The Internet, and especially its most famous offspring, the World Wide Web (WWW), has changed the way most of us do business and go about our daily working lives. In the past several years, the increase of personal computers and other key technologies such as client-server computing, standardized communications protocols (TCP/IP, HTTP), Web browsers, and corporate intranets have dramatically changed the manner we discover, view, obtain, and exploit information. As well as an infrastructure for electronic mail and a playground for academic users, the Internet has increasingly become a vital information resource for commercial enterprises, which want to keep in touch with their existing customers or reach new customers with new online product offerings. The Internet has also become an information resource for enterprises to keep clear about their competitor's strengths and weaknesses. (Ferguson and Wooldridge, 1997)
The increase in volume and diversity of the WWW creates an increasing demand from its users of sophisticated information and knowledge management services, beyond searching and retrieving. Such services include cataloguing and classification, resource discovery and filtering, personalization of access and monitoring of new and changing resources, among others. The number of professional and commercially valuable information resources available on the WWW has grown considerably over the last years, still relying on general-purpose Internet search engines. Satisfying the vast and varied requirements of corporate users is quickly becoming a complex task to Internet search engines. (Ferguson and Wooldridge, 1997)
Every day the WWW grows by roughly a million electronic pages, adding to the hundreds of millions already on-line. This volume of information is loosely held together by more than a billion connections, called hyperlinks.
(Chakrabarti et al. 1999)
Because of the Web's rapid, chaotic growth, it lacks organization and structure. People from any background, education, culture, interest and motivation with of many kinds of dialect or style can write Web pages in any language. Each page might range from a few characters to a few hundred thousand, containing truth, falsehood, wisdom, propaganda or sheer nonsense. The discovery of high-quality, relevant pages in response to a specific need for certain information from this digital mess is quite difficult. (Chakrabarti et al. 1999)
So far people have relied on search engines that hunt for specific words or terms. Text searches frequently retrieve tens of thousands of pages, many of them useless. The problem is how is possible to locate quickly only the information which is needed, and be sure that it is authentic and reliable. (Chakrabarti et al. 1999)
The other approach to find the pages is to use produced lists, which would encourage users to browse the WWW. The production of hierarchical browsing tools has sometimes led to the adoption of library classification schemes to provide the subject hierarchy. (Brümmer et al. 1997a)
Telecommunications Software and Multimedia Laboratory of Helsinki University of Technology started IPMAN-project in January 1999. It is financed by TEKES, Nokia Networks Oy and Open Environment Software Oy. In 1999 the project produced a literary research, which was published in Publications in Telecommunications Software and Multimedia.
The objective of the IPMAN-project is to research increasing Internet Protocol (IP) traffic and it’s affects to the network architecture and the network management. The data volumes will explode in growth in the near future when new Internet related services enable more customers, more interactions and more data per interaction.
Solving the problems of the continuous growing volumes of Internet is important for the business world as networks and distributed processing systems have become critical success factors. As networks have become larger and more complex, automated network management has come unavoidable in the network management.
In IPMAN-project the network management has been divided into four levels: Network Element Management, Traffic Management, Service Management and Content Management. Levels can be seen in figure 1.
Network Element Management
Figure 1. Network management levels in IPMAN-project (Uosukainen et al. 1999, p. 14) The network element management level is dealing with questions of how to manage network elements in the IP network. The traffic management level is intending to manage the network so that expected traffic properties are achieved. Service management level manages service applications and platforms. The final level is content management and it is dealing with managing the content provided by the service applications.
During the year 1999 the main stress was to study the service management. The aim of the project during the year 2000 is to concentrate to study the content management and the main stress is to create a prototype. The prototype's subject is content personalization. Content personalization means that a user can influence to the content he wants to get. My task in IPMAN-project is to find out different methods of classification possible to use in IP networks. The decision of the method, which is to be used in the prototype, will be based on my settlement.
Scope of the thesis
The Web contains approximately 300 million hypertext pages. The amount of pages continues to grow at roughly a million pages per day. The variation of pages is large. The set of Web pages lacks a unifying structure and shows more authoring style and content variation than has seen in traditional text-document collections. (Chakrabarti et al. 1999b, p. 60)
The scope of this thesis is to focus on different classification and indexing methods, which are useful in text classification or indexing in IP networks. Information retrieval is one of the most popular research subjects of today. The main purpose of many study groups is to develop an efficient and useful classification or indexing method to be used for information retrieval in Internet. This thesis will introduce the basic methods of classification and indexing and some of the latest applications and projects where those methods are used. The main purpose is to find out what kind of applications for classification and indexing have been generated lately and the advantages and weaknesses of them. An appropriate method for text classification and indexing will make IP networks, especially Internet, more useful as well to end-users as to content providers.
Structure of the thesis
In chapter two there is description of metadata and possible ways to use it. In chapter three and four there is described different existing indexing and classification methods.
In chapter five is described how classification and indexing is put into practice in Internet of today. Also the problems and the demands of the future are examined in chapter five. In chapter six is introduced new applications which use existing classification and indexing methods. The purpose has been to find a working and existing application of each method. Anyway, there are also introduced few applications which are just experiments.
Chapter seven includes conclusions of all methods and applications and chapter eight includes the summary. The results of the thesis are reported in eight chapters and the main contents are outlined in figure 2.
INPUT PROCESS OUTPUT
Impetus Research problem
Project description Content Management
Current problems Classification, indexing
Description of Dublin Core, RDF
Metadata and publishing languages are explained in this chapter. One way to make classification and indexing easier is to add metadata to an electronic resource situated in network. The metadata that is used in electronic libraries (eLibs) is based on Dublin Core metadata element set. Dublin Core is described in chapter 2.1.1. The eLib metadata uses the 15 Dublin Core attributes. Dublin Core attributes are also used in ordinary web pages to give metadata information to search engines.
Resource Description Framework (RDF) is a new architecture meant for metadata on the Web, especially for diverse metadata needs for separate publishers on the web. It can be used in resource discovery to provide better search engine capabilities and for describing the content and content relationships of a Web page.
Search engines in Internet uses the information embedded in WWW-pages done by some page description and publishing language. In this work, HyperText Markup Language (HTML) and one of the newest languages, Extensible Markup Language (XML), are described after Dublin Core and RDF. Extensible HyperText Markup Language (XHTML) is the latest version of HTML.
XML and XHTML are quite new publishing languages and assumed to attain an important role in publishing in Internet in the near future. Therefore both of them are described more accurately than HTML, which is the main publishing language at present but will apparently make room for XML and XHTML. In chapters of XML and XHTML properties of HTML are brought forward and compared with the properties of XML and XHTML.
Description of metadata
The International Federation of Library Associations and Institutions gives the following description of metadata:
"Metadata is data about data. The term is used to refer to any data that is used to aid the identification, description and location of networked electronic resources. Many different metadata formats exist, some quite simple in their description, others quite complex and rich." (IFLA 2000)
According to other definition: metadata is machine understandable information about web resources or other things. (Berners-Lee 1997)
The main purpose of metadata is to give some information about the document for computers that cannot deduce this information from the document itself. Keywords and descriptions are supposed to present the main concepts and subjects of the text. (Kirsanov 1997a)
Metadata is open to abuse, but it's still the only technique capable of helping computers for better understanding of human-produced documents. According to Kirsanov, we won't have another choice but to rely on some sort of metadata information until computers achieve a level of intelligence comparable to that of human beings. (Kirsanov 1997a)
Information of metadata consists of a set of elements and attributes, which are needed in description of a document. For instance, the library card indexing is a metadata method. It includes descriptive information like creator, title, the year of publication among others of a book or other document existing in library. (Stenvall and Hakala 1998)
Metadata can be used in documents in two ways:
- the elements of metadata are situated in separated record, for instance in library card index, or
- the elements of metadata are embedded in the document.
(Stenvall and Hakala 1998)
Once created metadata can be interpreted and processed without human assistance, because of its machine-readability. After extracted from the actual content, it should be possible to transfer and process it independently and separately from the original content. This allows the operations only on the metadata instead of the whole content. (Savia et al. 1998)
Dublin Core element set
In March 1995 OCLC/NCSA Metadata Workshop agreed a core list of metadata elements called Dublin Metadata Core Element Set. Dublin Core is shortening for it. Dublin Core provides a standard format (Internet standard RFC2413) for metadata and ensures interoperability for the eLib metadata. The eLib metadata uses the 15 appropriate Dublin Core attributes. (Gardner 1999)
The purpose of Dublin Core metadata element set is to facilitate discovery of electronic resources. It was originally conceived for author-generated description of Web resources but it has also attracted the attention of formal resource description communities such as museums, libraries, government agencies, and commercial organizations. (DCMI 2000c)
Dublin Core is trying to catch several characteristics analyzed below:
it is meant to be usable for all users, to non-catalogers as well as resource description specialists.
- the possibility of semantic interoperability across disciplines increases by promoting a commonly understanding set of descriptors that helps to unify other data content standards.
- it is critical to the development of effective discovery infrastructure to recognize the international scope of resource discovery on the Web.
- it provides an economical alternative to more elaborate description models.
Metadata modularity on the Web
- the diversity of metadata needs on the Web requires an infrastructure that supports the coexistence of complementary, independently maintained metadata packages. (DCMI 2000b)
Each Dublin Core element is optional and repeatable. Most of the elements have also specifiers, which make the meaning of the element more accurate. (Stenvall and Hakala 1998)
The elements are given descriptive names. The intention of descriptive names is to make it easier to user to understand the semantic meaning of the element. To promote global interoperability, the element descriptions are associated with a controlled vocabulary for the respective element values. (DCMI 2000a) Element Descriptions
The name given to the resource usually by the creator or publisher.
2. Author or Creator
The person or organization primarily responsible for creating the intellectual content of the resource.
3. Subject and Keywords
The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource.
A textual description of the content of the resource.
The entity responsible for making the resource available in its present form, like a publishing house, a university department, or a corporate entity.
6. Other Contributor
A person or organization that has made significant intellectual contributions to the resource but was not specified in a Creator element.
The date the resource have done or been available.
The data format used to identify the software and sometimes also the hardware that is needed to display or operate the resource. Dimensions, size, duration e.g. are optional and can be also performed in here.
10. Resource Identifier
A string or a number is used to identify the resource. Identifier can be for example URLs (Uniform Resource Locator), URNs (Uniform Resource Number) and ISBNs (International Standard Book Number).
This contains information about a second resource from which the present resource is derived if it is considered important for discovery of the present resource.
The language used in the content of the resource.
The second resource’s identifier and its relationship to the present resource. This element is used to express linkages among related resources.
The spatial and/or temporal characteristics of the intellectual content of the resource. Spatial coverage refers to a physical region. Temporal coverage refers to the content of the resource.
15. Rights Management
An identifier that links to a rights management statement, or an identifier that links to a service providing information about rights management for the resource. (Weibel et al. 1998)
Resource Description Framework
The World Wide Web Consortium (W3C) has begun to implement an architecture for metadata for the Web. The Resource Description Framework (RDF) is designed with an eye to many diverse metadata needs of vendors and information providers. (DCMI 2000c)
RDF is meant to support the interoperability of metadata. It allows any kind of Web resources, in other words, any object with a Uniform Resource Identifier (URI) as its address, to be made available in machine understandable form. (Iannella 1999)
RDF is meant to be metadata for any object that can be found on the Web. It is a means for developing tools and applications using a common syntax for describing Web resources. In the year 1997 the W3C recognized the need for a language, which would eliminate the problems of content ratings, intellectual property rights and digital signatures while allowing all kinds of Web resources to be visible and be discovered in the Web. A working group within the W3C has drawn up a data model and syntax for RDF. (Heery 1998)
RDF is designed specifically with the Web in mind, so it takes into account the features of Web resources. It is a syntax based on a data model, which influences the way properties are described. The structure of descriptions is explicit and means that RDF has a good fit for describing Web resources. From another direction, it might cause problems within environments where there is a need to re-use or interoperate with 'legacy metadata' which may well contain logical inconsistencies. (Heery 1998)
The model for representing properties and property values is the foundation of RDF and the basic data model consists of three object types:
Resources can be called all things described by RDF expressions. A resource can be an entire Web page, like an HTML document or a part of a Web page like an element within the HTML or XML document source. A resource may also be a whole collection of pages, like an entire Web site. An object that is not directly accessible via the Web, like a printed book, can also be considered as a resource. A resource will always have URI and an optional anchor Id.
A resource can be described as a used property that can have a specific aspect, characteristic, attribute or relation. Each property has a specific meaning, and it defines its permitted values, the types of resources it can describe, and its relationship with other properties.
A RDF statement is a specific resource together with a named property plus the value of that property for that resource. These three parts of a statement are called the subject, the predicate, and the object. The object of a statement can be another resource or it can be a literal. This means a resource specified by an URI or a simple string or other primitive data type defined by XML. (Lassila and Swick 1999)
The following sentences can be considered as an example:
The individual referred to by employee id 92758 is named Kirsi Lehtinen and has the email address email@example.com. The resource http://www.lut.fi/~klehtine/index.html was created by this individual.
The sentence is illustrated in figure 3.
Figure 3. RDF property with structured value. (Lassila and Swick 1999)
The example is written in RDF/XML in the following way:
A universally understood language is needed for publishing information globally. It should be a language that all computers may potentially understand. (Raggett 1999) The most famous and common language, for page description and publishing on the Web is HyperText Markup Language (HTML). It describes the contents and appearance of the documents publishing on the Web. Publishing languages are formed from entities, elements and attributes. Because HTML has become insufficient for the needs of publication other languages have developed. Extensible Markup Language (XML) has developed to be a language, which better satisfy the needs of information retrieval and diverse browsing devices. Its purpose is to describe the structure of the document without responding the appearance of the document. Extensible HyperText Markup Language (XHTML) is a combination of HTML and XML.
HyperText Markup Language
HyperText Markup Language (HTML) was originally developed by Tim Berners-Lee while he was working at CERN. NCSA developed the Mosaic browser, which popularized HTML. During the 1990s it has been a success with the explosive growth of the Web. Since beginning, HTML has been extended in number of ways. (Raggett 1999)
HTML is a universally understood publishing language used by the WWW. (Raggett 1999) Information of metadata can be embedded in HTML document. With the help of metadata an HTML document can be classified and indexed.
Below are listed properties of HTML:
- Online documents can include headings, text, tables, lists, photos, etc.
- Online information can be retrieved via hypertext links just by clicking a button.
- Forms for conducting transactions with remote services can be designed like for use in searching for information, making reservations, ordering products, etc.
- Spreadsheets, video clips, sound clips, and other applications can be included directly in documents. (Raggett 1999)
HTML is a non-proprietary format based upon Standard General Markup Language (SGML). It can be created and processed by a wide range of tools, from simple plain text editors to more sophisticated tools. To structure text into headings, paragraphs, lists, hypertext links etc., HTML uses tags such as
. (Raggett et al. 2000)
A typical example of HTML code could be as follows: