Table of Contents
LIST OF FIGURES AND TABLES 3
1 INTRODUCTION 6
1.1 IPMAN- project 7
1.2 Scope of the thesis 9
1.3 Structure of the thesis 9
2 METADATA AND PUBLISHING LANGUAGES 11
2.1 Description of metadata 12
2.1.1 Dublin Core element set 14
2.1.2 Resource Description Framework 17
2.2 Description of publishing languages 20
2.2.1 HyperText Markup Language 20
2.2.2 Extensible Markup Language 22
2.2.3 Extensible HyperText Markup Language 26
3 METHODS OF INDEXING 31
3.1 Description of indexing 31
3.2 Customs to index 32
3.2.1 Full-text indexing 32
3.2.2 Inverted indexing 32
3.2.3 Semantic indexing 33
3.2.4 Latent semantic indexing 33
3.3 Automatic Indexing vs. manual indexing 33
4 METHODS OF CLASSIFICATION 36
4.1 Description of classification 36
4.2 Classification used in libraries 38
4.2.1 Dewey Decimal Classification 38
4.2.2 Universal Decimal Classification 39
4.2.3 Library of Congress Classification 39
4.2.4 National general schemes 40
4.2.5 Subject specific and home-grown schemes 40
4.3 Neural network methods and fuzzy systems 41
4.3.1 Self-Organizing Map / WEBSOM 47
4.3.2 Multi-Layer Perceptron Network 49
4.3.3 Fuzzy clustering 52
5 INFORMATION RETRIEVAL IN IP NETWORKS 56
5.1 Classification at present 56
5.1.1 Search alternatives 58
5.1.2 Searching problems 59
5.2 Demands in future 62
6 CLASSIFICATION AND INDEXING APPLICATIONS 64
6.1 Library classification-based applications 64
6.1.1 WWLib – DDC classification 65
6.1.2 GERHARD with DESIRE II – UDC classification 69
6.1.3 CyberStacks(sm) – LC classification 71
6.2 Neural network classification-based applications 72
6.2.1 Basic Units for Retrieval and Clustering of Web Documents - SOM – based classification 72
6.2.2 HyNeT – Neural Network classification 77
6.3 Applications with other classification methods 79
6.3.1 Mondou – web search engine with mining algorithm 79
6.3.2 EVM – advanced search technology for unfamiliar metadata 81
6.3.3 SHOE - Semantic Search with SHOE Search Engine 85
7 CONCLUSIONS 90
8 SUMMARY 93
LIST OF FIGURES AND TABLES
LIST OF FIGURES
Figure 1. Network management levels in IPMAN-project (Uosukainen et al. 1999, p. 14) 8
Figure 2. Outline of the thesis. 11
Figure 3. RDF property with structured value. (Lassila and Swick 1999) 19
Figure 4. The structure and function of a neuron (Department of Trade and Industry 1993, p. 2.2) 42
Figure 5. A neural network architecture (Department of Trade and Industry 1993, p. 2.3, Department of Trade and Industry 1994, p.17) 43
Figure 6. The computation involved in an example neural network unit. (Department of Trade and Industry 1994, p. 15) 45
Figure 7. The architecture of SOM network. (Browne NCTT 1998) 49
Figure 8. The training Process (Department of Trade and Industry 1993, p. 2.1) 52
Figure 9. A characteristic function of the set A. (Tizhoosh 2000) 54
Figure 10. A characterizing membership function of young people’s fuzzy set. (Tizhoosh 2000) 55
Figure 11. Overview of the WWLib architecture. (Jenkins et al. 1998) 66
Figure 12. Classification System with BUDWs. (Hatano et. Al 1999) 74
Figure 13. The structure of Mondou system (Kawano and Hasegawa 1998) 81
Figure 14. The external architecture of the EVM-system (Gey et al. 1999) 85
Figure 15. The SHOE system architecture. (Heflin et al. 2000a) 87
LIST OF TABLES
Table 1. Precision and Recall Ratios between normal and Relevance Feedback Operations (Hatano et al. 1999) 76
Table 2. Distribution of the titles. (Wermter et al. 1999) 78
Table 3. Results of the use of the recurrent plausibility network. (Panchev et al. 1999) 79
AI Artificial Intelligence
CERN European Organization for Nuclear Research
CGI Common Gateway Interface
DARPA Defense Advanced Research Projects Agency
DDC Dewey Decimal Classification
DESIRE Development of a European Service for Information on Research and Education
DFG Deutsche Forchungsgemeinschaft
DTD Document Type Definition
Ei Engineering information
eLib Electronic Library
ETH Eidgenössische Technische Hochschule
GERHARD German Harvest Automated Retrieval and Directory
HTML HyperText Markup Language
HTTP HyperText Transfer Protocol
IP Internet Protocol
IR Information Retrieval
ISBN International Standard Book Numbers
KB Knowledge base
LCC Library of Congress Classification
LC Library of Congress
MARC Machine-Readable Cataloguing
MLP Multi-Layer Perceptron Network
NCSA National Center for Supercomputing Applications
PCDATA parsed character data
RDF Resource Description Framework
SIC Standard Industrial Classification
SGML Standard General Markup Language
TCP Transmission Control Protocol
UDC Universal Decimal Classification
URI Uniform Resource Identifier
URL Uniform Resource Locator
URN Uniform Resource Name
W3C World Wide Web Consortium
WEBSOM Neural network (SOM) software product
VRML Virtual Reality Modeling Language
WWW World Wide Web
XHTML Extensible HyperText Markup Language
XML Extensible MarkUp Language
XSL Extensible Stylesheet Language
Xlink Extensible Linking Language
Xpointer Extensible Pointer Language
The Internet, and especially its most famous offspring, the World Wide Web (WWW), has changed the way most of us do business and go about our daily working lives. In the past several years, the increase of personal computers and other key technologies such as client-server computing, standardized communications protocols (TCP/IP, HTTP), Web browsers, and corporate intranets have dramatically changed the manner we discover, view, obtain, and exploit information. As well as an infrastructure for electronic mail and a playground for academic users, the Internet has increasingly become a vital information resource for commercial enterprises, which want to keep in touch with their existing customers or reach new customers with new online product offerings. The Internet has also become an information resource for enterprises to keep clear about their competitor's strengths and weaknesses. (Ferguson and Wooldridge, 1997)
The increase in volume and diversity of the WWW creates an increasing demand from its users of sophisticated information and knowledge management services, beyond searching and retrieving. Such services include cataloguing and classification, resource discovery and filtering, personalization of access and monitoring of new and changing resources, among others. The number of professional and commercially valuable information resources available on the WWW has grown considerably over the last years, still relying on general-purpose Internet search engines. Satisfying the vast and varied requirements of corporate users is quickly becoming a complex task to Internet search engines. (Ferguson and Wooldridge, 1997)
Every day the WWW grows by roughly a million electronic pages, adding to the hundreds of millions already on-line. This volume of information is loosely held together by more than a billion connections, called hyperlinks.
(Chakrabarti et al. 1999)
Because of the Web's rapid, chaotic growth, it lacks organization and structure. People from any background, education, culture, interest and motivation with of many kinds of dialect or style can write Web pages in any language. Each page might range from a few characters to a few hundred thousand, containing truth, falsehood, wisdom, propaganda or sheer nonsense. The discovery of high-quality, relevant pages in response to a specific need for certain information from this digital mess is quite difficult. (Chakrabarti et al. 1999)
So far people have relied on search engines that hunt for specific words or terms. Text searches frequently retrieve tens of thousands of pages, many of them useless. The problem is how is possible to locate quickly only the information which is needed, and be sure that it is authentic and reliable. (Chakrabarti et al. 1999)
The other approach to find the pages is to use produced lists, which would encourage users to browse the WWW. The production of hierarchical browsing tools has sometimes led to the adoption of library classification schemes to provide the subject hierarchy. (Brümmer et al. 1997a)
Telecommunications Software and Multimedia Laboratory of Helsinki University of Technology started IPMAN-project in January 1999. It is financed by TEKES, Nokia Networks Oy and Open Environment Software Oy. In 1999 the project produced a literary research, which was published in Publications in Telecommunications Software and Multimedia.
The objective of the IPMAN-project is to research increasing Internet Protocol (IP) traffic and it’s affects to the network architecture and the network management. The data volumes will explode in growth in the near future when new Internet related services enable more customers, more interactions and more data per interaction.
Solving the problems of the continuous growing volumes of Internet is important for the business world as networks and distributed processing systems have become critical success factors. As networks have become larger and more complex, automated network management has come unavoidable in the network management.
In IPMAN-project the network management has been divided into four levels: Network Element Management, Traffic Management, Service Management and Content Management. Levels can be seen in figure 1.
Network Element Management
Figure 1. Network management levels in IPMAN-project (Uosukainen et al. 1999, p. 14)
The network element management level is dealing with questions of how to manage network elements in the IP network. The traffic management level is intending to manage the network so that expected traffic properties are achieved. Service management level manages service applications and platforms. The final level is content management and it is dealing with managing the content provided by the service applications.
During the year 1999 the main stress was to study the service management. The aim of the project during the year 2000 is to concentrate to study the content management and the main stress is to create a prototype. The prototype's subject is content personalization. Content personalization means that a user can influence to the content he wants to get. My task in IPMAN-project is to find out different methods of classification possible to use in IP networks. The decision of the method, which is to be used in the prototype, will be based on my settlement.
Scope of the thesis
The Web contains approximately 300 million hypertext pages. The amount of pages continues to grow at roughly a million pages per day. The variation of pages is large. The set of Web pages lacks a unifying structure and shows more authoring style and content variation than has seen in traditional text-document collections. (Chakrabarti et al. 1999b, p. 60)
The scope of this thesis is to focus on different classification and indexing methods, which are useful in text classification or indexing in IP networks. Information retrieval is one of the most popular research subjects of today. The main purpose of many study groups is to develop an efficient and useful classification or indexing method to be used for information retrieval in Internet. This thesis will introduce the basic methods of classification and indexing and some of the latest applications and projects where those methods are used. The main purpose is to find out what kind of applications for classification and indexing have been generated lately and the advantages and weaknesses of them. An appropriate method for text classification and indexing will make IP networks, especially Internet, more useful as well to end-users as to content providers.
Structure of the thesis
In chapter two there is description of metadata and possible ways to use it. In chapter three and four there is described different existing indexing and classification methods.
In chapter five is described how classification and indexing is put into practice in Internet of today. Also the problems and the demands of the future are examined in chapter five. In chapter six is introduced new applications which use existing classification and indexing methods. The purpose has been to find a working and existing application of each method. Anyway, there are also introduced few applications which are just experiments.
Chapter seven includes conclusions of all methods and applications and chapter eight includes the summary. The results of the thesis are reported in eight chapters and the main contents are outlined in figure 2.
INPUT PROCESS OUTPUT
Impetus Research problem
Project description Content Management
Current problems Classification, indexing
Description of Dublin Core, RDF
metadata and HTML, XML,
Description of Customs to index
indexing methods Automatic indexing
Description of DDC, UDC, LCC,
classification special schemas,
methods SOM, MLP, Fuzzy clustering
Information Search alternatives
retrieval and and problems
search engines Demands for future
Description of Mondou, EVM,
new applications SHOE, WWLib,
and experiments Desire II, Cyberstacs,
Future demands The trend of new
Figure 2. Outline of the thesis.
METADATA AND PUBLISHING LANGUAGES
Metadata and publishing languages are explained in this chapter. One way to make classification and indexing easier is to add metadata to an electronic resource situated in network. The metadata that is used in electronic libraries (eLibs) is based on Dublin Core metadata element set. Dublin Core is described in chapter 2.1.1. The eLib metadata uses the 15 Dublin Core attributes. Dublin Core attributes are also used in ordinary web pages to give metadata information to search engines.
Resource Description Framework (RDF) is a new architecture meant for metadata on the Web, especially for diverse metadata needs for separate publishers on the web. It can be used in resource discovery to provide better search engine capabilities and for describing the content and content relationships of a Web page.
Search engines in Internet uses the information embedded in WWW-pages done by some page description and publishing language. In this work, HyperText Markup Language (HTML) and one of the newest languages, Extensible Markup Language (XML), are described after Dublin Core and RDF. Extensible HyperText Markup Language (XHTML) is the latest version of HTML.
XML and XHTML are quite new publishing languages and assumed to attain an important role in publishing in Internet in the near future. Therefore both of them are described more accurately than HTML, which is the main publishing language at present but will apparently make room for XML and XHTML. In chapters of XML and XHTML properties of HTML are brought forward and compared with the properties of XML and XHTML.
Description of metadata
The International Federation of Library Associations and Institutions gives the following description of metadata:
"Metadata is data about data. The term is used to refer to any data that is used to aid the identification, description and location of networked electronic resources. Many different metadata formats exist, some quite simple in their description, others quite complex and rich." (IFLA 2000)
According to other definition: metadata is machine understandable information about web resources or other things. (Berners-Lee 1997)
The main purpose of metadata is to give some information about the document for computers that cannot deduce this information from the document itself. Keywords and descriptions are supposed to present the main concepts and subjects of the text. (Kirsanov 1997a)
Metadata is open to abuse, but it's still the only technique capable of helping computers for better understanding of human-produced documents. According to Kirsanov, we won't have another choice but to rely on some sort of metadata information until computers achieve a level of intelligence comparable to that of human beings. (Kirsanov 1997a)
Information of metadata consists of a set of elements and attributes, which are needed in description of a document. For instance, the library card indexing is a metadata method. It includes descriptive information like creator, title, the year of publication among others of a book or other document existing in library. (Stenvall and Hakala 1998)
Metadata can be used in documents in two ways:
- the elements of metadata are situated in separated record, for instance in library card index, or
- the elements of metadata are embedded in the document.
(Stenvall and Hakala 1998)
Once created metadata can be interpreted and processed without human assistance, because of its machine-readability. After extracted from the actual content, it should be possible to transfer and process it independently and separately from the original content. This allows the operations only on the metadata instead of the whole content. (Savia et al. 1998)
Dublin Core element set
In March 1995 OCLC/NCSA Metadata Workshop agreed a core list of metadata elements called Dublin Metadata Core Element Set. Dublin Core is shortening for it. Dublin Core provides a standard format (Internet standard RFC2413) for metadata and ensures interoperability for the eLib metadata. The eLib metadata uses the 15 appropriate Dublin Core attributes. (Gardner 1999)
The purpose of Dublin Core metadata element set is to facilitate discovery of electronic resources. It was originally conceived for author-generated description of Web resources but it has also attracted the attention of formal resource description communities such as museums, libraries, government agencies, and commercial organizations. (DCMI 2000c)
Dublin Core is trying to catch several characteristics analyzed below:
- the possibility of semantic interoperability across disciplines increases by promoting a commonly understanding set of descriptors that helps to unify other data content standards.
- it is critical to the development of effective discovery infrastructure to recognize the international scope of resource discovery on the Web.
- it provides an economical alternative to more elaborate description models.
Metadata modularity on the Web
- the diversity of metadata needs on the Web requires an infrastructure that supports the coexistence of complementary, independently maintained metadata packages. (DCMI 2000b)
Each Dublin Core element is optional and repeatable. Most of the elements have also specifiers, which make the meaning of the element more accurate. (Stenvall and Hakala 1998)
The elements are given descriptive names. The intention of descriptive names is to make it easier to user to understand the semantic meaning of the element. To promote global interoperability, the element descriptions are associated with a controlled vocabulary for the respective element values. (DCMI 2000a)
The name given to the resource usually by the creator or publisher.
2. Author or Creator
The person or organization primarily responsible for creating the intellectual content of the resource.
3. Subject and Keywords
The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource.
A textual description of the content of the resource.
The entity responsible for making the resource available in its present form, like a publishing house, a university department, or a corporate entity.
6. Other Contributor
A person or organization that has made significant intellectual contributions to the resource but was not specified in a Creator element.
The date the resource have done or been available.
8. Resource Type
The category in which the resource belongs, such as home page, novel, poem, working paper, technical report, essay, dictionary.
The data format used to identify the software and sometimes also the hardware that is needed to display or operate the resource. Dimensions, size, duration e.g. are optional and can be also performed in here.
10. Resource Identifier
A string or a number is used to identify the resource. Identifier can be for example URLs (Uniform Resource Locator), URNs (Uniform Resource Number) and ISBNs (International Standard Book Number).
This contains information about a second resource from which the present resource is derived if it is considered important for discovery of the present resource.
The language used in the content of the resource.
The second resource’s identifier and its relationship to the present resource. This element is used to express linkages among related resources.
The spatial and/or temporal characteristics of the intellectual content of the resource. Spatial coverage refers to a physical region. Temporal coverage refers to the content of the resource.
15. Rights Management
An identifier that links to a rights management statement, or an identifier that links to a service providing information about rights management for the resource. (Weibel et al. 1998)
Resource Description Framework
The World Wide Web Consortium (W3C) has begun to implement an architecture for metadata for the Web. The Resource Description Framework (RDF) is designed with an eye to many diverse metadata needs of vendors and information providers. (DCMI 2000c)
RDF is meant to support the interoperability of metadata. It allows any kind of Web resources, in other words, any object with a Uniform Resource Identifier (URI) as its address, to be made available in machine understandable form. (Iannella 1999)
RDF is meant to be metadata for any object that can be found on the Web. It is a means for developing tools and applications using a common syntax for describing Web resources. In the year 1997 the W3C recognized the need for a language, which would eliminate the problems of content ratings, intellectual property rights and digital signatures while allowing all kinds of Web resources to be visible and be discovered in the Web. A working group within the W3C has drawn up a data model and syntax for RDF. (Heery 1998)
RDF is designed specifically with the Web in mind, so it takes into account the features of Web resources. It is a syntax based on a data model, which influences the way properties are described. The structure of descriptions is explicit and means that RDF has a good fit for describing Web resources. From another direction, it might cause problems within environments where there is a need to re-use or interoperate with 'legacy metadata' which may well contain logical inconsistencies. (Heery 1998)
The model for representing properties and property values is the foundation of RDF and the basic data model consists of three object types:
Resources can be called all things described by RDF expressions. A resource can be an entire Web page, like an HTML document or a part of a Web page like an element within the HTML or XML document source. A resource may also be a whole collection of pages, like an entire Web site. An object that is not directly accessible via the Web, like a printed book, can also be considered as a resource. A resource will always have URI and an optional anchor Id.
A resource can be described as a used property that can have a specific aspect, characteristic, attribute or relation. Each property has a specific meaning, and it defines its permitted values, the types of resources it can describe, and its relationship with other properties.
A RDF statement is a specific resource together with a named property plus the value of that property for that resource. These three parts of a statement are called the subject, the predicate, and the object. The object of a statement can be another resource or it can be a literal. This means a resource specified by an URI or a simple string or other primitive data type defined by XML. (Lassila and Swick 1999)
The following sentences can be considered as an example:
The individual referred to by employee id 92758 is named Kirsi Lehtinen and has the email address firstname.lastname@example.org. The resource http://www.lut.fi/~klehtine/index.html was created by this individual.
The sentence is illustrated in figure 3.
Figure 3. RDF property with structured value. (Lassila and Swick 1999)
The example is written in RDF/XML in the following way:
<rdf:Description about="http: ://www.lut.fi/studentid/92758"/>
rdf:RDF> (Lassila and Swick 1999)
Description of publishing languages
A universally understood language is needed for publishing information globally. It should be a language that all computers may potentially understand. (Raggett 1999) The most famous and common language, for page description and publishing on the Web is HyperText Markup Language (HTML). It describes the contents and appearance of the documents publishing on the Web. Publishing languages are formed from entities, elements and attributes. Because HTML has become insufficient for the needs of publication other languages have developed. Extensible Markup Language (XML) has developed to be a language, which better satisfy the needs of information retrieval and diverse browsing devices. Its purpose is to describe the structure of the document without responding the appearance of the document. Extensible HyperText Markup Language (XHTML) is a combination of HTML and XML.
HyperText Markup Language
HyperText Markup Language (HTML) was originally developed by Tim Berners-Lee while he was working at CERN. NCSA developed the Mosaic browser, which popularized HTML. During the 1990s it has been a success with the explosive growth of the Web. Since beginning, HTML has been extended in number of ways. (Raggett 1999)
HTML is a universally understood publishing language used by the WWW. (Raggett 1999) Information of metadata can be embedded in HTML document. With the help of metadata an HTML document can be classified and indexed.
Below are listed properties of HTML:
- Online documents can include headings, text, tables, lists, photos, etc.
- Online information can be retrieved via hypertext links just by clicking a button.
- Forms for conducting transactions with remote services can be designed like for use in searching for information, making reservations, ordering products, etc.
- Spreadsheets, video clips, sound clips, and other applications can be included directly in documents. (Raggett 1999)
HTML is a non-proprietary format based upon Standard General Markup Language (SGML). It can be created and processed by a wide range of tools, from simple plain text editors to more sophisticated tools. To structure text into headings, paragraphs, lists, hypertext links etc., HTML uses tags such as
and . (Raggett et al. 2000)
A typical example of HTML code could be as follows:
My first HTML document