Index (search engine)
Encyclopedia
Search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...

 indexing
Index (information technology)
In computer science, an index can be:# an integer that identifies an array element# a data structure that enables sublinear-time lookup -Array element identifier:...

 collects, parses, and stores data
Data (computing)
In computer science, data is information in a form suitable for use with a computer. Data is often distinguished from programs. A program is a sequence of instructions that detail a task for the computer to perform...

 to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics
Information technology
Information technology is the acquisition, processing, storage and dissemination of vocal, pictorial, textual and numerical information by a microelectronics-based combination of computing and telecommunications...

, physics, and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is Web indexing.

Popular engines focus on the full-text indexing of online, natural language documents. Media types
Multimedia
Multimedia is media and content that uses a combination of different content forms. The term can be used as a noun or as an adjective describing a medium as having multiple content forms. The term is used in contrast to media which use only rudimentary computer display such as text-only, or...

 such as video and audio and graphics are also searchable.

Meta search engines
Metasearch engine
A metasearch engine is a search tool that sends user requests to several other search engines and/or databases and aggregates the results into a single list or displays them according to their source. Metasearch engines enable users to enter search criteria once and access several search engines...

 reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent
Intelligent agent
In artificial intelligence, an intelligent agent is an autonomous entity which observes through sensors and acts upon an environment using actuators and directs its activity towards achieving goals . Intelligent agents may also learn or use knowledge to achieve their goals...

-based search engines index in real time
Real time business intelligence
Real-time business intelligence is the process of delivering information about business operations as they occur.In this context, real-time means a range from milliseconds to a few seconds after the business event has occurred...

.

Indexing

The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...

 every document in the corpus, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.

Index Design Factors

Major factors in designing a search engine's architecture include:

Merge factors : How data enters the index, or how words or subject features are added to the index during text corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge
Merge (SQL)
A relational database management system uses SQL MERGE statements to INSERT new records or UPDATE existing records depending on whether or not a condition matches...

 command and other merge algorithms.
Storage techniques : How to store the index data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

, that is, whether information should be data compressed or filtered.
Index size : How much computer storage is required to support the index.
Lookup speed : How quickly a word can be found in the inverted index. The speed of finding an entry in a data structure, compared with how quickly it can be updated or removed, is a central focus of computer science.
Maintenance : How the index is maintained over time.
Fault tolerance : How important it is for the service to be reliable. Issues include dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware, partitioning
Partition (database)
A partition is a division of a logical database or its constituting elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons....

, and schemes such as hash-based
Hash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

 or composite partitioning, as well as replication
Replication (computer science)
Replication is the process of sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility. It could be data replication if the same data is stored on multiple storage devices, or...

.

Index Data Structures

Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. Types of indices include:

Suffix tree
Suffix tree
In computer science, a suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.The suffix tree for a string S is a tree whose edges are labeled with strings, such that each suffix...

 : Figuratively structured like a tree, supports linear time lookup. Built by storing the suffixes of words. The suffix tree is a type of trie
Trie
In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the...

. Tries support extendable hashing, which is important for search engine indexing. Used for searching for patterns in DNA
DNA
Deoxyribonucleic acid is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms . The DNA segments that carry this genetic information are called genes, but other DNA sequences have structural purposes, or are involved in...

 sequences and clustering. A major drawback is that the storage of a word in the tree may require more storage than storing the word itself. An alternate representation is a suffix array
Suffix array
In computer science, a suffix array is an array of integers giving the starting positions of suffixes of a string in lexicographical order.-Details:Consider the string...

, which is considered to require less virtual memory and supports data compression such as the BWT
Burrows-Wheeler transform
The Burrows–Wheeler transform , is an algorithm used in data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while working at DEC Systems Research Center in Palo Alto, California...

 algorithm.

Inverted index
Inverted index
In computer science, an inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents...

 : Stores a list of occurrences of each atomic search criterion, typically in the form of a hash table
Hash table
In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys , to their associated values . Thus, a hash table implements an associative array...

 or binary tree
Binary tree
In computer science, a binary tree is a tree data structure in which each node has at most two child nodes, usually distinguished as "left" and "right". Nodes with children are parent nodes, and child nodes may contain references to their parents. Outside the tree, there is often a reference to...

.

Citation index
Citation index
A citation index is a kind of bibliographic database, an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents. The first citation indices were legal citators such as Shepard's Citations...

 : Stores citations or hyperlinks between documents to support citation analysis, a subject of Bibliometrics
Bibliometrics
Bibliometrics is a set of methods to quantitatively analyze scientific and technological literature. Citation analysis and content analysis are commonly used bibliometric methods...

.
Ngram index : Stores sequences of length of data to support other types of retrieval or text mining.
Document-term matrix
Document-term matrix
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining...

 : Used in latent semantic analysis, stores the occurrences of words in documents in a two-dimensional sparse matrix
Sparse matrix
In the subfield of numerical analysis, a sparse matrix is a matrix populated primarily with zeros . The term itself was coined by Harry M. Markowitz....

.

Challenges in Parallelism

A major challenge in the design of search engines is the management of serial computing processes. There are many opportunities for race conditions and coherent faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competing tasks. Consider that authors are producers of information, and a web crawler is the consumer of this information, grabbing the text and storing it in a cache (or corpus
Text corpus
In linguistics, a corpus or text corpus is a large and structured set of texts...

). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a producer-consumer model. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve distributed computing
Distributed computing
Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...

, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully synchronized, distributed, parallel architecture.

Inverted Indices

Many search engines incorporate an inverted index
Inverted index
In computer science, an inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents...

 when evaluating a search query to quickly locate documents containing the words in a query and then rank these documents by relevance. Because the inverted index stores a list of the documents containing each word, the search engine can use direct access
Random access
In computer science, random access is the ability to access an element at an arbitrary position in a sequence in equal time, independent of sequence size. The position is arbitrary in the sense that it is unpredictable, thus the use of the term "random" in "random access"...

 to find the documents associated with each word in the query in order to retrieve the matching documents quickly. The following is a simplified illustration of an inverted index:
Inverted Index
Word Documents
the Document 1, Document 3, Document 4, Document 5
cow Document 2, Document 3, Document 4
says Document 5
moo Document 7


This index can only determine whether a word exists within a particular document, since it stores no information regarding the frequency and position of the word; it is therefore considered to be a boolean
Boolean datatype
In computer science, the Boolean or logical data type is a data type, having two values , intended to represent the truth values of logic and Boolean algebra...

 index. Such an index determines which documents match a query but does not rank matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document. Position information enables the search algorithm to identify word proximity to support searching for phrases; frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

.

The inverted index is a sparse matrix
Sparse matrix
In the subfield of numerical analysis, a sparse matrix is a matrix populated primarily with zeros . The term itself was coined by Harry M. Markowitz....

, since not all words are present in each document. To reduce computer storage memory requirements, it is stored differently from a two dimensional array. The index is similar to the term document matrices
Document-term matrix
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining...

 employed by latent semantic analysis
Latent semantic analysis
Latent semantic analysis is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close...

. The inverted index can be considered a form of a hash table. In some cases the index is a form of a binary tree
Binary tree
In computer science, a binary tree is a tree data structure in which each node has at most two child nodes, usually distinguished as "left" and "right". Nodes with children are parent nodes, and child nodes may contain references to their parents. Outside the tree, there is often a reference to...

, which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically a distributed hash table
Distributed hash table
A distributed hash table is a class of a decentralized distributed system that provides a lookup service similar to a hash table; pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key...

.

Index Merging

The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing, where a merge identifies the document or documents to be added or updated and then parses each document into words. For technical accuracy, a merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives.

After parsing, the indexer adds the referenced document to the document list for the appropriate words. In a larger search engine, the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming, and so this process is commonly split up into two parts, the development of a forward index and a process which sorts the contents of the forward index into the inverted index. The inverted index is so named because it is an inversion of the forward index.

The Forward Index

The forward index stores a list of words for each document. The following is a simplified form of the forward index:
Forward Index
Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon


The rationale behind developing a forward index is that as documents are parsing, it is better to immediately store the words per document. The delineation enables Asynchronous system processing, which partially circumvents the inverted index update bottleneck
Bottleneck
A bottleneck is a phenomenon where the performance or capacity of an entire system is limited by a single or limited number of components or resources. The term bottleneck is taken from the 'assets are water' metaphor. As water is poured out of a bottle, the rate of outflow is limited by the width...

. The forward index is sorted
Sorting algorithm
In computer science, a sorting algorithm is an algorithm that puts elements of a list in a certain order. The most-used orders are numerical order and lexicographical order...

 to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.

Compression

Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of compression to reduce the size of the indices on disk
Computer storage
Computer data storage, often called storage or memory, refers to computer components and recording media that retain digital data. Data storage is one of the core functions and fundamental components of computers....

. Consider the following scenario for a full text, Internet search engine.
  • An estimated 2,000,000,000 different web pages exist as of the year 2000
  • Suppose there are 250 words on each webpage (based on the assumption they are similar to the pages of a novel.
  • It takes 8 bits (or 1 byte
    Byte
    The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...

    ) to store a single character. Some encodings
    Character encoding
    A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

     use 2 bytes per character
  • The average number of characters in any given word on a page may be estimated at 5 (Wikipedia:Size comparisons)
  • The average personal computer
    Personal computer
    A personal computer is any general-purpose computer whose size, capabilities, and original sales price make it useful for individuals, and which is intended to be operated directly by an end-user with no intervening computer operator...

     comes with 100 to 250 gigabyte
    Gigabyte
    The gigabyte is a multiple of the unit byte for digital information storage. The prefix giga means 109 in the International System of Units , therefore 1 gigabyte is...

    s of usable space


Given this scenario, an uncompressed index (assuming a non-conflated
Conflation
Conflation occurs when the identities of two or more individuals, concepts, or places, sharing some characteristics of one another, become confused until there seems to be only a single identity — the differences appear to become lost...

, simple, index) for 2 billion web pages would need to store 500 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone, more than the average free disk space of 25 personal computers. This space requirement may be even larger for a fault-tolerant distributed storage architecture. Depending on the compression technique chosen, the index can be reduced to a fraction of this size. The tradeoff is the time and processing power required to perform compression and decompression.

Notably, large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. Thus compression is a measure of cost.

Document parsing

Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. The words found are called tokens, and so, in the context of search engine indexing and natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

, parsing is more commonly referred to as tokenization
Tokenization
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining...

. It is also sometimes called word boundary disambiguation, tagging
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...

, text segmentation
Text segmentation
Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing...

, content analysis
Content analysis
Content analysis or textual analysis is a methodology in the social sciences for studying the content of communication. Earl Babbie defines it as "the study of recorded human communications, such as books, websites, paintings and laws."According to Dr...

, text analysis, text mining
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

, concordance
Agreement (linguistics)
In languages, agreement or concord is a form of cross-reference between different parts of a sentence or phrase. Agreement happens when a word changes form depending on the other words to which it relates....

 generation, speech segmentation
Speech segmentation
Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing....

, lexing
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...

, or lexical analysis
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...

. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.

Natural language processing, as of 2006, is the subject of continuous research and technological improvement. Tokenization presents many challenges in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the implementation of which are commonly kept as corporate secrets.

Challenges in Natural Language Processing

Word Boundary Ambiguity : Native English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

 speakers may at first consider tokenization to be a straightforward task, but this is not the case with designing a multilingual indexer. In digital form, the texts of other languages such as Chinese
Chinese language
The Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...

, Japanese
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...

 or Arabic
Arabic language
Arabic is a name applied to the descendants of the Classical Arabic language of the 6th century AD, used most prominently in the Quran, the Islamic Holy Book...

 represent a greater challenge, as words are not clearly delineated by whitespace
Whitespace (computer science)
In computer science, whitespace is any single character or series of characters that represents horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visual mark, but typically does occupy an area on a page...

. The goal during tokenization is to identify words for which users will search. Language-specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax).

Language Ambiguity : To assist with properly ranking matching documents, many search engines collect additional information about each word, such as its language
Language
Language may refer either to the specifically human capacity for acquiring and using complex systems of communication, or to a specific instance of such a system of complex communication...

 or lexical category
Lexical category
In grammar, a part of speech is a linguistic category of words , which is generally defined by the syntactic or morphological behaviour of the lexical item in question. Common linguistic categories include noun and verb, among others...

 (part of speech). These techniques are language-dependent, as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document.

Diverse File Formats : In order to correctly identify which bytes of a document represent characters, the file format must be correctly handled. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document.

Faulty Storage : The quality of the natural language data may not always be perfect. An unspecified number of documents, particular on the Internet, do not closely obey proper file protocol. binary characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade.

Tokenization

Unlike literate
Literacy
Literacy has traditionally been described as the ability to read for knowledge, write coherently and think critically about printed material.Literacy represents the lifelong, intellectual process of gaining meaning from print...

 humans, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program the computer to identify what constitutes an individual or distinct word, referred to as a token. Such a program is commonly called a tokenizer or parser or lexer
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...

. Many search engines, as well as other natural language processing software, incorporate specialized programs
Comparison of parser generators
This is a list of notable lexer generators and parser generators for various language classes.- Regular languages :- Deterministic context-free languages :-Parsing expression grammars, deterministic boolean grammars:...

 for parsing, such as YACC
Yacc
The computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...

 or Lex
Lex programming tool
Lex is a computer program that generates lexical analyzers . Lex is commonly used with the yacc parser generator. Lex, originally written by Mike Lesk and Eric Schmidt, is the standard lexical analyzer generator on many Unix systems, and a tool exhibiting its behavior is specified as part of the...

.

During tokenization, the parser identifies sequences of characters which represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify entities such as email
Email
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...

 addresses, phone numbers, and URL
Uniform Resource Locator
In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....

s. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.

Language Recognition

If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language; many of the subsequent steps are language dependent (such as stemming
Stemming
In linguistic morphology and information retrieval, stemming is the process for reducing inflected words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same...

 and part of speech tagging). Language recognition
Language recognition
In computational linguistics and natural language processing, language recognition may refer to:* Language identification* Natural language understanding* Speech recognition* Language recognition...

 is the process by which a computer program attempts to automatically identify, or categorize, the language
Language
Language may refer either to the specifically human capacity for acquiring and using complex systems of communication, or to a specific instance of such a system of complex communication...

 of a document. Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research in natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

. Finding which language the words belongs to may involve the use of a language recognition chart.

Format Analysis

If the search engine supports multiple document formats
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...

, documents must be prepared for tokenization. The challenge is that many document formats contain formatting information in addition to textual content. For example, HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 documents contain HTML tags, which specify formatting information such as new line starts, bold emphasis, and font
Font
In typography, a font is traditionally defined as a quantity of sorts composing a complete character set of a single size and style of a particular typeface...

 size or style. If the search engine were to ignore the difference between content and 'markup', extraneous information would be included in the index, leading to poor search results. Format analysis is the identification and handling of the formatting content embedded within documents which controls the way the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning, and text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary with very little information disclosed, while others are well documented. Common, well-documented file formats that many search engines support include:
  • HTML
    HTML
    HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

  • ASCII
    ASCII
    The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

     text files (a text document without specific computer readable formatting)
  • Adobe
    Adobe Systems
    Adobe Systems Incorporated is an American computer software company founded in 1982 and headquartered in San Jose, California, United States...

    's Portable Document Format (PDF)
  • PostScript
    PostScript
    PostScript is a dynamically typed concatenative programming language created by John Warnock and Charles Geschke in 1982. It is best known for its use as a page description language in the electronic and desktop publishing areas. Adobe PostScript 3 is also the worldwide printing and imaging...

     (PS)
  • LaTeX
    LaTeX
    LaTeX is a document markup language and document preparation system for the TeX typesetting program. Within the typesetting system, its name is styled as . The term LaTeX refers only to the language in which documents are written, not to the editor used to write those documents. In order to...

  • UseNet
    Usenet
    Usenet is a worldwide distributed Internet discussion system. It developed from the general purpose UUCP architecture of the same name.Duke University graduate students Tom Truscott and Jim Ellis conceived the idea in 1979 and it was established in 1980...

     netnews server formats
  • XML
    XML
    Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

     and derivatives like RSS
    RSS
    -Mathematics:* Root-sum-square, the square root of the sum of the squares of the elements of a data set* Residual sum of squares in statistics-Technology:* RSS , "Really Simple Syndication" or "Rich Site Summary", a family of web feed formats...

  • SGML
  • Multimedia
    Multimedia
    Multimedia is media and content that uses a combination of different content forms. The term can be used as a noun or as an adjective describing a medium as having multiple content forms. The term is used in contrast to media which use only rudimentary computer display such as text-only, or...

     meta data formats like ID3
    ID3
    ID3 is a metadata container most often used in conjunction with the MP3 audio file format. It allows information such as the title, artist, album, track number, and other information about the file to be stored in the file itself....

  • Microsoft Word
    Microsoft Word
    Microsoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...

  • Microsoft Excel
    Microsoft Excel
    Microsoft Excel is a proprietary commercial spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications...

  • Microsoft Powerpoint
    Microsoft PowerPoint
    Microsoft PowerPoint, usually just called PowerPoint, is a non-free commercial presentation program developed by Microsoft. It is part of the Microsoft Office suite, and runs on Microsoft Windows and Apple's Mac OS X operating system...

  • IBM Lotus Notes
    Lotus Notes
    Lotus Notes is the client of a collaborative platform originally created by Lotus Development Corp. in 1989. In 1995 Lotus was acquired by IBM and became known as the Lotus Development division of IBM and is now part of the IBM Software Group...


Options for dealing with various formats include using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format, and writing a custom parser.

Some search engines support inspection of files that are stored in a compressed
Compressor (software)
Compressor is a video and audio media compression and encoding application for use with Final Cut Studio and Logic Studio on Mac OS X. It can be used with Qmaster for clustering.-History:...

 or encrypted file format. When working with a compressed format, the indexer first decompresses the document; this step may result in one or more files, each of which must be indexed separately. Commonly supported compressed file formats include:
  • ZIP
    ZIP (file format)
    Zip is a file format used for data compression and archiving. A zip file contains one or more files that have been compressed, to reduce file size, or stored as is...

     - Zip archive file
  • RAR - Roshal ARchive File
  • CAB
    Cabinet (file format)
    In computing, CAB is the Microsoft Windows native compressed archive format. It supports compression and digital signing, and is used in a variety of Microsoft installation engines: Setup API, Device Installer, AdvPack and Windows Installer.Though Cabinet was originally called Diamond, its .CAB...

     - Microsoft Windows
    Microsoft Windows
    Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

     Cabinet File
  • Gzip
    Gzip
    Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...

     - File compressed with gzip
  • BZIP
    Bzip2
    bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...

     - File compressed using bzip2
  • Tape ARchive (TAR)
    Tar (file format)
    In computing, tar is both a file format and the name of a program used to handle such files...

    , Unix
    Unix
    Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

     archive file, not (itself) compressed
  • TAR.Z, TAR.GZ or TAR.BZ2 - Unix
    Unix
    Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

     archive files compressed with Compress, GZIP or BZIP2


Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for spamdexing
Spamdexing
In computing, spamdexing is the deliberate manipulation of search engine indexes...

:
  • Including hundreds or thousands of words in a section which is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag
    Span and div
    In HTML, the span and div elements are used where parts of a document cannot be semantically described by other HTML elements.Most HTML elements carry semantic meaning – i.e. the element describes, and can be made to function according to, the type of data contained within...

     in HTML
    HTML
    HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

    , which may incorporate the use of CSS
    CSS
    -Computing:*Cascading Style Sheets, a language used to describe the style of document presentations in web development*Central Structure Store in the PHIGS 3D API*Closed source software, software that is not distributed with source code...

     or Javascript
    JavaScript
    JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....

     to do so).
  • Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer.

Section Recognition

Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the web
Internet
The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

, such as newsletters and corporate reports, contain erroneous content and side-sections which do not contain primary material (that which the document is about). For example, this article displays a side menu with links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear sequentially in the raw source content are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. Two primary problems are noted:
  • Content in different sections is treated as related in the index, when in reality it is not
  • Organizational 'side bar' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents.


Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via Javascript. If the search engine does not render the page and evaluate the Javascript within the page, it would not 'see' this content in the same way and would index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via Javascript or use the Noscript tag to ensure that the web page is indexed properly. At the same time, this fact can also be exploited
Spamdexing
In computing, spamdexing is the deliberate manipulation of search engine indexes...

 to cause the search engine indexer to 'see' different content than the viewer.

Meta Tag Indexing

Specific documents often contain embedded meta information such as author, keywords, description, and language. For HTML pages, the meta tag contains keywords which are also included in the index. Earlier Internet search engine technology would only index the keywords in the meta tags for the forward index; the full document would not be parsed. At that time full-text indexing was not as well established, nor was the hardware
Hardware
Hardware is a general term for equipment such as keys, locks, hinges, latches, handles, wire, chains, plumbing supplies, tools, utensils, cutlery and machine parts. Household hardware is typically sold in hardware stores....

 able to support such technology. The design of the HTML markup language initially included support for meta tags for the very purpose of being properly and easily indexed, without requiring tokenization.

As the Internet grew through the 1990s, many brick-and-mortar corporations went 'online' and established corporate websites. The keywords used to describe webpages (many of which were corporate-oriented webpages similar to product brochures) changed from descriptive to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. The fact that these keywords were subjectively specified was leading to spamdexing
Spamdexing
In computing, spamdexing is the deliberate manipulation of search engine indexes...

, which drove many search engines to adopt full-text indexing technologies in the 1990s. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky', the customer lifetime value
Customer lifetime value
-Definition of Customer Lifetime Value:In marketing, customer lifetime value , lifetime customer value , or lifetime value is the net present value of the cash flows attributed to the relationship with a customer...

 equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement, which in turn furthered research of full-text indexing technologies.

In Desktop search
Desktop search
Desktop search is the name for the field of search tools which search the contents of a user's own computer files, rather than searching the Internet...

, many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. Desktop search is more under the control of the user, while Internet search engines must focus more on the full text index.

See also

  • Compound term processing
    Compound term processing
    Compound term processing is the name that is used for a category of techniques in Information retrieval applications that performs matching on the basis of compound terms...

  • Concordance
    Concordance
    Concordance can mean:* Concordance , a list of words used in a body of work, with their immediate contexts* Concordance , the presence of the same trait in both members of a pair of twins...

  • Content analysis
    Content analysis
    Content analysis or textual analysis is a methodology in the social sciences for studying the content of communication. Earl Babbie defines it as "the study of recorded human communications, such as books, websites, paintings and laws."According to Dr...

  • Controlled vocabulary
    Controlled vocabulary
    Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems...

  • Desktop search
    Desktop search
    Desktop search is the name for the field of search tools which search the contents of a user's own computer files, rather than searching the Internet...

  • Documentation
    Documentation
    Documentation is a term used in several different ways. Generally, documentation refers to the process of providing evidence.Modules of Documentation are Helpful...

  • Document Retrieval
    Document retrieval
    Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual...


  • Index (database)
    Index (database)
    A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...

  • Information extraction
    Information extraction
    Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

  • Information retrieval
    Information retrieval
    Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

  • Keyword In Context Indexing
  • Latent semantic indexing
    Latent semantic indexing
    Latent Semantic Indexing is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words...

  • List of search engines
  • Natural language processing
    Natural language processing
    Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

  • Search engine
    Search engine
    A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...

  • Selection-based search
    Selection-based search
    A selection-based search system is a search engine system in which the user invokes a search query using only the mouse. A selection-based search system allows the user to search the internet for more information about any keyword or phrase contained within a document or webpage in any software...

  • Semantic Web
    Semantic Web
    The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...


  • Text mining
    Text mining
    Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

  • Text Retrieval
  • Vertical search
    Vertical search
    A vertical search engine, as distinct from a general web search engine, focuses on a specific segment of online content. The vertical content area may be based on topicality, media type, or genre of content. Common verticals include shopping, the automotive industry, legal information, medical...

  • Web crawler
    Web crawler
    A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.This process is called Web...

  • Website Parse Template
    Website Parse Template
    Website Parse Template is an XML-based open format which provides HTML structure description of website pages. WPT format allows web crawlers to generate Semantic Web’s RDFs for web pages...

  • Windows indexing service
    Windows indexing service
    Indexing Service was a Windows service that maintained an index of most of the files on a computer to improve searching performance on PCs and corporate computer networks. It updated indexed without user intervention...



Further reading

  • R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. Acta Informatica, 173-189, 1972.
  • Donald E. Knuth. The art of computer programming, volume 1 (3rd ed.): fundamental algorithms, Addison Wesley Longman Publishing Co. Redwood City, CA, 1997.
  • Donald E. Knuth. The art of computer programming, volume 3: (2nd ed.) sorting and searching, Addison Wesley Longman Publishing Co. Redwood City, CA, 1998.
  • Gerald Salton. Automatic text processing, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1988.
  • Gerard Salton
    Gerard Salton
    Gerard Salton , also known as Gerry Salton, was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time...

    . Michael J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, 1986.
  • Gerard Salton
    Gerard Salton
    Gerard Salton , also known as Gerry Salton, was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time...

    . Lesk, M.E.: Computer evaluation of indexing and text processing. Journal of the ACM. January 1968.
  • Gerard Salton
    Gerard Salton
    Gerard Salton , also known as Gerry Salton, was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time...

    . The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall Inc., Englewood Cliffs, 1971.
  • Gerard Salton
    Gerard Salton
    Gerard Salton , also known as Gerry Salton, was a Professor of Computer Science at Cornell University. Salton was perhaps the leading computer scientist working in the field of information retrieval during his time...

    . The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, Mass., 1989.
  • Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Chapter 8. ACM Press 1999.
  • G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
  • Adelson-Velskii, G.M., Landis, E. M.: An information organization algorithm. DANSSSR, 146, 263-266 (1962).
  • Edward H. Sussenguth, Jr., Use of tree structures for processing files, Communications of the ACM, v.6 n.5, p. 272-279, May 1963
  • Harman, D.K., et al.: Inverted files. In Information Retrieval: Data Structures and Algorithms, Prentice-Hall, pp 28–43, 1992.
  • Lim, L., et al.: Characterizing Web Document Change, LNCS 2118, 133–146, 2001.
  • Lim, L., et al.: Dynamic Maintenance of Web Indexes Using Landmarks. Proc. of the 12th W3 Conference, 2003.
  • Moffat, A., Zobel, J.: Self-Indexing Inverted Files for Fast Text Retrieval. ACM TIS, 349–379, October 1996, Volume 14, Number 4.
  • Mehlhorn, K.
    Kurt Mehlhorn
    Kurt Mehlhorn is a German computer scientist. He has been a vice president of the Max Planck Society and is director of the Max Planck Institute for Computer Science.-Education:...

    : Data Structures and Efficient Algorithms, Springer Verlag, EATCS Monographs, 1984.
  • Mehlhorn, K.
    Kurt Mehlhorn
    Kurt Mehlhorn is a German computer scientist. He has been a vice president of the Max Planck Society and is director of the Max Planck Institute for Computer Science.-Education:...

    , Overmars, M.H.
    Mark Overmars
    Markus Hendrik Overmars is a Dutch computer scientist and teacher of game programming known for his game development application Game Maker. Game Maker lets people create computer games using a drag-and-drop interface. He is the head of the Center for Geometry, Imaging, and Virtual Environments...

    : Optimal Dynamization of Decomposable Searching Problems. IPL 12, 93–98, 1981.
  • Mehlhorn, K.
    Kurt Mehlhorn
    Kurt Mehlhorn is a German computer scientist. He has been a vice president of the Max Planck Society and is director of the Max Planck Institute for Computer Science.-Education:...

    : Lower Bounds on the Efficiency of Transforming Static Data Structures into Dynamic Data Structures. Math. Systems Theory 15, 1–16, 1981.
  • Koster, M.: ALIWEB: Archie-Like indexing in the Web. Computer Networks and ISDN Systems, Vol. 27, No. 2 (1994) 175-182 (also see Proc. First Int'l World Wide Web Conf., Elsevier Science, Amsterdam, 1994, pp. 175–182)
  • Serge Abiteboul
    Serge Abiteboul
    Serge Joseph Abiteboul is a computer scientist working in the areas of data management, database theory, and finite model theory.He received his PhD from the University of Southern California under the supervision of Seymour Ginsburg, in 1982....

     and Victor Vianu
    Victor Vianu
    Victor Vianu is a computer scientist, a professor of computer science and engineering at the University of California, San Diego and since 2010 the editor-in-chief of the Journal of the ACM....

    . Queries and Computation on the Web. Proceedings of the International Conference on Database Theory. Delphi, Greece 1997.
  • Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994.
  • A. Emtage and P. Deutsch, "Archie--An Electronic Directory Service for the Internet." Proc. Usenix Winter 1992 Tech. Conf., Usenix Assoc., Berkeley, Calif., 1992, pp. 93–110.
  • M. Gray, World Wide Web Wanderer.
  • D. Cutting and J. Pedersen. "Optimizations for Dynamic Inverted Index Maintenance." Proceedings of the 13th International Conference on Research and Development in Information Retrieval, pp. 405–411, September 1990.
  • Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, Cambridge, Mass., 2010.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK