Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in
subject indexingSubject indexing is the act of describing or classifying a document by index terms or other symbols in order to indicate what the document is about, to summarize its content or to increase its findability. In other words, it is about identifying and describing the subject of documents...
schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems. Controlled vocabulary schemes mandate the use of predefined, authorised terms that have been preselected by the designer of the vocabulary, in contrast to natural language vocabularies, where there is no restriction on the vocabulary.
In library and information science
In
library and information scienceLibrary and information science is a merging of the two fields library science and information science...
controlled vocabulary is a carefully selected list of words and
phraseIn everyday speech, a phrase may refer to any group of words. In linguistics, a phrase is a group of words which form a constituent and so function as a single unit in the syntax of a sentence. A phrase is lower on the grammatical hierarchy than a clause....
s, which are used to
tagIn online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...
units of information (document or work) so that they may be more easily retrieved by a search. Controlled vocabularies solve the problems of homographs, synonyms and polysemes by a
bijectionA bijection is a function giving an exact pairing of the elements of two sets. A bijection from the set X to the set Y has an inverse function from Y to X. If X and Y are finite sets, then the existence of a bijection means they have the same number of elements...
between concepts and authorized terms. In short, controlled vocabularies reduce ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency.
For example, in the
Library of Congress Subject HeadingsThe Library of Congress Subject Headings comprise a thesaurus of subject headings, maintained by the United States Library of Congress, for use in bibliographic records...
(a subject heading system that uses a controlled vocabulary), authorized terms -- subject headings in this case -- have to be chosen to handle choices between variant spellings of the same concept (American versus British), choice among scientific and popular terms (Cockroaches versus Periplaneta americana), and choices between synonyms (automobile versus cars), among other difficult issues.
Choices of authorized terms are based on the principles of
user warrant (what terms users are likely to use),
literary warrant (what terms are generally used in the literature and documents), and
structural warrant (terms chosen by considering the structure, scope of the controlled vocabulary).
Controlled vocabularies also typically handle the problem of homographs, with qualifiers. For example, the term "pool" has to be qualified to refer to either swimming pool, or the game pool to ensure that each authorized term or heading refers to only one concept.
There are two main kinds of controlled vocabulary tools used in libraries: subject headings and thesauri. While the differences between the two are diminishing, there are still some minor differences.
Historically subject headings were designed to describe books in library catalogs by catalogers while thesauri were used by indexers to apply index terms to documents and articles. Subject headings tend to be broader in scope describing whole books, while thesauri tend to be more specialized covering very specific disciplines. Also because of the card catalog system, subject headings tend to have terms that are in indirect order (though with the rise of automated systems this is being removed), while thesaurus terms are always in direct order. Subject headings also tend to use more pre-coordination of terms such that the designer of the controlled vocabulary will combine various concepts together to form one authorized subject heading. (e.g., children and terrorism) while thesauri tend to use singular direct terms. Lastly thesauri list not only equivalent terms but also narrower, broader terms and related terms among various authorized and non-authorized terms, while historically most subject headings did not.
For example, the Library of Congress Subject Heading itself did not have much syndetic structure until 1943, and it was not until 1985 when it began to adopt the thesauri type term "Broader term" and "Narrow term".
The
termsTerminology is the study of terms and their use. Terms are words and compound words that in specific contexts are given specific meanings, meanings that may deviate from the meaning the same words have in other contexts and in everyday language. The discipline Terminology studies among other...
are chosen and organized by trained professionals (including librarians and information scientists) who possess expertise in the subject area. Controlled vocabulary terms can accurately describe what a given document is actually about, even if the terms themselves do not occur within the document's text. Well known subject heading systems include the
Library of Congress systemThe Library of Congress Subject Headings comprise a thesaurus of subject headings, maintained by the United States Library of Congress, for use in bibliographic records...
,
MeSHMedical Subject Headings is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching...
, and Sears. Well known thesauri include the Art and Architecture Thesaurus and the
ERICERIC - the Education Resources Information Center - is an online digital library of education research and information. ERIC is sponsored by the Institute of Education Sciences of the U.S. Department of Education...
Thesaurus.
Choosing authorized terms to be used is a tricky business, besides the areas already considered above, the designer has to consider the specificity of the term chosen, whether to use direct entry, inter consistency and stability of the language. Lastly the amount of pre-co-ordinate (in which case the degree of enumeration versus synthesis becomes an issue) and post co-ordinate in the system is another important issue.
Controlled vocabulary elements (terms/phrases) employed as
tagsIn online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...
, to aid in the content identification process of documents, or other information system entities (e.g. DBMS, Web Services) qualifies as
metadataThe term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
.
Indexing languages
There are three main types of indexing languages.
- Controlled indexing language - Only approved terms can be used by the indexer to describe the document
- Natural language indexing language - Any term from the document in question can be used to describe the document.
- Free indexing language - Any term (not only from the document) can be used to describe the document.
When indexing a document, the indexer also has to choose the level of indexing exhaustivity, the level of detail in which the document is described. For example using low indexing exhaustivity, minor aspects of the work will not be described with index terms. In general the higher the indexing exhaustivity, the more terms indexed for each document.
In recent years
free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is
indexed). Many studies have been done to compare the efficiency and effectiveness of free text searches against documents that have been indexed by experts using a few well chosen controlled vocabulary descriptors.
Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce
irrelevantIn information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user.-Types:...
items in the retrieval list. These irrelevant items (false positives) are often caused by the inherent ambiguity of
natural languageIn the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...
. Take the English word
football for example.
Football is the name given to a number of different
team sportA team sport includes any sport which involves players working together towards a shared objective. A team sport is an activity in which a group of individuals, on the same team, work together to accomplish an ultimate goal which is usually to win. This can be done in a number of ways such as...
s. Worldwide the most popular of these team sports is
Association footballAssociation football, more commonly known as football or soccer, is a sport played between two teams of eleven players with a spherical ball...
, which also happens to be called
soccer in several countries. The
English languageEnglish is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...
word footballThe English language word football may mean any one of several team sports , depending on the national or regional origin and location of the person using the word....
is also applied to
Rugby footballRugby football is a style of football named after Rugby School in the United Kingdom. It is seen most prominently in two current sports, rugby league and rugby union.-History:...
(
Rugby unionRugby union, often simply referred to as rugby, is a full contact team sport which originated in England in the early 19th century. One of the two codes of rugby football, it is based on running with the ball in hand...
and
rugby leagueRugby league football, usually called rugby league, is a full contact sport played by two teams of thirteen players on a rectangular grass field. One of the two codes of rugby football, it originated in England in 1895 by a split from Rugby Football Union over paying players...
),
American footballAmerican football is a sport played between two teams of eleven with the objective of scoring points by advancing the ball into the opposing team's end zone. Known in the United States simply as football, it may also be referred to informally as gridiron football. The ball can be advanced by...
,
Australian rules footballAustralian rules football, officially known as Australian football, also called football, Aussie rules or footy is a sport played between two teams of 22 players on either...
,
Gaelic footballGaelic football , commonly referred to as "football" or "Gaelic", or "Gah" is a form of football played mainly in Ireland...
, and
Canadian footballCanadian football is a form of gridiron football played exclusively in Canada in which two teams of 12 players each compete for territorial control of a field of play long and wide attempting to advance a pointed prolate spheroid ball into the opposing team's scoring area...
. A search for
football therefore will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by
taggingIn online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...
the documents in such a way that the ambiguities are eliminated.
Compared to free text searching, the use of a controlled vocabulary can dramatically increase the performance of an information retrieval system, if performance is measured by precision (the percentage of documents in the retrieval list that are actually
relevant-Introduction:The concept of relevance is studied in many different fields, including cognitive sciences, logic and library and information science. Most fundamentally, however, it is studied in epistemology...
to the search topic).
In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once the correct authorized term is searched, you don't need to worry about searching for other terms that might be synonyms of that term.
However, a controlled vocabulary search may also lead to unsatisfactory recall, in that it will fail to retrieve some documents that are actually relevant to the search question.
This is particularly problematic when the search question involves terms that are sufficiently tangential to the subject area such that the indexer might have decided to tag it using a different term (but the searcher might consider the same). Essentially, this can be avoided only by an experienced user of controlled vocabulary whose understanding of the vocabulary coincides with the way it is used by the indexer.
Another possibility is that the article is just not tagged by the indexer because indexing exhaustivity is low. For example an article might mention football as a secondary focus, and the indexer might decide not to tag it with "football" because it is not important enough compared to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A free text search would automatically pick up that article regardless.
On the other hand free text searches have high exhaustivity (you search on every word) so it has potential for high recall (assuming you solve the problems of synonyms by entering every combination) but will have much lower precision.
Controlled vocabularies are also quickly out-dated and in fast developing fields of knowledge, the authorized terms available might not be available if they are not updated regularly. Even in the best case scenario, controlled language is often not as specific as using the words of the text itself. Indexers trying to choose the appropriate index terms might misinterpret the author, while a free text search is in no danger of doing so, because it uses the author's own words.
The use of controlled vocabularies can be costly compared to free text searches because human experts or expensive automated systems are necessary to index each entry. Furthermore, the user has to be familiar with the controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision.
Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including
faceted classificationA faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, predetermined, taxonomic order. A facet comprises "clearly defined, mutually exclusive, and collectively exhaustive...
, which enables a given data record or document to be described in multiple ways.
Applications
Controlled vocabularies, such as the
Library of Congress Subject HeadingsThe Library of Congress Subject Headings comprise a thesaurus of subject headings, maintained by the United States Library of Congress, for use in bibliographic records...
, are an essential component of
bibliographyBibliography , as a practice, is the academic study of books as physical, cultural objects; in this sense, it is also known as bibliology...
, the study and classification of books. They were initially developed in
library and information scienceLibrary and information science is a merging of the two fields library science and information science...
. In the 1950s, government agencies began to develop controlled vocabularies for the burgeoning journal literature in specialized fields; an example is the
Medical Subject HeadingsMedical Subject Headings is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching...
(MeSH) developed by the
U.S. National Library of MedicineThe United States National Library of Medicine , operated by the United States federal government, is the world's largest medical library. Located in Bethesda, Maryland, the NLM is a division of the National Institutes of Health...
. Subsequently, for-profit firms (called Abstracting and indexing services) emerged to index the fast-growing literature in every field of knowledge. In the 1960s, an online bibliographic database industry developed based on dialup
X.25X.25 is an ITU-T standard protocol suite for packet switched wide area network communication. An X.25 WAN consists of packet-switching exchange nodes as the networking hardware, and leased lines, Plain old telephone service connections or ISDN connections as physical links...
networking. These services were seldom made available to the public because they were difficult to use; specialist librarians called search intermediaries handled the searching job. In the 1980s, the first full text databases appeared; these databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; however, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services without charge; some of these services may be accessible without charge at a public library.
In large organizations, controlled vocabularies may be introduced to improve
technical communicationTechnical communication is a method of researching and creating information about technical processes or products directed to an audience through media. The information must be relevant to the intended audience. Technical communicators often work collaboratively to create products for various...
. The use of controlled vocabulary ensures that everyone is using the same word to mean the same thing. This consistency of terms is one of the most important concepts in
technical writingTechnical writing, a form of technical communication, is a style of writing used in fields as diverse as computer hardware and software, engineering, chemistry, the aerospace industry, robotics, finance, consumer electronics, and biotechnology....
and
knowledge managementKnowledge management comprises a range of strategies and practices used in an organization to identify, create, represent, distribute, and enable adoption of insights and experiences...
, where effort is expended to use the same word throughout a
documentThe term document has multiple meanings in ordinary language and in scholarship. WordNet 3.1. lists four meanings :* document, written document, papers...
or
organizationAn organization is a social group which distributes tasks for a collective goal. The word itself is derived from the Greek word organon, itself derived from the better-known word ergon - as we know `organ` - and it means a compartment for a particular job.There are a variety of legal types of...
instead of slightly different ones to refer to the same thing.
Web searching could be dramatically improved by the development of a controlled vocabulary for describing Web pages; the use of such a vocabulary could culminate in a
Semantic WebThe Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...
, in which the content of Web pages is described using a machine-readable
metadataThe term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
scheme. One of the first proposals for such a scheme is the
Dublin CoreThe Dublin Core metadata terms are a set of vocabulary terms which can be used to describe resources for the purposes of discovery. The terms can be used to describe a full range of web resources: video, images, web pages etc and physical resources such as books and objects like artworks...
Initiative. An example of a controlled vocabulary which is usable for
indexing web pagesWeb indexing includes back-of-book-style indexes to individual websites or an intranet, and the creation of keyword metadata to provide a more useful vocabulary for Internet or onsite search engines...
is
PSHPolythematic Structured Subject Heading System is a bilingual Czech-English controlled vocabulary of subject headings developed and maintained by the National Technical Library in Prague...
.
It is unlikely that a single metadata scheme will ever succeed in describing the content of the entire Web. To create a Semantic Web, it may be necessary to draw from two or more metadata systems to describe a Web page's contents. The eXchangeable Faceted Metadata Language (XFML) is designed to enable controlled vocabulary creators to publish and share metadata systems. XFML is designed on
faceted classificationA faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, predetermined, taxonomic order. A facet comprises "clearly defined, mutually exclusive, and collectively exhaustive...
principles.
See also
- Authority control
Authority control is the practice of creating and maintaining index terms for bibliographic material in a catalog in library and information science. Authority control fulfills two important functions. First, it enables catalogers to disambiguate items with similar or identical headings...
- Controlled natural language
Controlled natural languages are subsets of natural languages, obtained byrestricting the grammar and vocabulary in orderto reduce or eliminate ambiguity and complexity.Traditionally, controlled languages fall into two major types:...
- Faceted classification
A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, predetermined, taxonomic order. A facet comprises "clearly defined, mutually exclusive, and collectively exhaustive...
- Full text search
In text retrieval, full text search refers to techniques for searching a single computer-stored document or a collection in a full text database...
- Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
- IMS Vocabulary Definition Exchange
IMS VDEX, which stands for IMS Vocabulary Definition Exchange, is a mark-up language – or grammar – for controlled vocabularies developed by IMS Global as an open specification, with the Final Specification being approved in February 2004....
- Language for specific purposes
Language for Specific Purposes has been primarily used to refer to two areas within applied linguistics:# one focusing on the needs in education and training....
- Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...
: Metadata registryA metadata registry is a central location in an organization where metadata definitions are stored and maintained in a controlled method.-Use of Metadata Registries:...
- Nomenclature
Nomenclature is a term that applies to either a list of names or terms, or to the system of principles, procedures and terms related to naming - which is the assigning of a word or phrase to a particular object or property...
- Ontology (computer science)
In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.In theory, an ontology is...
- Semantic spectrum
The semantic spectrum is a series of increasingly precise or rather semantically expressive definitions for data elements in knowledge representations, especially for machine use.At the low end of the spectrum is a simple binding of a single word or phrase and its...
- Subject indexing
Subject indexing is the act of describing or classifying a document by index terms or other symbols in order to indicate what the document is about, to summarize its content or to increase its findability. In other words, it is about identifying and describing the subject of documents...
- Terminology
Terminology is the study of terms and their use. Terms are words and compound words that in specific contexts are given specific meanings, meanings that may deviate from the meaning the same words have in other contexts and in everyday language. The discipline Terminology studies among other...
- Technical terminology
Technical terminology is the specialized vocabulary of any field, not just technical fields. The same is true of the synonyms technical terms, terms of art, shop talk and words of art, which do not necessarily refer to technology or art...
- Text retrieval
- Thesaurus
A thesaurus is a reference work that lists words grouped together according to similarity of meaning , in contrast to a dictionary, which contains definitions and pronunciations...
- Universal Data Element Framework
The Universal Data Element Framework provides the foundation for building an enterprise-wide controlled vocabulary. It is a standard way of indexing enterprise information that can produce big cost savings...
- Vocabulary-based transformation
In metadata, a vocabulary-based transformation is a transformation aided by the use of a semantic equivalence statements within a controlled vocabulary.Many organizations today require communication between one or more computers...
External links