Question answering - AbsoluteAstronomy.com

Information retrieval

Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

and natural language processing

Natural language processing

Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

(NLP), question answering (QA) is the task of automatically answering a question posed in natural language

Natural language

In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...

. To find the answer to a question, a QA computer program may use either a pre-structured database

Database

A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

or a collection of natural language documents (a text corpus

Text corpus

In linguistics, a corpus or text corpus is a large and structured set of texts...

such as the World Wide Web

World Wide Web

The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

or some local collection).

QA research attempts to deal with a wide range of question types including: fact, list, definition, How, Why, hypothetical, semantically constrained, and cross-lingual questions. Search collections vary from small local document collections, to internal organization documents, to compiled newswire reports, to the World Wide Web.

Closed-domain question answering deals with questions under a specific domain (for example, medicine or automotive maintenance), and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies
Ontology (computer science)
In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.In theory, an ontology is...

. Alternatively, closed-domain might refer to a situation where only a limited type of questions are accepted, such as questions asking for descriptive
Descriptive knowledge
Descriptive knowledge, also declarative knowledge or propositional knowledge, is the species of knowledge that is, by its very nature, expressed in declarative sentences or indicative propositions...

rather than procedural
Procedural knowledge
Procedural knowledge, also known as imperative knowledge, is the knowledge exercised in the performance of some task. See below for the specific meaning of this term in cognitive psychology and intellectual property law....

information.
Open-domain question answering deals with questions about nearly anything, and can only rely on general ontologies and world knowledge. On the other hand, these systems usually have much more data available from which to extract the answer.

QA is regarded as requiring more complex natural language processing

Natural language processing

Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

(NLP) techniques than other types of information retrieval such as document retrieval

Document retrieval

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual...

, thus natural language search engines are sometimes regarded as the next step beyond current search engine

Search engine

A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...

Architecture

The first QA systems were developed in the 1960s and they were basically natural-language interfaces to expert system

Expert system

In artificial intelligence, an expert system is a computer system that emulates the decision-making ability of a human expert. Expert systems are designed to solve complex problems by reasoning about knowledge, like an expert, and not by following the procedure of a developer as is the case in...

s that were tailored to specific domains. In contrast, current QA systems use text documents as their underlying knowledge source and combine various natural language processing

Natural language processing

Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

techniques to search for the answers.

Current QA systems typically include a question classifier module that determines the type of question and the type of answer. After the question is analysed, the system typically uses several modules that apply increasingly complex NLP techniques on a gradually reduced amount of text. Thus, a document retrieval module uses search engine

Search engine

s to identify the documents or paragraphs in the document set that are likely to contain the answer. Subsequently a filter preselects small text fragments that contain strings of the same type as the expected answer. For example, if the question is "Who invented
Penicillin" the filter returns text that contain names of people. Finally, an answer extraction module looks for further clues in the text to determine if the answer candidate can indeed answer the question.

Question answering methods

QA is very dependent on a good search corpus

Text corpus

In linguistics, a corpus or text corpus is a large and structured set of texts...

- for without documents containing the answer, there is little any QA system can do. It thus makes sense that larger collection sizes generally lend well to better QA performance, unless the question domain is orthogonal to the collection. The notion of data redundancy

Data redundancy

Data redundancy occurs in database systems which have a field that is repeated in two or more tables. For instance, in case when customer data is duplicated and attached with each product bought then redundancy of data is a known source of inconsistency, since customer might appear with different...

in massive collections, such as the web, means that nuggets of information are likely to be phrased in many different ways in differing contexts and documents, leading to two benefits:

(1) By having the right information appear in many forms, the burden on the QA system to perform complex NLP techniques to understand the text is lessened.

(2) Correct answers can be filtered from false positives by relying on the correct answer to appear more times in the documents than instances of incorrect ones.

Shallow

Some methods of QA use keyword

Keyword (Internet search)

An index term, subject term, subject heading, or descriptor, in information retrieval, is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integral part of bibliographic control, which is the...

-based techniques to locate interesting passages and sentences from the retrieved documents and then filter based on the presence of the desired answer type within that candidate text. Ranking is then done based on syntactic features such as word order or location and similarity to query.

When using massive collections with good data redundancy, some systems use templates to find the final answer in the hope that the answer is just a reformulation of the question. If you posed the question "What is a dog?", the system would detect the substring "What is a X" and look for documents which start with "X is a Y". This often works well on simple "factoid

Factoid

A factoid is a questionable or spurious—unverified, incorrect, or fabricated—statement presented as a fact, but with no veracity. The word can also be used to describe a particularly insignificant or novel fact, in the absence of much relevant context...

" questions seeking factual tidbits of information such as names, dates, locations, and quantities.

Deep

Sometimes question reformulation or keyword techniques do not suffice. Then syntactic, semantic and contextual processing is usually performed to extract or construct the answer. Such processing includes named-entity recognition

Named entity recognition

Named-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...

, relation detection, coreference

Coreference

In linguistics, co-reference occurs when multiple expressions in a sentence or document refer to the same thing; or in linguistic jargon, they have the same "referent."...

resolution, syntactic alternations

Synonym

Synonyms are different words with almost identical or similar meanings. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. The word comes from Ancient Greek syn and onoma . The words car and automobile are synonyms...

, word sense disambiguation

Word sense disambiguation

In computational linguistics, word-sense disambiguation is an open problem of natural language processing, which governs the process of identifying which sense of a word is used in a sentence, when the word has multiple meanings...

and so on.

More difficult queries such as Why or How questions, hypothetical postulations, spatially or temporally constrained questions, dialog

Dialog system

A dialog system or conversational agent is a computer system intended to converse with a human, with a coherent structure. Dialog systems have employed text, speech, graphics, haptics, gestures and other modes for communication on both the input and output channel.What does and does not constitute...

queries, badly worded or ambiguous questions - all usually need the above-mentioned deeper types of question analysis. Likewise, complex or ambiguous document passages also need more sophisticated NLP techniques.

Statistical question-answering is growing in popularity in the research community. Many of the lower-level NLP tools such as part-of-speech tagging

Part-of-speech tagging

In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...

, parsing

Parsing

In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...

, named-entity detection, sentence boundary detection, and document retrieval

Document retrieval

are already available as probabilistic applications.

Answer Questioning (AQ) provides suggestions for metadata

Metadata

The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

, linked data

Linked Data

In computing, linked data describes a method of publishing structured data so that it can be interlinked and become more useful. It builds upon standard Web technologies such as HTTP and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a...

or simply finds more indexable questions that would lead to that single answer. AQ starts with an answer and proceeds to formulate various questions about that answer. For example, the answer "I like sushi." can be turned into: Q: "Why do I like sushi?" A: "The flavor." A useful response would be to add some metadata that describes the flavor of sushi. The next step is to go beyond that with: Q: "What aspect of the flavor of sushi do I like?". A useful response to this would be to link data about that flavor to the metadata for the sushi. This leads to an emerging

Emergence

In philosophy, systems theory, science, and art, emergence is the way complex systems and patterns arise out of a multiplicity of relatively simple interactions. Emergence is central to the theories of integrative levels and of complex systems....

ontology, purely for answering questions.

Issues

In 2002 a group of researchers wrote a roadmap of research in question answering. The following
issues were identified.

Question classes : Different types of questions (e.g., "What is the capital of Lichtenstein

Lichtenstein

Lichtenstein is surname of:*Aharon Lichtenstein, noted Orthodox rabbi*Alfred Lichtenstein , an American philatelist*Alfred Lichtenstein , a German writer*Bill Lichtenstein, journalist and producer...

?" vs. "Why does a rainbow

Rainbow

A rainbow is an optical and meteorological phenomenon that causes a spectrum of light to appear in the sky when the Sun shines on to droplets of moisture in the Earth's atmosphere. It takes the form of a multicoloured arc...

form?" vs. "Did Marilyn Monroe

Marilyn Monroe

Marilyn Monroe was an American actress, singer, model and showgirl who became a major sex symbol, starring in a number of commercially successful motion pictures during the 1950s....

and Cary Grant

Cary Grant

Archibald Alexander Leach , better known by his stage name Cary Grant, was an English actor who later took U.S. citizenship...

ever appear in a movie together?") require the use of different strategies to find the answer. Question classes are arranged hierarchically in taxonomies.

Question processing : The same information request can be expressed in various ways, some interrogative ("Who is the president of the United States?") and some assertive ("Tell me the name of the president of the United States."). A semantic model of question understanding and processing would recognize equivalent questions, regardless of how they are presented. This model would enable the translation of a complex question into a series of simpler questions, would identify ambiguities and treat them in context or by interactive clarification.

Context and QA : Questions are usually asked within a context and answers are provided within that specific context. The context can be used to clarify a question, resolve ambiguities or keep track of an investigation performed through a series of questions. (For example, the question, "Why did Joe Biden visit Iraq in January 2010?" might be asking why Vice President Biden visited and not President Obama, why he went to Iraq and not Afghanistan or some other country, why he went in January 2010 and not before or after, or what Biden was hoping to accomplish with his visit. If the question is one of a series of related questions, the previous questions and their answers might shed light on the questioner's intent.)

Data sources for QA : Before a question can be answered, it must be known what knowledge sources are available and relevant. If the answer to a question is not present in the data sources, no matter how well the question processing, information retrieval and answer extraction is performed, a correct result will not be obtained.

Answer extraction : Answer extraction depends on the complexity of the question, on the answer type provided by question processing, on the actual data where the answer is searched, on the search method and on the question focus and context.

Answer formulation : The result of a QA system should be presented in a way as natural as possible. In some cases, simple extraction is sufficient. For example, when the question classification indicates that the answer type is a name (of a person, organization, shop or disease, etc.), a quantity (monetary value, length, size, distance, etc.) or a date (e.g. the answer to the question, "On what day did Christmas fall in 1989?") the extraction of a single datum is sufficient. For other cases, the presentation of the answer may require the use of fusion techniques that combine the partial answers from multiple documents.

Real time question answering : There is need for developing Q&A systems that are capable of extracting answers from large data sets in several seconds, regardless of the complexity of the question, the size and multitude of the data sources or the ambiguity of the question.

Multilingual (or cross-lingual) question answering : The ability to answer a question posed in one language using an answer corpus in another language (or even several). This allows users to consult information that they cannot use directly. (See also Machine translation

Machine translation

Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...

.)

Interactive QA : It is often the case that the information need is not well captured by a QA system, as the question processing part may fail to classify properly the question or the information needed for extracting and generating the answer is not easily retrieved. In such cases, the questioner might want not only to reformulate the question, but to have a dialogue with the system. (For example, the system might ask for a clarification of what sense a word is being used, or what type of information is being asked for.)

Advanced reasoning for QA : More sophisticated questioners expect answers that are outside the scope of written texts or structured databases. To upgrade a QA system with such capabilities, it would be necessary to integrate reasoning components operating on a variety of knowledge bases, encoding world knowledge and common-sense reasoning mechanisms, as well as knowledge specific to a variety of domains.

User profiling for QA : The user profile captures data about the questioner, comprising context data, domain of interest, reasoning schemes frequently used by the questioner, common ground established within different dialogues between the system and the user, and so forth. The profile may be represented as a predefined template, where each template slot represents a different profile feature. Profile templates may be nested one within another.

History

Some of the early AI

Artificial intelligence

Artificial intelligence is the intelligence of machines and the branch of computer science that aims to create it. AI textbooks define the field as "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions that maximize its...

systems were question answering systems. Two of the most famous QA systems of that time are BASEBALL and LUNAR, both of which were developed in the 1960s. BASEBALL answered questions about the US baseball league over a period of one year. LUNAR, in turn, answered questions about the geological analysis of rocks returned by the Apollo moon missions. Both QA systems were very effective in their chosen domains. In fact, LUNAR was demonstrated at a lunar science convention in 1971 and it was able to answer 90% of the questions in its domain posed by people untrained on the system. Further restricted-domain QA systems were developed in the following years. The common feature of all these systems is that they had a core database or knowledge system that was hand-written by experts of the chosen domain.

Some of the early AI

Artificial intelligence

systems included question-answering abilities. Two of the most famous early systems are SHRDLU and ELIZA. SHRDLU

SHRDLU

SHRDLU was an early natural language understanding computer program, developed by Terry Winograd at MIT from 1968-1970. In it, the user carries on a conversation with the computer, moving objects, naming collections and querying the state of a simplified "blocks world", essentially a virtual box...

simulated the operation of a robot in a toy world (the "blocks world"), and it offered the possibility to ask the robot questions about the state of the world. Again, the strength of this system was the choice of a very specific domain and a very simple world with rules of physics that were easy to encode in a computer program. ELIZA

ELIZA

ELIZA is a computer program and an early example of primitive natural language processing. ELIZA operated by processing users' responses to scripts, the most famous of which was DOCTOR, a simulation of a Rogerian psychotherapist. Using almost no information about human thought or emotion, DOCTOR...

, in contrast, simulated a conversation with a psychologist. ELIZA was able to converse on any topic by resorting to very simple rules that detected important words in the person's input. It had a very rudimentary way to answer questions, and on its own it led to a series of chatterbot

Chatterbot

A chatter robot, chatterbot, chatbot, or chat bot is a computer program designed to simulate an intelligent conversation with one or more human users via auditory or textual methods, primarily for engaging in small talk. The primary aim of such simulation has been to fool the user into thinking...

s such as the ones that participate in the annual Loebner prize

Loebner prize

The Loebner Prize is an annual competition in artificial intelligence that awards prizes to the chatterbot considered by the judges to be the most human-like. The format of the competition is that of a standard Turing test. In each round, a human judge simultaneously holds textual conversations...

.

The 1970s and 1980s saw the development of comprehensive theories in computational linguistics

Computational linguistics

Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

, which led to the development of ambitious projects in text comprehension and question answering. One example of such a system was the Unix Consultant (UC), a system that answered questions pertaining to the Unix

Unix

Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

operating system. The system had a comprehensive hand-crafted knowledge base of its domain, and it aimed at phrasing the answer to accommodate various types of users. Another project was LILOG, a text-understanding system that operated on the domain of tourism information in a German city. The systems developed in the UC and LILOG projects never went past the stage of simple demonstrations, but they helped the development of theories on computational linguistics and reasoning.

An increasing number of systems include the World Wide Web

World Wide Web

The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

as one more corpus of text. . However, these tools mostly work by using shallow methods, as described above — thus returning a list of documents, usually with an excerpt containing the probable answer highlighted, plus some context. Furthermore, highly-specialized natural language question-answering engines, such as EAGLi for health and life scientists, have been made available.

The Future of Question Answering

QA systems have been extended in recent years to explore critical new scientific and practical dimensions For example, systems have been developed to automatically answer temporal and geospatial questions, definitional questions, biographical questions, multilingual questions, and questions from multimedia (e.g., audio, imagery, video). Additional aspects such as interactivity (often required for clarification of questions or answers), answer reuse, and knowledge representation and reasoning to support question answering have been explored. Future research may explore what kinds of questions can be asked and answered about social media, including sentiment analysis

Sentiment analysis

Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials....

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.