Text analytics
Encyclopedia
The term text analytics describes a set of linguistic
Linguistics
Linguistics is the scientific study of human language. Linguistics can be broadly broken into three categories or subfields of study: language form, language meaning, and language in context....

, statistical, and machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 techniques that model and structure the information content of textual sources for business intelligence
Business intelligence
Business intelligence mainly refers to computer-based techniques used in identifying, extracting, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes....

, exploratory data analysis
Exploratory data analysis
In statistics, exploratory data analysis is an approach to analysing data sets to summarize their main characteristics in easy-to-understand form, often with visual graphs, without using a statistical model or having formulated a hypothesis...

, research
Research
Research can be defined as the scientific search for knowledge, or as any systematic investigation, to establish novel facts, solve new or existing problems, prove new ideas, or develop new theories, usually using a scientific method...

, or investigation. The term is roughly synonymous with text mining
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

; indeed, Prof. Ronen Feldman modified a 2000 description of "text mining" in 2004 to describe "text analytics." The latter term is now used more frequently in business settings while "text mining" is used in some of the earliest application areas, dating to the 1980s, notably life-sciences research and government intelligence.

Text analytics involves information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

, lexical analysis
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...

 to study word frequency distributions, pattern recognition
Pattern recognition
In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...

, tag
Tag (metadata)
In online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...

ging/annotation
Annotation
An annotation is a note that is made while reading any form of text. This may be as simple as underlining or highlighting passages.Annotated bibliographies give descriptions about how each source is useful to an author in constructing a paper or argument...

, information extraction
Information extraction
Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

, data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

 techniques including link and association analysis, visualization
Information visualization
Information visualization is the interdisciplinary study of "the visual representation of large-scale collections of non-numerical information, such as files and lines of code in software systems, library and bibliographic databases, networks of relations on the internet, and so forth".- Overview...

, and predictive analytics
Predictive analytics
Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events....

. The overarching goal is, essentially, to turn text into data for analysis via application of natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

 (NLP) and analytical methods.

The term
also describes that application of text analytics to respond to business
problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism that 80 percent of business-relevant information originates in unstructured form, primarily text. These techniques and processes discover and present knowledge – facts,
business rules, and relationships – that is otherwise locked in textual form, impenetrable
to automated processing.

A typical application is to scan a set of documents written in a natural language
Natural language
In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...

 and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.

History

The challenge exploiting the large proportion of enterprise information that originates in "unstructured" form has been recognized for decades. It is recognized in the earliest definition of business intelligence
Business intelligence
Business intelligence mainly refers to computer-based techniques used in identifying, extracting, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes....

 (BI), in an October 1958 IBM Journal article by H.P. Luhn, A Business Intelligence System, which describes a system that will:


"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the 'action points' in an organization. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points."


Yet as management information systems developed starting in the 1960s, and as BI emerged in the '80s and '90s as a software category and field of practice, the emphasis was on numerical data stored in relational databases. This is not surprising: text in "unstructured" documents is hard to process. The emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from algorithm development to application, as described by Prof. Marti A. Hearst in the paper Untangling Text Data Mining:

For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collections to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.


Hearst's 1999 statement of need fairly well describes the state of text analytics technology and practice a decade later.

Text Analysis Processes

Subtasks — components of a larger text-analytics effort — typically include:
  • Information Retrieval
    Information retrieval
    Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

     or identification of a corpus
    Text corpus
    In linguistics, a corpus or text corpus is a large and structured set of texts...

     is a preparatory step: collecting or identifying a set textual materials, on the Web or held in a file system, database, or content management system, for analysis.
  • Named Entity Recognition
    Named entity recognition
    Named-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...

     is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on. Disambiguation — the use of contextual clues — may be required to decide where, for instance, "Ford" refers to a former U.S. president, a vehicle manufacturer, a movie star (Glenn or Harrison?) or some other entity.
  • Recognition of Pattern Identified Entities: Features such as telephone numbers, e-mail addresses, quantities (with units) can be discerned via regular expression or other pattern matches.
  • Coreference
    Coreference
    In linguistics, co-reference occurs when multiple expressions in a sentence or document refer to the same thing; or in linguistic jargon, they have the same "referent."...

    : identification of noun phrase
    Noun phrase
    In grammar, a noun phrase, nominal phrase, or nominal group is a phrase based on a noun, pronoun, or other noun-like word optionally accompanied by modifiers such as adjectives....

    s and other terms that refer to the same object.
  • Relationship, Fact, and Event Extraction: identification of associations among entities and other information in text
  • Sentiment Analysis
    Sentiment analysis
    Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials....

     involves discerning subjective (as opposed to factual) material and extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.

Applications

The technology is now broadly applied for a wide variety of government, research, and business needs. Applications can be sorted into a number of categories by analysis type or by business function. Using this approach to classifying solutions, application categories include:
  • Enterprise Business Intelligence/Data Mining, Competitive Intelligence
  • E-Discovery, Records Management
  • National Security/Intelligence
  • Scientific Discovery, especially Life Sciences
  • Sentiment Analysis Tools, Listening Platforms
  • Natural Language/Semantic Toolkit or Service
  • Publishing
  • Search/Information Access

Software

There are many text analytics research, commercial, and open source software options. Some are comprehensive solutions; others handle particular subtasks.

Commercial Software

  • AeroText
    AeroText
    AeroText is a suite of text mining applications that are used for content analysis. Content used can be in multiple languages.AeroText is a solution developed at the Integrated Systems and Solutions division of Lockheed Martin Corporation, a leading U.S. Defense contractor...

     - provides a suite of text mining applications for content analysis. Content used can be in multiple languages.
  • Attensity
    Attensity
    Attensity provides text analytics software for Customer Experience Management . Attensity's software applications extract facts, relationships and sentiment from unstructured data, which comprise approximately 85% of the information companies store electronically.The software uses natural language...

     - hosted, integrated and stand-alone text analytics software that uses natural language processing technology to address collective intelligence in social media and forums; the voice of the customer in surveys and emails; customer relationship management; e-services; research and e-discovery; risk and compliance; and intelligence analysis.
  • Clarabridge
    Clarabridge
    Clarabridge is a software company formed in 2005 in Reston, VA. Clarabridge offers its Clarabridge Enterprise and Clarabridge Professional products as SaaS and on premise software solutions that utilize sentiment and text analytics to automatically collect, categorize and report on structured and...

     - provides SaaS, hosted and on-premise text and sentiment analytics that enables companies to collect, listen to, analyze, and act on the Voice of the Customer (VOC) from both external (Twitter, Facebook, Yelp!, product forums, etc.) and internal sources (call center notes, CRM, Enterprise Data Warehouse, BI, surveys, emails, etc.).
  • General Sentiment - technology company that produces comprehensive research products to help marketing, sales and communications executives evaluate their brand performance in the media
  • IBM LanguageWare  - the IBM suite for text analytics (tools and Runtime).
  • IBM SPSS
    SPSS
    SPSS is a computer program used for survey authoring and deployment , data mining , text analytics, statistical analysis, and collaboration and deployment ....

     - provider of PASW Text Analytics for Surveys and PASW Text Analytics, Advanced NLP-based text analysis software (multi-lingual sentiment, event and fact extraction), that can be used in conjunction with SPSS Predictive Analysis Solutions.
  • Language Computer Corporation
    Language Computer Corporation
    Language Computer Corporation is a natural language processing research company based in Richardson, Texas. The company develops a variety of natural language processing products, including software for question answering, information extraction, and automatic summarization.Since its founding in...

     – provides a suite of customizable text extraction and analysis tools including natural language search, available in multiple languages.
  • Lexalytics
    Lexalytics
    Lexalytics, Inc. provides enterprise and hosted text analytics software to transform unstructured text into structured data. The software extracts entities , sentiment, quotes, opinions, and themes from text...

     - provides commercial sentiment analysis
    Sentiment analysis
    Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials....

     for many OEM and direct customers including analysis of financial news feeds for the Thomson Reuters RMDS trading information system.
  • MeshLabs - MeshLabs develops text analytics solutions that discover information from unstructured data and deliver highly relevant personalized knowledge and actionable insights from any given content source, channel, and type.
  • SAS - a leading business intelligence
    Business intelligence
    Business intelligence mainly refers to computer-based techniques used in identifying, extracting, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes....

     and business analytics provider, SAS provides text analysis capabilities with the Enterprise Miner data-mining workbench and via Teragram linguistic-analysis tools.
  • StatSoft
    StatSoft
    StatSoft is a global provider of enterprise and desktop software for data analysis, data management, data visualization, data mining , and quality control.-Company History:...

     - provides a Text Miner extension to the STATISTICA
    STATISTICA
    STATISTICA is a statistics and analytics software package developed by StatSoft. STATISTICA provides data analysis, data management, data mining, and data visualization procedures...

     Data Miner product. STATISTICA Text Miner features text retrieval, pre-processing, and analytic procedures for unstructured text data; with options to convert text into numeric information for mapping, clustering, and predictive data mining.
  • Sysomos
    Sysomos
    Sysomos Inc. is a Toronto-based social media analytics company.The company uses content of social media sites including blogs, forums and Twitter to create a real-time picture on how products, people, and brands are covered in those media sites. Unlike other similar services, it also attempts to...

     - provider social media analytics software platform, including text analytics and sentiment analysis on online consumer conversations.

Open-Source Software

  • GATE
    General Architecture for Text Engineering
    General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for all sorts of natural language processing tasks, including...

     - General Architecture for Text Engineering, an open-source toolbox for natural language processing
  • Text Engineering Software Laboratory (Tesla) - A component framework for experiments in natural language processing
  • Apache
    Apache Software Foundation
    The Apache Software Foundation is a non-profit corporation to support Apache software projects, including the Apache HTTP Server. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.The Apache Software Foundation is a decentralized community of developers...

     UIMA
    Uima
    UIMA stands for Unstructured Information Management Architecture. An OASIS standard as of March 2009, UIMA is to date the only industry standard for content analytics....

     - Unstructured Information Management Architecture
  • Natural Language Toolkit
    Natural Language Toolkit
    Natural Language Toolkit or, more commonly, NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language. NLTK includes graphical demonstrations and sample data...

     - open-source Python modules, linguistic data and documentation for text analytics
  • RapidMiner  - open-source software for data and text mining

See also

  • Noisy text analytics
    Noisy text analytics
    Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data...

  • Information extraction
    Information extraction
    Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

  • Computational linguistics
    Computational linguistics
    Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

  • Natural language processing
    Natural language processing
    Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

  • Named entity recognition
    Named entity recognition
    Named-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...

  • Identity resolution
    Identity resolution
    Identity resolution is an operational intelligence process, typically powered by an identity resolution engine or middleware stack, whereby organizations can connect disparate data sources with a view to understanding possible identity matches and non-obvious relationships across multiple data silos...

  • Text mining
    Text mining
    Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

  • News analytics
    News analytics
    News analysis refers to the measurement of the various qualitative and quantitative attributes of textual news stories. Some of these attributes are: sentiment, relevance, and novelty...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK