Home      Discussion      Topics      Dictionary      Almanac
Signup       Login
Document classification

Document classification

Overview
Document classification/categorization is a problem in information science
Information science
Information science is an interdisciplinary science primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval and dissemination of information...

. The task is to assign an electronic document
Electronic document
An electronic document is any electronic media content that are intended to be used in either an electronic form or as printed output....

 to one or more categories
Categorization
Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects are grouped into categories, usually for some specific purpose. Ideally, a category illuminates a relationship between the subjects and objects of knowledge...

, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information.
Discussion
Ask a question about 'Document classification'
Start a new discussion about 'Document classification'
Answer questions from other users
Full Discussion Forum
 
Encyclopedia
Document classification/categorization is a problem in information science
Information science
Information science is an interdisciplinary science primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval and dissemination of information...

. The task is to assign an electronic document
Electronic document
An electronic document is any electronic media content that are intended to be used in either an electronic form or as printed output....

 to one or more categories
Categorization
Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects are grouped into categories, usually for some specific purpose. Ideally, a category illuminates a relationship between the subjects and objects of knowledge...

, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information. There is also a semi-supervised document classification, where parts of the documents are labeled by the external mechanism.

Techniques


Document classification techniques include:
  • naive Bayes classifier
    Naive Bayes classifier
    A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions...

  • tf-idf
  • latent semantic indexing
    Latent semantic indexing
    Latent Semantic Indexing is an indexing and retrieval method that uses a mathematical technique called Singular Value Decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words...

  • support vector machine
    Support vector machine
    Support vector machines are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a hyperplane or set of hyperplanes in a high-dimensional space, which can be used for classification, regression or other tasks...

    s
  • artificial neural network
    Artificial neural network
    An artificial neural network , usually called "neural network" , is a mathematical model or computational model that tries to simulate the structure and/or functional aspects of biological neural networks. It consists of an interconnected group of artificial neurons and processes information using...

  • kNN
    K-nearest neighbor algorithm
    In pattern recognition, the k-nearest neighbors algorithm is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until...

  • decision tree
    Decision tree
    A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a...

    s, such as ID3
    ID3 algorithm
    In decision tree learning, ID3 is an algorithm used to generate a decision tree invented by Ross Quinlan. ID3 is the precursor to the C4.5 algorithm.-Algorithm:The ID3 algorithm can be summarized as follows:...

  • Concept Mining
    Concept Mining
    Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining...



and approaches based on natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages. Natural language generation systems convert information from computer databases into readable human language...

.

Applications


Classification techniques have been applied to spam filtering, a process which tries to discern E-mail spam
E-mail spam
E-mail spam, also known as junk e-mail, is a subset of spam that involves nearly identical messages sent to numerous recipients by e-mail. A common synonym for spam is unsolicited bulk e-mail . Definitions of spam usually include the aspects that email is unsolicited and sent in bulk...

 messages from legitimate emails.

See also


  • classification
    Classification
    Classification may refer to:* Library classification and classification in general* Taxonomic classification* Biological classification of organisms* Medical classification* Scientific classification * Classification...

  • Compound term processing
    Compound term processing
    Compound term processing is the name that is used for a category of techniques in Information retrieval applications that performs matching on the basis of compound terms...

  • supervised learning
    Supervised learning
    Supervised learning is a machine learning technique for deducing a function from training data. The training data consist of pairs of input objects , and desired outputs. The output of the function...

    , unsupervised learning
    Unsupervised learning
    In machine learning, unsupervised learning is a class of problems in which one seeks to determine how the data are organized. It is distinguished from supervised learning in that the learner is given only unlabeled examples....

  • document retrieval
    Document retrieval
    Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual...


  • information retrieval
    Information retrieval
    Information retrieval is the science of searching for documents, for information within documents and for metadata about documents, as well as that of searching relational databases and the World Wide Web...

  • string metrics
  • machine learning
    Machine learning
    Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to learn based on data, such as from sensor data or databases. A major focus of machine learning research is to automatically learn to recognize complex patterns and...

  • text mining
    Text mining
    Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers generally to the process of deriving high-quality information from text. High-quality information is typically derived through the divining of patterns and trends through means such as...

    , web mining
    Web mining
    Web mining - is the application of data mining techniques to discover patterns from the Web.According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.-Web content mining:Web content mining is the...

    , concept mining
    Concept Mining
    Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining...



Further reading


Publications:

Data sets: