Document classification/categorization is a problem in
information scienceInformation science is an interdisciplinary science primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval and dissemination of information...
. The task is to assign an
electronic documentAn electronic document is any electronic media content that are intended to be used in either an electronic form or as printed output....
to one or more
categoriesCategorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects are grouped into categories, usually for some specific purpose. Ideally, a category illuminates a relationship between the subjects and objects of knowledge...
, based on its contents. Document classification tasks can be divided into two sorts:
supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and
unsupervised document classification, where the classification must be done entirely without reference to external information.
Discussion
Ask a question about 'Document classification'
Start a new discussion about 'Document classification'
Answer questions from other users
|
Document classification/categorization is a problem in
information scienceInformation science is an interdisciplinary science primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval and dissemination of information...
. The task is to assign an
electronic documentAn electronic document is any electronic media content that are intended to be used in either an electronic form or as printed output....
to one or more
categoriesCategorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects are grouped into categories, usually for some specific purpose. Ideally, a category illuminates a relationship between the subjects and objects of knowledge...
, based on its contents. Document classification tasks can be divided into two sorts:
supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and
unsupervised document classification, where the classification must be done entirely without reference to external information. There is also a
semi-supervised document classification, where parts of the documents are labeled by the external mechanism.
Techniques
Document classification techniques include:
- naive Bayes classifier
A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions...
- tf-idf
- latent semantic indexing
Latent Semantic Indexing is an indexing and retrieval method that uses a mathematical technique called Singular Value Decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words...
- support vector machine
Support vector machines are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a hyperplane or set of hyperplanes in a high-dimensional space, which can be used for classification, regression or other tasks...
s
- artificial neural network
An artificial neural network , usually called "neural network" , is a mathematical model or computational model that tries to simulate the structure and/or functional aspects of biological neural networks. It consists of an interconnected group of artificial neurons and processes information using...
- kNN
In pattern recognition, the k-nearest neighbors algorithm is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until...
- decision tree
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a...
s, such as ID3In decision tree learning, ID3 is an algorithm used to generate a decision tree invented by Ross Quinlan. ID3 is the precursor to the C4.5 algorithm.-Algorithm:The ID3 algorithm can be summarized as follows:...
- Concept Mining
Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining...
and approaches based on
natural language processingNatural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages. Natural language generation systems convert information from computer databases into readable human language...
.
Applications
Classification techniques have been applied to spam filtering, a process which tries to discern
E-mail spamE-mail spam, also known as junk e-mail, is a subset of spam that involves nearly identical messages sent to numerous recipients by e-mail. A common synonym for spam is unsolicited bulk e-mail . Definitions of spam usually include the aspects that email is unsolicited and sent in bulk...
messages from legitimate emails.
See also
- classification
Classification may refer to:* Library classification and classification in general* Taxonomic classification* Biological classification of organisms* Medical classification* Scientific classification * Classification...
- Compound term processing
Compound term processing is the name that is used for a category of techniques in Information retrieval applications that performs matching on the basis of compound terms...
- supervised learning
Supervised learning is a machine learning technique for deducing a function from training data. The training data consist of pairs of input objects , and desired outputs. The output of the function...
, unsupervised learningIn machine learning, unsupervised learning is a class of problems in which one seeks to determine how the data are organized. It is distinguished from supervised learning in that the learner is given only unlabeled examples....
- document retrieval
Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual...
- information retrieval
Information retrieval is the science of searching for documents, for information within documents and for metadata about documents, as well as that of searching relational databases and the World Wide Web...
- string metrics
- machine learning
Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to learn based on data, such as from sensor data or databases. A major focus of machine learning research is to automatically learn to recognize complex patterns and...
- text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers generally to the process of deriving high-quality information from text. High-quality information is typically derived through the divining of patterns and trends through means such as...
, web miningWeb mining - is the application of data mining techniques to discover patterns from the Web.According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.-Web content mining:Web content mining is the...
, concept miningConcept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining...
Further reading
Publications:
Data sets: