All Topics  
Data mining

 

   Email Print
   Bookmark   Link






 

Data mining



 
 
Data mining is the process of extracting hidden patterns from data. As more data is gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices
Profiling practices

One of the most challenging problems of the information society is dealing with the increasing data overload. Due to the digitalization of all sorts of content and due to the improvement and drop in cost of recording technologies, the amount of available information is enormous and is increasing exponentially....
, such as marketing, fraud detection and scientific discovery. Data mining can be applied to data sets of any size.






Discussion
Ask a question about 'Data mining'
Start a new discussion about 'Data mining'
Answer questions from other users
Full Discussion Forum



Recent Posts









Encyclopedia


Data mining is the process of extracting hidden patterns from data. As more data is gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices
Profiling practices

One of the most challenging problems of the information society is dealing with the increasing data overload. Due to the digitalization of all sorts of content and due to the improvement and drop in cost of recording technologies, the amount of available information is enormous and is increasing exponentially....
, such as marketing, fraud detection and scientific discovery. Data mining can be applied to data sets of any size. However, while it can be used to uncover hidden patterns in data that have been collected, obviously it can neither uncover patterns which are not already present in the data, nor can it uncover patterns in data that have not been collected.

Background

Humans have been "manually" extracting information
Information

Information as a Conveyed concept has a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of constraint, communication, control system, data, form, instruction, knowledge, Meaning , stimulation, pattern, perception, and knowledge representation....
 from data
DATA

Debt, AIDS, Trade in Africa is a multinational Non-governmental organization founded in January 2002 in London by U2's Bono along with Robert Sargent Shriver III and activists from the Jubilee 2000 Drop the Debt campaign....
 for centuries, but the increasing volume of data in modern times has called for more automatic approaches. As data set
Data set

A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question....
s and the information extracted from them has grown in size and complexity, direct hands-on data analysis has increasingly been supplemented and augmented with indirect, automatic data processing using more complex and sophisticated tools, methods and models. The proliferation, ubiquity and increasing power of computer technology has aided data collection, processing, management and storage. However, the captured data needs to be converted into information
Information

Information as a Conveyed concept has a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of constraint, communication, control system, data, form, instruction, knowledge, Meaning , stimulation, pattern, perception, and knowledge representation....
 and knowledge
Knowledge

Knowledge is defined in the Oxford English Dictionary as expertise, and skills acquired by a person through experience or education; the theoretical or practical understanding of a subject, what is known in a particular field or in total; facts and information or awareness or familiarity gained by experience of a fact or situation....
 to become useful. Data mining is the process of using computing power to apply methodologies
Methodology

Methodology can be defined as:# "the analysis of the principles of methods, rules, and postulates employed by a discipline";# "the systematic study of methods that are, can be, or have been applied within a discipline"; or...
, including new techniques for knowledge discovery
Knowledge discovery

Knowledge discovery is a concept of the field of computer science that describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data....
, to data.

Data mining identifies trends within data that go beyond simple data analysis. Through the use of sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of processes and target opportunities. However, abdicating control and understanding of processes from statisticians to poorly informed or uninformed users can result in false-positives, no useful results, and worst of all, results that are misleading and/or misinterpreted.

Although data mining is a relatively new term, the technology is not. For many years, businesses and governments have used increasingly powerful computers to sift through volumes of data such as airline passenger trip records, census data and supermarket scanner data to produce market research reports. (Note, however, that reporting is not always considered to be data mining). Continuous innovations in computer processing power, disk storage, data capture technology, algorithms, methodologies and analysis software have dramatically increased the accuracy and usefulness of the extracted information.

The term data mining is often used to apply to the two separate processes of knowledge discovery
Knowledge discovery

Knowledge discovery is a concept of the field of computer science that describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data....
 and prediction
Prediction

A prediction is a statement or claim that a particular event will occur in the future in more certain terms than a forecasting. The etymology of this word is Latin ....
. Knowledge discovery provides explicit information about the characteristics of the collected data, using a number of techniques (e.g., association rule mining). Forecasting
Forecasting

Forecasting is the process of estimation in unknown situations. Prediction is a similar, but more general term. Both can refer to estimation of time series, cross-sectional data or longitudinal study data....
 and predictive modeling provide predictions of future events, and the processes may range from the transparent (e.g., rule-based approaches) through to the opaque (e.g., neural network
Neural network

Traditionally, the term neural network had been used to refer to a network or circuit of neuron. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes....
s).

Metadata
Metadata

Metadata is "data about other data", of any sort in any media. An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema....
, (data about the characteristics of a data set), are often expressed in a condensed data-minable format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts.

Data mining is usually performed on "real-world data". Such data are vulnerable to collinearity
Collinearity

Collinearity indicates that a set of points are on a single straight line.Alternative spellings are co-linear or colinear. The version with the double l is typically preferred due to a Latin grammatical rule, an example of which is the spelling of attractive, which derives from ad tractare where the d is lost and the consona...
 because of unknown and possibly unobserved interrelations. An unavoidable fact of data mining is that the (sub-)set of data being analysed may not be representative of the whole domain, and therefore may not contain examples of certain critical relationships that exist across other parts of the domain. Alternative methods using experiment-based approaches, such as Choice Modelling
Choice Modelling

Choice modelling attempts to model the decision process of an individual or segment in a particular context. Choice modelling may also be used to estimate non-market environmental benefits and costs....
 for human-generated data, may be used to address this sort of issue. In these situations, inherent correlations can be either controlled for or removed altogether during the construction of the experimental design.

There have been some efforts to define standards for data mining, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining
Java Data Mining

Java Data Mining is a standard Java API for developing data mining applications and tools. JDM defines an object model and Java API for data mining objects and processes....
 standard (JDM 1.0). These are evolving standards; later versions of these standards are under development. Independent of these standardization efforts, freely available open-source software systems like RapidMiner and Weka
Weka (machine learning)

Weka is a popular suite of machine learning software written in Java , developed at the University of Waikato. WEKA is free software available under the GNU General Public License....
 have become an informal standard for defining data-mining processes.

Since the availability of affordable computer processing power in the last quarter of the 20th century, organizations have been accumulating vast and ever growing amounts of data
DATA

Debt, AIDS, Trade in Africa is a multinational Non-governmental organization founded in January 2002 in London by U2's Bono along with Robert Sargent Shriver III and activists from the Jubilee 2000 Drop the Debt campaign....
, including, for example:
  • operational and transactional data, such as sales, cost, inventory, payroll and accounting data
  • nonoperational data, such as forecasts and macro economic data
  • meta data — data about the data itself, such as logical database design and data dictionary definitions


This article outlines the longitudinal changes of DMKD research activities during the last decade by surveying a large collection of Data Mining literature to provide a comprehensive picture of current DMKD research and classify these research activities into high-level categories.

The process of data mining


Knowledge Discovery in Databases (KDD), is the name coined by in 1989 to describe the process of finding interesting, interpreted, useful and novel data. There are many nuances to this process, but roughly the steps are to preprocess raw data, mine the data, and interpret the results.

Pre-processing

Once the objective for the KDD process is known, a target data set must be assembled. As data mining can only uncover patterns already present in the data, the target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. A common source for data is a datamart or data warehouse
Data warehouse

Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis....
.

The target set is then cleaned. Cleaning removes the observations with noise and missing data.

The clean data is reduced into feature vector
Feature vector

In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical Features s that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and...
s, one vector per observation. A feature vector is a summarized version of the raw data observation. For example, a black and white image of a face which is 100px by 100px would contain 10,000 bits of raw data. This might be turned into a feature vector by locating the eyes and mouth in the image. Doing so would reduce the data for each vector from 10,000 bits to three codes for the locations, dramatically reducing the size of the dataset to be mined, and hence reducing the processing effort. The feature(s) selected will depend on what the objective(s) is/are; obviously, selecting the "right" feature(s) is fundamental to successful data mining.

The feature vectors are divided into two sets, the "training set" and the "test set". The training set is used to "train" the data mining algorithm(s), while the test set is used to verify the accuracy of any patterns found.

Data mining

Data mining commonly involves four classes of task:
  • Classification
    Statistical classification

    Statistical classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items and based on a training set of previously labeled items....
     - Arranges the data into predefined groups. For example an email program might attempt to classify an email as legitimate or spam. Common algorithms include Nearest neighbor, Naive Bayes classifier
    Naive Bayes classifier

    A naive Bayes classifier is a term in Bayesian statistics statistics dealing with a simple probabilistic Classifier based on applying Bayes' theorem with strong statistical independence assumptions....
     and Neural network.
  • Clustering - Is like classification but the groups are not predefined, so the algorithm will try to group similar items together.
  • Regression
    Regression analysis

    In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable and of one or more independent variables ....
     - Attempts to find a function which models the data with the least error. A common method is to use Genetic Programming
    Genetic programming

    In artificial intelligence, genetic programming is an evolutionary algorithm-based methodology bio-inspired computing by biological evolution to find computer programs that perform a user-defined task....
    .
  • Association rule learning
    Association rule learning

    In data mining, association rule learning is a popular andwell researched method for discovering interesting relations between variablesin large databases....
     - Searches for relationships between variables. For example a supermarket might gather data of what each customer buys. Using association rule learning, the supermarket can work out what products are frequently bought together, which is useful for marketing purposes. This is sometimes referred to as "market basket analysis".


Interpreting the results

The final step of knowledge discovery from data is to evaluate the patterns produced by the datamining algorithms. Not all patterns found by the datamining algorithms are necessarily valid. It is common for the datamining algorithms to find patterns in the training set which are not present in the general data set, this is called overfitting
Overfitting

In statistics, overfitting is fitting a statistical model that has too many parameters. An absurd and false model may fit perfectly if the model has enough complexity by comparison to the amount of data available....
. To overcome this, the evaluation uses a "test set" of data which the datamining algorithm was not trained on. The learnt patterns are applied to this "test set" and the resulting output is compared to the desired output. For example, a datamining algorithm trying to distinguish spam from legitimate emails would be trained on a "training set" of sample emails. Once trained, the learnt patterns would be applied to the "test set" of emails which it had not been trained on, the accuracy of these patterns can then be measured from how many emails they correctly classify. A number of statistical methods may be used to evaluate the algorithm such as ROC curves.

If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and datamining. If the learnt patterns do meet the desired standards then the final step is to interpret the learnt patterns and turn them into knowledge.

Notable uses of data mining


Combating terrorism

It has been suggested that both the Central Intelligence Agency
Central Intelligence Agency

The Central Intelligence Agency is a civilian intelligence agency of the Federal government of the United States. It is the successor of the Office of Strategic Services formed during World War II to coordinate espionage activities between the branches of the US military services....
 and the Canadian Security Intelligence Service
Canadian Security Intelligence Service

The Canadian Security Intelligence Service is the primary intelligence agency of the Canadian government. It is responsible for collecting, analyzing and reporting Intelligence on threats to Canada's national security, and conducting operations, covert operation and overt, within Canada and abroad....
 have employed this method.

Previous data mining to stop terrorist programs under the U.S. government include the Total Information Awareness (TIA) program, Computer-Assisted Passenger Prescreening System (CAPPS II), Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE
ADVISE

ADVISE is a research and development program within the United States Department of Homeland Security Threat and Vulnerability Testing and Assessment portfolio....
), Multistate Anti-Terrorism Information Exchange (MATRIX
Matrix

Matrix usually refers to:* Matrix , a mathematical object generally represented as an array of numbers;* The Matrix , a series of films, video games and comic books;...
), and the Secure Flight program. These programs have been discontinued due to controversy over whether they violate the US Constitution's 4th amendment, although many programs that were formed under them continue to be funded by different organizations, or under different names, to this day.

Two plausible data mining techniques in the context of combatting terrorism include "pattern mining" and "subject-based data mining".

An example of a probable application to national security monitoring would be the ability for government analysts to define a pattern of interest as "all individuals traveling from the United States to the Middle East in the next six months" and have the ADVISE tool provide an alert whenever this pattern emerges in the data.

Pattern mining
"Pattern mining" is a data mining technique that involves finding existing pattern
Pattern

A pattern, from the French language patron, is a type of theme of recurring events of or objects, sometimes referred to as elements of a set....
s in data. In this context patterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behaviour in terms of the purchased products. For example, an association rule "beer => chips (80%)" states that four out of five customers that bought beer also bought chips.

In the context of pattern mining as a tool to identify terrorist activity, the National Research Council
National Research Council

National Research Council may refer to:* National Research Council , Canada's leading organization for scientific research and development* National Scientific and Technical Research Council, an Argentine government agency which directs and co-ordinates most of the scientific and technical research done in public universities and institute...
 provides the following definition: "Pattern-based data mining looks for patterns (including anomalous data patterns) that might be associated with terrorist activity — these patterns might be regarded as small signals in a large ocean of noise." Pattern Mining includes new areas such a Music Information Retrieval
Music information retrieval

Music information retrieval or MIR is the interdisciplinary science of retrieving information from music.This includes:*Computational methods for classification, clustering, and modelling ? musical feature extraction for mono- and polyphonic music, similarity and pattern matching, retrieval...
 (MIR) where patterns seen both in the temporal and non temporal domains are imported to classical knowledge discovery search techniques.

Subject-based data mining
"Subject-based data mining" is a data mining technique involving the search for associations between individuals in data. In the context of combatting terrorism, the National Research Council
National Research Council

National Research Council may refer to:* National Research Council , Canada's leading organization for scientific research and development* National Scientific and Technical Research Council, an Argentine government agency which directs and co-ordinates most of the scientific and technical research done in public universities and institute...
 provides the following definition: "Subject-based data mining uses an initiating individual or other datum that is considered, based on other information, to be of high interest, and the goal is to determine what other persons or financial transactions or movements, etc., are related to that initiating datum."

Games

Since the early 1960s, with the availability of oracles
Oracle machine

In computational complexity theory and Computability theory , an oracle machine is an abstract machine used to study decision problems. It can be visualized as a Turing machine with a black box, called an oracle, which is able to decide certain decision problems in a single operation....
 for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John Nunn
John Nunn

John Denis Martin Nunn is one of England's strongest chess players, and once belonged to the world's top ten. He was twice world champion in chess problem solving....
 in chess
Chess

Chess is a recreational and competitive game played between two Player . Sometimes called Western chess or international chess to distinguish it from History of chess and other chess variants, the current form of the game emerged in Southern Europe during the second half of the 15th century after evolving from similar, much older...
 endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.

Business

Data mining in customer relationship management
Customer relationship management

Customer relationship management consists of the processes a company uses to track and organize its contacts with its current and prospective customers....
 applications can contribute significantly to the bottom line. Rather than randomly contacting a prospect or customer through a call center or sending mail, a company can concentrate its efforts on prospects that are predicted to have a high likelihood of responding to an offer. More sophisticated methods may be used to optimize resources across campaigns so that one may predict which channel and which offer an individual is most likely to respond to — across all potential offers. Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer. Data clustering
Data clustering

Clustering is the assignment of objects into groups so that objects from the same cluster are more similar to each other than objects from different clusters....
 can also be used to automatically discover the segments or groups within a customer data set.

Businesses employing data mining may see a return on investment, but also they recognize that the number of predictive models can quickly become very large. Rather than one model to predict which customers will churn
Churning (stock trade)

Churning is the practice of executing trades for an investment account by a salesman or Stock broker in order to generate Commission from the account....
, a business could build a separate model for each region and customer type. Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers that will likely take to offer. And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to manage model versions and move to automated data mining.

Data mining can also be helpful to human-resources departments in identifying the characteristics of their most successful employees. Information obtained, such as universities attended by highly successful employees, can help HR focus recruiting efforts accordingly. Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels.

Another example of data mining, often called the market basket analysis, relates to its use in retail sales. If a clothing store records the purchases of customers, a data-mining system could identify those customers who favour silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with association rules within transaction-based data. Not all data are transaction based and logical or inexact rule
Rule

A rule is:* Rewrite rule, in generative grammar and computer science* Standardization, a formal and widely-accepted statement, fact, definition, or qualification...
s may also be present within a database
Database

A database is a structured collection of records or data that is stored in a computer system. The structure is achieved by organizing the data according to a database model....
. In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem will develop a secondary problem within the next six months.

Market basket analysis has also been used to identify the purchase patterns of the Alpha consumer
Alpha consumer

Alpha Consumer is someone that plays a key role in connecting with the concept behind a product, then adopting that product, and finally validating it for the rest of society....
. Alpha Consumers are people that play a key roles in connecting with the concept behind a product, then adopting that product, and finally validating it for the rest of society. Analyzing the data collected on these type of users has allowed companies to predict future buying trends and forecast supply demands.

Data Mining is a highly effective tool in the catalog marketing industry. Catalogers have a rich history of customer transactions on millions of customers dating back several years. Data mining tools can identify patterns among customers and help identify the most likely customers to respond to upcoming mailing campaigns.

Related to an integrated-circuit production line, an example of data mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing." In this paper the application of data mining and decision analysis to the problem of die-level functional test is described. Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die-test data to create a probabilistic model of patterns of die failure which are then utilized to decide in real time which die to test next and when to stop testing. This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products.

Given below is a list of the top eight data-mining software vendors in 2008 published in a Gartner
Gartner

Gartner, Inc. is an information technology research and advisory firm headquartered in Stamford, Connecticut, Connecticut. It was known as The Gartner Group until 2001....
 study.
  • Angoss Software
    Angoss

    .Angoss Software Corporation , headquartered in Toronto, Ontario, Canada, with offices in the UK and Australia,is a provider of predictive analytics systems....
  • Infor CRM Epiphany
  • Kxen
  • Portrait Software
  • SAS
    SAS System

    SAS is an integrated system of software products provided by SAS Institute that enables the programmer to perform:*data entry, Information retrieval, Data management, and Data mining...
  • SPSS
    SPSS

    SPSS is a computer program used for statistical analysis....
  • ThinkAnalytics
  • Unica
    Unica Corporation

    Unica Corporation is a Massachusetts-based vendor of enterprise marketing management software. Unica's Affinium suite is used by more than 600 companies globally and has been identified by Gartner as the clear market leader for multichannel campaign management software....
  • Viscovery
  • Monarch


Science and engineering

In recent years, data mining has been widely used in area of science and engineering, such as bioinformatics, genetics
Genetics

Genetics , a discipline of biology, is the science of heredity and Genetic variation in living organisms. The fact that living things inherit traits from their parents has been used since prehistoric times to improve crop plants and animals through selective breeding....
, medicine
Medicine

Medicine is the art and science of healing. It encompasses a range of health care practices evolved to maintain and restore health by the prevention and treatment of illness....
, education
Education

File:Inukshuk Monterrey 1.jpgEducation can be seen as a product or a process and considered in a broad sense or a technical sense. According to philosophy of education George F....
 and electrical power engineering.

In the area of study on human genetics, the important goal is to understand the mapping relationship between the inter-individual variation in human DNA
DNA

Deoxyribonucleic acid is a nucleic acid that contains the genetics instructions used in the development and functioning of all known living organisms and some viruses....
 sequences and variability in disease susceptibility. In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as cancer
Cancer

Cancer is a class of diseases in which a group of cell display uncontrolled growth , invasion , and sometimes metastasis . These three malignant properties of cancers differentiate them from benign tumors, which are self-limited, do not invade or metastasize....
. This is very important to help improve the diagnosis, prevention and treatment of the diseases. The data mining technique that is used to perform this task is known as multifactor dimensionality reduction
Multifactor dimensionality reduction

Multifactor dimensionality reduction is a data mining approach for detecting and characterizing combinations of attributes or independent variables that interact to influence a dependent or class variable....
.

In the area of electrical power engineering, data mining techniques have been widely used for condition monitoring
Condition monitoring

Condition monitoring is the process of monitoring a parameter of condition in machinery, such that a significant change is indicative of a developing failure....
 of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on the insulation
Insulation

Insulation may mean:* Building insulation, added to buildings for comfort and energy efficiency* Soundproofing, also known as acoustic insulation, any means of reducing the intensity of sound...
's health status of the equipment. Data clustering
Data clustering

Clustering is the assignment of objects into groups so that objects from the same cluster are more similar to each other than objects from different clusters....
 such as self-organizing map
Self-organizing map

A self-organizing map is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional , discretized representation of the input space of the training samples, called a map....
 (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers(OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Obviously, different tap positions will generate different signals. However, there was considerable variability amongst normal condition signals for the exact same tap position. SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities.

Data mining techniques have also been applied for dissolved gas analysis
Dissolved gas analysis

Dissolved gas analysis, or DGA, is a diagnostic and Maintenance, repair and operations tool used in machinery. The study of gases from transformers can be used to give an early indication of abnormal behavior of transformer and may indicate appropriate action that may be taken on the equipment before it suffers great damage....
 (DGA) on power transformers. DGA, as a diagnostics for power transformer, has been available for many years. Data mining techniques such as SOM has been applied to analyse data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle.

A fourth area of application for data mining in science/engineering is within educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning and to understand the factors influencing university student retention.. A similar example of the social application of data mining its is use in expertise finding systems
Expertise finding

What is expertise? The Oxford English Dictionary defines "expertise" as follows: a. Expert Opinion or Knowledge, often obtained through the action of submitting a matter to, and its consideration by, Expert; an expert's appraisal, valuation, or report....
, whereby descriptors of human expertise are extracted, normalized and classified so as to facilitate the finding of experts, particularly in scientific and technical fields. In this way, data mining can facilitate Institutional memory
Institutional Memory

"Insititutional Memory" is episode 153 of The West Wing , the penultimate episode of the series.The episode is set two weeks before the inauguration and the Bartlet administration staff is preparing to leave the White House....
.

Other examples of applying data mining technique applications are biomedical data facilitated by domain ontologies, mining clinical trial data, traffic analysis
Traffic analysis

Traffic analysis is the process of intercepting and examining messages in order to deduce information from patterns in communication. It can be performed even when the messages are encrypted and cannot be cryptanalysis....
 using SOM, et cetera.

In adverse drug reaction surveillance, the Uppsala Monitoring Centre
Uppsala Monitoring Centre

The Uppsala Monitoring Centre , located in Uppsala, Sweden, is the field name for the World Health Organization Collaborating Centre for International Drug Monitoring....
 has, since 1998, used data mining methods to routinely screen for reporting patterns indicative of emerging drug safety issues in the WHO global database of 4.6 million suspected adverse drug reaction
Adverse drug reaction

An adverse drug reaction or adverse drug event is an expression that describes the unwanted, negative consequences associated with the use of given medications....
 incidents. Recently, similar methodology has been developed to mine large collections of electronic health records for temporal patterns associating drug prescriptions to medical diagnoses.

Privacy concerns and ethics

How datamining is used can raise ethical questions regarding privacy, legality, and ethics. In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program, has raised privacy concerns.

Datamining can uncover information or patterns which may compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation is when the data which has been mined, possibly from various sources, has been put together so that it can be analyzed. The threat to an individual's privacy comes into play when the data, once compiled, causes the data miner to be able to identify specific individuals, especially when originally the data was anonymous.

It is recommended that an individual is made aware of the following before data is collected:
  • the purpose of the data collection and any data mining projects,
  • how the data will be used,
  • who will be able to mine the data and use it,
  • the security surrounding access to the data, and in addition,
  • how collected data can be updated.
One may additionally modify the data so that it is anonymous, so that individuals may not be readily identified.

See also


  • Association rule learning
    Association rule learning

    In data mining, association rule learning is a popular andwell researched method for discovering interesting relations between variablesin large databases....
  • Data analysis
    Data analysis

    Data analysis is a process of gathering, modeling, and transforming data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making....
  • Data warehouse
    Data warehouse

    Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis....
  • Cluster analysis
  • Knowledge discovery
    Knowledge discovery

    Knowledge discovery is a concept of the field of computer science that describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data....


  • Stellar wind (code name)
    Stellar wind (code name)

    Stellar Wind is the open secret code name for certain information collection activities performed by the United States National Security Agency....
  • Structured data analysis (statistics)
    Structured data analysis (statistics)

    Structured data analysis is the statistics of structured data. This can arise either in the form of an a priori structure such as multiple-choice questionnaires or in situations with the need to search for structure that fits the given data, either exactly or approximately....
  • Screen scraping
    Screen scraping

    Screen scraping is a technique in which a computer program extracts data from the display output of another program.The program doing the scraping is called a screen scraper....
  • Web-scraping software comparison
    Web-scraping software comparison

    This article provides a basic feature comparison for several types of web scraping software. Additional feature details are available from the individual products' websites and/or articles....


Data mining is about analysing data; for information about extracting information out of data, see:
  • Information extraction
    Information extraction

    In natural language processing, information extraction is a type of information retrieval whose goal is to automatically extract structured information, i.e....
  • Named entity recognition
    Named entity recognition

    Named entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc....
  • Profiling
    Profiling

    Profiling, the extrapolation of information about something, based on known qualities, may refer specifically to:* Profiling practices* APML ...
  • Profiling practices
    Profiling practices

    One of the most challenging problems of the information society is dealing with the increasing data overload. Due to the digitalization of all sorts of content and due to the improvement and drop in cost of recording technologies, the amount of available information is enormous and is increasing exponentially....


Further reading

  • Wang, X.Z.; Medasani, S.; Marhoon, F; Al-Bazzaz, H. (2004) Multidimensional visualisation of principal component scores for process historical data analysis. Industrial & Engineering Chemistry Research, 43(22), pp.7036-7048.
  • Wang, X.Z. (1999) Data mining and knowledge discovery for process monitoring and control. Springer, London.
  • Peter Cabena, Pablo Hadjnian, Rolf Stadler, Jaap Verhees, Alessandro Zanasi, Discovering Data Mining: From Concept to Implementation (1997), Prentice Hall, ISBN 0137439806.
  • Ronen Feldman and James Sanger, The Text Mining Handbook, Cambridge University Press, ISBN 9780521836579.
  • Phiroz Bhagat, Pattern Recognition in Industry, Elsevier, ISBN 0-08-044538-1.
  • Ian Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (2000), ISBN 1-55860-552-5. (See also Free Weka software
    Weka (machine learning)

    Weka is a popular suite of machine learning software written in Java , developed at the University of Waikato. WEKA is free software available under the GNU General Public License....
    .)
  • Mark F. Hornick, Erik Marcade, Sunil Venkayala: "Java Data Mining: Strategy, Standard, and Practice: A Practical Guide for Architecture, Design, And Implementation" (Broché).
  • Weiss and Indurkhya, Predictive Data Mining, Morgan Kaufman.
  • Yike Guo and Robert Grossman, editors: High Performance Data Mining: Scaling Algorithms, Applications and Systems, Kluwer Academic Publishers, 1999.
  • Trevor Hastie, Robert Tibshirani and Jerome Friedman (2001). The Elements of Statistical Learning, Springer. ISBN 0387952845. (.)
  • Pascal Poncelet, Florent Masseglia and Maguelonne Teisseire (Editors). Data Mining Patterns: New Methods and Applications , Information Science Reference, ISBN 978-1599041629, (October 2007).
  • Ingo Mierswa, Michael Wurst, Ralf Klinkenberg, Martin Scholz and Timm Euler: YALE: Rapid Prototyping for Complex Data Mining Tasks, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-06), 2006.
  • Peng, Y., Kou, G., Shi, Y. and Chen, Z. A Descriptive Framework for the Field of Data Mining and Knowledge Discovery(.), International Journal of Information Technology and Decision Making, Vol. 7, Issue: 4, Page 639 – 682, 2008.


External links

  • - Data Mining Community's Top Resource since 1997
  • - ACM SIGKDD, The Society for Knowledge Discovery and Data Mining