Data classification (Business Intelligence)
Encyclopedia
In business intelligence
Business intelligence
Business intelligence mainly refers to computer-based techniques used in identifying, extracting, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes....

, data classification has close ties to data clustering
Data clustering
Cluster analysis or clustering is the task of assigning a set of objects into groups so that the objects in the same cluster are more similar to each other than to those in other clusters....

, but where data clustering is descriptive, data classification
Data classification
In the field of data management, data classification as a part of Information Lifecycle Management process can be defined as tool for categorization of data to enable/help organization to effectively answer following questions:...

 is predictive. In essence data classification consists of using variable
Variable
Variable may refer to:* Variable , a logical set of attributes* Variable , a symbol that represents a quantity in an algebraic expression....

s with known values to predict the unknown or future values of other variables. It can be used in e.g. direct marketing
Direct marketing
Direct marketing is a channel-agnostic form of advertising that allows businesses and nonprofits to communicate straight to the customer, with advertising techniques such as mobile messaging, email, interactive consumer websites, online display ads, fliers, catalog distribution, promotional...

, insurance fraud
Insurance fraud
Insurance fraud is any act committed with the intent to fraudulently obtain payment from an insurer.Insurance fraud has existed ever since the beginning of insurance as a commercial enterprise. Fraudulent claims account for a significant portion of all claims received by insurers, and cost billions...

 detection or medical diagnosis
Medical diagnosis
Medical diagnosis refers both to the process of attempting to determine or identify a possible disease or disorder , and to the opinion reached by this process...

.

The first step in doing a data classification is to cluster the data set
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...

 used for category training, to create the wanted number of categories. An algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

, called the classifier, is then used on the categories, creating a descriptive model for each. These models can then be used to categorize new items in the created classification system.

According to Golfarelli and Rizzi, these are the measures of effectiveness of the classifier:
  • Predictive accuracy: How well does it predict the categories for new observations?
  • Speed: What is the computational cost of using the classifier?
  • Robustness: How well do the models created perform if data quality
    Data quality
    Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" . Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer...

     is low?
  • Scalability: Does the classifier function efficiently with large amounts of data?
  • Interpretability: Are the results understandable to users?


Typical examples of input for data classification could be variables such as demographics
Demographics
Demographics are the most recent statistical characteristics of a population. These types of data are used widely in sociology , public policy, and marketing. Commonly examined demographics include gender, race, age, disabilities, mobility, home ownership, employment status, and even location...

, lifestyle information, or economical behaviour.

Challenges for data classification

There are several challenges in working with data classification. One in particular is that it is necessary for all using categories on e.g. customer
Customer
A customer is usually used to refer to a current or potential buyer or user of the products of an individual or organization, called the supplier, seller, or vendor. This is typically through purchasing or renting goods or services...

s or client
Client
Client may refer to:* Customer, someone who purchases or hires something from someone else* Client , software that accesses a remote service on another computer* The client who received patronage in ancient Rome...

s, to do the modeling in an iterative process. This is to make sure that change in the characteristics of customer groups does not go unnoticed, making the existing categories outdated and obsolete, without anyone noticing.

This could be of special importance to insurance
Insurance
In law and economics, insurance is a form of risk management primarily used to hedge against the risk of a contingent, uncertain loss. Insurance is defined as the equitable transfer of the risk of a loss, from one entity to another, in exchange for payment. An insurer is a company selling the...

 or banking companies, where fraud detection is extremely relevant. New fraud patterns may come unnoticed, if the methods to surveil these changes and alert when categories are changing, disappearing or new ones emerge, are not developed and implemented.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK