Discretization of continuous features
Encyclopedia
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

 and machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

, discretization refers to the process of converting or partitioning continuous attributes, features
Features (pattern recognition)
In pattern recognition, features are the individual measurable heuristic properties of the phenomena being observed. Choosing discriminating and independent features is key to any pattern recognition algorithm being successful in classification...

 or variables
Dependent and independent variables
The terms "dependent variable" and "independent variable" are used in similar but subtly different ways in mathematics and statistics as part of the standard terminology in those subjects...

 to discretized or nominal attributes/features/variables/intervals
Interval (mathematics)
In mathematics, a interval is a set of real numbers with the property that any number that lies between two numbers in the set is also included in the set. For example, the set of all numbers satisfying is an interval which contains and , as well as all numbers between them...

. This can be useful when creating probability mass functions – formally, in density estimation
Density estimation
In probability and statistics,density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function...

. It is a form of binning
Data binning
Data binning is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value...

, as in making a histogram
Histogram
In statistics, a histogram is a graphical representation showing a visual impression of the distribution of data. It is an estimate of the probability distribution of a continuous variable and was first introduced by Karl Pearson...

.

Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies).

Some mechanisms for discretizing continuous data include:
  • Fayyad & Irani's MDL method - Uses Information Gain to recursively define the best bins.
  • And many more

Many Machine Learning algorithms are known to produce better models by discretizing continuous attributes

See also

  • Data binning
    Data binning
    Data binning is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value...

  • Density estimation
    Density estimation
    In probability and statistics,density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function...

  • Discretization error
    Discretization error
    In numerical analysis, computational physics, and simulation, discretization error is error resulting from the fact that a function of a continuous variable is represented in the computer by a finite number of evaluations, for example, on a lattice...

  • Histogram
    Histogram
    In statistics, a histogram is a graphical representation showing a visual impression of the distribution of data. It is an estimate of the probability distribution of a continuous variable and was first introduced by Karl Pearson...

  • Continuity correction
    Continuity correction
    In probability theory, if a random variable X has a binomial distribution with parameters n and p, i.e., X is distributed as the number of "successes" in n independent Bernoulli trials with probability p of success on each trial, then...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK