C4.5 algorithm - AbsoluteAstronomy.com

C4.5 is an algorithm used to generate a decision tree

Decision tree learning

Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...

developed by Ross Quinlan

Ross Quinlan

John Ross Quinlan is a computer science researcher in data mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms...

. C4.5 is an extension of Quinlan's earlier ID3 algorithm

ID3 algorithm

In decision tree learning, ID3 is an algorithm used to generate a decision tree invented by Ross Quinlan. ID3 is the precursor to the C4.5 algorithm.-Algorithm:The ID3 algorithm can be summarized as follows:...

. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.

Algorithm

C4.5 builds decision trees from a set of training data in the same way as ID3

ID3 algorithm

, using the concept of information entropy. The training data is a set

of already classified samples. Each sample

is a vector where

represent attributes or features of the sample. The training data is augmented with a vector

where

represent the class to which each sample belongs.

At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sublists.

This algorithm has a few base cases.

All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class.
None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class.
Instance of previously-unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.

Pseudocode

In pseudocode

Pseudocode

In computer science and numerical computation, pseudocode is a compact and informal high-level description of the operating principle of a computer program or other algorithm. It uses the structural conventions of a programming language, but is intended for human reading rather than machine reading...

, the general algorithm for building decision trees is:

Check for base cases
For each attribute a
1. Find the normalized information gain from splitting on a
Let a_best be the attribute with the highest normalized information gain
Create a decision node that splits on a_best
Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node

Implementations

J48 is an open source

Open source

The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

Java

Java (programming language)

Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

implementation of the C4.5 algorithm in the weka

Weka (machine learning)

Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand...

data mining

Data mining

Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

tool.

Improvements from ID3 algorithm

C4.5 made a number of improvements to ID3. Some of these are:

Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.
Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations.
Handling attributes with differing costs.
Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes.

Improvements in C5.0/See5 algorithm

Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are:

Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude)
Memory usage - C5.0 is more memory efficient than C4.5
Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees.
Support for boosting
Boosting
Boosting is a machine learning meta-algorithm for performing supervised learning. Boosting is based on the question posed by Kearns: can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification...

- Boosting improves the trees and gives them more accuracy.
Weighting - C5.0 allows you to weight different cases and misclassification types.
Winnowing - a C5.0 option automatically winnow
Winnow (algorithm)
The winnow algorithm is a technique from machine learning for learning a linear classifier from labeled examples. It is very similar to the perceptron algorithm. However, the perceptron algorithm uses an additive weight-update scheme, while Winnow uses a multiplicative scheme that allows it to...

s the attributes to remove those that may be unhelpful.

Source for a single-threaded Linux version of C5.0 is available under the GPL.

External links

Original implementation on Ross Quinlan's homepage: http://www.rulequest.com/Personal/
See5 and C5.0

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.