Exploratory data analysis
Encyclopedia
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, exploratory data analysis (EDA) is an approach to analysing
Data analysis
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making...

 data set
Data set
A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random numbers. Each...

s to summarize their main characteristics in easy-to-understand form, often with visual graphs, without using a statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...

 or having formulated a hypothesis
Hypothesis
A hypothesis is a proposed explanation for a phenomenon. The term derives from the Greek, ὑποτιθέναι – hypotithenai meaning "to put under" or "to suppose". For a hypothesis to be put forward as a scientific hypothesis, the scientific method requires that one can test it...

. Exploratory data analysis was promoted by John Tukey
John Tukey
John Wilder Tukey ForMemRS was an American statistician.- Biography :Tukey was born in New Bedford, Massachusetts in 1915, and obtained a B.A. in 1936 and M.Sc. in 1937, in chemistry, from Brown University, before moving to Princeton University where he received a Ph.D...

 to encourage statisticians visually to examine their data sets, to formulate hypotheses
Hypothesis
A hypothesis is a proposed explanation for a phenomenon. The term derives from the Greek, ὑποτιθέναι – hypotithenai meaning "to put under" or "to suppose". For a hypothesis to be put forward as a scientific hypothesis, the scientific method requires that one can test it...

 that could be tested
Tested
Tested is a live album by punk rock band Bad Religion that was recorded live in USA, Canada, Germany, Estonia, Denmark, Italy and Austria in 1996 and released in 1997. It is Bad Religion's first live album since their 1979 formation...

 on new data-sets.

Tukey's championing of EDA encouraged the development of statistical computing packages, especially S
S programming language
S is a statistical programming language developed primarily by John Chambers and Rick Becker and Allan Wilks of Bell Laboratories...

at Bell Labs
Bell Labs
Bell Laboratories is the research and development subsidiary of the French-owned Alcatel-Lucent and previously of the American Telephone & Telegraph Company , half-owned through its Western Electric manufacturing subsidiary.Bell Laboratories operates its...

: The S programming language inspired the systems 'S'-PLUS
S-PLUS
S-PLUS is a commercial implementation of the S programming language sold by TIBCO Software Inc..It features object-oriented programming capabilities and advanced analytical algorithms.-Historical timeline:...

 and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers and patterns
Pattern recognition
In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...

 in data that merited further study.

Tukey's EDA was related to two other developments in statistical theory
Statistical theory
The theory of statistics provides a basis for the whole range of techniques, in both study design and data analysis, that are used within applications of statistics. The theory covers approaches to statistical-decision problems and to statistical inference, and the actions and deductions that...

: Robust statistics
Robust statistics
Robust statistics provides an alternative approach to classical statistical methods. The motivation is to produce estimators that are not unduly affected by small departures from model assumptions.- Introduction :...

 and nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating statistical model
Statistical model
A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more random variables. The model is statistical as the variables are not deterministically but...

s. Tukey promoted the use of five number summary of numerical data—the two extremes (maximum and minimum), the median
Median
In probability theory and statistics, a median is described as the numerical value separating the higher half of a sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to...

, and the quartile
Quartile
In descriptive statistics, the quartiles of a set of values are the three points that divide the data set into four equal groups, each representing a fourth of the population being sampled...

s—because these median and quartiles, being functions of the empirical distribution
Empirical distribution
Empirical distribution may refer to:* Empirical distribution function* Empirical measure...

 are defined for all distributions, unlike the mean and standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

; moreover, the quartiles and median are more robust to skewed
Skewness
In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable. The skewness value can be positive or negative, or even undefined...

 or heavy-tailed
Heavy-tailed distribution
In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution...

 distribution
Probability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....

s than traditional summaries (the mean
Mean
In statistics, mean has two related meanings:* the arithmetic mean .* the expected value of a random variable, which is also called the population mean....

 and standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

). The packages S, S-PLUS, and R included routines using resampling statistics, such as Quenouille and Tukey's jacknife and Efron
Bradley Efron
Bradley Efron is an American statistician best known for proposing the bootstrap resampling technique, which has had a major impact in the field of statistics and virtually every area of statistical application...

bootstrap
Bootstrapping (statistics)
In statistics, bootstrapping is a computer-based method for assigning measures of accuracy to sample estimates . This technique allows estimation of the sample distribution of almost any statistic using only very simple methods...

, that were nonparametric and robust (for many problems).

Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statistician's work on scientific and engineering problems, such as on the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the analytic
Analytic function
In mathematics, an analytic function is a function that is locally given by a convergent power series. There exist both real analytic functions and complex analytic functions, categories that are similar in some ways, but different in others...

 theory of testing statistical hypotheses
Statistical hypothesis testing
A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...

, particularly the Laplacian tradition's emphasis on exponential families.

EDA development

Tukey held that too much emphasis in statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

 was placed on statistical hypothesis testing
Statistical hypothesis testing
A statistical hypothesis test is a method of making decisions using data, whether from a controlled experiment or an observational study . In statistics, a result is called statistically significant if it is unlikely to have occurred by chance alone, according to a pre-determined threshold...

 (confirmatory data analysis); more emphasis needed to be placed on using data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

 to suggest hypotheses
Hypothesis
A hypothesis is a proposed explanation for a phenomenon. The term derives from the Greek, ὑποτιθέναι – hypotithenai meaning "to put under" or "to suppose". For a hypothesis to be put forward as a scientific hypothesis, the scientific method requires that one can test it...

 to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data
Testing hypotheses suggested by the data
In statistics, hypotheses suggested by the data, if tested using the data set that suggested them, are likely to be accepted even when they are not true...

.

The objectives of EDA are to:
  • Suggest hypotheses about the causes of observed phenomena
    Phenomenon
    A phenomenon , plural phenomena, is any observable occurrence. Phenomena are often, but not always, understood as 'appearances' or 'experiences'...

  • Assess assumptions on which statistical inference
    Statistical inference
    In statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation...

     will be based
  • Support the selection of appropriate statistical tools and techniques
  • Provide a basis for further data collection through surveys
    Survey sampling
    In statistics, survey sampling describes the process of selecting a sample of elements from a target population in order to conduct a survey.A survey may refer to many different types or techniques of observation, but in the context of survey sampling it most often involves a questionnaire used to...

     or experiments
    Design of experiments
    In general usage, design of experiments or experimental design is the design of any information-gathering exercises where variation is present, whether under the full control of the experimenter or not. However, in statistics, these terms are usually used for controlled experiments...



Many EDA techniques have been adopted into data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

 and are being taught to young students as a way to introduce them to statistical thinking.

Techniques

There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques.

Typical graphical techniques used in EDA are:
  • Box plot
    Box plot
    In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their five-number summaries: the smallest observation , lower quartile , median , upper quartile , and largest observation...

  • Histogram
    Histogram
    In statistics, a histogram is a graphical representation showing a visual impression of the distribution of data. It is an estimate of the probability distribution of a continuous variable and was first introduced by Karl Pearson...

  • Multi-vari chart
    Multi-vari chart
    In quality control, multi-vari charts are a visual way of presenting variability through a series of charts. The content and format of the charts has evolved over time.-Original concept:...

  • Run chart
    Run Chart
    A run chart, also known as a run-sequence plot is a graph that displays observed data in a time sequence. Often, the data displayed represent some aspect of the output or performance of a manufacturing or other business process.- Overview :...

  • Pareto chart
  • Scatter plot
  • Stem-and-leaf plot
  • Odds ratio
    Odds ratio
    The odds ratio is a measure of effect size, describing the strength of association or non-independence between two binary data values. It is used as a descriptive statistic, and plays an important role in logistic regression...

  • Chi-square
  • Multidimensional scaling
    Multidimensional scaling
    Multidimensional scaling is a set of related statistical techniques often used in information visualization for exploring similarities or dissimilarities in data. MDS is a special case of ordination. An MDS algorithm starts with a matrix of item–item similarities, then assigns a location to each...

  • Targeted projection pursuit
    Targeted projection pursuit
    Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection...



Typical quantitative
Quantity
Quantity is a property that can exist as a magnitude or multitude. Quantities can be compared in terms of "more" or "less" or "equal", or by assigning a numerical value in terms of a unit of measurement. Quantity is among the basic classes of things along with quality, substance, change, and relation...

 techniques are:
  • Median polish
    Median polish
    The median polish is an exploratory data analysis procedure proposed by the statistician John Tukey. It finds an additively-fit model for data in a two-way layout table of the form row effect + column effect + overall median.-References:* Frederick Mosteller and John Tukey . "Data Analysis and...

  • the Trimean
    Trimean
    In statistics the trimean , or Tukey's trimean, is a measure of a probability distribution's location defined as a weighted average of the distribution's median and its two quartiles:This is equivalent to the average of the median and the midhinge:...

  • Letter values
  • Resistant line
  • Resistant smooth
  • Rootogram
  • Ordination
    Ordination (statistics)
    In multivariate analysis, ordination is a method complementary to data clustering, and used mainly in exploratory data analysis . Ordination orders objects that are characterized by values on multiple variables so that similar objects are near each other and dissimilar objects are farther from...


History

Many EDA ideas can be traced back to earlier authors, for example:
  • Francis Galton
    Francis Galton
    Sir Francis Galton /ˈfrɑːnsɪs ˈgɔːltn̩/ FRS , cousin of Douglas Strutt Galton, half-cousin of Charles Darwin, was an English Victorian polymath: anthropologist, eugenicist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist, psychometrician, and statistician...

     emphasized order statistic
    Order statistic
    In statistics, the kth order statistic of a statistical sample is equal to its kth-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference....

    s and quantile
    Quantile
    Quantiles are points taken at regular intervals from the cumulative distribution function of a random variable. Dividing ordered data into q essentially equal-sized data subsets is the motivation for q-quantiles; the quantiles are the data values marking the boundaries between consecutive subsets...

    s.
  • Arthur Bowley used precursors of the stemplot
    Stemplot
    A stemplot , in statistics, is a device for presenting quantitative data in a graphical format, similar to a histogram, to assist in visualizing the shape of a distribution. They evolved from Arthur Bowley's work in the early 1900s, and are useful tools in exploratory data analysis...

     and five-number summary
    Five-number summary
    The five-number summary is a descriptive statistic that provides information about a set of observations. It consists of the five most important sample percentiles:# the sample minimum # the lower quartile or first quartile...

     (Bowley actually used a "seven-figure summary
    Seven-number summary
    In descriptive statistics, the seven-number summary is a collection of seven summary statistics, and is a modification or extension of the five-number summary...

    ", including the extremes, decile
    Decile
    * In descriptive statistics, any of the nine values that divide the sorted data into ten equal parts, so that each part represents 1/10 of the sample or population* In astrology, an aspect of 36 degrees-See also:*Percentile*Quantile*Quartile*Summary statistics...

    s and quartile
    Quartile
    In descriptive statistics, the quartiles of a set of values are the three points that divide the data set into four equal groups, each representing a fourth of the population being sampled...

    s, along with the median - see his Elementary Manual of Statistics (3rd edn., 1920), p. 62 – he defines "the maximum and minimum, median, quartiles and two deciles" as the "seven positions").
  • Andrew Ehrenberg articulated a philosophy of data reduction
    Data reduction
    Data Reduction is the transformation of numerical or alphabetical digital information derived empirical or experimentally into a corrected, ordered, and simplified form....

     (see his book of the same name).


The Open University
Open University
The Open University is a distance learning and research university founded by Royal Charter in the United Kingdom...

 course Statistics in Society (MDST 242), took the above ideas and merged them with Gottfried Noether's work, which introduced statistical inference
Statistical inference
In statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation...

 via coin-tossing and the median test
Median test
In statistics, Mood's median test is a special case of Pearson's chi-squared test. It is a nonparametric test that tests the null hypothesis that the medians of the populations from which two samples are drawn are identical...

.

Software

  • GGobi
    GGobi
    GGobi is a free statistical software tool used for graphing various types of data. GGobi allows extensive exploration of the data with Interactive dynamic graphics. It is also a tool for looking at multivariate data. R can be used in sync with GGobi...

     is a free software
    Free software
    Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

     for interactive Data visualization
    Data visualization
    Data visualization is the study of the visual representation of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for the units of information"....

  • Mondrian
    Mondrian data analysis
    Mondrian is a general-purpose statistical data-visualization system. It features outstanding visualization techniques for data of almost any kind, and has its particular strength compared to other tools when working with Categorical Data, Geographical Data and LARGE Data.All plots in Mondrian are...

     is a free software
    Free software
    Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

     for interactive Data visualization
    Data visualization
    Data visualization is the study of the visual representation of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for the units of information"....

  • OpenSHAPA (modern open source successor to MacSHAPA), permits analysis of various media files (e.g. video, sound).
  • CMU-DAP (Carnegie-Mellon University Data Analysis Package, FORTRAN
    Fortran
    Fortran is a general-purpose, procedural, imperative programming language that is especially suited to numeric computation and scientific computing...

     source for EDA tools with English-style command syntax, 1977).
  • Data Applied
    Data Applied
    Data Applied is a a software vendor headquartered in Washington. Founded by a group of former Microsoft employees, the company specializes in data mining, data visualization, and business intelligence environments.- Products :...

    , a comprehensive web-based data visualization and data mining environment.
  • Fathom (for high-school and intro college courses).
  • JMP
    JMP (statistical software)
    JMP is a computer program that was first developed by John Sall and others to perform simple and complex statistical analyses.It dynamically links statistics with graphics to interactively explore, understand, and visualize data...

    , an EDA package from SAS Institute
    SAS Institute
    SAS Institute Inc. , headquartered in Cary, North Carolina, USA, has been a major producer of software since it was founded in 1976 by Anthony Barr, James Goodnight, John Sall and Jane Helwig...

    .
  • KNIME
    KNIME
    KNIME, the Konstanz Information Miner, is a user friendly, coherent open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept...

     Konstanz Information Miner – Open-Source data exploration platform based on Eclipse.
  • LiveGraph (open source real-time data series plotter).
  • Orange
    Orange (software)
    Orange is a component-based data mining and machine learning software suite, featuring friendly yet powerful and flexible visual programming front-end for explorative data analysis and visualization, and Python bindings and libraries for scripting...

    , an open-source data mining software suite.
  • SOCR
    SOCR
    The Statistics Online Computational Resource is a suite of online tools and interactive aids for hands-on learning and teaching concepts in statistical analysis and probability theory developed at the University of California, Los Angeles...

     provides a large number of free Internet-accessible.
  • DASS-GUI – data mining framework written in C++ and Qt.
  • TinkerPlots
    TinkerPlots
    TinkerPlots is exploratory data analysis software designed for use by students in grades 4-8. It was designed by Clifford Konold and Craig Miller at the University of Massachusetts Amherst and is published by Key Curriculum Press. It has some similarities with Fathom, and runs on Windows XP or...

     (for upper elementary and middle school students).
  • Weka
    Weka (machine learning)
    Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand...

     an open source data mining package that includes visualisation and EDA tools such as targeted projection pursuit
    Targeted projection pursuit
    Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection...


See also

  • Anscombe's quartet
    Anscombe's quartet
    Anscombe's quartet comprises four datasets that have identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven points. They were constructed in 1973 by the statistician F.J...

    , on importance of exploration
  • Predictive analytics
    Predictive analytics
    Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events....

  • Structured data analysis (statistics)
    Structured data analysis (statistics)
    Structured data analysis is the statistical data analysis of structured data. This can arise either in the form of an a priori structure such as multiple-choice questionnaires or in situations with the need to search for structure that fits the given data, either exactly or approximately...

  • Configural frequency analysis
    Configural frequency analysis
    Configural frequency analysis is a method of exploratory data analysis. The goal of a configural frequency analysis is to detect patterns in the data that occur significantly more or significantly less often than expected by chance...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK