Data analysis

# Data analysis

Discussion

Encyclopedia
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

with the goal of highlighting useful information
Information
Information in its most restricted technical sense is a message or collection of messages that consists of an ordered sequence of symbols, or it is the meaning that can be interpreted from such a message or collection of messages. Information can be recorded or transmitted. It can be recorded as...

, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.

Data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business intelligence
Business intelligence mainly refers to computer-based techniques used in identifying, extracting, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes....

covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, some people divide data analysis into descriptive statistics
Descriptive statistics
Descriptive statistics quantitatively describe the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics , in that descriptive statistics aim to summarize a data set, rather than use the data to learn about the population that the data are...

, exploratory data analysis
Exploratory data analysis
In statistics, exploratory data analysis is an approach to analysing data sets to summarize their main characteristics in easy-to-understand form, often with visual graphs, without using a statistical model or having formulated a hypothesis...

(EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Predictive analytics
Predictive analytics
Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events....

focuses on application of statistical or structural models for predictive forecasting or classification, while text analytics
Text analytics
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining;...

applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data
Unstructured data
Unstructured Data refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well...

. All are varieties of data analysis.

Data integration
Data integration
Data integration involves combining data residing in different sources and providing users with a unified view of these data.This process becomes significant in a variety of situations, which include both commercial and scientific domains...

is a precursor to data analysis, and data analysis is closely linked to data visualization
Data visualization
Data visualization is the study of the visual representation of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for the units of information"....

and data dissemination. The term data analysis is sometimes used as a synonym for data modeling
Data modeling
Data modeling in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques.- Overview :...

.

## Type of data

Data can be of several types
• Quantitative data data is a number
• Often this is a continuous decimal
Decimal
The decimal numeral system has ten as its base. It is the numerical base most widely used by modern civilizations....

number to a specified number of significant digits
• Sometimes it is a whole counting
Counting
Counting is the action of finding the number of elements of a finite set of objects. The traditional way of counting consists of continually increasing a counter by a unit for every element of the set, in some order, while marking those elements to avoid visiting the same element more than once,...

number
• Categorical data
Categorical data
In statistics, categorical data is that part of an observed dataset that consists of categorical variables, or for data that has been converted into that form, for example as grouped data...

data one of several categories
• Qualitative data data is a pass/fail or the presence or lack of a characteristic

## The process of data analysis

Data analysis is a process
Process theory
Process theory is a commonly used form of scientific research study in which events or occurrences are said to be the result of certain input states leading to a certain outcome state, following a set process....

, within which several phases can be distinguished:

## Data cleaning

Data cleaning is an important procedure during which the data are inspected, and erroneous data are—if necessary, preferable, and possible—corrected. Data cleaning can be done during the stage of data entry. If this is done, it is important that no subjective decisions are made. The guiding principle provided by Adèr (ref) is: during subsequent manipulations of the data, information should always be cumulatively retrievable. In other words, it should always be possible to undo any data set alterations. Therefore, it is important not to throw information away at any stage in the data cleaning phase. All information should be saved (i.e., when altering variables, both the original values and the new values should be kept, either in a duplicate data set or under a different variable name), and all alterations to the data set should carefully and clearly documented, for instance in a syntax or a log.

## Initial data analysis

The most important distinction between the initial data analysis phase and the main analysis phase, is that during initial data analysis one refrains from any analysis that are aimed at answering the original research question. The initial data analysis phase is guided by the following four questions:

### Quality of data

The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using different types of analyses: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms, normal probability plots), associations (correlations, scatter plots).

Other initial data quality checks are:
• Checks on data cleaning: have decisions influenced the distribution of the variables? The distribution of the variables before data cleaning is compared to the distribution of the variables after data cleaning to see whether data cleaning has had unwanted effects on the data.
• Analysis of missing observations: are there many missing values, and are the values missing at random
MAR
Mar, mar or MAR may refer to:* Earl of Mar, an earldom in Scotland* Mar, an area of Scotland now known as Marr* MÄR, a manga and anime series-MAR as an abbreviation:* The Marathi language's ISO 639 code...

? The missing observations in the data are analyzed to see whether more than 25% of the values are missing, whether they are missing at random (MAR), and whether some form of imputation
Imputation (statistics)
In statistics, imputation is the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be analysed using standard techniques for complete data...

is needed.
• Analysis of extreme observations
Outlier
In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....

: outlying observations in the data are analyzed to see if they seem to disturb the distribution.
• Comparison and correction of differences in coding schemes: variables are compared with coding schemes of variables external to the data set, and possibly corrected if coding schemes are not comparable.

The choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses that will be conducted in the main analysis phase.

### Quality of measurements

The quality of the measurement instruments
Measuring instrument
In the physical sciences, quality assurance, and engineering, measurement is the activity of obtaining and comparing physical quantities of real-world objects and events. Established standard objects and events are used as units, and the process of measurement gives a number relating the item...

should only be checked during the initial data analysis phase when this is not the focus or research question of the study. One should check whether structure of measurement instruments corresponds to structure reported in the literature.

There are two ways to assess measurement quality:
• Confirmatory factor analysis
• Analysis of homogeneity (internal consistency
Internal consistency
In statistics and research, internal consistency is typically a measure based on the correlations between different items on the same test . It measures whether several items that propose to measure the same general construct produce similar scores...

), which gives an indication of the reliability
Reliability (statistics)
In statistics, reliability is the consistency of a set of measurements or of a measuring instrument, often used to describe a test. Reliability is inversely related to random error.-Types:There are several general classes of reliability estimates:...

of a measurement instrument. During this analysis, one inspects the variances of the items and the scales, the Cronbach's α
Cronbach's alpha
Cronbach's \alpha is a coefficient of reliability. It is commonly used as a measure of the internal consistency or reliability of a psychometric test score for a sample of examinees. It was first named alpha by Lee Cronbach in 1951, as he had intended to continue with further coefficients...

of the scales, and the change in the Cronbach's alpha when an item would be deleted from a scale.

### Initial transformations

After assessing the quality of the data and of the measurements, one might decide to impute missing data, or to perform initial transformations of one or more variables, although this can also be done during the main analysis phase.

Possible transformations of variables are:
• Square root transformation (if the distribution differs moderately from normal)
• Log-transformation (if the distribution differs substantially from normal)
• Inverse transformation (if the distribution differs severely from normal)
• Make categorical (ordinal / dichotomous) (if the distribution differs severely from normal, and no transformations help)

### Did the implementation of the study fulfill the intentions of the research design?

One should check the success of the randomization
Randomization
Randomization is the process of making something random; this means:* Generating a random permutation of a sequence .* Selecting a random sample of a population ....

procedure, for instance by checking whether background and substantive variables are equally distributed within and across groups.

If the study did not need and/or use a randomization procedure, one should check the success of the non-random sampling, for instance by checking whether all subgroups of the population of interest are represented in sample.

Other possible data distortions that should be checked are:
• dropout
Dropout (electronics)
Dropout within the realm of electronics and electrical engineering, has a number of uses.It is the dropping away of a flake of magnetic material from magnetic tape, leading to loss of signal, or a failure to properly read a binary character from data storage...

(this should be identified during the initial data analysis phase)
• Item nonresponse
Response rate
Response rate in survey research refers to the number of people who answered the survey divided by the number of people in the sample...

(whether this is random or not should be assessed during the initial data analysis phase)
• Treatment quality (using manipulation checks).

### Characteristics of data sample

In any report or article, the structure of the sample must be accurately described. It is especially important to exactly determine the structure of the sample (and specifically the size of the subgroups) when subgroup analyses will be performed during the main analysis phase.

The characteristics of the data sample can be assessed by looking at:
• Basic statistics of important variables
• Scatter plots
• Correlations
• Cross-tabulations

### Final stage of the initial data analysis

During the final stage, the findings of the initial data analysis are documented, and necessary, preferable, and possible corrective actions are taken.

Also, the original plan for the main data analyses can and should be specified in more detail and/or rewritten.
In order to do this, several decisions about the main data analyses can and should be made:
• In the case of non-normals: should one transform
Data transformation (statistics)
In statistics, data transformation refers to the application of a deterministic mathematical function to each point in a data set — that is, each data point zi is replaced with the transformed value yi = f, where f is a function...

variables; make variables categorical (ordinal/dichotomous); adapt the analysis method?
• In the case of missing data: should one neglect or impute the missing data; which imputation technique should be used?
• In the case of outlier
Outlier
In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....

s: should one use robust analysis techniques?
• In case items do not fit the scale: should one adapt the measurement instrument by omitting items, or rather ensure comparability with other (uses of the) measurement instrument(s)?
• In the case of (too) small subgroups: should one drop the hypothesis about inter-group differences, or use small sample techniques, like exact tests or bootstrapping
Bootstrapping (statistics)
In statistics, bootstrapping is a computer-based method for assigning measures of accuracy to sample estimates . This technique allows estimation of the sample distribution of almost any statistic using only very simple methods...

?
• In case the randomization
Randomization
Randomization is the process of making something random; this means:* Generating a random permutation of a sequence .* Selecting a random sample of a population ....

procedure seems to be defective: can and should one calculate propensity score
Propensity score
In the design of experiments, a propensity score is the probability of a unit being assigned to a particular condition in a study given a set of known covariates...

s and include them as covariates in the main analyses?

### Analyses

Several analyses can be used during the initial data analysis phase:
• Univariate statistics
• Bivariate associations (correlations)
• Graphical techniques (scatter plots)

It is important to take the measurement levels of the variables into account for the analyses, as special statistical techniques are available for each level:
• Nominal and ordinal variables
• Frequency counts (numbers and percentages)
• Associations
• circumambulations (crosstabulations)
• hierarchical loglinear analysis (restricted to a maximum of 8 variables)
• loglinear analysis (to identify relevant/important variables and possible confounders)
• Exact tests or bootstrapping (in case subgroups are small)
• Computation of new variables

• Continuous variables
• Distribution
• Statistics (M, SD, variance, skewness, kurtosis)
• Stem-and-leaf displays
• Box plots

## Main data analysis

In the main analysis phase analyses aimed at answering the research question are performed as well as any other relevant analysis needed to write the first draft of the research report.

### Exploratory and confirmatory approaches

In the main analysis phase either an exploratory or confirmatory approach can be adopted. Usually the approach is decided before data is collected. In an exploratory analysis no clear hypothesis is stated before analysing the data, and the data is searched for models that describe the data well. In a confirmatory analysis clear hypotheses about the data are tested.

Exploratory data analysis
Exploratory data analysis
In statistics, exploratory data analysis is an approach to analysing data sets to summarize their main characteristics in easy-to-understand form, often with visual graphs, without using a statistical model or having formulated a hypothesis...

should be interpreted carefully. When testing multiple models at once there is a high chance on finding at least one of them to be significant, but this can be due to a type 1 error. It is important to always adjust the significance level when testing multiple models with, for example, a bonferroni correction
Bonferroni correction
In statistics, the Bonferroni correction is a method used to counteract the problem of multiple comparisons. It was developed and introduced by Italian mathematician Carlo Emilio Bonferroni...

. Also, one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory analysis is used to find ideas for a theory, but not to test that theory as well. When a model is found exploratory in a dataset, then following up that analysis with a comfirmatory analysis in the same dataset could simply mean that the results of the comfirmatory analysis are due to the same type 1 error that resulted in the exploratory model in the first place. The comfirmatory analysis therefore will not be more informative than the original exploratory analysis.

### Stability of results

It is important to obtain some indication about how generalizable the results are. While this is hard to check, one can look at the stability of the results. Are the results reliable and reproducible? There are two main ways of doing this:
• Cross-validation: By splitting the data in multiple parts we can check if analyzes (like a fitted model) based on one part of the data generalize to another part of the data as well.
• Sensitivity analysis
Sensitivity analysis
Sensitivity analysis is the study of how the variation in the output of a statistical model can be attributed to different variations in the inputs of the model. Put another way, it is a technique for systematically changing variables in a model to determine the effects of such changes.In any...

: A procedure to study the behavior of a system or model when global parameters are (systematically) varied. One way to do this is with bootstrapping.

### Statistical methods

A lot of statistical methods have been used for statistical analyses. A very brief list of four of the more popular methods is:
• General linear model
General linear model
The general linear model is a statistical linear model.It may be written aswhere Y is a matrix with series of multivariate measurements, X is a matrix that might be a design matrix, B is a matrix containing parameters that are usually to be estimated and U is a matrix containing errors or...

: A widely used model on which various statistical methods are based (e.g. t test, ANOVA, ANCOVA
ANCOVA
In statistics, analysis of covariance is a general linear model with a continuous outcome variable and two or more predictor variables where at least one is continuous and at least one is categorical . ANCOVA is a merger of ANOVA and regression for continuous variables...

, MANOVA
MANOVA
Multivariate analysis of variance is a generalized form of univariate analysis of variance . It is used when there are two or more dependent variables. It helps to answer : 1. do changes in the independent variable have significant effects on the dependent variables; 2. what are the interactions...

). Usable for assessing the effect of several predictors on one or more continuous dependent variables.
• Generalized linear model
Generalized linear model
In statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...

: An extension of the general linear model for discrete dependent variables.
• Structural equation modelling: Usable for assessing latent structures from measured manifest variables.
• Item response theory
Item response theory
In psychometrics, item response theory also known as latent trait theory, strong true score theory, or modern mental test theory, is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is based...

: Models for (mostly) assessing one latent variable from several binary measured variables (e.g. an exam).

## Free software for data analysis

• ROOT
ROOT
ROOT is an object-oriented program and library developed by CERN. It was originally designed for particle physics data analysis and contains several features specific to this field, but it is also used in other applications such as astronomy and data mining....

- C++ data analysis framework developed at CERN
CERN
The European Organization for Nuclear Research , known as CERN , is an international organization whose purpose is to operate the world's largest particle physics laboratory, which is situated in the northwest suburbs of Geneva on the Franco–Swiss border...

• PAW
Physics Analysis Workstation
The Physics Analysis Workstation is an interactive, scriptable computer software tool for data analysis and graphical presentation in high energy physics. Developed at CERN since 1986, it is optimized for processing very large amounts of data...

- FORTRAN/C data analysis framework developed at CERN
CERN
The European Organization for Nuclear Research , known as CERN , is an international organization whose purpose is to operate the world's largest particle physics laboratory, which is situated in the northwest suburbs of Geneva on the Franco–Swiss border...

• JHepWork
JHepWork
jHepWork is an interactive framework for scientific computation, data analysis and data visualization designed for scientists, engineers and students...

- Java (multi-platform) data analysis framework developed at ANL
Argonne National Laboratory
Argonne National Laboratory is the first science and engineering research national laboratory in the United States, receiving this designation on July 1, 1946. It is the largest national laboratory by size and scope in the Midwest...

• KNIME
KNIME
KNIME, the Konstanz Information Miner, is a user friendly, coherent open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept...

- the Konstanz Information Miner, a user friendly and comprehensive data analytics framework.
• Data Applied
Data Applied
Data Applied is a a software vendor headquartered in Washington. Founded by a group of former Microsoft employees, the company specializes in data mining, data visualization, and business intelligence environments.- Products :...

- an online data mining and data visualization solution.
• R - a programming language and software environment for statistical computing and graphics.
• DevInfo
DevInfo
DevInfo is a database system developed under the auspices of the United Nations and endorsed by the United Nations Development Group for monitoring human development with the specific purpose of monitoring the Millennium Development Goals , which is a set of Human Development Indicators...

- a database system endorsed by the United Nations Development Group for monitoring and analyzing human development.
• Zeptoscope Basic - Interactive Java-based plotter developed at Nanomix.

## Nuclear and particle physics

In nuclear
Nuclear physics
Nuclear physics is the field of physics that studies the building blocks and interactions of atomic nuclei. The most commonly known applications of nuclear physics are nuclear power generation and nuclear weapons technology, but the research has provided application in many fields, including those...

and particle physics
Particle physics
Particle physics is a branch of physics that studies the existence and interactions of particles that are the constituents of what is usually referred to as matter or radiation. In current understanding, particles are excitations of quantum fields and interact following their dynamics...

the data usually originate from the experimental apparatus
Particle detector
In experimental and applied particle physics, nuclear physics, and nuclear engineering, a particle detector, also known as a radiation detector, is a device used to detect, track, and/or identify high-energy particles, such as those produced by nuclear decay, cosmic radiation, or reactions in a...

via a data acquisition
Data acquisition
Data acquisition is the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer. Data acquisition systems typically convert analog waveforms into digital values for processing...

system. It is then processed, in a step usually called data reduction, to apply calibrations and to extract physically significant information. Data reduction is most often, especially in large particle physics experiments
Particle physics experiments
Particle physics experiments briefly discusses a number of past, present, and proposed experiments with particle accelerators, throughout the world. In addition, some important accelerator interactions are discussed...

, an automatic, batch-mode operation carried out by software written ad-hoc. The resulting data n-tuples are then scrutinized by the physicists, using specialized software tools like ROOT
ROOT
ROOT is an object-oriented program and library developed by CERN. It was originally designed for particle physics data analysis and contains several features specific to this field, but it is also used in other applications such as astronomy and data mining....

or PAW
Physics Analysis Workstation
The Physics Analysis Workstation is an interactive, scriptable computer software tool for data analysis and graphical presentation in high energy physics. Developed at CERN since 1986, it is optimized for processing very large amounts of data...

, comparing the results of the experiment with theory.

The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for Monte Carlo simulation
Monte Carlo method
Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results. Monte Carlo methods are often used in computer simulations of physical and mathematical systems...

software like Geant4
Geant4
Geant4 is a platform for "the simulation of the passage of particles through matter," using Monte Carlo methods. It is the successor of the GEANT series of software toolkits developed by CERN, and the first to use Object oriented programming . Its development, maintenance and user support are...

, predict the response of the detector to a given theoretical event, producing simulated events which are then compared to experimental data.

• Adèr, H.J. & Mellenbergh, G.J. (with contributions by D.J. Hand) (2008). Advising on Research Methods: A consultant's companion. Huizen, the Netherlands: Johannes van Kessel Publishing.
• ASTM International
ASTM International
ASTM International, known until 2001 as the American Society for Testing and Materials , is an international standards organization that develops and publishes voluntary consensus technical standards for a wide range of materials, products, systems, and services...

(2002). Manual on Presentation of Data and Control Chart Analysis, MNL 7A, ISBN 0803120931
• Godfrey, A. B. (1999). Juran's Quality Handbook, ISBN 00703400359
• Lewis-Beck, Michael S. (1995). Data Analysis: an Introduction, Sage Publications Inc, ISBN 0803957726
• NIST/SEMATEK (2008) Handbook of Statistical Methods,
• Pyzdek, T, (2003). Quality Engineering Handbook, ISBN 0824746147
• Richard Veryard
Richard Veryard
Richard Veryard is a British computer scientist, author and business consultant, known for his work on Service Oriented Architecture and the Service-Based Business.-Biography:...

(1984). Pragmatic data analysis. Oxford : Blackwell Scientific Publications. ISBN 0632013117
• Tabachnick, B.G. & Fidell, L.S. (2007). Using Multivariate Statistics, Fifth Edition. Boston: Pearson Education, Inc. / Allyn and Bacon, ISBN 978-0205459384