All Topics  
Data analysis

 

   Email Print
   Bookmark   Link






 

Data analysis



 
 
Data analysis is a process of gathering, modeling, and transforming data
DATA

Debt, AIDS, Trade in Africa is a multinational Non-governmental organization founded in January 2002 in London by U2's Bono along with Robert Sargent Shriver III and activists from the Jubilee 2000 Drop the Debt campaign....
 with the goal of highlighting useful information
Information

Information as a Conveyed concept has a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of constraint, communication, control system, data, form, instruction, knowledge, Meaning , stimulation, pattern, perception, and knowledge representation....
, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.

Data mining
Data mining

Data mining is the process of extracting hidden patterns from data. As more data is gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information....
 is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes.






Discussion
Ask a question about 'Data analysis'
Start a new discussion about 'Data analysis'
Answer questions from other users
Full Discussion Forum



Encyclopedia


Data analysis is a process of gathering, modeling, and transforming data
DATA

Debt, AIDS, Trade in Africa is a multinational Non-governmental organization founded in January 2002 in London by U2's Bono along with Robert Sargent Shriver III and activists from the Jubilee 2000 Drop the Debt campaign....
 with the goal of highlighting useful information
Information

Information as a Conveyed concept has a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of constraint, communication, control system, data, form, instruction, knowledge, Meaning , stimulation, pattern, perception, and knowledge representation....
, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.

Data mining
Data mining

Data mining is the process of extracting hidden patterns from data. As more data is gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information....
 is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business intelligence
Business intelligence

Business intelligence refers to skills, technologies, applications and practices used to help a business acquire a better understanding of its commercial context....
 covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications
Statistics

Statistics is a Mathematics pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It also provides tools for prediction and forecasting based on data....
, some people divide data analysis into descriptive statistics
Descriptive statistics

Descriptive Statistics are used to describe the basic features of the data gathered from an experimental study in various ways. A descriptive Statistics is distinguished from inductive statistics....
, exploratory data analysis
Exploratory data analysis

Exploratory data analysis is an approach to data analysis for the purpose of formulating hypothesis worth testing, complementing the tools of conventional statistics for testing hypotheses....
, and confirmatory data analysis. EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Predictive analytics
Predictive analytics

Predictive analytics encompasses a variety of techniques from statistics and data mining that analyze current and historical data to make predictions about future events....
 focuses on application of statistical or structural models for predictive forecasting or classification, while text analytics
Text analytics

The term text analytics describes a set of linguistic, lexical, pattern recognition,extraction, tagging/structuring, visualization, and predictive techniques....
 applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data
Unstructured data

Unstructured data refers to computerized information that either does not have a data model or has one that is not easily usable by a computer program....
. All are varieties of data analysis.

Data integration
Data integration

Data integration is the process of combining data residing at different sources and providing the user with a unified view of these data . This process emerges in a variety of situations both commercial and scientific ....
 is a precursor to data analysis, and data analysis is closely linked to data visualization
Data visualization

Data visualization is the study of the visual representation of data, defined as information which has been abstracted in some schematic form, including attributes or variables for the units of information....
 and data dissemination. The term data analysis is sometimes used as a synonym for data modeling
Data modeling

Data modeling in software engineering is the process of creating a data model by applying formal data model descriptions using data modeling techniques....
, which is unrelated to the subject of this article.

Nuclear and particle physics

In nuclear
Nuclear physics

Nuclear physics is the field of physics that studies the building blocks and interactions of atomic nuclei.The most commonly known applications of nuclear physics are nuclear power and nuclear weapons, but the research field is also the basis for a far wider range of applications, including in the medical sector , in materials engineering...
 and particle physics
Particle physics

Particle physics is a branch of physics that studies the elementary particle constituents of matter and radiation, and the interactions between them....
 the data usually originate from the experimental apparatus
Particle detector

In experimental and applied particle physics and nuclear engineering, a particle detector, also known as a radiation detector, is a device used to detect, track, and/or identify high-energy Elementary particles, such as those produced by nuclear decay, cosmic radiation, or reactions in a particle accelerator....
 via a data acquisition
Data acquisition

Data acquisition is the sampling of the real world to generate data that can be manipulated by a computer. Sometimes abbreviated DAQ or DAS, data acquisition typically involves acquisition of signals and waveforms and processing the signals to obtain desired information....
 system. It is then processed, in a step usually called data reduction, to apply calibrations and to extract physically significant information. Data reduction is most often, especially in large particle physics experiments, an automatic, batch-mode operation carried out by software written ad-hoc. The resulting data n-tuples are then scrutinized by the physicists, using specialized software tools like ROOT
Root

In vascular plants, the root is the organ of a plant body that typically lies below the surface of the soil. This is not always the case, however, since a root can also be aerial root or aerating ....
 or PAW
Physics Analysis Workstation

The Physics Analysis Workstation is an interactive, scriptable computer software tool for data analysis and graphical presentation in high energy physics....
, comparing the results of the experiment with theory.

The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for Monte Carlo simulation
Monte Carlo method

Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results. Monte Carlo methods are often used when computer simulation physics and mathematics systems....
 software like Geant4
Geant4

Geant4 is a Platform for "the simulation of the passage of Elementary particles through matter," using Monte Carlo methods. It is the successor of the GEANT series of software toolkits developed by CERN, and the first to use Object oriented programming ....
, predict the response of the detector to a given theoretical event, producing simulated events which are then compared to experimental data.

See also: Computational physics
Computational physics

Computational physics is the study and implementation of numerical algorithms in order to solve problems in physics for which a quantitative theory already exists....
.

Social sciences

Qualitative data analysis (QDA) or qualitative research
Qualitative research

Qualitative research is a field of inquiry that crosscuts disciplines and subject matters . Qualitative researchers aim to gather an in-depth understanding of human behavior and the reasons that govern such behavior....
 is the non-quantitative analysis of data from non-numerical sources, for example words, photographs, observations, etc..

Phases in data analysis

The statistical analysis of data is a process
Process theory

Process theory is a commonly used form of scientific research study in which events or occurrences are said to be the result of certain input state s leading to a certain outcome state, following a set process....
 with several phases, each with its own goal.

Data cleaning

During data cleaning erroneous entries are inspected and corrected where possible. In some cases, it is easy to substitute suspect data with the correct values. However, when it is unclear what caused the erroneous data or what should be used to replace it, it is important that no subjective decisions are made to ensure the quality of the data
Data quality

Data are of high quality "if they are fit for their intended uses in Business operations, decision making and planning" . Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer....
. Furthermore, it is important not to throw information away at any stage in the data cleaning phase. When altering variables the original values should be kept in a duplicate dataset or under a different variable name so that information is always cumulatively retrievable.

Initial data analysis

The initial data analysis uses descriptive statistics
Descriptive statistics

Descriptive Statistics are used to describe the basic features of the data gathered from an experimental study in various ways. A descriptive Statistics is distinguished from inductive statistics....
 to answer the following four questions:
  1. What is the quality of the data?
  2. What is the quality of the measurements?
  3. Did the implementation of the study fulfill the intentions of the research design?
  4. What are the characteristics of the data sample?
Each step of the initial data analysis is described below.

The quality of the data
The quality of the data can be assessed in several ways. First of all the distribution of the variables before data cleaning is compared to the distribution of the variables after data cleaning to see whether data cleaning has had unwanted effects on the data. Second, the missing observations in the data are analyzed to see whether they are missing at random
MCAR

In Statistics, data that are missing completely at random in a data set when the event that a particular item is missing is statistical independence of observable variables and unobservable parameters of interest....
 and whether some form of imputation (statistics)
Imputation (statistics)

In statistics, imputation is the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be analysed using standard techniques for complete data....
 is needed. Third, extreme observations
Outlier

In statistics, an outlier is an observation that is numerically distant from the rest of the data set.They can occur by chance in any distribution, but they are often indicative either of measurement error or that the population has a heavy-tailed distribution....
 in the data are analyzed to see if they seem to disturb the distribution. If that is the case, robust
Robust statistics

Robust statistics provides an alternative approach to classical statistical methods. The motivation is to produce estimators that are not unduly affected by small departures from model assumptions....
 techniques can be applied.
The quality of the measurements
When the quality of the measurement instruments
Measuring instrument

In the physical sciences, quality assurance, and engineering, measurement is the activity of obtaining and comparing physical quantity of real-world object and phenomenon....
 used is not the main focus of the research, the quality of the measurement instruments can be checked during initial data analysis. One way to assess the quality of a measurement instrument is to perform an analysis of homogeneity (internal consistency
Internal consistency

In statistics and research, internal consistency is a measure based on the correlations between different items on the same test . It measures whether several items that propose to measure the same general construct produce similar scores....
). A homogeneity index like Cronbach's α
Cronbach's alpha

Cronbach's is a statistic. It has an important use as a measure of the Reliability of a psychometrics instrument. It was first named as alpha by Lee Cronbach , as he had intended to continue with further instruments....
 gives an indication of the reliability
Reliability (statistics)

In statistics, reliability is the consistency of a set of measurements or measuring instrument, often used to describe a Test . This can either be whether the measurements of the same instrument give or are likely to give the same measurement , or in the case of more subjective instruments, such as personality or trait inventories, whether t...
 of a measurement instrument.
The implementation of the design
In many cases, a check to see whether the randomization
Randomization

Randomization is the process of making something random; this means:* Generating a random permutation of a sequence .* Selecting a random sample of a population ....
 procedure has worked will be the starting point for analyzing the implementation of the design. This can be done by checking whether variables are equally distributed across groups. Other ways of checking the implementation of the design are manipulation checking and the analysis of nonresponse
Response rate

Response rate in Statistical survey research refers to the ratio of number of people who answered the survey divided by the number of people in the sample ....
 and dropout
Dropout

Dropout may refer to:* Dropout .* Dropout .* Dropout .* "The Drop-out" a 2010 upcoming film starring Cher and Johnny Knoxville.* "Drop Out," a song from the album Scream, Dracula, Scream! by Rocket from the Crypt....
.
Characteristics of the data sample
In this step, the findings of the initial data analysis are documented and possible corrective actions are taken. For instance, when the distribution of a variable is not normal
Normal distribution

The normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields....
, the data may need to be transformed
Data transformation (statistics)

In statistics, data transformation is carried in order to Transformation the data and ensure that it has a normal distribution . This is also known as transformation to linearity....
 or categorized. Furthermore, a decision should be made on how to handle missing data and outlier
Outlier

In statistics, an outlier is an observation that is numerically distant from the rest of the data set.They can occur by chance in any distribution, but they are often indicative either of measurement error or that the population has a heavy-tailed distribution....
s. If the randomization procedure seems to be defective, propensity score
Propensity score

In the design of experiments, a propensity score is the probability of a unit being assigned to a particular condition in a study given a set of known covariates....
s can be calculated and included in the main analyses as a covariate.

See also


Further reading

  • ASTM International
    ASTM International

    ASTM International , originally known as the American Society for Testing and Materials, is an international standards organization that develops and publishes voluntary consensus technical standards for a wide range of materials, products, systems, and services....
     (2002). Manual on Presentation of Data and Control Chart Analysis, MNL 7A, ISBN 0803120931
  • Godfrey, A. B., Juran's Quality Handbook, 1999, ISBN 007034003
  • Lewis-Beck, Michael S., Data Analysis: an Introduction, Sage Publications Inc, 1995, ISBN 0803957726
  • NIST/SEMATEK (2008) ,
  • Pyzdek, T, Quality Engineering Handbook, 2003, ISBN 0824746147
  • Richard Veryard
    Richard Veryard

    Richard Veryard is a British computer scientist, author and consultant, who nowadays specializes in Service Oriented Architecture and the Service-Based Business....
     (1984). Pragmatic data analysis. Oxford : Blackwell Scientific Publications. ISBN 0632013117