Data set

Data set

Discussion
Ask a question about 'Data set'
Start a new discussion about 'Data set'
Answer questions from other users
Full Discussion Forum
 
Encyclopedia
A data set is a collection of data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

, usually presented in tabular
Table (database)
In relational databases and flat file databases, a table is a set of data elements that is organized using a model of vertical columns and horizontal rows. A table has a specified number of columns, but can have any number of rows...

 form. Each column
Column (database)
In the context of a relational database table, a column is a set of data values of a particular simple type, one for each row of the table. The columns provide the structure according to which the rows are composed....

 represents a particular variable. Each row
Row (database)
In the context of a relational database, a row—also called a record or tuple—represents a single, implicitly structured data item in a table. In simple terms, a database table can be thought of as consisting of rows and columns or fields...

 corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of random number
Random number
Random number may refer to:* A number generated for or part of a set exhibiting statistical randomness.* A random sequence obtained from a stochastic process.* An algorithmically random sequence in algorithmic information theory....

s. Each value is known as a datum
Datum
A geodetic datum is a reference from which measurements are made. In surveying and geodesy, a datum is a set of reference points on the Earth's surface against which position measurements are made, and an associated model of the shape of the earth to define a geographic coordinate system...

. The data set may comprise data for one or more members, corresponding to the number of rows.

Nontabular data sets can take the form of marked up
Markup language
A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...

 strings of characters, such as an XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 file.

Almost all data sets, although they may often be written using high-level languages and base-ten numbers
Decimal
The decimal numeral system has ten as its base. It is the numerical base most widely used by modern civilizations....

, end up transcoded into machine language for processing by the computer
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...

s involved. Thus, for all their semantic diversity and tabular or nontabular forms, most datasets can be expressed in binary code
Binary code
A binary code is a way of representing text or computer processor instructions by the use of the binary number system's two-binary digits 0 and 1. This is accomplished by assigning a bit string to each particular symbol or instruction...

 as a long string of zeros and ones.

History


Historically, the term originated in the mainframe field
Mainframe computer
Mainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...

, where it had a well-defined meaning
Data set (IBM mainframe)
data set , dataset , is a computer file having a record organization. The term pertains to the IBM mainframe operating system line, starting with OS/360, and is still used by its successors, including the current z/OS. Those systems historically preferred this term over a file...

, very close to contemporary computer file
Computer file
A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...

. This topic is not covered here.

Properties


A data set has several characteristics which define its structure and properties. These include the number and types of the attributes or variables and the various statistical measures which may be applied to them such as standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...

 and kurtosis
Kurtosis
In probability theory and statistics, kurtosis is any measure of the "peakedness" of the probability distribution of a real-valued random variable...

.

In the simplest case, there is only one variable, and then the data set consists of a single column of values, often represented as a list. In spite of the name, such a univariate
Univariate
In mathematics, univariate refers to an expression, equation, function or polynomial of only one variable. Objects of any of these types but involving more than one variable may be called multivariate...

 data set is not a set in the usual mathematical sense, since a given value may occur multiple times. Normally the order does not matter, and then the collection of values may be considered to be a multiset
Multiset
In mathematics, the notion of multiset is a generalization of the notion of set in which members are allowed to appear more than once...

 rather than an (ordered) list.

The values may be numbers, such as real number
Real number
In mathematics, a real number is a value that represents a quantity along a continuum, such as -5 , 4/3 , 8.6 , √2 and π...

s or integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...

s, for example representing a person's height in centimeters, but may also be nominal data (i.e., not consisting of numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a level of measurement
Level of measurement
The "levels of measurement", or scales of measure are expressions that typically refer to the theory of scale types developed by the psychologist Stanley Smith Stevens. Stevens proposed his theory in a 1946 Science article titled "On the theory of scales of measurement"...

. For each variable, the values will normally all be of the same kind. However, there may also be "missing values
Missing values
In statistics, missing data, or missing values, occur when no data value is stored for the variable in the current observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data....

", which need to be indicated in some way.

In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

 data sets usually come from actual observations obtained by sampling
Sampling (statistics)
In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....

 a statistical population
Statistical population
A statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population. For example, if we were interested in generalizations about crows, then we would describe the set of crows that is of interest...

, and each row corresponds to the observations on one element of that population. Data sets may further be generated by algorithms for the purpose of testing certain kinds of software. Some modern statistical analysis software such as PSPP
PSPP
PSPP is a free software application for analysis of sampled data. It has a graphical user interface and conventional command line interface. It is written in C, uses GNU Scientific Library for its mathematical routines, and plotutils for generating graphs....

 still present their data in the classical data set fashion.

Classic data sets


Several classic data sets have been used extensively in the statistical literature:
  • Iris flower data set
    Iris flower data set
    The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Sir Ronald Aylmer Fisher as an example of discriminant analysis...

     - multivariate data set introduced by Ronald Fisher
    Ronald Fisher
    Sir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...

     (1936).
  • Categorical data analysis - Data sets used in the book, An Introduction to Categorical Data Analysis, by Agresti are provided on-line by StatLib.
  • Robust statistics
    Robust statistics
    Robust statistics provides an alternative approach to classical statistical methods. The motivation is to produce estimators that are not unduly affected by small departures from model assumptions.- Introduction :...

    - Data sets used in Robust Regression and Outlier Detection (Rousseeuw
    Peter Rousseeuw
    Peter J. Rousseeuw is a Belgian statistician known for his work on robust statistics and cluster analysis.-Books:...

     and Leroy, 1986). Provided on-line at the University of Cologne.
  • Time series
    Time series
    In statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the...

    - Data used in Chatfield's book, The Analysis of Time Series, are provided on-line by StatLib.
  • Extreme values - Data used in the book, An Introduction to the Statistical Modeling of Extreme Values are provided on-line by Stuart Coles, the book's author. [Dead link]

  • Bayesian Data Analysis - Data used in the book are provided on-line by Andrew Gelman
    Andrew Gelman
    Andrew Gelman is an American statistician, professor of statistics and political science, and director of the Applied Statistics Center at Columbia University. He earned an S.B. in mathematics and in physics from MIT in 1986 and a Ph.D...

    , one of the book's authors.
  • The [ftp://ftp.ics.uci.edu/pub/machine-learning-databases/liver-disorders Bupa liver data], used in several papers in the machine learning (data mining) literature.

External links