A
data set is a collection of
dataThe term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...
, usually presented in
tabularIn relational databases and flat file databases, a table is a set of data elements that is organized using a model of vertical columns and horizontal rows. A table has a specified number of columns, but can have any number of rows...
form. Each
columnIn the context of a relational database table, a column is a set of data values of a particular simple type, one for each row of the table. The columns provide the structure according to which the rows are composed....
represents a particular variable. Each
rowIn the context of a relational database, a row—also called a record or tuple—represents a single, implicitly structured data item in a table. In simple terms, a database table can be thought of as consisting of rows and columns or fields...
corresponds to a given member of the data set in question. Its values for each of the variables, such as height and weight of an object or values of
random numberRandom number may refer to:* A number generated for or part of a set exhibiting statistical randomness.* A random sequence obtained from a stochastic process.* An algorithmically random sequence in algorithmic information theory....
s. Each value is known as a
datumA geodetic datum is a reference from which measurements are made. In surveying and geodesy, a datum is a set of reference points on the Earth's surface against which position measurements are made, and an associated model of the shape of the earth to define a geographic coordinate system...
. The data set may comprise data for one or more members, corresponding to the number of rows.
Nontabular data sets can take the form of
marked upA markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...
strings of characters, such as an
XMLExtensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
file.
Almost all data sets, although they may often be written using high-level languages and
base-ten numbersThe decimal numeral system has ten as its base. It is the numerical base most widely used by modern civilizations....
, end up transcoded into machine language for processing by the
computerA computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...
s involved. Thus, for all their semantic diversity and tabular or nontabular forms, most datasets can be expressed in
binary codeA binary code is a way of representing text or computer processor instructions by the use of the binary number system's two-binary digits 0 and 1. This is accomplished by assigning a bit string to each particular symbol or instruction...
as a long string of zeros and ones.
History
Historically, the term originated in the
mainframe fieldMainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...
, where it had a
well-defined meaningdata set , dataset , is a computer file having a record organization. The term pertains to the IBM mainframe operating system line, starting with OS/360, and is still used by its successors, including the current z/OS. Those systems historically preferred this term over a file...
, very close to contemporary
computer fileA computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...
. This topic is not covered here.
Properties
A data set has several characteristics which define its structure and properties. These include the number and types of the attributes or variables and the various statistical measures which may be applied to them such as
standard deviationStandard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...
and
kurtosisIn probability theory and statistics, kurtosis is any measure of the "peakedness" of the probability distribution of a real-valued random variable...
.
In the simplest case, there is only one variable, and then the data set consists of a single column of values, often represented as a list. In spite of the name, such a
univariateIn mathematics, univariate refers to an expression, equation, function or polynomial of only one variable. Objects of any of these types but involving more than one variable may be called multivariate...
data set is not a set in the usual mathematical sense, since a given value may occur multiple times. Normally the order does not matter, and then the collection of values may be considered to be a
multisetIn mathematics, the notion of multiset is a generalization of the notion of set in which members are allowed to appear more than once...
rather than an (ordered) list.
The values may be numbers, such as
real numberIn mathematics, a real number is a value that represents a quantity along a continuum, such as -5 , 4/3 , 8.6 , √2 and π...
s or
integerThe integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...
s, for example representing a person's height in centimeters, but may also be nominal data (i.e., not consisting of numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a
level of measurementThe "levels of measurement", or scales of measure are expressions that typically refer to the theory of scale types developed by the psychologist Stanley Smith Stevens. Stevens proposed his theory in a 1946 Science article titled "On the theory of scales of measurement"...
. For each variable, the values will normally all be of the same kind. However, there may also be "
missing valuesIn statistics, missing data, or missing values, occur when no data value is stored for the variable in the current observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data....
", which need to be indicated in some way.
In
statisticsStatistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
data sets usually come from actual observations obtained by
samplingIn statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....
a
statistical populationA statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population. For example, if we were interested in generalizations about crows, then we would describe the set of crows that is of interest...
, and each row corresponds to the observations on one element of that population. Data sets may further be generated by algorithms for the purpose of testing certain kinds of software. Some modern statistical analysis software such as
PSPPPSPP is a free software application for analysis of sampled data. It has a graphical user interface and conventional command line interface. It is written in C, uses GNU Scientific Library for its mathematical routines, and plotutils for generating graphs....
still present their data in the classical data set fashion.
Classic data sets
Several classic data sets have been used extensively in the statistical literature:
- Iris flower data set
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Sir Ronald Aylmer Fisher as an example of discriminant analysis...
- multivariate data set introduced by Ronald FisherSir Ronald Aylmer Fisher FRS was an English statistician, evolutionary biologist, eugenicist and geneticist. Among other things, Fisher is well known for his contributions to statistics by creating Fisher's exact test and Fisher's equation...
(1936).
- Categorical data analysis - Data sets used in the book, An Introduction to Categorical Data Analysis, by Agresti are provided on-line by StatLib.
- Robust statistics
Robust statistics provides an alternative approach to classical statistical methods. The motivation is to produce estimators that are not unduly affected by small departures from model assumptions.- Introduction :...
- Data sets used in Robust Regression and Outlier Detection (RousseeuwPeter J. Rousseeuw is a Belgian statistician known for his work on robust statistics and cluster analysis.-Books:...
and Leroy, 1986). Provided on-line at the University of Cologne.
- Time series
In statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones index or the annual flow volume of the...
- Data used in Chatfield's book, The Analysis of Time Series, are provided on-line by StatLib.
- Extreme values - Data used in the book, An Introduction to the Statistical Modeling of Extreme Values are provided on-line by Stuart Coles, the book's author. [Dead link]
- Bayesian Data Analysis - Data used in the book are provided on-line by Andrew Gelman
Andrew Gelman is an American statistician, professor of statistics and political science, and director of the Applied Statistics Center at Columbia University. He earned an S.B. in mathematics and in physics from MIT in 1986 and a Ph.D...
, one of the book's authors.
- The [ftp://ftp.ics.uci.edu/pub/machine-learning-databases/liver-disorders Bupa liver data], used in several papers in the machine learning (data mining) literature.
External links