All Topics  
Data set

 

   Email Print
   Bookmark   Link






 

Data set



 
 
A data set (or dataset) is a collection of data
DATA

Debt, AIDS, Trade in Africa is a multinational Non-governmental organization founded in January 2002 in London by U2's Bono along with Robert Sargent Shriver III and activists from the Jubilee 2000 Drop the Debt campaign....
, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. It lists values for each of the variables, such as height and weight of an object or values of random numbers. Each value is known as a datum
Datum

A geodetic datum is a reference from which measurements are made. In surveying and geodesy,a datum is a set of reference points on the earth's surface against which position measurements are made, and an associated model of the shape of the earth to define a geographic coordinate system....
. The data set may comprise data for one or more members, corresponding to the number of rows.

Historically, the term originated in the mainframe field
Mainframe computer

Mainframes are computers used mainly by large organizations for critical applications, typically bulk data processing such as census, industry and consumer statistics, Enterprise Resource Planning, and financial transaction processing....
, where it had a well-defined meaning
Data set (IBM mainframe)

A data set, or dataset, is a computer file having a record-oriented file. The term pertains to the IBM mainframe computer operating system line, starting with OS/360, and is still used by its successors, including the current z/OS....
, very close to contemporary computer file
Computer file

A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable computer storage....
.






Discussion
Ask a question about 'Data set'
Start a new discussion about 'Data set'
Answer questions from other users
Full Discussion Forum



Encyclopedia


A data set (or dataset) is a collection of data
DATA

Debt, AIDS, Trade in Africa is a multinational Non-governmental organization founded in January 2002 in London by U2's Bono along with Robert Sargent Shriver III and activists from the Jubilee 2000 Drop the Debt campaign....
, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. It lists values for each of the variables, such as height and weight of an object or values of random numbers. Each value is known as a datum
Datum

A geodetic datum is a reference from which measurements are made. In surveying and geodesy,a datum is a set of reference points on the earth's surface against which position measurements are made, and an associated model of the shape of the earth to define a geographic coordinate system....
. The data set may comprise data for one or more members, corresponding to the number of rows.

Historically, the term originated in the mainframe field
Mainframe computer

Mainframes are computers used mainly by large organizations for critical applications, typically bulk data processing such as census, industry and consumer statistics, Enterprise Resource Planning, and financial transaction processing....
, where it had a well-defined meaning
Data set (IBM mainframe)

A data set, or dataset, is a computer file having a record-oriented file. The term pertains to the IBM mainframe computer operating system line, starting with OS/360, and is still used by its successors, including the current z/OS....
, very close to contemporary computer file
Computer file

A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable computer storage....
. This topic is not covered here.

In the simplest case, there is only one variable, and then the data set consists of a single column of values, often represented as a list. In spite of the name, such a univariate
Univariate

In mathematics, univariate refers to an expression , equation, function or polynomial of only one variable. Objects of any of these types but involving more than one variable may be called multivariate....
 data set is not a set in the usual mathematical sense, since a given value may occur multiple times. Normally the order does not matter, and then the collection of values may be considered to be a multiset
Multiset

In mathematics, a multiset is a generalization of a Set . A Element of a multiset can have more than one Element , while each member of a set has only one membership....
 rather than an (ordered) list.

The values may be numbers, such as real number
Real number

In mathematics, the real numbers may be described informally in several different ways. The real numbers include both rational numbers, such as 42 and −23/129, and irrational numbers, such as pi and the square root of two; or, a real number can be given by an infinite decimal representation, such as 2.4871773339...., where the digits co...
s or integer
Integer

The integers are natural numbers including 0 and their negative and non-negative numberss . They are numbers that can be written without a fractional or decimal component, and fall within the set ....
s, for example representing a person's height in centimeters, but may also be nominal data (i.e., not consisting of numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a level of measurement
Level of measurement

The "levels of measurement" is an expression which typically refers to the theory of scale types developed by the Harvard psychologist Stanley Smith Stevens....
. For each variable, the values will normally all be of the same kind. However, there may also be "missing values
Missing values

In statistics, missing values are a common occurrence. Several statistical methods have been developed to deal with this problem. Missing values mean that no data Value is stored for the variable in the current observation....
", which need to be indicated in some way.

In statistics
Statistics

Statistics is a Mathematics pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It also provides tools for prediction and forecasting based on data....
 data sets usually come from actual observations obtained by sampling
Sampling (statistics)

Sampling is that part of statistical practice concerned with the selection of individual observations intended to yield some knowledge about a population of concern, especially for the purposes of statistical inference....
 a statistical population
Statistical population

In statistics, a statistical population is a Set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population....
, and each row corresponds to the observations on one element of that population. Data sets may further be generated by algorithms for the purpose of testing certain kinds of software. Some modern statistical analysis software such as PSPP
PSPP

PSPP is a free software application for analysis of sampled data. It has a graphical user interface and conventional command line interface. It is written in C , uses GNU Scientific Library for its mathematical routines, and plotutils for generating graphs....
 still present their data in the classical data set fashion.

Classic data sets

Several classic data set
Data set

A data set is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question....
s have been used extensively in the statistical literature:

  • Iris flower data set
    Iris flower data set

    The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher as an example of discriminant analysis....
     - multivariate data set introduced by Ronald Fisher
    Ronald Fisher

    Sir Ronald Aylmer Fisher, Fellow of the Royal Society was an England statistician, evolutionary biologist, and genetics. He was described by Anders Hald as "a genius who almost single-handedly created the foundations for modern statistical science" and Richard Dawkins described him as "the greatest of Charles Darwin successors"....
     (1936).
  • Categorical data analysis - Data sets used in the book, An Introduction to Categorical Data Analysis, by Agresti are
  • Robust statistics
    Robust statistics

    Robust statistics provides an alternative approach to classical statistical methods. The motivation is to produce estimators that are not unduly affected by small departures from model assumptions....
     - Data sets used in Robust Regression and Outlier Detection (Rousseeuw and Leroy, 1986).
  • Time series
    Time series

    In statistics, signal processing, and many other fields, a time series is a sequence of data points, measured typically at successive times, spaced at time intervals....
     - Data used in Chatfield's book, The Analysis of Time Series, are
  • Extreme values - Data used in the book, An Introduction to the Statistical Modeling of Extreme Values are , the book's author.
  • Bayesian Data Analysis - Data used in the book, Bayesian
    Bayesian

    Bayesian refers to methods in probability and statistics named after the Reverend Thomas Bayes , in particular methods related to:* the degree-of-belief interpretation of probability, as opposed to frequency or proportion or propensity interpretations; or...
     Data Analysis
    , are , one of the book's authors.
  • The [ftp://ftp.ics.uci.edu/pub/machine-learning-databases/liver-disorders Bupa liver data], used in several papers in the machine learning (data mining) literature.


External links

  • - The Global Change Master Directory contains more than 20,000 descriptions of Earth science data sets and services covering all aspects of Earth and environmental sciences.