Free statistical software
Encyclopedia
In this article, the word free generally means can be legally obtained without paying any money (cf. free beer). Just a few of the software packages mentioned here are also free as in the sense of free speech: they are not only open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

 but also free software
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

 in the sense that the source code of the software is freely available and can be freely modified by anyone who so desires, and also distributed to others, as long as those re-distributed modifications remain free in exactly the same strong sense.

Free statistical software is a practical alternative to commercial packages. In general, free statistical software gives results that are the same as the results from commercial programs, and many of the packages are fairly easy to learn, using menu systems, although a few are command-driven. These packages come from a variety of sources, including government
Government
Government refers to the legislators, administrators, and arbitrators in the administrative bureaucracy who control a state at a given time, and to the system of government by which they are organized...

s, nongovernmental organizations (NGOs) like UNESCO
UNESCO
The United Nations Educational, Scientific and Cultural Organization is a specialized agency of the United Nations...

, and universities
University
A university is an institution of higher education and research, which grants academic degrees in a variety of subjects. A university is an organisation that provides both undergraduate education and postgraduate education...

, and are also developed by individuals.

Some packages are developed for specific purposes (e.g., time series analysis, factor analysis, calculators for probability distributions, etc.), while others are general packages, with a variety of statistical procedures. Others are meta-packages or statistical computing environments, which allow the user to code completely new statistical procedures. This article is a review of the general statistical packages.

Brief history of free statistical software

Some of the free software packages are from governmental or NGO organizations, such as Epi Info
Epi Info
Epi Info is public domain statistical software for epidemiology developed by Centers for Disease Control and Prevention in Atlanta, Georgia ....

, from CDC
Centers for Disease Control and Prevention
The Centers for Disease Control and Prevention are a United States federal agency under the Department of Health and Human Services headquartered in Druid Hills, unincorporated DeKalb County, Georgia, in Greater Atlanta...

 (Centers for Disease Control and Prevention), and IDAMS
IDAMS
IDAMS is a software package for processing and analysing numerical data developed, maintained and disseminated by UNESCO....

 from UNESCO. Some other software packages are from smaller or independent organizations or universities, such as Instat or Irristat. Another package, the R project
R (programming language)
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....

, is being developed by a large group of volunteer individuals all over the world. This package is notable in that it is not just open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

 but also free software
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

 in the same sense that material written on Wikipedia is free: others can edit, use, and redistribute at will.

A large proportion of free statistical software packages, however, are from individuals. Some of these software packages from individuals include Easyreg, MicrOsiris, OpenStat, PSPP
PSPP
PSPP is a free software application for analysis of sampled data. It has a graphical user interface and conventional command line interface. It is written in C, uses GNU Scientific Library for its mathematical routines, and plotutils for generating graphs....

, SOFA
SOFA Statistics
SOFA Statistics is an open-source statistical package, with an emphasis on ease of use, learn as you go, and beautiful output. The name stands for Statistics Open For All. It has a graphical user interface and can connect directly to MySQL, PostgreSQL, SQLite, MS Access, and Microsoft SQL Server...

, and Zelig.

At least one package, WinIDAMS, was developed for the purposes of making key technologies available to those who could not otherwise afford them, to empower development. OpenStat and Instat were developed as teaching aids. Other packages were developed for specific purposes but can be more generally used. Examples are Irristat, developed for agricultural analysis, and Epi Info, developed for public health. Several of the packages, PSPP, R and Osiris don't appear to give any statements about why they were developed, other than just general use for statistical analysis.

These free software packages have been used in a number of scholarly publications. For example, OpenStat was used in a research letter to JAMA and in several published studies. Irristat is used in an agricultural report,
EasyReg is listed or used in several papers, EpiInfo was also used in several papers, R was used in a number of papers and WinIdams was used in other papers.

While Microsiris doesn't appear to be used in academic research, the author of the program was one of the original authors of OSIRIS, which was the starting program from which WinIdams was developed.. The author of Microsiris also has also contributed or co-contributed several components to WinIdams.

Reviews of free statistical software

There are a few reviews of free statistical software. There were two reviews in journals (but not peer reviewed), one by Zhu and Kuljaca and another article by Grant that included mainly a brief review of R. Zhu and Kuljaca outlined some useful characteristics of software, such as ease of use, having a number of statistical procedures and ability to develop new procedures. They reviewed several programs and identified which ones, at that time, had the most functionality. At that time, several of the programs may not have had all of the desired ability for advanced statistics. Grant reviewed some of the programing features of R, and briefly mentioned the availability of other programs. One other paper reviewed statistical packages, mainly commercial, but includes R. One article reviewed EasyReg and included a discussion of its accuracy.

Only one review has compared the output of various packages. In this review, all of the packages read either CSV
Comma-separated values
A comma-separated values file stores tabular data in plain-text form. As a result, such a file is easily human-readable ....

 files or EXCEL
Excel
Excel may refer to:* Microsoft Excel, a spreadsheet application by Microsoft Corporation* Excel , a brand of chewing gum produced by Wrigley's* Excel , a crossover thrash/punk band from Venice, California...

 format. All of the packages gave exactly the same results for correlation
Correlation
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence....

 and regression
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

. The free software packages also gave the same regression results as did excel. One of the main differences among the packages was how they handled missing data. With the example data sets used in the review, and for the package versions available in November 2006 when this review was conducted, two packages, MicrOsiris and Epi Info, could read files with blanks for missing. Two other programs, Stat4U and WinIdams need something for the missing, like -9 or -9.99. The other packages could only handle data sets with no missing values.

Two websites that list software also have very brief reviews of each package. These two sites are StatCon and by Pezzullo. These sites mainly offer a brief list of the features available in the packages. Similarly, one other web site compares the statistical procedures available on free statistical packages. In this review, R had all of the procedures, OpenStat had 16, MacAnova had 15, and Microsiris had 12. The others had from 8 to 11 of the procedures.

There is also a journal specifically for statistical software, although the main focus is on commercial software, R and some coding snippets.

In contrast, there are various reviews of commercial statistical software, such as a comparison between several major packages and a brief review of several packages.

Using free statistical software

Before using any statistical packages, it is generally a good idea to have a solid background in Statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

. Then the packages can be used to the best advantage, for example, to choose the most appropriate test, to make sure all the necessary assumptions are met, so that the appropriate conclusions can be drawn.

Once the statistical issues are understood, the next step is to decide which package to use. Most of these packages are menu driven, and can be learned a couple of hours at most, except R, which is generally code driven and requires a much longer time to learn, and to some extent CDC's Epi Info, which also takes some time to learn.

Several of the packages also have tutorials. These tutorials help with a basic introduction and learning the basics of programs. For example, CDC has these tutorials about Epi Info,. The CDC page also lists a video slide show tutorial from the University of Nebraska, and another site has on line training classes,. R has a large number of tutorials and manuals, in English and other languages and a faq site. A few of the packages have email discussion lists, including R and PSPP.

Most of the packages have on line manuals, guides or help pages. These manuals or guides are useful when there are questions about specific procedures or statistical tests. Some manuals or guides are for R, EasyReg , OpenStat, PSPP, Vista, WinIdams,, Microsiris and Zelig. The CDC EpiInfo site itself does not have a manual, but one faculty member from Emory's School of Public Health has an introductory manual.

Finally, there are a number of commercial packages such as SAS, SPSS and many others . Most of the major commercial and free packages have many statistical procedures in common. The main reason to use free packages is probably the cost.

Menu driven packages

Many of the packages have some kind of opening menu that is used to get or enter the data, manipulate the data, and select the statistical analysis. Then after starting the program, people generally get data, either from previously saved data sets, or importing from some other format. From this menu, data files in various formats can be imported. For example if the data is in CSV form (text with commas between values), the program recognizes the format and creates a data set from the CSV file. Finally, people can use the program to do some analysis. In this analysis menu, people can select the variables of interest, along with other options. Then the analysis is run and results are obtained.

Command driven packages

A few programs, like WinIDAMS, need commands for many of their procedures. WinIDAMS does have an interactive menu to read in data, but then specific statistical procedures need a set of text commands. For example, the text command lines for frequencies look like this:
$COMMENT basic freqs of testing data
$RUN TABLES
$FILES
DICTIN = PD_data_idams.dic
DATAIN = PD_data_idams.dat
$SETUP
FREQUENCY TABLES
PRINT=(CDICT)
TABLES
ROWVARS=(V21) CELLS=(ROWP,FREQS)


This set of commands identifies procedure (tables), the data set and dictionary (PD_data_idams.dat and dic) and the variables. The procedures all have various options outlined in the manuals.

R can be used both in a menu-driven way and as a programming language and as an interpreter.

Getting data

Most packages are able to import data from excel or CSV (text with commas separating values).

One consideration is whether there are missing data. Some packages, like PSPP and MicrOsiris, can automatically deal with the missing data. So for example, say one set of data look like this:
|-
! Name
! Age
! Sex
! Born in US
! Degree
|-
| Joe
| 31
| M
| Yes
| BA
|-
| Sam
|
| M
| No
| MS
|-
| Sally
| 28
| F
|
| Ph.D.
|}>

In this data set, Sam is missing age, and Sally is missing whether she was born in the USA. When some packages, like PSPP or MicrOsiris, read in or import the original data set, the packages will recognize that those values are missing, and do their calculations accordingly. MicrOsiris automatically assigns 1.5 or 1.6 billion to blanks as missing, and these values are excluded from analysis.

Other packages need a 'place holder', such as '-9' where there is missing data. Before the package is used to read the data, the data set has to be edited to put in place holder where there are missing data. So for example:
|-
! Name
! Age
! Sex
! Born in US
! Degree
|-
| Joe
| 31
| M
| Yes
| BA
|-
| Sam
| -9
| M
| No
| MS
|-
| Sally
| 28
| F
| -9
| Ph.D.
|}>

The data set includes '-9' and then people who are reading in the data need to tell the program that the -9 means missing data.

Limitations of packages

Most of the packages have limitations of some sort.

Variables in WidIDAMS are limited to 9 digits in length and so have to be manipulated before analysis. In the version of PSPP current as of April 2009, there are a limited number of procedures available, including means, frequencies, crosstabs, two non-parametric tests, t-tests, anova and basic regression. In addition, the output is, apparently, not easy to use as it cannot be copied and pasted to other applications, and it is not clear where, in Windows Vista, the output is saved. Several of the programs, including Easyreg, Epidata and Instat, do not appear to handle missing data or do not handle it well. While EpiInfo has many statistical procedures, correlation is not one of them. Rather correlation is found by regression. This means that EpiInfo will not produce a single table showing correlations among multiple variables. According to the Zelig installation manual, use of Zelig requires that R and several of its libraries already be installed, and the installation also requires some degree of background in R. One limit of MicrOsiris is in handling the output. When calculations are complete, the output pages through the results, but various menu boxes also appear over the results, and so the results cannot be accessed. The output can be saved, though, as a text file and then used.

One limitation is specific to programs that were developed by individuals. Support for these programs is limited to the time that the author has available. While the authors may, and often do, respond fairly quickly when there are few people asking questions, if too many people ask questions or the author is otherwise busy, support would correspondingly be slower.

R is both written by and used by a large number of people all over the world, and many internet-fora and other internet facilities can be used to get support from other users. While R is powerful, the learning curve can be rather steep for those not already familiar with other kinds of scientific programming .
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK