Biased sample
Encyclopedia
In statistics
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

, sampling bias is when a sample is collected in such a way that some members of the intended population
Statistical population
A statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population. For example, if we were interested in generalizations about crows, then we would describe the set of crows that is of interest...

 are less likely to be included than others. It results in a biased sample, a non-random sample
Random sample
In statistics, a sample is a subject chosen from a population for investigation; a random sample is one chosen by a method involving an unpredictable component...

 of a population
Statistical population
A statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population. For example, if we were interested in generalizations about crows, then we would describe the set of crows that is of interest...

 (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected. If this is not accounted for, results can be erroneously attributed to the phenomenon under study rather than to the method of sampling
Sampling (statistics)
In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....

.

Medical sources sometimes refer to sampling bias as ascertainment bias. Ascertainment bias has basically the same definition, but is still sometimes classified as a separate type of bias.

Distinction from selection bias

Sampling bias is mostly classified as a subtype of selection bias
Selection bias
Selection bias is a statistical bias in which there is an error in choosing the individuals or groups to take part in a scientific study. It is sometimes referred to as the selection effect. The term "selection bias" most often refers to the distortion of a statistical analysis, resulting from the...

, sometimes specifically termed sample selection bias, but some classify it as a separate type of bias. A distinction, albeit not universally accepted, of sampling bias is that it undermines the external validity
External validity
External validity is the validity of generalized inferences in scientific studies, usually based on experiments as experimental validity....

 of a test (the ability of its results to be generalized to the rest of the population), while selection bias mainly addresses internal validity
Internal validity
Internal validity is the validity of inferences in scientific studies, usually based on experiments as experimental validity.- Details :...

 for differences or similarities found in the sample at hand. In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias.

However, selection bias and sampling bias are often used synonymously.

Types of sampling bias

  • Selection from a specific area. For example, a survey of high school students to measure teenage use of illegal drugs will be a biased sample because it does not include home-schooled students or dropouts. A sample is also biased if certain members are underrepresented or overrepresented relative to others in the population. For example, a "man on the street" interview which selects people who walk by a certain location is going to have an overrepresentation of healthy individuals who are more likely to be out of the home than individuals with a chronic illness. This may be an extreme form of biased sampling, because certain members of the population are totally excluded from the sample (that is, they have zero probability of being selected).
  • Self-selection
    Self-selection
    In statistics, self-selection bias arises in any situation in which individuals select themselves into a group, causing a biased sample with nonprobability sampling...

    bias, which is possible whenever the group of people being studied has any form of control over whether to participate. Participants' decision to participate may be correlated with traits that affect the study, making the participants a non-representative sample. For example, people who have strong opinions or substantial knowledge may be more willing to spend time answering a survey than those who do not. Another example is online and phone-in polls, which are biased samples because the respondents are self-selected. Those individuals who are highly motivated to respond, typically individuals who have strong opinions, are overrepresented, and individuals that are indifferent or apathetic are less likely to respond. This often leads to a polarization of responses with extreme perspectives being given a disproportionate weight in the summary. As a result, these types of polls are regarded as unscientific.
  • Pre-screening of trial participants, or advertising for volunteers within particular groups. For example a study to "prove" that smoking does not affect fitness might recruit at the local fitness center, but advertise for smokers during the advanced aerobics class, and for non-smokers during the weight loss sessions.
  • Exclusion bias results from exclusion of particular groups from the sample, e.g. exclusion of subjects who have recently migrated
    Human migration
    Human migration is physical movement by humans from one area to another, sometimes over long distances or in large groups. Historically this movement was nomadic, often causing significant conflict with the indigenous population and their displacement or cultural assimilation. Only a few nomadic...

     into the study area (this may occur when newcomers are not available in a register used to identify the source population). Excluding subjects who move out of the study area during follow-up is rather equivalent of dropout or nonresponse, a selection bias
    Selection bias
    Selection bias is a statistical bias in which there is an error in choosing the individuals or groups to take part in a scientific study. It is sometimes referred to as the selection effect. The term "selection bias" most often refers to the distortion of a statistical analysis, resulting from the...

     in that it rather affects the internal validity of the study.
  • Healthy user bias
    Healthy user bias
    The healthy user bias is a bias that can damage the validity of epidemiologic studies testing the efficacy of particular therapies or interventions. Specifically, it is a sampling bias: the kind of subjects that voluntarily enroll in a clinical trial and actually follow the experimental regimen...

    , when the study population is likely healthier than the general population, e.g. workers (i.e. someone in ill-health is unlikely to have a job as manual laborer).
  • Overmatching, matching for an apparent confounder that actually is a result of the exposure. The control group becomes more similar to the cases in regard to exposure than the general population.

Symptom-based sampling

The study of medical conditions begins with anecdotal reports. By their nature, such reports only include those referred for diagnosis and treatment. A child who can't function in school is more likely to be diagnosed with dyslexia
Dyslexia
Dyslexia is a very broad term defining a learning disability that impairs a person's fluency or comprehension accuracy in being able to read, and which can manifest itself as a difficulty with phonological awareness, phonological decoding, orthographic coding, auditory short-term memory, or rapid...

 than a child who struggles but passes. A child examined for one condition is more likely to be tested for and diagnosed with other conditions, skewing comorbidity
Comorbidity
In medicine, comorbidity is either the presence of one or more disorders in addition to a primary disease or disorder, or the effect of such additional disorders or diseases.- In medicine :...

 statistics. As certain diagnoses become associated with behavior problems or mental retardation
Mental retardation
Mental retardation is a generalized disorder appearing before adulthood, characterized by significantly impaired cognitive functioning and deficits in two or more adaptive behaviors...

, parents try to prevent their children from being stigmatized with those diagnoses, introducing further bias. Studies carefully selected from whole populations are showing that many conditions are much more common and usually much milder than formerly believed.

Truncate selection in pedigree studies

Geneticists are limited in how they can obtain data from human populations. As an example, consider a human characteristic. We are interested in deciding if the characteristic is inherited as a simple Mendelian trait. Following the laws of Mendelian inheritance
Mendelian inheritance
Mendelian inheritance is a scientific description of how hereditary characteristics are passed from parent organisms to their offspring; it underlies much of genetics...

, if the parents in a family do not have the characteristic, but carry the allele for it, they are carriers (e.g. a non-expressive heterozygote). In this case their children will each have a 25% chance of showing the characteristic. The problem arises because we can't tell which families have both parents as carriers (heterozygous) unless they have a child who exhibits the characteristic. The description follows the textbook by Sutton.

The figure shows the pedigrees of all the possible families with two children when the parents are carriers (Aa).
  • Nontruncate selection. In a perfect world we should be able to discover all such families with a gene including those who are simply carriers. In this situation the analysis would be free from ascertainment bias and the pedigrees would be under "nontruncate selection" In practice, most studies identify, and include, families in a study based upon them having affected individuals.
  • Truncate selection. When afflicted individuals have an equal chance of being included in a study this is called truncate selection, signifying the inadvertent exclusion (truncation) of families who are carriers for a gene. Because selection is performed on the individual level, families with two or more affected children would have a higher probability of becoming included in the study.
  • Complete truncate selection is a special case where each family with an affected child has an equal chance of being selected for the study.


The probabilities of each of the families being selected is given in the figure, with the sample frequency of affected children also given. In this simple case, the researcher will look for a frequency of or for the characteristic, depending on the type of truncate selection used.

The caveman effect

An example of selection basis is called the "caveman effect." Much of our understanding of prehistoric
Prehistory
Prehistory is the span of time before recorded history. Prehistory can refer to the period of human existence before the availability of those written records with which recorded history begins. More broadly, it refers to all the time preceding human existence and the invention of writing...

 peoples comes from caves, such as cave painting
Cave painting
Cave paintings are paintings on cave walls and ceilings, and the term is used especially for those dating to prehistoric times. The earliest European cave paintings date to the Aurignacian, some 32,000 years ago. The purpose of the paleolithic cave paintings is not known...

s made nearly 40,000 years ago. If there had been contemporary paintings on trees, animal skins or hillsides, they would have been washed away long ago. Similarly, evidence of fire pits, midden
Midden
A midden, is an old dump for domestic waste which may consist of animal bone, human excrement, botanical material, vermin, shells, sherds, lithics , and other artifacts and ecofacts associated with past human occupation...

s, burial sites, etc. are most likely to remain intact to the modern era in caves. Prehistoric people are associated with caves because that is where the data still exists, not necessarily because most of them lived in caves for most of their lives.

Problems caused by sampling bias

A biased sample causes problems because any statistic
Statistic
A statistic is a single measure of some attribute of a sample . It is calculated by applying a function to the values of the items comprising the sample which are known together as a set of data.More formally, statistical theory defines a statistic as a function of a sample where the function...

 computed from that sample has the potential to be consistently erroneous. The bias can lead to an over- or underrepresentation of the corresponding parameter
Parameter
Parameter from Ancient Greek παρά also “para” meaning “beside, subsidiary” and μέτρον also “metron” meaning “measure”, can be interpreted in mathematics, logic, linguistics, environmental science and other disciplines....

 in the population. Almost every sample in practice is biased because it is practically impossible to ensure a perfectly random sample. If the degree of underrepresentation is small, the sample can be treated as a reasonable approximation to a random sample. Also, if the group that is underrepresented does not differ markedly from the other groups in the quantity being measured, then a random sample can still be a reasonable approximation.

The word bias
Bias
Bias is an inclination to present or hold a partial perspective at the expense of alternatives. Bias can come in many forms.-In judgement and decision making:...

 in common usage has a strong negative word connotation, and implies a deliberate intent to mislead or other scientific fraud. In statistical usage, bias merely represents a mathematical property, no matter if it is deliberate or either unconscious or due to imperfections in the instruments used for observation. While some individuals might deliberately use a biased sample to produce misleading results, more often, a biased sample is just a reflection of the difficulty in obtaining a truly representative sample.

Some samples use a biased statistical design which nevertheless allows the estimation of parameters. The U.S. National Center for Health Statistics
National Center for Health Statistics
National Center for Health Statistics is a division of the United States federal agency the Centers for Disease Control and Prevention . As such, NCHS is under the United States Department of Health and Human Services...

 for example, deliberately oversamples from minority populations in many of its nationwide surveys in order to gain sufficient precision for estimates within these groups. These surveys require the use of sample weights (see below) to produce proper estimates across all racial and ethnic groups. Provided that certain conditions are met (chiefly that the sample is drawn randomly from the entire sample) these samples permit accurate estimation of population parameters.

Historical examples

A classic example of a biased sample and the misleading results it produced occurred in 1936. In the early days of opinion polling, the American Literary Digest
Literary Digest
The Literary Digest was an influential general interest weekly magazine published by Funk & Wagnalls. Founded by Isaac Kaufmann Funk in 1890, it eventually merged with two similar weekly magazines, Public Opinion and Current Opinion.-History:...

magazine collected over two million postal surveys and predicted that the Republican candidate in the U.S. presidential election, Alf Landon
Alf Landon
Alfred Mossman "Alf" Landon was an American Republican politician, who served as the 26th Governor of Kansas from 1933–1937. He was best known for being the Republican Party's nominee for President of the United States, defeated in a landslide by Franklin D...

, would beat the incumbent president, Franklin Roosevelt by a large margin. The result was the exact opposite. The Literary Digest survey represented a sample collected from readers of the magazine, supplemented by records of registered automobile owners and telephone users. This sample included an over-representation of individuals who were rich, who, as a group, were more likely to vote for the Republican candidate. In contrast, a poll of only 50 thousand citizens selected by George Gallup
George Gallup
George Horace Gallup was an American pioneer of survey sampling techniques and inventor of the Gallup poll, a successful statistical method of survey sampling for measuring public opinion.-Biography:...

's organization successfully predicted the result, leading to the popularity of the Gallup poll.

Another classic example occurred in the 1948 Presidential Election
United States presidential election, 1948
The United States presidential election of 1948 is considered by most historians as the greatest election upset in American history. Virtually every prediction indicated that incumbent President Harry S. Truman would be defeated by Republican Thomas E. Dewey. Truman won, overcoming a three-way...

. On Election night, the Chicago Tribune
Chicago Tribune
The Chicago Tribune is a major daily newspaper based in Chicago, Illinois, and the flagship publication of the Tribune Company. Formerly self-styled as the "World's Greatest Newspaper" , it remains the most read daily newspaper of the Chicago metropolitan area and the Great Lakes region and is...

 printed the headline DEWEY DEFEATS TRUMAN
Dewey Defeats Truman
"Dewey Defeats Truman" was a famously inaccurate banner headline on the front page of the Chicago Tribune on November 3, 1948, the day after incumbent United States President Harry S. Truman beat Republican challenger and Governor of New York Thomas E...

, which turned out to be mistaken. In the morning the grinning President-Elect, Harry S. Truman
Harry S. Truman
Harry S. Truman was the 33rd President of the United States . As President Franklin D. Roosevelt's third vice president and the 34th Vice President of the United States , he succeeded to the presidency on April 12, 1945, when President Roosevelt died less than three months after beginning his...

, was photographed holding a newspaper bearing this headline. The reason the Tribune was mistaken is that their editor trusted the results of a phone survey. Survey research was then in its infancy, and few academics realized that a sample of telephone users was not representative of the general population. Telephones were not yet widespread, and those who had them tended to be prosperous and have stable addresses. (In many cities, the Bell System
Bell System
The Bell System was the American Bell Telephone Company and then, subsequently, AT&T led system which provided telephone services to much of the United States and Canada from 1877 to 1984, at various times as a monopoly. In 1984, the company was broken up into separate companies, by a U.S...

 telephone directory
Telephone directory
A telephone directory is a listing of telephone subscribers in a geographical area or subscribers to services provided by the organization that publishes the directory...

 contained the same names as the Social Register
Social Register
Specific to the United States, the Social Register is a directory of names and addresses of prominent American families who form the social elite, . The "Directory" automatically includes the President of the United States and the First Family, and in the past always included the U.S. Senators and...

.) In addition, the Gallup poll that the Tribune based its headline on was over two weeks old at the time of the printing.

Statistical corrections for a biased sample

If entire segments of the population are excluded from a sample, then there are no adjustments that can produce estimates that are representative of the entire population. But if some groups are underrepresented and the degree of underrepresentation can be quantified, then sample weights can correct the bias.

For example, a hypothetical population might include 10 million men and 10 million women. Suppose that a biased sample of 100 patients included 20 men and 80 women. A researcher could correct for this imbalance by attaching a weight of 2.5 for each male and 0.625 for each female. This would adjust any estimates to achieve the same expected value as a sample that included exactly 50 men and 50 women, unless men and women differed in their likelihood of taking part in the survey.

See also

  • Cherry picking (fallacy)
  • File drawer problem
  • Friendship paradox
    Friendship paradox
    The friendship paradox is the phenomenon first observed by the sociologist Scott L. Feld in 1991 that most people have fewer friends than their friends have, on average. It can be explained as a form of sampling bias in which people with greater numbers of friends have an increased likelihood of...

  • Reporting bias
    Reporting bias
    In empirical research, reporting bias refers to a tendency to under-report unexpected or undesirable experimental results, attributing the results to sampling or measurement error, while being more trusting of expected or desirable results, though these may be subject to the same sources of error...

  • Spectrum bias
    Spectrum bias
    Initially identified in 1978, spectrum bias refers to the phenomenon that the performance of a diagnostic test may change between different clinical settings owing to changes in the patient case-mix thereby affecting the transferability of study results in clinical practice...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK