Theil–Sen estimator
Encyclopedia
In non-parametric statistics
Non-parametric statistics
In statistics, the term non-parametric statistics has at least two different meanings:The first meaning of non-parametric covers techniques that do not rely on data belonging to any particular distribution. These include, among others:...

, the Theil–Sen estimator, also known as Sen's slope estimator, slope selection, the single median method, or the Kendall robust line-fit method, is a method for robust linear regression
Linear regression
In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple regression...

 that chooses the median
Median
In probability theory and statistics, a median is described as the numerical value separating the higher half of a sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to...

 slope
Slope
In mathematics, the slope or gradient of a line describes its steepness, incline, or grade. A higher slope value indicates a steeper incline....

 among all lines through pairs of two-dimensional sample points. It is named after Henri Theil
Henri Theil
Henri Theil was a Dutch econometrician.He graduated from the University of Amsterdam. He was the successor of Jan Tinbergen at the Erasmus University Rotterdam. Later he taught in Chicago and at the University of Florida. He is most famous for his invention of 2-stage least squares...

 and Pranab K. Sen
Pranab K. Sen
Pranab Kumar Sen is a statistician, a professor of statistics and the Cary C. Boshamer Professor of Biostatistics at the University of North Carolina at Chapel Hill.-Academic biography:...

, who published papers on this method in 1950 and 1968 respectively. It can be computed efficiently, and is insensitive to outlier
Outlier
In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....

s; it can be significantly more accurate than simple linear regression
Simple linear regression
In statistics, simple linear regression is the least squares estimator of a linear regression model with a single explanatory variable. In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model as...

 for skewed
Skewness
In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable. The skewness value can be positive or negative, or even undefined...

 and heteroskedastic data, and competes well against simple least squares even for normally distributed data. It has been called "the most popular nonparametric technique for estimating a linear trend".

Definition

As defined by , the Theil–Sen estimator of a set of two-dimensional points is the median of the slopes determined by all pairs of sample points. extended this definition to handle the case in which two samples have the same -coordinate. In Sen's definition, one takes the median of the slopes defined only from pairs of points having distinct -coordinates.

Once the slope has been determined, one may determine a line through the sample points by setting the -intercept to be the median of the values . As Sen observed, this estimator is the value that makes the Kendall tau rank correlation coefficient
Kendall tau rank correlation coefficient
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's tau coefficient, is a statistic used to measure the association between two measured quantities...

 comparing the sample data values with their estimated values become approximately zero.

A confidence interval
Confidence interval
In statistics, a confidence interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval , in principle different from sample to sample, that frequently includes the parameter of interest, if the...

 for the slope estimate may be determined as the interval containing the middle 95% of the slopes of lines determined by pairs of points, and may be estimated quickly by sampling pairs of points and determining the 95% interval of the sampled slopes. According to simulations, approximately 600 sample pairs are sufficient to determine an accurate confidence interval.

Variations

A variation of the Theil–Sen estimator due to determines, for each sample point , the median of the slopes of lines through that point, and then determines the overall estimator as the median of these medians.

A different variant pairs up sample points by the rank of their -coordinates (the point with the smallest coordinate being paired with the first point above the median coordinate, etc) and computes the median of the slopes of the lines determined by these pairs of points.

Variations of the Theil–Sen estimator based on weighted medians have also been studied, based on the principle that pairs of samples whose -coordinates differ more greatly are more likely to have an accurate slope and therefore should receive a higher weight.

For seasonal data, it may be appropriate to smooth out seasonal variations in the data by considering only pairs of sample points that both belong to the same month or the same season of the year, and finding the median of the slopes of the lines determined by this more restrictive set of pairs.

Statistical properties

The Theil–Sen estimator is an unbiased estimator of the true slope in simple linear regression
Simple linear regression
In statistics, simple linear regression is the least squares estimator of a linear regression model with a single explanatory variable. In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model as...

. For many distributions of the response error, this estimator has high asymptotic efficiency
Efficiency (statistics)
In statistics, an efficient estimator is an estimator that estimates the quantity of interest in some “best possible” manner. The notion of “best possible” relies upon the choice of a particular loss function — the function which quantifies the relative degree of undesirability of estimation errors...

 relative to least-squares
Least squares
The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in solving every...

 estimation. Estimators with low efficiency require more independent observations to attain the same sample variance of efficient unbiased estimators.

The Theil–Sen estimator is more robust
Robust statistics
Robust statistics provides an alternative approach to classical statistical methods. The motivation is to produce estimators that are not unduly affected by small departures from model assumptions.- Introduction :...

 than the least-squares estimator because it is much less sensitive to outlier
Outlier
In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....

s: It has a breakdown point of , meaning that it can tolerate arbitrary corruption of up to 29.3% of the input data-points without degradation of its accuracy. However, the breakdown point decreases for higher-dimensional generalizations of the method. A higher breakdown point, 50%, holds for the repeated median estimator of Siegel.

The Theil–Sen estimator is equivariant
Equivariant
In mathematics, an equivariant map is a function between two sets that commutes with the action of a group. Specifically, let G be a group and let X and Y be two associated G-sets. A function f : X → Y is said to be equivariant iffor all g ∈ G and all x in X...

 under every linear transformation
Linear transformation
In mathematics, a linear map, linear mapping, linear transformation, or linear operator is a function between two vector spaces that preserves the operations of vector addition and scalar multiplication. As a result, it always maps straight lines to straight lines or 0...

 of its response variable, but is not equivariant under affine transformations of both the predictor and response variables.

Algorithms

The median slope of a set of sample points may be computed exactly by computing all lines through pairs of points, and then applying a linear time median finding algorithm
Selection algorithm
In computer science, a selection algorithm is an algorithm for finding the kth smallest number in a list . This includes the cases of finding the minimum, maximum, and median elements. There are O, worst-case linear time, selection algorithms...

, or it may be estimated by sampling pairs of points. It is equivalent, under projective duality, to the problem of finding the crossing point in an arrangement of lines
Arrangement of lines
In geometry an arrangement of lines is the partition of the plane formed by a collection of lines. Bounds on the complexity of arrangements have been studied in discrete geometry, and computational geometers have found algorithms for the efficient construction of arrangements.-Definition:For any...

 that has the median -coordinate among all such crossing points.

The problem of performing slope selection exactly but more efficiency than the brute force quadratic time algorithm has been extensively studied in computational geometry
Computational geometry
Computational geometry is a branch of computer science devoted to the study of algorithms which can be stated in terms of geometry. Some purely geometrical problems arise out of the study of computational geometric algorithms, and such problems are also considered to be part of computational...

. Several different methods are known for computing the Theil–Sen estimator exactly in time, either deterministically or using randomized algorithm
Randomized algorithm
A randomized algorithm is an algorithm which employs a degree of randomness as part of its logic. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random bits...

s. Siegel's repeated median estimator can also be constructed efficiently in the same time bound.

An estimator for the slope with approximately median rank, having the same breakdown point as the Theil–Sen estimator, may be maintained in the data stream model
Streaming algorithm
In computer science, streaming algorithms are algorithms forprocessing data streams in which the input is presented as a sequence ofitems and can be examined in only a few passes...

 (in which the sample points are processed one by one by an algorithm that does not have enough persistent storage to represent the entire data set) using an algorithm based on ε-nets
Ε-net (computational geometry)
An ε-net is any of several related concepts in mathematics, and in particular in computational geometry, where it relates to the approximation of a general set by a collection of simpler subsets.- Background :...

.

Applications

Theil–Sen estimation has been applied to astronomy
Astronomy
Astronomy is a natural science that deals with the study of celestial objects and phenomena that originate outside the atmosphere of Earth...

 due to its ability to handle censored regression model
Censored regression model
Censored regression models commonly arise in econometrics in cases where the variable ofinterest is only observable under certain conditions. A common example is labor supply. Data are frequently available on the hours worked by employees, and a labor supply model estimates the relationship between...

s. In biophysics
Biophysics
Biophysics is an interdisciplinary science that uses the methods of physical science to study biological systems. Studies included under the branches of biophysics span all levels of biological organization, from the molecular scale to whole organisms and ecosystems...

, suggest its use for remote sensing applications such as the estimation of leaf area from reflectance data due to its "simplicity in computation, analytical estimates of confidence intervals, robustness to outliers, testable assumptions regarding residuals and ... limited a priori information regarding measurement errors". For measuring seasonal environmental data such as water quality
Water quality
Water quality is the physical, chemical and biological characteristics of water. It is a measure of the condition of water relative to the requirements of one or more biotic species and or to any human need or purpose. It is most frequently used by reference to a set of standards against which...

, a seasonally adjusted variant of the Theil–Sen estimator has been proposed as preferable to least squares estimation due to its high precision in the presence of skewed data. In computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, the Theil–Sen method has been used to estimate trends in software aging
Software aging
In software engineering, software aging refers to progressive performance degradation or a sudden hang/crash of a software system due to exhaustion of operating system resources, fragmentation and accumulation of errors. A proactive fault management method to deal with the software aging...

.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK