Krippendorff's Alpha
Encyclopedia
Krippendorff
Klaus Krippendorff
Klaus Krippendorff Frankfurt am Main, is the Gregory Bateson professor for Cybernetics, Language, and Culture at the Annenberg School for Communication, University of Pennsylvania, Philadelphia, USA.- Overview :...

's alpha coefficient is a statistical measure of the agreement achieved when coding a set of units of analysis in terms of the values of a variable. Since the 1970s, alpha is used in content analysis
Content analysis
Content analysis or textual analysis is a methodology in the social sciences for studying the content of communication. Earl Babbie defines it as "the study of recorded human communications, such as books, websites, paintings and laws."According to Dr...

 where textual units are categorized by trained readers, in counseling and survey research
Survey research
Survey research a research method involving the use of questionnaires and/or statistical surveys to gather data about people and their thoughts and behaviours. This method was pioneered in the 1930s and 1940s by sociologist Paul Lazarsfeld. The initial use of the method was to examine the effects...

 where experts code open-ended interview data into analyzable terms, in psychological testing where alternative tests of the same phenomena need to be compared, or in observational studies where unstructured happenings are recorded for subsequent analysis.

Krippendorff’s alpha generalizes several known statistics, often called measures of inter-coder agreement, inter-rater reliability
Inter-rater reliability
In statistics, inter-rater reliability, inter-rater agreement, or concordance is the degree of agreement among raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by...

, reliability of coding (as distinct from unitizing) but it also distinguishes itself from statistics that claim to measure reliability but are unsuitable to assess the reliability of coding or the data it generates.

Krippendorff’s alpha is applicable to any number of coders, each assigning one value to one unit of analysis, to incomplete (missing) data, to any number of values available for coding a variable, to binary, nominal, ordinal, interval, ratio, polar, and circular metrics (Levels of Measurement), and it adjusts itself to small sample sizes of the reliability data. The virtue of a single coefficient with these variations is that computed reliabilities are comparable across any numbers of coders and values, different metrics, and unequal sample sizes.

Software for calculating Krippendorff’s alpha is available

Reliability data

Reliability data are generated in a situation in which m ≥ 2 jointly instructed (e.g., by a Code book) but independently working coders assign any one of a set of values c or k of a variable to a common set of N units of analysis. In their canonical form, reliability data are tabulated in an m-by-N matrix containing n values ciu or kju that coder i or j has assigned to unit u. When data are incomplete, some cells in this matrix are empty or missing, hence, the number mu of values assigned to unit u may vary. Reliability data require that values be pairable, i.e., mu ≥ 2. The total number of pairable values is n ≤ mN.

General form of alpha



where the disagreement
,


is the average difference between two values c and k over all mu(mu-1) pairs of values possible within unit u – without reference to coders. is a function of the metric of the variable, see below. The observed disagreement

is the average over all N disagreements Du in u. And the expected disagreement

is the average difference between any two values c and k over all n(n–1) pairs of values possible within the reliability data – without reference to coders and units. In effect, De is the disagreement that is expected when the values used by all coders are randomly assigned to the given set of units.

One interpretation of Krippendorff's alpha is:
α = 1 indicates perfect reliability
α = 0 indicates the absence of reliability. Units and the values assigned to them are statistically unrelated
α < 0 when disagreements are systematic and exceed what can be expected by chance.


In this general form, disagreements Do and De may be conceptually transparent but are computationally inefficient. They can be simplified algebraically, especially when expressed in terms of the visually more instructive coincidence matrix representation of the reliability data.

Coincidence matrices

A coincidence matrix cross tabulates the n pairable values from the canonical form of the reliability data into a v-by-v square matrix, where v is the number of values available in a variable. Unlike contingency matrices, familiar in association and correlation statistics, which tabulate pairs of values (Cross tabulation
Cross tabulation
Cross tabulation is the process of creating a contingency table from the multivariate frequency distribution of statistical variables. Heavily used in survey research, cross tabulations can be produced by a range of statistical packages, including some that are specialised for the task. Survey...

), a coincidence matrix tabulates all pairable values. A coincidence matrix omits references to coders and is symmetrical around its diagonal, which contains all perfect matches, c = k. The matrix of observed coincidences contains frequencies:
,
, , and .


Because a coincidence matrix tabulates all pairable values and its contents sum to the total n, when four or more coders are involved, ock may be fractions.

The matrix of expected coincidences contains frequencies:
,
which sum to the same nc, nk, and n as does ock. In terms of these coincidences, Krippendorff's alpha becomes:
.

Difference functions

Difference functions between values c and k reflect the metric properties (Levels of Measurement) of their variable.

In general:


In particular:
For nominal data , where c and k serve as names.
For ordinal data , where c, k, and g are ranks.

For interval data , where c and k are interval scale values.

For ratio data , where c and k are absolute values.
For polar data , where cmin and cmax define the end points of the polar scale.
For circular data , where the sinus function is expressed in degrees and U is the circumference or the range of values in a circle or loop before they repeat. For equal interval circular metrics, the smallest and largest integer values of this metric are adjacent to each other and U = clargest – csmallest + 1.

Significance

Inasmuch as mathematical statements of the statistical distribution of alpha are always only approximations, it is preferable to obtain alpha’s distribution by bootstrapping (statistics)
Bootstrapping (statistics)
In statistics, bootstrapping is a computer-based method for assigning measures of accuracy to sample estimates . This technique allows estimation of the sample distribution of almost any statistic using only very simple methods...

. Alpha 's distribution gives rise to two indices:
  • The confidence intervals of a computed alpha at various levels of statistical significance

  • The probability that alpha could be below a chosen minimum, required for data to be considered sufficiently reliable (one-tailed test). This index acknowledges that the null-hypothesis (of chance agreement) is so far removed from the range of relevant alpha coefficients that its rejection would mean little regarding how reliable given data are. To be judged reliable, data must not significantly deviate from perfect agreement.


The minimum acceptable alpha coefficient should be chosen according to the importance of the conclusions to be drawn from imperfect data. When the costs of mistaken conclusions are high, the minimum alpha needs to be set high as well. In the absence of knowledge of the risks of drawing false conclusions from unreliable data, social scientists commonly rely on data with reliabilities α ≥ .800, consider data with 0.800 > α ≥ 0.667 only to draw tentative conclusions, and discard data whose agreement measures α < 0.667.

A misunderstanding of Krippendorff's alpha has become an instructive public controversy.

A computational example

Let the canonical form of reliability data be a 3-coder-by-15 unit matrix with 45 cells:
Units u: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Coder A * * * * * 3 4 1 2 1 1 3 3 * 3
Coder B 1 * 2 1 3 3 4 3 * * * * * * *
Coder C * * 2 1 3 4 4 * 2 1 1 3 3 * 4


Suppose “*” indicates a default category like “cannot code,” “no answer,” or “lacking an observation.” Then, * provides no information about the reliability of data in the four values that matter. Note that unit 2 and 14 contains no information and unit 1 contains only one value, which is not pairable within that unit. Thus, these reliability data consist not of mN=45 but of n=26 pairable values, not in N =15 but in 12 multiply coded units.

The coincidence matrix for these data would be constructed as follows:
o11 = {in u=4}: {in u=10}: {in u=11};

o13 = {in u=8}: o31

o22 = {in u=3}: {in u=9}:

o33 = {in u=5}: {in u=6}: {in u=12}: {in u=13}:

o34 = {in u=6}: {in u=15}: o43

o44 = {in u=7}:

Values c or k: 1 2 3 4 nc
Value 1 6 1 7
Value 2 4 4
Value 3 1 7 2 10
Value 4 2 3 5
Frequency nk 7 4 10 5 26


In terms of the entries in this coincidence matrix, Krippendorff's alpha may be calculated from:
.


For convenience, because products with exclude c=k-pairs from being counted and coincidences are symmetrical, only the entries in one of the off-diagonal triangles of the coincidence matrix are listed in the following:


Considering that all , the above expression yields:


With , , and , the above expression yields:


Here, > because disagreements happens to occur largely among neighboring values, visualized by occurring closer to the diagonal of the coincidence matrix, a condition that takes into account but does not. When the observed frequencies oc ≠ k are on the average proportional to the expected frequencies ec ≠ k, = .

Comparing alpha coefficients across different metrics can provide clues to how coders conceptualize the metric of a variable.

Alpha's embrace of other statistics

Krippendorff's alpha brings several known statistics under a common umbrella, each of them has its own limitations but no additional virtues.
  • Scott’s pi
    Scott's Pi
    Scott's pi is a statistic for measuring inter-rater reliability for nominal data in communication studies. Textual entities are annotated with categories by different annotators, and various measures are used to assess the extent of agreement between the annotators, one of which is Scott's pi...

     is an agreement coefficient for nominal data and two coders.

where , and

When data are nominal, alpha reduces to a form resembling Scott’s pi:


Scott’s observed proportion of agreement appears in alpha’s numerator, exactly. Scott’s expected proportion of agreement, is asymptotically approximated by when the sample size n is large, equal when infinite. It follows that Scott’s pi is that special case of alpha in which two coders generate a very large sample of nominal data. For finite sample sizes: . Evidently, .

  • Fleiss’ kappa
    Fleiss' kappa
    Fleiss' kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters when assigning categorical ratings to a number of items or classifying items. This contrasts with other kappas such as Cohen's kappa, which only work when assessing the agreement...

     is an agreement coefficient for nominal data, a fixed number of m coders, each coding all of N units without exception, and very large sample sizes. Fleiss claimed to have extended Cohen’s kappa to three or more raters or coders, but generalized Scott’s pi instead. This confusion is reflected in Fleiss’ choice of its name, which has been recognized by renaming it K:

where , and

When sample sizes are finite, K can be seen to perpetrate the inconsistency of obtaining the proportion of observed agreements \bar P by counting matches within the m(m-1) possible pairs of values within u, properly excluding values paired with themselves, while the proportion is obtained by counting matches within all (mN)2=n2 possible pairs of values, effectively including values paired with themselves. It is the latter that introduces a bias into the coefficient. However, just as for pi, when sample sizes become very large this bias disappears and the proportion in nominalα above asymptotically approximates in K. Nevertheless, Fleiss' kappa, or rather K, intersects with alpha in that special situation in which a fixed number of m coders code all of N units (no data are missing), using nominal categories, and the sample size n=mN is very large, theoretically infinite.

  • Spearman’s rank correlation coefficient
    Spearman's rank correlation coefficient
    In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles Spearman and often denoted by the Greek letter \rho or as r_s, is a non-parametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can...

     rho measures the agreement between two coders’ ranking of the same set of N objects. In its original form:

,

where is the sum of N differences between one coder’s rank c and the other coder’s rank k of the same object u. Whereas alpha accounts for tied ranks in terms of their frequencies for all coders, rho averages them in each individual coder's instance. In the absence of ties, 's numerator and 's denominator , where n=2N, which becomes when sample sizes become large. So, Spearman’s rho is that special case of alpha in which two coders rank a very large set of units. Again, and .

  • Pearson’s intraclass correlation coefficient rii is an agreement coefficient for interval data, two coders, and very large sample sizes. To obtain it, Pearson's original suggestion was to enter the observed pairs of values twice into a table, once as c-k and once as k-c, to which the traditional Pearson product-moment correlation coefficient
    Pearson product-moment correlation coefficient
    In statistics, the Pearson product-moment correlation coefficient is a measure of the correlation between two variables X and Y, giving a value between +1 and −1 inclusive...

     is then applied. By entering pairs of values twice, the resulting table becomes a coincidence matrix without reference to the two coders, contains n=2N values, and is symmetrical around the diagonal, i.e., the joint linear regression line is forced into a 45º line, and references to coders are eliminated. Hence, Pearson’s intraclass correlation coefficient is that special case of interval alpha for two coders and large sample sizes, and .

  • Finally, The disagreements in the interval alpha, Du, Do and De are proper sample variance
    Variance
    In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean . In particular, the variance is one of the moments of a distribution...

    s. It follows that the reliability the interval alpha assesses is consistent with all variance-based analytical techniques, such as the Analysis of Variance
    Analysis of variance
    In statistics, analysis of variance is a collection of statistical models, and their associated procedures, in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation...

    . Moreover, by incorporating difference functions not just for interval data but also for nominal, ordinal, ratio, polar, and circular data, alpha extends the notion of variance to metrics
    Level of measurement
    The "levels of measurement", or scales of measure are expressions that typically refer to the theory of scale types developed by the psychologist Stanley Smith Stevens. Stevens proposed his theory in a 1946 Science article titled "On the theory of scales of measurement"...

     that classical analytical techniques rarely address.


Evidently, Krippendorff's alpha is more general than either of these special purpose coefficients. It adjusts to varying sample sizes and affords comparisons across a great variety of reliability data, mostly ignored by the familiar measures.

Coefficients incompatible with alpha and the reliability of coding

Semantically, reliability is the ability to rely on something, here on coded data for subsequent analysis. When a sufficiently large number of coders agree perfectly on what they have read or observed, relying on their descriptions is a safe bet. Judgments of this kind hinge on the number of coders duplicating the process and how representative the coded units are of the population of interest. Problems of interpretation arise when agreement is less than perfect, especially when reliability is absent.
  • Correlation and association coefficients. Pearson’s product-moment correlation coefficient rij, for example, measures deviations from any linear regression line between the coordinates of i and j. Unless that regression line happens to be exactly 45º or centered, rij does not measure agreement. Similarly, while perfect agreement between coders also means perfect association, association statistics
    Association (statistics)
    In statistics, an association is any relationship between two measured quantities that renders them statistically dependent. The term "association" refers broadly to any such relationship, whereas the narrower term "correlation" refers to a linear relationship between two quantities.There are many...

     register any above chance pattern of relationships between variables. They do not distinguish agreement from other associations and are, hence, unsuitable as reliability measures.

  • Coefficients measuring the degree to which coders are statistically dependent on each other. When the reliability of coded data is at issue, the individuality of coders can have no place in it. Coders need to be treated as interchangeable. Alpha, Scott’s pi, and Pearson’s original intraclass correlation accomplish this by being definable as a function of coincidences, not only of contingencies. Unlike the more familiar contingency matrices, which tabulate N pairs of values and maintain reference to the two coders, coincidence matrices tabulate the n pairable values used in coding, regardless of who contributed them, in effect treating coders as interchangeable. Cohen’s kappa
    Cohen's kappa
    Cohen's kappa coefficient is a statistical measure of inter-rater agreement or inter-annotator agreement for qualitative items. It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance. Some...

    , by contrast, defines expected agreement in terms of contingencies, as the agreement that would be expected if coders were statistically independent of each other. Cohen's conception of chance fails to include disagreements between coders’ individual predilections for particular categories, punishes coders who agree on their use of categories, and rewards those who do not agree with higher kappa-values. This is the cause of other noted oddities of kappa. The statistical independence of coders is only marginally related to the statistical independence of the units coded and the values assigned to them. Cohen’s kappa, by ignoring crucial disagreements, can become deceptively large when the reliability of coding data is to be assessed.

  • Coefficients measuring the consistency of coder judgments. In the psychometric literature, reliability tends to be defined as the consistency with which several tests perform when applied to a common set of individual characteristics. Cronbach’s alpha
    Cronbach's alpha
    Cronbach's \alpha is a coefficient of reliability. It is commonly used as a measure of the internal consistency or reliability of a psychometric test score for a sample of examinees. It was first named alpha by Lee Cronbach in 1951, as he had intended to continue with further coefficients...

    , for example, is designed to assess the degree to which multiple tests produce correlated results. Perfect agreement is the ideal, of course, but Cronbach’s alpha is high also when test results vary systematically. Consistency of coders’ judgments does not provide the needed assurances of data reliability. Any deviation from identical judgments – systematic or random – needs to count as disagreement and reduce the measured reliability. Cronbach’s alpha is not designed to respond to absolute differences.

  • Coefficients with baselines (conditions under which they measure 0) that cannot be interpreted in terms of reliability, i.e. have no dedicated value to indicate when the units and the values assigned to them are statistically unrelated. Simple %-agreement ranges from 0=extreme disagreement to 100=perfect agreement with chance having no definite value. As already noted, Cohen's kappa
    Cohen's kappa
    Cohen's kappa coefficient is a statistical measure of inter-rater agreement or inter-annotator agreement for qualitative items. It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance. Some...

     falls into this category by defining the absence of reliability as the statistical independence between two individual coders. The baseline of Bennett, Alpert, and Goldstein’s S is defined in terms of the number of values available for coding, which has little to do with how values are actually used. Goodman and Kruskal’s lambdar
    Goodman and Kruskal's lambda
    In probability theory and statistics, Goodman & Kruskal's lambda is a measure of proportional reduction in error in cross tabulation analysis...

     is defined to vary between –1 and +1, leaving 0 without a particular reliability interpretation. Lin’s reproducibility or concordance coefficient
    Concordance correlation coefficient
    In statistics, the concordance correlation coefficient measures the agreement between two variables, e.g., to evaluate reproducibility or for inter-rater reliability.-Definition:...

     rc takes Pearson’s product moment correlation rij as a measure of precision and adds to it a measure Cb of accuracy, ostensively to correct for rijs above mentioned inadequacy. It varies between –1 and +1 and the reliability interpretation of 0 is uncertain. There are more so-called reliability measures whose reliability interpretations become questionable as soon as they deviate from perfect agreement.


Naming a statistic as one of agreement, reproducibility, or reliability does not make it a valid index of whether one can rely on coded data in subsequent decisions. Its mathematical structure must fit the process of coding units into a system of analyzable terms.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK