*This article is a preliminary version of what will be published in
the
International Encyclopedia of the Social and Behavioral Sciences,
and was written while Professor Edwards was visiting CIMAT, March
8-19, 1999.
A statistical model for phenomena in the sciences or social sciences
is a mathematical construct which associates a probability with each of
the possible outcomes. If the data are discrete, such as the numbers of
people falling into various classes, the model will be a discrete probability
distribution, but if the data consist of measurements or other numbers
which may take any values in a continuum, the model will be a continuous
probability distribution. When two different models, or perhaps two variants
of the same model differing only in the value of some adjustable parameter(s),
are to be compared as explanations for the same observed outcome, the probability
of obtaining this particular outcome can be calculated for each and is
then known as likelihood for the model or parameter value(s) given the
data.
Probabilities and likelihoods are easily (and frequently) confused,
and it is for this reason that in 1921 R.A.Fisher introduced the new word:
‘What we can find from a sample is the likelihood of any particular value
of [the parameter], if we define the likelihood as a quantity proportional
to the probability that, from a population having that particular value,
the [observed sample] should be obtained. So defined, probability and likelihood
are quantities of an entirely different nature’.
The first difference to be noted is that the variable quantity in a
likelihood statement is the hypothesis (a word which conveniently covers
both the case of a model and of particular parameter values in a single
model), the outcome being that actually observed, in contrast to a probability
statement, which refers to a variety of outcomes, the hypothesis being
assumed and fixed. Thus a manufacturer of dice may reasonably assert that
the outcomes 1, 2, 3, 4, 5, 6 of a throw each have probability 1/6 on the
hypothesis that his dice are well-balanced, whilst an inspector of casinos
charged with testing a particular die will wish to compute the likelihoods
for various hypotheses about these probabilities on the basis of data from
actual tosses.
The second difference arises directly from the first. If all the outcomes
of a statistical model are considered their total probability will be 1
since one of them must occur and they are mutually exclusive; but since
in general hypotheses are not exhaustive – one can usually think of another
one – it is not to be expected that the sum of likelihoods has any particular
meaning, and indeed there is no addition law for likelihoods corresponding
to the addition law for probabilities. It follows that only relative likelihoods
are informative, which is the reason for Fisher’s use of the word ‘proportional’
in his original definition.
The most important application of likelihood is in parametric statistical
models. Consider the simplest binomial example, such as that of the distribution
of the number of boys r in families of size n (an example
which has played an important role in the development of statistical theory
since the early eighteenth century). The probability of getting exactly
r boys will be given by the binomial distribution indexed by a parameter
p, the probability of a male birth. Denote this probability of r
boys by P(r|p), n being assumed fixed and of no statistical
interest. Then we write
for the likelihood of p given the particular value r,
the double vertical line || being used to indicate that the likelihood
of p is not conditional on r in the technical probability
sense. In this binomial example L(p||r) is a continuous function of the
parameter p and is known as the likelihood function. When only two
hypotheses are compared, such as two particular values of p in the
present example, the ratio of their likelihoods is known as the likelihood
ratio.
The value of p which maximises L(p||r) for an observed r is
known as the maximum-likelihood estimate of p and is denoted by
p^; expressed in general form as a function of r it is known as
the maximum-likelihood estimator. Since the pioneering work of Fisher in
the 1920s it has been known that maximum-likelihood estimators possess
certain desirable properties under repeated-sampling (consistency and asymptotic
efficiency, and in an important class of models sufficiency and full efficiency),
and for this reason they have come to occupy a central position in repeated-sampling
(or ‘frequentist’) theories of statistical inference.
However, partly as a reaction to the unsatisfactory features which
repeated-sampling theories display when used as theories of evidence, coupled
with a reluctance to embrace the full-blown Bayesian theory of statistical
inference, likelihood is increasingly seen as a fundamental concept enabling
hypotheses and parameter values to be compared directly.
The basic notion, championed by Fisher as early as 1912 whilst still
an undergraduate at Cambridge but now known to have been occasionally suggested
by other writers even earlier, is that the likelihood ratio for two hypotheses
or parameter values is to be interpreted as the degree to which the data
support the one hypothesis against the other. Thus a likelihood ratio of
1 corresponds to indifference between the hypotheses on the basis of the
evidence in the data, whilst the maximum-likelihood value of a parameter
is regarded as the best-supported value, other values being ranked by their
lesser likelihoods accordingly. This was formalised as the Law of Likelihood
by Ian Hacking in 1965. Fisher’s final advocacy of the direct use of likelihood
will be found in his last book Statistical Methods and Scientific Inference
(1956).
Such an approach, unsupported by any appeal to repeated-sampling criteria,
is ultimately dependent on the primitive notion that the best hypothesis
or parameter-value on the evidence of the data is the one which would explain
what has in fact been observed with the highest probability. The strong
intuitive appeal of this can be captured by recognizing that it is the
value which would lead, on repeated sampling, to a precise repeat of the
data with the least expected delay. In this sense it offers the best statistical
explanation of the data.
In addition to specifying that relative likelihoods measure degrees
of support, the likelihood approach requires us to accept that the likelihood
function or ratio contains all the information we can extract from the
data about the hypotheses in question on the assumption of the specified
statistical model – the so-called Likelihood Principle. It is important
to include the qualification requiring the specification of the model,
first because the adoption of a different model might prove necessary later
and secondly because in some cases the structure of the model enables inferences
to be made in terms of fiducial probability which, though dependent on
the likelihood, are stronger, possessing repeated-sampling properties which
enable confidence intervals to be constructed.
Though it would be odd to accept the Law of Likelihood and not the
Likelihood Principle, Bayesians necessarily accept the Principle but not
the Law, for although the likelihood is an intrinsic component of Bayes’s
Theorem, Bayesians deny that a likelihood function or ratio has any meaning
in isolation. For those who accept both the Law and the Principle it is
convenient to express the two together as:
The Likelihood Axiom: Within the framework of a statistical model, all the information which the data provide concerning the relative merits of two hypotheses is contained in the likelihood ratio of those hypotheses on the data, and the likelihood ratio is to be interpreted as the degree to which the data support the one hypothesis against the other (Edwards, 1972).
The likelihood approach has many advantages apart from its intuitive
appeal. It is straightforward to apply because the likelihood function
is usually simple to obtain analytically or easy to compute and display.
It leads directly to the important theoretical concept of sufficiency according
to which the function of the data which is the argument of the likelihood
function itself carries the information. This reduction of the data is
often a simple statistic such as the sample mean. Moreover, the approach
illuminates many of the controversies surrounding repeated-sampling theories
of inference, especially those concerned with ancillarity and conditioning.
Birnbaum (1962) argued that it was possible to derive the Likelihood Principle
from the concepts of sufficiency and conditionality, but to most people
the Principle itself seems the more primitive concept and the fact that
it leads to notions of sufficiency and conditioning seems an added reason
for accepting it.
Likelihoods are multiplicative over independent data sets referring
to the same hypotheses or parameters, facilitating the combination of information.
For this reason log-likelihood is often preferred because information is
then combined by addition. In the field of genetics, where likelihood theory
is widely applied, the log-likelihood with the logarithms taken to the
base 10 is known as a LOD, but for general use natural logarithms to the
base e are to be preferred, in which case log-likelihood is sometimes called
support. Most importantly, the likelihood approach is compatible with Bayesian
statistical inference in the sense that the posterior Bayes distribution
for a parameter is, by Bayes’s Theorem, found by multiplying the prior
distribution by the likelihood function. Thus when, in accordance with
Bayesian principles, a parameter can itself be given a probability distribution
(and this assumption is the Achilles’ heel of Bayesian inference) all the
information the data contain about the parameter is transmitted via the
likelihood function in accordance with the Likelihood Principle. It is
indeed difficult to see why the medium through which such information is
conveyed should depend on the purely external question of whether the parameter
may be considered to have a probability distribution, and this is another
powerful argument in favour of the Principle itself.
In the case of a single parameter the likelihood function or the log-likelihood
function may easily be drawn, and if it is unimodal limits may be assigned
to the parameter, analogous to the confidence limits of repeated-sampling
theory. Calling the log-likelihood the support, m-unit support limits are
the two parameter values astride the maximum at which the support is m
units less than at the maximum. For the simplest case of estimating the
mean of a normal distribution of known variance the 2-unit support limits
correspond closely to the 95% confidence limits which are at
1.96 standard errors. In this normal case the support function is quadratic
and may therefore be characterized completely by the maximum-likelihood
estimate and the curvature at the maximum (the reciprocal of the radius
of curvature) which is defined as the observed information. Comparable
definitions apply in multiparameter cases, leading to the concept of an
m-unit support region and an observed information matrix. In cases in which
the support function is not even approximately quadratic the above approach
may still be applied if a suitable transformation of the parameter space
can be found.
The representation of a support function for more than two parameters
naturally encounters the usual difficulties associated with the visualisation
of high-dimensioned spaces, and a variety of methods have been suggested
to circumvent the problem. It will often be the case that information is
sought about some subset of the parameters, the others being considered
to be nuisance parameters of no particular interest. In fortunate cases
it may be possible to restructure the model so that the nuisance parameters
are eliminated, and in all cases in which the support function is quadratic
(or approximately so) the dimensions corresponding to the nuisance parameters
can simply be ignored.
Several other approaches are in use to eliminate nuisance parameters.
Marginal likelihoods rely on finding some function of the data which does
not depend on them; notable examples involve the normal distribution, where
a marginal likelihood for the variance can be found from the distribution
of the sample variance which is independent of the mean, and a marginal
likelihood for the mean can similarly be found using the t-distribution.
Profile likelihoods, also called maximum relative likelihoods, are found
by replacing the nuisance parameters by their maximum-likelihood estimates
at each value of the parameters of interest. It is easy to visualise from
the case of two parameters why this is called a profile likelihood.
Naturally, a solution can always be found by strengthening the model
through adopting particular values for the nuisance parameters, just as
a Bayesian solution using integrated likelihoods can always be found by
adopting a prior distribution for them and integrating them out, but such
assumptions do not command wide assent. When a marginal likelihood solution
has been found it may correspond to a Bayesian integrated likelihood for
some choice of prior, and such priors are called neutral priors to distinguish
them from so-called uninformative priors for which no comparable justification
exists. However, in the last analysis there is no logical reason why nuisance
parameters should be other than a nuisance, and procedures for mitigating
the nuisance must be regarded as expedients.
All the common repeated-sampling tests of significance have their analogues
in likelihood theory, and in the case of the normal model it may seem that
only the terminology has changed. At first sight an exception seems to
be the x2 goodness-of-fit test, where no alternative
hypothesis is implied. However, this is deceptive, and a careful analysis
shows that there is an implied alternative hypothesis which allows the
variances of the underlying normal model to depart from their multinomial
values. In this way the paradox of small values of x2
being interpreted as meaning that the model is ‘too good’ is exposed, for
in reality they mean that the model is not good enough and that one with
a more appropriate variance structure will have a higher likelihood. Likelihood
ratio tests are based on the distribution under repeated-sampling of the
likelihood ratio and are therefore not part of likelihood theory.
When likelihood arguments are applied to models with continuous sample
spaces it may be necessary to take into account the approximation involved
in representing data, which are necessarily discrete, by a continuous model.
Neglect of this can lead to the existence of singularities in the likelihood
function or other artifacts which a more careful analysis will obviate.
It is often argued that in comparing two models by means of a likelihood
ratio, allowance should be made for any difference in the number of parameters
by establishing a ‘rate of exchange’ between an additional parameter and
the increase in log-likelihood expected. The attractive phrase ‘Occam’s
bonus’ has been suggested for such an allowance (J.H.Edwards, 1969). However,
the proposal seems only to have a place in a repeated-sampling view of
statistical inference, where a bonus such as that suggested by Akaike’s
information criterion is sometimes canvassed.
The major application of likelihood theory so far has been in human
genetics, where log-likelihood functions are regularly drawn for recombination
fractions (linkage values) (see Ott, 1991), but even there a reluctance
to abandon significance-testing altogether has led to a mixed approach.
Other examples, especially from medical fields, will be found in the books
cited below.
Although historically the development of a likelihood approach to statistical
inference was almost entirely due to R.A.Fisher, it is interesting to recall
that the Neyman–Pearson approach to hypothesis testing derives ultimately
from a remark of ‘Student’s’ (W.S.Gossett) in a letter to E.S.Pearson in
1926 that ‘if there is any alternative hypothesis which will explain the
occurrence of the sample with a more reasonable probability … you will
be very much more inclined to consider that the original hypothesis is
not true’, a direct likelihood statement (quoted in McMullen and Pearson,
1939). Indeed, it has been remarked that ‘Just as support [log-likelihood]
is Bayesian inference without the priors, so it turns out to be Neyman–Pearson
inference without the ‘errors’ (Edwards, 1972).
The literature on likelihood is gradually growing as an increasing
number of statisticians become concerned at the inappropriate use of significance
levels, confidence intervals and other repeated-sampling criteria to represent
evidence. The movement is most advanced in biostatistics as may be seen
from books such as Clayton and Hills (1993) and Royall (1997), but general
texts such as Lindsey (1995) exist as well. Amongst older books Cox and
Hinkley (1974) contains much that is relevant to likelihood, whilst Edwards
(1972, 1992) was the first book to advocate a purely likelihood approach,
and is rich in relevant quotations from Fisher’s writings. The history
of likelihood is treated by Edwards (1974; reprinted in Edwards 1992).
Birnbaum, A. (1962) On the foundations of statistical inference. J.
Amer. Statist. Ass. 57, 269–326.
Clayton, D.G. and Hills, M. (1993) Statistical Models in Epidemiology.
Oxford University Press.
Cox, D.R. and Hinkley, D.V. (1974) Theoretical Statistics. London:
Chapman & Hall.
Edwards, A.W.F. (1972) Likelihood. Cambridge University Press.
Edwards, A.W.F. (1974) The history of likelihood. Int. Statist. Rev.
42, 9-15.
Edwards, A.W.F. (1992) Likelihood. Baltimore: Johns Hopkins University
Press.
Edwards, J.H. (1969) In: Computer Applications in Genetics, ed. N.E.Morton.
Honolulu: University of Hawaii Press.
Fisher, R.A. (1912) On an absolute criterion for fitting frequency
curves. Mess. Math. 41, 155-60.
Fisher, R.A. (1921) On the ‘probable error’ of a coefficient of correlation
deduced from a small sample. Metron 1 pt 4, 3-32.
Fisher, R.A. (1956) Statistical Methods and Scientific Inference. Edinburgh:
Oliver & Boyd.
Hacking, I. (1965) Logic of Statistical Inference. Cambridge University
Press.
Lindsey, J.K. (1995) Introductory Statistics: A Modelling Approach.
Oxford: Clarendon Press.
McMullen, L. and Pearson, E.S. (1939) William Sealy Gosset, 1876–1937.
Biometrika 30, 205-50.
Royall, R. (1997) Statistical Evidence: A Likelihood Paradigm. London:
Chapman & Hall.