Frequentist and Bayesian Statistics | Fernando Villanea

^{Fernando Villanea} _{Frequentist and Bayesian Statistics}

Both frequentist and Bayesian statistics share a large body of underlying theory developed concurrently to solve similar problems. Likewise, both schools of statistics have converged on similar terminology and compatible representations of results, as they easily supplement each other (Puga et al. 2015b). However, some underlying elements to both remain distinct and these features are particularly important to understand in order to apply both approaches effectively.

To formalize the distinction mathematically, let there be a population of data N, from which we have the sample n, which is but a portion of the entire population. For any variable of interest in the data, the frequency histogram of all possible values is called the population distribution. The population N possesses immutable characteristics, or parameters, such as a mean μ (Krzywinski and Altman 2013b). If n=N, then the population mean μ is known, otherwise it must be described in terms of probability. This is where the differences in philosophy commence.

Frequentist probability, also known as physical or objective probability, is associated with repeatable processes that occur at a given rate (i.e., occurring at some frequency during a long set of trials). The probability of an outcome is then expressed as a measure of the relative frequency of the occurrence of that outcome from a lengthy number of trials. For example, the probability of rolling a ’20’ on a 20-sided die [P(20)] is given by the frequency of times a ‘20’ is rolled [n₂₀] over the total number of trials [n_t]. At this point, it should be noted that relative frequency is a poor approximation of the “true” frequency when the number of trials are low, but as the number of trials approaches infinity the relative frequency becomes exactly the true frequency (i.e., the law of large numbers). Similarly, when frequentist statistics are used to estimate an aspect of the total population of data N, for example the mean μ, using a sample n, the mean of the sample x̄ is a bad approximation for μ when the sample size is small, and becomes better as the sample size increases. Finally, when n=N, then x̄=μ. That is, when all samples have been observed in the population, the mean is known. Interval estimation is commonly used to calculate unknown parameters of the population N, as an alternative to providing a single estimator value. A confidence interval (CI) is a range of values that should contain the true parameter of the population for a given relative frequency, or confidence level. For example, a 95% CI consists of values of x̄ that should contain the value for μ 95% of times a sample n is taken and x̄ is calculated (Krzywinski and Altman 2013a).

The central ideas of frequentist probability are commonly applied in hypothesis testing in the form of the familiar p-value. In order to connect the frequentist philosophy with hypothesis testing, it is important to formalize the null hypothesis in terms of a distribution of expected observations. In order to generate a null hypothesis or null distribution we need a control or reference, and we have to assume we can characterize all the random fluctuations inherent in measuring that control or reference. If this is possible, we can construct the null distribution, which has a mean μ corresponding to the value of the reference, and variance determined by the inherent random fluctuations (Krzywinski and Altman 2013c). The purpose of a statistical test is to determine how a new observation compares to this distribution and, in particular, to determine how far-removed it is from the mean μ. The significance of the difference between the observation and μ is determined by first calculating the proportion of the null distribution that is more extreme than the observation (this is the p-value), and then comparing that to a proportion of most extreme values that are a priori defined as outliers (this is the α value). The value of α is used to calculate maximum and minimum threshold values beyond which the observation is considered significantly different from μ.

A p-value is the probability of observing a value equal to or more extreme than our cut-off value α, assuming that the null hypothesis is true (Krzywinski and Altman 2013c). Thus, when a p-value is found to be less than a standard α of 0.05 (i.e., p < 0.05) it is an observation that falls in the most extreme 5% of all observations relative to the mean μ. Now we can connect the concepts of hypothesis testing and confidence intervals, as the confidence level is the complement to its significance level or α. For example, a 95% confidence interval represents all estimates of μ that would not be considered significant at the 0.05 level.

Bayesian probability, on the other hand, is known as conditional probability. A conditional probability measures the chance of an outcome given another outcome as explained by Bayes’ theorem (Box 2). Back to the example of population N and its mean μ, under Bayesian theory both the sample n and its mean x̄ are treated more fluidly, and described probabilistically. A very important distinction is that the sample n is considered a realized sample and treated as the only source of data (D), while frequentist statistics are concerned with repeated sampling. Another key distinction is that frequentist statistics treat the true mean μ of population N as fixed but also unknown, being forced to approximate its value through the mean x̄ (except in the rare instances when n=N). In Bayesian statistics, it is possible to estimate the true mean μ by associating a conditional probability to it, that is μ given the data, or P(μǀD). This is referred to as the likelihood of parameter μ given some known data D. The likelihood of a parameter is described as a probability density function, which can be used to determine the probability of μ having a value in any given interval, called a credibility interval. The credibility interval represents a range of values that contain the true parameter of the population for a given probability level. For example, a 95% credibility interval consists of values which include the true value for μ with 95% probability (Casella 2008). Because in Bayesian statistics μ can take a range of values each with an associated conditional probability, it is typical to represent this distribution with a summary statistic and credible intervals. The mean value of the probability distribution is typically reported for symmetrical probability distributions, whereas in cases with an asymmetric distribution the median or modal (i.e., highest probability) value are often more characteristic of the distribution and should be used.