Recently, I revisited the concept of the median and took this opportunity to record some key points.
When performing outlier removal or clipping, we often need a “benchmark.” For example, for a set of non-negative data, we may consider values greater than 50 times the benchmark as outliers. How should we choose this benchmark? A commonly used metric is the mean, but the mean is easily “biased” by outliers. Therefore, using it as a benchmark might lean toward the outliers, causing some results to be missed. In such cases, we can consider using the median as the benchmark.
Basic Properties
For one-dimensional data points x_1,x_2,\cdots,x_n, their mean is defined as \mathop{\mathrm{mean}}(x_1,x_2,\cdots,x_n) = \frac{1}{n}\sum_{i=1}^n x_i. Since all data points directly participate in the averaging calculation, if a few points are exceptionally large, the mean will also become larger, thereby interfering with outlier detection.
The idea of the median is to find a “neutral” value in the data to serve as the benchmark. Specifically, it seeks a split point such that the numbers of data points greater than or equal to it and less than or equal to it are the same; thus it is also called the “50% quantile (50th percentile).” If n is odd, it equals the (n+1)/2-th largest number in the set; if n is even, any number between the n/2-th and n/2+1-th largest can be considered the median, and usually we take the average of the two.
From the definition of the median, we can see its robustness: as long as the relative order of each number with respect to the median remains unchanged, the median does not change. Furthermore, we can introduce the “breakdown point” to describe this ability, which is “the smallest proportion of outliers needed to make the estimate tend to infinity.” For the mean, the breakdown point is essentially 0, because just one point tending to infinity makes it tend to infinity; but for the median, it is 50\%, because more than half of the data must tend to infinity for the median to tend to infinity.
The disadvantage of the median is that it is more computationally complex than the mean, because it requires at least some form of sorting, which is especially unfriendly for distributed computation. Fortunately, most scenarios do not require an exact median, so there is some leeway; for example, one can compute the median locally and then compute a global mean or median, which can significantly reduce communication overhead.
Optimization Perspective
Interestingly, we can unify the mean and median from an optimization perspective: \begin{gathered} \mathop{\mathrm{mean}}(x_1,x_2,\cdots,x_n) = \mathop{\mathrm{argmin}}_{\mu} \sum_{i=1}^n (x_i - \mu)^2, \\ \mathop{\mathrm{median}}(x_1,x_2,\cdots,x_n) = \mathop{\mathrm{argmin}}_{\mu} \sum_{i=1}^n |x_i - \mu|. \end{gathered} The proof for \mathop{\mathrm{mean}} is relatively simple and left to the reader; here we briefly introduce the proof for \mathop{\mathrm{median}}. Without loss of generality, assume the x_i are already sorted, i.e. x_1 \leq x_2 \leq \cdots \leq x_n, and let f(\mu) = \sum_{i=1}^n |x_i - \mu|. Direct differentiation gives f'(\mu) = \sum_{i=1}^n \mathop{\mathrm{sign}}(\mu - x_i) = \#\{x_i < \mu\} - \#\{x_i > \mu\}, where the symbol \# denotes the number of elements satisfying the condition. To find the minimum, we want the derivative to be as close to 0 as possible, i.e. we want the numbers of x_i greater than \mu and less than \mu to be as equal as possible, which aligns with the idea of the median. A more detailed discussion requires a case analysis: f'(\mu) = \left\{\begin{aligned} &\left.\begin{aligned} &\underbrace{\#\{x_i < \mu\}}_{\leq k} - \underbrace{\#\{x_i > \mu\}}_{\geq k+1} < 0, &\mu < x_{k+1} \\ &\underbrace{\#\{x_i < \mu\}}_{\geq k+1} - \underbrace{\#\{x_i > \mu\}}_{\leq k} > 0, &\mu > x_{k+1} \\ \end{aligned}\right\} & n = 2k+1 \\[5pt] &\left.\begin{aligned} &\underbrace{\#\{x_i < \mu\}}_{\leq k-1} - \underbrace{\#\{x_i > \mu\}}_{\geq k+1} < 0, &\mu < x_{k\phantom{+1}} \\ &\underbrace{\#\{x_i < \mu\}}_{\geq k+1} - \underbrace{\#\{x_i > \mu\}}_{\leq k-1} > 0, &\mu > x_{k+1} \\ \end{aligned}\right\} & n = 2k \end{aligned}\right. That is, when n is odd (2k+1), both \mu < x_{k+1} and \mu > x_{k+1} increase f(\mu), so \mu^* = x_{k+1}; when n is even (2k), both \mu < x_k and \mu > x_{k+1} increase f(\mu), so \mu^* \in [x_k, x_{k+1}]. It can be verified that when \mu^* lies in this interval, f(\mu^*) remains unchanged, so any number in this interval can serve as \mu^*. In conclusion, \mu^* is perfectly consistent with the definition of the median.
High-dimensional Space
From the optimization perspective, we can also understand why the median is more robust against outliers than the mean: if there is an exceptionally large x_i, the loss contributed to the mean is (x_i - \mu)^2, while for the median it is |x_i - \mu|. Generally, (x_i - \mu)^2 \gg |x_i - \mu|, so the mean will be more inclined toward the outlier to minimize the loss as much as possible.
Furthermore, another advantage of the optimization perspective is that it can be easily generalized to high-dimensional spaces. We know that the concept of the mean is easily extended to high dimensions: for a set of vectors \boldsymbol{x}_1,\boldsymbol{x}_2,\cdots,\boldsymbol{x}_n, the mean vector is \frac{1}{n}\sum_{i=1}^n \boldsymbol{x}_i. However, the concept of the median is not as straightforward to generalize, because it relies on sorting, and it is difficult to define a good ordering for vector data.
However, from the optimization perspective, this generalization is very natural: \begin{gathered} \mathop{\mathrm{mean}}(\boldsymbol{x}_1,\boldsymbol{x}_2,\cdots,\boldsymbol{x}_n) = \mathop{\mathrm{argmin}}_{\boldsymbol{\mu}} \sum_{i=1}^n \Vert\boldsymbol{x}_i - \boldsymbol{\mu}\Vert_2^2, \\ \mathop{\mathrm{median}}(\boldsymbol{x}_1,\boldsymbol{x}_2,\cdots,\boldsymbol{x}_n) = \mathop{\mathrm{argmin}}_{\boldsymbol{\mu}} \sum_{i=1}^n \Vert\boldsymbol{x}_i - \boldsymbol{\mu}\Vert_2. \end{gathered} Here \Vert\cdot\Vert_2 denotes the Euclidean distance. It is easy to show that the mean vector defined this way is exactly \frac{1}{n}\sum_{i=1}^n \boldsymbol{x}_i, matching the empirical definition. As for the median vector, we also call it the Geometric Median. From the objective function, we can see that when n=3, it becomes the classic “Fermat point” (Fermat point); hence, we often directly refer to the median vector as the Fermat point.
Unfortunately, the median vector does not have an analytical solution; we usually compute it using the following Weiszfeld iteration: \boldsymbol{\mu}_{t+1} = \frac{\sum_{i=1}^n \boldsymbol{x}_i / \Vert\boldsymbol{x}_i - \boldsymbol{\mu}_t\Vert_2}{\sum_{i=1}^n 1 / \Vert\boldsymbol{x}_i - \boldsymbol{\mu}_t\Vert_2}.
Further Generalization
Obviously, we can consider an even more general extension: \mathop{\mathrm{average}}(\boldsymbol{x}_1,\boldsymbol{x}_2,\cdots,\boldsymbol{x}_n;\alpha) = \mathop{\mathrm{argmin}}_{\boldsymbol{\mu}} \sum_{i=1}^n \Vert\boldsymbol{x}_i - \boldsymbol{\mu}\Vert_2^{\alpha}. Let f(\boldsymbol{\mu}) = \sum_{i=1}^n \Vert\boldsymbol{x}_i - \boldsymbol{\mu}\Vert_2^{\alpha}; then \nabla_{\boldsymbol{\mu}} f(\boldsymbol{\mu}) = \alpha\sum_{i=1}^n \Vert\boldsymbol{x}_i - \boldsymbol{\mu}\Vert_2^{\alpha - 2}(\boldsymbol{\mu} - \boldsymbol{x}_i). Setting \nabla_{\boldsymbol{\mu}} f(\boldsymbol{\mu}) = \boldsymbol{0}, the equation to solve can be written as \boldsymbol{\mu} = \frac{\sum_{i=1}^n \Vert\boldsymbol{x}_i - \boldsymbol{\mu}\Vert_2^{\alpha - 2}\boldsymbol{x}_i}{\sum_{i=1}^n\Vert\boldsymbol{x}_i - \boldsymbol{\mu}\Vert_2^{\alpha - 2}}. Substituting \boldsymbol{\mu}_t into the right-hand side and denoting it as \boldsymbol{\mu}_{t+1} yields a fixed-point iteration. Setting \alpha=2 gives the mean vector, and \alpha=1 yields the Weiszfeld iteration. Strictly speaking, one must also prove convergence and uniqueness, but the details are somewhat involved and are omitted here.
Additionally, there is another less commonly used form of the median vector in high-dimensional space, which replaces the Euclidean norm with the L^1 norm: \mathop{\mathrm{argmin}}_{\boldsymbol{\mu}} \sum_{i=1}^n \Vert\boldsymbol{x}_i - \boldsymbol{\mu}\Vert_1. This is called the Coordinate-wise Median, because it essentially computes the one-dimensional median component-wise. It is relatively easy to compute, but because it lacks clear geometric meaning, it is used less frequently.
Summary
This article briefly summarizes the concept, properties, and high-dimensional generalizations of the median.
For reprints, please include the link to this article:
https://kexue.fm/archives/11693
For more detailed reprint guidelines, please refer to:
Science
Space FAQ