In the "Approximate Estimation" section of "Higher-order MuP: Simpler but Smarter Spectral Condition Scaling", we "pre-paid" a conclusion: "For a random matrix of size n \times m following a standard normal distribution, its spectral norm is approximately \sqrt{n} + \sqrt{m}."
In this article, we will supplement the discussion of this conclusion and provide a method for the fast estimation of the spectral norm of random matrices.
Random Matrix Theory
Suppose we have a random matrix \boldsymbol{W} \in \mathbb{R}^{n \times m}, where each element is independently and identically sampled from a standard normal distribution \mathcal{N}(0,1). We want to estimate the spectral norm of \boldsymbol{W}, which is its largest singular value. We take \mathbb{E}[\|\boldsymbol{W}\|_2] as the final estimation result.
First, it should be noted that the analysis of the behavior of random matrices has formed a specialized branch of mathematics. Regarding the spectral norm estimation of normal matrices, relevant keywords include the "Marchenko–Pastur distribution", "Bai-Yin law", and "Gordon’s theorem", which contain detailed estimation results for the distribution of singular values. However, we do not intend to introduce these contents. Instead, we will introduce a method to quickly obtain \sqrt{n} + \sqrt{m}.
This method was simplified by the author based on Section 5.3.1 of "Introduction to the non-asymptotic analysis of random matrices". It can actually only be considered a "heuristic overview" to help everyone quickly understand the origin of \sqrt{n} + \sqrt{m}. A rigorous proof would require many tedious and monotonous details, which will not be expanded upon here.
Spectral Norm Estimation
Our starting point is the identity: \begin{equation} \Vert\boldsymbol{W}\Vert_2 = \max_{\Vert \boldsymbol{u}\Vert=1, \Vert \boldsymbol{v}\Vert=1} \boldsymbol{u}^{\top}\boldsymbol{W} \boldsymbol{v} \label{eq:w-uv} \end{equation} where \boldsymbol{u} \in \mathbb{R}^n, \boldsymbol{v} \in \mathbb{R}^m. Directly calculating \Vert\boldsymbol{W}\Vert_2 is usually not easy, so it is natural to look for some kind of approximation. We consider the following two "half-finished" expressions: \begin{equation} \max_{\Vert \boldsymbol{u}\Vert=1} \boldsymbol{u}^{\top}\boldsymbol{W} \boldsymbol{v} \qquad\qquad \max_{\Vert \boldsymbol{v}\Vert=1} \boldsymbol{u}^{\top}\boldsymbol{W} \boldsymbol{v} \end{equation} Intuitively, compared to Eq. [eq:w-uv], both of the above expressions only complete half of the work. We make a bold assumption: their results are also half of the full result. Therefore, by adding them together, we consider it a good approximation of the final result: \begin{equation} \Vert\boldsymbol{W}\Vert_2 = \max_{\Vert \boldsymbol{u}\Vert=1, \Vert \boldsymbol{v}\Vert=1} \boldsymbol{u}^{\top}\boldsymbol{W} \boldsymbol{v} \approx \max_{\Vert \boldsymbol{u}\Vert=1} \boldsymbol{u}^{\top}\boldsymbol{W} \boldsymbol{v} + \max_{\Vert \boldsymbol{v}\Vert=1} \boldsymbol{u}^{\top}\boldsymbol{W} \boldsymbol{v} = \Vert \boldsymbol{W}\boldsymbol{v}\Vert + \Vert \boldsymbol{u}^{\top}\boldsymbol{W}\Vert \label{eq:core-approx} \end{equation} In other words, we sample \boldsymbol{u} and \boldsymbol{v} from the unit hyperspheres of \mathbb{R}^n and \mathbb{R}^m respectively, and then obtain an approximation of the spectral norm of \boldsymbol{W} according to the above formula.
Calculating the Expected Value
With this approximation, we can calculate: \begin{equation} \mathbb{E}[\Vert\boldsymbol{W}\Vert_2] \approx \mathbb{E}[\Vert \boldsymbol{W}\boldsymbol{v}\Vert] + \mathbb{E}[\Vert \boldsymbol{u}^{\top}\boldsymbol{W}\Vert] \approx \sqrt{\mathbb{E}[\Vert \boldsymbol{W}\boldsymbol{v}\Vert^2]} + \sqrt{\mathbb{E}[\Vert \boldsymbol{u}^{\top}\boldsymbol{W}\Vert^2]} \end{equation} where \begin{equation} \mathbb{E}[\Vert \boldsymbol{W}\boldsymbol{v}\Vert^2] = \mathbb{E}[ \boldsymbol{v}^{\top}\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{v}] = \boldsymbol{v}^{\top}\mathbb{E}[\boldsymbol{W}^{\top}\boldsymbol{W}]\boldsymbol{v} = \boldsymbol{v}^{\top}(n\boldsymbol{I}_m)\boldsymbol{v} = n \end{equation} Similarly, \mathbb{E}[\Vert \boldsymbol{u}^{\top}\boldsymbol{W}\Vert^2] = m, so \begin{equation} \mathbb{E}[\Vert\boldsymbol{W}\Vert_2] \approx \sqrt{n} + \sqrt{m} \end{equation} This result is actually very accurate (it can be verified through simulation experiments). Specifically, if n=ka and m=kb, where a and b are constants, then: \begin{equation} \lim_{k\to\infty} \frac{\Vert\boldsymbol{W}\Vert_2}{\sqrt{n} + \sqrt{m}} = 1, \qquad \boldsymbol{W} \sim \mathcal{N}(0,1) \end{equation} The reason it is so accurate is that we "cheated"—the most critical Eq. [eq:core-approx] is essentially reverse-engineered from the known correct answer. Besides Eq. [eq:core-approx], the only conditions we used were \mathbb{E}[\boldsymbol{W}^{\top}\boldsymbol{W}] = n\boldsymbol{I}_m and \mathbb{E}[\boldsymbol{W}\boldsymbol{W}^{\top}] = m\boldsymbol{I}_n. Therefore, it can be considered that the above approximation holds for any distribution with zero mean and unit variance.
Minimum Singular Value
The spectral norm is the maximum singular value. In fact, we can use the same logic to estimate the minimum singular value. Of course, "minimum" here needs a more rigorous definition. Let n \geq m; the minimum singular value here refers to the m-th singular value of \boldsymbol{W} (ordered from largest to smallest), which is equal to: \begin{equation} \sigma_{\min}(\boldsymbol{W}) = \min_{\Vert \boldsymbol{v}\Vert=1}\max_{\Vert \boldsymbol{u}\Vert=1} \boldsymbol{u}^{\top}\boldsymbol{W} \boldsymbol{v} \end{equation} Note that the positions and objects of \min and \max here cannot be swapped. Similarly, we consider the sum of two "half-finished" expressions for \min and \max as its approximation: \begin{equation} \sigma_{\min}(\boldsymbol{W}) = \min_{\Vert \boldsymbol{v}\Vert=1}\max_{\Vert \boldsymbol{u}\Vert=1} \boldsymbol{u}^{\top}\boldsymbol{W} \boldsymbol{v} \approx \min_{\Vert \boldsymbol{v}\Vert=1}\boldsymbol{u}^{\top}\boldsymbol{W} \boldsymbol{v} + \max_{\Vert \boldsymbol{u}\Vert=1} \boldsymbol{u}^{\top}\boldsymbol{W} \boldsymbol{v} = -\Vert \boldsymbol{u}^{\top}\boldsymbol{W}\Vert + \Vert \boldsymbol{W}\boldsymbol{v}\Vert \end{equation} The rest follows the same logic as before, and the result is \mathbb{E}[\sigma_{\min}(\boldsymbol{W})] \approx \sqrt{n} - \sqrt{m}.
Final Tips
This article provides a quick idea for estimating the spectral norm of random matrices, or more strictly speaking, it is a heuristic and introductory explanation rather than a rigorous and accurate derivation. It has the potential to be made rigorous, but it would require filling in many theoretical details, all of which have been skipped here.