Quantitative finance has a statistics problem. Not a lack of statistics | an overconfidence in statistics. Every factor, every strategy, every signal is dressed in t-statistics, p-values, and Sharpe ratios. Most of them mean considerably less than they appear to.
This module is about developing healthy statistical scepticism. The factor premia are real and the academic evidence is substantial | but the methods used to find and validate them are routinely misapplied, and understanding those misapplications is the first step to building research that actually holds up.
The problem with normal distributions
Standard portfolio theory and most backtest statistics assume that asset returns are normally distributed | the bell curve. This assumption is mathematically convenient and empirically wrong.
Real market returns have fat tails: extreme events occur far more often than a normal distribution predicts. The March 2020 COVID crash, the 2018 Indian small-cap collapse, the 2008 crisis | these are described as "multi-sigma events" under normal distribution assumptions. But they happen regularly.
The practical implication: any risk measure that assumes normal returns | standard deviation-based confidence intervals, parametric VaR | systematically underestimates tail risk. Max drawdown is a better risk measure precisely because it doesn't assume normality: it just reads the actual worst loss from the data.
P-values and what they actually mean
A strategy backtest shows a t-statistic of 2.3 and a p-value of 0.02. "Statistically significant at the 5% level" | the strategy works, right? Not necessarily. The p-value answers a specific, narrow question: If this strategy had zero true edge, how likely is it that I'd see results this good by chance? A p-value of 0.02 means: 2% chance. That sounds compelling. But there are two critical issues:
- The p-value is not the probability that the strategy works. It's the probability of seeing these results if the strategy doesn't work. These are very different statements.
- The p-value doesn't tell you about economic significance. A strategy with t-statistic 2.1 and annual excess return of 0.05% is "statistically significant" | and economically worthless after costs.
Harvey, Liu & Zhu (2016): given the number of strategies tested in academic finance, a new factor needs a t-statistic of at least 3.0 to be taken seriously | not the traditional 2.0. The inflation of factor discovery has made the 5% significance threshold nearly meaningless.
The multiple testing problem
This is the most important statistical concept in quantitative research, and the most widely ignored. Suppose you test 100 completely random signals against the Nifty 500 universe. By pure chance, roughly 5 will show statistically significant results at the 5% level | even though none have real edge. If you publish the best-performing 3 as "discovered factors," you've committed p-hacking.
Most factor research | including published academic work | doesn't disclose how many signals were tested before the "discovered" factor was selected. The backtest shows the best outcome of many trials, and the statistics are calculated as if it were the only trial.
The honest check: When evaluating any factor or strategy, ask: "How many signals were tested to find this one?" If the answer is "many" and no multiple testing correction is applied, treat the results with significant scepticism | including your own research.
Stationarity and regime change
A signal that worked from 2003 to 2015 may not work from 2015 to 2023. Financial time series are non-stationary | their statistical properties change over time as market structure, participant composition, and information availability evolve. This creates a fundamental tension: you want as much historical data as possible (for statistical power), but older data may describe a market regime that no longer exists.
Autocorrelation and inflated t-statistics
Standard statistical tests assume independent observations. Monthly returns from a momentum strategy are not independent | if the strategy performs well in month 1, it tends to hold similar positions in month 2, so returns are correlated. This autocorrelation inflates the apparent sample size and makes t-statistics look more significant than they are.
The practical fix: use Newey-West standard errors when calculating t-statistics for time-series strategies, or use non-overlapping periods for performance evaluation. Reporting a 10-year backtest as "120 independent monthly observations" when there's significant serial correlation is a common and consequential error.
RupeeCase backtests use non-overlapping annual windows for primary performance reporting | a 10-year backtest gives 10 independent annual observations, not 120 correlated monthly ones. The multiple testing problem is addressed by using fixed canonical factor parameters from academic literature rather than in-sample optimization. Available at invest.rupeecase.com.
Glossary
Sources & further reading
- → Harvey, Liu & Zhu (2016) — …and the Cross-Section of Expected Returns
- → Bailey, D. et al. (2014). The Deflated Sharpe Ratio. Journal of Portfolio Management.
- → Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
- → NSE India Research Publications
Quick check, Module 5.1
Kelly Criterion Calculator
Maximises geometric growth of capital. Most professionals use a fraction of full Kelly to soften drawdowns.