Quantitative finance has a statistics problem. Not a lack of statistics | an overconfidence in statistics. Every factor, every strategy, every signal is dressed in t-statistics, p-values, and Sharpe ratios. Most of them mean considerably less than they appear to.

This module is about developing healthy statistical scepticism. The factor premia are real and the academic evidence is substantial | but the methods used to find and validate them are routinely misapplied, and understanding those misapplications is the first step to building research that actually holds up.

▼ Why most backtests lie, the multiple testing problem Statistical illustration
1 strategy tested
5% false
20 tested
64% false
100 tested
99.4% false
Probability of at least one false positive at p<0.05 significance

The problem with normal distributions

Standard portfolio theory and most backtest statistics assume that asset returns are normally distributed | the bell curve. This assumption is mathematically convenient and empirically wrong.

Real market returns have fat tails: extreme events occur far more often than a normal distribution predicts. The March 2020 COVID crash, the 2018 Indian small-cap collapse, the 2008 crisis | these are described as "multi-sigma events" under normal distribution assumptions. But they happen regularly.

A 5-sigma daily move should occur once every 3,500 years under a normal distribution. In practice, it happens roughly once every decade in major markets.
> 6
Kurtosis of Nifty 500 daily returns. Normal distribution has kurtosis = 3. Higher kurtosis means much fatter tails | more extreme moves, more often.

The practical implication: any risk measure that assumes normal returns | standard deviation-based confidence intervals, parametric VaR | systematically underestimates tail risk. Max drawdown is a better risk measure precisely because it doesn't assume normality: it just reads the actual worst loss from the data.

P-values and what they actually mean

A strategy backtest shows a t-statistic of 2.3 and a p-value of 0.02. "Statistically significant at the 5% level" | the strategy works, right? Not necessarily. The p-value answers a specific, narrow question: If this strategy had zero true edge, how likely is it that I'd see results this good by chance? A p-value of 0.02 means: 2% chance. That sounds compelling. But there are two critical issues:

Harvey, Liu & Zhu (2016): given the number of strategies tested in academic finance, a new factor needs a t-statistic of at least 3.0 to be taken seriously | not the traditional 2.0. The inflation of factor discovery has made the 5% significance threshold nearly meaningless.

Harvey, Liu & Zhu (2016) — …and the Cross-Section of Expected Returns

The multiple testing problem

This is the most important statistical concept in quantitative research, and the most widely ignored. Suppose you test 100 completely random signals against the Nifty 500 universe. By pure chance, roughly 5 will show statistically significant results at the 5% level | even though none have real edge. If you publish the best-performing 3 as "discovered factors," you've committed p-hacking.

Most factor research | including published academic work | doesn't disclose how many signals were tested before the "discovered" factor was selected. The backtest shows the best outcome of many trials, and the statistics are calculated as if it were the only trial.

Bonferroni Correction
Adjusted α = α / n
Testing n strategies with family-wise error rate α = 0.05 means each individual test needs p < 0.05/n. Testing 20 signals: each needs p < 0.0025. This is conservative but shows how much the bar rises with each additional test.

The honest check: When evaluating any factor or strategy, ask: "How many signals were tested to find this one?" If the answer is "many" and no multiple testing correction is applied, treat the results with significant scepticism | including your own research.

Stationarity and regime change

A signal that worked from 2003 to 2015 may not work from 2015 to 2023. Financial time series are non-stationary | their statistical properties change over time as market structure, participant composition, and information availability evolve. This creates a fundamental tension: you want as much historical data as possible (for statistical power), but older data may describe a market regime that no longer exists.

Momentum | relatively stationary
The 12-1 momentum signal has worked in Indian markets across multiple regime changes (2003 to present). Consistency across regimes is evidence of genuine structural alpha, not overfitting to one period.
Value | non-stationary post-2015
P/B-based value worked well in India pre-2015. The rise of capital-light businesses (IT, pharma, consumer) made book value less relevant. A value factor that doesn't adapt has decaying predictive power.
Liquidity regime shift
Indian small-cap liquidity has changed dramatically since 2010 as retail participation grew. Small-cap strategies that worked pre-2010 had different execution characteristics than today.
How to detect it
Chow test and rolling window analysis detect when a signal's predictive power changed. Any factor strategy should be tested for structural breaks | not just evaluated on whole-period average performance.

Autocorrelation and inflated t-statistics

Standard statistical tests assume independent observations. Monthly returns from a momentum strategy are not independent | if the strategy performs well in month 1, it tends to hold similar positions in month 2, so returns are correlated. This autocorrelation inflates the apparent sample size and makes t-statistics look more significant than they are.

The practical fix: use Newey-West standard errors when calculating t-statistics for time-series strategies, or use non-overlapping periods for performance evaluation. Reporting a 10-year backtest as "120 independent monthly observations" when there's significant serial correlation is a common and consequential error.

How RupeeCase handles these issues

RupeeCase backtests use non-overlapping annual windows for primary performance reporting | a 10-year backtest gives 10 independent annual observations, not 120 correlated monthly ones. The multiple testing problem is addressed by using fixed canonical factor parameters from academic literature rather than in-sample optimization. Available at invest.rupeecase.com.

Rigorous backtesting
Canonical parameters, honest statistics, non-overlapping windows | the RupeeCase backtester
Built to avoid the statistical traps described here.
Start free →

Glossary

Key terms from this module
Fat tails
The tendency of financial return distributions to have more extreme observations than a normal distribution predicts. Formally measured by excess kurtosis.
P-value
The probability of seeing results at least as extreme as the data, assuming no true effect. Does not equal the probability that a strategy works.
Multiple testing
The inflation of false positive rates when many hypotheses are tested. Testing 100 signals at 5% produces ~5 false positives even if none have genuine edge.
Stationarity
A time series is stationary if its statistical properties don't change over time. Financial returns are typically non-stationary, complicating historical analysis.
Autocorrelation
Correlation between a time series and lagged versions of itself. Serial correlation in strategy returns inflates apparent statistical significance if not corrected.

Sources & further reading

Quick check, Module 5.1

0 correct · 0 answered
🎉
Module 5.1 complete
3 correct. Continue to Module 5.2 when ready.
Research Lab Qualifier
Path 5, Module 1 of 5 done, complete all 5 + path test to unlock
📍 5.1 Statistics 5.2 Time Series 5.3 ML for Alpha 5.4 Alt Data 5.5 Out-of-Sample
calc-kelly
Calculator

Kelly Criterion Calculator

Maximises geometric growth of capital. Most professionals use a fraction of full Kelly to soften drawdowns.

Quick check, Module 5.1

3 questions. Get 2 right to mark this module complete.

0 of 3 answered
Up next, Module 5.2
Time Series Analysis
Volatility clustering, mean reversion, momentum persistence, and the five statistical properties of Indian equity returns that every systematic strategy must account for.
Continue →
TK
A note from the author
Why I wrote this path

I’ve spent years building quantitative models for Indian markets. The hardest lesson wasn’t learning the math, it was learning when the math lies to you. Most backtests are garbage. Most “alpha signals” are noise dressed up as signal.

This path exists because I wish someone had told me about multiple testing bias, fat tails, and out-of-sample validation before I wasted months on strategies that only worked in-sample. These are the statistical foundations that separate real quant work from curve-fitting.

Fair warning: this is the hardest path in the series. But if you get through it, you’ll have a genuine edge in evaluating any quantitative strategy, including the ones on RupeeCase.

TK
Tanmay Kurtkoti
Founder & CEO, RupeeCase · 17 years systematic trading · QC Alpha
RC
Want to put this into practice? RupeeCase is the systematic investing terminal built around everything you’re learning here, factor scores, strategy backtests, portfolio construction for Indian markets.
Explore the terminal →
PRACTICE WHAT YOU LEARNED
Try systematic strategies on RupeeCase | free paper trading.
Get Started Free →