From CAPM To Hedging

Let’s start with a question from Twitter:

This is a provocative question. Patrick was clever to disallow Berkshire. In this post, we are going to use this question to launch into the basics of regression, correlation, beta hedging and risk.

Let’s begin.

My Reaction To The Question

I don’t know anything about picking stocks. I do know about the nature of stocks which makes this question scary. Why?

  1. Stocks don’t last forever

    Many stocks go to zero. The distribution of many stocks is positively skewed which means there’s a small chance of them going to the moon and reasonable chance that they go belly-up. The price of a stock reflects its mathematical expectation. Since the downside is bounded by zero and the upside is infinite, for the expectation to balance the probability of the stock going down can be much higher than our flawed memories would guess. Stock indices automatically rebalance, shedding companies that lose relevance and value. So the idea that stocks up over time is really stock indices go up over a time, even though individual stocks have a nasty habit of going to zero. For more see Is There Actually An Equity Premium Puzzle?.

  2. Diversification is the only free lunch

    The first point hinted at my concern with the question. I want to be diversified. Markets do not pay you for non-systematic risk. In other words, you do not get paid for risks that you can hedge. All but the most fundamental risks can be hedged with diversification. See Why You Don’t Get Paid For Diversifiable Risks. To understand how diversifiable risks get arbed out of the market ask yourself who the most efficient holder of a particular idiosyncratic risk is? If it’s not you, then you are being outbid by someone else, or you’re holding the risk at a price that doesn’t make sense given your portfolio choices. Read You Don’t See The Whole Picture to see why.

My concerns reveal why Berkshire would be an obvious choice. Patrick ruled it out to make the question much harder. Berkshire is a giant conglomerate. Many would have chosen it because it’s run by masterful investors Warren Buffet and Charlie Munger. But I would have chosen it because it’s diversified. It is one of the closest companies I could find to an equity index. Many people look at the question and think about where their return is going to be highest. I have no edge in that game. Instead, I want to minimize my risk by diversifying and accepting the market’s compensation for accepting broad equity exposure.

In a sense, this question reminds me of an interview question I’ve heard.

You are gifted $1,000,000 dollars. You must put it all in play on a roulette wheel. What do you do?

The roulette wheel has negative edge no matter what you do. Your betting strategy can only alter the distribution. You can be crazy and bet it all on one number. Your expectancy is negative but the payoff is positively skewed…you probably lose your money but have a tiny chance at becoming super-rich. You can try to play it safe by risking your money on most of the numbers, but that is still negative expectancy. The skew flips to negative. You probably win, but there’s a small chance of losing most of your gifted cash.

I would choose what’s known as a minimax strategy which seeks to minimize the maximum loss. I would spread my money evenly on all the numbers, accept a sure loss of 5.26%.1 The minimax response to Patrick’s question is to find the stock that is the most internally diversified.

Berkshire Vs The Market

I don’t have an answer to Patrick’s question. Feel free to explore the speculative responses in the thread. Instead, I want to dive further into my gut reaction that Berkshire would be a reasonable proxy to the market. If we look at the mean of its annual returns from 1965 to 2001, the numbers are gaudy. Its CAGR was 26.6% vs the SP500 at 11%. Different era. Finding opportunities at the scale Buffet needs to move the needle has been much harder in the past 2 decades. 

Buffet has been human for the past 20 years. This is a safer assumption than the hero stats he was putting up in the last half of the 20th century. 

The mean arithmetic returns and standard deviations validate my hunch that Berkshire’s size and diversification 2 make it behave like the whole market in a single stock. 

Let’s add a scatterplot with a regression. 

If you tried to anticipate Berkshire’s return, your best guess might be its past 20 year return, distributed similarly to its prior volatility. Another approach would be to see this relationship to the SP500 and notice that a portion of its return can simply be explained by the market. It clearly has a positive correlation to the SP500. But just how much of the relationship is explained by SP500? This is a large question with practical applications. Specifically, it underpins how market netural traders think about hedges. If I hedge an exposure to Y with X how much risk do I have remaining? To answer this question we will go on a little learning journey:

  1. Deriving sensitivities from regressions in general
  2. Interpreting the regression
  3. CAPM: Applying regression to compute the “risk remaining of a hedge”

On this journey you can expect to learn the difference between beta and correlation, build intuition for how regressions work, and see how market exposures are hedged. 

Unpacking The Berkshire Vs SP500 Regression

A regression is simply a model of how an independent variable influences a dependant variable. Use a regression when you believe there is a causal relationship between 2 variables. Spurious correlations are correlations that will appear to be causal because they can be tight. The regression math may even suggest that’s the case. I’m sorry. Math is a just a tool. It requires judgement. The sheer number of measurable quanitites in the world guarantees an infinite list of correlations that serve as humor not insight3.

The SP500 is steered by the corporate earnings of the largest public companies (and in the long-run the Main Street economy4) discounted by some risk-aware consensus. Berkshire is big and broad enough to inherit the same drivers. We accept that Berkshire’s returns are partly driven by the market and partly due to its own idiosyncracies.

Satisfied that some of Berkshire’s returns are attributable to the broader market, we can use regression to understand the relationship. In the figure above, I had Excel simply draw a line that best fit the scatterplot with SP500 being the independent variable, or X, and Berkshire returns being the dependant or Y. The best fit line (there are many kinds of regression but we are using a simple linear regression) is defined the same way in line is: by a slope and an intercept. 

The regression equation should remind you of the generic form of a line y = mx + b where m is the slope and b is the intercept. 

In a regression:



y = dependant variable (Berkshire returns)

x = independent variable (SP500 returns)

α = the intercept (a constant)

β = the slope or sensitivity of the Y variable based on the X variable

If you right-click on a scatterplot in Excel you can choose “Add Trendline”. It will open the below menu where you can set the fitted line to be linear and also check a box to “Display Equation on chart”.

This is how I found the slope and intercept for the Berkshire chart:

y = .6814x + .0307

Suppose the market returns 2%:

Predicted Berkshire return = .6814 * 2% + 3.07%

Predicted Berkshire return = 4.43%

So based on actual data, we built a simple model of Berkshire’s returns as a function of the market. 

It’s worth slowing down to understand how this line is being created. Conceptually it is the line that minimizes the squared errors between itself and the actual data. Since each point has 2 coordinates, we are dealing with the variance of a joint distribution. We use covariance instead of variance but the concepts are analogous. With variance we square the deviations from a mean. For covariance, we multiply the distance of each X and Y in a coordinate from their respective means: (xᵢ – x̄)(yᵢ -ȳ)

Armed with that idea, we can compute the regression line by hand with the following formulas:

β or slope = covar(x,y)/ var(x)

α or intercept = ȳ – β̄x̄

We will look at the full table of this computation later to verify Excel’s regression line. Before we do that, let’s make sure that this model is even helpful. One standard we could use to determine  if the model is useful is if it performs better than the cheapest naive model that says:

Our predicted Berkshire return simply is mean return from sample.

This green arrows in this picture represent the error between this simple model and the actual returns. 

This naive model of summing the squared differences from the mean of Berkshire’s returns is exactly the same as variance. You are computing squared differences from a mean. If you take square root of the average of the squared differences you get a standard deviation. In, this simple model where our prediction is simply the mean our volatility is 16.5% or the volatility of Berkshire’s returns for 20 years. 

In the regression context, the total variance of the dependent variable from its mean is knows as the Total Sum of Squares or TSS

The point of using regression though is we can make a better prediction of Berkshire’s returns if we know the SP500’s returns. So we can compare the mean to the fitted line instead of the actual returns. The sum of those squared differences is known as the Regression Sum Of Squares or RSS. This is the sum of squared deviations between the mean and fitted predictions instead of the actual returns. If there is tremendous overlap between the RSS and TSS, than we think much of the variance in X explains the variance of Y.

The last quantity we can look at is the Error Sum of Squares or ESS. These are the deviations from the actual data to the predicted values represented by our fitted line. This represents the unexplained portion of Y’s variance. 


Let’s use 2008’s giant negative return to show how TSS, RSS, and ESS relate.


The visual shows:


We can compute the sum of these squared deviations simply from their definitions:

TSS (aka variance) Σ(actual-mean)²
ESS (sum of errors squared) Σ(actual-predicted)²
RSS (aka TSS – ESS) Σ(predicted-mean)²

The only other quantities we need are variances and covariances to compute β or slope of the regression line. 

In the table below:

ŷ = the predicted value of Berkshire’s return aka “y-hat”

x̄ = mean SP500 return aka “x-bar”

ȳ = mean Berkshire return aka “y-bar”



  β = .40 / .59 = .6814

  α = ȳ – β̄x̄ = 10.6% – .6814 * 11.1% = 3.07%

This yields the same regression equation Excel spit out:


ŷ = 3.07% + .6814x


We walked through this slowly as a learning exercise, but the payoff is appreciating the R². Excel computed it as 52%. But we did everything we need to compute it by hand. Go back to our different sum of squares.

TSS or variance of Y = .52

ESS or sum of squared difference between actual data and the model = .25

Re-arranging TSS = RSS + ESS we can see that RSS = .27

Which brings us to:

R² = RSS/TSS = .27/.52 = 52% 

Same as Excel!

R² is the regression sum of squares divided by the total variance of Y. It is called the coefficient of determination and can be interpreted as:

The variability in Y explained by X

So based on this small sample, 52% of Berkshire’s variance is explained by the market, as proxied by the SP500. 


Correlation, r (or if you prefer Greek, ρ) can be computed in at least 2 ways. It’s the square root of R².

r = √R² = √.52 = .72

We can confirm this by computing correlation by hand according to its own formula:





Looking at the table above we have all the inputs:

r = .40 / sqrt(.59 x .52)

r = .72

Variance is an unintuitive number. By taking the square root of variance, we arrive at a standard deviation which we can actually use.

Similarly, covariance is an intermediate computation lacking intuition. By normalizing it (ie dividing it) by the standard deviations of X and Y we arrive at correlation, a measure that holds meaning to us. It is bounded by -1 and +1. If the correlation is .72 then we can make the following statement:

If x is 1 standard deviation above its mean, I expect y to be .72 standard deviations above its own mean.

It is a normalized measure of how one variable co-varies versus the other. 

How Beta And Correlation Relate

Beta, β, is the slope of the regression equation.

Correlation is the square root of R2 or coefficient of determination.

Beta actually embeds correlation within it.

Look closely at the formulas:



Watch what happens when we divide β̄ by r.


Beta equals correlation times the ratio of the standard deviations. 

The significance of that insight is about to become clear as we move from our general use of regression to the familiar CAPM regression. From the CAPM formula we can derive the basis of hedge ratios and more!

We have done all the heavy lifting at this point. The reward will be a set of simple, handy formulas that have served me throughout my trading career.

Let’s continue.

From Regression To CAPM 

The famous CAPM pricing equation is a simple linear regression stipulating that the return of an asset is a function of the risk free rate, a beta to the broader market, plus an error term that represents the security’s own idiosyncratic risk. 

Rᵢ = Rբ + β(Rₘ – Rբ) + Eᵢ


Rᵢ = security total return

Rբ = risk-free rate

β = sensitivity of security’s return to the overall market’s excess return (ie the return above the risk-free rate)

Eᵢ = the security’s unique return (aka the error or noise term)

Since the risk-free rate is a constant, let’s scrap it to clean the equation up.

This is the variance equation for this security:

Recall that beta is the vol ratio * correlation:

We can use this to factor the “market variance” term.

Plugging this form of “variance due to the market” back into the variance equation:

This reduces to the prized equation: The “risk remaining” formula which is the proportion of a stock’s volatility due to its own idiosyncratic risk. 

This makes sense. R2 is the amount of variance in a dependant variable attributable to indepedent variable. If we subtract that proportion from 1 we arrive at the “unexplained” or idiosyncratic variance. By taking the square root of that quantity, we are left with unexplained volatility or “risk remaining”. 

Let’s use what we’ve learned in a concrete example.

From CAPM To Hedge Ratios

Let’s return to Berkshire vs the SP500. Suppose we are long $10mm worth of BRK.B and want to hedge our exposure by going short SP500 futures. 

We want to compute:

  1. How many dollars worth of SP500 to get short
  2. The “risk remaining” on the hedged portfolio

How many dollars of SP500 do we need to short?

Before we answer this lets consider a few ways we can hedge with SP500. 

  • Dollar weighting

    We could simply sell $10mm worth of SP500 futures which corresponds to our $10mm long in BRK.B. Since Berkshire and the SP500 are a similar volatility this is a reasonable approach. But suppose we were long TSLA instead of BRK.B. Assuming TSLA was sufficiently correlated to the market (say .70 like BRK.B), the SP500 hedge would be “too light”. 


    Because TSLA is about 3x more volatile than the SP500. If the SP500 fell 1 standard deviation, we expect TSLA to fall .70 standard deviations. Since TSLA’s standard deviations are much larger than the SP500 we would be tragically underhedged. Our TSLA long would lose much more money than our short SP500 position because we are not short enough dollars of SP500. 

  • Vol weighting

    Dollar weighting is clearly naive if there are large differences in volatility between our long and short. Let’s stick with the TSLA example. If TSLA is 3x as volatile as the SP500 then if we are long $10mm TSLA, we need to short $30mm worth of SP500.

    Uh oh. 

    That’s going to be too much. Remember the correlation. It’s only .70. The pure vol weighted hedge only makes sense if the correlations are 1. If the SP500 drops one standard deviation, we expect TSLA to drop only .70 standard deviations, not a full standard deviation. In this case, we will have made too much money on our hedge, but if the market would have rallied 1 standard deviation our oversized short would have been “heavy”. We would lose more money than we gained on our TSLA long. Again, only partially hedged. 

  • Beta weighting

    Alas, we arrive at the goldilocks solution. We use the beta or slope of the linear regression to weight our hedge. Since beta equals correlation * vol ratio we are incorporating both vol and correlation weighting into our hedge! 

    I made up numbers vols and correlations to complete the summary tables below. The key is seeing how much the prescribed hedge ratios can vary depending on how you weight the trades. 

    Beta weighting accounts for both relative volatilies and the correlation between names. Beta has a one-to-many relationship to its construction. A beta of .5 can come from:

    • A .50 correlation but equal vols
    • A .90 correlation but vol ratio of .56
    • A .25 correlation but vol ratio of 2

It’s important to decompose betas because the correlation portion is what determines the “risk remaining” on a hedge. Let’s take a look. 

How much risk remains on our hedges?

We are long $10,000,000 of TSLA

We sell $21,000,000 of SP500 futures as a beta-weighted hedge. 

Risk remaining is the volatility of TSLA that is unexplained by the market.

  • R2 is the amount of variance in the TSLA position explained by the market. 
  • 1-R2 is the amount of variance that remains unexplained
  • The vol remaining is sqrt(1-R2)

Risk (or vol) remaining = sqrt (1-.72) = 51%

TSLA annual volatility is 45% so the risk remaining is 51% * 45% = 22.95%

22.95% of $10,0000 of TSLA = $2,295,000

So if you ran a hedged position, within 1 standard deviation, you still expect $2,295,000 worth of noise!

Remember correlation is symmetrical. The correlation of A to B is the same as the correlation of B to A (you can confirm this by looking at the formula). 

Beta is not symmetrical because it’s correlation * σdependant / σindependent 

Yet risk remaining only depends on correlation. 

So what happens if we flipped the problem and tried to hedge $10,000,000 worth of SP500 with a short TSLA position.

  1. First, this is conceptually a more dangerous idea. Even though the correlation is .70, we are less likely to believe that TSLA’s variance explains the SP500’s variance. Math without judgement will impale you on a spear of overconfidence. 

  2. I’ll work through the example just to be complete. 

    To compute beta we flip the vol ratio from 3 to 1/3 then multiply by the correlation of .7

    Beta of SP500 to TSLA is .333 * .7 = .233

    If we are long $10,000,000 of SP500, we sell $2,333,000 of TSLA. The risk remaining is still 51% but it is applied to the SP500 volatility of 15%. 

    51% x 15% = 7.65% so we expect 7.65% of $10,000,000 or $765,000 of the SP500 position to be unexplained by TSLA. 

  3. I’m re-emphasizing: math without judgement is a recipe for disaster. The formulas are tools, not substitutes for reasoning. 

Changes in Correlation Have Non-Linear Effects On Your Risk

Hedging is tricky. You can see that risk remaining explodes rapidly as correlation falls.

If correlation is as high as .86, you already have 50% risk remaining!

In practice, a market maker may:

  1. group exposures to the most related index (they might have NDX, SPX, and IWM buckets for example)
  2. offset deltas between exposures as they accumulate
  3. and hedge the remaining deltas with futures. 

You might create risk tolerances that stop you from say being long $50mm worth of SPX and short $50mm of NDX leaving you exposed the underlying factors which differentiate these indices. Even though they might be tightly correlated intraday, the correlation change over time and your risk-remaining can begin to swamp your edge. 

The point of hedging is to neutralize the risks you are not paid to take. But hedging is costly. Traders must always balance these trade-offs in the context of their capital, risk tolerances, and changing correlations. 


I walked slowly through topics that are familiar to many investors and traders. I did this because the grout in these ideas often trigger an insight or newfound clarity of something we thought we understood. 

This is a recap of important ideas in this post:

  • Variance is a measure of dispersion for a single distribution. Covariance is a measure of dispersion for a joint distribution.
  • Just as we take the square root of variance to normalize it to something useful (standard deviation, or in a finance context — volatility), we normalize covariance into correlation.
  • Intuition for a positive(negative) correlation: if X is N standard deviations above its mean, Y is r * N standard deviations above(below) its mean. 
  • Beta is r * the vol ratio of Y to X. In a finance context, it allows it allows us to convert a correlation from a standard deviation comparison to a simple elasticity. If beta = 1.5, then if X is up 2%, I expect Y to be up 3%
  • Correlation is symmetrical. Beta is not. 
  • Ris the variance explained by the independent variable. Risk remaining is the volatility that remains unexplained. It is equal to sqrt(1-R2). 
  • There is a surprising amount of risk remaining even if correlations are strong. At a correlation of .86, there is 50% unexplained variance!
  • Don’t compute robotically. Reason > formulas. 



Least squares linear regression is only one method for fitting a line. It only works for linear relationships. Its application is fraught with pitfalls. It’s important to understand the assumptions in any models you use before they become load-bearing beams in your process. 


The table in this post was entirely inspired by Rahul Pathak’s post Anova For Regression.

For the primer on regression and sum of squares I read these 365 DataScience posts in hte following order:

  1. Getting Familiar with the Central Limit Theorem and the Standard Error

  2. How To Perform A Linear Regression In Python (With Examples!)

  3. The Difference between Correlation and Regression

  4. Sum of Squares Total, Sum of Squares Regression and Sum of Squares Error

  5. Measuring Explanatory Power with the R-squared

  6. Exploring the 5 OLS Assumptions for Linear Regression Analysis
    (I strongly recommend reading this post before diving in on your own. )




There’s Gold In Them Thar Tails: Part 1

If you were accepted to a selective college or job in the 90s, have you ever wondered if you’d get accepted in today’s environment? I wonder myself. It leaves me feeling grateful because I think the younger version of me would not have gotten into Cornell or SIG today. Not that I dwell on this too much. I take Heraclitus at his word that we do not cross the same river twice. Transporting a fixed mental impression of yourself into another era is naive (cc the self-righteous who think they’d be on the right side of history on every topic). Still, my self-deprecation has teeth. When I speak to friends with teens I hear too many stories of sterling resumes bulging with 3.9 GPAs, extracurriculars, and Varsity sport letters, being warned: “don’t bother applying to Cal”.

A close trader friend explained his approach. His daughter is a high achiever. She’s also a prolific writer. Her passion is the type all parents hope their children will be lucky enough to discover. My friend recognizes that the bar is so high to get into a top school that acceptance above that bar is a roulette wheel. With so much randomness lying above a strict filter, he de-escalates the importance of getting into an elite school. “Do what you can, but your life doesn’t depend on the whim of an admissions officer”. She will lean into getting better at what she loves wherever she lands. This approach is not just compassionate but correct. She’s thought ahead, got her umbrella, but she can’t control the weather.

My friend’s insight that acceptance above a high threshold is random is profound. And timely. I had just finished reading Rohit Krishnan’s outstanding post Spot The Outlier, and immediately sent it to my friend.

I chased down several citations in Rohit’s post to improve my understanding of this topic.

In this post, we will tie together:

  1. Why the funnels are getting narrower
  2. The trade-offs in our selection criteria
  3. The nature of the extremes: tail divergence
  4. Strategies for the extremes

We will extend the discussion in a later post with:

  1. What this means for intuition in general
  2. Applications to investing

Why Are The Funnels Getting Narrower?

The answer to this question is simple: abundance.

In college admissions, the number of candidates in aggregate grows with the population. But this isn’t the main driver behind the increased selectivity.  The chart below shows UC acceptance rates plummeting as total applications outstrip admits.

The spread between applicants and admissions has exploded. UCLA received almost 170k applications for the 2021 academic year! Cal receives over 100k applicants for about 10k spots. Your chances of getting in have cratered in the past 20 years. Applications have lapped population growth due to a familiar culprit: connectivity. It is much easier to apply to schools today. The UC system now uses a single boilerplate application for all of its campuses.

This dynamic exists everywhere. You can apply to hundreds of jobs without a postage stamp. Artists, writers, analysts, coders, designers can all contribute their work to the world in a permissionless way with as little as a smartphone. Sifting through it all necessitated the rise of algorithms — the admissions officers of our attention.

Trade-offs in Selection Criteria

There’s a trade-off between signal and variance. What if Spotify employed an extremely narrow recommendation engine indexed soley on artist? If listening to Enter Sandman only lead you to Metallica’s deepest cuts, the engine is failing to aid discovery. If it indexed by “year”, you’d get a lot more variance since it would choose across genres, but headbangers don’t want to listen to Color Me Badd.  This prediction fails to delight the user.

Algorithms are smarter than my cardboard examples but the tension remains. Our solutions to one problem excarbates another. Rohit describes the dilemma:

The solution to the problem of discovery is better selection, which is the second problem. Discovery problems demand you do something different, change your strategy, to fight to be amongst those who get seen.

There’s plenty of low-hanging fruit to find recommendations that reside between Color Me Badd and St. Anger. But once it’s picked, we are still left with a vast universe of possible songs for the recommendation engine to choose from.

Selection problems reinforce the fact that what we can measure and what we want to measure are two different things, and they diverge once you get past the easy quadrant.

In other words, it’s easy enough to rule out B students, but we still need to make tens of thousands of coinflip-like decisions between the remaining A students. Are even stricter exams an effective way narrow an unwieldy number of similar candidates? Since in many cases predictors poorly map to the target, the answer is probably no. Imagine taking it to the extreme and setting the cutoff to the lowest SAT score that would satisfy Cal’s expected enrollment. Say that’s 1400. This feels wrong for good reasons (and this is not even touching the hot stove topic of “fairness”). Our metrics are simply imperfect proxies for who we want to admit. In mathy language we can say, the best person at Y (our target variable) is not likely to come from the best candidates we screened if the screening criteria, X, is an imperfect correlate of success(Y).

The cost of this imperfect correlation is a loss of diversity or variance. Rohit articulates the true goal of selection criteria (emphasis mine):

Since no exam perfectly captures the necessary qualities of the work, you end up over-indexing on some qualities to the detriment of others. For most selection processes the idea isn’t to get those that perfectly fit the criteria as much as a good selection of people from amongst whom a great candidate can emerge.

This is even true in sports. Imagine you have a high NBA draft pick. A great professional must endure 82 games (plus a long playoff season), fame, money, and most importantly, a sustained level of unprecedented competition. Until the pros, they were kids. Big fish in small ponds. If you are selecting for an NBA player with narrow metrics, even beyond the well-understood requisite screens for talent, then those metrics are likely to be a poor guide to how the player will handle such an outlier life. The criteria will become more squishy as you try to parse the right tail of the distribution.

In the heart of the population distribution, the contribution to signal of increasing selectivity is worth the loss of variance. We can safely rule out B students for Cal and D3 basketball players for the NBA.  But as we get closer to elite performers, at what point should our metrics give way to discretion? Rohit provides a hint:

When the correlation between the variable measured and outcome desired isn’t a hundred percent, the point at which the variance starts outweighing the mean error is where dragons lie!

Nature Of The Extremes: Tail Divergence

To appreciate why the signal of our predictive metrics become random at the extreme right tail we start with these intuitive observations via LessWrong:

Extreme outliers of a given predictor are seldom similarly extreme outliers on the outcome it predicts, and vice versa. Although 6’7″ is very tall, it lies within a couple of standard deviations of the median US adult male height – there are many thousands of US men taller than the average NBA player, yet are not in the NBA. Although elite tennis players have very fast serves, if you look at the players serving the fastest serves ever recorded, they aren’t the very best players of their time. It is harder to look at the IQ case due to test ceilings, but again there seems to be some divergence near the top: the very highest earners tendto be very smart, but their intelligence is not in step with their income (their cognitive ability is around +3 to +4 SD above the mean, yet their wealth is much higher than this).

The trend seems to be that even when two factors are correlated, their tails diverge: the fastest servers are good tennis players, but not the very best (and the very best players serve fast, but not the very fastest); the very richest tend to be smart, but not the very smartest (and vice versa). 

The post uses simple scatterplots to demonstrate. Here are 2 self-explanatory charts. 

LessWrong contines: Given a correlation, the envelope of the distribution should form some sort of ellipse, narrower as the correlation goes stronger, and more circular as it gets weaker.

If we zoom into the far corners of the ellipse, we see ‘divergence of the tails’: as the ellipse doesn’t sharpen to a point, there are bulges where the maximum x and y values lie with sub-maximal y and x values respectively:

Say X is SAT score and Y is college GPA. We shoudn’t expect that the person with highest SATs will earn the highest GPA. SAT is an imperfect correlate of GPA. LessWrong’s interpretation is not surprising:

The fact that a correlation is less than 1 implies that other things matter to an outcome of interest. Although being tall matters for being good at basketball, strength, agility, hand-eye-coordination matter as well (to name but a few). The same applies to other outcomes where multiple factors play a role: being smart helps in getting rich, but so does being hard working, being lucky, and so on.

Pushing this even further, if we zoom in on the extreme of a distribution we may find correlations invert! This scatterplot via shows a positive correlation over the full sample (pink) but a negative correlation for a slice (blue). 

This is known as Berkson’s Paradox and can appear when you measure a correlation over a “restricted range” of a distribution (for example, if we restrict our sample to the best 20 basketball players in the world we might find that height is negatively correlated to skill if the best players were mostly point guards).

[I’ve written about Berkson’s Paradox here. Always be wary of someone trying to show a correlation from a cherry-picked range of a distribution. Once you internalize this you will see it everywhere! I’d be charitable to the perpetrator. I suspect it’s usually careless thinking rather than a nefarious attempt to persuade.]

Strategies For The Extremes

In 1849, assayor Dr. M. F. Stephenson shouted ‘There’s gold in them thar hills’ from the steps of the Lumpkin County Courthouse in a desperate bid to keep the miners in Georgia from heading west to chase riches in California. We know there’s gold in the tails of distributions but our standard filters are unfit to sift for them. 

Let’s pause to take inventory of what we know. 

  1. As the number of candidates or choices increases we demand stricter criteria to keep the field to a manageable size.
  2. At some cutoff, in the extreme of a distribution, selection metrics can lead to random or even misleading predictions. 1

    I’ll add a third point to what we have already established:

  3. Evolution in nature works by applying competitve pressures to a diverse population to stimulate adaptation (a form of learning). Diversity is more than a social buzzword. It’s an essential input to progress. Rohit implicitly acknowledges the dangers of inbreeding when he warns against putting folks through a selection process that reflexively molds them into rule-following perfectionists rather than those who are willing to take risks to create something new.

With these premises in place we can theorize strategies for both the selector and the selectee to improve the match between a system’s desired output (the definition of success depends on the context) and its inputs (the criteria the selector uses to filter). 

Selector Strategies

We can continue to rely on conventional metrics to filter the meat of the distribution for a pool of candidates. As we get into the tails, our adherence and reverance for measures should be put aside in favor of increasing diversity and variance. Remember the output of an overly strict filter in the tail is arbitrary anyway. Instead we can be deliberate about the randomness we let seep into selections to maximize the upside of our optionality. 

Rohit summarizes the philosophy:

Change our thinking from a selection mindset (hire the best 5%) to a curation mindset (give more people a chance, to get to the best 5%).

Practically speaking this means selectors must widen the top of the funnel then…enforce the higher variance strategy of hire-and-train.

Rohit furnishes examples:

  • Tyler Cowen’s strategy of identifying unconventional talent and placing small but influential bets on the candidates. This is easier to say than do but Tony Kulesa finds some hints in Cowen’s template. 
  • The Marine Corps famously funnels wide electing not to focus so much on the incoming qualifications, but rather look at recruiting a large class and banking on attrition to select the right few.
  • Investment banks and consulting firms hire a large group of generically smart associates, and let attrition decide who is best suited to stick around.

David Epstein, author of Range and The Sports Gene, has spent the past decade studying the development of talent in sports and beyond. He echoes these strategies:

One practice we’ve often come back to: not forcing selection earlier than necessary. People develop at different speeds, so keep the participation funnel wide, with as many access points as possible, for as long as possible. I think that’s a pretty good principle in general, not just for sports.

I’ll add 2 meta observations to these strategies:

  1. The silent implication is the upside of matching the right talent to the right role is potentially massive. If you were hiring someone to bag groceries the payoff to finding the fastest bagger on the planet is capped. An efficient checkout process is not the bottleneck to a supermarket’s profits. There’s a predictable ceiling to optimizing it to the microsecond. That’s not the case with roles in the above examples. 

  2. Increasing adoption of these strategies requires thoughtful “accounting” design. High stakes busts, whether they are first round draft picks or 10x engineers, are expensive in time and money for the employer and candidate. If we introduce more of a curation mindset, cast wider nets and hire more employees, we need to understand that the direct costs of doing that should be weighed against the opaque and deferred costs of taking a full-size position in expensive employees from the outset.

    Accrual accounting is an attempt match a business’ economic mechanics to meaningful reports of stocks and flows so we extract insights that lead to better bets. Fully internalized, we must recognize that some amount of churn is expected as “breakage”. Lost option premiums need to be charged against the options that have paid off 100x. If an organization fails to design its incentive and accounting structures in accordance with curation/optionality thinking it will be unable to maintain its discipline to the strategy.  

Selectee Strategies

For the selectee trying to maximise their own potential there are strategies which exploit the divergence in the tails. 

To understand, we first recognize, that in any complicated domain, the effort to become the best is not linear. You could devote a few years to becoming an 80th or 90 percentile golfer or chess player. But in your lifetime you wouldn’t become Tiger or Magnus. The rewards to effort decay exponentially after a certain point. Anyone who has lifted weights knows you can spend a year progressing rapidly, only to hit a plateau that lasts just as long. 

The folk wisdom of the 80/20 rule captures this succintly: 80% of the reward comes from 20% of the effort, and the remaining 20% of the reward requires 80% effort. The exact numbers don’t matter. Divorced from contexts, it’s more of a guideline. 

This is the invisible foundation of Marc Andreesen and Scott Adam’s career advice to level up your skills in multiple domains. Say coding and public speaking or writing plus math. If it’s exponentially easier to get to the 90th percentile than the 99th then consider the arithmetic2.

a) If you are in the 99th percentile you are 1 in 100. 

b) If you are top 10% in 2 different (technically uncorrelated) domains then you are also 1 in 100 because 10% x 10% = 1%

It’s exponentially easier to achieve the second scenario because of the effort scaling function. 

If this feels too stifling you can simply follow your curiosity. In Why History’s Greatest Innovators Optimized for Interesting, Taylor Pearson summarizes the work of Juergen Schmidhuber which contends that curiousity is the desire to make sense of, or compress, information in such a way that we make it more beautiful or useful in its newly ordered form. If learning (or as I prefer to say – adapting) is downstream from curiousity we should optimize for interesting

Lawrence Yeo unknowingly takes the baton in True Learning Is Done With Agency, with his practical advice. He tells us to truly learn we must:

decouple an interest from its practical value. Instead of embarking on something with an end goal in mind, you do it for its own sake. You don’t learn because of the career path it’ll open up, but because you often wonder about the topic at hand.

…understand that a pursuit truly driven by curiosity will inevitably lend itself to practical value anyway. The internet has massively widened the scope of possible careers, and it rewards those who exercise agency in what they pursue.


Rohit’s essay anchored Part 1 of this series. I can’t do better than let his words linger before moving on to Part 2.
If measurement is too strict, we lose out on variance.

If we lose out on variance, we miss out on what actually impacts outcomes.

If we miss what actually impacts outcomes, we think we’re in a rut.

But we might not be.

Once you’ve weeded out the clear “no”s, then it’s better to bet on variance rather than trying to ascertain the true mean through imprecise means.

We should at least recognize that our problems might be stemming from selection efforts. We should probably lower our bars at the margin and rely on actual performance [as opposed to proxies for performance] to select for the best. And face up to the fact that maybe we need lower retention and higher experimentation.

Looking Ahead

In Part 2, we will explore what divergence in the tails can tell us about about life and investing. 


Solving A Compounding Riddle With Black-Scholes

A few weeks ago I was getting on an airplane armed with a paper and pen, ready to solve the problem in the tweet below. And while I think you will enjoy the approach, the real payoff is going to follow shortly after — I’ll show you how to not only solve it with option theory but expand your understanding of the volatility surface. This is going to be fun. Thinking caps on. Let’s go.

The Question That Launched This Post

From that tweet, you can see the distribution of answers has no real consensus. So don’t let others’ choices affect you. Try to solve the problem yourself. I’ll re-state some focusing details:

  • Stock A compounds at 10% per year with no volatility
  • Stock B has the same annual expectancy as A but has volatility. Its annual return is binomial — either up 30% or down 10%.
  • After 10 years, what’s the chance volatile stock B is higher than A?

You’ll get the most out of this post if you try to solve the problem. Give it a shot. Take note of your gut reactions before you start working through it. In the next section, I will share my gut reaction and solution.

My Approach To The Problem

Gut Reaction

So the first thing I noticed is that this is a “compounding” problem. It’s multiplicative. We are going to be letting our wealth ride and incurring a percent return. We are applying a rate of return to some corpus of wealth that is growing or shrinking. I’m being heavy-handed in identifying that because it stands in contrast to a situation where you earn a return, take profits off the table, and bet again. Or situations, where you bet a fixed amount in a game as opposed to a fraction of your bankroll. This particular poll question is a compounding question, akin to re-investing dividends not spending them. This is the typical context investors reason about when doing “return” math. Your mind should switch into “compounding” mode when you identify these multiplicative situations.

So if this is a compounding problem, and the arithmetic returns for both investments are 10% I immediately know that volatile stock “B” is likely to be lower than stock “A” after 10 years. This is because of the “volatility tax” or what I’ve called the volatility drain. Still, that only conclusively rules out choice #4. Since we could rule that without doing any work and over 2,000 respondents selected it, I know there’s a good reason to write this post!

Showing My Work

Here’s how I reasoned through the problem step-by-step.

Stock A’s Path (10% compounded annually)

Stock B’s Path (up 30% or down 10%)

The fancy term for this is “binomial tree” but it’s an easy concept visually. Let’s start simple and just draw the path for the first 2 years. Up nodes are created by multiplying the stock price by 1.3, down modes are created by multiplying by .90.


Year 1: 2 cumulative outcomes. Volatile stock B is 50/50 to outperform
Year 2: There are 3 cumulative outcomes. Stock B only outperforms in one of them.

Let’s pause here because while we are mapping the outcome space, we need to recognize that not every one of these outcomes has equal probability.

2 points to keep in mind:

  • In a binomial tree, the number of possibilities is 2ᴺ where N is the number of years. This makes sense since each node in the tree has 2 possible outcomes, the tree grows by 2ᴺ.
  • However, the number of outcomes is N + 1. So in Year 1, there are 2 possible outcomes. In year 2, 3 possible outcomes.

Probability is the number of ways an outcome can occur divided by the total number of possibilities.


So by year 2 (N=2), there are 3 outcomes (N+1) and 4 cumulative paths (2ᴺ)

We are moving slowly, but we are getting somewhere.

In year 1, the volatile investment has a 50% chance of winning. The frequency of win paths and lose paths are equal. But what happens in an even year?

There is an odd number of outcomes, with the middle outcome representing the number of winning years and the number of losing years being exactly the same. If the frequency of the wins and losses is the same the volatility tax dominates. If you start with $100 and make 10% then lose 10% the following year, your cumulative result is a loss.

$100 x 1.1 x .9 = $99

Order doesn’t matter.

$100 x .9 x 1.1 = $99

In odd years, like year 3, there is a clear winner because the number of wins and losses cannot be the same. Just like a 3-game series.

Solving for year 10

If we extend this logic, it’s clear that year 10 is going to have a big volatility tax embedded in it because of the term that includes stock B having 5 up years and 5 loss years.

N = 10
Outcomes (N+1) = 11 (ie 10 up years, 9 up years, 8 up years…0 up years)
# of paths (2ᴺ) = 1024

We know that 10, 9, 8,7,6 “ups” result in B > A.
We know that 4, 3, 2,1, 0 “ups” result in B < A

The odds of those outcomes are symmetrical. So the question is how often does 5 wins, 5 losses happen? That’s the outcome in which stock A wins because the volatility tax effect is so dominant.

The number of ways to have 5 wins in 10 years is a combination formula for “10 choose 5”:

₁₀C₅ or in Excel =combin(10,5) = 252

So there are 252 out of 1024 total paths in which there are 5 wins and 5 losses. 24.6%

24.6% of the time the volatility tax causes A > B. The remaining paths represent 75.4% of the paths and those have a clear winner that is evenly split between A>B and B>A.

75.4% / 2 = 37.7%

So volatile stock B only outperforms stock A 37.7% of the time despite having the same arithmetic expectancy!

This will surprise nobody who recognized that the geometric mean corresponds to the median of a compounding process. The geometric mean of this investment is not 10% per year but 8.17%. Think of how you compute a CAGR by taking the terminal wealth and raising it to the 1/N power. So if you returned $2 after 10 years on a $1 investment your CAGR is 2^(1/10) – 1 = 7.18%. To compute a geometric mean for stock B we invert the math: .9^(1/2) * 1.3^(1/2) -1  = 8.17%. (we’ll come back to this after a few pictures)

The Full Visual

A fun thing to recognize with binomial trees is that the coefficients (ie the number of ways a path can be made that we denoted with the “combination” formula) can be created easily with Pascal’s Triangle. Simply sum the 2 coefficients directly from the line above it.

Coefficients of the binomial expansion (# of ways to form the path)


Probabilities (# of ways to form each path divided by total paths)

Corresponding Price Paths

Above we computed the geometric mean to be 8.17%. If we compounded $100 at 8.17% for 10 years we end up with $219 which is the median result that corresponds to 5 up years and 5 down years! 

The Problem With This Solution

I solved the 10-year problem by recognizing that, in even years, the volatility tax would cause volatile stock B to lose when the up years and down years occurred equally. (Note that while an equal number of heads and tails is the most likely outcome, it’s still not likely. There’s a 24.6% chance that it happens in 10 trials).

But there’s an issue. 

My intuition doesn’t scale for large N. Consider 100 years. Even in the case where B is up 51 times and down 49 times the volatility tax will still cause the cumulative return of B < A. We can use guess-and-test to see how many winning years B needs to have to overcome the tax for N = 100.

N = 100

If we put $1 into A, it grows at 1.1^100 = $13,871

If we put $1 into B and it has 54 winning years and 46 losing years, it will return 1.3^54 * .9^46 = $11,171. It underperforms A.

If we put $1 into B and it has 55 winning years and 45 losing years, it will return 1.3^55 * .9^45 = $16,136. It outperforms A.

So B needs to have 55 “ups”/45 “downs” or about 20% more winning years to overcome the volatility tax. It’s not as simple as it needs to win more times than stock A, like we found for shorter horizons.

We need a better way. 

The General Solution Comes From Continuous Compounding: The Gateway To Option Theory

In the question above, we compounded the arithmetic return of 10% annually to get our expectancy for the stocks.

Both stocks’ expected value after 10 years is 100 * 1.1^10 = $259.37.

Be careful. You don’t want the whole idea of the geometric mean to trip you up. The compounding of volatility does NOT change the expectancy. It changes the distribution of outcomes. This is crucial.

The expectancy is the same, the distribution differs.

If we keep cutting the compounding periods from 1 year to 1 week to 1 minute…we approach continuous compounding. That’s what logreturns are. Continuously compounded returns.

Here’s the key:

Returns conform to a lognormal distribution. You cannot lose more than 100% but you have unlimited upside because of the continuous compounding. Compared to a bell-curve the lognormal distribution is positively skewed. The counterbalance of the positive skew is that the geometric mean or center of mass of the distribution is necessarily lower than the arithmetic expectancy. How much lower? It depends on the volatility because the volatility tax1 pulls the geometric mean down from the arithmetic mean or expectancy. The higher the volatility, the more positively skewed the lognormal or compounded distribution is. The more volatile the asset is in a positively skewed distribution the larger the right tail grows since the left tail is bounded by zero. The counterbalance to the positive skew is that the most likely outcome is the geometric mean.

I’ll pause here for a moment to just hammer home the idea of positive skew:

If stock B doubled 20% of the time and lost 12.5% the remaining 80% of the time its average return would be exactly the same as stock A after 1 year (20% * $200 + 80% * $87.5 = $110). The arithmetic mean is the same. But the most common lived result is that you lose. The more we crank the volatility higher, the more it looks like a lotto ticket with a low probability outcome driving the average return.

Look at the terminal prices for stock B:

The arithmetic mean is the same as A, $259.

The geometric or mean or most likely outcome is only $219 (again corresponding to the 8.17% geometric return)

The magnitude of that long right tail ($1,379 is > 1200% total return, while the left tail is a cumulative loss of 65%) is driving that 10% arithmetic return.

Compounding is pulling the typical outcome down as a function of volatility but it’s not changing the overall expectancy.

A Pause To Gather Ourselves

  • We now understand that compounded returns are positively skewed.
  • We now understand that logreturns are just compounded returns taken continuously as opposed to annually.
  • This continuous, logreturn world is the basis of option math. 


The lognormal distribution underpins the Black-Scholes model used for pricing options.

The mean of a lognormal distribution is the geometric mean. By now we understand that the geometric mean is always lower than the arithmetic mean. So in compounded world we understand that most likely outcome is lower than the arithmetic mean. 

Geometric mean  = arithmetic mean – .5 * volatility²

The question we worked on is not continuous compounding but if it were, the geometric mean = 10% – .5 * (.20)² = 8%. Just knowing this was enough to know that most likely B would not outperform A even though they have the same average expectancy.

Let’s revisit the original question, but now we will assume continuous compounding instead of annual compounding. The beauty of this is we can now use Black Scholes to solve it!

Re-framing The Poll As An Options Question

We now switch compounding frequency from annual to continuous so we are officially in Black-Scholes lognormal world. 

Expected return (arithmetic mean)

  • Annual compounding: $100 * (1.1)¹⁰ = $259.37
  • Continuous compounding (B-S world): 100*e^(.10 * 10) = $271.83

Median return (geometric mean)

  • Annual compounding: $100 x 1.0817¹⁰ = $219.24
  • Continuous compounding (B-S world): $100 * e^(.10 – .5 * .2²) = $222.55
    • remember Geometric mean  = arithmetic mean – .5 * volatility²
    • geometric mean < arithmetic mean of course

The original question:

What’s the probability that stock B with its 10% annual return and 20% volatility outperforms stock A with its 10% annual return and no volatility in 10 years?

Asking the question in options language:

What is the probability that a 10-year call option on stock B with a strike price of $271.83 expires in-the-money?

If you have heard that “delta” is the probability of “expiring in-the-money” then you think we are done. We have all the variables we need to use a Black-Scholes calculator which will spit out a delta. The problem is delta is only approximately the probability of expiring in-the-money. In cases with lots of time to expiry, like this one where the horizon is 10 years, they diverge dramatically. 2

We will need to extract the probability from the Black Scholes equation. Rest assured, we already have all the variables. 

Computing The Probability That Stock “B” Expires Above Stock “A”

If we simplify Black-Scholes to a bumper sticker, it is the probability-discounted stock price beyond a fixed strike price. Under the hood of the equation, there must be some notion of a random variable’s probability distribution. In fact, it’s comfortingly simple. The crux of the computation is just calculating z-scores.

I think of a z-score as the “X” coordinate on a graph where the “Y” coordinate is a probability on a distribution. Refresher pic3:

Conceptually, a z-score is a distance from a distribution’s mean normalized by its standard deviation. In Black-Scholes world, z-scores are a specified logreturn’s distance from the geometric mean normalized by the stock’s volatility. Same idea as the Gaussian z-scores you have seen before.

Conveniently, logreturns are themselves normally distributed allowing us to use the good ol’ NORM.DIST Excel function to turn those z-scores into probabilities and deltas. 

In Black Scholes,

  • delta is N(d1)
  • probability of expiring in-the-money is N(d2)
  • d1 and d2 are z-scores

Here are my calcs4:


The probability of stock B finishing above stock A (ie the strike or forward price of an a $100 stock continuously compounded at 10% for 10 years) is…


This is respectably close to the 37.7% we computed using Pascal’s Triangle. The difference is we used the continuous compounding (lognormal) distribution of returns instead of calculating the return outcomes discretely. 

The Lognormal Distribution Is A Lesson In How Compounding Influences Returns

I ran all the same inputs through Black Scholes for strikes up to $750.

  • This lets us compute all the straddles and butterflies in Black-Scholes universe (ie what market-makers back in the day called “flat sheets”. That means no additional skew parameters were fit to the model or the model was not fit to the market).
  • The flys lets us draw the distribution of prices.

A snippet of the table:

I highlighted a few cells of note:

  • The 220 strike has a 50% chance of expiring ITM. That makes sense, it’s the geometric mean or arithmetic median.
  • The 270 strike is known as At-The-Forward because it corresponds to the forward price of $271.83 derived from continuously compounding $100 at 10% per year for 10 years (ie Seʳᵗ). If 10% were a risk-free rate this would be treated like the 10 year ATM price in practice. Notice it has a 63% delta. This suprises people new to options but for veterans this is expected (assuming you are running a model without spot-vol correlation).
  • You have to go to the $330 strike to find the 50% delta option! If you need to review why see Lessons From The .50 Delta Option.

This below summary picture adds one more lesson:

The cheapest straddle (and therefore most expensive butterfly) occurs at the modal return, about $150. If the stock increased from $100 to $150, you’re CAGR would be 4.1%. This is the single most likely event despite the fact that it’s below the median AND has a point probability of only 1.7%

Speaking of Skew

Vanilla Black-Scholes option theory is a handy framework for understanding the otherwise unintuitive hand of compounding. The lognormal distribution is the distribution that corresponds to continuously compounded returns. However, it is important to recognize that nobody actually believes this distribution describes any individual investment. A biotech stock might be bimodally distributed, contingent on an FDA approval. If you price SPX index options with positively skewed model like this you will not last long. 

A positively skewed distribution says “on average I’ll make X because sometimes I’ll make multiples of X but most of the time, my lived experience is I’ll make less than X”.

In reality, the market imputes negative skew on the SPX options market. This shifts the peak to the right, shortens the right tail, and fattens the left tail. That implied skew says “on average I make X, I often make more than X, because occasionally I get annihilated”. 

It often puzzles beginning traders that adding “put skew” to a market, which feels like a “negative” sentiment, raises the value of call spreads. But that actually makes sense. A call spread is a simple over/under bet that reduces to the odds of some outcome happening. If the spot price is unchanged, and the puts become more expensive because the left tail is getting fatter, then it means the asset must be more likely to appreciate to counterbalance those 2 conditions. So of course the call spreads must be worth more. 


Final Wrap

Compounding is a topic that gives beginners and even experienced professionals difficulty. By presenting the solution to the question from a discrete binomial angle and a continuous Black-Scholes angle, I hope it soldified or even furthered your appreciation for how compounding works. 

My stretch goal was to advance your understanding of option theory. While it overlaps with many of my other option theory posts, if it led to even any small additional insight, I figure it’s worth it. I enjoyed sensing that the question could be solved using options and then proving it out. 

I want to thank @10kdiver for the work he puts out consistently and the conversation we had over Twitter DM regarding his question. If you are trying to learn basic and intermediate level financial numeracy his collection of threads is unparalled. Work I aspire to. Check them out here:

Remember, my first solution (Pascal’s Triangle) only worked for relatively small N. It was not a general solution. The Black-Scholes solution is a general one but required changing “compounded annually” to “compounded continuously”. 10kdiver provided the general solution, using logs (so also moving into continuous compounding) but did not require discussion of option theory. 

I’ll leave you with that:

Additional Reading 

  • Path: How Compounding Alters Return Distributions (Link)

This post shows how return distributions built from compounding depend on the ratio of trend vs chop.

  • The difficulty with shorting and inverse positions (Link)

    The reason shorting and inverse positions are problematic is intimately tied to compounding math.


A Cleaner Dashboard: Z-Scores Instead Of Price Changes

Most investors or traders’ dashboards includes a watchlist with the field “percentage price change”. Perhaps you have several fields for this. Daily, weekly, monthly.

Here’s a useful way to filter out the noise and get a nicer view of the market action:

Re-scale all the moves in terms of standard deviations

My preference, although it relies on having options data, is to use implied volatility which is the market’s consensus for what the standard deviation is.

Here’s the formulas:

  • Daily = % change on day * 16/IV from yesterday’s ATM straddle
  • Weekly = % change on week * 7.2 / IV week ago
  • Monthly =% change on month * 3.5 / IV month ago

Implied vols are annualized numbers so the factors (16, 7.2, and 3.5) re-scale the vols for the measurement period.

These are just Z-scores!


  • If the absolute value of any of these numbers exceeds 1, the asset moved more than 1 implied standard deviation.
  • You can put all the assets on the x-axis of a barchart to see them visually. If you want, you can even subtract 1 from each value to see the excess move above one standard deviation. Or you set your filter at any other level.
  • This is not a tool to find opportunities or anything fancy, it’s literally just a cleaner way to visualize price moves and ignore noise.

I was too lazy to make one for stocks or futures, but the output will look like this (instead of MPG imagine it was “price change”):

If you want to use straddle prices which represent mean absolute deviation or MAD then divide the formulas further by .8.

The reason you use .8 is explained in my post Straddles, Volatility, and Win Rates.

What The Widowmaker Can Teach Us About Trade Prospecting And Fool’s Gold

We’re going to go on a little ride to talk about trade prospecting. We’ll use the natural gas futures and options market to demonstrate how to think about markets and what’s required to actually identify opportunities.
The nat gas market is all the rage these days as we head into the winter of 2021/22.

Let’s start with some background.

The Widowmaker

Enter the famous March/April futures spread in the natural gas market. This was the football famously tossed between John Arnold’s Centaurus and Brian Hunter’s Amaranth. You can get a good recount of the story here as recounted by the excellent @HideNotSlide.

The reason it’s a “widowmaker” is the spread can get nasty. The March future, henceforth known by its future code (H), represents the price of gas by the end of winter when supply has been withdrawn from storage.  April (J) is the price of gas in the much milder “shoulder” month. H futures expire in Feb but are called “March” because they are named by when the gas must be delivered. Same with J. They expire in March, but delivered in April. The H/J spread references the spread or difference between the 2 prices.

If you “buy” the spread, you are buying H and selling J.

  • If the price of the spread is positive, the market is backwardated. H is trading premium to J.
  • If the spread is negative, H<J (ie contango)
On 10/6/2021 the spread settled at +$1.44 because:
  • H future = $5.437
  • J future = $3.997

Introducing Options Into The Mix

There are vanilla options that trade on each month.
So there are options that reference the March future and they expire a day before the future (so in February).
  • H settled $5.437 so the ATM straddle would be approximately the $5.45 strike. Strikes in nat gas are a nickel apart.
  • For April futures the ATM strike is the $4.00 line. You can see the J straddle (ATM C + P) settled around $1.14

Commodities Are Not Like Equities

Every option expiry in equities references the same underlying — the common stock price. If you trade Sep, Oct, Nov, or Dec SPY options they all reference the same underlying price.
The December 100 call cannot be worth less than the November 100 call because of simple arbitrage conditions. Your December options also capture the volatility that occurs in November (in fact if you wanted to bet on the volatility just in December, you would structure a time spread that bought December vol and sold November vol, to strip out all the time before November expiration. The structure of that trade is beyond the scope of this post.)
This doesn’t work in commodities because each month has a different underlyer.
Recall H =$5.437 and J = $3.997
  • The H $5 call is almost .44 ITM
  • The J $5 call is a full dollar OTM

Despite J options having a month longer until expiry, the J $5 call trades waaaay under the H $5 call.

It gets better.

Even if H and J were trading the same price, the H $5 call can trade over the J $5 call. This is where newcomers to commodities from equities find their muscle memory misfires.

The H implied volatility can go so far north of the J vol that it can swamp the 1 month time difference.

As described earlier, in an equity, March and April options would reference the same underlyer so owning April vol exposes you to the March vol.

Not true in NG.

Severing the arbitrage link between spreads

H is trading above J. The spread is backwardated. But H and J are not fungible. They are deliverable at different times. If you need H gas, you need H gas. It’s cold today. You cannot wait for J gas to be delivered. You won’t need it then.
This is generally true in commodities.
There is no arb to a backwardated market.
A contango market can be bounded by the cost of storage. Be careful though. The steep contangos of oil in Spring 2020 and around the GFC are lessons in “limits to arbitrage”. The cost of storage is effectively infinite if you run out of storage. So contango represents the market “bidding for storage”. You can’t just build new storage overnight. The other major input into contango spreads is the funding cost of holding a commodity either via opportunity cost or interest rates. THE GFC was a credit crunch. Funding was squeezed. That cuts right to the heart of “cost of carry” that contango represents.

So we now understand that H and J can become unhinged from each other. That’s why the spread is a widowmaker. It can be pushed around until convergence happens near the expiry of the near month. That’s when reality’s vote gets counted.

More Complexity: Options On Those Crazy Spreads

You can also trade options directly on the H/J futures spread. Since H/J is considered a calendar spread, the options are cleverly named:
Calendar spread options.
The cool kids refer to them as “CSOs”.
Let’s talk CSOs.
We established that the H/J future spread is $1.44
  • You can buy a call option on that spread. You can buy (or sell) an OTM call, like the H/J $10 call.
  • You can buy an ITM call like the H/J $1 call. That option is 44 cents ITM.
  • You can buy a put on the spread. If you buy the H/J 0 put (pronounced “zero put”), that option is currently OTM. It goes ITM if H collapses relative to J and the spread goes negative (ie contango).
These exist in WTI oil as well. Imagine a fairly typical market regime where oil is in contango. The CL1-CL2 spread might trade -.40. That means the front month is .40 under the second month. CSOs trade on these negative spreads as well! If someone buys the -$1.00 put they are betting the market gets even more steeply contango.
I’ll pause for a moment.

Right now, you playing with an example in your mind. Something like: “so if I buy the -$.25 call, I’m rooting for…ahh, CL1 to narrow against CL2 or even trade premium into backwardation”

Don’t be hard on yourself. This is supposed to hurt. It hurts everyone’s head when they learn it. It’s just a language. The more you do it, the easier it gets and with enough reps you won’t remember what it was like to not be able to understand it natively.

Real-life example

These prices are from 10/6/2021 settlement.
H settled $5.437
The H 15 strike call settled $.42
H/J spread = $1.44
H/J $10 CSO call = $.38
Let’s play market maker.
You make some markets around these values:
  • Suppose you get lifted on the CSO call at $.40 (2 cents of edge or 20 ticks. 1/10 cent is min tick size)
  • Meanwhile the other mm on your desk gets her bid hit on the vanilla H 15 call at $.40 (also 2 cents of edge)

Your desk has legged getting long the H 15 call, and short the H/J 10 call for net zero premium. If we zoomed ahead to expiration what are some p/l scenarios?

  • H expires at $5 and J is trading $4 on the day H expires or “rolls off”. Therefore H/J = $1
    • Both calls expire worthless. P/L = 0
  • H expires $15 and J is trading $4 so H/J is $11.
    • Ouch. Your long call expired worthless and your short H/J $10 call expired at $1.00. You just lost a full $1.00 or 1,000 ticks. That’s a pretty wild scenario. H went from $5.43 to $15 and J…didn’t even move?!

How about another scenario.

  • H goes to $16 and J to $7. So H/J expires at $9.
    •  The $10 CSO call you are short expires OTM and the vanilla H 15 call earned you $1.00. Now you made 1000 ticks.

It quickly becomes clear that vol surfaces for these products are untamed. Option models assume bell-curvish type distributions. They are not well-suited for this task. You really have to reason about these like a puzzle in price space. I won’t really dive into how to manage a book like this because it’s very far out of scope for a post but it’s critical to remember that pricing is just one consideration. Mark-to-market, path, margin play a huge role.

Sucker Bets

The truth is the gas market is very smart. The options are priced in such a way that the path is highly respected. The OTM calls are jacked, because if we see H gas trade $10, the straddle will go nuclear.

Why? Because it has to balance 2 opposing forces.

  1. It’s not clear how high the price can go in a true squeeze or shortage
  2. The MOST likely scenario is the price collapses back to $3 or $4.
Let me repeat how gnarly this is.
The price has an unbounded upside, but it will most likely end up in the $3-$4 range.
Try to think of a strategy to trade that.
Good luck.
  • Wanna trade verticals? You will find they all point right back to the $3 to $4 range.
  • Upside butterflies which are the spread of call spreads (that’s not a typo…that’s what a fly is…a spread of spreads. Prove it to yourself with a pencil and paper) are zeros.
The market places very little probability density at high prices but this is very jarring to people who see the jacked call premiums.
That’s not an opportunity. It’s a sucker bet.

Let me show you what’s going on with the CSOs:


The CSO options tell us that the H/J spread has roughly 3% chance of settling near $2, a 2% chance of ending near $3 and a 0%  chance of settling anywhere higher than that.
And yet the futures spread is trading $1.44 today! And the options fully expect that to collapse.
What is going on?
Look at history. Even in cold winters, the spread almost always settles….at zero! When H expires, it is basically going to be at the same price as J.
Now, I know nothing of gas fundamentals. And none of this is advice. And I’m not currently up on the market, but I am explaining how these prices look so crazy (as in whoa look at all this opportunity) but it’s actually fair.
The market does something brilliant.
It appreciates path while never giving you great odds on making money on the terminal value of the options.

The Wider Lesson

So how do you make money without a differentiated view on fundamentals in such a market?

There are 2 ways and they double as general lessons.

  1. Play bookie

    You have a team that trades flow. You are trading the screens and voice, you’re getting hit on March calls over here, you’re getting lifted on March puts over there, you’re buying CSO puts on that phone, your clerk is hedging futures spreads on the screens. Unfortunately, this is not really a trade. This is a business. It needs software, expertise, relationships. Sorry not widely helpful.

  2. Radiate outwards

    The other way to make money is prospecting elsewhere, with the knowledge that the gas market is smart. It’s the fair market. It’s not the market where you get the edge, it’s the one that tells you what’s fair or expected. So you prospect for other markets or assets that have moved in response to what happened in the gas market, but did so in a naive way. A way that doesn’t appreciate how much reversion the gas market has priced in. Can you find another asset that’s related, but whose participants are using standard assumptions or surfaces? Use the fair market’s intelligence to inform trades in a dumber or less liquid or stale market.

Trading As a Concept

Many people think that trading is about having a view. Trading is really about measuring the odds of certain outcomes based on market prices. Markets imply or try to tell us something about the future. The job is to find markets that say something contrary about the future and take both bets. Arbitrage is an extreme example of this. If one person thinks the USA basketball is 90% to win the gold and another thinks the field is 15% to win the gold you can bet against them both and get paid $105 while knowing you’ll only owe $100. Trading identifying similar examples but of course in reality they are hard to find, more difficult, and require creativity and proper access.To see the present clearly you must be agnostic. You look for contrary propositions. Trading is not about having strong opinions. It’s not thematic. You don’t have some grand view of what the future looks like or the implications of some emerging technology or change in regulations. You just want to find prices that disagree.
Why would you slug it out in smart markets? Use them to find trades in markets that radiate away from them that are not incorporating parameters from the smart market fully. If you can’t get away from fair markets, you are going to need to be absolutely elite.
Battling it out in SPY reminds me of this cartoon:

The solutions in markets are rarely going to be where it’s easy to see because that’s where everyone will be looking.

Happy prospecting.

If you found CSOs interesting recognize there are physical assets that are just like options on a spread.

  • Oil refineries =Heat/Gas crack options
  • Power plants =  Spark spread options
  • Oil storage facility = WTI CSO puts
  • Soybean mill that crushes soy into meal/bean oil

If you had a cap ex program to build one of these assets how would you value it? You’d need to model volatility for the spread between its inputs and outputs!

The owners of these assets understand this. They are the ones selling CSOs! It’s the closest hedge to their business.

I got the data for this post from the CME website’s nat gas settlements page.
The dropdowns on the right of the page should keep you busy.

Teach A Math Idea To Internalize It

My 8-year-old Zak is going to be taking the OLSAT soon. It’s a 64-question test that looks an awful lot like an IQ test. The test (or one of its brethren like the CoGat) is administered to all 3rd graders in CA. If you score in the top 2 or 3% you can be eligible for your local ‘gifted and talented’ program. 20% of the questions are considered “very challenging” and that’s where the separation on the high end happens.

I gave Zak a practice test just to familiarize him with it. He’s never taken a test with a time limit before and never filled out Scantron bubbles. Do not underestimate how confusing those sheets are to kids. It took a while for him to register how it worked because he only saw choices A,B,C,D for each of the 64 questions.

Daddy, the answer to question 1 is ‘cat’ not A,B,C, or D

I know, Zak, it’s just that…you know what bud, how about just circle the right answer on the question for now.

Hopefully, some practice breaks the seal so he isn’t scared when he sits for his first test ever. I think a small amount of prep is helpful even though I get the sense that caring about tests is not in style around here. Call me old-fashioned. I’m not bringing out a whip, but having the option to go to the program seems worth putting in a token effort if you think your kid has a shot.

Anyway, he took one test. Poking around a bit, I think his raw score would land him in the 90th percentile. Not good enough but it was his first shot and if he doesn’t improve much, that’s also totally fine too. Plenty of people are content just flipping burgers (I’m kidding, calm down. Also, get your own kid to stuff your insecurities into). One thing did stand out. He got all the math questions (about 1/3 of the test) correct.


It made me think of how I was a decent math student growing up.

I'm Something of a Scientist Myself | Know Your Meme

Not good enough to compete with peers who did math team in HS, but enough to get through Calc BC. Regretfully, I never took another math class after that. I optimized my college courses for A’s not learning. Short-sighted.

I really felt the pain of that decision when I got hired to trade options and was surrounded by a cohort in which 50% of the trainees had an 800 math SAT. (There were 3 people in our office of about 60 that had an SAT verbal > math. I was one of them.) That inferiority exists even to this day. Until Google Translate can decode academic papers, those things are for lining birdcages.


Every now and then, I’ll come across a math topic that seems useful for making estimates about practical things, so I’ll learn it.

And then I’m reminded I have no math gifts because that learning process is uphill in molasses. When I was young I did lots of practice problems (how else are you supposed to become a doctor and please mom) which got me proficient. Today, it’s a similar process. I just power through it.

But there is a difference in how I power through it.

Instead of practice problems, I watch YouTube until I can write the ELI5 version for others. Everyone has heard that if you want to test your knowledge, teach it to others. In that case, it’s a win-win. We all learn.

So that’s what I did this week. I wrote an ELI5 version of a concept called Jensen’s Inequality.

  • Jensen’s Inequality As An Intuition Tool (10 min read)

    You will learn:

    • Why I found Jensen’s Inequality interesting
    • The conditions and statement of the inequality
    • An example that affects us all
    • Spotting Jensen’s in the wild

    If you struggle to understand it after reading it tell me. I am challenging myself to see if I can relay not just the concept but the significance of it with minimal effort on behalf of the reader. If I can get to the point where I’m “putting in the effort so you don’t have to” then I’ll feel like I’m being useful here.

    If you think you got it, test yourself the way I did. Construct an example. (That’s what I did with the “traffic on the way to Sizzler” example.)

  • If you grok Jensen’s Inequality and want to relate it to portfolio construction Corey is your guy. Before I learned of this concept his tweets would have made no sense to me, but now I at least kinda get it.

Understanding Vega Risk

In a chat with an options novice, they told me they didn’t want to take vol (vega) risk so they only traded short-dated options. This post will explain why that logic doesn’t work.

Here’s the gist:

It’s true that the near-term option’s vega is not large. That is counterbalanced by the fact that near-term implied vols move faster (ie are more volatile) than longer-term vols.

The goal of this post is to:

  • demonstrate that near-term vols are more volatile both intuitively and with napkin math
  • show the practical implications for measuring risk

Near Term Vols Are More Volatile

An Intuitive Understanding

Think of the standard deviation of returns that a stock can realize over the course of a week. If there is a holiday in that week the realized volatility will likely be dampened since there are 4 days of trading instead of 5. If Independence Day falls on Friday, Thursday might see even lower volatility than a typical trading day as fund managers chopper to the Hamptons early. On the other extreme, if a stock misses earnings and drops 25%, then we have a Lenin-esque week where a year happens. The range of realized volatilities is extremely wide. This requires the range of implied volatilities to be similarly wide for a 1-week option. Those large single-day moves are diluted when they are part of a computation for 1-year realized volatility (there are 253 trading days in a year).

This concept is easily shown with a “volatility cone” (credit: OptionsUniversity)

Here we can see the standard deviation of realized volatility itself declines as the sampling period lengthens.

The Napkin Math Understanding

The intuition for why the range of short-dated volatility is wider than long-dated volatility is easy to grasp. To cement the intuition let’s look at a numerical example.


A weekly option [5 days til expiry]

Assume the stock’s daily vol is expected to be 1% per day. The fair implied vol can be computed as follows:

IV = sqrt(.01² x 5 days x 52 weeks) = 16.1%1

Remember variances are additive not standard deviations so we must square daily vols before annualizing them. We take a square root of the expression to bring it back into vols or standard deviation terms.

Ok say 1 of those days is an earnings day and is expected to be 3% daily vol.

IV = sqrt([.01² x 4 days + .03² x 1 day] x 52 weeks) = 26%

Look what happened.

The single-day expected vol jumping from 1% to 3% means there is more variance in that single day than the remaining 4 days!

.01² x 4 days < .03²

How did this earnings day affect the fair IV of a longer-dated options?

A 2-week option [10 days til expiry]

 IV =  sqrt([.01² x 9 days + .03² x 1 day] x 26 bi-weeks) = 21.6%

A 1-month option [21 days til expiry]

IV = sqrt([.01² x 20 days + .03² x 1 day] x 12 months) = 18.7%

The increased vol from a single day is clearly diluted as we extend the time til expiry. When we inserted a single day of 3% vol:

  • The 1-week option vol went from 16% to 26%. 10 vol point increase.
  • The 1-month option went from 16% to 18.7%. 2.7 vol point increase.

To understand why this matters look at the effect on P/L:

Remember, the vega of the 1-month straddle is 2x the vega of the 1-week option.

    • The 1-week straddle increased by 10 vol points x the vega.
    • The 1-month straddle increased by 3 vol points x 2 x the vega of the 1-week straddle

      10x > 6x

      The 1-week straddle increased in price 10/6 (ie 66%) more than the 1-month straddle!

      (This is why event pricing is so important. The astute novice’s head will now explode as they realize how this works in reverse. You cannot know what a clean implied vol even is unless you can back out the market’s event pricing)

Practical Implications For Measuring Vega Risk

Comparing Risk

So while a 1- month ATM option has 1/2 the vega of a 4-month option2, if the 1 month IV is twice as volatile it’s the same vega risk in practice. You need to consider both the vega and the vol of vol!

In practice, if I tell you that I’m long 100k vega, that means if volatility increases [decreases] 1 point my position makes [loses] $100k. But this risk doesn’t mean much without context. A 100k vega position means something very different in a 1-week option versus a 1-year option. Looking at a vol cone, we might see that 1-week implied vol has an inter-quartile range of 30 points while 1-year vol might only have a 3 point range. You have 10x the risk if the vega is in the weekly vs the yearly!

Another way of thinking about this is how many contracts you would need to have to hold 100k vega. Since vega scales by sqrt(time) we know that a 1-year option has √52x or 7.2x as much vega. So to have the equivalent amount of vega in a 1-week option as a 1-year option you must be holding 7x as many contracts in the near-dated.

Normalizing Vegas

It’s common for traders and risk managers to normalize vega risk to a specific tenor. The assumption embedded in this summary is that volatility changes are proportional to root(time). So if 1-week volatility increased by 7 points, we expect 1-year vol to increase by 1 point.

This is an example of normalizing risk to a 6-month tenor:


  • Your headline raw vega is long, but normalized vega is short
  • Your 2,000 vega in a weekly option is more vol risk than your 10,000 vega in the 6-month
  • You want the belly of the curve to decline faster than the long end. This is a flattening of the curve in a rising vol environment and a steepening in a declining vol environment.
  • If the entire vol curve were to parallel shift lower, you’d lose as you are net-long raw vega.
  • If we choose to normalize to a different tenor than 180 days, we would end up with a different normalized vega. The longer the tenor we choose, the shorter our normalized vega becomes (test for yourself).

Critically, we must remember that this summary of net vega while likely better than a simple sum of raw vega is embedding an assumption of sqrt(time). If you presume that vol changes across the curve move in proportion to 1/sqrt(t), the value of calendar straddle spreads stays constant. At this point, you should be able to test that for yourself using the straddle approximation in the footnotes. This would imply that as long as your total normalized vega is 0, you are truly vega neutral (your p/l is not sensitive to changes in implied vol).

As you might expect, that assumption of sqrt(time) vol changes across the curve is just a useful summary assumption, not gospel. In fact, on any given day you can expect the curve changes would deviate from that model. As we saw above, the bottoms-up approach of adding/subtracting volatility with a calendar has uneven effects that won’t match up to sqrt(time) rule. Your actual p/l attributed to changes in volatility will depend on how the curve shifts and twists. Perhaps the decay rate in a vol cone could provide a basis for a more accurate scaling factor. It does require more work plus scaling to time allows us to normalize across assets and securities more understandably rather than using some empirical or idiosyncratic functions.


Just because the vega of a longer-dated option is larger doesn’t necessarily mean it has more vol risk.

  • We need to consider how wide the vol range is per tenor. We looked at realized vol cones, but implied vol cones can also be used to approximate vol risk.
  • We need to recognize that a steepening or flattening of vol curves means the price of straddle spreads is changing. That means a vega-neutral position can still generate volatility profits and losses.
  • Changing straddle spreads, by definition, means that vol changes are not happening at the simple rate of sqrt(time).
  • Measuring and normalizing vols (or any parameter really) always presents trade-offs between ease, legibility/intuition, and accuracy.

Shorting In The Time Of ShitCos

HTZG, GME, now HWIN.  The more slandered or shorted or ridiculous the name is the more bullish it seems to be for the stock. Just imagine explaining this to an alien.

“I bought a deli for $100mm. It’s an investment.

A deli? Well… it’s a place where people from the surrounding neighborhood go midday for some protein stuffed into wheat…umm, no, not every person in the neighborhood. Just like some of them. Why not everyone? There are other delis I guess. And a McDonald’s. Oh, you have those too? Yea I love the fries myself. Ah, yes back to the deli. Right, so the deli actually has to buy the ingredients. Correct, it doesn’t grow them. Slaves? What? No, no, no. Those people are called “employees”. I have to pay them. And yes, that guy needs to be paid too. IRS. We call him IRS.

Did I mention it has the best dills?”

The entire shorting business model appears to broken. In a period where concentrated shorts are getting lit up, in a period where diamond hands combined with brick brains, shorting just looks like return-free risk. Or at least the style where you try to recruit support after establishing the short.

I think @Mephisto731 is correct. Probably super correct. The best time to sell insurance is after the earthquake blows out your competitors.

You’re sneering. Fine, I’ll play along.

Common Objections To Shorting

It’s common for shorting detractors to mock the strategy as negative EV for 2 reasons. I’m just going to annihilate them now so we can get to a more productive discussion.

  1. Stocks have positive drift (aka “stonks only go up”)

    I get it, you are fighting the most fundamental risk premia. The “equity risk premia”. First of all, that’s debatable. After, all most stocks go to zero. Stock indices have risen over time thanks to rebalancing. But more clinically, the negative drift, can be offset by just offsetting the beta. You can short the target and get long a basket to sterilize the drift. So, in practice, and possibly in theory, this positive drift objection can be put to rest.

  2. Stocks have unbounded upside but limited downside

    This has no bearing on the EV of shorting. Anyone familiar with options understands that individual stocks have positive skew. If a stock is $100 despite everyone knowing that it is bounded by zero and infinity then the odds of it going down are the counterbalance. And the fact that most stocks go to zero is in keeping with that understanding. So, stop citing the unbounded upside as a reason why shorting is negative EV. Remember EV is a sumproduct of terminal prices and probability.

That said, shorting is no stroll in the park. We just don’t need to fabricate objections like the ones above to show that.

The Real Reasons Why Shorting Is Difficult

  • No limit to arbitrage on the short side

    First, think of the long side. I’ll paraphrase Sam Bankman-Fried’s explanation from his recent Odd Lots interview:

If AAPL stock price went to $1 tomorrow, Warren Buffet or whoever would just buy the whole company. It makes billions of dollars in earnings and you could just buy all the earnings for less than the stock price if it got low enough. But on the short side, there is no mechanism to moor the stock to reality (although as we learned from the Archegos saga, a secondary to feed the ducks, has consequences).

This lack of limit to arbitrage doesn’t change the EV of the stock which is already balanced by probabilities, but it does change the path behavior. You need to borrow shares to be short, and any share borrowed means a future buy order. So inflows of cash can cascade into forced covering since the short-seller is effectively levered.

  • The negative gamma effect

    I’ve explained this before with respect to shorting, but I’ll re-hash it simply. When a fund sizes a short it does so as a percentage of its AUM. Say the short is 10% of its AUM. You can think of the AUM as the denominator and the dollar-weighted short as the numerator. This ratio starts at 10/100.

    What happens if the fund wins on the trade because the stock drops 50%?

    Well, now the fund has made 50% on a 10% position, so its new equity is 105. Yet, the size of the short shrank with the stock halved. So now the numerator is 5, not 10 units. So the short is now 5/105 or 4.7%. The fund needs to more than double the size of the short to maintain constant exposure as a percentage of AUM. Both the numerator and denominator moved in a way that reduced the position.

    This looks just like short gamma. You need to sell more as the stock falls!

    When the stock rallies, the size of the short (numerator) increases, while the fund’s equity (denominator) gets hammered. Both forces conspire to force short-covering. Or buying, in a rallying market. Negative gamma. And to think, you often pay to borrow stocks, so you get the indignity of paying theta to play this game.

The Options Approach

Let’s address the ways we can use options to be short.

  • Synthetic shorts

    If you want to implement the short in the most similar way to a short stock position, then you will want to structure a “synthetic short”. Just like a stock position, it has 100 delta and no Greeks except exposure to cost of carry. But you faced that risk from the prime you borrow shares from anyway.  In this case, the borrow cost is embedded in the options but the clearing rate for that cost will be inherited from the arbitrageurs with the best funding rates.

    How to implement a synthetic short

    You buy a put and short a call on the same strike in the same expiry. To prove to yourself that it is the exact same exposure as a short stock position work through this example:

    Stock is $100
    You buy the 1 year 100 put for $10 and sell the 1 year 100 call at $10.

    The stock drops to $80 by expiration. What’s your p/l?
    What if the stock ripped to $120?

The synthetic short will have the same path risks as an actual short so let’s move on to option strategies that mitigate the path risk.

  • Outright puts

    If short-selling seems like it has negative gamma, you could always substitute your trade expression with long options. At least, you get something for the theta.  So while you will be paying to borrow, it might actually be at a better rate than you can borrow from your broker. And the moment you buy the put, the funding rate is capped at the implied cost you traded at. If the borrow gets more expensive from that point forward, your put will actually appreciate in step with its rho.

    The risks of buying puts are familiar. You can be wrong on timing, vol, how far the stock actually falls.  You can get middled. Your thesis can be right but not right enough.

    The benefit is you cannot lose more than the premium (unless you dynamically hedge…but if you are using the puts directionally then you shouldn’t be doing that anyway). This simple fact turns your strong hand into a weak hand. You always reserve the right to roll your puts down as you take profits or up to chase the rising stock. But the basic position, while risky, is path-resistant. And path is why shorting is so hard.

  • Put spreads

    Buying a put vertical (buy 1 put, sell a lower strike put, same expiry) sterilizes many of the Greeks since you buy and sell an option, including some of the borrow costs.  The tighter the strikes the more the bet looks like a pure probability play. If the strikes are wide, your further OTM will not offset the Greeks of the near put as much (if you think about it, an outright put position is the same thing as a put spread where the further OTM strike is the zero strike).

    If the stock has a lot of negative sentiment around it, depending which put spreads you choose, it’s possible you are getting a bargain if the put skew is especially fat.

Options and the “Write Down Your Thoughts” Effect

I’m not shilling for options here. I’m just pointing out that in a market that is scaring vanilla short sellers away, there are trade expressions that allow you to stay in the game at the time when you probably want to the most. Even if you decide not to use options, there is a benefit from walking through the trade construction process — it will tighten up your thinking. It’s like journaling.

Before choosing an option implementation, you should write down your answers. I’d be surprised if the answers to these questions didn’t impact how you might frame a vanilla short.

Let’s walk through questions you must answer before buying a put spread.

  • Edge: if the put spread I’m looking at pays 6-1 what do I think the true odds are? 4-1? 3-1? The amount of edge AND the fact that we are talking about a bet with a sub 25% hit ratio will dictate my risk budget.
  • Risk budget: How much am I willing to lose in premium?
  • Should I spread my risk budget over several months or is there a specific catalyst or expiring lockup that favors concentrating the bet in a single month?
  • Which put spread should I buy? Would I rather buy $1,000,000 worth of the 85%-80% put spread or the 70%-65% if $1,000,000 buys me 2x as many of the further OTM spreads. Or maybe I prefer a higher delta trade, that pays off more often but pays smaller odds. This forces me to think about price targets and the market’s relative implied pricing of those targets. It directs your attention to the meatiness or winginess of your thesis.
  • Does the winginess or meatiness of my thesis correlate to any other forces in the market or is it a purely idiosyncratic idea? For example, if you were interested in owning put spreads on a portion of the ARKK basket, then you could concentrate your put spreads on the subset of the basket that offered the best implied odds. Your thesis wasn’t specific to a single stock but more of a general liquidity trade.
  • How much dry powder do you want in reserve to roll your put spread up when the stock rallies? What thresholds would trigger rollups? Likewise, if the stock sells off, will you roll spreads down? How about down and out into a further calendar month? Will you roll down on a 1-to-1 basis (taking profits) or aggro win-big-or-go-home style where you use 100% of the collected premium to buy a boatload of further OTM put spreads?

Working through these questions refines your thinking and creates a plan for different scenarios. I find that the granularity of options and layers of relative pricing force me to “write down my thoughts” in a way that delta 1 trading can easily gloss over.


Short-selling is hard. Not because it’s negative EV, but because limits to arbitrage and the reality of levered return math create perilous paths. Whether the bruises from the recent mania will usher in a “golden age of short-selling” remains to be seen. But removing an entire direction of returns from your arsenal seems short-sighted. It’s a surrender to the current moment just when you should be thinking hardest about profiting from names that on a long enough time frame will have prices that match their ShitCo status. Options provide a more path-hardy set of trade expressions and may become table stakes for investors (ie hedge funds) whose mandates should not allow them to ignore the short side.


The difficulty with shorting and inverse positions

Shorting Bimodal Stocks

A Thought Exercise For Outsourcing Liquidity Risk

Understanding Edge

In my indoctrination into trading, the term “edge” was equated to the bookie’s “vig” or a casino’s “house edge”. This makes sense since I started in this business as a market maker. The interview questions I faced were focused on mathematical expectation or expected value. For example, if someone offered you a game that pays you the number that comes up on a single die, what would you pay to play? The weighted average payout of the game is $3.50. So if you can pay $3 to play, you’d make $.50 in theoretical profit. Of course, you could still lose if you roll a 1 or 2, but if you could do this every day, you’d earn 14% ($.50/$3.50) in the long run.

The basic premise of the market-making business is 2-fold: capture edge and manage risk so you can survive to actually see that long run.

  1. The edge comes from identifying the fair price.
  2. The primary risk management levers are diversification and sizing.

If you can price accurately and manage risk competently, you can crystallize the edge as surely as the Wynn prints money.

In this post, I will share:

  • the nature of edge in both trading and investing contexts
  • unbehaved edge in the real world
  • intuitions you can take with you

The Nature Of Edge in Trading And Investing

First, let’s define fair value. I will decompose it into 2 concepts.

  1. Expectation

    This can be a price that is ultimately an arbitrage. The die game from the intro or a casino game can be squeezed into this since the asset’s expectancy can be computed. With a large enough bankroll or sufficiently small bet size, it’s practically impossible to lose in the long run. Cash/futures arbitrage and creating/redeeming ETFs trading away from NAV are market examples.

  2. The liquid price

    In the market maker pasture, I was raised in, we’d call any price that was transparently and liquidly trading “fair value”. If the market for an option was “choice” or “pick’em” with deep-pocketed players on both sides then it was “fair”. We might say “fair value is $5, Goldman Sachs by JP Morgan”. In other words, a GS client was $5 bid and a JP Morgan client was offered at $5, it was trading, and there was enough size available for anyone else to basically participate. It’s a fleeting concept, but useful. We could use that price as a benchmark to compare less liquid derivatives as we looked for relative value.

With the idea of fair value established, we can begin exploring the nature of edge with a familiar toy model — the coin flip.

The Power Of Small Edges

Imagine a coin flip game. Call the toss correctly, make $1, otherwise, lose $1. Let’s pretend you could predict the coin flip with 50.5% accuracy. Sweet.

  • What’s your edge?

The expected value of playing the game is 1% because your payoff is equal to .505 * $1 – .495 *$1

  • What’s the standard deviation?

    From the binomial distribution, we know the standard dev or vol is √(.505 * .495) or 50%

  • What’s your risk/reward (Sharpe ratio)?

    I’m going to use the term “Sharpe ratio” in a specific context, as the ratio of edge to volatility. This is intuitively important since edge doesn’t mean much without a measure of variance. For this single toss, the Sharpe ratio is a measly .02 (1%/50%).

1% edge on this coin flip doesn’t seem like much. The .02 Sharpe ratio is a laughable signal to noise ratio. But as we increase N from 1 flip to many, the binomial distribution can be closely approximated by the familiar Gaussian curve [Taleb, spare my window, I’ll address reality later].

Look closely. The Sharpe ratio increases with N. Specifically, it increases at the rate of √N.

Why? Because the edge or numerator grows linearly with N while the denominator, or vol, only increases at √N. This property of edge is the foundation of trading and gambling. With enough trials, victory is nearly guaranteed. With a 1% edge on a coin flip, you are 90% certain you will be up money after 4,000 trades. So if you have 10 traders making 20 trades each business day, in one month you are more than 90% certain you are winning. In one year, you can’t lose.

Getting A Feel For Edges

Let’s look at the math in reverse. In Excel, we can use Norm.INV() to find what return corresponds to a desired probability for a given EV and vol. Let’s say we want to be 95% certain we make money. In math language, we are interested in the point where the 5th percentile return of the CDF is equal to 0.

We want to ask Excel:

How many trials do I need to have so that my Sharpe ratio sets my 5th-percentile return to zero?

To do this let’s standardize the vol to 1. The equation we need to solve is:

NORM.INV(5%, EV, 1) = 0

To solve for EV we use Excel’s goalseek function. We find EV = 1.645

Since we standardized the vol to 1, then we have discovered that at a Sharpe ratio of 1.645 (again Sharpe is EV/vol), the 5th percentile return is 0. That is the Sharpe ratio we need to be 95% certain we make money.

Remember that having 1% edge on a single coin flip only has a Sharpe of .02

But as we increase N, the Sharpe increases by √N :

SR of 1 trial x N/√N = SRN
.02 x N/√N = 1.645
N = 6,764

If we flip the coin 6,764 times, we are 95% sure we will make money even though we have a tiny edge on a volatile bet.

Let’s recap in English what we did here:

  1. Compute the risk/reward or Sharpe for a single bet
  2. Figured out the risk/reward needed to be 95% certain we will make money on a series of bets
  3. Computed how many times we need to play to achieve that risk/reward

Let’s look at the relationship between a single bet Sharpe to how many trials we need to be 95% certain we win.

  • If we have .02 Sharpe per bet, we need to do 25 trades per day for a year to be 95% certain of making money.
  • If we have .10 Sharpe per bet, then 1 trade per day will help us realize the same risk/reward over the course of a year.

This table highlights another important point: by increasing the Sharpe per bet by an order of magnitude (ie from 1% to 10%) we cut the required number of trials by 2 orders of magnitude (27,055 to 271).

Think about that. The improvement in Sharpe leads to a quadratic reduction in trials needed to maintain the same risk/reward for the series of bets.

Inverting the logic:

If the risk/reward of your bet is halved, you need to bet 4x as many times for the strategy to maintain the same overall risk/reward.

From Trading To Investing

The domain of many individual bets fits more under the umbrella of trading. For investing, we tend to think of the annual Sharpe ratios of investing styles or asset classes. Without looking this up, I’d guess that the SP500 has a long-term Sharpe ratio of about .40. I’m estimating an 8% annual return divided by 20% vol.

We can use the same math we did above to see how many years we’d need to invest to be 95% certain we did not lose money in nominal terms. Turns out the answer is 17 years. The table below finds the number of years for other combinations of expected return and volatility.

Years Required to Be 95% Sure of Profit

The Real World

Bell curves are great to build intuition but they are not reality. We can’t really be 95% sure we’ll make money by holding stocks for a generation because the historically sampled returns and volatilities are just that — sampled. We don’t know what the actual distributions are. Fat tails, skew, other moments I don’t even know about. 

We can use a highly skewed bet to demonstrate how volatility can distort our impression of risk. This renders the Sharpe ratio useless in highly skewed scenarios.

Consider 2 stocks, both are fairly priced at $100. We’ll call them Balanced Corp and Skewed Corp.

Balanced Corp is 50% to go up or down $10.

Skewed Corp has a 90% chance of going up $3.33 and a 10% chance of dropping $30.

Using the bimodal distribution we find that the stocks have the same volatility. However, they would have different straddle prices if there were options listed on them.

(It’s a good exercise for the reader to use what we know about expected value to manually compute the call and put prices).

So here we have 2 stocks with the same true volatility but different straddle prices if we compute them via expected value. Of course, we would not use B-S for a stock that was discontinuous and was going to magically open at one of 2 prices in a year. But this does show how the effect of a strong skew would suppress the value of a straddle for a given level of volatility. 

This is actually more intuitive than it appears. FX carry is a highly skewed trade that might exhibit minimal vol on a daily basis. The volatility imputed by the straddle understates the risk because it derives most of its value from the behavior of daily moves, where the risk of a jump will be better reflected in the cost of OTM options. In the above case, the Balanced Corp 90 put is worthless while the 90 put on Skewed Corp is worth $2 (10% of the time it finishes $20 in-the-money).

So if you use straddle prices to impute volatilities which are then used to calibrate Sharpe ratios, you may be understating the risk of highly skewed assets. Your risk/reward ratio is actually overstated which means it will take far more trials to realize your edge, assuming you actually have any. And remember how diabolical the math is…if your Sharpe ratio is overstated by 2x (let’s say you think it’s .8 and it’s actually .4), then you need 4x the number of trades to maintain the same assumptions about making or losing money. How would you feel if you found at the long-run for your given strategy wasn’t 10 years, but 40?

Takeaways About Edge

Self-aware investors and traders are always questioning their edge. Evaluating a track record or doing post-mortems on your own strategies requires being able to handicap the true distribution of your trades. The more Gaussian they look (for example if you play limit poker instead of no-limit) the easier it is to ascertain the strength of your edge statistically. You can tell the difference between bad run vs a change in the quality of your edge. Some runs would be almost impossible if your edge was real.

Edge is scarce. When we prospect for it, we should expect to mostly find fool’s gold. There are many reasons for this.

On skew

While both high volatility or high skew make it harder to determine if you have an edge statistically, skew is especially tricky. It is hard to see without liquid option surfaces. Here’s an intuitive way to see how skew distorts reality. Imagine finding a video poker machine that didn’t show its payoff table. Under the hood, it gives slightly worse payoffs on a pair of Jacks or better, but offered a billion to one on the Royal Flush. You could play that machine for days or even weeks and never realize you had massively positive EV.

On sample size

  • Having a small edge or number of trials makes it hard to verify an edge. Remember that when evaluating anyone trading highly volatile assets (ie crypto), engaging in highly skewed trades (carry, staking tokens for yield, option selling), or making a few concentrated bets per year (much of discretionary fundamental investors).

  • Remember the phrase “to think in N not T”. If there is a flow that shows up every day for a month do you have a sample of 30 or just 1 bit of behavior spread over 30 days? It’s the philosophical version of how auto-correlation artificially inflates N.

On luck vs skill

  • If you have negative edge, trade less. Short-term variance may turn up a friend named “Luck”. In the long run, she’s lost your number. 

  • In chess, a difference in ELO can be used to handicap a match between 2 players. Chess has no element of randomness. The signal is extremely strong. Backgammon has randomness, so the predictive strength of the ELO spread increases with match length. This comment in a chess forum cements this:

    While Magnus Carlsen would stand virtually no chance against the top chess programs, the Elo rating difference between Extreme Gammon, (the best bot) and the top humans is more like 75 points, so XG would be something like a 2-1 favorite in a 25-point match against the top human player.

The importance of edge

  • When I was a market-maker we were always on the lookout for a new source of edge (perhaps a new name to trade or spotting a new flow to trade against). Edge is pure gold. Its scaling properties are amazing if it’s genuine. We were encouraged to not worry about risk if we could find a legit edge. The firm would find a way to hedge some portion of the risk if the edge was worthwhile, and you could always use sizing to manage the risk. Finding edges was top priority. It’s what you build businesses around.
  • A 1% edge in a stock or ETF is enormous. Imagine buying a stock that was trading “fair” for $50 for $49.50. This is an order of magnitude more edge than HFTs earn. Hold my beer now as we do options. If the fair price for a call or put is $.50 and the bid/ask is $.49-.$51, you are giving up 2% edge every time you hit or lift. Before fees! Option prices themselves are more volatile than the underlying stock so from the market-maker’s perspective the Sharpe of the trade might be pretty small (getting 2% edge on a security that might have a 100% vol for example). But think of the second-order effect…the optical tightness of the market and high volatility of option prices means it can take many trades before the option tourist realizes just how much the deck is stacked against them. For independent market-makers, like I was 10 years ago, the tight markets made our business worse because our risk and capital limits did not allow us to keep pace with the volume scaling required to make up for the smaller edge per trade. But the large market-makers welcomed the increased transparency and liquidity because they could leverage their infrastructure effectively. 

  • If you make a 50/50 bet with a bookie but need to pay them 105 to 100 you are giving up 2.5% per bet (imagine you win one and lose one…you are down 5% after 2 bets). Now think of a vertical spread or risk reversal in the options market. Pay up a nickel on a $2 spread? Might as well have a bookie on speed dial.

Edge in the real world is nebulous

Firms with provable edges don’t try to raise money. If it’s provable it does not need more eyeballs on it. The epistemological status of edges that are trying to raise money is unknown. Many will never get the sample size to prove it. Asset management is the vitamin industry. It sells noise as signal. It sells placebos.  There will always be one edge that never goes out of style — marketing.

True mathematical edge is hard to find.


  • Nick Maggiulli’s Why You Shouldn’t Pick Individual Stocks: On The Existential Dilemma Of Stock Picking (Link)

  • Moontower Money Wiki: Time And Human Capital (Link)

Interview Questions A Market Maker Gave Me in 1999

SIG is well known for asking probability questions to filter trainees. This is not surprising. They view option theory as a pillar of decision-making in general. Thinking in probabilities takes practice which is why they like to look for talent amongst gamers who make many probabilistic decisions and need to interpret feedback in the context of uncertainty. They require many hours of poker during  “class”. In this 3 month period, junior traders live and breathe options in lovely suburban Philly after apprenticing (“clerking”) on a trading desk for about a year.

Here’s some of the questions I remember from my interviews in 1999.

  1. You flip a single die and will paid $1 times the number that comes up. How much would you pay to play?
    • Suppose I let you take a mulligan on the roll. Now how much would you pay (you are pricing an option now btw)?
  2. My batting avg is higher than yours for the first half of the season. It’s also higher than your for the second half of the season.

    Is it possible your avg for the full season is higher than mine?

    (Hint: Simpsons paradox)

  3. You are mid game that you have a wager on. Opponent offers to double the stakes or you automatically lose. (Like the doubling cube in backgammon)

    What’s the min probability of winning you need to continue playing?

  4. You’re down by 2 with seconds left in regulation basketball game and have a 50/50 chance of winning a game if it goes to overtime. You have a 50% 2-pt shooter and a 33% 3-pt shooter.

    Who do you give the ball to?

    (simple EV question)

  5. You are given $1,000,000 for free but there’s a catch. You must put all of it into play on roulette.

    What do you do?

  6. There’s a 30% chance of raining Saturday. 30% chance of raining Sunday.

    What’s the probability it rains at least one day?

To encourage you to try before looking up the answers, I’ll make it annoying…the answers are somewhere in this thread.

I wrapped that thread with a short post on Trading And Aptitude (Link)