Commodity Futures Curve Modeling and Factor Trading

§ 01 · The Question

10 factors. 7 strategies. One answer.

3 factor families: structural curve (carry, slope, curvature, curve momentum from WRDS back-month contracts), behavioural (TSMOM, cross-sectional momentum, CFTC positioning), and fundamental (EIA inventory surprise, macro exposure, volatility regime). 7 portfolio constructions from cross-sectional ranking to convenience-yield regime timing to multi-layer term structure strategies. Proper IS/OOS splits, expanding-window z-scores, per-commodity roll costs. The risk premium dominates everything.

The setup Every commodity desk runs a carry signal. But after CTA capital piled into systematic commodity strategies, the real question is whether any of these factors still produce alpha net of costs. I built the system to answer that directly.

-0.15

Carry OOS Sharpe

-0.23

TSMOM OOS Sharpe

10

Factors Tested

+0.35

EW Long OOS Sharpe

§ 02 · Results

7 strategies, after costs.

Execution model 13 year in sample (2005-2017), 7 year out of sample (2018-2024). T+1 execution lag on all signals. Per-commodity roll costs calibrated from actual bid-ask spreads. Turnover-adjusted commissions. No survivorship bias, no lookahead.

Hover for daily values. Click legend to toggle. Drag to zoom.

Drawdown from peak, full sample. More complexity and more turnover, worse drawdowns. The passive EW long portfolio recovers faster than every active strategy.

Strategy	OOS CAGR	OOS Vol	OOS Sharpe	OOS Max DD	Turnover
EW Long	+5.3%	14.7%	+0.35	-29.6%	-
TSMOM	-2.4%	10.8%	-0.23	-30.9%	0.04
TSI	-3.0%	10.2%	-0.30	-33.0%	0.09
Calendar Spread	-0.8%	10.7%	-0.07	-25.2%	0.06
XS Carry	-1.6%	10.6%	-0.15	-39.0%	0.06
Multi-Factor EW	-6.5%	10.7%	-0.63	-38.7%	0.14
Sector Neutral	-7.4%	10.7%	-0.72	-48.8%	0.18
SPY	+14.2%	19.7%	+0.67	-31.8%	-
AGG	-1.5%	6.0%	-0.25	-22.9%	-

§ 03 · Signal Contamination

I found a contamination mechanism in daily curve factors.

All 4 curve factors (carry, slope, curvature, curve momentum) show lag-0 IC 3-4x higher than lag-1. The interpolated curve is computed from the same prices you're trying to predict, so the signal is mechanically correlated with contemporaneous returns. Once you lag signals to t+1 (which any real execution would require), the predictive power collapses to noise. This can affect daily-frequency carry backtests that do not explicitly test for it.

Diagnostic finding This effect is easy to miss in monthly commodity factor studies (Gorton-Rouwenhorst, Erb-Harvey, Szymanowska), where monthly rebalancing masks the daily lag issue. I built an IC decay diagnostic across lags [0, 1, 5, 10, 20] to test it systematically.

IC at lag 0 vs lag 1 across all 10 factors. The 4 curve factors (left) show 3-4x decay. Momentum factors (center) retain some signal at lag 1. Fundamental factors (right) are flat. This is the key diagnostic.

0.12

Carry Lag-0 IC

0.035

Carry Lag-1 IC

3.4x

Decay Ratio

4 / 4

Curve Factors Hit

§ 04 · Curve Structure

Full term structure model across 19 markets.

Log-linear interpolation of WRDS back-month contracts to standardised tenors (F1M through F12M) with per-commodity roll calendars and 45-day extrapolation limits. From there I extract convenience yield via cost-of-carry inversion (risk-free rate from FRED, storage costs calibrated per-commodity from in-sample contango depth) and classify each market into 5 regimes. The curve builder processes all 19 commodities in under 30 seconds.

Theory of Storage Convenience yield reflects physical scarcity. When inventories are low, the spot premium rises and the curve inverts into backwardation. This regime mapping is what physical commodity trading desks use to frame directional views.

Click legend to toggle commodities. CL, NG, GC shown by default.

Regime classification across all 19 commodities over time. Deep backwardation clusters during supply crises: 2008 energy, 2022 grains.

§ 05 · TSI Strategy

Term Structure Intelligence: 3 layers, 1 thesis.

A multi-layer strategy architecture built around convenience yield dynamics. Layer 1: directional regime tilt with TSMOM trend gate (40% risk budget, monthly rebalance). Layer 2: curve transition momentum with confirmation filter (25%, weekly). Layer 3: structural spreads including CY crack spread, EIA inventory overlay, and deseasonalised livestock spread (35%, biweekly). Combined via risk-budget-weighted Ledoit-Wolf vol targeting. Every signal is grounded in physical commodity microstructure, not data-mined patterns.

CY Estimation → Regime Class. → Directional → Transition → Spreads → Vol Target

Per-layer cumulative returns. The directional layer drives in-sample gains. Transition and spread layers diversify but add no persistent OOS alpha.

+0.39

TSI IS Sharpe

-0.30

TSI OOS Sharpe

-0.69

IS to OOS Gap

Yes

CI Includes Zero

IS/OOS gap of -0.69 Sharpe. This tells you something important: even when you build the signals the way a physical commodity trader thinks about the market (convenience yield for positioning, regime transitions for timing, structural spreads for relative value), the signal did not survive stricter lag tests in post-2010 commodity markets. The system is end to end, but the evidence weakens out of sample.

§ 06 · What Works

The risk premium is the trade.

Equal-weight long across 19 markets. OOS Sharpe +0.35 with zero free parameters. The commodity risk premium comes from a structural source: commercial hedgers (producers, refiners, airlines) systematically pay speculators to take the other side of their price risk. That transfer has been happening since the 1930s and it's not going away. Unlike factor signals, it doesn't degrade when more capital chases it because the hedging demand is driven by real economic activity, not by quant alpha.

Zero parameters No lookback windows. No thresholds. No regime classification. No estimation error. No overfitting surface. The EW long portfolio is the correct null hypothesis for commodity factor research. After testing 10 factors and 7 strategies, none of the tested active variants beat it OOS.

252-day rolling Sharpe across the IS/OOS boundary. EW long stays consistently positive through multiple regimes. Active factor strategies oscillate around zero with no persistent edge.

10,000-sample block bootstrap, 95% CI. Only EW long excludes zero. No active strategy is statistically significant.

§ 07 · Cost Sensitivity

How much friction can it take?

Commodity futures execution runs 2 to 5 bps typically. Drag the slider.

Transaction Cost 5 bps

0.35

Sharpe

5.3%

CAGR

-30%

Max DD

42 bps

Breakeven

§ 08 · Stress Test

Stress tested across 4 major crises.

GFC 2008 (commodities crashed 60%). Oil glut 2014-2016 (WTI from $105 to $26). COVID March 2020 (WTI went negative for the first time in history). Russia-Ukraine energy spike 2022 (nat gas +300%, grain +80%). The EW long portfolio survived all 4.

Strategy performance during each crisis window. The diversified long portfolio recovers faster than any single-factor strategy because sector diversification absorbs idiosyncratic shocks.

Annualised return contribution by sector. Energy is the largest driver of both gains and losses. Precious metals partially hedge equity drawdowns. Agriculture and livestock provide structural diversification.

§ 09 · Architecture

End to end, from data to evaluation.

WRDS Ingest → Curve Builder → Factor Engine → Signal Gen → Backtest → Evaluation

Data · WRDS Datastream institutional access (2.4M contract rows, 3,000+ individual futures contracts) · yfinance front-month + benchmarks · EIA weekly petroleum + gas inventories · FRED rates/inflation/USD · CFTC disaggregated COT positioning
Universe · 19 commodities across 5 sectors: CL, NG, RB, HO (energy) · GC, SI, HG, PL, PA (metals) · ZC, ZS, ZW (grains) · KC, SB, CC (softs) · LC, LH, FC, LB (livestock)
Curve engine · Log-linear interpolation to F1M/F2M/F3M/F6M/F9M/F12M with per-commodity roll calendars, 45-day extrapolation limit, negative price handling (WTI April 2020). Full universe builds in 28 seconds
Factor engine · 10 factors across 3 families (structural curve, flow/behavioural, fundamental/macro). Expanding-window z-scores throughout, no lookahead. Cross-validated against known market events (WTI -$37.63, gold $1900 Aug 2011)
Infra · 438 unit tests, YAML config-driven pipeline, Parquet storage, `make all` reproduces everything from raw data to charts. CI-ready

Factor computation loop

for commodity in config["commodities"]:
    curve = build_curve(wrds_data[commodity])
    cy    = estimate_convenience_yield(curve, rates)
    regime = classify_regime(cy, thresholds)

    factors[commodity] = {
        "carry":     compute_carry(curve),
        "slope":     compute_slope(curve),
        "curvature": compute_curvature(curve),
        "curve_mom": compute_curve_momentum(curve),
        "tsmom":     compute_tsmom(front_month[commodity]),
        "xsmom":     compute_xsmom(returns, commodity),
    }
    signals[commodity] = combine_factors(factors[commodity])

§ 10 · Methodology

Methodology notes.

Strict IS/OOS discipline on every strategy variant. 10,000-sample block bootstrap for all Sharpe CIs. Signal contamination identified and quantified across all factor families. 19 markets with institutional data going back to 2003. 438 unit tests covering every module from data ingestion through evaluation. The entire pipeline reproduces from a single `make all` command.

IS vs OOS Sharpe for every strategy. Only EW long (zero parameters) improves out of sample. Everything else degrades.

Pairwise correlations across all 10 factors. Carry and slope are 0.75 correlated (both read the same curve shape). Momentum factors are nearly orthogonal to curve factors, which gives you diversification but not alpha.

Monthly returns for EW long. No seasonal clustering. The returns come from the risk premium, not from timing anything.