Give an LLM agent a few years of prices and it posts a great backtest. Move the test window to *after* its training cutoff and the edge mostly evaporates. That gap undermines a lot of hypey blog posts and is the basis for a variety of research papers: the alpha wasn't from trading, it was from memorization.
**Profit Mirage** makes the cleanest case. On windows held out past the cutoff and matched for market return, performance collapses — Sharpe down 51–62%, total return down 50–72%.[^1] The cause is **data leakage**, and it's model-agnostic. Bigger models memorize more, so the frontier is more exposed, not less.
The obvious objection — *those were weak models* — doesn't rescue it. **StockBench** ran clean, multi-month tests on state-of-the-art models; most still lost to buy-and-hold, and being better at financial Q&A didn't predict being better at trading.[^2] **FINSABER** finds the same over 2004–2024 with survivorship and look-ahead controls, though I weight it less: its agents run on a cheap `gpt-4o-mini`.[^3]
The bullish headlines are the mirror image — short, in-sample, bull-market runs, rarely measured against SPY. The most-cited hobbyist version gave five LLMs $100K each. The "winner" was whoever held the most tech during a tech rally; the authors flag the leakage and overfitting themselves.[^4]
---
*Footnotes*
[^1]: Profit Mirage (arXiv 2510.07920) — on out-of-cutoff windows matched for market return, reports Sharpe declines of 51–62% and total-return declines of 50–72%, read as memorization rather than skill.
[^2]: StockBench (arXiv 2510.02209) — clean multi-month trading tests on frontier models; most fail to beat buy-and-hold, and financial-Q&A ability doesn't predict trading performance.
[^3]: FINSABER (arXiv 2505.07078) — 2004–2024, 100+ symbols, survivorship and look-ahead controls; no statistically significant alpha over buy-and-hold (every p > 0.34). Agents run on gpt-4o-mini, so it speaks mainly to weak models.
[^4]: AI Trade Arena, "we ran LLMs for 8 months" — five LLMs, $100K each; the result tracked tech-factor exposure in a bull market, and the authors disclaim it as not statistically significant.
*Research*
- [Survey: LLM agents for financial trading (arXiv 2408.06361)](https://arxiv.org/html/2408.06361v1) — catalogs the field; most positive results are short, in-sample, and un-benchmarked.
- [TradingAgents (arXiv 2412.20138)](https://arxiv.org/abs/2412.20138) — multi-agent bull/bear debate; the useful mode is stress-testing a thesis, not predicting prices.
- [GuruAgents (arXiv 2510.01664)](https://arxiv.org/html/2510.01664v1) — encodes an investor persona as a prompt; supports thesis-as-prompt, not trading alpha.