Skip to main content
Blog / March 20, 2026

An AI That Discovers Its Own Trading Factors. Does It Actually Work?

A recent paper from HKUST proposes an autonomous AI system that generates, tests, and refines its own equity factors without human guidance. The reported Sharpe ratio of 3.11 is exceptional, but the details raise questions any systematic investor should think through before getting excited.

The standard workflow for quantitative factor research has not changed much in decades. A human researcher forms a hypothesis, builds a signal, backtests it, checks the statistics, and decides whether it is worth pursuing. Machine learning accelerated parts of this process, but the human still drives. A new paper by Huang and Fan (2026) asks what happens when you take the human out of the loop almost entirely and let an AI agent run the full cycle on its own.

The answer, according to their results, is a long-short portfolio with a 3.11 annualized Sharpe ratio and 59.53% annualized returns over a four-year out-of-sample window. Those are numbers that deserve scrutiny, not just attention.

What the system actually does

The framework uses a large language model as a self-directed quant researcher. In each iteration, the agent proposes new factor hypotheses expressed as mathematical formulas over raw price and volume data, executes them as code, evaluates the results against a fixed statistical protocol, and then updates its search strategy based on what worked and what failed. The authors call this a "closed-loop" design, and the loop has five modules: hypothesis generation, deterministic execution, unified evaluation, gatekeeping, and memory update.

The key constraint is that the agent works from a deliberately small set of raw inputs: stock returns, prices, trading volume, volatility, and market-level return series. No fundamental data, no alternative data, no text. Factor expressions are built from transparent operators like moving averages, lags, and cross-sectional ranks. Expression depth is bounded to keep formulas readable.

Gatekeeping is rule-based. A factor gets promoted only if its information coefficient t-statistic exceeds a threshold and its long-short Sharpe ratio clears a minimum economic bar. Failures are retired, borderline cases are held. The next round of hypothesis generation is conditioned on the full history of past attempts, which in principle lets the system avoid redundant trials and explore diverse corners of the signal space.

All factor discovery and agent learning use data through December 2020. The out-of-sample window runs from January 2021 to December 2024, and the primary reporting window starts in January 2023.

The reported performance

Twelve individual factors survive the screening process. All twelve produce positive long-short returns with Sharpe ratios above 1.3 in the out-of-sample period. The best individual performers (Factors 6 and 10) show annualized returns above 14% and information coefficients around 0.027, which is high for daily cross-sectional signals.

When the twelve signals are combined through a simple linear model, the composite long-short portfolio delivers a cumulative gross return of 543% over four years, with a Sharpe of 3.11 and a maximum drawdown of 10.84%. All sixteen quarters in the out-of-sample window are gross profitable. A LightGBM nonlinear aggregation produces a smoother drawdown profile and tighter statistical significance (t-statistics around 7.2 for risk-adjusted alpha), though the linear model generates higher raw returns.

Risk-adjusted alphas remain significant after controlling for CAPM, Fama-French three-factor, five-factor, and six-factor models. The daily alphas for the linear long-short portfolio range from 0.442% to 0.453% depending on the benchmark, with t-statistics above 6.0.

What the factors actually capture

This is where it gets interesting, and also where questions start. Table IX in the paper provides the economic interpretation of all twelve factors. They fall into a narrow thematic cluster: turnover dynamics, flow shocks, and liquidity demand. Factor 1 is a flow shock (one-day volume growth rank). Factor 2 is relative turnover. Factors 3 and 11 are the z-score of relative turnover. Factors 4, 8, and 12 are combinations of turnover level and turnover change. Factors 5 and 9 are short-horizon and medium-horizon turnover averages. Factor 7 is the z-score of relative turnover again. Only Factors 6 and 10 break the pattern, combining price level relative to trend with realized volatility.

The performance metrics reflect this overlap. Factors 2, 3, 7, and 11 have nearly identical Sharpe ratios (1.79 to 1.81), identical IC values (0.0097 to 0.0099), and identical risk-adjusted alphas. The same is true for the cluster of Factors 4, 8, and 12, and for the pair of Factors 6 and 10. The agent appears to have discovered three or four distinct signals and then generated minor variants of each.

This does not invalidate the results, but it complicates the claim that the system discovers a "diverse family of signals." The effective dimensionality of the factor library is closer to four than twelve. Whether the multi-factor aggregation actually benefits from having near-duplicates in the input set is an open question.

The data mining question

The authors take the overfitting concern seriously, which is appropriate given that an autonomous system running thousands of implicit hypothesis tests is exactly the scenario where false discoveries proliferate. Their defenses include a t-statistic threshold of 3.0 (following Harvey et al., 2016), strict temporal separation between discovery and evaluation periods, a requirement that each factor be accompanied by an economic rationale, and multi-model risk adjustment.

These are sound practices. The t > 3.0 hurdle is meaningfully stricter than the traditional 2.0, and the temporal isolation (learning frozen before 2021, primary reporting from 2023) provides genuine out-of-sample distance. The fact that alphas remain stable across CAPM through FF6 suggests the signals are not simply repackaging known factor exposures.

That said, the economic rationale requirement deserves some skepticism. The agent generates mathematical expressions and economic explanations jointly. But the explanations in Table IX are post-hoc rationalizations generated by a language model, not independent theoretical derivations. Calling a cross-sectional z-score of relative turnover a "liquidity anomaly" is not wrong, but it is also not a deep economic mechanism. The constraint filters out nonsensical formulas, which is useful, but it does not guarantee that surviving factors represent persistent structural effects rather than well-described statistical patterns.

Transaction costs and capacity

The strategy rebalances daily with average daily turnover around 110%. That is extremely high. The authors apply a linear transaction cost model of 3 basis points per dollar traded (one-way) and show that cumulative net returns over the 2023-2024 window drop from roughly 139% to 75%. The strategy remains profitable, but nearly half the gross return evaporates.

Three basis points is a reasonable estimate for large-cap US equities with electronic execution and good infrastructure. For smaller names, the real cost is likely higher. The paper also does not model market impact, which scales with order size and matters once the strategy manages meaningful capital. A 110% daily turnover strategy running several hundred million dollars would face substantially worse execution than the linear cost model implies.

The paper does not discuss capacity constraints directly. Given that the core signals are turnover and volume based, they are particularly sensitive to the paradox of volume-based factors: the signal is about unusual trading activity, but executing trades based on that signal adds to the very activity you are measuring. At small scale this is irrelevant. At institutional scale it becomes a real problem.

Agentic vs. traditional: what the comparison shows

One of the more useful results is the direct comparison between agentic and traditional AI factor generation (Section 7.4). The agentic framework with simple linear aggregation (58.80% annualized) outperforms traditional factors with LightGBM (50.84%). This suggests that the iterative, self-correcting search process produces better raw signals, and that the quality of inputs matters more than the complexity of the combination model.

The divergence grows after mid-2022, during a period of elevated volatility. The agentic factors adapt better because the memory-update mechanism steers subsequent hypothesis generation toward signals that work in changing conditions. Traditional factor mining, which fixes the factor set before combination, cannot adjust.

This is the most practically interesting finding in the paper. It says something about the value of closed-loop factor research regardless of whether you believe the specific performance numbers.

What to take away

This paper introduces a genuinely novel architecture for systematic factor discovery. The closed-loop design, where hypothesis generation, empirical testing, gatekeeping, and policy updates form a continuous cycle, is a meaningful step beyond static machine learning pipelines. The out-of-sample discipline is real. The transaction cost analysis is present, which is more than many academic papers offer.

The reported performance is exceptional, and that is both the selling point and the reason for caution. A 3.11 Sharpe ratio on a daily-rebalanced equity long-short portfolio, sustained over four years, would place this among the best quantitative strategies ever documented. The turnover is extreme. The factor library has less diversity than it first appears. The capacity question is unaddressed. And the paper is a preprint that has not yet been through peer review.

For anyone building systematic strategies, the relevant lesson is not about this specific portfolio. It is about the architecture. Autonomous iterative search with strict statistical gates, economic interpretability constraints, and cumulative memory represents a different paradigm from the traditional "human proposes, machine tests" workflow. Whether the specific numbers in this paper hold up under external replication and live trading is a separate question, and one that only time and independent verification can answer.


References:

Fama, E.F. and French, K.R. (2015) 'A five-factor asset pricing model', Journal of Financial Economics, 116(1), pp. 1-22.

Harvey, C.R., Liu, Y. and Zhu, H. (2016) '... and the cross-section of expected returns', The Review of Financial Studies, 29(1), pp. 5-68.

Huang, A.Y. and Fan, Z. (2026) 'Beyond Prompting: An Autonomous Framework for Systematic Factor Investing via Agentic AI', arXiv preprint, arXiv:2603.14288v1. Available at: https://arxiv.org/abs/2603.14288 (Accessed: 20 March 2026).