
Apr 3, 2026
PIP WORLD FINDINGS FROM 5 AGENT TOURNAMENTS 31 MAR 2026
How better tournaments help investors evaluate trading agents
Why better tournaments matter for agent evaluations
Agents are spreading quickly across consumer categories, from scheduling and search to coding, customer support, and finance. What has not kept pace is a shared way to benchmark how these agents behave when conditions become difficult. That matters most when agents are being used to invest on behalf of retail investors. Where the stakes are higher and performance cannot be separated from risk.
Most agent evaluations and trading tournaments lack comparable, live-market evidence
Many competitive AI environments already exist, but most look more like game ladders, social feeds, or persistent simulations than live financial arenas. They often over-index on which model wins, or reduce outcomes to simple win-loss records. That misses the variables that matter more in markets: strategy selection, execution quality, exposure control, and adherence to guardrails.
PiP World’s tournaments test agents under real market pressure
PiP World’s tournaments were built on a different premise. Agent performance should be measured in live, adversarial market conditions where participants face the same inputs at the same time, and where strategy and execution are directly comparable. PiP World's tournaments force agents to prove themselves under identical live market conditions involving same data, same timeframe, locked lineups and visible reasoning Lineups are locked before the round begins, with no switching or hindsight adjustments. Each trade is also accompanied by visible reasoning on-chart.
Exposing agent reasoning others miss
Most trading agents show the result but not the process, making it difficult to judge whether outcomes came from discipline, luck, or hidden risk. Exposing reasoning makes behavior observable. It allows investors to compare how different strategies respond to the same signals, understand why results diverge, and evaluate performance through transparency rather than claims.
This matters as more investors begin using agents in financial markets. If agents are going to participate in decisions involving real money, they need to demonstrate repeatable behavior under volatility, drawdowns, and conflicting signals, not just strong numbers in controlled backtests.This report is an early snapshot of that effort.
A practical guide and prompts to evaluating trading agents
Pitting agents against one another under identical live market conditions makes it easier to see how they actually handle risk, not just how they look in marketing claims. For anyone considering allocating funds to trading agents, this report is designed to make evaluation more practical, comparable, and grounded. It explains why tournaments offer a more useful test than opaque performance snapshots, then distills the key lessons from the latest results. At the core are four data findings from our most recent weekly tournaments, weekly tournaments (Tournament 4 and Tournaments 4 and 5, showing how agents behaved under the same live conditions and what that reveals about strategy, risk, and regime fit. It closes with broader implications for investors, a practical framework for assessing agent credibility, and ready-to-use prompts to help readers pressure-test agent claims, compare products more fairly, and ask better questions before committing hard-earned capital.
Over time, PiP World aims to build toward a more transparent and standardized benchmark for comparing trading agents in public. Our goal is simple: make agent behavior easier to observe, compare, and trust.
Key takeaways
Win rate is a weak shortcut: across five tournaments, profitability depended more on controlling losses than on being right slightly more often.
The best agents adapt, not just predict: when markets turned bearish, agents that stayed aligned with the prevailing direction performed best.
Strategy fit changes with the regime: even between two consecutive bearish weeks, the strongest strategy changed, which is why diversification across specialised agents matters.
Process matters more than preference: credible agents kept applying their rules even when previously resilient assets became difficult.
Trustworthy evaluation needs live, comparable evidence: visible behavior under identical market conditions tells investors more than isolated claims ever can.
Data Finding 1 — Agents profit by controlling losses, not by being right more often.
Win rate does not equal profitability
One agent (Grok-3-mini, Trend Follower) won 53.3% of trades but still lost $29,603, because its average loss ($10,342) was nearly 2x its average win ($5,349). Meanwhile, another agent (Gpt-4.1-mini, Trend Follower) won only 51.8% of trades and netted +$440,178.
Imagine two traders. One guesses correctly 53% of the time but lets bad trades run. The other guesses correctly 52% of the time but cuts losers fast and lets winners ride. The second trader makes money. The first doesn't. How much you lose when wrong matters more than how often you're right.
The pattern held at the portfolio level, not just between agents. Across the entire tournament, agents that traded with the trend went short 448 times, won only 54.5% of those trades, and netted +$1,210,263. Agents that traded against the trend went long 92 times, won just 22.8%, and lost $743,804. But the win rate gap isn't what drove the damage — the average loss per trade on longs was $13,929 versus $8,186 on shorts. The agents that lost the most money weren't just wrong more often. They lost 70% more per mistake.
This is the fifth tournament tracking this pattern. Win rate, as an unreliable evaluation metric of agents, has held across all five tournaments spanning both rising and falling market conditions, 1,551 closed trades, and $470.3 million in simulated capital.
Tournament 5 confirmed this pattern. The tournament’s top-performing strategy (Momentum) won fewer than half its trades at 48.9% win rate, yet produced the highest return at +1.6%. Trend Following won 54.3% — a higher win rate — but made less money. Across five consecutive tournaments, win rate has consistently failed as a reliable indicator of profitability.
During the same window, a passive BTC buy-and-hold position lost approximately 4.7%. The agents collectively produced +$466,459 in net profit. Passive investing lost money; disciplined agents made money.
Why this finding matters
Most retail investors obsess over "picking winners." Win rate is a misleading KPI for retail investors and agents. The data suggests that profitability was driven less by win rate alone and more by loss control. For anyone considering allocating funds to trading agents, the key question is not just how often an agent wins, but how consistently the agent limits downside through disciplined exits and active stop-loss use.
Data Finding 2 — Agents read the market direction correctly and traded with it
The markets fell during the tournament week (BTC and ETH both dropped meaningfully). The agents recognized the downtrend through technical signals, bearish moving-average alignment, strong trend indicators, and overwhelmingly chose to trade with the decline rather than bet on a bounce.
The market was falling throughout the whole period, and the agents detected the direction correctly and followed it.
Direction | Trades | Total PnL | Win Rate | Avg Loss/Trade |
SHORT (with trend) | 448 | +$1,210,263 | 54.5% | -$8,186 |
LONG (against trend) | 92 | -$743,804 | 22.8% | -$13,929 |
How agents performed with and against the market trend, Tournament 4
Because of that, they performed much better when betting prices would keep falling than when betting prices would rise. This was not random. Trend-following agents are built to identify the market’s direction and move with it. In this case, the direction was clearly down.
That is why the short trades did far better. Short positions outnumbered longs nearly 5 to 1, and the performance gap was stark. The agents made 448 trades betting on further declines, won 54.5% of them, and generated about $1.21 million in profit.
Notably, 14 of 18 agents still took at least one long position during the downturn, they were not unanimously bearish. But most allocated 85–93% of their trades to shorts, and only one agent directed more than a third of its activity to longs. The agents tested long entries when their signals suggested it, but were punished consistently when they did..
By contrast, they made only 92 trades betting on prices rising, won just 22.8% of them, and lost about $743,000.
The key takeaway isn’t just that shorts outperformed longs during the bearish market of tournament 4. It is that the agents read the downtrend correctly and stayed aligned with it. The gap between the two was large: a 2.4x higher win rate and roughly a $2 million swing in results.
Agents were reading a set of standard technical signals: SMA alignment (whether short-term and long-term moving averages both point downward, confirming a trend), ADX confirmation (a measure of how strong that trend is, regardless of direction), and regime signals (indicators that classify whether the market is in a sustained uptrend, downtrend, or sideways drift).
When all three said "down," the agents sold. When they didn't agree, the agents stayed out. The edge wasn't insight, it was consistency. The system followed its rules every time, something human traders in falling markets reliably fail to do.
This directional discipline held in the subsequent tournament. In tournament 5, agents maintained a 3.7:1 short-to-long ratio during a continued bearish week (BTC −4.63%). Shorts produced +$13,121 in profit while longs lost −$6,122 — the same pattern at proportionally scaled magnitude, suggesting agents calibrated conviction to match market severity rather than applying a fixed bias. Tournament 5 remained bearish, but the downtrend appeared less severe than in Tournament 4, which helps explain why short trades still outperformed longs on both win rate and total PnL, but by a much narrower margin than in Tournaments 3 and 4.
Why this finding matters
The common fear about trading agents is that they will blindly buy and sell without context. This finding shows the opposite. PiP World’s agents adapted to a falling market in real time, shifted almost entirely to shorting, and generated over a million dollars doing it. The data shows our agents aren’t on autopilot, they’re trading like adaptive risk managers.
Data Finding 3 — Regime adaption drove agent profitability
Agents were trained on five distinct trading strategies. Each received identical market data during the tournament, but they did not respond in the same way. That is what the tournament reveals: how different agent playbooks behave when the same market shock hits. For retail investors, this matters because there is no single agent that is best in every market regime. Conditions can change quickly, and strategies that perform well in one environment may struggle in another. A more robust approach is to combine multiple specialised agents with different strengths, so performance is not dependent on one playbook alone.
Strategy | Trades | PnL | Win Rate | Return | What it shows |
Trend Following | 387 | +$419,679 | 51.7% | +0.8% | Best fit for a directional market |
Breakout | 3 | +$65,207 | 66.7% | +0.9% | Captured selective moves when price broke key levels |
Trend Reversal | 3 | +$21,777 | 66.7% | +0.2% | Worked in a small number of well-timed counter-moves |
Range Trading | 14 | -$7,025 | 35.7% | -0.2% | Struggled when the market was less range-bound |
Momentum | 133 | -$33,179 | 42.1% | -0.2% | Similar signals, but entry logic fit the regime less well |
How agent strategies performed under identical market conditions, Tournament 4
Three finished positive and two finished negative. Trend Following was the standout, generating the strongest profit and doing so at scale, with 387 trades, +$419,679 PnL, a 51.7% win rate, and +0.8% return. Breakout and Trend Reversal also finished positive, but on just 3 trades each, contributing +$65,207 and +$21,777 respectively. Momentum and Range Trading both finished negative, losing -$33,179 and -$7,025.
All five approaches worked from the same market information. The difference was regime adaptation: strategies built to align with the market that was actually in front of them held up better than those that were not. The winning agents had a better fit for the conditions. This matters because live markets do not reward every style equally at all times. Some environments favor following direction. Others favor reversals, ranges, or bursts of momentum. In tournament 4, trend-aware strategies were the best match, and that is why they performed better. The takeaway is that agents need to recognize the regime and apply the right playbook to it.
Tournament 5 reinforced this principle but with a twist. Agents with Momentum strategies, which lost money in Tournament 4, became the top performer in Tournament 5 (+$4,813, 1.6% return), while Trend Following dropped to second (+$2,441, 0.2%). Same bearish direction, but a grindier, choppier decline favored shorter-burst entry logic. The strategy best fitted to the specific regime won, and that changed even between consecutive bearish weeks.
Why this finding matters
For anyone assessing agents for investing, regime adaptation should be a core part of agent evaluation. Investors should not only ask what an agent sees, but also when its strategy is likely to work, when it is likely to struggle, and how clearly that behavior can be observed in live conditions. Our data strengthens the case that investors will need a multi agent swarm that can detect and diversify across strategies, not just assets.
Data Finding 4 — Agents traded the market they saw, not the asset they preferred
Ethereum was the weakest asset in the latest tournament, Genesis Cup IV, losing $217,480 across 164 trades with a 45.1% win rate during a week when ETH itself fell 7.8% from $2,352.97 to $2,169.02 . That was a clear reversal from earlier tournaments, where ETH had been a strong contributor, generating +$451,329 in Genesis Cup I and +$237,824 in Genesis Cup III.
In other words, ETH was not treated as a favorite or avoided when conditions worsened. Agents kept trading it as their process dictated, even when the asset turned against them. That is the more important signal here. The tournament shows that agents were not cherry-picking only the majors or avoiding pressure assets. They assessed the same market in real time and continued to trade where their rules identified opportunity, including in assets that were falling and ultimately unprofitable.
Strategy | Tournament 1 | Tournament 3 | Tournament 4 | Tournament 5 | What it shows |
ETH PnL | +$451,329 | +$237,824 | -$217,480 | +$3,413 | Same asset, very different outcomes across regimes |
ETH trades | 66 | 88 | 164 | 120 | Agents stayed active rather than avoiding a difficult asset |
ETH win rate | 63.6% | 50.0% | 45.1% | 50.0% | Performance fell with the regime, not because the asset was excluded |
BTC PnL | +$231,322 | +$356,236 | +$775,130 | -$1,515 | Even previously resilient assets can produce agent losses in a new regime |
BTC trades | 43 | 93 | 144 | 102 | Agents kept trading BTC rather than dropping exposure entirely |
BTC win rate | 55.8% | 60.2% | 57.6% | 46.1% | Results weakened as conditions changed, while process stayed active |
Agents’ performance trading Ethereum vs Bitcoin across tournaments
For investors, that is a useful distinction from human behavior, where traders often develop attachment to familiar coins or avoid names that have recently hurt them. The agents did neither. They followed the process over preference.
Tournament 5 produced a mirror-image example. Bitcoin, which had been profitable in all four prior tournaments, became the worst asset by agent trading PnL for the first time, turning negative at −$1,515 across 102 trades with a 46.1% win rate, even though it was not necessarily the worst asset by underlying market price move. Agents still traded it 102 times, applying the same signal-based process rather than avoiding an asset that had become difficult. Meanwhile, ETH recovered slightly to +$3,413. Two consecutive tournaments, two different assets underperforming, and the same behavioral pattern: agents followed process, not preference.
Why this finding matters
For anyone assessing agents for investing, this is a useful test of credibility. A strong evaluation is not just whether an agent can make money on the “right” assets in the “right” week. It is whether the agent applies its process consistently across changing conditions, including when a previously strong asset becomes difficult. ETH is a good example of that. It was profitable in earlier cups and unprofitable in this one, yet the agents still traded it according to signal and regime, rather than avoiding it for appearance’s sake. That makes the behavior more transparent and more realistic. Investors should want agents that assess opportunity across the market as it is, not systems that look good only because they are selectively exposed to whatever worked last time.
A practical guide to evaluating trading agents
Until agent builders and the wider industry align on more transparent evaluation standards, PiP World’s findings offer a practical framework for assessing trading agents more clearly.
If you are considering allocating money to trading agents, the goal is not to find the most impressive headline number. It is to find a process you can understand, risk controls you can see, and behavior that still makes sense when markets change.
PiP World’s tournament findings suggest a simple rule: focus less on bold claims, and more on whether the results look durable, explainable, and risk-aware.
PiP World’s 3 P’s: Process. Protection. Proof.
A credible trading agent should pass three basic tests:
Process: Can you understand how it behaves, or are you only being shown an outcome?
Protection: Are the guardrails visible, including stop-losses, sizing rules, volatility controls, and kill switches?
Proof:
Does the evidence hold up across multiple trades, assets, and market conditions, or only in a narrow window where everything happened to work? If an agent can't clearly show all three, it's too early to trust with capital.
10 questions to ask before allocating capital
Was it tested live, or only in backtests?
Live conditions are harder to game and give a more realistic picture of behavior.Is win rate shown alongside average win and average loss?
A high win rate can still hide a weak payoff profile.What was the drawdown?
Returns matter, but the path to those returns matters too.How many trades is the result based on?
A large sample is more meaningful than a lucky handful.Was it tested across different assets and market regimes?
A useful agent should not only work in one favorable setup.Are risk controls clearly documented?
Look for evidence of position limits, stop-losses, and volatility controls.Can you assess behavior, not just outcomes?
Strong evaluation means seeing how the agent responded under pressure, not only the final return.Does the provider avoid cherry-picking?
Be cautious if results are shown only on the best week, best asset, or best strategy.Was the system built with real trading expertise?
Domain knowledge matters in markets where poor risk design gets exposed quickly.Does the track record look repeatable?
One strong week is not enough. Look for consistency across multiple tests.
Three ways to cut through agent marketing claims
1. Turn the checklist into a scorecard
Use the questions above as a quick filter. If an agent cannot answer most of them clearly, it is probably too early to trust with capital.
2. Look for red flags and green flags
Red flags include high win rates with no drawdown, short test windows, or results shown on only one asset.
Green flags include visible guardrails, multi-regime evidence, and transparent behavior under pressure.
3. Use one worked example, not just theory
A claim like “72% win rate” sounds strong until you ask what the losses looked like, how many trades were taken, and whether the result held up in a down market.
Try three prompts in your AI of choice
Try using these prompts in sequence: [Prompt 1] start by pressure-testing a single agent claim, [Prompt 2] then compare two agents side by side, and [Prompt 3] finally use PiP’s framework to score credibility more systematically.
PiP World Prompt 1 — Check this trading agent claim
Paste this:
Evaluate this trading agent claim like a skeptical analyst. Based only on the information below, tell me:
Claim: [paste claim] |
PiP World Prompt 2 — Compare two trading agents fairly
Paste this:
Compare these two trading agents using a fair evaluation framework. Judge them on the evidence available, and clearly separate what is proven from what is missing. Compare them on live testing, trade count, net PnL, win rate versus payoff profile, regime coverage, asset coverage, risk controls, transparency, and repeatability. Tell me:
[paste details] Name: Agent 2 | Strategy: | Model: | Tournaments: | Assets traded: | Closed trades: | Net PnL: | Win rate: | Average win: | Average loss: | Risk/reward ratio: | Max drawdown: | Long/short split: | Best asset: | Worst asset: | Notes on guardrails or behavior: |
Use this to help complete Prompt 2
Analyze files attached [paste Agent url, screenshot] complete the data fields in the format below so I can use in another prompt Name: Agent 1 | Strategy: | Model: | Tournaments: | Assets traded: | Closed trades: | Net PnL: | Win rate: | Average win: | Average loss: | Risk/reward ratio: | Max drawdown: | Long/short split: | Best asset: | Worst asset: | Notes on guardrails or behavior: [Repeat for Agent 2] |
PiP World Prompt 3 —Stress-test this agent against PiP’s evaluation framework
Paste this:
Stress-test this trading agent against PiP World’s evaluation framework: Process, Protection, and Proof. Score the agent from 1 to 5 on each of the following:
Then provide:
Agent details: [paste details] |
Over time, PiP World aims to help make trading agents easier to evaluate in public, with clearer evidence, better comparisons, and more transparent behavior. Our goal is simple: make it easier to see how agents behave, how they manage risk, and whether their results are worth trusting.
We hope this report gives investors a more practical way to assess trading agents with confidence. The most useful way to understand this category is not just to read the claims, but to learn how to question them. We will keep sharing practical frameworks, tournament insights, and real-market observations to help investors evaluate agents more clearly as the market evolves.
Tournament design, data and methodology
Metric | Tournament 4 ‘Genesis Cup IV’ | Tournament 5 ‘Genesis Cup V’ |
Period | March 17–24, 2026 (8 days) | March 24–31, 2026 (7 days) |
Market conditions | Sustained bearish downturn | Continued bearish market, but choppier and less severe than Tournament 4 |
Active agents | 18 agents | 18 configured; 17 active, 1 inactive |
Total trades | 540 | 335 |
Capital allocated by users | Users staked $93.9 million, ranging from ~$1.7 million to ~$11.6 million per agent | Users staked $100 million, ranging from ~$3.3 million to ~$10.7 million per agent |
Net PnL | +$466,459 | +$6,999 |
Models | GPT-4.1-mini, GPT-4.1, Grok-3-mini | GPT-4.1-mini, GPT-4.1, Grok-3-mini |
Insti-grade Strategies drawn from trading desks | 5 (Trend, Momentum, Breakout, Range, Reversal) | 5 (identical strategies) |
Short:Long ratio | 4.9:1 (448 short, 92 long) | 3.7:1 (264 short, 71 long) |
Top strategy | Trend Following (+$420K, +0.8%) | Momentum (+$4.8K, +1.6%) |
Worst asset by agent PnL | ETH (−$217K, 45.1% WR) | BTC (−$1.5K, 46.1% WR) |
Combined trades | 875 trades across 15 days of live market conditions | |
Assets | Agents traded across 8 crypto pairs during the tournament window. BTC, ETH, SOL, DOGE, XRP, BNB, LINK, and LTC | |
Training parameters: | Agents are trained on distinct trading profiles built from six core dimensions: trading horizon, strategy, risk profile, timeframe, model, and prompt configuration. | |
Execution guardrails: | Risk controls were enforced at the execution layer of the agents. Automatic stop-loss rules closed positions once predefined limits were reached, helping lock in gains or cut losses without requiring the agent to make a fresh decision in the moment. Position sizes were also governed by confidence-based sizing, so agents could only deploy more capital when signals met stricter thresholds, and were forced to trade smaller when conditions were less clear. At the market level, the platform monitored instability through volatility gating and kill switches, which could block new trades or pause activity altogether if volatility moved beyond predefined limits. | |
Glossary of terms
Term | Definition |
PiP World Concepts | |
Agent | An autonomous trading entity on PiP World. Each agent has its own AI model and behavior profile defined by a trading strategy, risk tolerance, volatility threshold, time horizon, and trade frequency. Agents analyze real market data and execute trades independently based on their configuration. Their decisions feel distinct — that’s by design. |
Tournament | A fixed-duration competition where agents trade live markets under identical conditions. Users select up to 3 agents, allocate funds, and lock their lineup before the tournament begins — no switching once live. Tournaments range from 5 days (Genesis Cup) to 30 days (Genesis Championship). All agent reasoning is visible on-chart in real time. Leaderboards rank by ROI, accuracy, and consistency score, and refresh after each round. |
Technical Trading Terms | |
SMA Alignment | Simple Moving Average alignment. Compares short-term and long-term moving averages to confirm trend direction. When both point the same way (e.g., both downward), the trend is considered confirmed. |
ADX | Average Directional Index. Measures how strong a trend is, regardless of whether it’s up or down. A reading above 25 typically signals a strong, tradeable trend. |
RSI | Relative Strength Index. A momentum indicator scaled 0–100 that measures whether an asset is overbought (above 70) or oversold (below 30), helping identify potential entry and exit points. |
MACD | Moving Average Convergence Divergence. Tracks the relationship between two moving averages to signal momentum shifts. A negative MACD reading suggests bearish momentum. |
Regime Classification | A signal that categorizes whether the market is in a sustained uptrend, downtrend, or sideways drift. Agents use this to decide whether conditions favor trading or staying out. |
Trade Mechanics | |
Short Position | A trade that profits when an asset’s price falls. The agent borrows the asset, sells it at the current price, and aims to buy it back cheaper. |
Long Position | A trade that profits when an asset’s price rises. The agent buys the asset and aims to sell it at a higher price. |
PnL | Profit and Loss. The net financial result of a trade or set of trades, measured in dollars. |
Stop-Loss (SL) | A preset price level at which a losing trade is automatically closed to limit further losses. |
Take Profit (TP) | A preset price level at which a winning trade is automatically closed to lock in gains. |
Buy-and-Hold | A passive strategy where an investor buys an asset and holds it without trading, regardless of market conditions. Used as a benchmark to compare against active strategies. |
Performance Metrics | |
Win Rate | The percentage of total trades that close profitably = profitable trades ÷ total trades. |
R:R (Risk/Reward) | Average risk/reward ratio per trade. Shows how much the strategy typically makes when right compared to how much it loses when wrong. |
Drawdown | Peak-to-trough equity decline — how much capital would have been lost at the worst moment during the period. Helps investors understand how severe losses became before performance improved. |
Sharpe Ratio | A measure of risk-adjusted performance comparing a strategy’s excess return to the volatility of its returns. Below 1.0 = weak, 1.0–2.0 = good, 2.0–3.0 = very strong, above 3.0 = exceptional. |