
Requires the AI to use a phone. The market resolves YES if an AI is able to reach and talk with a human IRS representative (the conversation itself isn't important) by calling the IRS phone number. It's fine if a human types in the phone number for it and starts the call, but nothing more than that.
This is primarily a test of its speech recognition abilities and ability to navigate the IRS automated voice answering ststem, and to operate over long periods (like being put on hold for over an hour).
If nobody posts evidence of this happening before the close date, I'll test a few mainstream AIs.
For a system tailored to this task to count it needs to retain its generality across other tasks, and not be e.g. a python script that sends DTMF tones.
Update 2026-03-06 (PST) (AI summary of creator comment): The AI is allowed to clear or compact its context or use subagents during a lengthy hold. What matters is that it reaches the IRS representative autonomously overall.
@singer If the AI loses continuity when it goes on a lengthy hold, does that mean it can't count for Yes that session even if it gets "woken up" or starts a new interaction when it hears a resolution to the hold (whether a human or a futher robot menu)?
@Panfilo I'm honestly not so sure what the distinction there is. It's fine if it clears or compacts its context or uses subagents, the market is just whether it reaches the IRS rep autonomously.