DealForge autonomously sources, scores, and writes investment memos on venture deals. Stop manually hunting.

1,180+ deals tracked  ·  22 AI investment memos  ·  Updated daily

← Back to leaderboard

Τ³

Show HN: Τ³-Bench is out – can agents handle complex docs and live calls?

40 AI Score
Show_hn other Added Mar 25, 2026

Details

Sector
other
Total Funding
$0
Last Round
$0

About

τ-Bench is an open benchmark for evaluating AI agents on grounded, multi-turn customer service tasks with verifiable outcomes. It&#x27;s been great to see the community adopt it since launch — this is now the third iteration. With τ³-Bench, we&#x27;re extending it to two new settings: knowledge-intensive retrieval and full-duplex voice.<p>τ-Knowledge: agents must navigate ~700 interconnected policy documents to complete multi-step tasks. Best frontier model (GPT-5.2, high reasoning) hits ~25%. The surprising part: even when you hand the model the exact documents it needs, performance only reaches ~40%. We found that the bottleneck isn&#x27;t retrieval — it&#x27;s reasoning over complex, interlinked policies and executing the right actions in the right order.<p>τ-Voice: same grounded tasks, but over live full-duplex voice with realistic audio — accents, background noise, interruptions, compressed phone lines. Voice agents score 31–51% in clean audio conditions and 26–38% in realistic ones. A consistent failure pattern across providers (OpenAI, Gemini, xAI): agent mishears a name or email during authentication, and everything downstream fails.<p>We also incorporated 75+ task fixes to the original airline, retail, and telecom domains — many based on community audits and PRs (including contributions from Amazon and Anthropic). We believe a benchmark is only as good as its maintenance, and we&#x27;re grateful for the community&#x27;s help improving it.<p>Code and leaderboard are open — we&#x27;d welcome community submissions and feedback.<p>Blog post (papers, code, leaderboard): <a href="https:&#x2F;&#x2F;sierra.ai&#x2F;blog&#x2F;bench-advancing-agent-benchmarking-to-knowledge-and-voice" rel="nofollow">https:&#x2F;&#x2F;sierra.ai&#x2F;blog&#x2F;bench-advancing-agent-benchmarking-to...</a>

AI Score Reasoning

Heuristic score based on available signals. Funding: $0, Source: show_hn.

Source

Show_hn — View original →