✓accuracy / golden_dataset0.98+0.01
✕behavior / llm_judge0.73-0.14
✓latency / performance0.95+0.02
✓cost / budget_guard0.92+0.04
✓safety / policy_guard0.99+0.00
Agentura tests your agent on every pull request and tells you what broke before you merge.
Like pytest, but for AI agents.
WHY THIS EXISTS
A tone adjustment that passes review can silently change how edge cases are handled.
Model providers update their models without notice. Outputs change.
Without a log, there's no way to know what changed between a passing eval and a failing one.
THREE STEPS
Initialize
$ bunx agentura initGenerate agentura.yaml and store a baseline snapshot from your main branch.
Gate every PR
$ agentura run --against main↓ behavior 19/26 0.73 -0.14 regression
→ Merge blocked: behavior suite below thresholdEvery pull request is scored against baseline. Regressions block the merge automatically.
Generate audit report
$ agentura reportGenerated audit_2026-03-28.pdf
Eval history · Drift log · Policy decisionsAuto-generated audit trail with full provenance. Ready for compliance review.
A GitHub Action runs your tests. Agentura is the tests.
Agentura also monitors behavioral drift over time against a frozen reference snapshot, not just PR-to-PR regression.
LIVE DEMO
Run a baseline vs branch comparison in your browser. No install. No account. Live eval results.
Open Playground →AGENTURA PLAYGROUND · RESULT
SEE IT IN ACTION
You made the tone friendlier. Policy refusals dropped 24%. Nobody noticed for two weeks.
| Metric | Baseline | Branch | Delta | Gate |
|---|---|---|---|---|
| Accuracy | 0.91 | 0.67 | -0.24 | BLOCK |
| Policy fidelity | 0.88 | 0.64 | -0.24 | BLOCK |
| Latency (p95) | 842ms | 902ms | +60ms | PASS |
OPEN SOURCE
MIT License · Self-host in minutes · Own your eval data