Run it overnight. Wake up to answers.

AutoScienceBench evaluates every AI model on your infrastructure against real scientific workflows — then generates a deployable routing policy backed by evidence, not intuition. One command. Full evaluation suite. Results by morning.

Autonomous evaluation · overnight run
18:00
Evaluation triggered. 4 models × 3 quantization levels × 50 tasks.
18:04
Layer 1 (General Gate) complete. 1 model filtered — fails minimum competency. 3 proceed.
20:30
Layer 2 (Domain Science) complete. 600 task evaluations scored. Partial results streaming to dashboard.
22:15
Layer 3 (Agentic Workflows) complete. Pipeline coherence scored across 9 end-to-end runs.
23:00
Layer 5 (Operational Metrics) complete. Throughput, latency, memory profiled under sustained load. (Layer 4 — human expert eval — runs quarterly, not overnight.)
06:00
Report ready. New routing policy generated. Diff: sequence_analysis moved from cloud to local (+11 score, −$0.08/req). Awaiting human approval.

One command. 20 experiments.

Systematically tests model variants, quantization levels, prompt strategies, and context window configurations against your full task suite. Every combination scored. Every result logged.

📊

Evidence, not intuition.

Per-task scoring breakdowns. Head-to-head comparisons with statistical significance. A new routing_policy.json with a full diff showing exactly what moved and why.

🔄

Re-run when anything changes.

New model release? Different quantization? Hardware upgrade? Same methodology, same task suite, new results. Your routing policy is never stale.

🛡️

Human approval gate.

ASB doesn't deploy automatically. It proposes changes with evidence and waits for review. Scientists make the calls.

The problem

Most hybrid AI deployments route by data sensitivity — confidential stays local, public goes to cloud. That's a security policy, not a quality policy. Nobody has measured which model is actually better at which scientific task on the infrastructure they already own.

Results

Sample evaluation: two models, six categories, one infrastructure.

50 tasks. Expert-validated references. Scored on the target hardware. Neither model wins everywhere.

Local · qwen-3.5-122b-a10b
Q4_K_M · Apple Silicon · 512 GB unified
Exo cluster · evaluated 2026-03-20
74.2 composite
sequence_analysis
86
literature_review
79
experiment_design
58
data_analysis
81
compliance
62
agentic_workflow
71
Wins: sequence analysis (+13), data analysis (+3). Zero marginal cost.
Cloud · claude-opus-4.6
API · OpenAI-compatible gateway
us-east-1 · evaluated 2026-03-20
82.7 composite
sequence_analysis
73
literature_review
91
experiment_design
93
data_analysis
78
compliance
88
agentic_workflow
76
Wins: experiment design (+35), compliance (+26), literature review (+12).

The local model beats cloud at sequence analysis by 13 points — at zero cost. The cloud model dominates experiment design by 35 points. Routing everything to one model wastes either money or quality. AutoScienceBench generates the routing config that uses each model where it's strongest.


Problem Statement

Generic benchmarks don't test what your lab requires.

FAIL — Recall ≠ Reasoning

MMLU / GPQA

Multiple-choice biology trivia tells you nothing about whether a model can design a construct library, interpret dose-response curves, or identify confounders in HTS data.

FAIL — Coding ≠ Science

HumanEval

LeetCode is irrelevant when you need a BioPython pipeline that parses GenBank annotations, aligns against a custom reference, and flags frameshifts in a specific reading frame.

FAIL — Retrieval ≠ Synthesis

BioASQ

Answering PubMed questions is a solved problem. The hard part is multi-step pipelines where the model decides what to search, how to filter, and which contradictions to flag.

PASS — Your workflows, your hardware

AutoScienceBench

Sequence analysis. Multi-step experiment design. Literature synthesis with contradiction detection. 21 CFR Part 11 compliance. Agentic pipelines with 5+ chained tool calls. Measured on your infrastructure.


Methodology

Five evaluation layers. General to operational.

Each layer tests a different dimension. Together they produce a complete quality-per-dollar profile for every model on a given infrastructure.

1

General Gate

Automated screening using curated GPQA and BioASQ subsets. Filters out models that can't meet minimum domain competency before expensive evaluation begins.

~4 min/model · pass/fail · fully automated · zero expert time
2

Domain Science

50+ tasks across six categories — sequence analysis, literature retrieval, experiment design, data analysis, regulatory compliance, and agentic workflows. Each task has expert-validated reference answers with structured scoring rubrics. This is the core.

50+ tasks · 6 categories · scored 0–100 · expert-validated references · customizable per organization
3

Agentic Workflows

End-to-end research pipelines requiring 5+ chained actions: search literature, extract data, run analysis, generate hypotheses, draft protocols. Evaluates tool use accuracy, error recovery, and reasoning coherence across multi-step sequences.

3+ pipeline benchmarks · 5+ steps each · tool use + coherence scoring
4

Human-in-the-Loop

Blind expert evaluation by domain scientists. Outputs from different models presented without attribution. Produces Elo ratings and labeled preference data suitable for fine-tuning. Reserved for top performers from Layers 1–3.

blind A/B/C comparison · Elo ratings · inter-rater agreement tracked · generates fine-tuning data
5

Operational Metrics

Throughput, latency (p50/p95/p99), memory footprint, cost-per-request — measured on your hardware, your quantization, your inference stack. A model that scores 90 but runs at 3 tok/s on your cluster isn't useful.

tok/s · TTFT · VRAM · $/req · sustained load profiles · quantization-aware

Output

A routing policy you can deploy. Not a leaderboard you can tweet.

Each task category maps to the model with the best quality-adjusted cost-performance on your infrastructure. Security overrides are preserved — restricted data stays local regardless of quality score.

routing_policy.json
{ "routing_policy": { "sequence_analysis": { "model": "qwen-3.5-122b", "tier": "local", "score": 86 }, "literature_review": { "model": "claude-opus-4.6", "tier": "cloud", "score": 91 }, "experiment_design": { "model": "claude-opus-4.6", "tier": "cloud", "score": 93 }, "data_analysis": { "model": "qwen-3.5-122b", "tier": "local", "score": 81 }, "compliance": { "model": "claude-opus-4.6", "tier": "cloud", "score": 88 } }, "override_rules": [ { "condition": "classification in ['restricted', 'phi']", "action": "force_local" } ] // security always wins }

Plugs directly into any OpenAI-compatible gateway — LiteLLM, Portkey, or custom. Re-run when you add a model, change quantization, or scale hardware.

Why this matters for regulated environments

Auditors will eventually ask: "Why did this model handle that regulatory submission?" Without ASB, the answer is "because the vendor said it was good" or "we tested it once in a notebook." With ASB, the answer is "because we evaluated it against 12 compliance tasks on our infrastructure, it scored 88, here's the evidence path, and it's re-evaluated weekly for drift."

Every evaluation generates timestamped records with integrity checksums, model version fingerprints, complete task inputs/outputs/rubrics, and routing policy diffs. That's the documentation 21 CFR Part 11 and GxP frameworks require.


AutoScienceBench is in private development.

Early access for organizations running hybrid local/cloud scientific AI infrastructure.

Request Early Access →
Methodology informed by

autoresearch (Karpathy, 2026) · HELM (Stanford CRFM) · BioASQ

Compatible with

Ollama · vLLM · LiteLLM · Portkey · Exo · OpenAI API · any OpenAI-compatible endpoint