AutoScienceBench evaluates every AI model on your infrastructure against real scientific workflows — then generates a deployable routing policy backed by evidence, not intuition. One command. Full evaluation suite. Results by morning.
Systematically tests model variants, quantization levels, prompt strategies, and context window configurations against your full task suite. Every combination scored. Every result logged.
Per-task scoring breakdowns. Head-to-head comparisons with statistical significance. A new routing_policy.json with a full diff showing exactly what moved and why.
New model release? Different quantization? Hardware upgrade? Same methodology, same task suite, new results. Your routing policy is never stale.
ASB doesn't deploy automatically. It proposes changes with evidence and waits for review. Scientists make the calls.
Most hybrid AI deployments route by data sensitivity — confidential stays local, public goes to cloud. That's a security policy, not a quality policy. Nobody has measured which model is actually better at which scientific task on the infrastructure they already own.
50 tasks. Expert-validated references. Scored on the target hardware. Neither model wins everywhere.
The local model beats cloud at sequence analysis by 13 points — at zero cost. The cloud model dominates experiment design by 35 points. Routing everything to one model wastes either money or quality. AutoScienceBench generates the routing config that uses each model where it's strongest.
Multiple-choice biology trivia tells you nothing about whether a model can design a construct library, interpret dose-response curves, or identify confounders in HTS data.
LeetCode is irrelevant when you need a BioPython pipeline that parses GenBank annotations, aligns against a custom reference, and flags frameshifts in a specific reading frame.
Answering PubMed questions is a solved problem. The hard part is multi-step pipelines where the model decides what to search, how to filter, and which contradictions to flag.
Sequence analysis. Multi-step experiment design. Literature synthesis with contradiction detection. 21 CFR Part 11 compliance. Agentic pipelines with 5+ chained tool calls. Measured on your infrastructure.
Each layer tests a different dimension. Together they produce a complete quality-per-dollar profile for every model on a given infrastructure.
Automated screening using curated GPQA and BioASQ subsets. Filters out models that can't meet minimum domain competency before expensive evaluation begins.
50+ tasks across six categories — sequence analysis, literature retrieval, experiment design, data analysis, regulatory compliance, and agentic workflows. Each task has expert-validated reference answers with structured scoring rubrics. This is the core.
End-to-end research pipelines requiring 5+ chained actions: search literature, extract data, run analysis, generate hypotheses, draft protocols. Evaluates tool use accuracy, error recovery, and reasoning coherence across multi-step sequences.
Blind expert evaluation by domain scientists. Outputs from different models presented without attribution. Produces Elo ratings and labeled preference data suitable for fine-tuning. Reserved for top performers from Layers 1–3.
Throughput, latency (p50/p95/p99), memory footprint, cost-per-request — measured on your hardware, your quantization, your inference stack. A model that scores 90 but runs at 3 tok/s on your cluster isn't useful.
Each task category maps to the model with the best quality-adjusted cost-performance on your infrastructure. Security overrides are preserved — restricted data stays local regardless of quality score.
Plugs directly into any OpenAI-compatible gateway — LiteLLM, Portkey, or custom. Re-run when you add a model, change quantization, or scale hardware.
Auditors will eventually ask: "Why did this model handle that regulatory submission?" Without ASB, the answer is "because the vendor said it was good" or "we tested it once in a notebook." With ASB, the answer is "because we evaluated it against 12 compliance tasks on our infrastructure, it scored 88, here's the evidence path, and it's re-evaluated weekly for drift."
Every evaluation generates timestamped records with integrity checksums, model version fingerprints, complete task inputs/outputs/rubrics, and routing policy diffs. That's the documentation 21 CFR Part 11 and GxP frameworks require.
Early access for organizations running hybrid local/cloud scientific AI infrastructure.
Request Early Access →autoresearch (Karpathy, 2026) · HELM (Stanford CRFM) · BioASQ
Ollama · vLLM · LiteLLM · Portkey · Exo · OpenAI API · any OpenAI-compatible endpoint