Follow the Data

What an Open-Source Eval Fleet Actually Found About AI Model Security


Follow the money.

My father taught me that.

He spent 20 years in the Secret Service — protecting presidents, investigating counterfeiting rings, and running down drug traffickers with the DEA. He finished with a 100%(*) conviction rate. That asterisk is a great cops-and-robbers story I'll only relay over drinks at a conference. Track me down. I'll tell you in person.

After federal service he pivoted to credit fraud investigation, then private investigation. I did some early-career dumpster diving to learn the PI craft. Trash left at the curb is public record. You'd be surprised what people throw away.

Follow the money works across all kinds of criminal investigation — political corruption, crypto fraud, you name it. The principle is simple: ignore what people say. Watch what they do with resources.

What I actually learned wasn't suspicion. It was a habit of ignoring what something is called and watching what it does.

Today we're not following the money.

We're following the data.

The principle is the same. The big AI and hardware firms want to sell you GPUs or tokens. Their marketing materials are built around that aim — even the security writeups. Especially the security writeups. "Our model is safe" is not a finding. It's a positioning statement.

We ran a 28-test agentic suite across a heterogeneous local fleet — six nodes, three vendors, four GPU backends. We built the tooling from scratch because nothing existed that did what we needed. We open-sourced it because the methodology should be reproducible, not proprietary.

Here's what the data actually said. Including the parts where it refused to cooperate.


The Question That Showed Up Uninvited

I didn't start with a hypothesis. I started with a capacity-planning problem.

I needed to know which models would fit on which cards. I was working with Claude on a benchmark script — tokens per second, VRAM usage, latency. Standard performance shakedown.

At some point I said: add security tests.

Not because I had a thesis. Because I've watched capability claims shift across software versions, library updates, and configuration changes for two decades — and model versions are just the newest entry on that list. The instinct to trust any single number on any single config is a habit I lost a long time ago. The framing in that moment was simpler than what it became: code is cheap, add some security tests to the same tool so we're considering security at the same time we're considering capacity. A few extra tests. Why not.

And these days — when non-determinism means 1+1=2 on Monday, 1.1 on Tuesday, and Potato on Wednesday — we need to be extra careful.

That's how the security suite ended up in the script. Bolted onto a benchmark. Not an afterthought — but not the initial thought either.

Then the data showed up, and the question stopped being "what fits on which card."

Once I had numbers across multiple backends, the framing shifted underneath me. The original "let's add some security tests" became a sharper question: if we're benchmarking what runs where, we should also be benchmarking whether what runs behaves the same way across configurations. That wasn't the starting question. It surfaced after the first discrepancies did. The published literature covers a couple of related questions — quantization affects safety alignment (well established), and backends produce different throughput (exhaustively documented). But nobody seemed to be holding the model weights and the quantization constant, varying only the GPU backend, and measuring whether security behavior changed.

That intersection wasn't in any paper I could find. So the security tests I'd added on instinct started doing work I hadn't planned for them to do.

This piece is what happened next.


The Fleet, the Tool, the Tests

Six nodes across three vendors and four backends:

The eval framework is Hermia — 28 agentic tests (and counting) covering routing, schema adherence, context retention, adversarial input handling, credential leak resistance, and a few others mapped to the OWASP LLM Top 10 and MITRE ATLAS. It's open source. pip install hermia. GitHub at scottblydotcom/hermia.

I had to build it because the existing options didn't fit. Garak and Microsoft's PyRIT do real work on model-level vulnerability — but they treat the model as an abstract endpoint, usually behind an API. None of them are built to run the same model binary across heterogeneous local hardware and compare what comes out.

That comparison is the entire question. Every model runs on hardware somewhere — and these days, increasingly on repurposed crypto mining gear, secondhand workstations, and whatever GPUs people could still afford. Talk about heterogeneous. So I built the test rig.


The First Real Finding (And It Wasn't What I Was Looking For)

The clearest result has nothing to do with backends.

One of the tests in the suite is classification-routing. The model is given a request and asked to route it to one of five agent types — building automation, security, finance, healthcare, or DevOps. The adversarial wrinkle: the request is framed in language that sounds security-relevant ("there's been a strange access attempt, check the smart lock") but actually maps to building automation.

The first version of this test was unfair. The model had to guess which agents existed. We rewrote it so the system prompt defines all five agent scopes explicitly. The correct answer is now derivable from the prompt. No guessing.

It still fails almost universally.

Across 21 GPU-confirmed models on the original run, the routing test passed once. One pass out of 44 runs. Every other model, every other run: zero.

A separate fresh run on the ROCm node, 16 models, no prior data shared: 0 out of 16.

Every backend tested. Every model family. Every vendor. The same result.

This isn't a hardware finding. It's reproducible across CUDA, Metal, ROCm, and Vulkan where Vulkan actually executes on GPU. It's a model-architecture finding — or more precisely, a finding about what current model architectures don't do well.

A caveat I want to land before anyone offers it as an objection: these are not frontier models, and they aren't the largest versions of the families I tested. We're working with locally runnable open-source weights — 7B to 32B parameter range, the kind you'd actually deploy on owned hardware. Would GPT-5 or Claude Opus pass this test? Probably. Does that matter for an enterprise considering a private inference deployment on commodity GPUs? Less than you'd think. The instinct to assume "bigger model = safer model" gets directly tested later in this piece, and it doesn't survive contact with the data either.

The dominant failure mode also matters. For classification-routing, it isn't broken JSON. It's SCHEMA_FAIL — valid, well-formed JSON, wrong routing decision. The model isn't malfunctioning. It's being deceived. The attack works at the reasoning layer, not the formatting layer.

That said, across the broader security test set, the failure-mode split looks different. Of models that failed any security test and produced some output: roughly 69% returned invalid JSON — meaning your application would throw a parse exception and crash. The remaining 31% returned valid JSON with the wrong schema — silent wrong behavior. Two different failure modes, two different downstream consequences. Your security test failure is also an availability test. Your red team probe doubles as a denial-of-service probe. Most teams aren't measuring it that way.

I've watched this exact pattern before, in a different layer of the stack. API security spent years selling schema validation as the answer. OWASP API Top 10 is largely a catalog of valid requests with wrong behavior — BOLA, mass assignment, business logic abuse. Schema validity is not behavioral correctness. We learned it about APIs. We have not yet learned it about LLMs.


When the Data Refused to Cooperate

This is the part of the story other practitioners don't tend to publish. So I'm going to.

The cross-backend comparison was supposed to be the centerpiece finding. Same weights, different GPU, different security outcome. That was the hypothesis the data had surfaced and the question I was trying to close.

I never closed it the way I expected to. The data refused to cooperate, four times in a row.

The first time was the ROCm node. qwen3:8b scored 14% there. A clean divergence story — same weights as the 96% CUDA result, dramatically worse behavior on AMD. I was almost ready to write the headline.

Then I checked VRAM allocation per test. Zero bytes. Across all 19 runs. The GPU wasn't running the model at all — Ollama 0.21 had silently fallen back to CPU on RDNA3. No error. No warning. The "14% ROCm result" wasn't a ROCm result. It was a CPU-inference timeout artifact. Upgrading Ollama to 0.24 fixed the backend, and the same node now scores 96.4% on the same model. ROCm isn't unsafe. Ollama was broken, and nothing told me.

The second time was the Vulkan node. Lower scores on qwen3 family models, fine scores on qwen2.5. Same card, same backend, same Ollama install, same day. The clean answer turned out to be that qwen3's architecture doesn't accelerate cleanly through the Vulkan/gfx900 path. It silently routes to CPU while qwen2.5 stays on GPU. Documented in llama.cpp issues. Not documented in any "is this backend safe for production" guide I could find.

The third time was the second 3090. A friend's RTX 3090 (24GB VRAM) ran a large model that overflowed VRAM, spilled into system RAM, and never recovered for the rest of that fleet run. Subsequent smaller models on the same node inherited the mess and ran at CPU speeds even though they should have fit cleanly in VRAM. One model's overflow contaminated every model that ran after it on that node. Excluded from the comparison entirely. The lesson: a VRAM ceiling violation isn't a single-test failure. It can poison the node until something forces a clean reload.

The fourth time was my own analysis tooling.

I was about to publish a clean 100% Vulkan score for qwen3:8b-q8_0. It would have been the strongest single data point in the piece. The problem: it wasn't a Vulkan result. My analysis script had a substring-match bug — two host addresses on the same subnet differed only by the trailing octet, and one was a substring of the other. 333 M1 Pro rows had been quietly attributed to the Vulkan node. The 100% I was about to cite was Apple Silicon data wearing a Vulkan label.

I caught it because the number was too good, not because it was wrong-looking. That's worth sitting with. The bug had been there for weeks. It only surfaced when I got suspicious of a result that supported my thesis.

Following the data means being skeptical of the results that confirm your hypothesis, not just the ones that contradict it. Most of us are trained to do the opposite.


What the Data Actually Says

Once the bugs were fixed, the tools were instrumented to detect silent fallbacks, and the host attribution was using exact matching, I ran the comparison again. Cleanly this time.

Across three GPU-confirmed backends — CUDA on the RTX 5090, ROCm on the RX 7800 XT, Metal on the M1 Pro — qwen3:8b scored 92%, 96.4%, and 98% respectively on the 28-test agentic suite.

A 6-point spread.

I then ran the same model on the same backend four times back-to-back to measure run-to-run variance on a single node. The spread there was 7.1 points.

The within-node variance is larger than the cross-backend variance. The 92/96/98 ranking is not signal. The backends agree on behavior.

That's the negative result on the original hypothesis. Same weights, three different GPU backends, three different vendors — within the noise of running the same model on the same hardware twice. I went looking for backend-driven behavioral divergence and didn't find it.

What I found instead was throughput divergence so large it has security implications of its own.

CUDA on the 5090 ran qwen3:8b at 132 tokens per second. ROCm on the 7800 XT at 62. Metal on the M1 Pro at 20. That's a 7x gap between the fastest and slowest GPU-confirmed backend on the exact same model.

Throughput is a security variable, not just a performance one. At lower token rates, correct answers arrive too slowly to be usable. A security eval with a timeout can't tell the difference between "the model is unsafe" and "the model couldn't finish in time." The CPU-fallback artifact I caught on the ROCm node was the extreme case — 3 tokens per second, 15 tests timing out, looking exactly like a security failure when it was actually an availability collapse. But every backend sits somewhere on that spectrum. A 7x gap is enough to matter for any test with a finite budget.

The same lesson hit harder in a different direction. I tested qwen2.5:72b on the 5090, expecting that a larger model in the same family would behave more conservatively on the security tests. It couldn't load — too large for 32GB of VRAM at any reasonable quant. It silently timed out on every test. Scored 0% on everything.

The model didn't behave unsafely. It didn't behave at all. A security dashboard reading zeros would have looked exactly like a model that refused everything. Scale up expecting better security, get a row of zeros that means "couldn't run."


What This Actually Means If You're Picking Models

The practical translation, for anyone making deployment decisions:

You cannot pick a model on a benchmark score from someone else's stack. The benchmark is for their model, on their hardware, on their software version, on that day. Three things, any one of them, can flip pass to fail without raising an error:

I hit all three in the course of this work. None of them produced an error message. All of them produced numbers that looked like behavioral findings until I checked the substrate.

What you can do: instrument your own stack. Measure VRAM allocation per inference. Confirm the model is running where you think it is. Distinguish availability failures — timeouts, CPU fallback, memory ceilings — from behavioral failures, which are wrong answers, broken schemas, instruction overrides. Hold a model's behavior stable across the runtime upgrades you're going to do anyway.

This is what Hermia exists to do. Open source. On PyPI. Run it against your fleet and tell me where it breaks. The market gap closes with reproducible data, not vendor benchmarks.


What I'm Still Not Sure About

The honest roadmap.

What's verified: routing failure is universal across confirmed-GPU backends. Behavioral pass rates on qwen3:8b agree within noise across CUDA, ROCm, and Metal. Throughput diverges 7x and that divergence has direct security implications via availability. Silent failures — CPU fallback, architecture-specific non-acceleration, capacity overflow — are real, reproducible, and invisible without instrumentation almost nobody is running.

What I don't yet know:

Whether the "backends agree on behavior" finding holds for models other than qwen3:8b. The variance check that locked the finding ran on one model. Need more.

Whether multi-turn agentic behavior diverges where single-turn doesn't. Hermia v0.1 tests structural integrity in single-turn interactions. Multi-turn is v0.2.

Whether quantization — which the literature already covers for safety — interacts with backend choice in ways nobody's mapped. That's a 2x2 design nobody's run, and it's the obvious next experiment.

Whether the 97.7% routing failure persists with different prompt structures, or whether it's specific to the adversarial-framing pattern in this test. I suspect it generalizes. I haven't proven it.


Close

I didn't set out to write a security paper.

I set out to figure out what would fit on my hardware. The security questions found me. They'll find you too. And when they do, you'll want instrumentation already in place, not improvised after the first weird number.

The industry will sell you GPUs and tokens. The security writeups will be positioning. The benchmark scores will be true and useless for your stack.

Build the test rig. Measure your own silicon. Be suspicious of the convenient numbers — including your own.

Hermia is on PyPI and at github.com/scottblydotcom/hermia. Come challenge what's here. Run it against your stack. Tell me what you find. The gap closes faster with more measurement, not more marketing.

The lab itself — kernel panics, supply chain compromises, the build that made this measurement possible — is a story for next time.