Does 4-bit Quantization Cost You Quality? Measuring NVFP4 Mistral-Small-4

A follow-up to my write-up Serving Mistral-Small-4-119B with vLLM on the DGX Spark, which covered the install and configuration. This one is about quality.

In my earlier post, I walked through getting Mistral-Small-4-119B in NVFP4 to serve reliably on a single DGX Spark (GB10, 128 GB unified memory) under vLLM — the install, the systemd service, and the consumer-Blackwell (SM121) quirks you have to work around. The model runs; it answers.

But every time I share that setup, the same question comes back — and a reader put it to me directly: it serves, sure, but it's 4-bit — what did you give up on quality?

It's the right question, and most answers to it are vibes. So I stopped guessing and measured.

The problem with "just trust the recovery numbers"

The official NVFP4 checkpoint ships without published accuracy-recovery figures against its BF16 base. And even when generic recovery numbers exist for a quantized model, they rarely answer the question that actually matters to you: does it hold for my task — text analysis, in French, on medium-sized contexts?

The only honest answer is a task-grounded A/B on the real use case.

The setup (with one caveat stated loudly, not buried)

Two endpoints, identical prompts, identical settings (temperature 0, reasoning off):

Candidate — the local NVFP4 model served by vLLM.
Reference — mistral-small-latest via the Mistral API, standing in for "full precision."

That reference deserves a caveat I'd rather state up front than hide: it's the hosted API, not a controlled local BF16 run. A 119B model in BF16 is roughly 238 GB — it doesn't fit on one 128 GB Spark, so a pristine local baseline was never on the table. The API is the closest accessible full-precision reference, but it may apply its own serving optimizations. Read this as NVFP4 vs full-precision serving, not NVFP4 vs pristine weights.

The methodology

I split the evaluation into two families.

Gold-scored tasks, where a correct answer exists and no judge is needed: extractive QA from PIAF (a native French reading-comprehension set, MIT-licensed), scored by token-level F1; plus classification (exact match) and extraction (set F1). Deterministic, cheap, objective.

Open-ended analysis, where there is no single right answer: summarize, identify the key entities, list the verifiable claims, and so on, over ~1000-word French Wikipedia articles (CC BY-SA). These I scored with a blind pairwise judge (GPT-4.1) — with one detail that matters: each pair is judged twice, with the two answers swapped between position A and B. A verdict only counts if it is consistent across both orderings; otherwise it's a tie. That cancels the well-known position bias of LLM judges, and the judge never knows which answer is the quantized one.

Everything ran at temperature 0 and reasoning_effort=none — the settings I'd actually deploy for analysis work. 200 examples in total: 150 open-ended, 50 gold QA.

The results

Gold QA (50 items)

Metric	NVFP4 (local)	Full precision (API)
token-F1	0.34	0.35

A 0.01 gap — i.e. nothing. (The absolute level is low because PIAF rewards minimal answer spans while the models answer in full sentences; that's a metric artifact, not a quality signal. What matters is that the two are level.)

Open-ended analysis (150 prompts)

Verdict	Count (of 150)
NVFP4 preferred	50
Tie	60
Full precision preferred	40

The quantized model was actually preferred slightly more often than full precision.

Is that lead real? No — and here is the cleanest way I can show it. In a smaller pilot (40 analysis prompts), the reference had edged ahead, 13 wins to 8. Scaling to 150 reversed it: the quantized model led, 50 to 40. A lead that changes sign when you add data is the textbook fingerprint of noise.

The statistics agree. Of the 90 decisive verdicts (ties excluded), 50–40 sits about one standard deviation from an even split (p ≈ 0.34; the 95% interval on the quantized model's win share is [0.45, 0.66], comfortably spanning 0.50). There is no statistically significant difference in either direction.

Put another way: in 73% of analysis prompts, NVFP4 was judged at least as good as full precision — and the remaining 27% don't form a consistent edge for the reference.

Reading this honestly

"No significant difference" is not "proven identical." It means: with 200 examples and this judge, I could not detect a quality gap. That's a meaningful result, but it has edges worth naming:

The reference is the API, not a controlled BF16 (see above).
n = 200 is modest; a gap smaller than ~5–10 points could hide in the noise.
A single judge has its own preferences; two-sided ordering cancels position bias, not model-identity bias.
reasoning_effort = none only; reasoning mode could behave differently.
The analysis texts are general-encyclopedic (Wikipedia), not domain-specific. For your domain, re-run it on your texts.

The takeaway

For medium-context text analysis, NVFP4 Mistral-Small-4 on a single GB10 gives you full-precision-grade output with no quality tax I could measure — on the same single-GB10 setup from the install guide, a machine that sits under your desk and keeps every token local.

For anyone weighing local inference for sovereignty or data-residency reasons, that's the number that was missing. On this workload, the 4-bit discount looks free.

Reproduce it: The harness is open-source (MIT) — the gold scoring, the two-sided judge, and a builder that assembles the eval set from open-licensed French sources so you can reproduce it, or point it at your own data: github.com/haruni-net/llm-quant-ab. If you do, I'd be curious what you find.

Interested in a similar project? I'm available for machine learning engineering missions — LLM deployment, inference optimization, model evaluation. Feel free to get in touch.