Introduction
The model: Mistral Small 4
mistralai/Mistral-Small-4-119B-2603-NVFP4 is a remarkable hybrid model released by Mistral AI in March 2026. It unifies three model families into a single checkpoint — Instruct, Reasoning (formerly Magistral), and Devstral — providing exceptional versatility for general-purpose use.
Its MoE (Mixture of Experts) architecture makes it particularly efficient:
- 119 billion total parameters, but only 6.5 billion active per token
- 128 experts, with 4 active at each inference step
- Context window of 256k tokens
- Multimodal: accepts both text and image inputs
- Multilingual: English, French, Spanish, German, Chinese, Japanese, Korean, Arabic, and more
- Configurable reasoning: switches between instant response mode and reasoning mode (test-time compute) on a per-request basis
- Apache 2.0 License: free for commercial use
This particular checkpoint is an NVFP4-quantized version (post-training activation quantization), created in collaboration with the vLLM and Red Hat teams via llm-compressor. This quantization significantly reduces the required memory while maintaining good performance — which is precisely what makes it servable on a DGX Spark with 128 GiB of unified memory.
In terms of performance, Mistral Small 4 delivers a 40% reduction in end-to-end completion time in latency-optimized configuration, and 3x more requests per second in throughput-optimized configuration, compared to Mistral Small 3.
The hardware: DGX Spark
The DGX Spark is a compact yet remarkable machine: it features an NVIDIA GB10 GPU (Blackwell SM121 architecture) with 128 GiB of unified CPU+GPU memory. This is enough to run this quantized 119B-parameter model locally, whose weights occupy ~66 GiB in memory.
This article documents the complete vLLM installation and the configuration choices for serving this model efficiently, detailing the issues encountered with the consumer Blackwell GPU and the solutions found.
Hardware context
| Component | Value |
|---|---|
| GPU | NVIDIA GB10 (Blackwell, SM121) |
| CUDA Architecture | 12.1 (12.1a for the consumer variant) |
| Unified Memory | 128 GiB |
| CUDA | 13.0 |
| CPU Architecture | aarch64 |
| System | Ubuntu 24 |
Important: the GB10 uses SM121 (consumer Blackwell) and not SM100 (datacenter Blackwell). This distinction is crucial for CUDA kernel compatibility.
1. Downloading the model
hf download mistralai/Mistral-Small-4-119B-2603-NVFP4
The model weighs ~66 GiB. It will be stored in ~/.cache/huggingface/hub/.
Storage note: if the model is on a USB drive or external SSD (~475 MB/s), expect ~10 minutes of loading time at each startup. An internal NVMe would be ideal, but the DGX Spark has limited NVMe space (~900 GiB, much of which is occupied by the system).
2. Installing vLLM
2.1 Clone the repository
The official vLLM repository does not yet fully support Mistral v15 parsing. We use a patched fork:
git clone --branch fix_mistral_parsing https://github.com/juliendenize/vllm.git vllm-mistral
cd vllm-mistral
2.2 Create the virtual environment
uv venv
source .venv/bin/activate
2.3 Install vLLM with precompiled binaries
VLLM_USE_PRECOMPILED=1 uv pip install --editable .
2.4 Install PyTorch for CUDA 13
uv pip install --index-url https://download.pytorch.org/whl/cu130 torch==2.10.0+cu130
uv pip install --index-url https://download.pytorch.org/whl/cu130 torchvision
2.5 Install the latest version of Transformers
The Mistral v15 tokenizer requires a recent version of mistral_common, included via Transformers:
uv pip install git+https://github.com/huggingface/transformers.git
pip install --upgrade mistral_common
2.6 Install FlashInfer
FlashInfer provides optimized kernels for attention. For CUDA 13:
uv pip install flashinfer-python flashinfer-cubin
pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu130
3. Systemd service configuration
3.1 Service file
sudo systemctl edit --force --full vllm.service
[Unit]
Description=vLLM Inference Server
After=network.target
[Service]
User=sb
Group=sb
WorkingDirectory=/mnt/data/sb/projects/vllm-mistral
Environment="PATH=/mnt/data/sb/projects/vllm-mistral/.venv/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
Environment="TORCH_CUDA_ARCH_LIST=12.1a"
Environment="TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas"
Environment="FLASHINFER_JIT_LOG_LEVEL=ERROR"
Environment="TRANSFORMERS_VERBOSITY=error"
Environment="VLLM_SKIP_P2P_CHECK=1"
ExecStart=/mnt/data/sb/projects/vllm-mistral/.venv/bin/vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
--max-model-len 262144 \
--tensor-parallel-size 1 \
--attention-backend TRITON_MLA \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--reasoning-parser mistral \
--max-num-batched-tokens 16384 \
--max-num-seqs 128 \
--gpu-memory-utilization 0.8 \
--no-enable-flashinfer-autotune \
--cudagraph-capture-sizes 1 2 4 8 16 32 64 128 256 \
--max-cudagraph-capture-size 256
Restart=always
RestartSec=15
StandardOutput=journal
StandardError=journal
SyslogIdentifier=vllm
MemoryMax=120G
MemorySwapMax=0
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now vllm
3.2 Verification
# Real-time logs
journalctl -u vllm -f
# Verify the model is being served
curl http://localhost:8000/v1/models
4. Configuration choices explained
--attention-backend TRITON_MLA
The GB10 (SM121) is not compatible with FlashAttention or FlashInfer MLA, which are compiled for SM100 (datacenter Blackwell). Without this option, vLLM crashes with cudaErrorIllegalInstruction.
Triton recompiles the attention kernels on the fly for SM121, which solves the problem.
--no-enable-flashinfer-autotune
This is the most impactful optimization on startup time. Without this flag, the FlashInfer autotuner tests dozens of MoE tactics compiled for SM120 (datacenter), all of which fail on SM121. On the 2nd startup, this caused an additional delay of 1300 seconds (~22 minutes).
With --no-enable-flashinfer-autotune, this delay disappears entirely:
Skipping FlashInfer autotune because it is disabled.
--cudagraph-capture-sizes 1 2 4 8 16 32 64 128 256
By default, vLLM captures CUDA graphs for dozens of batch sizes (up to 512 in increments of 8). For interactive use with a single user, all these sizes are unnecessary. Reducing to 9 sizes (powers of 2) drastically cuts capture time.
--max-cudagraph-capture-size 256
Limits the maximum captured batch size. Consistent with --max-num-seqs 128.
--max-model-len 262144
The model supports up to 1M context tokens, but with 128 GiB of unified memory (~66 GiB used for weights), the available KV cache memory is limited (~13 GiB). 262144 tokens (256K) is a good trade-off.
--gpu-memory-utilization 0.8
80% of GPU memory is allocated for the model + KV cache. The remaining 20% is used for CUDA graphs and runtime overhead.
VLLM_SKIP_P2P_CHECK=1
Skips a P2P check that can take ~60 seconds on certain single-GPU configurations.
TORCH_CUDA_ARCH_LIST=12.1a
Tells PyTorch and Triton to compile kernels for SM121 (the a variant designates the consumer GB10, as opposed to the B200/H100).
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
Points Triton to the CUDA 13 PTX compiler. Without this, Triton may use an incompatible version.
5. Startup performance
Here are the measured times for a startup with all optimizations active and a warm torch.compile cache:
| Step | Duration |
|---|---|
| Weight loading (13 shards, USB SSD) | 606s |
| torch.compile (cache hit) | 4s |
| Initial warmup | 3s |
| CUDA graph capture (9 sizes) | 13s |
| init engine total | 104s |
| Total startup | ~12 min 27s |
10:13:14 → service start
10:23:53 → Loading weights took 606.35 seconds
10:25:13 → torch.compile took 3.96 s (cache hit)
10:25:38 → Graph capturing finished in 13 secs
10:25:39 → init engine took 103.80 seconds
10:25:41 → Application startup complete
The torch.compile cache is automatically stored in ~/.cache/vllm/torch_compile_cache/. From the 2nd startup onward, compilation is near-instantaneous (~4s instead of ~15s).
Weight loading from a USB SSD (~475 MB/s) accounts for the majority of startup time (~10 min) and cannot be reduced without changing the storage.
6. Quick test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-Small-4-119B-2603-NVFP4",
"messages": [{"role": "user", "content": "Explique ce qu'\''est un LLM en 3 phrases."}],
"max_tokens": 256
}'
7. Web interface with Open WebUI
For a graphical interface accessible from another machine:
docker run -d \
--network=host \
--name open-webui \
-e OPENAI_API_BASE_URL=http://localhost:8000/v1 \
-e OPENAI_API_KEY=none \
-v open-webui:/app/backend/data \
--restart always \
ghcr.io/open-webui/open-webui:main
Interface accessible at http://<IP_DGX>:8080.
8. Known issues
PyTorch warning on SM121
Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0)
This warning is harmless. PyTorch 2.10+cu130 was officially compiled for SM 8.0–12.0, but works correctly on SM 12.1 thanks to recompilation with TORCH_CUDA_ARCH_LIST=12.1a.
safetensors repo_utils error
ERROR: 'mistralai/Mistral-Small-4-119B-2603-NVFP4' is not a safetensors repo.
This error message is normal: this model uses Mistral's consolidated.safetensors format rather than the standard HuggingFace format. vLLM handles this correctly via fallback.
Tensorizer incompatible
Tensorizer (which could reduce loading time) is incompatible with NVFP4 / compressed-tensors quantized models. This avenue should be abandoned for this model.
Conclusion
Running a 119B-parameter model locally on a DGX Spark is entirely feasible, with a few specifics related to the consumer Blackwell GB10 GPU. The key takeaways:
The irreducible startup time remains dominated by weight loading (~10 min on USB SSD), but once running, the server stays stable and performant for personal or team use.