Serving Mistral-Small-4-119B with vLLM on DGX Spark (Blackwell GB10)

Introduction

The model: Mistral Small 4

mistralai/Mistral-Small-4-119B-2603-NVFP4 is a remarkable hybrid model released by Mistral AI in March 2026. It unifies three model families into a single checkpoint — Instruct, Reasoning (formerly Magistral), and Devstral — providing exceptional versatility for general-purpose use.

Its MoE (Mixture of Experts) architecture makes it particularly efficient:

119 billion total parameters, but only 6.5 billion active per token
128 experts, with 4 active at each inference step
Context window of 256k tokens
Multimodal: accepts both text and image inputs
Multilingual: English, French, Spanish, German, Chinese, Japanese, Korean, Arabic, and more
Configurable reasoning: switches between instant response mode and reasoning mode (test-time compute) on a per-request basis
Apache 2.0 License: free for commercial use

This particular checkpoint is an NVFP4-quantized version (post-training activation quantization), created in collaboration with the vLLM and Red Hat teams via llm-compressor. This quantization significantly reduces the required memory while maintaining good performance — which is precisely what makes it servable on a DGX Spark with 128 GiB of unified memory.

In terms of performance, Mistral Small 4 delivers a 40% reduction in end-to-end completion time in latency-optimized configuration, and 3x more requests per second in throughput-optimized configuration, compared to Mistral Small 3.

The hardware: DGX Spark

The DGX Spark is a compact yet remarkable machine: it features an NVIDIA GB10 GPU (Blackwell SM121 architecture) with 128 GiB of unified CPU+GPU memory. This is enough to run this quantized 119B-parameter model locally, whose weights occupy ~66 GiB in memory.

This article documents the complete vLLM installation and the configuration choices for serving this model efficiently, detailing the issues encountered with the consumer Blackwell GPU and the solutions found.

Hardware context

Component	Value
GPU	NVIDIA GB10 (Blackwell, SM121)
CUDA Architecture	`12.1` (`12.1a` for the consumer variant)
Unified Memory	128 GiB
CUDA	13.0
CPU Architecture	aarch64
System	Ubuntu 24

Important: the GB10 uses SM121 (consumer Blackwell) and not SM100 (datacenter Blackwell). This distinction is crucial for CUDA kernel compatibility.

1. Downloading the model

hf download mistralai/Mistral-Small-4-119B-2603-NVFP4

The model weighs ~66 GiB. It will be stored in ~/.cache/huggingface/hub/.

Storage note: if the model is on a USB drive or external SSD (~475 MB/s), expect ~10 minutes of loading time at each startup. An internal NVMe would be ideal, but the DGX Spark has limited NVMe space (~900 GiB, much of which is occupied by the system).

2. Installing vLLM

2.1 Clone the repository

The official vLLM repository does not yet fully support Mistral v15 parsing. We use a patched fork:

git clone --branch fix_mistral_parsing https://github.com/juliendenize/vllm.git vllm-mistral
cd vllm-mistral

2.2 Create the virtual environment

uv venv
source .venv/bin/activate

2.3 Install vLLM with precompiled binaries

VLLM_USE_PRECOMPILED=1 uv pip install --editable .

2.4 Install PyTorch for CUDA 13

uv pip install --index-url https://download.pytorch.org/whl/cu130 torch==2.10.0+cu130
uv pip install --index-url https://download.pytorch.org/whl/cu130 torchvision

2.5 Install the latest version of Transformers

The Mistral v15 tokenizer requires a recent version of mistral_common, included via Transformers:

uv pip install git+https://github.com/huggingface/transformers.git
pip install --upgrade mistral_common

2.6 Install FlashInfer

FlashInfer provides optimized kernels for attention. For CUDA 13:

uv pip install flashinfer-python flashinfer-cubin
pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu130

3. Systemd service configuration

3.1 Service file

sudo systemctl edit --force --full vllm.service

[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
User=sb
Group=sb
WorkingDirectory=/mnt/data/sb/projects/vllm-mistral
Environment="PATH=/mnt/data/sb/projects/vllm-mistral/.venv/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
Environment="TORCH_CUDA_ARCH_LIST=12.1a"
Environment="TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas"
Environment="FLASHINFER_JIT_LOG_LEVEL=ERROR"
Environment="TRANSFORMERS_VERBOSITY=error"
Environment="VLLM_SKIP_P2P_CHECK=1"
ExecStart=/mnt/data/sb/projects/vllm-mistral/.venv/bin/vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
    --max-model-len 262144 \
    --tensor-parallel-size 1 \
    --attention-backend TRITON_MLA \
    --tool-call-parser mistral \
    --enable-auto-tool-choice \
    --reasoning-parser mistral \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 128 \
    --gpu-memory-utilization 0.8 \
    --no-enable-flashinfer-autotune \
    --cudagraph-capture-sizes 1 2 4 8 16 32 64 128 256 \
    --max-cudagraph-capture-size 256
Restart=always
RestartSec=15
StandardOutput=journal
StandardError=journal
SyslogIdentifier=vllm
MemoryMax=120G
MemorySwapMax=0

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now vllm

3.2 Verification

# Real-time logs
journalctl -u vllm -f

# Verify the model is being served
curl http://localhost:8000/v1/models

4. Configuration choices explained

`--attention-backend TRITON_MLA`

The GB10 (SM121) is not compatible with FlashAttention or FlashInfer MLA, which are compiled for SM100 (datacenter Blackwell). Without this option, vLLM crashes with cudaErrorIllegalInstruction.

Triton recompiles the attention kernels on the fly for SM121, which solves the problem.

`--no-enable-flashinfer-autotune`

This is the most impactful optimization on startup time. Without this flag, the FlashInfer autotuner tests dozens of MoE tactics compiled for SM120 (datacenter), all of which fail on SM121. On the 2nd startup, this caused an additional delay of 1300 seconds (~22 minutes).

With --no-enable-flashinfer-autotune, this delay disappears entirely:

Skipping FlashInfer autotune because it is disabled.

`--cudagraph-capture-sizes 1 2 4 8 16 32 64 128 256`

By default, vLLM captures CUDA graphs for dozens of batch sizes (up to 512 in increments of 8). For interactive use with a single user, all these sizes are unnecessary. Reducing to 9 sizes (powers of 2) drastically cuts capture time.

`--max-cudagraph-capture-size 256`

Limits the maximum captured batch size. Consistent with --max-num-seqs 128.

`--max-model-len 262144`

The model supports up to 1M context tokens, but with 128 GiB of unified memory (~66 GiB used for weights), the available KV cache memory is limited (~13 GiB). 262144 tokens (256K) is a good trade-off.

`--gpu-memory-utilization 0.8`

80% of GPU memory is allocated for the model + KV cache. The remaining 20% is used for CUDA graphs and runtime overhead.

`VLLM_SKIP_P2P_CHECK=1`

Skips a P2P check that can take ~60 seconds on certain single-GPU configurations.

`TORCH_CUDA_ARCH_LIST=12.1a`

Tells PyTorch and Triton to compile kernels for SM121 (the a variant designates the consumer GB10, as opposed to the B200/H100).

`TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas`

Points Triton to the CUDA 13 PTX compiler. Without this, Triton may use an incompatible version.

5. Startup performance

Here are the measured times for a startup with all optimizations active and a warm torch.compile cache:

Step	Duration
Weight loading (13 shards, USB SSD)	606s
torch.compile (cache hit)	4s
Initial warmup	3s
CUDA graph capture (9 sizes)	13s
init engine total	104s
Total startup	~12 min 27s

10:13:14 → service start
10:23:53 → Loading weights took 606.35 seconds
10:25:13 → torch.compile took 3.96 s (cache hit)
10:25:38 → Graph capturing finished in 13 secs
10:25:39 → init engine took 103.80 seconds
10:25:41 → Application startup complete

The torch.compile cache is automatically stored in ~/.cache/vllm/torch_compile_cache/. From the 2nd startup onward, compilation is near-instantaneous (~4s instead of ~15s).

Weight loading from a USB SSD (~475 MB/s) accounts for the majority of startup time (~10 min) and cannot be reduced without changing the storage.

6. Quick test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Small-4-119B-2603-NVFP4",
    "messages": [{"role": "user", "content": "Explique ce qu'\''est un LLM en 3 phrases."}],
    "max_tokens": 256
  }'

7. Web interface with Open WebUI

For a graphical interface accessible from another machine:

docker run -d \
  --network=host \
  --name open-webui \
  -e OPENAI_API_BASE_URL=http://localhost:8000/v1 \
  -e OPENAI_API_KEY=none \
  -v open-webui:/app/backend/data \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Interface accessible at http://<IP_DGX>:8080.

8. Known issues

PyTorch warning on SM121

Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0)

This warning is harmless. PyTorch 2.10+cu130 was officially compiled for SM 8.0–12.0, but works correctly on SM 12.1 thanks to recompilation with TORCH_CUDA_ARCH_LIST=12.1a.

safetensors repo_utils error

ERROR: 'mistralai/Mistral-Small-4-119B-2603-NVFP4' is not a safetensors repo.

This error message is normal: this model uses Mistral's consolidated.safetensors format rather than the standard HuggingFace format. vLLM handles this correctly via fallback.

Tensorizer incompatible

Tensorizer (which could reduce loading time) is incompatible with NVFP4 / compressed-tensors quantized models. This avenue should be abandoned for this model.

Conclusion

Running a 119B-parameter model locally on a DGX Spark is entirely feasible, with a few specifics related to the consumer Blackwell GB10 GPU. The key takeaways:

⚙

Attention backend

TRITON_MLA

⚡

Disable autotuner

--no-enable-flashinfer-autotune

📊

Reduce CUDA graphs

9 sizes (powers of 2)

The irreducible startup time remains dominated by weight loading (~10 min on USB SSD), but once running, the server stays stable and performant for personal or team use.

Interested in a similar project? I'm available for machine learning engineering missions — LLM deployment, inference optimization, production deployment. Feel free to get in touch.