← Articles FR
Benchmark · Generative AI

Image Generation on the DGX Spark: 11 Models Compared

I tested the power of this desktop mini supercomputer on image generation: FLUX, Stable Diffusion, Qwen, Z-Image — all with the same prompt, the same seed, the same conditions.

Sébastien Burel · haruni.net · January 2026

Why this test?

The NVIDIA DGX Spark is a rather exceptional machine: a desktop-sized AI supercomputer, powered by the Grace Blackwell Superchip GB10. It's a chip that combines a 20-core ARM CPU and a Blackwell GPU on a single package, with 128 GB of unified memory shared between both. All plugged into a standard wall outlet.

On paper, it promises to run very large models locally, without cloud, without subscriptions, without sending your data anywhere. I wanted to verify this in practice, on a concrete use case: text-to-image generation.

I wrote an interactive notebook in Marimo (a modern alternative to Jupyter) that lets you pick a model, enter a prompt, adjust parameters, and generate an image with a real-time progress bar. I then ran 11 different models under the exact same conditions to compare speed, quality, and memory consumption.

The hardware: NVIDIA DGX Spark, GB10 Grace Blackwell chip, 128 GB LPDDR5x unified CPU+GPU memory, 4 TB NVMe, 20 ARM cores, up to 1 petaFLOP of AI performance in FP4.

What exactly is the DGX Spark?

The DGX Spark is not a souped-up gaming PC. It's a machine designed from the ground up for artificial intelligence, in a compact form factor that fits on a desk. Its main feature: unified memory. The CPU and GPU share the same 128 GB pool — which means you can load very large models without worrying about VRAM, which is often the bottleneck on traditional setups.

128 GB Unified CPU+GPU memory
1 PFLOP AI performance (FP4)
20 ARM Cortex cores
273 GB/s Memory bandwidth
4 TB NVMe storage
240 W Max power draw

What's notable is that the limit here isn't memory (128 GB is considerable) but memory bandwidth: 273 GB/s shared between CPU and GPU, compared to several TB/s on a datacenter H100. That's the real bottleneck of this machine for intensive AI workloads.

The test notebook

Rather than running command-line scripts, I built an interactive interface in Marimo. The idea: pick a model from a dropdown, adjust parameters (number of steps, guidance scale, seed, aspect ratio) and generate images with real-time visual feedback. After each generation, the parameters are automatically saved to a JSON file for reproducibility and comparison.

The core loading function is intentionally simple — the Hugging Face diffusers library does the heavy lifting:

from diffusers import DiffusionPipeline import torch def loadModel(model_name): pipeline = DiffusionPipeline.from_pretrained( model_name, torch_dtype=torch.bfloat16, local_files_only=True, use_safetensors=True ).to('cuda') return pipeline

All models are loaded with the bfloat16 data type to balance precision and memory usage. The local_files_only flag ensures models are loaded from local disk, with no network calls during generation.

For the benchmark, I kept exactly the same parameters across every model:

The 11 models head to head

Here's the full measurement table. Generation time is the pure diffusion time (excluding model loading). Memory indicates the peak GPU memory measured during generation.

Model Family Loading Generation Duration Memory
SD 3.5 medium
stabilityai/stable-diffusion-3.5-medium
SD 3.5 66s 34s
34s
22 GB
Z-Image-Turbo
Tongyi-MAI/Z-Image-Turbo
Z-Image 63s 178s
178s
22 GB
Z-Image
Tongyi-MAI/Z-Image
Z-Image 120s 179s
179s
22 GB
SD 3.5 large
stabilityai/stable-diffusion-3.5-large
SD 3.5 171s 82s
82s
29 GB
SD 3.5 large turbo
stabilityai/stable-diffusion-3.5-large-turbo
SD 3.5 171s 82s
82s
29 GB
FLUX.2-klein-9B
black-forest-labs/FLUX.2-klein-9B
FLUX.2 266s 95s
95s
66 GB
FLUX.1-schnell
black-forest-labs/FLUX.1-schnell
FLUX.1 291s 110s
110s
72 GB
FLUX.1-dev
black-forest-labs/FLUX.1-dev
FLUX.1 284s 111s
111s
95 GB
FLUX.1-Kontext-dev
black-forest-labs/FLUX.1-Kontext-dev
FLUX.1 289s 111s
111s
66 GB
Qwen-Image-2512
Qwen/Qwen-Image-2512
Qwen 427s 212s
212s
63 GB
FLUX.2-dev (4-bit)
diffusers/FLUX.2-dev-bnb-4bit
FLUX.2 265s 397s
397s
34 GB

The generated images

All images below were generated with the same prompt, the same seed (42) and the same parameters. Only the model changes — making the comparison directly readable.

SD 3.5 medium
SD 3.5 medium
34s · 22 GB
Fast, lightweight, very photographic
Z-Image-Turbo
Z-Image-Turbo
178s · 22 GB
Highly saturated green eyes, "illustration" style
Z-Image
Z-Image
179s · 22 GB
Natural rendering, blue sweater, spring vibes
SD 3.5 large
SD 3.5 large
82s · 29 GB
Film grain, tight framing, very realistic
SD 3.5 large turbo
SD 3.5 large turbo
82s · 29 GB
Overexposed, warm tones, commercial style
FLUX.2-klein-9B
FLUX.2-klein-9B
95s · 66 GB
Professional composition, crossed arms, centered Eiffel Tower
FLUX.1-schnell
FLUX.1-schnell
110s · 72 GB
Parisian café vibes, scarf, soft gaze
FLUX.1-dev
FLUX.1-dev
111s · 95 GB
Highly detailed, textured skin, golden light
FLUX.1-Kontext-dev
FLUX.1-Kontext-dev
111s · 66 GB
Sunny street, clean and bright style
Qwen-Image-2512
Qwen-Image-2512
212s · 63 GB
Sober portrait, Seine riverbanks, older-looking rendering
FLUX.2-dev (4-bit)
FLUX.2-dev (4-bit)
397s · 34 GB
Cherry blossoms, natural — but 397s of waiting!

Prompt: "A 35 years young woman in paris during spring. Blond hairs, green eyes. 35mm, professional photographer." — 1024×1024, 50 steps, guidance 4.0, seed 42.

Benchmark takeaways

Fastest
SD 3.5 medium

34 seconds of generation, 22 GB of memory. Remarkable speed/quality ratio.

🐢
Slowest
FLUX.2-dev 4-bit

397 seconds — paradoxical for a quantized version meant to be lighter.

🧠
Most memory-hungry
FLUX.1-dev: 95 GB

Impossible to run on a setup with less than 80-90 GB of VRAM.

🪶
Lightest
Z-Image & SD 3.5 medium

Only 22 GB — accessible on far less expensive setups.

4-bit quantization: not always a win

The FLUX.2-dev-bnb-4bit version is a 4-bit compressed version of the FLUX.2-dev model (instead of 16-bit) to reduce its memory footprint. Result: 34 GB consumed instead of 66. On paper, that's a great idea.

But in practice, it's the slowest model in this benchmark: 397 seconds of generation, or 3.5× slower than FLUX.2-klein which uses the same amount of memory. Quantization has a computational cost that the GB10 struggles to offset — likely due to the DGX Spark's limited memory bandwidth (273 GB/s) which amplifies this overhead.

SD 3.5 large = SD 3.5 large turbo?

Identical results on both: 171 seconds loading, 82 seconds generation, 29 GB memory. The two models were likely optimized for the same performance targets. The "turbo" label normally means fewer steps needed for good quality — testable with fewer than 50 steps.

Z-Image vs Z-Image-Turbo: same story

Both Z-Image variants (from Tongyi-MAI) are virtually identical in timing: 178 and 179 seconds. The difference between them might show more with a reduced step count.

The DGX Spark: a machine built for large models

What this test clearly shows is that the DGX Spark shines through its memory capacity far more than its raw speed. Models like FLUX.1-dev (95 GB) or Qwen-Image (63 GB) are simply impossible to run on a standard GPU, even a high-end one. Here, they run — slowly at times, but they run.

For high-throughput production, this isn't the right machine. For exploration, prototyping, or testing massive models locally without cloud or subscriptions, it's a one-of-a-kind proposition.

My takeaway

The DGX Spark is a fascinating and somewhat unconventional machine. It's not the fastest GPU for image generation — an RTX 5090 with 32 GB of VRAM would likely be faster on models that fit in memory. But it's a machine that can run anything, locally, with no VRAM constraints, with NVIDIA's full CUDA stack.

For anyone who wants to explore state-of-the-art image generation — or LLMs — without depending on the cloud and without being limited by memory, it's a remarkable tool.

🏆
Best speed/memory ratio
SD 3.5 medium — 34s / 22 GB
🎯
Best balance
FLUX.1-schnell — 110s / 72 GB
🦾
Largest model supported
FLUX.1-dev — 95 GB out of 128 GB

The next step will be to test these same models with more complex prompts and varying step counts — to see if "turbo" really lives up to its name with 10 or 20 steps instead of 50.

Interested in a similar project? I'm available for machine learning engineering engagements — modeling, data pipelines, production deployment. Feel free to get in touch.