Image Generation Benchmark on the DGX Spark

Why this test?

The NVIDIA DGX Spark is a rather exceptional machine: a desktop-sized AI supercomputer, powered by the Grace Blackwell Superchip GB10. It's a chip that combines a 20-core ARM CPU and a Blackwell GPU on a single package, with 128 GB of unified memory shared between both. All plugged into a standard wall outlet.

On paper, it promises to run very large models locally, without cloud, without subscriptions, without sending your data anywhere. I wanted to verify this in practice, on a concrete use case: text-to-image generation.

I wrote an interactive notebook in Marimo (a modern alternative to Jupyter) that lets you pick a model, enter a prompt, adjust parameters, and generate an image with a real-time progress bar. I then ran 11 different models under the exact same conditions to compare speed, quality, and memory consumption.

The hardware: NVIDIA DGX Spark, GB10 Grace Blackwell chip, 128 GB LPDDR5x unified CPU+GPU memory, 4 TB NVMe, 20 ARM cores, up to 1 petaFLOP of AI performance in FP4.

What exactly is the DGX Spark?

The DGX Spark is not a souped-up gaming PC. It's a machine designed from the ground up for artificial intelligence, in a compact form factor that fits on a desk. Its main feature: unified memory. The CPU and GPU share the same 128 GB pool — which means you can load very large models without worrying about VRAM, which is often the bottleneck on traditional setups.

128 GB Unified CPU+GPU memory

1 PFLOP AI performance (FP4)

20 ARM Cortex cores

273 GB/s Memory bandwidth

4 TB NVMe storage

240 W Max power draw

What's notable is that the limit here isn't memory (128 GB is considerable) but memory bandwidth: 273 GB/s shared between CPU and GPU, compared to several TB/s on a datacenter H100. That's the real bottleneck of this machine for intensive AI workloads.

The test notebook

Rather than running command-line scripts, I built an interactive interface in Marimo. The idea: pick a model from a dropdown, adjust parameters (number of steps, guidance scale, seed, aspect ratio) and generate images with real-time visual feedback. After each generation, the parameters are automatically saved to a JSON file for reproducibility and comparison.

The core loading function is intentionally simple — the Hugging Face diffusers library does the heavy lifting:

from diffusers import DiffusionPipeline
import torch

def loadModel(model_name):
    pipeline = DiffusionPipeline.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        local_files_only=True,
        use_safetensors=True
    ).to('cuda')
    return pipeline

All models are loaded with the bfloat16 data type to balance precision and memory usage. The local_files_only flag ensures models are loaded from local disk, with no network calls during generation.

For the benchmark, I kept exactly the same parameters across every model:

Identical prompt: "A 35 years young woman in paris during spring. Blond hairs, green eyes. 35mm, professional photographer."
Fixed seed: 42 (for reproducibility)
50 inference steps
Guidance scale: 4.0
Resolution: 1024×1024 pixels

The 11 models head to head

Here's the full measurement table. Generation time is the pure diffusion time (excluding model loading). Memory indicates the peak GPU memory measured during generation.

Model	Family	Loading	Generation	Duration	Memory
SD 3.5 medium stabilityai/stable-diffusion-3.5-medium	SD 3.5	66s	34s	34s	22 GB
Z-Image-Turbo Tongyi-MAI/Z-Image-Turbo	Z-Image	63s	178s	178s	22 GB
Z-Image Tongyi-MAI/Z-Image	Z-Image	120s	179s	179s	22 GB
SD 3.5 large stabilityai/stable-diffusion-3.5-large	SD 3.5	171s	82s	82s	29 GB
SD 3.5 large turbo stabilityai/stable-diffusion-3.5-large-turbo	SD 3.5	171s	82s	82s	29 GB
FLUX.2-klein-9B black-forest-labs/FLUX.2-klein-9B	FLUX.2	266s	95s	95s	66 GB
FLUX.1-schnell black-forest-labs/FLUX.1-schnell	FLUX.1	291s	110s	110s	72 GB
FLUX.1-dev black-forest-labs/FLUX.1-dev	FLUX.1	284s	111s	111s	95 GB
FLUX.1-Kontext-dev black-forest-labs/FLUX.1-Kontext-dev	FLUX.1	289s	111s	111s	66 GB
Qwen-Image-2512 Qwen/Qwen-Image-2512	Qwen	427s	212s	212s	63 GB
FLUX.2-dev (4-bit) diffusers/FLUX.2-dev-bnb-4bit	FLUX.2	265s	397s	397s	34 GB

The generated images

All images below were generated with the same prompt, the same seed (42) and the same parameters. Only the model changes — making the comparison directly readable.

SD 3.5 medium

34s · 22 GB

Fast, lightweight, very photographic

Z-Image-Turbo

178s · 22 GB

Highly saturated green eyes, "illustration" style

Z-Image

179s · 22 GB

Natural rendering, blue sweater, spring vibes

SD 3.5 large

82s · 29 GB

Film grain, tight framing, very realistic

SD 3.5 large turbo

82s · 29 GB

Overexposed, warm tones, commercial style

FLUX.2-klein-9B

95s · 66 GB

Professional composition, crossed arms, centered Eiffel Tower

FLUX.1-schnell

110s · 72 GB

Parisian café vibes, scarf, soft gaze

FLUX.1-dev

111s · 95 GB

Highly detailed, textured skin, golden light

FLUX.1-Kontext-dev

111s · 66 GB

Sunny street, clean and bright style

Qwen-Image-2512

212s · 63 GB

Sober portrait, Seine riverbanks, older-looking rendering

FLUX.2-dev (4-bit)

397s · 34 GB

Cherry blossoms, natural — but 397s of waiting!

Prompt: "A 35 years young woman in paris during spring. Blond hairs, green eyes. 35mm, professional photographer." — 1024×1024, 50 steps, guidance 4.0, seed 42.

Benchmark takeaways

⚡

Fastest

SD 3.5 medium

34 seconds of generation, 22 GB of memory. Remarkable speed/quality ratio.

🐢

Slowest

FLUX.2-dev 4-bit

397 seconds — paradoxical for a quantized version meant to be lighter.

🧠

Most memory-hungry

FLUX.1-dev: 95 GB

Impossible to run on a setup with less than 80-90 GB of VRAM.

🪶

Lightest

Z-Image & SD 3.5 medium

Only 22 GB — accessible on far less expensive setups.

4-bit quantization: not always a win

The FLUX.2-dev-bnb-4bit version is a 4-bit compressed version of the FLUX.2-dev model (instead of 16-bit) to reduce its memory footprint. Result: 34 GB consumed instead of 66. On paper, that's a great idea.

But in practice, it's the slowest model in this benchmark: 397 seconds of generation, or 3.5× slower than FLUX.2-klein which uses the same amount of memory. Quantization has a computational cost that the GB10 struggles to offset — likely due to the DGX Spark's limited memory bandwidth (273 GB/s) which amplifies this overhead.

SD 3.5 large = SD 3.5 large turbo?

Identical results on both: 171 seconds loading, 82 seconds generation, 29 GB memory. The two models were likely optimized for the same performance targets. The "turbo" label normally means fewer steps needed for good quality — testable with fewer than 50 steps.

Z-Image vs Z-Image-Turbo: same story

Both Z-Image variants (from Tongyi-MAI) are virtually identical in timing: 178 and 179 seconds. The difference between them might show more with a reduced step count.

The DGX Spark: a machine built for large models

What this test clearly shows is that the DGX Spark shines through its memory capacity far more than its raw speed. Models like FLUX.1-dev (95 GB) or Qwen-Image (63 GB) are simply impossible to run on a standard GPU, even a high-end one. Here, they run — slowly at times, but they run.

For high-throughput production, this isn't the right machine. For exploration, prototyping, or testing massive models locally without cloud or subscriptions, it's a one-of-a-kind proposition.

My takeaway

The DGX Spark is a fascinating and somewhat unconventional machine. It's not the fastest GPU for image generation — an RTX 5090 with 32 GB of VRAM would likely be faster on models that fit in memory. But it's a machine that can run anything, locally, with no VRAM constraints, with NVIDIA's full CUDA stack.

For anyone who wants to explore state-of-the-art image generation — or LLMs — without depending on the cloud and without being limited by memory, it's a remarkable tool.

🏆

Best speed/memory ratio

SD 3.5 medium — 34s / 22 GB

🎯

Best balance

FLUX.1-schnell — 110s / 72 GB

🦾

Largest model supported

FLUX.1-dev — 95 GB out of 128 GB

The next step will be to test these same models with more complex prompts and varying step counts — to see if "turbo" really lives up to its name with 10 or 20 steps instead of 50.

Interested in a similar project? I'm available for machine learning engineering engagements — modeling, data pipelines, production deployment. Feel free to get in touch.