4-Spark Qwen3.5 397B FP8 Deployment

Distributed execution guide for the 397-billion parameter Qwen3.5-A17B model utilizing FP8 quantization across a 4-node DGX Spark cluster via a switched QSFP matrix.

4× DGX Spark (GB10) RoCE v2 (200Gbps) Model: 397B / 17B Active Precision: FP8

Architectural Substrate & Memory Constraints A single DGX Spark node possesses exactly 128 GB of unified LPDDR5x memory. To fit a ~397GB FP8 model, a minimum of 4 nodes (512 GB total) is strictly required. Standard PP/TP logic is inverted here: due to the 200Gbps ConnectX-7 fabric, we use Tensor Parallelism = 4 to bypass the prefill latency penalties of Pipeline Parallelism.

01 Network Fabric & Netplan Configuration

Sub-millisecond latency requires RoCE v2 over the ConnectX-7 NICs. Daisy-chaining is not recommended; use a managed 200Gbps QSFP switch. Jumbo frames (MTU 9000) are mandatory to prevent fragmentation.

/etc/netplan/40-cx7.yaml (Example for Spark-Alpha)

network:
  version: 2
  ethernets:
    enp1s0f1np1:
      addresses:
        - 192.168.100.10/24
      mtu: 9000
      routes:
        - to: 192.168.100.0/24
          via: 192.168.100.10
          metric: 100

Apply with sudo netplan apply on all nodes, incrementing the IP for Spark-Beta (.11), Gamma (.12), and Delta (.13).

02 OS Stabilization & Memory Reclamation

Unified memory means the OS shares RAM with the GPU. You must aggressively disable GUI environments, stop NVMe swapping, and drop caches to prevent the Linux OOM killer from terminating the Ray cluster.

Execute on all 4 nodes

# 1. Enforce headless mode
sudo systemctl isolate multi-user.target
sudo systemctl disable gdm lightdm

# 2. Prevent eager memory paging
sudo sysctl vm.swappiness=1
echo "* hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
echo "* soft memlock unlimited" | sudo tee -a /etc/security/limits.conf

# 3. Drop caches immediately prior to launch
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

03 Initialize Ray Cluster Topology

You must explicitly bind Ray to the ConnectX-7 interface (enp1s0f1np1). Otherwise, it will route through the slow 10GbE management port, causing catastrophic latency.

Execute on Spark-Alpha (Head Node)

export MN_IF_NAME=enp1s0f1np1
export VLLM_HOST_IP=192.168.100.10
export UCX_NET_DEVICES=$MN_IF_NAME
export NCCL_SOCKET_IFNAME=$MN_IF_NAME
export GLOO_SOCKET_IFNAME=$MN_IF_NAME
export OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME

ray start --head --node-ip-address=$VLLM_HOST_IP --port=6379 --dashboard-host=0.0.0.0 --num-gpus=1

Execute on Spark-Beta, Gamma, Delta (Workers)

export MN_IF_NAME=enp1s0f1np1
export VLLM_HOST_IP=
export NCCL_SOCKET_IFNAME=$MN_IF_NAME
export GLOO_SOCKET_IFNAME=$MN_IF_NAME

ray start --address='192.168.100.10:6379' --node-ip-address=$VLLM_HOST_IP --num-gpus=1

04 Launch vLLM via Docker

Launch the OpenAI-compatible server on the Head Node using Docker. Host networking is required to expose the InfiniBand devices to the container.

Critical Memory Mathematics Do NOT set gpu-memory-utilization to 0.90+. Because memory is unified, 0.85 is the absolute maximum safety threshold. It reserves exactly 19.2 GB for the Linux OS and RoCE networking buffers, preventing immediate OOM crashes.

Execute on Spark-Alpha (Head Node)

docker run -it --rm --name=vllm-spark-matrix \
  --network=host \
  --ipc=host \
  --device=/dev/infiniband \
  --ulimit memlock=-1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:nightly \
  --model Qwen/Qwen3.5-397B-A17B-FP8 \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 131072 \
  --max-num-batched-tokens 4096 \
  --block-size 128 \
  --language-model-only \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_xml \
  --attention-backend FLASHINFER

Expert Parallelism Hazard (Bug Mitigation) Do not use the --enable-expert-parallel flag. Due to an upstream vLLM kernel bug with FP8 MoE sharding, this will trigger a fatal dimension size exception. Standard TP=4 automatically replicates and shards the layers safely.

Performance Expectations Due to the GB10's 273 GB/s memory bandwidth reading 17 GB of active FP8 weights per token, expected decode throughput over the RoCE fabric is a highly stable 25.0 to 28.0 tokens/second.