2-Spark QWEN3.5_397B int4

Max Stability & Offline Build & High-VRAM Optimization

Topology: 2× Nodes (No-Ray) Model: Qwen3.5-397B-A17B Context: 262,144 Vision: Enabled Target: 112 GiB Memory
Objective This guide ensures maximum stability for restricted/slow network environments (like hotel Wi-Fi) by implementing an auto-resuming offline build workflow, paired with strict memory, swap, and clock controls.
1

Hardware & OS Stabilization

Both Sparks

Run these commands on both Spark A and Spark B to maximize available RAM, stabilize thermals, and prevent OS memory paging crashes.

Terminal (Run on A & B)
# 1. Switch to headless mode to free critical VRAM
sudo systemctl isolate multi-user.target

# 2. Apply GPU clock caps to prevent power-spike crashes
sudo nvidia-smi -lgc 200,2150

# 3. Reduce swappiness to prevent disk thrashing
sudo sysctl -w vm.swappiness=10

Note: You must re-run the nvidia-smi command if the machines are ever rebooted.

2

Authenticate Hugging Face

Spark A Only

To pull restricted or large models securely, you must authenticate with Hugging Face before downloading.

Get your token:

  1. Go to huggingface.co/settings/tokens in your browser.
  2. Log in (or create a quick free account).
  3. Click Create new token, select the Read permission, and copy the key it gives you.

Apply the token:

Back in your terminal, paste this command (replacing the placeholder with your actual token):

Terminal (Spark A)
export HF_TOKEN="hf_your_actual_token_here"
3

Resilient Offline Docker Build

Spark A Only

Network drops during a Docker build permanently discard partial downloads. This step decouples the massive PyTorch downloads from Docker. It uses an auto-resuming loop to cache wheels locally, patches the Dockerfile, and builds instantly offline.

Terminal (Spark A)
git clone https://github.com/eugr/spark-vllm-docker.git
cd ~/spark-vllm-docker
mkdir -p torch-wheelhouse

# Step A: Resilient download loop (auto-resumes on disconnect)
echo "Starting resilient download loop..."
while ! python3 -m pip download \
  --dest ./torch-wheelhouse \
  --index-url https://download.pytorch.org/whl/nightly/cu130 \
  --extra-index-url https://pypi.nvidia.com \
  torch torchvision torchaudio triton \
  nvidia-nvshmem-cu13 "apache-tvm-ffi<0.2" \
  filelock pynvml requests tqdm; do
    echo "Wi-Fi timeout detected. Retrying in 3s... (completed files are kept)"
    sleep 3
done

# Step B: Patch Dockerfile to mount the local wheels offline
cp Dockerfile Dockerfile.bak
python3 - <<'EOF'
import re
with open("Dockerfile", "r") as f: text = f.read()

# Route Base & Runner stages to use local /whl folder
text = re.sub(
    r'RUN\s+--mount=type=cache,id=uv-cache,target=/root/\.cache/uv[\s\\]+uv pip install torch torchvision torchaudio triton --index-url https://download\.pytorch\.org/whl/nightly/cu130 &&[\s\\]+uv pip install nvidia-nvshmem-cu13 "apache-tvm-ffi<0\.2" filelock pynvml requests tqdm',
    r'RUN --mount=type=bind,source=torch-wheelhouse,target=/whl \\\n    --mount=type=cache,id=uv-cache,target=/root/.cache/uv \\\n    uv pip install --no-index --find-links=/whl torch torchvision torchaudio triton nvidia-nvshmem-cu13 "apache-tvm-ffi<0.2" filelock pynvml requests tqdm',
    text)

text = re.sub(
    r'RUN\s+--mount=type=cache,id=uv-cache,target=/root/\.cache/uv[\s\\]+uv pip install torch torchvision torchaudio triton --index-url https://download\.pytorch\.org/whl/nightly/cu130 &&[\s\\]+uv pip install nvidia-nvshmem-cu13 "apache-tvm-ffi<0\.2"',
    r'RUN --mount=type=bind,source=torch-wheelhouse,target=/whl \\\n    --mount=type=cache,id=uv-cache,target=/root/.cache/uv \\\n    uv pip install --no-index --find-links=/whl torch torchvision torchaudio triton nvidia-nvshmem-cu13 "apache-tvm-ffi<0.2"',
    text)

with open("Dockerfile", "w") as f: f.write(text)
print("Patched Dockerfile for offline installation.")
EOF

# Step C: Build ONLY the required TF5 image (instantly uses local wheels)
./build-and-copy.sh --setup -t vllm-node-tf5 --tf5 --network host -c
4

Model Sync & Network Discovery

Spark A Only

Ensure passwordless SSH is configured from Spark A to Spark B. Then download and sync the model across the cluster.

Terminal (Spark A)
# Download the model and parallel-copy to Spark B
./hf-download.sh happypatrick/Qwen3.5-397B-A17B-heretic-int4-AutoRound -c --copy-parallel

# Find your exact network interface for Spark A <-> Spark B traffic
ip route get 

Look closely at the output of the routing command for the dev field (e.g., eno1, enp1s0f1np1). You will need this exact interface name for the launch script.

5

Clear Caches & Launch Cluster

A + B
CRITICAL: CLEAR OS CACHES Right before you start the server, you must purge OS caches on BOTH Sparks to give vLLM a clean slate for its massive 112 GiB continuous allocation.
Terminal (Run on A & B)
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Finally, execute the launch command on Spark A. Replace <YOUR_REAL_INTERFACE> and <YOUR_API_KEY> before running.

Terminal (Spark A)
MODEL="happypatrick/Qwen3.5-397B-A17B-heretic-int4-AutoRound"
MODELPATH="/root/models"
IFACE=""
API_KEY=""

VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/models:/root/models --shm-size 10.24g" \
$HOME/spark-vllm-docker/launch-cluster.sh \
  --no-ray \
  -t vllm-node-tf5 \
  -e PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  -e OMP_NUM_THREADS=4 \
  -e MKL_NUM_THREADS=4 \
  -e TORCH_NUM_THREADS=4 \
  -e NCCL_P2P_DISABLE=1 \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_SOCKET_IFNAME="=${IFACE}" \
  -j 8 \
  exec vllm serve \
  "$MODELPATH/$MODEL" \
    --served-model-name qwen397b-heretic \
    --host 0.0.0.0 \
    --port 40000 \
    --distributed-executor-backend ray \
    --tensor-parallel-size 2 \




    --gpu-memory-utilization-gb 112 \
    --max-model-len 262144 \
    --max-num-seqs 1 \
    --max-num-batched-tokens 4176 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --kv-cache-dtype fp8 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --chat-template unsloth.jinja \
    --mm-encoder-tp-mode weights \
    --mm-processor-cache-type shm \
    --api-key "$API_KEY"
6

OpenClaw Setup

Separate Machine

To preserve essential memory footprint, keep OpenClaw completely off the Sparks. On your separate laptop, desktop, or gateway server, configure the frontend with these parameters:

OpenClaw Settings
Base URL: http://:40000/v1
API key:  
Model:    qwen397b-heretic
Troubleshooting OOM (Out of Memory) 112 GiB is an aggressive, edge-of-the-envelope target limit. If the model throws an Out of Memory error during the startup profiling phase, kill the process, drop caches again on both Sparks, and lower --gpu-memory-utilization-gb 112 to 111 or 110. Do not touch --mm-encoder-tp-mode weights. Changing it to "data" mode will instantly OOM the system.