Max Stability & Offline Build & High-VRAM Optimization
Run these commands on both Spark A and Spark B to maximize available RAM, stabilize thermals, and prevent OS memory paging crashes.
# 1. Switch to headless mode to free critical VRAM
sudo systemctl isolate multi-user.target
# 2. Apply GPU clock caps to prevent power-spike crashes
sudo nvidia-smi -lgc 200,2150
# 3. Reduce swappiness to prevent disk thrashing
sudo sysctl -w vm.swappiness=10
Note: You must re-run the nvidia-smi command if the machines are ever rebooted.
To pull restricted or large models securely, you must authenticate with Hugging Face before downloading.
Back in your terminal, paste this command (replacing the placeholder with your actual token):
export HF_TOKEN="hf_your_actual_token_here"
Network drops during a Docker build permanently discard partial downloads. This step decouples the massive PyTorch downloads from Docker. It uses an auto-resuming loop to cache wheels locally, patches the Dockerfile, and builds instantly offline.
git clone https://github.com/eugr/spark-vllm-docker.git
cd ~/spark-vllm-docker
mkdir -p torch-wheelhouse
# Step A: Resilient download loop (auto-resumes on disconnect)
echo "Starting resilient download loop..."
while ! python3 -m pip download \
--dest ./torch-wheelhouse \
--index-url https://download.pytorch.org/whl/nightly/cu130 \
--extra-index-url https://pypi.nvidia.com \
torch torchvision torchaudio triton \
nvidia-nvshmem-cu13 "apache-tvm-ffi<0.2" \
filelock pynvml requests tqdm; do
echo "Wi-Fi timeout detected. Retrying in 3s... (completed files are kept)"
sleep 3
done
# Step B: Patch Dockerfile to mount the local wheels offline
cp Dockerfile Dockerfile.bak
python3 - <<'EOF'
import re
with open("Dockerfile", "r") as f: text = f.read()
# Route Base & Runner stages to use local /whl folder
text = re.sub(
r'RUN\s+--mount=type=cache,id=uv-cache,target=/root/\.cache/uv[\s\\]+uv pip install torch torchvision torchaudio triton --index-url https://download\.pytorch\.org/whl/nightly/cu130 &&[\s\\]+uv pip install nvidia-nvshmem-cu13 "apache-tvm-ffi<0\.2" filelock pynvml requests tqdm',
r'RUN --mount=type=bind,source=torch-wheelhouse,target=/whl \\\n --mount=type=cache,id=uv-cache,target=/root/.cache/uv \\\n uv pip install --no-index --find-links=/whl torch torchvision torchaudio triton nvidia-nvshmem-cu13 "apache-tvm-ffi<0.2" filelock pynvml requests tqdm',
text)
text = re.sub(
r'RUN\s+--mount=type=cache,id=uv-cache,target=/root/\.cache/uv[\s\\]+uv pip install torch torchvision torchaudio triton --index-url https://download\.pytorch\.org/whl/nightly/cu130 &&[\s\\]+uv pip install nvidia-nvshmem-cu13 "apache-tvm-ffi<0\.2"',
r'RUN --mount=type=bind,source=torch-wheelhouse,target=/whl \\\n --mount=type=cache,id=uv-cache,target=/root/.cache/uv \\\n uv pip install --no-index --find-links=/whl torch torchvision torchaudio triton nvidia-nvshmem-cu13 "apache-tvm-ffi<0.2"',
text)
with open("Dockerfile", "w") as f: f.write(text)
print("Patched Dockerfile for offline installation.")
EOF
# Step C: Build ONLY the required TF5 image (instantly uses local wheels)
./build-and-copy.sh --setup -t vllm-node-tf5 --tf5 --network host -c
Ensure passwordless SSH is configured from Spark A to Spark B. Then download and sync the model across the cluster.
# Download the model and parallel-copy to Spark B
./hf-download.sh happypatrick/Qwen3.5-397B-A17B-heretic-int4-AutoRound -c --copy-parallel
# Find your exact network interface for Spark A <-> Spark B traffic
ip route get
Look closely at the output of the routing command for the dev field (e.g., eno1, enp1s0f1np1). You will need this exact interface name for the launch script.
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
Finally, execute the launch command on Spark A. Replace <YOUR_REAL_INTERFACE> and <YOUR_API_KEY> before running.
MODEL="happypatrick/Qwen3.5-397B-A17B-heretic-int4-AutoRound"
MODELPATH="/root/models"
IFACE=""
API_KEY=""
VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/models:/root/models --shm-size 10.24g" \
$HOME/spark-vllm-docker/launch-cluster.sh \
--no-ray \
-t vllm-node-tf5 \
-e PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
-e OMP_NUM_THREADS=4 \
-e MKL_NUM_THREADS=4 \
-e TORCH_NUM_THREADS=4 \
-e NCCL_P2P_DISABLE=1 \
-e NCCL_IB_DISABLE=1 \
-e NCCL_SOCKET_IFNAME="=${IFACE}" \
-j 8 \
exec vllm serve \
"$MODELPATH/$MODEL" \
--served-model-name qwen397b-heretic \
--host 0.0.0.0 \
--port 40000 \
--distributed-executor-backend ray \
--tensor-parallel-size 2 \
--gpu-memory-utilization-gb 112 \
--max-model-len 262144 \
--max-num-seqs 1 \
--max-num-batched-tokens 4176 \
--enable-chunked-prefill \
--enable-prefix-caching \
--kv-cache-dtype fp8 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--chat-template unsloth.jinja \
--mm-encoder-tp-mode weights \
--mm-processor-cache-type shm \
--api-key "$API_KEY"
To preserve essential memory footprint, keep OpenClaw completely off the Sparks. On your separate laptop, desktop, or gateway server, configure the frontend with these parameters:
Base URL: http://:40000/v1
API key:
Model: qwen397b-heretic