Skip to content

Running GLM-4.7-Flash MoE on Windows with a Small GPU

Using llama.cpp to punch above your hardware class

Learn how to run Z.ai’s massive 30B MoE reasoning model on a modest Windows GPU using llama.cpp and tensor overrides.

In this guide, we’ll look at how to run massive Mixture of Experts (MoE) models on a Windows machine with limited VRAM. By leveraging llama.cpp and a specific tensor-override trick, you can offload the memory-hungry “experts” to your CPU while keeping the core processing fast on your GPU.

Why This Works

Mixture of Experts (MoE) models are incredible. Instead of using every single parameter to answer a prompt, they route your query to specific “expert” sub-networks. This makes them fast, but they require a massive amount of memory to hold all those experts at once.

A perfect example is GLM-4.7-Flash, Z.ai’s new 30B MoE reasoning model. It delivers best-in-class performance for coding, agentic workflows, and chat. While it is a 30B parameter model, it only uses about 3.6B active parameters during generation. It also supports a massive 200K context window.

If you try to load a model like this entirely into a smaller GPU’s VRAM, it will instantly result in out-of-memory errors.

Does this actually perform well? Absolutely. On a machine with a 5th gen AMD Ryzen 7 processor, 64GB of system memory, and an NVIDIA RTX 3060 Ti with just 8GB of VRAM, this exact setup yields a highly usable 20 to 25 tokens per second (with good tuning).

Here is how to solve the VRAM bottleneck on Windows using llama.cpp.


Prerequisites

Before we start, you need the right foundation on your Windows machine:

  • An NVIDIA GPU with NVIDIA Drivers installed and up to date.
  • CUDA Toolkit 12.4 (this guide specifically targets the CUDA 12.4 binaries).
  • A basic understanding of PowerShell.

Step 1 - Download and Install llama.cpp for Windows

llama.cpp is a highly optimized C/C++ port of the LLaMA model inference code. We need the specific pre-compiled binaries for Windows and CUDA.

  1. Go to the official repository releases: https://github.com/ggml-org/llama.cpp/releases
  2. Download the main binary zip: llama-bXXXX-bin-win-cuda-12.4-x64.zip.
  3. Download the CUDA runtime dependencies: cudart-llama-bin-win-cuda-12.4-x64.zip.
  4. Extract both zip files into a single folder on your machine, for example: C:\Users\YourName\.llamacpp.

Step 2 - Download the GLM-4.7-Flash Model

Next, you need the model in the .gguf format.

You can download the Unsloth quantized GGUF versions directly from Hugging Face at https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF.

For a 4-bit quantized version, you will need around 18GB to 24GB of combined RAM and VRAM (unified memory) to run the model. Download the .gguf file and place it somewhere accessible, like D:\LLM\models\GLM-4.7-Flash-Q4_K_M.gguf.


Step 3 - The Secret Sauce: Tensor Overrides

When you load a model, llama.cpp typically tries to put the whole thing into your GPU’s VRAM. For an MoE model, the Feed-Forward Network (FFN) layers—which house the “experts”—take up the bulk of the space.

We can tell llama.cpp to load the attention mechanisms and base layers into the GPU, but force the heavy FFN expert layers into your standard system RAM (CPU) using this specific flag:

--override-tensor ".ffn_.*_exps.=CPU"

Because an MoE model only activates a small portion of these experts per token, the CPU can handle fetching them from system RAM without creating a massive bottleneck.


Step 4 - The PowerShell Launch Script

To make running the server easy and repeatable, let’s create a PowerShell launch script.

Create a file named start-server.ps1 and paste in the following. This script uses llama-server to host an OpenAI-compatible API on your local machine and applies strict sampling parameters required by GLM-4.7-Flash.

# Configuration
$ModelPath = "D:\LLM\models\GLM-4.7-Flash-Q4_K_M.gguf"
$ServerHost = "0.0.0.0"
$Port = 8080

Write-Host "Starting Llama Server..." -ForegroundColor Green

# Build command arguments
$ProcessArgs = @(
    "--model", $ModelPath
    "--alias", "glm-4.7-flash"
    "--jinja"
    "--threads", "-1"
    "--batch-size", "512"
    "--n-gpu-layers", "99"
    "--ctx-size", "65536"
    "--cache-type-k", "q8_0"
    "--cache-type-v", "q8_0"
    "--flash-attn", "on"
    "--temp", "1.0"
    "--top-p", "0.95"
    "--min-p", "0.01"
    "--repeat-penalty", "1.0"
    "--host", $ServerHost
    "--port", $Port.ToString()
    "--no-webui"
    "--override-tensor", ".ffn_.*_exps.=CPU"
)

# Launch the server
$CommandLine = "llama-server $($ProcessArgs -join ' ')"
cmd /c $CommandLine

Understanding the Flags

Let’s break down the important parameters we just used:

  • --n-gpu-layers 99: Attempts to offload as many layers as possible to the GPU.
  • --override-tensor ".ffn_.*_exps.=CPU": The MoE trick that forces the heavy expert tensors back to system RAM.
  • --cache-type-k q8_0 & --cache-type-v q8_0: Quantizes the KV cache to 8-bit, saving a ton of VRAM during long conversations.
  • --flash-attn on: Enables Flash Attention, heavily optimizing context processing speeds.
  • --ctx-size 65536: Sets the context window. While 32k (32768) is great for general chat, you should increase this to a minimum of 64k (65536) if you are coding with Kiro code. The model supports up to a 200K context window if you have the RAM for it!

Note: This exact same concept and override trick works beautifully for the GPT-OSS-20B model as well, where you can push the context window even larger to around ~128k.

Crucial GLM-4.7-Flash Sampling Parameters: Z.ai highly recommends specific sampling settings to prevent the model from looping or outputting poor results.

  • --temp 1.0 and --top-p 0.95: The recommended baseline for general use cases. For tool-calling, use --temp 0.7 and --top-p 1.0 instead.
  • --repeat-penalty 1.0: You must disable the repeat penalty by setting it to 1.0.
  • --min-p 0.01: llama.cpp defaults to 0.05, so explicitly setting this to 0.01 provides the best results for this architecture.

Recap and What’s Next

You just set up a highly optimized local AI server! By combining the raw speed of llama.cpp with smart tensor offloading, you can comfortably run advanced Mixture of Experts models without needing to buy a massive graphics card.

To use your new server, you can point any OpenAI-compatible frontend application to http://localhost:8080/v1 and start prompting.