How to Run Llama 4 Locally: Complete Setup Guide
Step-by-step guide to running Meta's Llama 4 Scout and Maverick models on your own hardware using Ollama, llama.cpp, and vLLM, with hardware requirements and optimization tips.
Step-by-step guide to running Meta's Llama 4 Scout and Maverick models on your own hardware using Ollama, llama.cpp, and vLLM, with hardware requirements and optimization tips.

Meta's Llama 4 family represents a major leap forward in open-weight AI models, introducing a Mixture of Experts (MoE) architecture that delivers impressive performance while keeping compute requirements manageable. Why run a model locally when hosted APIs exist? Privacy, offline access, and zero per-token costs are the big three. Running Llama 4 locally is entirely achievable with the right hardware and software setup.
This guide walks you through every step — from choosing the right model variant to generating your first response on your own machine.
What You'll Learn
- How to choose between Llama 4 Scout and Maverick based on your hardware
- Three different methods to run Llama 4 locally (Ollama, llama.cpp, vLLM)
- Hardware requirements and optimization tips for each variant
- How to quantize models to fit smaller GPUs
Before diving into setup, it's important to understand which Llama 4 variant fits your use case and hardware.
Start with Scout: If this is your first time running Llama 4 locally, start with Llama 4 Scout. Its MoE architecture means only 17B parameters are active per inference pass, making it surprisingly efficient despite the 109B total parameter count.
So what hardware do you actually need? Here's what you need for a smooth local experience:
| Component | Requirement |
|---|---|
| GPU | NVIDIA RTX 3090 (24 GB) or RTX 4070 Ti Super (16 GB) with CPU offloading |
| RAM | 32 GB |
| Storage | 80 GB free (for Q4_K_M quantized model files) |
| OS | Linux (recommended), Windows 11, macOS (Apple Silicon) |
| Component | Requirement |
|---|---|
| GPU | 2x NVIDIA RTX 3090/4090 (48 GB total) or 1x NVIDIA A100 80 GB |
| RAM | 64 GB |
| Storage | 150 GB free |
| OS | Ubuntu 22.04+ or similar Linux distro |
| Component | Requirement |
|---|---|
| GPU | 4x NVIDIA A100 80 GB (320 GB total) or 8x NVIDIA A6000 48 GB |
| RAM | 128 GB+ |
| Storage | 800 GB+ free |
| OS | Linux with CUDA 12.x |
Apple Silicon Users: macOS with Apple Silicon (M2 Pro/Max/Ultra, M3, M4) can run Llama 4 Scout quantized models using unified memory. Since the Q4_K_M quantized Scout model is ~65 GB, you'll need a Mac with at least 96 GB unified memory for best results (e.g., M2 Ultra, M3 Max 96 GB, M4 Max 128 GB). A 64 GB Mac can work with smaller quantizations (Q2_K at ~40 GB) or heavy swap usage, but expect slower performance.
Ollama is the simplest way to run Llama 4 locally. It handles model downloading, quantization, and serving with a single command.
Linux / WSL:
curl -fsSL https://ollama.com/install.sh | sh
macOS (Homebrew):
brew install ollama
Windows:
Download the installer from the official Ollama website and run it. Ollama runs natively on Windows 11 with GPU support.
# Pull the model (this downloads ~65 GB for the default quantized version)
ollama pull llama4
# Start a chat session
ollama run llama4
For the larger Maverick model:
ollama pull llama4:maverick
ollama run llama4:maverick
Ollama automatically starts an OpenAI-compatible API server on port 11434:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama4",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}]
}'
You can integrate this with any application that supports the OpenAI API format (see our Ollama vs LM Studio comparison for how Ollama stacks up) by pointing it to http://localhost:11434/v1.
Create a custom Modelfile for fine-tuned settings:
FROM llama4
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER num_gpu 99
PARAMETER num_thread 8
Then build and run:
ollama create llama4-custom -f Modelfile
ollama run llama4-custom
llama.cpp gives you granular control over quantization, context length, and GPU offloading. It's ideal for squeezing maximum performance from limited hardware.
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
# Build with CUDA support (NVIDIA GPUs)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
For Apple Silicon:
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.ncpu)
Download a pre-quantized GGUF file (we cover more options in our best GGUF models for local AI guide). Popular quantization levels for Llama 4 Scout:
# Using huggingface-cli (install with: pip install huggingface_hub)
huggingface-cli download unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF \
--include "*Q4_K_M*" \
--local-dir ./models/llama4-scout
| Quantization | Size (Scout) | Quality | Speed | VRAM Needed |
|---|---|---|---|---|
| Q2_K | ~40 GB | Low | Fastest | ~40 GB (full) / ~12 GB (partial offload) |
| Q4_K_M | ~65 GB | Good | Fast | ~65 GB (full) / ~16 GB (partial offload) |
| Q5_K_M | ~77 GB | Very Good | Medium | ~77 GB (full) / ~20 GB (partial offload) |
| Q6_K | ~88 GB | Excellent | Slower | ~88 GB (full) / ~24 GB (partial offload) |
| Q8_0 | ~115 GB | Near-Perfect | Slowest | ~115 GB (full) / ~24 GB (partial offload) |
./build/bin/llama-cli \
-m ./models/llama4-scout/Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf \
-c 4096 \
-ngl 99 \
--chat-template llama4 \
-cnv
Key flags:
-c 4096 — context length (increase if you have the VRAM)-ngl 99 — number of layers to offload to GPU (99 means all)-cnv — interactive conversation mode./build/bin/llama-server \
-m ./models/llama4-scout/Llama-4-Scout-17B-16E-Instruct-Q4_K_M.gguf \
-c 4096 \
-ngl 99 \
--host 0.0.0.0 \
--port 8080
This exposes an OpenAI-compatible API at http://localhost:8080/v1.
vLLM is optimized for high-throughput inference and works well with multi-GPU setups for running the full-precision models.
pip install vllm
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--enforce-eager
For Maverick with 4 GPUs:
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--tensor-parallel-size 4 \
--max-model-len 16384
Hugging Face Access: You need to accept Meta's license agreement on Hugging Face and set your
HF_TOKENenvironment variable before downloading Llama 4 models through vLLM or Hugging Face.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[{"role": "user", "content": "Write a Python function to detect palindromes."}],
temperature=0.7,
max_tokens=512
)
print(response.choices[0].message.content)
Once your model is running, these are the tweaks that actually matter for day-to-day usability.
If you're running out of VRAM:
# llama.cpp: reduce layers on GPU
./build/bin/llama-cli -m model.gguf -ngl 20 -c 2048
# This offloads only 20 layers to GPU, the rest use CPU
Llama 4 Scout supports up to 10M tokens, but local setups should use practical limits:
| Context Length | VRAM Overhead | Use Case |
|---|---|---|
| 2,048 | Minimal | Quick Q&A |
| 8,192 | Moderate | Code generation, short docs |
| 32,768 | Significant | Long documents |
| 131,072 | Very High | Book-length analysis |
Don't have a GPU at all? You can still run Llama 4 Scout on CPU:
# llama.cpp with CPU only
./build/bin/llama-cli \
-m model-Q4_K_M.gguf \
-c 2048 \
-ngl 0 \
-t 8 \
-cnv
Expect significantly slower speeds (1-3 tokens/second vs 20-60+ on GPU), but it works for testing and light usage. Note: CPU inference requires enough system RAM to hold the full model (~65 GB for Q4_K_M).
A web interface makes interacting with your local Llama 4 much more comfortable.
# If using Ollama
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 in your browser for a ChatGPT-like interface connected to your local Llama 4.
Here are the issues that trip up most people on their first attempt.
Reduce the quantization level, lower context length, or offload fewer layers to the GPU. Try Q4_K_M if Q8_0 doesn't fit.
Ensure GPU offloading is active (-ngl flag in llama.cpp). Check that you have the correct CUDA drivers installed with nvidia-smi.
For Hugging Face downloads, make sure you've accepted the Llama 4 license agreement on the model page and set your access token:
export HF_TOKEN=your_token_here
huggingface-cli login
Make sure you're using the correct chat template. Llama 4 uses a specific prompt format. With llama.cpp, use --chat-template llama4 to apply it automatically.
The MoE architecture is the key insight: you get frontier-model intelligence with the inference cost of a 17B model. That's what makes local deployment practical.
Running Llama 4 locally is straightforward with modern tooling. Ollama offers the fastest setup, llama.cpp gives you the most control, and vLLM is the best choice for multi-GPU configurations. Start with Llama 4 Scout at Q4_K_M quantization — it hits the sweet spot of quality and hardware requirements for most users.
The MoE architecture means you get the intelligence of a much larger model with the compute efficiency of a 17B parameter model during inference -- only 17B parameters are active per token. However, all expert weights (109B for Scout, 400B for Maverick) must be loaded into memory, so storage and VRAM requirements reflect the total parameter count. With quantization and partial GPU offloading via llama.cpp, Scout remains one of the most accessible frontier-class models for local deployment. If you want to explore more open-source models to run locally, check out our guide to 8 open source LLMs for local AI in 2026.
Sources
Yes, you can run Llama 4 Scout on CPU using llama.cpp with a quantized GGUF model. You'll need at least 80 GB of system RAM for the Q4_K_M quantization (~65 GB model file plus overhead) and should expect speeds of 1-3 tokens per second. Use Q4_K_M quantization and set -ngl 0 to keep all layers on CPU. For machines with less RAM, use Q2_K (~40 GB).
It depends on quantization and whether you fully load the model into VRAM. At Q4_K_M, the model is ~65 GB -- so fully loading it requires a GPU setup with at least 65 GB VRAM (e.g., 2x RTX 3090 or 1x A100 80 GB). However, llama.cpp supports partial GPU offloading: you can run Q4_K_M on a single 24 GB GPU (like an RTX 3090 or 4090) by keeping most layers in system RAM, with slower but usable inference. Full-precision BF16 requires ~220 GB of VRAM, which needs a multi-GPU server.
Both use 17B active parameters per inference, but Scout has 16 experts (109B total parameters) while Maverick has 128 experts (400B total). Maverick delivers higher quality responses but requires significantly more storage and memory. Scout is the practical choice for most local setups.
Ollama is better for beginners and quick setup — it handles everything with a single command. llama.cpp gives more control over quantization, GPU offloading, and context length, making it better for users who want to fine-tune performance on specific hardware.
Yes. Llama 4 Scout runs on Apple Silicon Macs using Ollama or llama.cpp with Metal acceleration. Since the Q4_K_M quantized model is ~65 GB, you'll need a Mac with at least 96 GB unified memory for best results (M2 Ultra, M3 Max 96 GB, M4 Max 128 GB). A 64 GB Mac can use smaller quantizations like Q2_K (~40 GB) but performance will be limited.
The easiest way is to run Llama 4 through Ollama, then connect Open WebUI via Docker. Open WebUI provides a ChatGPT-like browser interface that communicates with your local model through Ollama's API on port 11434.