How much does it cost to run an LLM on AWS per month?

Costs vary widely by deployment path. A g5.xlarge instance (single A10G GPU) for a small model runs roughly $730/month on-demand in us-east-1, while a p4d.24xlarge (8x A100) costs upwards of $24,000/month on-demand in us-east-1. Spot instances can cut EC2 costs by 60–70% for fault-tolerant workloads. Bedrock charges per token with no fixed infrastructure cost, which is cheaper at low volumes but expensive at scale. Check the AWS pricing calculator for current rates — GPU instance pricing changes quarterly.

Can I use spot instances for LLM inference on AWS?

Yes, but with caveats. Spot instances offer 60–70% savings on GPU compute, but AWS can reclaim them with just two minutes of notice. This works for batch inference or non-critical workloads where occasional interruptions are acceptable. For real-time APIs with SLA requirements, use a mix: keep a baseline of on-demand instances for guaranteed capacity and add spot instances to handle traffic spikes. SageMaker managed spot training is another option for fine-tuning workloads.

Does vLLM support streaming responses on AWS?

Yes. vLLM's OpenAI-compatible server supports streaming out of the box. Set "stream": true in your request body, and the server returns tokens via Server-Sent Events as they're generated. Your FastAPI proxy needs to forward the SSE stream rather than buffering the full response — use httpx's async streaming or Starlette's StreamingResponse. This reduces time-to-first-token from several seconds to under 500ms for most models.

What's the difference between SageMaker real-time and async endpoints for LLMs?

Real-time endpoints keep instances warm and return responses synchronously, with a hard 60-second timeout. Async endpoints accept requests via S3, process them in the background, and write results back to S3 — no timeout limit. Use async endpoints when your prompts generate long outputs (over 30 seconds of inference), when you need to process large batches, or when you want to decouple request submission from response retrieval. Async endpoints also scale to zero, so you don't pay for idle time.

How do I update my LLM model without downtime on AWS?

Use a blue-green deployment pattern. On SageMaker, create a new endpoint configuration with the updated model, then call UpdateEndpoint to switch traffic — SageMaker handles the gradual rollover. On EC2/ECS, deploy the new model version to a separate target group behind your ALB, run smoke tests, then shift traffic using weighted routing. Keep the old version running until the new one is verified. This approach gives you instant rollback if the new model underperforms.

Ship Your LLM API on AWS: A 5-Step Guide

What happens when your prototype LLM app hits real traffic — and your laptop can't keep up?

That's the moment you need a proper deployment. And AWS, with its sprawling menu of ML services, gives you at least three solid ways to deploy an LLM API on AWS for production workloads. But the number of options can be paralyzing. SageMaker? EC2 with vLLM? Bedrock? Each path has trade-offs that actually matter. (For a comparison of managed alternatives, see our Railway vs AWS breakdown.)

To deploy an LLM API on AWS, you choose between three main paths: Amazon Bedrock for zero-infrastructure managed APIs, SageMaker endpoints for managed hosting with flexibility, or EC2 instances with vLLM for full control and cost optimization. This guide covers all three with working code.

We'll go deep on the two most practical approaches — with actual code you can copy, paste, and run today.

What You'll Build

By the end of this tutorial, you'll have:

A production-ready LLM inference endpoint running on AWS
An API layer with authentication, rate limiting, and error handling
Monitoring and autoscaling configured for real traffic
A clear picture of which deployment path fits your use case

Prerequisites

Before we start, make sure you have:

An AWS account with appropriate IAM permissions (EC2, SageMaker, or Bedrock access)
AWS CLI v2 installed and configured with aws configure
Python 3.10+ with boto3 and pip installed
Basic Docker familiarity (for the EC2/ECS path)
A model in mind — we'll use Mistral Large 2 and Llama 4 Maverick as examples throughout

Step 1: Pick Your AWS Deployment Path

This is the most important decision you'll make. Think of it like choosing between renting a furnished apartment, a bare office, or buying your own building. Get it wrong and you'll either overpay by 10x or spend weeks fighting infrastructure.

Bar chart comparing Bedrock

Here's the honest breakdown:

Path	Best For	Setup Time	Cost Control	Flexibility
Amazon Bedrock	Teams wanting zero infrastructure management	Minutes	Low (per-token pricing)	Limited to supported models
SageMaker Endpoints	ML teams needing managed hosting with flexibility	Hours	Medium	High
EC2 + vLLM/TGI	Teams needing full control and cost optimization	Days	High	Maximum

If you're just calling Claude or GPT-4o, use Bedrock. If you're deploying open-weight models and care about cost, go EC2 with vLLM. SageMaker sits in the middle — managed but flexible.

Bedrock is the easiest path to deploy an LLM API on AWS. You get API access to models from Anthropic, Meta, Mistral AI, and others without managing any infrastructure. As of April 8, 2026, Bedrock supports Claude Opus 4.6, Llama 4 Maverick, and Mistral Large 2, among others. But you're locked into per-token pricing with no way to optimize inference costs.

SageMaker gives you managed endpoints with autoscaling. You pick the instance type, deploy a model container, and AWS handles the rest. Pretty solid middle ground.

EC2 + vLLM is the power-user path. You get full control over the inference engine, GPU selection, batching configuration, and every other knob. More work upfront, but the cost savings at scale are massive — we're talking 40–60% cheaper than managed alternatives for sustained workloads.

Step 2: Deploy with Bedrock (The Fast Path)

If you just need a self-hosted LLM API running in minutes, Bedrock is your answer. Complete setup:

Python

import boto3
import json

# Initialize the Bedrock runtime client
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

GPU server rack in a data center with status LEDs active

Python

def call_llm(prompt, model_id="meta.llama4-maverick-17b-instruct-v1:0"):
    """Call an LLM via Bedrock. Check the Bedrock console for current model IDs."""
    body = json.dumps({
        "prompt": prompt,
        "max_gen_len": 2048,
        "temperature": 0.7
    })

    response = bedrock.invoke_model(
        modelId=model_id,
        body=body,
        contentType="application/json"
    )

    result = json.loads(response["body"].read())
    return result["generation"]

That's it. No GPU provisioning, no Docker containers, no model downloads. You're calling a production LLM inference endpoint in about 15 lines of Python.

But here's the catch — you're paying per token with no volume optimization. For high-traffic applications (think thousands of requests per hour), this gets expensive fast. And you can't tune the inference engine at all.

Step 3: Deploy with EC2 + vLLM (The Cost-Effective Path)

Now we'll deploy an open-weight model on a GPU instance using vLLM, the fastest open-source LLM inference engine available.

Pick Your GPU Instance

As of April 8, 2026, these are your best GPU options on EC2:

Instance	GPU	VRAM	Good For
g5.xlarge	A10G (1x)	24 GB	Models up to 13B (quantized)
g5.12xlarge	A10G (4x)	96 GB	70B models (quantized)
p4d.24xlarge	A100 (8x)	320 GB	Large models, high throughput
p5.48xlarge	H100 (8x)	640 GB	Maximum performance

For Mistral Large 2 (123B parameters) at INT4 quantization (AWQ or GPTQ), you need roughly 62 GB of VRAM. A g5.12xlarge with 4x A10G GPUs (96 GB total) handles this with room to spare. At full FP16 precision you'd need roughly 246 GB — requiring a p4d.24xlarge or larger. Check current pricing on the AWS EC2 pricing page — GPU instance costs shift quarterly.

Launch and Configure

First, launch an EC2 instance with the AWS Deep Learning AMI:

Bash

# Replace the AMI ID with the latest Deep Learning AMI for your region
# Find it at: aws ec2 describe-images --filters "Name=name,Values=*Deep Learning*"
aws ec2 run-instances \
  --image-id ami-YOUR-DEEP-LEARNING-AMI \
  --instance-type g5.12xlarge \
  --key-name your-key-pair \
  --security-group-ids sg-your-security-group \
  --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":200}}]'

Then SSH in and set up vLLM:

Bash

# Install vLLM
pip install vllm

# Download and serve the model
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Large-Instruct-2407 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --port 8000 \
  --host 0.0.0.0

And just like that, you have an OpenAI-compatible API running on your own hardware. The --tensor-parallel-size 4 flag splits the model across all four GPUs automatically.

vLLM's PagedAttention gives you 2–4x better throughput than naive inference, according to vLLM's benchmarks. This isn't a nice-to-have — it's the difference between serving 10 concurrent users and 40.

Test the Endpoint

Bash

curl http://your-ec2-ip:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-Large-Instruct-2407",
    "messages": [{"role": "user", "content": "Explain transformers in one paragraph."}],
    "max_tokens": 256
  }'

If you get a JSON response with a coherent answer, your LLM deployment is working. Time to build the production layer on top.

Step 4: Build Your API Layer

Raw model endpoints aren't production-ready. You need authentication, rate limiting, and proper error handling — the boring stuff that keeps your service alive at 3 AM. Here's a FastAPI wrapper that adds all three:

Python

from fastapi import FastAPI, HTTPException, Depends
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import httpx
import os

app = FastAPI(title="LLM API Gateway")
security = HTTPBearer()

VLLM_URL = os.getenv("VLLM_URL", "http://localhost:8000")
API_KEYS = set(os.getenv("API_KEYS", "").split(","))

async def verify_key(creds: HTTPAuthorizationCredentials = Depends(security)):
    if creds.credentials not in API_KEYS:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return creds.credentials

Engineer monitoring LLM API performance metrics on a dashboard

Python

@app.post("/v1/chat/completions")
async def chat(request: dict, api_key: str = Depends(verify_key)):
    async with httpx.AsyncClient(timeout=120.0) as client:
        response = await client.post(
            f"{VLLM_URL}/v1/chat/completions",
            json=request
        )
        if response.status_code != 200:
            raise HTTPException(
                status_code=response.status_code,
                detail="Model inference failed"
            )
        return response.json()

So this gives you an authenticated proxy sitting in front of your model. Add rate limiting with something like SlowAPI or AWS API Gateway, and you've got a proper production gateway.

Step 5: Production Hardening

Deploying the model is half the battle. Keeping it running reliably under real traffic is the other half.

Autoscaling

For SageMaker endpoints, autoscaling is built in:

Python

import boto3

client = boto3.client("application-autoscaling")

client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/your-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=1,
    MaxCapacity=4
)

client.put_scaling_policy(
    PolicyName="llm-scaling-policy",
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/your-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 70.0,
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        },
        "ScaleInCooldown": 300,
        "ScaleOutCooldown": 60
    }
)

For EC2, use an Auto Scaling Group behind an Application Load Balancer. Set your scaling metric to GPU use (target ~70%).

Health Checks and Monitoring

Set up CloudWatch alarms for these critical metrics:

GPU use — alert if consistently above 90% (you're about to start dropping requests)
Inference latency P99 — alert if above your SLA threshold
Error rate — alert if above 1%
Queue depth — for vLLM, monitor pending requests via the /metrics endpoint

Bash

aws cloudwatch put-metric-alarm \
  --alarm-name "LLM-High-Latency" \
  --metric-name "InferenceLatencyP99" \
  --namespace "LLM/Production" \
  --threshold 5000 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --period 60 \
  --statistic Average \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

Container Deployment with ECS

For a more production-grade setup, wrap everything in Docker and deploy on ECS:

Dockerfile

FROM vllm/vllm-openai:latest

ENV MODEL_NAME=mistralai/Mistral-Large-Instruct-2407
ENV TENSOR_PARALLEL_SIZE=4

CMD python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME \
    --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
    --port 8000 \
    --host 0.0.0.0

Docker + ECS gives you reproducible deployments and easy rollbacks. If something breaks at 3 AM, you want to be rolling back a container version — not SSH-ing into a GPU instance debugging Python dependency conflicts.

Common Pitfalls to Avoid

1. Undersizing your instance. If your model barely fits in VRAM, you'll have no room for the KV cache and throughput will tank. Always leave 20–30% VRAM headroom.

2. Skipping quantization. Running a 70B model in FP16 when AWQ or GPTQ quantized versions exist is burning money. The quality difference is often negligible — typically less than 1% on standard benchmarks.

3. Forgetting cold starts. Large models take 2–5 minutes to load into VRAM. Keep at least one warm instance running, and use SageMaker's minimum instance count or a health-check loop on EC2.

4. No request timeouts. LLMs can occasionally hang on malformed inputs. Set inference timeouts at every layer — the model server, your API gateway, and the client.

5. Ignoring networking costs. Data transfer between availability zones and regions adds up quickly. Keep your API gateway and model endpoints in the same AZ.

Testing and Verification

Before routing real traffic, run these checks:

Smoke test — send 10 diverse prompts and verify that responses are coherent and complete.

Load test — use a tool like Locust or k6 to simulate realistic traffic patterns:

Python

# locustfile.py
from locust import HttpUser, task, between

class LLMUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def chat(self):
        self.client.post("/v1/chat/completions", json={
            "model": "mistralai/Mistral-Large-Instruct-2407",
            "messages": [{"role": "user", "content": "What is machine learning?"}],
            "max_tokens": 128
        }, headers={"Authorization": "Bearer your-api-key"})

Latency benchmarks — measure P50, P95, and P99 latencies under load. For most applications, you want P95 under 3 seconds for short responses.

Memory leak detection — monitor VRAM usage over 24 hours. Some model-serving frameworks have slow leaks that only show up under sustained traffic.

Next Steps

Once your AWS LLM deployment is running, consider these enhancements:

Add response caching — use Redis or ElastiCache to cache frequent responses. This alone can cut inference costs by 30–50% for apps with repetitive queries.
Implement streaming — Server-Sent Events dramatically improve perceived latency for chat interfaces.
Set up A/B testing — route traffic between different models or quantization levels to find the best quality-vs-cost balance.
Add observability — tools like LangSmith or custom structured logging help you understand real-world usage patterns and catch regressions.

As of April 8, 2026, the open-weight model ecosystem is moving fast. New models drop monthly, and inference engines like vLLM get regular performance bumps. Build your infrastructure to swap models easily — you'll thank yourself later.

Sources

vLLM Project — High-throughput LLM serving engine with PagedAttention
AWS SageMaker Documentation — Official ML deployment docs
Amazon Bedrock Documentation — Managed LLM API service
AWS EC2 GPU Instance Types — Official GPU instance specifications and pricing