AI engineering

26 min read

Running Ollama on a Tesla P40

Bringing an old datacentre GPU back to life for local LLM inference. £150 on eBay, two evenings of fan-mod work, then you own a 24 GB inference box that runs Gemma 3 12B, Llama 3.3 70B (quantised), and Qwen 2.5 Coder. This is the actual setup I run at home.

Why the P40 specifically

The Tesla P40 is a Pascal-generation datacentre card from 2016. It shipped in DGX-adjacent boxes, did its eight-year tour, and is now flooding eBay at £100 to £180. The headline number is 24 GB of GDDR5, which is the same VRAM as an RTX 3090 and an RTX 4090, and only 8 GB short of an A6000 that costs forty times more. For local LLM inference, VRAM is the only number that matters until you run out of patience.

Pascal does not have native bfloat16, and it has no Tensor cores, so you give up performance per watt against Ampere and Ada. What you keep is FP16 at roughly half the rate of FP32, full CUDA 12.x support, and the ability to actually load a 70B parameter model at 3-bit quantisation. For a home lab where the goal is "can I run this at all, on my own hardware, for one evening of electricity", the P40 is unbeatable on cost per GB of VRAM.

I paid £140 including shipping for mine, off a UK seller who had pulled it from a decommissioned Dell PowerEdge. Pick a seller with a returns policy. Dead P40s exist; the heatsinks corrode in damp data halls.

The P40 is not fast. It is enough. And enough, in your own house, on a card you own, is a very different feeling to renting tokens by the million from someone else.

Hardware around the card

The P40 is a full-height, dual-slot, 250 W card with a single 8-pin EPS (CPU-style) power connector, not a PCIe 8-pin. That detail catches people out. You need either a server PSU with an EPS lead spare, or a cheap EPS-to-PCIe adapter, or a workstation board that exposes a spare EPS rail.

Minimum viable host: a Dell Precision T5810, HP Z440, or Lenovo P510 workstation. All three were Xeon E5 v3/v4 boxes that came with proper 685 W or 825 W power supplies, full-length PCIe x16 slots, and crucially, the BIOS bits to let the system POST with a card that has no display output. The P40 has no HDMI or DisplayPort. If your motherboard refuses to boot without a primary display adapter, you are stuck.

My build is a T5810 with a Xeon E5-1650 v4, 64 GB of DDR4 ECC, a 1 TB NVMe on a PCIe adapter, the P40 in the top x16 slot, and the stock 685 W PSU. Total cost including the GPU: under £350.

Cooling the fanless beast

The P40 has no fan. It is a passive aluminium brick designed to sit in a 1U server with 15,000 RPM screamers shovelling air down its throat. Plug it into a quiet workstation and it will hit 95 C in ninety seconds of inference and thermal-throttle to half speed, or trip its over-temp shutdown.

Two routes that work. First, a 3D-printed shroud that clamps to the back of the card and accepts a 60 mm or 80 mm fan. Files are on Thingiverse and Printables under "Tesla P40 shroud". I run a Noctua NF-A6x25 PWM (60 mm) bolted to a Printables shroud, wired to a motherboard PWM header. It is near silent and holds the card at 68 C under sustained load.

Second, a commercial blower kit such as the ones from Mining-Heaven or random AliExpress sellers, usually a 40 mm or 50 mm blower in a plastic duct. They are louder and uglier, but they work, and you do not need a 3D printer. Either way, wire the fan to PWM and set a fan curve based on GPU temperature, not CPU.

bash
# Watch GPU temp and fan response in real time
watch -n 1 nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,power.draw \
  --format=csv,noheader

Drivers on Ubuntu 22.04

Ubuntu 22.04 LTS is the path of least resistance. Ubuntu 24.04 also works but the NVIDIA repo packaging lagged for a while; 22.04 is boring and stable. Start with a clean install, then blacklist nouveau, then install the proprietary driver.

bash
# 1. Disable nouveau before anything else
echo "blacklist nouveau
options nouveau modeset=0" | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
sudo update-initramfs -u
sudo reboot

# 2. Install build tooling + headers
sudo apt update
sudo apt install -y build-essential dkms linux-headers-$(uname -r) \
  pkg-config dmidecode

# 3. Let ubuntu-drivers pick the right datacentre driver
sudo ubuntu-drivers list
sudo ubuntu-drivers install nvidia:550-server
sudo reboot

# 4. Verify
nvidia-smi

nvidia-smi should report the P40 with 24576 MiB total memory, driver 550.x, and CUDA 12.4. If it says "No devices were found", check that the EPS power lead is fully seated; the P40 will appear on the PCIe bus but refuse to initialise without it. lspci | grep -i nvidia tells you whether the card enumerated.

Installing Ollama

Ollama on Linux is a one-line install. It detects CUDA automatically and downloads the matching runner.

bash
curl -fsSL https://ollama.com/install.sh | sh

# Confirm Ollama saw the GPU
ollama --version
journalctl -u ollama -n 50 | grep -i cuda
# You want to see: "looking for compatible GPUs", "library=cuda variant=v12"

# Pull and run something small to prove the pipe
ollama pull gemma3:12b
ollama run gemma3:12b "Say hello in one sentence."

# In another shell, confirm the model is on the GPU
ollama ps

ollama ps should list the loaded model with a SIZE in GB and a PROCESSOR column reading 100% GPU. If it says 100% CPU, CUDA was not detected, and Ollama silently fell back. Re-check nvidia-smi from inside the Ollama systemd context: sudo -u ollama nvidia-smi.

Model picks for 24 GB

With 24 GB of VRAM you can pick from almost everything on Ollama's library, as long as you mind the quantisation. The rule of thumb: parameter count times bits divided by eight gives you the rough VRAM footprint in GB, plus 1 to 3 GB for the KV cache.

My standing rotation:

Gemma 3 12B (q4_K_M), around 8 GB on disk, sits comfortably in VRAM with plenty of headroom for an 8K context. Best general-purpose model on the box. Around 22 to 28 tokens/sec on the P40.
Qwen 2.5 Coder 14B (q4_K_M), around 9 GB, my daily code companion. Tab-completion quality is genuinely competitive with hosted Copilot for non-trivial refactors. Around 18 to 22 tokens/sec.
Mistral Small 3 (24B, q4_K_M), around 14 GB, the heavy generalist when Gemma is not deep enough. Around 12 to 15 tokens/sec.
Llama 3.3 70B (q3_K_M), around 30 GB on disk but with aggressive layer-offload settings it fits in 24 GB at q3. Around 5 to 7 tokens/sec, which is slow but usable for an overnight batch job.

bash
# Pull the rotation
ollama pull gemma3:12b
ollama pull qwen2.5-coder:14b
ollama pull mistral-small:24b
ollama pull llama3.3:70b-instruct-q3_K_M

# Force a heavy model to fit by setting layer count
# Tune num_gpu down until it stops OOMing
OLLAMA_NUM_PARALLEL=1 ollama run llama3.3:70b-instruct-q3_K_M \
  --verbose "Summarise the Treaty of Westphalia in 200 words."

Serving Ollama securely

By default Ollama binds to 127.0.0.1:11434. To use it from other machines on your network, set OLLAMA_HOST=0.0.0.0:11434 in the systemd unit, then put Caddy in front with bearer-token auth. Never expose port 11434 directly to the public internet; the API is unauthenticated by design.

bash
# /etc/systemd/system/ollama.service.d/override.conf
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_NUM_PARALLEL=2"

sudo systemctl daemon-reload
sudo systemctl restart ollama

caddy
# /etc/caddy/Caddyfile, on the same box
llm.lan {
  tls internal
  @authed header Authorization "Bearer {env.OLLAMA_BEARER}"
  handle @authed {
    reverse_proxy localhost:11434
  }
  respond 401
}

Measuring performance

Ollama prints per-request stats when you run with --verbose. The numbers that matter are eval rate (tokens/sec of generation) and prompt eval rate (tokens/sec of prefill). On the P40, prefill is fast (hundreds of tokens/sec) but generation is the bottleneck because it is memory-bandwidth limited.

bash
ollama run gemma3:12b --verbose "Write a haiku about Cornish pasties."
# total duration:       3.42s
# load duration:        18.3ms
# prompt eval count:    18 token(s)
# prompt eval duration: 142ms
# prompt eval rate:     126.8 tokens/s
# eval count:           41 token(s)
# eval duration:        1.62s
# eval rate:            25.3 tokens/s

If a model feels too slow, drop one quantisation level (q4 to q3, or q3 to q2_K) before reaching for a smaller model. Quantisation costs you accuracy in subtle ways; switching models costs you behaviour in obvious ways. For long-context work, raise num_ctx but keep an eye on VRAM via nvidia-smi dmon; the KV cache grows linearly with context length.

Access from anywhere via Tailscale

I do not want my home GPU on the public internet. Tailscale gives every device on your account a stable IP on a private mesh; the P40 box is reachable from my laptop, my phone, and (with care) from a Vercel serverless function via a tailnet relay, and from nothing else.

bash
# On the GPU box
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up --ssh --advertise-tags=tag:llm

# On any client (laptop, phone)
tailscale up
# Now you can talk to the box at its tailnet name, e.g.
curl -H "Authorization: Bearer $OLLAMA_BEARER" \
  https://gpu-box.tailnet-xxxx.ts.net/api/generate \
  -d '{"model":"gemma3:12b","prompt":"hello","stream":false}'

Combine Tailscale with a Caddy tls internal cert and a bearer token, and you have a setup that is private, encrypted end-to-end, and requires no port forwarding on your home router. The whole stack survives an ISP-assigned IP change because Tailscale handles the relay.

Pitfalls

Forgetting to disable ECC scrubbing

The P40 ships with ECC enabled, which costs you about 1 GB of usable VRAM and a measurable chunk of effective bandwidth. Disable with `sudo nvidia-smi -e 0` then reboot. You lose ECC protection; for inference workloads that is fine.

PSU sag under sustained load

A 250 W card with a 685 W PSU sounds comfortable on paper, but cheap or aging units sag on the 12 V rail when CPU and GPU both spike. Symptoms: random reboots mid-generation. Swap in a good Seasonic or Corsair 750 W before chasing software bugs.

The "GPU not found" cycle

A new install often fails to detect the P40 the first time. The usual cause is a half-seated EPS power lead, nouveau still loaded, or a BIOS option called "Above 4G Decoding" being disabled. Check all three before reinstalling drivers a fourth time.

Missing dmidecode breaks the driver installer

NVIDIA datacentre driver scripts probe DMI to identify the platform; without `dmidecode` installed, the install completes but leaves the kernel module unsigned and unloadable on Secure Boot systems. Install it before running ubuntu-drivers.

Thermal throttling that looks like a slow model

If tokens/sec drops by 30 to 50 percent after the first minute of generation, it is the card hitting 88 C and clocking down. Fix the airflow, not the model. `nvidia-smi -q -d PERFORMANCE` shows current clock vs max.

KV cache eats your VRAM at long contexts

A 12B model at 32K context can use more VRAM for the KV cache than for the weights. If `ollama ps` shows the model partially on CPU at long prompts, lower `num_ctx` or use a smaller model.

Wrap up

A working P40 box is not glamorous. It is loud-ish, it draws 250 W at full tilt, and it will never match a 4090 on tokens per second. But it sits in a corner of my office, runs the models I actually use for code and writing, and costs less than a single month of a serious hosted-inference bill. The electricity it has burned in a year of daily use comes to roughly £40 at UK rates.

More importantly, the models live on a disk I own. The prompts never leave the house. If a hosted provider deprecates a checkpoint I depend on, I still have it. For anyone learning how LLMs actually behave under the hood, owning the box is the cheapest education available.

Want this done for you?

If you would rather skip the YAK shave and have someone who has done this fifty times set it up properly, that is what I do for a living.

Start a project