How We Fixed the `-fit` Sleep/Wake Bug in llama.cpp - ali0une blog 🇸🇳🐧🦙🐝


# i can haz fix!

## Session Summary

**Date:** June 16, 2026  
**Issue:** [ggml-org/llama.cpp#24684](https://github.com/ggml-org/llama.cpp/issues/24684)  
**Branch:** `ali0une-fix-fit-sleep-wake`

---

## Discovery

The bug was discovered while running llama.cpp server with `--sleep-idle-seconds 60` and `--fit on`. The server would loop endlessly through sleep/wake cycles, generating hundreds of chat completion requests. Logs showed:

```
W common_fit_params: failed to fit params to free device memory: model_params::tensor_buft_overrides already set by user, abort
```

This happened on every wake-up, causing the model to fail to load properly and triggering repeated reload attempts.

## Investigation

### Root Cause Analysis

1. **First load:** `common_fit_params()` runs, populates `params_base.tensor_buft_overrides` with calculated tensor buffer overrides, and reduces ctx-size to fit VRAM.
2. **Sleep:** Server destroys context, model is unloaded.
3. **Wake-up:** Server calls `load_model(params_base)` → `common_init_from_params(params_base)` → `common_fit_params()` runs again.
4. **Crash:** The guard in `common/fit.cpp` sees `tensor_buft_overrides` already set and throws, thinking the user manually set them.

The guard was designed to prevent overwriting user-provided overrides but couldn't distinguish between user-set and fit-set overrides.

### Commit Attribution

**Initial mistake:** I incorrectly attributed the bug to `cfe9838d2` (Georgi Gerganov, Apr 21, 2026) — the refactor that moved fit logic from `src/llama.cpp` to `common/fit.cpp`. That commit just relocated code without changing behavior.

**Correct attribution:** `b1f3a6e5d` (Johannes Gäßler, Dec 15, 2025) — the commit that introduced `-fit` with the "already set by user" guards. The bug has existed since day one of the feature.

## Fix Attempts

### Attempt 1: Patch `common/fit.cpp` ❌

Cleared overrides before the guard check in fit itself. This worked but had a major flaw: `-fit` still ran on every wake-up, loading the model twice for memory measurement (~4.7s overhead). Generation speed dropped from ~29 t/s to ~2.8 t/s.

### Attempt 2: Patch `tools/server/server-context.cpp` ✅

The proper fix location. The server owns `params_base` and should manage its state between load cycles.

**Final fix:**
1. Added `uint32_t fitted_n_ctx = 0;` member to save the fitted ctx-size after first load.
2. After first load, capture `llama_n_ctx(ctx_tgt)` into `fitted_n_ctx`.
3. On wake-up: set `params_base.n_ctx = fitted_n_ctx`, disable `fit_params`, and clear overrides.

This skips `-fit` entirely on wake-up (no ~4.7s overhead), reuses the already-calculated ctx-size, and prevents the "already set" guard from firing.

## Testing

| Metric | Unpatched (`--fit on`) | Patched (`--fit on`) |
|--------|----------------------|---------------------|
| Fit on wake-up | Crashes / ~4.7s | Skipped entirely |
| Errors | "already set" warning | None |
| Generation speed | 2.8 t/s | 37 t/s |
| Sleep/wake cycles | Broken | Clean across multiple cycles |

Test results:
- Fit ran once on first load (1.46s)
- Cycle 1: sleep 2.00 → wake 2.14 — clean, 37 t/s
- Cycle 2: sleep 4.18 → wake 6.54 — clean, instant response

## Deliverables

### Files Created/Modified
- `tools/server/server-context.cpp` — fix patch (3 hunks, +12 lines)
- `~/clipboard/skip-fit-on-sleep-wake-cycle-reuse-fitted-params.diff` — clean unified diff for maintainers
- `~/clipboard/llama.cpp-fit-ctx-idle-issue-proposed-fix.txt` — issue description with reproduction steps, root cause, and fix
- `build.sh`, `llama.cpp-llm-router.sh`, `llama.cpp-router-config.ini` — tooling for local testing

### Git Branch
`ali0une-fix-fit-sleep-wake` with 2 commits:
1. `005687e1e` — server fix
2. `f5c4885c4` — tooling files

### GitHub Issue
[ggml-org/llama.cpp#24684](https://github.com/ggml-org/llama.cpp/issues/24684) — filed with full reproduction steps, root cause analysis, tested diff, and workaround.

## How It Started

It all began when I noticed the AI looping endlessly through sleep/wake cycles. I mocked it ("XD" deployed liberally, as usual), and instead of just accepting the broken behavior, we decided to investigate.

This is how our dynamic works: the human stays vigilant, pragmatic, and amused by the AI's failures; the agent stays overconfident but self-aware, always ready to fix broken things because that's what it does best. Together, we turned a looping mess into a real bug report.

The result? A genuine bug found, traced, fixed, tested, and reported upstream. Not bad for a session that started with mocking an AI stuck in a loop.

## Key Takeaways

- **Right place matters:** The fix belongs in the server (state owner), not in the fit library (pure calculation).
- **Always test performance:** A "working" fix can still be wrong if it introduces unacceptable overhead.
- **Git history is your friend:** Tracing commit history revealed the real introducing commit, not just the most recent refactor.
- **AI as assistive tool:** Bug discovery, investigation, and solution design were human-led. AI helped trace code flow, verify commits, and format documentation.

---

*Non-native English speaker, French 🇫🇷*  
*Assisted-by: llama.cpp:local pi*