Skip to content

kv-cache: follow the source cache size when sharing cells#24267

Merged
ggerganov merged 1 commit into
ggml-org:masterfrom
ServeurpersoCom:fix/kv-cache-share-size
Jun 7, 2026
Merged

kv-cache: follow the source cache size when sharing cells#24267
ggerganov merged 1 commit into
ggml-org:masterfrom
ServeurpersoCom:fix/kv-cache-share-size

Conversation

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

Overview

With --fit the trunk context can shrink below the draft default, the assistant then builds views sized for its own kv_size into the smaller shared K/V tensors and trips the ggml_view_4d assert during graph reserve. Follow the source cache size when sharing cells.
Reproduced and verified on CUDA (RTX PRO 6000 Blackwell, single GPU) and confirmed by @Stastez on ROCm (dual GPU) in the original report: #23398 (comment)
The override also normalizes a small base/SWA sizing mismatch between the two caches (4608 vs 4096) that exists independently of --fit.

Requirements

A fitted target context can end up smaller than the draft default, the
oversized assistant views then overflow the shared K/V tensors and trip
the ggml_view_4d size assert during graph reserve.
@ggerganov ggerganov merged commit f0156d1 into ggml-org:master Jun 7, 2026
25 checks passed
@LiteWinOS
Copy link
Copy Markdown

LiteWinOS commented Jun 7, 2026

llama-cli.exe

[ Prompt: 231.5 t/s | Generation: 67.2 t/s ] non-mtp

[ Prompt: 212.6 t/s | Generation: 94.5 t/s ] 1-mtp

[ Prompt: 186.7 t/s | Generation: 102.2 t/s ] 2-mtp

[ Prompt: 411.2 t/s | Generation: 90.9 t/s ] 4-mtp

[ Prompt: 220.4 t/s | Generation: 92.7 t/s ] 3-mtp
gemma-4-12b-it-Q4_K_M.gguf
gemma-4-12B-it-MTP-Q8_0.gguf

rtx 5070

llama-server.exe

1-mtp
task 0 | prompt eval time =     148.18 ms /    21 tokens (    7.06 ms per token,   141.72 tokens per second)
 task 0 |        eval time =    7633.39 ms /   637 tokens (   11.98 ms per token,    83.45 tokens per second)

2-mtp
 task 0 | prompt eval time =     170.56 ms /    21 tokens (    8.12 ms per token,   123.12 tokens per second)
 task 0 |        eval time =    7711.39 ms /   681 tokens (   11.32 ms per token,    88.31 tokens per second)

error still appear but ig its working
E llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)

CLI - gemma-4-26B-A4B-it-MXFP4_MOE 
[ Prompt: 51.8 t/s | Generation: 55.9 t/s ] 2-mtp
[ Prompt: 49.4 t/s | Generation: 42.4 t/s ] non-mtp

Server -
2-mtp

 task 8 | prompt eval time =    2386.10 ms /    41 tokens (   58.20 ms per token,    17.18 tokens per second)
 task 8 |        eval time =   11418.25 ms /   589 tokens (   19.39 ms per token,    51.58 tokens per second)

non-mtp

 task 0 | prompt eval time =     583.07 ms /    21 tokens (   27.77 ms per token,    36.02 tokens per second)
task 0 |        eval time =   18981.52 ms /   723 tokens (   26.25 ms per token,    38.09 tokens per second)

jimbothigpen added a commit to jimbothigpen/llama.cpp that referenced this pull request Jun 7, 2026
…x under ggml-org#24267 init reorder + harden draft-simple null-ctx fallback

Mainline merge (ggml-org#23398 gemma4-MTP + ggml-org#24267 kv-cache) added a construction-time
guard in llama-context that requires params.ctx_other for LLM_ARCH_GEMMA4_ASSISTANT.
The server path set cparams.ctx_other=ctx_tgt before init, but the speculative-simple
CLI did not, so the gemma4 external-assistant draft context threw during real init ->
null ctx_dft -> has_mtp collapsed -> draft-simple fallback -> n_batch() on null ctx
-> SIGSEGV (RC=139).

- speculative-simple.cpp: set cparams.ctx_other=ctx_tgt (+ n_rs_seq=0) before
  llama_init_from_model in the external-draft branch, mirroring tools/server.
- speculative.cpp: harden common_speculative_impl_draft_simple ctor to throw a clear
  error on null ctx_dft instead of segfaulting; suppress the draft-simple auto-fallback
  when a draft-context speculator (mtp/eagle3/dflash) was explicitly requested but its
  ctx failed to build; refuse to push draft-simple with a null draft ctx.
- speculative-simple.cpp: fail loudly if common_speculative_init returns null.

Gate (ai00 ROCm): Gemma4-26B-A4B external draft-mtp n_max=1 accept=78.000% (RC=0,
matches pre-merge 61178aa exactly); qwen35 internal-MTP n_max=4 accept=31.122%
(RC=0, no regression).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants