kv-cache: follow the source cache size when sharing cells by ServeurpersoCom · Pull Request #24267 · ggml-org/llama.cpp

ServeurpersoCom · 2026-06-07T15:14:21Z

Overview

With --fit the trunk context can shrink below the draft default, the assistant then builds views sized for its own kv_size into the smaller shared K/V tensors and trips the ggml_view_4d assert during graph reserve. Follow the source cache size when sharing cells.
Reproduced and verified on CUDA (RTX PRO 6000 Blackwell, single GPU) and confirmed by @Stastez on ROCm (dual GPU) in the original report: #23398 (comment)
The override also normalizes a small base/SWA sizing mismatch between the two caches (4608 vs 4096) that exists independently of --fit.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES

A fitted target context can end up smaller than the draft default, the oversized assistant views then overflow the shared K/V tensors and trip the ggml_view_4d size assert during graph reserve.

LiteWinOS · 2026-06-07T17:28:52Z

llama-cli.exe

[ Prompt: 231.5 t/s | Generation: 67.2 t/s ] non-mtp

[ Prompt: 212.6 t/s | Generation: 94.5 t/s ] 1-mtp

[ Prompt: 186.7 t/s | Generation: 102.2 t/s ] 2-mtp

[ Prompt: 411.2 t/s | Generation: 90.9 t/s ] 4-mtp

[ Prompt: 220.4 t/s | Generation: 92.7 t/s ] 3-mtp

gemma-4-12b-it-Q4_K_M.gguf
gemma-4-12B-it-MTP-Q8_0.gguf

rtx 5070

llama-server.exe

1-mtp
task 0 | prompt eval time =     148.18 ms /    21 tokens (    7.06 ms per token,   141.72 tokens per second)
 task 0 |        eval time =    7633.39 ms /   637 tokens (   11.98 ms per token,    83.45 tokens per second)

2-mtp
 task 0 | prompt eval time =     170.56 ms /    21 tokens (    8.12 ms per token,   123.12 tokens per second)
 task 0 |        eval time =    7711.39 ms /   681 tokens (   11.32 ms per token,    88.31 tokens per second)

error still appear but ig its working
E llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)

CLI - gemma-4-26B-A4B-it-MXFP4_MOE 
[ Prompt: 51.8 t/s | Generation: 55.9 t/s ] 2-mtp
[ Prompt: 49.4 t/s | Generation: 42.4 t/s ] non-mtp

Server -
2-mtp

 task 8 | prompt eval time =    2386.10 ms /    41 tokens (   58.20 ms per token,    17.18 tokens per second)
 task 8 |        eval time =   11418.25 ms /   589 tokens (   19.39 ms per token,    51.58 tokens per second)

non-mtp

 task 0 | prompt eval time =     583.07 ms /    21 tokens (   27.77 ms per token,    36.02 tokens per second)
task 0 |        eval time =   18981.52 ms /   723 tokens (   26.25 ms per token,    38.09 tokens per second)

…x under ggml-org#24267 init reorder + harden draft-simple null-ctx fallback Mainline merge (ggml-org#23398 gemma4-MTP + ggml-org#24267 kv-cache) added a construction-time guard in llama-context that requires params.ctx_other for LLM_ARCH_GEMMA4_ASSISTANT. The server path set cparams.ctx_other=ctx_tgt before init, but the speculative-simple CLI did not, so the gemma4 external-assistant draft context threw during real init -> null ctx_dft -> has_mtp collapsed -> draft-simple fallback -> n_batch() on null ctx -> SIGSEGV (RC=139). - speculative-simple.cpp: set cparams.ctx_other=ctx_tgt (+ n_rs_seq=0) before llama_init_from_model in the external-draft branch, mirroring tools/server. - speculative.cpp: harden common_speculative_impl_draft_simple ctor to throw a clear error on null ctx_dft instead of segfaulting; suppress the draft-simple auto-fallback when a draft-context speculator (mtp/eagle3/dflash) was explicitly requested but its ctx failed to build; refuse to push draft-simple with a null draft ctx. - speculative-simple.cpp: fail loudly if common_speculative_init returns null. Gate (ai00 ROCm): Gemma4-26B-A4B external draft-mtp n_max=1 accept=78.000% (RC=0, matches pre-merge 61178aa exactly); qwen35 internal-MTP n_max=4 accept=31.122% (RC=0, no regression). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kv-cache: follow the source cache size when sharing cells

b2c60e5

A fitted target context can end up smaller than the draft default, the oversized assistant views then overflow the shared K/V tensors and trip the ggml_view_4d size assert during graph reserve.

ServeurpersoCom requested a review from ggerganov as a code owner June 7, 2026 15:14

ggerganov approved these changes Jun 7, 2026

View reviewed changes

ggerganov merged commit f0156d1 into ggml-org:master Jun 7, 2026
25 checks passed

ServeurpersoCom mentioned this pull request Jun 7, 2026

llama : add Gemma4 MTP #23398

Merged

sswtodo mentioned this pull request Jun 7, 2026

WIP: kv-cache apply_ubtach change from O(n) to O(1) #24270

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv-cache: follow the source cache size when sharing cells#24267

kv-cache: follow the source cache size when sharing cells#24267
ggerganov merged 1 commit into
ggml-org:masterfrom
ServeurpersoCom:fix/kv-cache-share-size

ServeurpersoCom commented Jun 7, 2026

Uh oh!

Uh oh!

LiteWinOS commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ServeurpersoCom commented Jun 7, 2026

Overview

Requirements

Uh oh!

Uh oh!

LiteWinOS commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LiteWinOS commented Jun 7, 2026 •

edited

Loading