kv-cache: follow the source cache size when sharing cells#24267
Merged
ggerganov merged 1 commit intoJun 7, 2026
Conversation
A fitted target context can end up smaller than the draft default, the oversized assistant views then overflow the shared K/V tensors and trip the ggml_view_4d size assert during graph reserve.
ggerganov
approved these changes
Jun 7, 2026
|
llama-cli.exe llama-server.exe error still appear but ig its working Server - non-mtp |
jimbothigpen
added a commit
to jimbothigpen/llama.cpp
that referenced
this pull request
Jun 7, 2026
…x under ggml-org#24267 init reorder + harden draft-simple null-ctx fallback Mainline merge (ggml-org#23398 gemma4-MTP + ggml-org#24267 kv-cache) added a construction-time guard in llama-context that requires params.ctx_other for LLM_ARCH_GEMMA4_ASSISTANT. The server path set cparams.ctx_other=ctx_tgt before init, but the speculative-simple CLI did not, so the gemma4 external-assistant draft context threw during real init -> null ctx_dft -> has_mtp collapsed -> draft-simple fallback -> n_batch() on null ctx -> SIGSEGV (RC=139). - speculative-simple.cpp: set cparams.ctx_other=ctx_tgt (+ n_rs_seq=0) before llama_init_from_model in the external-draft branch, mirroring tools/server. - speculative.cpp: harden common_speculative_impl_draft_simple ctor to throw a clear error on null ctx_dft instead of segfaulting; suppress the draft-simple auto-fallback when a draft-context speculator (mtp/eagle3/dflash) was explicitly requested but its ctx failed to build; refuse to push draft-simple with a null draft ctx. - speculative-simple.cpp: fail loudly if common_speculative_init returns null. Gate (ai00 ROCm): Gemma4-26B-A4B external draft-mtp n_max=1 accept=78.000% (RC=0, matches pre-merge 61178aa exactly); qwen35 internal-MTP n_max=4 accept=31.122% (RC=0, no regression). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
With --fit the trunk context can shrink below the draft default, the assistant then builds views sized for its own kv_size into the smaller shared K/V tensors and trips the ggml_view_4d assert during graph reserve. Follow the source cache size when sharing cells.
Reproduced and verified on CUDA (RTX PRO 6000 Blackwell, single GPU) and confirmed by @Stastez on ROCm (dual GPU) in the original report: #23398 (comment)
The override also normalizes a small base/SWA sizing mismatch between the two caches (4608 vs 4096) that exists independently of --fit.
Requirements