From 65b15cf0809675169bed7c353aaf4e61427d0604 Mon Sep 17 00:00:00 2001 From: Pringled Date: Thu, 30 Apr 2026 09:24:31 +0200 Subject: [PATCH 1/2] docs: update semble docs with benchmarks, token efficiency, and new API features --- .../docs/packages/semble/benchmarks.mdx | 79 +++++++++++++------ .../docs/packages/semble/introduction.mdx | 2 +- src/content/docs/packages/semble/usage.mdx | 47 +++++++++++ 3 files changed, 105 insertions(+), 23 deletions(-) diff --git a/src/content/docs/packages/semble/benchmarks.mdx b/src/content/docs/packages/semble/benchmarks.mdx index dbebc4c..631981c 100644 --- a/src/content/docs/packages/semble/benchmarks.mdx +++ b/src/content/docs/packages/semble/benchmarks.mdx @@ -16,6 +16,8 @@ We benchmark quality and speed across all methods on ~1,250 queries over 63 repo | CodeRankEmbed | 0.765 | 57 s | 16 ms | | ColGREP | 0.693 | 5.8 s | 124 ms | | BM25 | 0.673 | 263 ms | 0.02 ms | +| grepai | 0.561 | 35 s | 48 ms | +| probe | 0.387 | — | 207 ms | | ripgrep | 0.126 | — | 12 ms | Semble achieves 99% of the retrieval quality of the 137M-parameter CodeRankEmbed Hybrid, while indexing **218× faster** and answering queries **11× faster** — entirely on CPU. @@ -28,32 +30,63 @@ The charts below plot latency against NDCG@10. Marker size reflects model parame ![Speed vs quality (warm)](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/speed_vs_ndcg_warm.png) *Query latency on a warm index vs NDCG@10* +## Token Efficiency + +Coding agents (Claude Code, OpenCode, etc.) typically find code by running `grep` on keywords and reading the matched files. We model that workflow and compare it against semble's chunk retrieval across our full benchmark of 1,251 queries. + +![Token efficiency: recall vs. retrieved tokens](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/token_efficiency.png) + +### Expected tokens per query + +For each query: tokens consumed at first relevant hit, or 32k if the method never finds anything. Averaged across all 1,251 queries. + +| Method | Expected tokens | Savings | +|--------|----------------:|--------:| +| ripgrep + read file | 45,692 | baseline | +| **semble** | **566** | **98% fewer** | + +### Recall at fixed token budgets + +A relevant file is "covered" once any retrieved unit comes from it. + +| Method | 500 | 1k | 2k | 4k | 8k | 16k | 32k | +|--------|----:|---:|---:|---:|---:|----:|----:| +| **semble** | **0.685** | **0.849** | **0.938** | **0.976** | **0.991** | **0.996** | **0.996** | +| ripgrep + read file | 0.001 | 0.008 | 0.037 | 0.088 | 0.212 | 0.379 | 0.583 | + +
+Methodology + +Semble returns the top-50 ranked chunks. `ripgrep+read` splits the query into keywords (dropping stopwords and short words), runs `rg --fixed-strings --ignore-case` for each keyword, then reads matched files in full ranked by how many distinct keywords they contain. Both methods search the same set of file types and ignored directories. Tokens are counted with `cl100k_base` via `tiktoken`. A relevant file is "covered" once any retrieved unit overlaps its annotated span. + +
+ ## By Language NDCG@10 per language. Best score per row is bolded. -| Language | semble | CRE Hybrid | CRE | ColGREP | ripgrep | -|----------|-------:|-----------:|----:|--------:|--------:| -| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.180 | -| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.126 | -| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.230 | -| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.134 | -| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.176 | -| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.000 | -| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.117 | -| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.133 | -| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.202 | -| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.123 | -| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.160 | -| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.000 | -| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.000 | -| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.198 | -| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.166 | -| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.162 | -| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.000 | -| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.000 | -| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.128 | -| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.126** | +| Language | semble | CRE Hybrid | CRE | ColGREP | grepai | probe | ripgrep | +|----------|-------:|-----------:|----:|--------:|-------:|------:|--------:| +| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.330 | 0.392 | 0.180 | +| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.731 | 0.375 | 0.126 | +| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.643 | 0.382 | 0.230 | +| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.669 | 0.412 | 0.134 | +| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.675 | 0.588 | 0.176 | +| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.755 | 0.369 | 0.000 | +| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.277 | 0.392 | 0.117 | +| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.722 | 0.410 | 0.133 | +| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.634 | 0.488 | 0.202 | +| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.402 | 0.340 | 0.123 | +| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.429 | 0.280 | 0.160 | +| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.723 | 0.226 | 0.000 | +| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.699 | 0.336 | 0.000 | +| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.386 | 0.536 | 0.198 | +| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.478 | 0.335 | 0.166 | +| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.519 | 0.242 | 0.162 | +| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.555 | 0.384 | 0.000 | +| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.483 | 0.313 | 0.000 | +| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.394 | 0.354 | 0.128 | +| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.561** | **0.387** | **0.126** | ## Ablations @@ -90,6 +123,8 @@ Languages covered: bash, C, C++, C#, Elixir, Go, Haskell, Java, JavaScript, Kotl ## Methods - **[ripgrep](https://github.com/BurntSushi/ripgrep)** — fast regex search, included as a raw keyword-match baseline. +- **[probe](https://github.com/buger/probe)** — BM25 keyword ranking backed by tree-sitter parse trees. No persistent index; scans on the fly. - **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)** — late-interaction code retrieval with the LateOn-Code-edge model. +- **[grepai](https://github.com/nicholasgasior/grepai)** — semantic search using [nomic-embed-text](https://huggingface.co/nomic-ai/nomic-embed-text-v1) (137M params) via a local Ollama daemon. - **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** — 137M-param transformer embedding model. *CRE Hybrid* fuses its dense scores with BM25. - **semble** — [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack. diff --git a/src/content/docs/packages/semble/introduction.mdx b/src/content/docs/packages/semble/introduction.mdx index bb1c524..c84b486 100644 --- a/src/content/docs/packages/semble/introduction.mdx +++ b/src/content/docs/packages/semble/introduction.mdx @@ -5,7 +5,7 @@ sidebar: icon: open-book --- -[Semble](https://github.com/MinishLab/semble) is a code search library built for agents. It returns the exact code snippets they need instantly, cutting both token usage and waiting time on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](/packages/semble/benchmarks/)). Everything runs on CPU with no API keys, GPU, or external services. +[Semble](https://github.com/MinishLab/semble) is a code search library built for agents. It returns the exact code snippets they need instantly, using ~98% fewer tokens than grep+read and cutting latency on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](/packages/semble/benchmarks/)). Everything runs on CPU with no API keys, GPU, or external services. Run it as an [MCP server](/packages/semble/mcp-server/) and any agent (Claude Code, Cursor, Codex, OpenCode, etc.) gets instant access to any repo, cloned and indexed on demand. diff --git a/src/content/docs/packages/semble/usage.mdx b/src/content/docs/packages/semble/usage.mdx index 8b3b420..cc297e9 100644 --- a/src/content/docs/packages/semble/usage.mdx +++ b/src/content/docs/packages/semble/usage.mdx @@ -21,6 +21,28 @@ index = SembleIndex.from_git("https://github.com/MinishLab/model2vec") Indexing a full repo typically takes under 300 ms. Remote repos are cloned on first use and cached for the lifetime of the process. +### Advanced options + +Both `from_path` and `from_git` accept optional parameters to control what gets indexed: + +```python +index = SembleIndex.from_path( + "./my-project", + extensions=frozenset({".py", ".ts"}), # only index these file types + ignore=frozenset({"dist", "node_modules"}), # skip these directories + include_text_files=True, # also index .md, .yaml, .json, etc. +) +``` + +`from_git` additionally accepts a `ref` parameter to check out a specific branch or tag: + +```python +index = SembleIndex.from_git( + "https://github.com/MinishLab/model2vec", + ref="v2.0.0", # branch or tag; defaults to the remote HEAD +) +``` + ## Searching Search the index with a natural-language description or a code snippet: @@ -34,6 +56,18 @@ for result in results: print() ``` +### Filtering + +Restrict results to specific languages or files using `filter_languages` and `filter_paths`: + +```python +# Only return results from Python files +results = index.search("parse config", filter_languages=["python"]) + +# Only return results from specific files +results = index.search("parse config", filter_paths=["src/config.py", "src/settings.py"]) +``` + ## Finding Related Code Given any search result, find other chunks that are semantically similar to it: @@ -76,3 +110,16 @@ result.chunk.start_line # 42 result.chunk.end_line # 67 result.chunk.content # raw source code of the chunk ``` + +## Index Stats + +Inspect the state of an index with the `stats` property: + +```python +stats = index.stats + +stats.indexed_files # number of files indexed +stats.total_chunks # total number of chunks +stats.languages # dict mapping language name to chunk count + # e.g. {"python": 412, "typescript": 88} +``` From f9a298ff37f4c53a744b9171a37363b028d91c67 Mon Sep 17 00:00:00 2001 From: Pringled Date: Thu, 30 Apr 2026 09:27:43 +0200 Subject: [PATCH 2/2] docs: fix Python version to 3.10, remove em dashes from semble docs --- .../docs/packages/semble/benchmarks.mdx | 20 +++++++++---------- .../docs/packages/semble/installation.mdx | 4 ++-- .../docs/packages/semble/introduction.mdx | 10 +++++----- src/content/docs/packages/semble/usage.mdx | 2 +- 4 files changed, 18 insertions(+), 18 deletions(-) diff --git a/src/content/docs/packages/semble/benchmarks.mdx b/src/content/docs/packages/semble/benchmarks.mdx index 631981c..c329ec0 100644 --- a/src/content/docs/packages/semble/benchmarks.mdx +++ b/src/content/docs/packages/semble/benchmarks.mdx @@ -17,10 +17,10 @@ We benchmark quality and speed across all methods on ~1,250 queries over 63 repo | ColGREP | 0.693 | 5.8 s | 124 ms | | BM25 | 0.673 | 263 ms | 0.02 ms | | grepai | 0.561 | 35 s | 48 ms | -| probe | 0.387 | — | 207 ms | -| ripgrep | 0.126 | — | 12 ms | +| probe | 0.387 | - | 207 ms | +| ripgrep | 0.126 | - | 12 ms | -Semble achieves 99% of the retrieval quality of the 137M-parameter CodeRankEmbed Hybrid, while indexing **218× faster** and answering queries **11× faster** — entirely on CPU. +Semble achieves 99% of the retrieval quality of the 137M-parameter CodeRankEmbed Hybrid, while indexing **218× faster** and answering queries **11× faster**, entirely on CPU. The charts below plot latency against NDCG@10. Marker size reflects model parameter count. @@ -96,7 +96,7 @@ NDCG@10 per language. Best score per row is bolded. |-----------|----:|----------:| | BM25 | 0.675 | 0.834 | | potion-code-16M | 0.650 | 0.821 | -| BM25 + potion-code-16M | — | **0.854** | +| BM25 + potion-code-16M | - | **0.854** | By query category: @@ -122,9 +122,9 @@ Languages covered: bash, C, C++, C#, Elixir, Go, Haskell, Java, JavaScript, Kotl ## Methods -- **[ripgrep](https://github.com/BurntSushi/ripgrep)** — fast regex search, included as a raw keyword-match baseline. -- **[probe](https://github.com/buger/probe)** — BM25 keyword ranking backed by tree-sitter parse trees. No persistent index; scans on the fly. -- **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)** — late-interaction code retrieval with the LateOn-Code-edge model. -- **[grepai](https://github.com/nicholasgasior/grepai)** — semantic search using [nomic-embed-text](https://huggingface.co/nomic-ai/nomic-embed-text-v1) (137M params) via a local Ollama daemon. -- **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** — 137M-param transformer embedding model. *CRE Hybrid* fuses its dense scores with BM25. -- **semble** — [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack. +- **[ripgrep](https://github.com/BurntSushi/ripgrep)**: fast regex search, included as a raw keyword-match baseline. +- **[probe](https://github.com/buger/probe)**: BM25 keyword ranking backed by tree-sitter parse trees. No persistent index; scans on the fly. +- **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)**: late-interaction code retrieval with the LateOn-Code-edge model. +- **[grepai](https://github.com/nicholasgasior/grepai)**: semantic search using [nomic-embed-text](https://huggingface.co/nomic-ai/nomic-embed-text-v1) (137M params) via a local Ollama daemon. +- **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)**: 137M-param transformer embedding model. *CRE Hybrid* fuses its dense scores with BM25. +- **semble**: [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack. diff --git a/src/content/docs/packages/semble/installation.mdx b/src/content/docs/packages/semble/installation.mdx index 595c402..f4fe719 100644 --- a/src/content/docs/packages/semble/installation.mdx +++ b/src/content/docs/packages/semble/installation.mdx @@ -7,8 +7,8 @@ sidebar: ## Requirements -- Python 3.9 or higher -- No GPU, API keys, or external services required — runs fully on CPU +- Python 3.10 or higher +- No GPU, API keys, or external services required. Runs fully on CPU. ## Install diff --git a/src/content/docs/packages/semble/introduction.mdx b/src/content/docs/packages/semble/introduction.mdx index c84b486..f671b2d 100644 --- a/src/content/docs/packages/semble/introduction.mdx +++ b/src/content/docs/packages/semble/introduction.mdx @@ -60,10 +60,10 @@ Semble splits each file into code-aware chunks using [Chonkie](https://github.co The two score lists are fused with Reciprocal Rank Fusion (RRF) and then reranked with a set of code-aware signals: -- **Adaptive weighting** — symbol-like queries (`Foo::bar`, `getUserById`) get more lexical weight; natural-language queries stay balanced. -- **Definition boosts** — a chunk that defines the queried symbol (`class`, `def`, `func`) ranks above chunks that merely reference it. -- **Identifier stems** — query tokens are stemmed and matched against identifier stems, so `parse config` boosts chunks containing `parseConfig`, `ConfigParser`, or `config_parser`. -- **File coherence** — when multiple chunks from the same file match, the file is boosted so the top result reflects broad file-level relevance. -- **Noise penalties** — test files, `compat`/`legacy` shims, example code, and `.d.ts` stubs are down-ranked so canonical implementations surface first. +- **Adaptive weighting**: symbol-like queries (`Foo::bar`, `getUserById`) get more lexical weight; natural-language queries stay balanced. +- **Definition boosts**: a chunk that defines the queried symbol (`class`, `def`, `func`) ranks above chunks that merely reference it. +- **Identifier stems**: query tokens are stemmed and matched against identifier stems, so `parse config` boosts chunks containing `parseConfig`, `ConfigParser`, or `config_parser`. +- **File coherence**: when multiple chunks from the same file match, the file is boosted so the top result reflects broad file-level relevance. +- **Noise penalties**: test files, `compat`/`legacy` shims, example code, and `.d.ts` stubs are down-ranked so canonical implementations surface first. Because the embedding model is static with no transformer forward pass at query time, all of this runs in milliseconds on CPU. diff --git a/src/content/docs/packages/semble/usage.mdx b/src/content/docs/packages/semble/usage.mdx index cc297e9..1450ba6 100644 --- a/src/content/docs/packages/semble/usage.mdx +++ b/src/content/docs/packages/semble/usage.mdx @@ -104,7 +104,7 @@ Each result object exposes: ```python result = results[0] -result.score # float — relevance score +result.score # float, relevance score result.chunk.file_path # "src/config.py" result.chunk.start_line # 42 result.chunk.end_line # 67