diff --git a/astro.config.mjs b/astro.config.mjs index 5498437..678b1e9 100644 --- a/astro.config.mjs +++ b/astro.config.mjs @@ -78,8 +78,7 @@ gtag('config', 'G-LQWDNXKF2X');`, items: [ { label: 'Introduction', link: '/packages/semble/introduction/' }, { label: 'Installation', link: '/packages/semble/installation/' }, - { label: 'MCP Server', link: '/packages/semble/mcp-server/' }, - { label: 'CLI / AGENTS.md',link: '/packages/semble/usage/' }, + { label: 'CLI', link: '/packages/semble/usage/' }, { label: 'Benchmarks', link: '/packages/semble/benchmarks/' }, ], }, diff --git a/src/content/docs/packages/semble/installation.mdx b/src/content/docs/packages/semble/installation.mdx index 9799b6b..fa54c5b 100644 --- a/src/content/docs/packages/semble/installation.mdx +++ b/src/content/docs/packages/semble/installation.mdx @@ -1,61 +1,288 @@ --- title: Installation -description: Install Semble, set up the MCP server, and scaffold a sub-agent +description: Install Semble, set up the MCP server, and wire it into your agent sidebar: icon: seti:config --- -There are three things you can do to install Semble, which are independent of eachother. We recommend doing all three, but you can pick and choose based on your needs: - -1. [Install Semble](#1-install-semble) (for the CLI and AGENTS.md flow). -2. [Set up the MCP server](#2-mcp-server) (so your top-level agent can call Semble as a tool). -3. [Install the sub-agent](#3-sub-agent) (so sub-agents, which can't call MCP tools, can still search). - ## Requirements - Python 3.10 or higher. -- [uv](https://docs.astral.sh/uv/getting-started/installation/) (recommended for all three flows). +- [uv](https://docs.astral.sh/uv/getting-started/installation/) (recommended). - No GPU, API keys, or external services required. Runs fully on CPU. -## 1. Install Semble +## Recommended: `semble install` + +The interactive installer detects your installed agents and configures any combination of three integrations: + +- **MCP server**: exposes Semble as a native tool your agent can call directly. +- **Instructions**: adds CLI usage guidance to your agent's config file (`CLAUDE.md`, `AGENTS.md`, etc.). +- **Sub-agent**: installs a dedicated `semble-search` sub-agent for harnesses that support it. + +Install the CLI with [uv](https://docs.astral.sh/uv/getting-started/installation/), then run: -Install Semble with [`uv`](https://docs.astral.sh/uv/) (recommended) or `pip`: +```bash +uv tool install semble +semble install +``` + +To undo: ```bash -uv tool install semble # Recommended -pip install semble # Or with pip +semble uninstall ``` -This gives you the [`semble` CLI](/packages/semble/usage/). +Supported agents: Claude Code, Cursor, Gemini CLI, Kiro, OpenCode, GitHub Copilot, Codex, VS Code, Windsurf, Zed, Reasonix, and Pi. -### Optional: wire it into AGENTS.md +> **Pi prerequisite:** Pi requires the MCP extension before Semble can connect. Run `pi install npm:pi-mcp-extension` once, then `semble install`. -Once installed, drop the [AGENTS.md snippet](/packages/semble/usage/#agentsmd-snippet) into your `AGENTS.md`, `CLAUDE.md`, `GEMINI.md`, or equivalent. This teaches any agent (including sub-agents) when to reach for `semble` instead of grep, and is the only setup needed for harnesses without MCP support. +## Manual setup -## 2. MCP Server +### MCP server -Install Semble as an [MCP server](/packages/semble/mcp-server/) for Claude Code: +> Requires [uv](https://docs.astral.sh/uv/getting-started/installation/) to be installed. + +#### Claude Code ```bash claude mcp add semble -s user -- uvx --from "semble[mcp]" semble ``` -For other agents (Cursor, Codex, OpenCode, VS Code, Copilot CLI, Windsurf, Gemini, Kiro, Zed), see [MCP Server](/packages/semble/mcp-server/) for the per-harness config snippet. +#### Cursor + +Add to `~/.cursor/mcp.json` (or `.cursor/mcp.json` in your project): + +```json +{ + "mcpServers": { + "semble": { + "command": "uvx", + "args": ["--from", "semble[mcp]", "semble"] + } + } +} +``` + +#### Codex + +Add to `~/.codex/config.toml`: + +```toml +[mcp_servers.semble] +command = "uvx" +args = ["--from", "semble[mcp]", "semble"] +``` + +#### OpenCode + +Add to `~/.config/opencode/opencode.jsonc`: + +```json +{ + "mcp": { + "semble": { + "type": "local", + "command": ["uvx", "--from", "semble[mcp]", "semble"] + } + } +} +``` + +#### VS Code + +Add to `.vscode/mcp.json` in your project (or your user profile's `mcp.json`): + +```json +{ + "servers": { + "semble": { + "command": "uvx", + "args": ["--from", "semble[mcp]", "semble"] + } + } +} +``` + +#### GitHub Copilot CLI + +Add to `~/.copilot/mcp-config.json`: + +```json +{ + "mcpServers": { + "semble": { + "command": "uvx", + "args": ["--from", "semble[mcp]", "semble"] + } + } +} +``` + +#### Windsurf + +Add to `~/.codeium/windsurf/mcp_config.json`: + +```json +{ + "mcpServers": { + "semble": { + "command": "uvx", + "args": ["--from", "semble[mcp]", "semble"] + } + } +} +``` + +#### Gemini CLI + +Add to `~/.gemini/settings.json`: + +```json +{ + "mcpServers": { + "semble": { + "command": "uvx", + "args": ["--from", "semble[mcp]", "semble"] + } + } +} +``` + +#### Kiro + +Add to `~/.kiro/settings/mcp.json` (or `.kiro/settings/mcp.json` in your project): + +```json +{ + "mcpServers": { + "semble": { + "command": "uvx", + "args": ["--from", "semble[mcp]", "semble"] + } + } +} +``` -## 3. Sub-agent +#### Zed -Sub-agents typically cannot call MCP tools directly. To give a sub-agent access to Semble, run `semble init` once in your project root to scaffold a dedicated search sub-agent for your harness: +Add to `~/.config/zed/settings.json` (or `.zed/settings.json` in your project): + +```json +{ + "context_servers": { + "semble": { + "source": "custom", + "command": "uvx", + "args": ["--from", "semble[mcp]", "semble"] + } + } +} +``` + +#### Reasonix + +Add to `~/.reasonix/config.json`: + +```json +{ + "mcpServers": { + "semble": { + "command": "uvx", + "args": ["--from", "semble[mcp]", "semble"] + } + } +} +``` + +#### Pi + +First install the Pi MCP extension (one-time prerequisite): + +```bash +pi install npm:pi-mcp-extension +``` + +Then add to `~/.pi/agent/mcp.json`: + +```json +{ + "mcpServers": { + "semble": { + "command": "uvx", + "args": ["--from", "semble[mcp]", "semble"] + } + } +} +``` + +#### Content types + +By default the MCP server indexes only code files. To also index documentation (markdown, rst, etc.), config files (yaml, toml, etc.), or everything, append `--content` to the server command. Valid values are `code` (default), `docs`, `config`, and `all`. For example, in Claude Code: ```bash -semble init # Claude Code → .claude/agents/semble-search.md -semble init --agent gemini # Gemini CLI → .gemini/agents/semble-search.md -semble init --agent cursor # Cursor → .cursor/agents/semble-search.md -semble init --agent opencode # OpenCode → .opencode/agents/semble-search.md -semble init --agent copilot # Copilot CLI → .github/agents/semble-search.md -semble init --agent kiro # Kiro → .kiro/agents/semble-search.md +claude mcp add semble -s user -- uvx --from "semble[mcp]" semble --content all ``` -If `semble` is not on `$PATH`, prefix the command with `uvx --from "semble[mcp]"`. +### Instructions (AGENTS.md / CLAUDE.md) + +Add the snippet below to your `AGENTS.md`, `CLAUDE.md`, `GEMINI.md`, or equivalent so your agent knows when and how to call the Semble CLI: + +````markdown +## Code Search + +Use `semble search` to find code by describing what it does or naming a symbol/identifier, instead of grep: + +```bash +semble search "authentication flow" ./my-project +semble search "save_pretrained" ./my-project +semble search "save model to disk" ./my-project --top-k 10 +``` + +The index is built on first run and cached automatically; it is invalidated when files change. + +Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config: + +```bash +semble search "deployment guide" ./my-project --content docs +semble search "database host port" ./my-project --content config +semble search "authentication" ./my-project --content all +``` + +Use `semble find-related` to discover code similar to a known location (pass `file_path` and `line` from a prior search result): + +```bash +semble find-related src/auth.py 42 ./my-project +``` + +`path` defaults to the current directory when omitted; git URLs are accepted. If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place. + +### Workflow + +1. Start with `semble search` to find relevant chunks. +2. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything. +3. Inspect full files only when the returned chunk does not give enough context. +4. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations. +5. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string. +```` + +### Sub-agent + +Sub-agents typically cannot call MCP tools directly. `semble install` handles this automatically. For manual setup, copy the appropriate file from the table below to your agent's agents directory: + +> **Pi prerequisite:** Pi sub-agents require the Pi agents extension. Run `pi install npm:pi-agents` once before installing. + +| Agent | Source file | Destination | +|---|---|---| +| Claude Code | `claude.md` | `~/.claude/agents/semble-search.md` | +| Cursor | `cursor.md` | `~/.cursor/agents/semble-search.md` | +| Gemini CLI | `gemini.md` | `~/.gemini/agents/semble-search.md` | +| Kiro | `kiro.md` | `~/.kiro/agents/semble-search.md` | +| OpenCode | `opencode.md` | `~/.config/opencode/agents/semble-search.md` | +| GitHub Copilot | `copilot.md` | `~/.copilot/agents/semble-search.agent.md` | +| Reasonix | `reasonix.md` | `~/.reasonix/skills/semble-search.md` | +| Pi | `pi.md` | `~/.pi/agents/semble-search.md` | + +Source files are in [`src/semble/agents/`](https://github.com/MinishLab/semble/tree/main/src/semble/agents) in the Semble repository. ## Updating Semble diff --git a/src/content/docs/packages/semble/introduction.mdx b/src/content/docs/packages/semble/introduction.mdx index b4bc2f6..c6445ec 100644 --- a/src/content/docs/packages/semble/introduction.mdx +++ b/src/content/docs/packages/semble/introduction.mdx @@ -5,32 +5,22 @@ sidebar: icon: open-book --- -[Semble](https://github.com/MinishLab/semble) is a code search library built for agents. It returns the exact code snippets they need instantly, using ~98% fewer tokens than grep+read. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](/packages/semble/benchmarks/)). Everything runs on CPU with no API keys, GPU, or external services. Run it as an [MCP server](/packages/semble/mcp-server/) or call it from the shell via [AGENTS.md](/packages/semble/usage/) and any agent (Claude Code, Cursor, Codex, OpenCode, etc.) gets instant access to any repo. +[Semble](https://github.com/MinishLab/semble) is a code search library built for agents. It returns the exact code snippets they need instantly, using ~98% fewer tokens than grep+read. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](/packages/semble/benchmarks/)). Everything runs on CPU with no API keys, GPU, or external services. Use it as an MCP server, a CLI tool via AGENTS.md, or a dedicated sub-agent, and any coding agent (Claude Code, Cursor, Codex, OpenCode, etc.) gets instant access to any repo. ## Quickstart -Your agent queries Semble in natural language (e.g. `"How is authentication handled?"`) and gets back only the relevant code snippets, without grepping or reading full files. You can set it up as an MCP server or via AGENTS.md. First, install [uv](https://docs.astral.sh/uv/getting-started/installation/) if you don't have it yet. +Your agent queries Semble in natural language (e.g. `"How is authentication handled?"`) and gets back only the relevant code snippets, without grepping or reading full files. - -### MCP (Claude Code) - -Add Semble to Claude Code (requires [uv](https://docs.astral.sh/uv/getting-started/installation/)): +The fastest way to get started is the interactive installer. Install [uv](https://docs.astral.sh/uv/getting-started/installation/), then run: ```bash -claude mcp add semble -s user -- uvx --from "semble[mcp]" semble +uv tool install semble +semble install ``` -Using another agent harness? See [MCP Server](/packages/semble/mcp-server/) for per-agent setup. - -### Bash / AGENTS.md - -[Install Semble](/packages/semble/installation/), then add the [AGENTS.md snippet](/packages/semble/usage/#agentsmd-snippet) to your `AGENTS.md`, `CLAUDE.md`, or equivalent. This works for any agent and is the only option for sub-agents, which typically cannot call MCP tools directly. - -```bash -uv tool install semble # Install with uv (recommended) -pip install semble # Or install with pip -``` +`semble install` detects your installed coding agents (Claude Code, Cursor, Codex, Gemini, OpenCode, and more) and lets you choose which integrations to enable: MCP server, CLI instructions in AGENTS.md, and a dedicated sub-agent. To undo, run `semble uninstall`. +For manual setup (per-agent MCP config, AGENTS.md snippet, sub-agent files), see [Installation](/packages/semble/installation/). ## Main Features @@ -41,6 +31,15 @@ pip install semble # Or install with pip - **MCP server**: works with Claude Code, Cursor, Codex, OpenCode, VS Code, and any other MCP-compatible agent. - **Local and remote**: pass a local path or a git URL. +## MCP tools + +Once connected, the agent has access to two tools: + +| Tool | Description | +|------|-------------| +| `search` | Search a codebase with a natural-language or code query. Pass `repo` as a local path or an `https://` git URL. | +| `find_related` | Given a `file_path` and `line` number, return chunks semantically similar to the code at that location. | + ## How it works Semble splits each file into code-aware chunks using [tree-sitter](https://github.com/tree-sitter/py-tree-sitter), then scores every query against the chunks with two complementary retrievers: static [Model2Vec](https://github.com/MinishLab/model2vec) embeddings using the code-specialized [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) model for semantic similarity, and [BM25](https://github.com/xhluca/bm25s) for lexical matches on identifiers and API names. The two score lists are fused with Reciprocal Rank Fusion (RRF). diff --git a/src/content/docs/packages/semble/mcp-server.mdx b/src/content/docs/packages/semble/mcp-server.mdx deleted file mode 100644 index 30976a7..0000000 --- a/src/content/docs/packages/semble/mcp-server.mdx +++ /dev/null @@ -1,173 +0,0 @@ ---- -title: MCP Server -description: Using Semble as an MCP server with AI agents -sidebar: - icon: puzzle ---- - -Semble can run as an [MCP](https://modelcontextprotocol.io/) server so agents can search any codebase directly. Repos are cloned and indexed on demand, and indexes are cached for the lifetime of the session. Local paths are watched for file changes and re-indexed automatically. - -> Requires [uv](https://docs.astral.sh/uv/getting-started/installation/) to be installed. - -## Setup - -### Claude Code - -```bash -claude mcp add semble -s user -- uvx --from "semble[mcp]" semble -``` - -### Cursor - -Add to `~/.cursor/mcp.json` (or `.cursor/mcp.json` in your project): - -```json -{ - "mcpServers": { - "semble": { - "command": "uvx", - "args": ["--from", "semble[mcp]", "semble"] - } - } -} -``` - -### Codex - -Add to `~/.codex/config.toml`: - -```toml -[mcp_servers.semble] -command = "uvx" -args = ["--from", "semble[mcp]", "semble"] -``` - -### OpenCode - -Add to `~/.opencode/config.json`: - -```json -{ - "mcp": { - "semble": { - "type": "local", - "command": ["uvx", "--from", "semble[mcp]", "semble"] - } - } -} -``` - -### VS Code - -Add to `.vscode/mcp.json` in your project (or your user profile's `mcp.json`): - -```json -{ - "servers": { - "semble": { - "command": "uvx", - "args": ["--from", "semble[mcp]", "semble"] - } - } -} -``` - -### GitHub Copilot CLI - -Add to `~/.copilot/mcp-config.json`: - -```json -{ - "mcpServers": { - "semble": { - "command": "uvx", - "args": ["--from", "semble[mcp]", "semble"] - } - } -} -``` - -### Windsurf - -Add to `~/.codeium/windsurf/mcp_config.json`: - -```json -{ - "mcpServers": { - "semble": { - "command": "uvx", - "args": ["--from", "semble[mcp]", "semble"] - } - } -} -``` - -### Gemini CLI - -Add to `~/.gemini/settings.json`: - -```json -{ - "mcpServers": { - "semble": { - "command": "uvx", - "args": ["--from", "semble[mcp]", "semble"] - } - } -} -``` - -### Kiro - -Add to `~/.kiro/settings/mcp.json` (or `.kiro/settings/mcp.json` in your project): - -```json -{ - "mcpServers": { - "semble": { - "command": "uvx", - "args": ["--from", "semble[mcp]", "semble"] - } - } -} -``` - -### Zed - -Add to `~/.config/zed/settings.json` (or `.zed/settings.json` in your project): - -```json -{ - "context_servers": { - "semble": { - "command": "uvx", - "args": ["--from", "semble[mcp]", "semble"] - } - } -} -``` - -## Tools - -Once connected, the agent has access to two tools: - -| Tool | Description | -|------|-------------| -| `search` | Search a codebase with a natural-language or code query. Pass `repo` as a local directory path or an https:// git URL. | -| `find_related` | Given a `file_path` and `line` number, return chunks semantically similar to the code at that location. | - -The index is built on the first call and reused for subsequent calls in the same session. - -## Content types - -By default the MCP server indexes only code files. To also index documentation (markdown, rst, etc.), config files (yaml, toml, etc.), or everything, append `--content` to the server command. Valid values are `code` (default), `docs`, `config`, and `all`, or any combination, e.g. `--content code docs`. - -For example, in Claude Code: - -```bash -claude mcp add semble -s user -- uvx --from "semble[mcp]" semble --content all -``` - -## Sub-agents - -Sub-agents typically cannot call MCP tools directly. To give a sub-agent access to Semble, use the [CLI / AGENTS.md](/packages/semble/usage/) flow or scaffold a dedicated search sub-agent with `semble init`. See [Installation → Sub-agent](/packages/semble/installation/#3-sub-agent). diff --git a/src/content/docs/packages/semble/usage.mdx b/src/content/docs/packages/semble/usage.mdx index 85d1f70..ebcef00 100644 --- a/src/content/docs/packages/semble/usage.mdx +++ b/src/content/docs/packages/semble/usage.mdx @@ -1,23 +1,23 @@ --- -title: CLI / AGENTS.md -description: Invoke Semble from the shell or wire it into AGENTS.md for any agent +title: CLI +description: Invoke Semble from the shell sidebar: icon: seti:shell --- -Semble ships as a standalone CLI. This is the best fit for sub-agents (which typically cannot call MCP tools directly), scripts, and anywhere you want search results without an MCP session. It also pairs nicely with [MCP](/packages/semble/mcp-server/) for the top-level agent. +Semble ships as a standalone CLI. This is the best fit for sub-agents (which typically cannot call MCP tools directly), scripts, and anywhere you want search results without an MCP session. [Install Semble](/packages/semble/installation/) first: ```bash +uv tool install semble # with uv (recommended) pip install semble # with pip -uv tool install semble # with uv ``` -## CLI +## Commands ```bash -# Search a local repo +# Search a local repo (index is built and cached automatically) semble search "authentication flow" ./my-project # Search for a symbol or identifier @@ -40,51 +40,54 @@ semble search "authentication" ./my-project --content all # Find code similar to a known location semble find-related src/auth.py 42 ./my-project + +# Clear cached indexes +semble clear index + +# Clear saved token savings stats +semble clear savings + +# Clear everything +semble clear all ``` `--content` accepts `code` (default), `docs`, `config`, or `all`. `path` defaults to the current directory when omitted; git URLs are accepted. If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place. -## AGENTS.md snippet +## Controlling which files are indexed -Append the snippet below to your `AGENTS.md`, `CLAUDE.md`, `GEMINI.md`, or equivalent so the agent knows when to reach for Semble instead of grep: +Semble reads `.gitignore` and `.sembleignore` files to determine which files to index. Both use standard gitignore syntax and their patterns are merged. `.sembleignore` lets you add Semble-specific rules without touching `.gitignore`. -````markdown -## Code Search +**Excluding files:** -Use `semble search` to find code by describing what it does or naming a symbol/identifier, instead of grep: - -```bash -semble search "authentication flow" ./my-project -semble search "save_pretrained" ./my-project -semble search "save model to disk" ./my-project --top-k 10 +``` +# .sembleignore +generated/ # exclude generated directory +*.pb.go # exclude Go protobuf files ``` -Use `--content docs` to search documentation and prose, `--content config` for config files (yaml, toml, etc.), or `--content all` to search code, docs, and config: +**Including non-default extensions** — prefix the pattern with `!` to force-include files Semble wouldn't index by default: -```bash -semble search "deployment guide" ./my-project --content docs -semble search "database host port" ./my-project --content config -semble search "authentication" ./my-project --content all +``` +# .sembleignore +!*.proto # include Protobuf files +!*.cob # include COBOL files ``` -Use `semble find-related` to discover code similar to a known location (pass `file_path` and `line` from a prior search result): +Semble also always skips well-known non-source directories (`node_modules/`, `.venv/`, `dist/`, `build/`, `__pycache__/`, and similar) regardless of ignore files. -```bash -semble find-related src/auth.py 42 ./my-project -``` +## Storage -`path` defaults to the current directory when omitted; git URLs are accepted. If `semble` is not on `$PATH`, use `uvx --from "semble[mcp]" semble` in its place. +Indexes and token savings statistics are stored in the OS cache folder by default: -### Workflow +- macOS: `~/Library/Caches/semble/` +- Linux: `~/.cache/semble/` +- Windows: `%LOCALAPPDATA%\semble\Cache\` -1. Start with `semble search` to find relevant chunks. -2. Use `--content docs` for documentation, `--content config` for config files, or `--content all` for everything. -3. Inspect full files only when the returned chunk is not enough context. -4. Optionally use `semble find-related` with a promising result's `file_path` and `line` to discover related implementations. -5. Use grep only when you need exhaustive literal matches or quick confirmation of an exact string. -```` +To override the location, set `SEMBLE_CACHE_LOCATION` to a full path: -For agents that support dedicated sub-agents (Claude Code, Gemini, Cursor, OpenCode, Copilot CLI, Kiro), `semble init` can scaffold one for you. See [Installation → Sub-agent](/packages/semble/installation/#3-sub-agent). +```bash +export SEMBLE_CACHE_LOCATION=~/my-folder/semble +``` ## Token savings @@ -105,7 +108,7 @@ semble savings --verbose # also show breakdown by call type All time 1.4k [██████████████░░] ~1.2M tokens (89%) ``` -For each call, Semble records the total character count of the unique files containing returned chunks and the character count of the snippets returned. Estimated tokens saved is `(file chars − snippet chars) / 4` (4 chars per token). This is a conservative estimate: the baseline is reading matched files in full, which is how coding agents often explore unfamiliar code. Stats are stored in `~/.semble/savings.jsonl`. +For each call, Semble records the total character count of the unique files containing returned chunks and the character count of the snippets returned. Estimated tokens saved is `(file chars − snippet chars) / 4` (4 chars per token). This is a conservative estimate: the baseline is reading matched files in full, which is how coding agents often explore unfamiliar code. ## Library usage