Documentation
Indexing flow
What happens between POST /index and a search_code call returning a hit. Step by step, with the failure modes called out.
The indexer worker is the only component that ever sees your full source. It clones, parses, chunks, embeds, and persists — then hands off to row-level-security-scoped Postgres for every read. This page walks the full lifecycle so you can reason about latency, re-index behaviour, and where to look when something is off.
The seven steps
POST /index ──► 1. Clone ──► 2. Parse ──► 3. Chunk
│
▼
7. Respond ◄── 6. Persist ◄── 5. Embed ◄── 4. Hash
- Clone. Shallow HTTPS clone of the Git URL at the requested
refinto an ephemeral working directory on the worker. Single commit only — no history. - Parse. Tree-sitter walks every file the language detector recognises (TS, JS, Python, Go, Rust, Java, C#, and more). Symbols, references, and imports are extracted into structured rows.
- Chunk. One chunk per top-level symbol (function, class, method), bounded by symbol start/end — not a sliding window. Short symbols are coalesced; oversized ones split on nested boundaries.
- Hash. Each chunk gets a SHA-256 of its text. Unchanged chunks skip the embed call entirely — that is why a re-index after a one-file change costs almost nothing.
- Embed. Surviving chunks are sent to the configured embedding provider in batches. Default model produces 1024-dimensional vectors. See Region mode for where this hop lands geographically.
- Persist. Symbols, references, chunk text, vectors, and a
per-run metadata row are inserted under your
tenant_id. The HNSW index onembeddings.embeddingupdates in place. - Respond. The HTTP response carries the counts so your client can confirm without polling.
What the response tells you
{
"ok": true,
"fileCount": 124,
"symbolCount": 1812,
"chunkCount": 1812,
"embeddedCount": 87,
"embedMs": 320,
"dim": 1024,
"region": "us"
}
embeddedCount < chunkCountmeans the rest hit the content-hash cache — common on re-runs.regionreflects the embedding hop, not the storage region. Storage is always EU Frankfurt.dimshould match the column dimension (1024). A mismatch surfaces as a hard error, not a silent truncation.
Re-index semantics
A workspace is identified by workspaceId, not by Git ref. Re-running
POST /index with a new ref replaces the index in place —
symbols and chunks that no longer exist in the new commit are deleted,
new ones are inserted, unchanged ones reuse their existing rows and
vectors. There is no "append" mode.
If you need parallel indices for two branches at the same time, mint a
second workspace and point each POST /index call at the corresponding
workspaceId. The two indices do not share storage.
Where failures surface
| Symptom | Where to look |
|---|---|
401 Unauthorized | API key invalid, revoked, or scoped to another tenant. |
404 workspace not found | workspaceId does not belong to the calling tenant. |
422 dim mismatch | The configured embedder returned vectors of a different dimension than the schema declares. We rotate models in lockstep with a migration — this should never reach you on the hosted worker. |
503 after long delay | Embedding provider is degraded. The worker has a circuit breaker; retry in 60 s. |
embeddedCount: 0 on a fresh index | Source language not supported by the parser yet, or all files filtered out by the workspace's exclude rules. |
For the full error-to-fix map see Troubleshooting.
What never leaves the worker
- The raw clone is deleted at the end of the request — on
success and on failure. Disk is
tmpfs. - Commit history. We clone a single revision.
- Author identity, commit messages, or any metadata not directly needed for retrieval.
Only chunk text is sent to the embedding provider. Symbol names, file paths, and references stay inside Frankfurt.
Related
POST /index— full request schema.- Region mode — opt out of the US embedding hop.
- Security model — the isolation guarantees behind every step above.