Documentation

Indexing flow

What happens between POST /index and a search_code call returning a hit. Step by step, with the failure modes called out.

The indexer worker is the only component that ever sees your full source. It clones, parses, chunks, embeds, and persists — then hands off to row-level-security-scoped Postgres for every read. This page walks the full lifecycle so you can reason about latency, re-index behaviour, and where to look when something is off.

The seven steps

POST /index ──► 1. Clone     ──► 2. Parse  ──► 3. Chunk
                                                 │
                                                 ▼
              7. Respond  ◄── 6. Persist ◄── 5. Embed ◄── 4. Hash

Clone. Shallow HTTPS clone of the Git URL at the requested ref into an ephemeral working directory on the worker. Single commit only — no history.
Parse. Tree-sitter walks every file the language detector recognises (TS, JS, Python, Go, Rust, Java, C#, and more). Symbols, references, and imports are extracted into structured rows.
Chunk. One chunk per top-level symbol (function, class, method), bounded by symbol start/end — not a sliding window. Short symbols are coalesced; oversized ones split on nested boundaries.
Hash. Each chunk gets a SHA-256 of its text. Unchanged chunks skip the embed call entirely — that is why a re-index after a one-file change costs almost nothing.
Embed. Surviving chunks are sent to the configured embedding provider in batches. Default model produces 1024-dimensional vectors. See Region mode for where this hop lands geographically.
Persist. Symbols, references, chunk text, vectors, and a per-run metadata row are inserted under your tenant_id. The HNSW index on embeddings.embedding updates in place.
Respond. The HTTP response carries the counts so your client can confirm without polling.

What the response tells you

{
  "ok": true,
  "fileCount": 124,
  "symbolCount": 1812,
  "chunkCount": 1812,
  "embeddedCount": 87,
  "embedMs": 320,
  "dim": 1024,
  "region": "us"
}

embeddedCount < chunkCount means the rest hit the content-hash cache — common on re-runs.
region reflects the embedding hop, not the storage region. Storage is always EU Frankfurt.
dim should match the column dimension (1024). A mismatch surfaces as a hard error, not a silent truncation.

Re-index semantics

A workspace is identified by workspaceId, not by Git ref. Re-running POST /index with a new ref replaces the index in place — symbols and chunks that no longer exist in the new commit are deleted, new ones are inserted, unchanged ones reuse their existing rows and vectors. There is no "append" mode.

If you need parallel indices for two branches at the same time, mint a second workspace and point each POST /index call at the corresponding workspaceId. The two indices do not share storage.

Where failures surface

Symptom	Where to look
`401 Unauthorized`	API key invalid, revoked, or scoped to another tenant.
`404 workspace not found`	`workspaceId` does not belong to the calling tenant.
`422 dim mismatch`	The configured embedder returned vectors of a different dimension than the schema declares. We rotate models in lockstep with a migration — this should never reach you on the hosted worker.
`503` after long delay	Embedding provider is degraded. The worker has a circuit breaker; retry in 60 s.
`embeddedCount: 0` on a fresh index	Source language not supported by the parser yet, or all files filtered out by the workspace's exclude rules.

For the full error-to-fix map see Troubleshooting.

What never leaves the worker

The raw clone is deleted at the end of the request — on success and on failure. Disk is tmpfs.
Commit history. We clone a single revision.
Author identity, commit messages, or any metadata not directly needed for retrieval.

Only chunk text is sent to the embedding provider. Symbol names, file paths, and references stay inside Frankfurt.