Concepts

Language coverage

The 14 languages prom.codes parses today, the file extensions each one claims, and the measured retrieval quality behind them.

prom.codes does not chunk source by line windows. Every supported language is parsed with Tree-sitter into a real syntax tree, chunked along symbol boundaries (function, class, method, interface), and linked through a reference graph. That is why a query lands on a symbol, not a paragraph that happens to mention the right word.

Supported languages

The indexer ships 14 language IDs today. Detection is by file extension; unknown code extensions are skipped, not guessed. Documentation and curated config files are handled separately — see Beyond code: prose & config below.

Language	Extensions
TypeScript	`.ts` `.mts` `.cts`
TSX	`.tsx`
JavaScript	`.js` `.mjs` `.cjs` `.jsx`
Python	`.py` `.pyi`
PHP	`.php`
Go	`.go`
Rust	`.rs`
Java	`.java`
C#	`.cs`
C	`.c` `.h`
C++	`.cpp` `.cxx` `.cc` `.hpp` `.hxx` `.hh`
Ruby	`.rb` `.rake` `.gemspec`
Kotlin	`.kt` `.kts`
HTML	`.html` `.htm`

Each language has both a symbol extractor (definitions, exports, properties) and a reference extractor (imports, calls, member access). The reference graph is what powers find_callers, find_callees, and the 1-hop context expansion around a hit.

Measured retrieval quality

We do not claim parity by assertion. Every non-trivial language is validated with the same self-index engine-lift harness — index a real OSS repo, synthesise queries from its own doc comments, and measure ranking quality against held-out gold targets with our production retrieval stack.

The headline metric is NDCG@10 for the pre-graph ranking (the surface an agent sees before graph expansion). Higher is better; 1.0 is a perfect ranking.

Language	Corpus repo	NDCG@10
C	wren	0.963
Go	gorilla/mux	0.924
Ruby	rack	0.921
TypeScript	prom.codes internal	0.910
Rust	serde/json	0.882
Kotlin	okhttp	0.872
C++	leveldb	0.835
C#	Newtonsoft.Json	0.832
Python	pydantic	0.831
Java	gson	0.827

Why the graph matters

Switching from hybrid-no-graph to hybrid-full trades a little top-of-list precision for a real jump in recall, consistently across every language we measured. For Ruby, Recall@100 climbs from 0.900 to 0.976; for Kotlin, from 0.774 to 0.874. That recall lift only appears if the reference extractor joins calls to their definitions correctly — so the lift doubles as a correctness signal for each new language we add.

Beyond code: prose & config

A codebase is more than its source. READMEs, ADRs, specs and a handful of key config files (package.json, tsconfig.json, docker-compose.yml, pyproject.toml …) answer questions an agent asks constantly — "how is this configured?", "where is this documented?". Tree-sitter does not help here, so these files travel a separate grammar-less document path: pure-string extractors, no parser.

The hard part is not indexing everything. Naively swallowing every .json — lockfiles, fixtures, generated blobs — floods the corpus with distractors and measurably lowers retrieval quality for the actual code. So the document path is deliberately selective, in three tiers:

Tier	What	How it is chunked	Default
1 — Prose	Markdown (`.md` `.mdx` `.markdown`), Text (`.txt`)	Markdown along headings (`#…` + Setext); text as one whole-file unit	on
2 — Config	JSON / YAML / TOML	along the top-level key/section path — only for basenames on a curated allowlist	on (allowlist-gated)
3 — Guards	everything above	a size cap (1 MiB) and a minification heuristic reject generated/minified blobs before extraction	always

Two outcomes follow from this. A non-code file always yields one document unit, so it is retrievable as a whole even when it has no internal structure; richer files additionally yield section units (a Markdown heading block, a config key or table). And because config is allowlist-gated rather than extension-gated, the high-signal files land in the corpus while the noise stays out — the code numbers above are protected, not diluted.

Not yet supported

Swift is deliberately deferred. Its current Tree-sitter grammar targets a newer ABI (15) than our pinned runtime, so wiring it in safely needs a tree-sitter@0.22+ upgrade first. We would rather ship it on a stable runtime than rush a fragile binding.

If you need a language that is not on the list, the per-language cost is small (a symbol extractor and a reference extractor that mirror the existing ones) — tell us which one and we will benchmark it.