Indexing
FlowRAG uses a batch indexing pipeline: scan files, chunk them, extract entities via LLM, generate embeddings, and store everything.
Basic Usage
await rag.index('./content');This processes all files in the directory recursively.
Incremental Indexing
By default, FlowRAG skips unchanged documents. Each document's content is hashed (SHA-256) after processing and stored in KV storage. On re-index, only modified or new files are processed.
// Only processes changed/new documents
await rag.index('./content');
// Force re-process everything
await rag.index('./content', { force: true });Pipeline Stages
Files → Scanner → Chunker → Extractor (LLM) → Embedder → Storage → Entity Embedding1. Scanner
Reads files from the input path. Supports text files (.txt, .md, .json, etc.).
2. Chunker
Splits documents into chunks using token-based splitting:
- Chunk size: ~1200 tokens (default)
- Overlap: 100 tokens between chunks
- Uses tiktoken for accurate tokenization
3. Extractor
The LLM reads each chunk and extracts:
- Entities: named things (services, databases, protocols...)
- Relations: connections between entities (uses, produces, depends on...)
- Custom fields: if defined in the schema
Extraction results are cached in KV storage to avoid re-processing identical chunks.
4. Embedder
Generates vector embeddings for each chunk, enabling semantic search.
5. Storage
Saves everything to three stores:
- KV: documents, chunks, extraction cache, document hashes
- Vector: chunk embeddings for semantic search — document metadata fields (from
DocumentMetadata.fields) are automatically included in vector records, so they're available in search results without additional lookups - Graph: entities and relations for knowledge graph traversal
6. Entity Embedding
After all chunks are processed, every entity in the knowledge graph is embedded and stored in vector storage. This enables semantic entity search via searchEntities() — find entities by meaning rather than exact name.
Concurrency Control
Configure parallel processing via options:
const rag = createFlowRAG({
schema,
...createLocalStorage('./data'),
options: {
indexing: {
chunkSize: 1200, // tokens per chunk
chunkOverlap: 100, // overlap between chunks
maxParallelInsert: 2, // concurrent documents
llmMaxAsync: 4, // concurrent LLM calls
embeddingMaxAsync: 16, // concurrent embedding calls
},
},
});Human-in-the-Loop
When using the CLI with --interactive, you can review extracted entities before they're stored:
flowrag index ./content --interactiveThis shows each extracted entity and relation, letting you accept, reject, or edit them. See CLI Reference for details.
Document Deletion
Delete a document and automatically clean up its entities and relations from the knowledge graph:
await rag.deleteDocument('doc:readme');Entities shared with other documents are preserved (only their sourceChunkIds are updated). Orphaned entities and relations are removed automatically.
During folder re-indexing, stale documents (files that no longer exist on disk) are detected and deleted automatically.
Document Parsers
By default, FlowRAG reads text files (.txt, .md, .json, .yaml, etc.). For non-text documents, register custom parsers:
const rag = createFlowRAG({
schema,
...createLocalStorage('./data'),
parsers: [new PDFParser(), new DocxParser()],
});Parsers implement the DocumentParser interface — see Interfaces for details. Files with matching extensions are parsed instead of read as plain text.
Extraction Gleaning
Run the LLM multiple times on the same chunk for higher extraction accuracy. Each additional pass receives previously found entities as context:
const rag = createFlowRAG({
schema,
...createLocalStorage('./data'),
options: {
indexing: {
extractionGleanings: 2, // 2 additional passes per chunk
},
},
});Results are deduplicated automatically. More passes improve recall at the cost of additional LLM calls.
Entity Merging
Merge duplicate entities extracted by the LLM:
await rag.mergeEntities({
sources: ['Auth Service', 'AuthService', 'auth-service'],
target: 'Auth Service',
});All relations are redirected to the target entity, duplicates are removed, and source entities are deleted. Self-relations created by the merge are automatically skipped.
Pipeline Hooks
For programmatic control, use the onEntitiesExtracted hook:
const rag = createFlowRAG({
schema,
...createLocalStorage('./data'),
hooks: {
onEntitiesExtracted: async (extraction, context) => {
// Filter, modify, or log extracted entities
console.log(`Chunk ${context.chunkId}: ${extraction.entities.length} entities`);
return extraction;
},
},
});