Fennec Community community/rag/multi_doc_rag.md

Multi-Document RAG — `multi_doc_rag` Module — Public API Reference

Module Overview
MultiDocumentRAGSystem
Response Schemas
Language Auto-Detection
Data Flow Diagram
Quick-Start Example

1. Module Overview

MultiDocumentRAGSystem is a production-grade RAG engine that manages multiple independent documents inside a single shared vector database. Unlike single-document RAG, every document is tracked individually — with its own metadata, chunk list, language tag, and ingestion timestamp — so you can add, update, remove, or query across any combination of documents at any time.

Key capabilities:

Per-document independent chunking — each document is split and stored with its own namespace so removal only deletes its own chunks.
Auto-language detection — Arabic (Unicode block 0600-06FF), Chinese (4E00-9FFF), and English detected automatically; overridable per document.
Flexible ingestion formats — Dict[doc_id, text] or List[Dict] with per-document metadata.
Document-scoped queries — query across all documents or restrict to a specific subset via filter_docs.
Bilingual prompts — Arabic and English prompt templates built in; language selected per query call.
JSON persistence — save the full system state (vector DB + document registry + stats) and reload in one call.
Async wrappers — agenerate() and aretrieve() for non-blocking event-loop integration.

2. MultiDocumentRAGSystem

from fennec_community.rag.types.multi_doc_rag import MultiDocumentRAGSystem

2.1 Constructor

MultiDocumentRAGSystem(
    vector_db: Any,
    llm: Any,
    chunker: Any,
    context_manager: Any,
    config: Optional[Any] = None,
)

Purpose: Instantiate the multi-document RAG system by wiring together a vector database, a language model, a text chunker, and a context manager. No documents are indexed at construction time.

Parameter	Type	Required	Description
`vector_db`	`Any`	Yes	Vector database instance. Must expose `.add(chunks)`, `.search(query, top_k)`, `.remove_by_doc_id(doc_id)`, `.save(path)`, and `.load(path)`.
`llm`	`Any`	Yes	Language model instance. Must expose `.generate(prompt, **kwargs) -> str`.
`chunker`	`Any`	Yes	Text chunker. Must expose `.chunk(text, doc_id=...) -> List[chunk]`. Each chunk object must have an `.id` attribute and a `.metadata` dict attribute.
`context_manager`	`Any`	Yes	Context builder. Must expose `.build(query, results) -> str` where `results` is a list of `(chunk, score)` tuples.
`config`	`Optional[Any]`	No	Configuration object. Only `config.top_k` is read (fallback `top_k` for queries). Pass `None` to use the built-in default of `5`.

Internal state initialised:

Attribute	Type	Description
`documents`	`Dict[str, Dict]`	Registry of all indexed documents. Key = `doc_id`.
`stats`	`Dict[str, int]`	Session-wide counters: `total_documents`, `total_chunks`, `total_queries`, `successful_queries`, `failed_queries`.

Example:

from fennec_community.rag.types.multi_doc_rag import MultiDocumentRAGSystem

system = MultiDocumentRAGSystem(
    vector_db=my_vector_db,
    llm=my_llm,
    chunker=my_chunker,
    context_manager=my_ctx_mgr,
)

2.2 Document Ingestion

`add_document()`

system.add_document(
    doc_id: str,
    text: str,
    metadata: Optional[Dict[str, Any]] = None,
    language: Optional[str] = None,
    chunk_independently: bool = True,
) -> Dict[str, Any]

Purpose: Index a single document by chunking it independently and writing all chunks to the vector database. Each chunk is tagged with language and original_doc_id in its metadata so the document can later be filtered or removed cleanly.

Validation checks (in order):

text is empty or whitespace-only → returns failure response immediately.
doc_id already exists in the registry → returns failure response instructing the caller to use update_document() instead.

Parameter	Type	Required	Description
`doc_id`	`str`	Yes	Unique document identifier. Must not already exist in the registry; use `update_document()` to replace.
`text`	`str`	Yes	Raw text content to index.
`metadata`	`Optional[Dict[str, Any]]`	No	Arbitrary key-value metadata stored in the document registry and returned with sources. Defaults to `{}`.
`language`	`Optional[str]`	No	Language code (`"arabic"`, `"english"`, `"chinese"`, or any custom string). Auto-detected from `text` if `None`.
`chunk_independently`	`bool`	No	Reserved parameter — all documents are always chunked independently. Presence allows future conditional chunking strategies.

Returns: Dict[str, Any] — see §3.1 add_document() response.

Side effects:

Chunks are written to vector_db.
Document info is stored in self.documents[doc_id].
stats["total_documents"] and stats["total_chunks"] are incremented.

Example:

result = system.add_document(
    doc_id="policy_2024",
    text="شروط الاستخدام تنطبق على جميع المستخدمين المسجلين...",
    metadata={"source": "hr_portal", "version": "3.1"},
    language="arabic",
)

if result["success"]:
    print(f"Indexed {result['num_chunks']} chunks in {result['language']}")
else:
    print(f"Failed: {result['error']}")

`add_documents()`

system.add_documents(
    documents: Union[Dict[str, str], List[Dict[str, Any]]],
    global_metadata: Optional[Dict[str, Any]] = None,
) -> Dict[str, Dict[str, Any]]

Purpose: Batch-index multiple documents in a single call. Accepts two input formats for flexibility. Documents missing a doc_id key are silently skipped with a warning. global_metadata is merged into every document's metadata (per-document metadata takes precedence).

Accepted input formats:

# Format A — simple dict
{"doc_id_1": "text one", "doc_id_2": "text two"}

# Format B — list of dicts with optional per-document fields
[
    {"doc_id": "id1", "text": "...", "metadata": {"dept": "legal"}, "language": "arabic"},
    {"doc_id": "id2", "text": "...", "metadata": {"dept": "hr"}},
]

Parameter	Type	Required	Description
`documents`	`Union[Dict[str, str], List[Dict[str, Any]]]`	Yes	Documents in either format above.
`global_metadata`	`Optional[Dict[str, Any]]`	No	Metadata merged into every document. Per-document `metadata` values override conflicting keys.

Returns: Dict[str, Dict[str, Any]] — mapping of doc_id → individual add_document() result dict. Documents that were skipped (missing doc_id) are not present in the output.

Example:

results = system.add_documents(
    {
        "contract_001": "This agreement is entered into between Party A and Party B...",
        "policy_003": "All employees must comply with the following workplace rules...",
    },
    global_metadata={"department": "legal", "year": 2024},
)

for doc_id, result in results.items():
    status = "✅" if result["success"] else "❌"
    print(f"{status} {doc_id}: {result.get('num_chunks', result.get('error'))}")

2.3 Document Lifecycle Management

`update_document()`

system.update_document(
    doc_id: str,
    new_text: str,
    metadata: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]

Purpose: Replace the content of an existing document with new text. Internally performs an atomic remove-then-add: the old chunks are purged from the vector database, and the new text is chunked and re-indexed. The previous document's metadata is preserved and optionally extended with the metadata argument.

Validation: If doc_id does not exist in the registry, returns a failure response immediately without any side effects.

Parameter	Type	Required	Description
`doc_id`	`str`	Yes	Identifier of the document to replace. Must already exist.
`new_text`	`str`	Yes	New raw text content. Subject to the same validation as `add_document()`.
`metadata`	`Optional[Dict[str, Any]]`	No	Additional metadata to merge with the existing document's metadata. New keys are added; conflicting keys are overwritten.

Returns: Dict[str, Any] — same schema as add_document() response (see §3.3).

Important: The old document's metadata is read before remove_document() is called, so it is safely preserved even though the registry entry is deleted as part of the remove step.

Example:

result = system.update_document(
    doc_id="policy_2024",
    new_text="Updated policy text effective January 2025...",
    metadata={"version": "4.0", "updated_by": "admin"},
)
print(result)

`remove_document()`

system.remove_document(doc_id: str) -> Dict[str, Any]

Purpose: Permanently remove a document and all of its associated chunks from both the vector database and the in-memory document registry. Statistics are updated to reflect the removal.

Validation: If doc_id is not in the registry, returns a failure response without touching the vector database.

Parameter	Type	Required	Description
`doc_id`	`str`	Yes	Identifier of the document to delete.

Returns: Dict[str, Any] — see §3.4 remove_document() response.

Side effects:

vector_db.remove_by_doc_id(doc_id) is called; the number of deleted vectors is recorded.
Entry removed from self.documents.
stats["total_documents"] and stats["total_chunks"] are decremented.

Example:

result = system.remove_document("policy_2024")
if result["success"]:
    print(f"Removed {result['chunks_removed']} chunks for doc '{result['doc_id']}'")

2.4 Querying

`query()`

system.query(
    query: str,
    top_k: Optional[int] = None,
    filter_docs: Optional[List[str]] = None,
    include_sources: bool = True,
    language: str = "ar",
    **llm_kwargs,
) -> Dict[str, Any]

Purpose: Retrieve the most relevant chunks across all (or a filtered subset of) indexed documents, build a grounded context, and generate a natural-language answer using the LLM.

Pipeline:

Validate query — returns failure dict immediately for empty/whitespace input.
Determine top_k (parameter → config.top_k → default 5).
Call vector_db.search(query, top_k=k) → list of (chunk, score) tuples.
If filter_docs is provided, retain only tuples where chunk.doc_id is in the filter list.
If no results remain → return "no answer" response.
Build context via context_manager.build(query, results).
Build bilingual prompt via _build_prompt(query, context, language).
Call llm.generate(prompt, **llm_kwargs) to produce the answer.
Extract source metadata via _extract_sources(results) (if include_sources=True).
Update stats.

Parameter	Type	Required	Default	Description
`query`	`str`	Yes	—	Natural-language question.
`top_k`	`Optional[int]`	No	`config.top_k` or `5`	Maximum number of chunks to retrieve from the vector database before filtering.
`filter_docs`	`Optional[List[str]]`	No	`None`	Restrict retrieval to chunks belonging to this list of `doc_id`s. When `None`, all documents are searched.
`include_sources`	`bool`	No	`True`	If `True`, the response includes a `sources` list with one entry per unique source document.
`language`	`str`	No	`"ar"`	Prompt language. `"ar"` selects the Arabic prompt template; any other value selects the English template.
`**llm_kwargs`	`Any`	No	—	Additional keyword arguments forwarded verbatim to `llm.generate()` (e.g., `temperature`, `max_tokens`).

Returns: Dict[str, Any] — see §3.5 query() response.

Edge-case return values:

Condition	`success`	`answer`
Empty query	`False`	—
No chunks retrieved	`False`	`"no answer for your query"`
Exception during processing	`False`	—
Normal result	`True`	LLM-generated answer string

Example — query all documents:

result = system.query(
    "ما هي شروط الفسخ في العقد؟",
    top_k=8,
    language="ar",
    include_sources=True,
)

if result["success"]:
    print(result["answer"])
    for src in result["sources"]:
        print(f"  [{src['score']:.3f}] {src['doc_id']} ({src['language']})")

Example — query only specific documents:

result = system.query(
    "What are the termination clauses?",
    filter_docs=["contract_001", "contract_002"],
    language="en",
    temperature=0.1,
)

2.5 Document Inspection & Statistics

`get_document_info()`

system.get_document_info(doc_id: str) -> Optional[Dict[str, Any]]

Purpose: Retrieve the full internal record for a single document from the registry. Useful for inspecting chunk counts, metadata, timestamps, and language.

Parameter	Type	Required	Description
`doc_id`	`str`	Yes	Document identifier to look up.

Returns: Dict[str, Any] if the document exists, None if it does not.

Returned dict fields:

Field	Type	Description
`text`	`str`	Original full text of the document.
`chunks`	`List[str]`	List of chunk IDs stored in the vector database.
`num_chunks`	`int`	Total number of chunks.
`metadata`	`Dict[str, Any]`	Document metadata as provided at ingestion.
`added_at`	`str`	ISO 8601 timestamp of when the document was first indexed.
`language`	`str`	Detected or explicitly provided language.
`status`	`str`	Always `"active"` for live documents.

Example:

info = system.get_document_info("policy_2024")
if info:
    print(f"Language : {info['language']}")
    print(f"Chunks   : {info['num_chunks']}")
    print(f"Added at : {info['added_at']}")
    print(f"Metadata : {info['metadata']}")

`list_documents()`

system.list_documents() -> List[Dict[str, Any]]

Purpose: Return a summary list of all documents currently indexed in the system. Does not include the full text or chunks fields (use get_document_info() for those).

Parameters: None.

Returns: List[Dict[str, Any]] — one entry per document with fields:

Field	Type	Description
`doc_id`	`str`	Document identifier.
`num_chunks`	`int`	Number of chunks stored in the vector DB.
`language`	`str`	Document language.
`added_at`	`str`	ISO 8601 ingestion timestamp.
`metadata`	`Dict[str, Any]`	Document metadata.

Returns an empty list [] if no documents have been indexed.

Example:

docs = system.list_documents()
print(f"Total indexed documents: {len(docs)}")
for doc in docs:
    print(f"  {doc['doc_id']} | {doc['language']} | {doc['num_chunks']} chunks | {doc['added_at'][:10]}")

`get_stats()`

system.get_stats() -> Dict[str, Any]

Purpose: Return all system-wide statistics counters plus a per-document breakdown of chunk counts and languages. The primary method for monitoring the health of the system.

Parameters: None.

Returns: Dict[str, Any] with the following structure:

{
    "total_documents":  int,   # Number of active documents in the registry
    "total_chunks":     int,   # Cumulative chunks stored in the vector DB
    "total_queries":    int,   # All query() calls (including failed ones)
    "successful_queries": int, # Queries that returned an answer
    "failed_queries":   int,   # Queries that failed or found no results
    "documents": {
        "<doc_id>": {
            "num_chunks": int,
            "language":   str,
        },
        ...
    }
}

Example:

import json
stats = system.get_stats()
print(json.dumps(stats, indent=2, ensure_ascii=False))

`get_document_stats_summary()`

system.get_document_stats_summary() -> Dict[str, Any]

Purpose: Return a high-level summary of document distribution — useful for dashboards and corpus health checks. Provides language counts and average chunk density without the per-document detail of get_stats().

Parameters: None.

Returns: Dict[str, Any] with the following fields:

Field	Type	Description
`total_documents`	`int`	Number of active documents.
`total_chunks`	`int`	Total chunks across all documents.
`languages`	`Dict[str, int]`	Mapping of language code → document count (e.g., `{"arabic": 4, "english": 2}`).
`avg_chunks_per_doc`	`float`	`total_chunks / total_documents`; `0.0` when no documents are indexed.

Example:

summary = system.get_document_stats_summary()
print(f"Documents   : {summary['total_documents']}")
print(f"Total chunks: {summary['total_chunks']}")
print(f"Avg chunks  : {summary['avg_chunks_per_doc']:.1f}")
print(f"Languages   : {summary['languages']}")
# Languages: {'arabic': 5, 'english': 2, 'chinese': 1}

2.6 Persistence

`save()`

system.save(path: str) -> None

Purpose: Persist the complete system state to a directory so it can be fully restored later without re-indexing any documents. Creates the directory (and any parents) if they do not already exist.

Parameter	Type	Required	Description
`path`	`str`	Yes	Directory path where the system state will be written.

Returns: None

Directory layout created:

<path>/
├── vector_db/        ← serialised vector index (format depends on vector_db implementation)
├── documents.json    ← full document registry (doc IDs, chunk IDs, metadata, timestamps)
└── stats.json        ← session statistics counters

Example:

system.save("./saved_systems/legal_rag_v2")

`load()` (class method)

MultiDocumentRAGSystem.load(
    path: str,
    vector_db: Any,
    llm: Any,
    chunker: Any,
    context_manager: Any,
    config: Optional[Any] = None,
) -> MultiDocumentRAGSystem

Purpose: Restore a previously saved system from disk. The vector index is loaded into the provided (empty) vector_db instance, and the document registry and statistics are rehydrated from JSON. No re-indexing is required.

Parameter	Type	Required	Description
`path`	`str`	Yes	Directory path passed to `save()`.
`vector_db`	`Any`	Yes	A fresh, empty vector DB instance — the saved index will be loaded into it via `vector_db.load()`.
`llm`	`Any`	Yes	Language model instance.
`chunker`	`Any`	Yes	Text chunker instance (needed for future `add_document()` calls).
`context_manager`	`Any`	Yes	Context builder instance.
`config`	`Optional[Any]`	No	Override configuration.

Returns: MultiDocumentRAGSystem — fully initialised with restored documents and statistics, ready to serve queries.

Raises: FileNotFoundError — if path does not exist.

Example:

restored = MultiDocumentRAGSystem.load(
    path="./saved_systems/legal_rag_v2",
    vector_db=fresh_vector_db,
    llm=my_llm,
    chunker=my_chunker,
    context_manager=my_ctx_mgr,
)

print(repr(restored))
# MultiDocumentRAGSystem(documents=12, chunks=480, queries=0)

result = restored.query("ما هي مواعيد الدفع؟")
print(result["answer"])

2.7 Async API

`agenerate()`

await system.agenerate(query: str, **kwargs) -> str

Purpose: Async wrapper for query-and-generate. Delegates to the synchronous generate() method (if exposed on the system) via asyncio.to_thread, making it safe to await from an event loop without blocking.

Parameter	Type	Required	Description
`query`	`str`	Yes	Natural-language question.
`**kwargs`	`Any`	No	Forwarded to the underlying `generate()` call.

Returns: str — the generated answer string.

Note: This method calls self.generate() if it exists. In the default implementation, this is not the same as query() — query() returns a full result dict whereas generate() is expected to return a plain string. Ensure your subclass or composition layer exposes a generate() method if you use agenerate().

Example:

import asyncio

async def main():
    answer = await system.agenerate("What are the payment terms?")
    print(answer)

asyncio.run(main())

`aretrieve()`

await system.aretrieve(query: str, **kwargs)

Purpose: Async wrapper for retrieval. Calls retrieve_with_context() if available, otherwise falls back to retrieve(), running it in a thread pool via asyncio.to_thread. Returns an empty list [] if neither method is found on the system.

Parameter	Type	Required	Description
`query`	`str`	Yes	Query string.
`**kwargs`	`Any`	No	Forwarded to the retrieval method.

Returns: The raw return value of retrieve_with_context() or retrieve(), typically List[Tuple[chunk, score]]. Returns [] if no retrieval method is available.

Example:

import asyncio

async def main():
    docs = await system.aretrieve("payment schedule")
    for chunk, score in docs[:3]:
        print(f"[{score:.3f}] {chunk.text[:80]}")

asyncio.run(main())

2.8 Context Manager

MultiDocumentRAGSystem supports the async context manager protocol:

import asyncio

async def main():
    async with MultiDocumentRAGSystem(
        vector_db=my_vdb,
        llm=my_llm,
        chunker=my_chunker,
        context_manager=my_ctx_mgr,
    ) as system:
        system.add_document("doc1", "Contract text here...")
        result = system.query("What are the obligations?", language="en")
        print(result["answer"])
    # __aexit__ is a no-op — used as a clean scope delimiter

asyncio.run(main())

The synchronous with statement is not supported. Instantiate directly for synchronous use.

2.9 Representation

`repr()`

repr(system)

Purpose: Return a concise machine-readable representation of the current system state. Ideal for logging, REPL inspection, and debugging.

Returns: str in the format:

MultiDocumentRAGSystem(documents=<n>, chunks=<n>, queries=<n>)

Example:

print(repr(system))
# MultiDocumentRAGSystem(documents=5, chunks=120, queries=17)

3. Response Schemas

3.1 `add_document()` response

Success:

{
    "success":    True,
    "doc_id":     "policy_2024",       # str   — document identifier
    "num_chunks": 8,                   # int   — number of chunks created
    "language":   "arabic",            # str   — detected or provided language
    "metadata":   {"source": "hr"},    # dict  — stored metadata
}

Failure:

{
    "success": False,
    "doc_id":  "policy_2024",
    "error":   "Document already exists. Use update_document()"
    # other possible values: "Empty document", "No chunks created", "<exception message>"
}

3.2 `add_documents()` response

{
    "policy_2024":  {"success": True,  "doc_id": "policy_2024",  "num_chunks": 8, ...},
    "contract_001": {"success": False, "doc_id": "contract_001", "error": "Empty document"},
}

3.3 `update_document()` response

Identical schema to add_document() response. "success": True means the old document was deleted and the new one indexed successfully.

3.4 `remove_document()` response

Success:

{
    "success":       True,
    "doc_id":        "policy_2024",   # str — identifier of the removed document
    "chunks_removed": 8,              # int — number of vectors deleted from vector DB
}

Failure:

{
    "success": False,
    "error":   "Document not found"   # or "<exception message>"
}

3.5 `query()` response

Success:

{
    "success":     True,
    "answer":      "وفقاً للعقد، يجب إشعار الطرف الآخر...",  # str — LLM answer
    "sources": [
        {
            "doc_id":   "contract_001",  # str   — source document ID
            "score":    0.873,           # float — highest similarity score from this doc
            "language": "arabic",        # str   — document language
            "metadata": {"dept": "legal"} # dict — document metadata
        },
        ...
    ],
    "num_results": 5,                    # int   — total retrieved chunks before filtering
}

Failure (empty query):

{"success": False, "error": "Empty query"}

Failure (no results):

{"success": False, "answer": "no answer for your query", "sources": []}

Failure (exception):

{"success": False, "error": "<exception message>"}

4. Language Auto-Detection

When language=None is passed to add_document(), the system detects the language automatically using Unicode character ratio analysis:

Language	Detection Criterion	Unicode Range
`"arabic"`	Arabic char ratio > 30% of total chars	`U+0600–U+06FF`
`"chinese"`	Chinese char ratio > 30% of total chars	`U+4E00–U+9FFF`
`"english"`	Neither ratio exceeds 30%	(fallback)
`"unknown"`	Empty text	—

Override: Pass language="french" (or any custom string) to bypass detection.

Prompt template selection in query(): The language parameter on query() is independent of the document language. Pass language="ar" to get an Arabic prompt/answer, regardless of what language the documents are in.

5. Data Flow Diagram

──────────────────────────────────────────────────────────────────
                     INGESTION PIPELINE
──────────────────────────────────────────────────────────────────

add_document(doc_id, text, metadata, language)
│
├─ Validate: empty text?  → return failure
├─ Validate: doc_id dup?  → return failure (suggest update_document)
│
├─ _detect_language(text)          [if language=None]
│    ├─ Arabic char ratio > 0.30   → "arabic"
│    ├─ Chinese char ratio > 0.30  → "chinese"
│    └─ default                    → "english"
│
├─ chunker.chunk(text, doc_id=doc_id)
│    └─ chunk.metadata["language"]           = language
│       chunk.metadata["original_doc_id"]   = doc_id
│
├─ vector_db.add(chunks)
│
├─ documents[doc_id] = {text, chunk_ids, num_chunks,
│                        metadata, added_at, language, status}
│
└─ stats["total_documents"] += 1
   stats["total_chunks"]    += len(chunks)

──────────────────────────────────────────────────────────────────
                       QUERY PIPELINE
──────────────────────────────────────────────────────────────────

query(query, top_k, filter_docs, include_sources, language)
│
├─ Validate: empty query? → return failure dict
│
├─ k = top_k or config.top_k or 5
│
├─ vector_db.search(query, top_k=k)
│    └─ List[(chunk, score)]
│
├─ filter_docs provided?
│    └─ retain only chunks where chunk.doc_id in filter_docs
│
├─ no results? → return {success:False, answer:"no answer..."}
│
├─ context_manager.build(query, results) → context_str
│
├─ _build_prompt(query, context_str, language)
│    ├─ language == "ar" → Arabic prompt template
│    └─ other            → English prompt template
│
├─ llm.generate(prompt, **llm_kwargs) → answer_str
│
├─ include_sources → _extract_sources(results)
│    └─ deduplicate by doc_id, keep highest score per doc
│       → [{doc_id, score, language, metadata}, ...]
│
├─ stats["total_queries"]     += 1
   stats["successful_queries"] += 1
│
└─ return {success, answer, sources, num_results}

──────────────────────────────────────────────────────────────────
                    PERSISTENCE PIPELINE
──────────────────────────────────────────────────────────────────

save(path)
  └─ vector_db.save(path/"vector_db")
     json.dump(documents  → path/"documents.json")
     json.dump(stats      → path/"stats.json")

load(path, vector_db, llm, chunker, context_manager, config)
  └─ vector_db.load(path/"vector_db")
     system = MultiDocumentRAGSystem(vector_db, llm, chunker, ctx, config)
     system.documents = json.load("documents.json")
     system.stats     = json.load("stats.json")
     return system

6. Quick-Start Example

import asyncio
import json
from fennec_community.rag.types.multi_doc_rag import MultiDocumentRAGSystem

# ── 1. Instantiate ────────────────────────────────────────────────────────
system = MultiDocumentRAGSystem(
    vector_db=my_vector_db,
    llm=my_llm,
    chunker=my_chunker,
    context_manager=my_ctx_mgr,
)
print(repr(system))
# MultiDocumentRAGSystem(documents=0, chunks=0, queries=0)

# ── 2. Add a single document ──────────────────────────────────────────────
r = system.add_document(
    doc_id="contract_001",
    text="This agreement is entered into on January 1, 2025, between Party A and Party B...",
    metadata={"department": "legal", "version": "1.0"},
    language="english",
)
print(r)
# {'success': True, 'doc_id': 'contract_001', 'num_chunks': 4, 'language': 'english', ...}

# ── 3. Batch-add documents (dict format) ──────────────────────────────────
results = system.add_documents(
    {
        "policy_hr_2024": "جميع الموظفين ملزمون باتباع سياسة العمل التالية...",
        "policy_it_2024": "يُمنع استخدام الأجهزة الشخصية على شبكة الشركة...",
    },
    global_metadata={"year": 2024, "status": "approved"},
)
for doc_id, res in results.items():
    print(f"{'✅' if res['success'] else '❌'} {doc_id}")

# ── 4. Batch-add with per-document metadata (list format) ─────────────────
system.add_documents([
    {"doc_id": "report_q1", "text": "Q1 financial results show 15% revenue growth...",
     "metadata": {"quarter": "Q1"}, "language": "english"},
    {"doc_id": "report_q2", "text": "Q2 showed continued momentum with 18% growth...",
     "metadata": {"quarter": "Q2"}},
])

# ── 5. Inspect the registry ───────────────────────────────────────────────
docs = system.list_documents()
for doc in docs:
    print(f"  {doc['doc_id']:20} | {doc['language']:8} | {doc['num_chunks']} chunks")

info = system.get_document_info("contract_001")
print(f"Added at: {info['added_at']}")
print(f"Metadata: {info['metadata']}")

# ── 6. Query all documents (Arabic answer) ────────────────────────────────
result = system.query(
    "ما هي سياسة استخدام الأجهزة الشخصية؟",
    top_k=6,
    language="ar",
    include_sources=True,
)
if result["success"]:
    print(result["answer"])
    for src in result["sources"]:
        print(f"  [{src['score']:.3f}] {src['doc_id']}")

# ── 7. Query only specific documents ─────────────────────────────────────
result = system.query(
    "What are the payment terms?",
    filter_docs=["contract_001"],
    language="en",
    top_k=4,
)
print(result["answer"])

# ── 8. Update a document ──────────────────────────────────────────────────
system.update_document(
    doc_id="policy_hr_2024",
    new_text="نسخة محدّثة من سياسة العمل لعام 2025...",
    metadata={"year": 2025, "updated_by": "hr_admin"},
)

# ── 9. Remove a document ──────────────────────────────────────────────────
rem = system.remove_document("report_q1")
print(f"Removed {rem['chunks_removed']} chunks")

# ── 10. Statistics ────────────────────────────────────────────────────────
print(json.dumps(system.get_stats(), indent=2, ensure_ascii=False))

summary = system.get_document_stats_summary()
print(f"Avg chunks/doc : {summary['avg_chunks_per_doc']:.1f}")
print(f"Languages      : {summary['languages']}")

# ── 11. Persistence ───────────────────────────────────────────────────────
system.save("./saved_systems/multi_rag_v1")

# Later — restore without re-indexing
restored = MultiDocumentRAGSystem.load(
    path="./saved_systems/multi_rag_v1",
    vector_db=fresh_vector_db,
    llm=my_llm,
    chunker=my_chunker,
    context_manager=my_ctx_mgr,
)
print(repr(restored))
result = restored.query("What are the financial results?", language="en")
print(result["answer"])

# ── 12. Async usage ───────────────────────────────────────────────────────
async def async_demo():
    # Async retrieval
    docs = await system.aretrieve("payment schedule")
    print(f"Retrieved {len(docs)} chunks")

    # Async generation
    answer = await system.agenerate("What are the obligations?")
    print(answer)

asyncio.run(async_demo())

# ── 13. Async context manager ─────────────────────────────────────────────
async def ctx_demo():
    async with MultiDocumentRAGSystem(
        vector_db=my_vector_db,
        llm=my_llm,
        chunker=my_chunker,
        context_manager=my_ctx_mgr,
    ) as sys:
        sys.add_document("temp_doc", "Temporary document for testing purposes.")
        result = sys.query("What is this document about?", language="en")
        print(result["answer"])

asyncio.run(ctx_demo())

Simple Real Example


from fennec_community.llm import MistralInterface
from fennec_community.document_loaders import TextLoader 
from fennec_community.vector_database import FAISSVectorDatabase
from fennec_community.chunks import ArabicTextChunker
from fennec_community.context import ContextManager
from fennec_community.embeddings import OllamaEmbedder
from fennec_community.rag.types.multi_doc_rag import MultiDocumentRAGSystem

loader_1 = TextLoader("./data_kn/faq.txt").load()
loader_2 = TextLoader("./data_kn/orders.txt").load()
loader_3 = TextLoader("./data_kn/returns_policy.txt").load()
loader_4 = TextLoader("./data_kn/technical_support.txt").load()
chunker = ArabicTextChunker(chunk_size=100, overlap=20)
embedder = OllamaEmbedder()
vector_db = FAISSVectorDatabase(embedder=embedder)
llm = MistralInterface(api_key=llm_api)
context_manager = ContextManager()
multi_doc_rag = RAGSystem(llm=llm, vector_db=vector_db,chunker=chunker, context_manager=context_manager)


multi_doc_rag.add_documents([{"doc_id":loader_1[0].doc_id,"text":loader_1[0].page_content,"metadata":loader_1[0].metadata},
                          {"doc_id":loader_2[0].doc_id,"text":loader_2[0].page_content,"metadata":loader_2[0].metadata},
                          {"doc_id":loader_3[0].doc_id,"text":loader_3[0].page_content,"metadata":loader_3[0].metadata},
                          {"doc_id":loader_4[0].doc_id,"text":loader_4[0].page_content,"metadata":loader_4[0].metadata}])

rag_system.query("ما هي طرق الدفع المتاحة؟")

Source: community/rag/multi_doc_rag.md

Table of Contents

1. Module Overview

2. MultiDocumentRAGSystem

2.1 Constructor

2.2 Document Ingestion

add_document()

add_documents()

2.3 Document Lifecycle Management

update_document()

remove_document()

2.4 Querying

query()

2.5 Document Inspection & Statistics

get_document_info()

list_documents()

get_stats()

get_document_stats_summary()

2.6 Persistence

save()

load() (class method)

2.7 Async API

agenerate()

aretrieve()

2.8 Context Manager

2.9 Representation

__repr__()

3. Response Schemas

3.1 add_document() response

3.2 add_documents() response

3.3 update_document() response

3.4 remove_document() response

3.5 query() response

4. Language Auto-Detection

5. Data Flow Diagram

6. Quick-Start Example

Simple Real Example

`add_document()`

`add_documents()`

`update_document()`

`remove_document()`

`query()`

`get_document_info()`

`list_documents()`

`get_stats()`

`get_document_stats_summary()`

`save()`

`load()` (class method)

`agenerate()`

`aretrieve()`

`repr()`

3.1 `add_document()` response

3.2 `add_documents()` response

3.3 `update_document()` response

3.4 `remove_document()` response

3.5 `query()` response