Fennec Logo Fennec
Fennec Community community/chunks.md

Chunks Modular

Table of Contents

  1. Overview
  2. Architecture at a Glance
  3. Quick Start
  4. Core Entry Point — ChunkManager
  5. Chunkers
  6. Text Splitters
  7. Supporting Components
  8. Data Models
  9. Base Classes
  10. Configuration Reference — ChunkConfig
  11. ChunkMode Enum
  12. End-to-End Usage Examples

Overview

The chunks modular provides a complete, production-ready pipeline for splitting documents into semantically meaningful units ("chunks") ready for vector embedding and retrieval in RAG systems. It handles:

  • Multiple splitting strategies — semantic, adaptive, structure-aware, context-aware, and hybrid.
  • Multilingual support — 17+ languages with dedicated BERT models; first-class Arabic support.
  • Post-processing — TF-IDF keyword extraction, importance scoring, and deduplication.
  • Query-aware routing — automatically selects the best strategy based on sample queries.
  • LangChain-compatible splitters — drop-in replacements for CharacterTextSplitter, RecursiveCharacterTextSplitter, etc.

Architecture at a Glance

Text / Documents
       │
       ▼
  ChunkManager          ← Orchestrator (recommended entry point)
       │
       ├── QueryPatternAnalyzer   (optional: picks strategy from queries)
       │
       ├── Chunker selection:
       │     ├── SemanticChunker         (embedding-based splits)
       │     ├── AdaptiveChunker         (density-adaptive sizing)
       │     ├── StructureAwareChunker   (respects headers, code, tables)
       │     ├── ContextAwareChunker     (sliding-window context)
       │     └── Hybrid                  (structural + semantic)
       │
       ├── MetadataEnricher       (keywords, scores)
       ├── Deduplicator           (hash + similarity)
       └── Finalize               (position, total_chunks)
             │
             ▼
      List[DocumentChunk]

Quick Start

from fennec_community.chunks import ChunkManager, ChunkConfig

# 1. Basic usage — auto mode
manager = ChunkManager()
chunks = manager.process("Your document text here...", source="my_doc.pdf")

# 2. Semantic chunking with custom config
config = ChunkConfig(
    use_semantic_chunking=True,
    chunk_size=512,
    overlap=128,
    extract_keywords=True,
    deduplication_enabled=True,
)
manager = ChunkManager(config=config)
chunks = manager.process(text, source="report.pdf")

# 3. Inspect results
for chunk in chunks:
    print(chunk.text[:80])
    print(f"  keywords: {chunk.metadata.keywords}")
    print(f"  score:    {chunk.metadata.score:.3f}")

Core Entry Point — ChunkManager

ChunkManager is the recommended entry point for all production use. It orchestrates the full chunking lifecycle: strategy selection → chunking → enrichment → deduplication → finalization.

Constructor

ChunkManager(
    config: Optional[ChunkConfig] = None,
    mode: ChunkMode = ChunkMode.AUTO,
    *,
    device: Optional[str] = None,
    language: str = "auto",
)
Parameter Type Default Description
config ChunkConfig | None None Configuration object. Uses defaults if omitted.
mode ChunkMode ChunkMode.AUTO Chunking strategy. See ChunkMode.
device str | None None Torch device ("cuda", "cpu"). Auto-detected if None.
language str "auto" Language hint ("arabic", "english", etc.) or "auto" for detection.

process

Target: it's point for split document to chunks ready for embedding.

def process(
    text: str,
    doc_id: Optional[str] = None,
    source: str = "",
    *,
    sample_queries: Optional[List[str]] = None,
    mode_override: Optional[ChunkMode] = None,
) -> List[DocumentChunk]

Parameters:

Parameter Type Required Description
text str The raw document text to be chunked.
doc_id str | None Unique document identifier. Auto-generated UUID if not provided.
source str Document origin label (filename, URL, etc.) stored in chunk metadata.
sample_queries List[str] | None Representative queries used to auto-select the best chunking strategy when config.query_aware=True.
mode_override ChunkMode | None Temporarily override the manager's default mode for this call only.

Returns: List[DocumentChunk] — Ordered, enriched, deduplicated chunks ready for embedding.

Returns empty list if text is empty or whitespace only.

Example:

chunks = manager.process(
    text="Large document content...",
    source="annual_report_2024.pdf",
    sample_queries=["What is the revenue?", "Who are the key executives?"],
)

process_documents

Target: split list of documents to chunks ready for embedding. in one batch.

def process_documents(
    documents: List[Document],
    *,
    sample_queries: Optional[List[str]] = None,
    mode_override: Optional[ChunkMode] = None,
) -> List[DocumentChunk]

Parameters:

Parameter Type Required Description
documents List[Document] List of Document objects (each holds page_content + metadata).
sample_queries List[str] | None Queries used for strategy selection (applied uniformly to all documents).
mode_override ChunkMode | None Override mode for all documents in this batch.

Returns: List[DocumentChunk] — All chunks from all documents combined in order.

Example:

from fennec_community.chunks import Document

docs = [
    Document(page_content="Doc 1 text...", metadata={"source": "file1.pdf"}),
    Document(page_content="Doc 2 text...", metadata={"source": "file2.pdf"}),
]
all_chunks = manager.process_documents(docs)

process_with_query_optimization

Goal: A simplified shortcut to enable the query-aware pipeline without needing to configure config.query_aware.

def process_with_query_optimization(
    text: str,
    queries: List[str],
    doc_id: Optional[str] = None,
    source: str = "",
) -> List[DocumentChunk]

Parameters:

Parameter Type Required Description
text str Document text to chunk.
queries List[str] Sample queries that will be analyzed to select the optimal chunking strategy.
doc_id str | None Optional document ID.
source str Document source label.

Returns: List[DocumentChunk]

Example:

chunks = manager.process_with_query_optimization(
    text=document_text,
    queries=["How does the API authenticate?", "What are the rate limits?"],
    source="api_docs.md",
)

get_stats

Goal: Extract summarized statistics from a set of chunks (useful for monitoring and diagnostics).

def get_stats(chunks: List[DocumentChunk]) -> Dict[str, Any]

Parameters:

Parameter Type Required Description
chunks List[DocumentChunk] The chunk list returned by any process* method.

Returns: Dict[str, Any] with the following keys:

Key Type Description
count int Total number of chunks.
total_chars int Sum of all characters across chunks.
avg_chars float Average characters per chunk.
min_chars int Smallest chunk size in characters.
max_chars int Largest chunk size in characters.
avg_words float Average word count per chunk.
avg_score float Mean importance score (0–1).
max_score float Highest importance score in the set.
chunk_types Dict[str, int] Count of each ChunkType value.
unique_sections int Number of distinct document sections represented.

Example:

stats = manager.get_stats(chunks)
print(stats)
# {'count': 42, 'avg_chars': 487.3, 'avg_score': 0.6812, ...}

Chunkers

Each chunker can be used standalone (without ChunkManager) when you need direct control. All chunkers share the same public interface via BaseChunker.chunk().

Common Interface (all chunkers)

All chunkers inherit from BaseChunker and expose these two public methods:

chunk(text, doc_id="", source="")

Goal: Split a single text into a complete list of chunks while properly setting the position and total_chunks.

def chunk(self, text: str, doc_id: str = "", source: str = "") -> List[DocumentChunk]
Parameter Type Description
text str Raw text to split.
doc_id str Parent document ID stored in each chunk's metadata.
source str Document origin label (filename, URL, etc.)

Returns: List[DocumentChunk]

chunk_documents(documents)

Goal: Batch process a list of Document objects into chunks while preserving each document’s doc_id and source.

def chunk_documents(self, documents: List[Document]) -> List[DocumentChunk]
Parameter Type Description
documents List[Document] Documents to split. Each Document.metadata may include a "source" key.

Returns: List[DocumentChunk] — all chunks from all documents, in order.


SemanticChunker

Goal: Split text based on semantic similarity between sentences using embedding models. A new chunk is created when similarity drops below a defined threshold, ensuring each chunk contains a semantically coherent idea.

SemanticChunker(
    config: Optional[ChunkConfig] = None,
    *,
    similarity_threshold: float = 0.75,
    min_chunk_size: int = 50,
    max_chunk_size: int = 2048,
    model_name: Optional[str] = None,
    device: Optional[str] = None,
    language: str = "auto",
)
Parameter Type Default Description
config ChunkConfig | None None If provided, overrides individual parameters.
similarity_threshold float 0.75 Cosine similarity below this value triggers a new chunk. Lower = more splits.
min_chunk_size int 50 Minimum characters per chunk; short chunks are merged with the next.
max_chunk_size int 2048 Hard cap: forces a new chunk regardless of similarity.
model_name str | None None Sentence-transformer model name. Defaults to multilingual MiniLM.
device str | None None "cuda" or "cpu".
language str "auto" Language tag stored in chunk metadata.

Best for: Conceptually diverse documents, FAQs, articles where topics shift frequently.

Example:

from fennec_community.chunks import SemanticChunker

chunker = SemanticChunker(similarity_threshold=0.70)
chunks = chunker.chunk(text, source="article.txt")

AdaptiveChunker

Goal: Automatically adjust chunk size based on the information density of the text. Dense technical content is split into smaller chunks for more precise retrieval, while narrative text uses larger chunks to preserve context.

AdaptiveChunker(
    config: Optional[ChunkConfig] = None,
    *,
    base_size: int = 512,
    min_size: int = 100,
    max_size: int = 1500,
    overlap: int = 128,
    technical_threshold: float = 0.5,
    language: str = "auto",
)
Parameter Type Default Description
config ChunkConfig | None None If provided, reads adaptive_base_size, adaptive_min_size, adaptive_max_size, adaptive_technical_threshold fields.
base_size int 512 Starting chunk size before density adjustment.
min_size int 100 Minimum allowed chunk size (used for dense/technical text).
max_size int 1500 Maximum allowed chunk size (used for simple/narrative text).
overlap int 128 Character overlap between consecutive chunks.
technical_threshold float 0.5 Information density score above which text is considered "technical".
language str "auto" Language tag.

Best for: Mixed-content documents that combine technical specifications with narrative explanations.


StructureAwareChunker

Goal: Split the document while respecting its natural structure (Markdown/HTML headings, paragraphs, code blocks, tables, and lists). It ensures that important structural elements are not split in the middle.

StructureAwareChunker(
    config: Optional[ChunkConfig] = None,
    *,
    max_chunk_size: int = 1024,
    min_chunk_size: int = 50,
    language: str = "auto",
    split_on_headers: bool = True,
    preserve_code_blocks: bool = True,
)
Parameter Type Default Description
config ChunkConfig | None None Reads max_chunk_size, split_on_headers, preserve_code_blocks.
max_chunk_size int 1024 Sections exceeding this are further split by sentence.
min_chunk_size int 50 Short sections are merged with adjacent ones.
language str "auto" Language tag.
split_on_headers bool True Create a new chunk at each Markdown/HTML heading.
preserve_code_blocks bool True Keep code fences as single chunks regardless of size.

Best for: Markdown documentation, HTML pages, structured technical manuals.


ContextAwareChunker

Goal: Preserve context using sliding windows. Each chunk includes core sentences plus surrounding sentences before and after them, ensuring no loss of context at chunk boundaries.

ContextAwareChunker(
    config: Optional[ChunkConfig] = None,
    *,
    chunk_size: int = 512,
    overlap: int = 128,
    window_size: int = 2,
    min_chunk_size: int = 50,
    language: str = "auto",
)
Parameter Type Default Description
config ChunkConfig | None None Reads chunk_size, overlap, context_window_size, min_context_sentences.
chunk_size int 512 Target character count for primary sentence groups.
overlap int 128 Characters from previous chunk carried into the next.
window_size int 2 Number of sentences added as context on each side of a primary group.
min_chunk_size int 50 Chunks below this size are discarded.
language str "auto" Language tag.

Best for: Conversational text, narrative documents, Q&A content where answer context spans sentence boundaries.


ArabicTextChunker

Goal: Specialized processing for Arabic text, including normalization (unifying forms of Alef, Ta Marbuta, Ya, and removing diacritics), spacing correction, and semantic splitting using Arabic BERT models. It supports both pure Arabic text and mixed Arabic-English content.

ArabicTextChunker(
    chunk_size: int = 512,
    overlap: int = 128,
    min_chunk_size: int = 50,
    model_name: str = "CAMeL-Lab/bert-base-arabic-camelbert-mix",
    use_semantic_chunking: bool = False,
    device: Optional[str] = None,
    preserve_formatting: bool = True,
    fix_spacing: bool = True,
    strict_size_limit: bool = True,
    size_tolerance: float = 0.05,
    use_smart_overlap: bool = True,
    smart_overlap_threshold: float = 0.7,
)
Parameter Type Default Description
chunk_size int 512 Maximum characters per chunk.
overlap int 128 Character overlap between chunks.
min_chunk_size int 50 Minimum characters; smaller chunks are skipped.
model_name str CAMeL BERT HuggingFace model for Arabic BERT embeddings.
use_semantic_chunking bool False Enable BERT-based semantic splitting (requires torch + transformers).
device str | None None "cuda" or "cpu".
preserve_formatting bool True Keep original whitespace and formatting structure.
fix_spacing bool True Auto-fix spacing issues common in Arabic text (e.g., missing spaces after punctuation).
strict_size_limit bool True Hard-enforce chunk_size; oversized chunks are force-split at word boundaries.
size_tolerance float 0.05 Allowed overshoot fraction (5% by default). Only applies when strict_size_limit=True.
use_smart_overlap bool True Use semantic similarity to choose the most relevant overlap sentences.
smart_overlap_threshold float 0.7 Minimum cosine similarity for a sentence to be included in smart overlap.

Class methods:

ArabicTextChunker.create_safely(**kwargs)

Goal: Safely create an instance while automatically ignoring any unknown parameters (useful when passing a dynamic configuration).

@classmethod
def create_safely(cls, **kwargs) -> ArabicTextChunker

Returns: ArabicTextChunker

ArabicTextChunker.get_available_parameters()

Goal: Retrieve a list of all accepted parameters in the __init__ method.

@classmethod
def get_available_parameters(cls) -> List[str]

Returns: List[str]

Public utility methods:

get_arabic_ratio(text)

Goal: Calculate the ratio of Arabic characters to the total number of characters in the text.

def get_arabic_ratio(self, text: str) -> float

Returns: float in range [0.0, 1.0].

is_arabic_dominant(text, threshold=0.3)

Goal: Determine whether the text is primarily Arabic (used for automatic language detection).

def is_arabic_dominant(self, text: str, threshold: float = 0.3) -> bool

Returns: True if Arabic character ratio ≥ threshold.


MultilanguageTextChunker

Goal: Process multilingual text (17+ languages) with automatic language detection and selection of the appropriate BERT model for each language. It supports both rule-based and semantic splitting.

MultilanguageTextChunker(
    chunk_size: int = 512,
    overlap: int = 128,
    min_chunk_size: int = 50,
    model_name: Optional[str] = None,
    language: str = "auto",
    use_semantic_chunking: bool = False,
    device: Optional[str] = None,
    use_smart_overlap: bool = True,
    smart_overlap_threshold: float = 0.7,
    strict_size_limit: bool = True,
    size_tolerance: float = 0.05,
)
Parameter Type Default Description
chunk_size int 512 Target chunk size in characters.
overlap int 128 Overlap in characters between consecutive chunks.
min_chunk_size int 50 Minimum chunk size; smaller chunks are discarded.
model_name str | None None Override the auto-selected language model.
language str "auto" Language code ("arabic", "chinese", "english", etc.) or "auto" for detection.
use_semantic_chunking bool False Enable BERT-based semantic splitting.
device str | None None "cuda" or "cpu".
use_smart_overlap bool True Use semantic-similarity-based overlap selection.
smart_overlap_threshold float 0.7 Similarity threshold for smart overlap.
strict_size_limit bool True Enforce hard size limits.
size_tolerance float 0.05 Allowed size overshoot fraction.

Supported languages: arabic, chinese, japanese, korean, russian, hindi, english, french, german, spanish, portuguese, italian, dutch, polish, turkish, vietnamese, multilingual.

Public utility methods:

get_supported_languages()

Goal: Retrieve a list of all supported languages along with their corresponding BERT model names.

def get_supported_languages(self) -> List[str]

Returns: List[str] — e.g. ["multilingual", "english", "arabic", "chinese", ...]

Example:

chunker = MultilanguageTextChunker()
print(chunker.get_supported_languages())
# ['multilingual', 'english', 'arabic', 'chinese', 'french', ...]

switch_language(language)

Goal: Switch to a different language and its corresponding BERT model at runtime without re-instantiating the object. Useful for multilingual pipelines operating on the same chunker instance.

def switch_language(self, language: str) -> None
Parameter Type Description
language str Target language code (e.g. "arabic", "french"). Falls back to "multilingual" if unsupported.

Returns: None. The new model is loaded in-place.

Example:

chunker = MultilanguageTextChunker(language="english")
chunks_en = chunker.chunk(english_text)

chunker.switch_language("arabic")
chunks_ar = chunker.chunk(arabic_text)

Text Splitters

splitters that return List[str] instead of List[DocumentChunk]. Useful as lightweight building blocks or drop-in replacements.

Common Interface (all text splitters)

All splitters inherit from TextSplitter and expose these shared methods in addition to split_text:

split_documents(documents)

Goal: Split a list of Document objects and return a new list of chunked Document objects while preserving the original document metadata in each chunk.

def split_documents(self, documents: List[Document]) -> List[Document]
Parameter Type Description
documents List[Document] Input documents to split.

Returns: List[Document] — each item is a chunk as a Document with page_content = chunk_text and the original document's metadata copied.

create_documents(texts, metadatas=None)

Goal: Convert a list of raw texts (with optional metadata) into a list of chunked Document objects. Useful as a factory method when manually loading text data.

def create_documents(
    self,
    texts: List[str],
    metadatas: Optional[List[Dict[str, Any]]] = None,
) -> List[Document]
Parameter Type Description
texts List[str] Raw texts to split into documents.
metadatas List[dict] | None Optional metadata dicts, one per text. Defaults to empty dicts if omitted.

Returns: List[Document] — each split chunk becomes a Document carrying the corresponding metadata.

Example:

from fennec_community.chunks import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)

docs = splitter.create_documents(
    texts=["Long document one...", "Long document two..."],
    metadatas=[{"source": "file1.txt"}, {"source": "file2.txt"}],
)

CharacterTextSplitter

Goal: Split text using a fixed delimiter with optional overlap support. This is the simplest and fastest among all chunkers.

CharacterTextSplitter(
    separator: str = "\n\n",
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    length_function: Callable = len,
)
Parameter Type Default Description
separator str "\n\n" String to split on (double newline = paragraph boundary).
chunk_size int 1000 Maximum characters per chunk.
chunk_overlap int 200 Characters to repeat at the start of the next chunk.
length_function Callable len Function to measure chunk size (can be replaced with token counter).

split_text(text)

def split_text(self, text: str) -> List[str]

Returns: List[str] — list of text chunks.


RecursiveCharacterTextSplitter

Goal: Hierarchical splitting that tries a sequence of separators from the most specific (paragraphs) to the least (characters). It preserves natural text boundaries as much as possible and includes special handling for Arabic text.

RecursiveCharacterTextSplitter(
    separators: Optional[List[str]] = None,
    arabic_mode: bool = False,
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    length_function: Callable = len,
)
Parameter Type Default Description
separators List[str] | None None Custom separator hierarchy. Defaults to ["\n\n", "\n", ".", "؟", "!", "،", "؛", " ", ""].
arabic_mode bool False Use Arabic-optimized separator list when True.
chunk_size int 1000 Maximum characters per chunk.
chunk_overlap int 200 Overlap in characters.
length_function Callable len Size measurement function.

split_text(text)

def split_text(self, text: str) -> List[str]

Returns: List[str]


TokenTextSplitter

Goal: Split text based on token count (instead of character count) using tiktoken. This is essential when working with OpenAI models to ensure the context window limit is not exceeded.

Requires: pip install tiktoken

TokenTextSplitter(
    encoding_name: str = "cl100k_base",
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    length_function: Callable = len,
)
Parameter Type Default Description
encoding_name str "cl100k_base" Tiktoken encoding (matches GPT-4 / text-embedding-ada-002). The default encoding can be overridden globally via ChunkConfig.model_token.
chunk_size int 1000 Maximum tokens per chunk.
chunk_overlap int 200 Overlap in tokens.

split_text(text)

def split_text(self, text: str) -> List[str]

Returns: List[str] — chunks guaranteed to be within chunk_size tokens.


SentenceTextSplitter

Goal: Split text while ensuring sentences are never cut in the middle. This is suitable for cases where preserving complete sentences is essential for understanding, with full support for Arabic.

SentenceTextSplitter(
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    length_function: Callable = len,
)
Parameter Type Default Description
chunk_size int 1000 Maximum characters per chunk.
chunk_overlap int 200 Character overlap between chunks.
length_function Callable len Size measurement function.

split_text(text)

def split_text(self, text: str) -> List[str]

Returns: List[str] — chunks always ending at sentence boundaries.


Supporting Components


EmbeddingProvider

Goal: Provide text embeddings with smart caching (LRU) and batch processing to speed up operations. It uses sentence-transformers and automatically falls back to TF-IDF when unavailable.

Note: Implemented as a singleton per (model_name, device) pair — calling the constructor with the same arguments always returns the same cached instance.


Module-level function: cosine_similarity_matrix

Goal: Compute a cosine similarity matrix between two embedding sets efficiently in a single batch using matrix operations. It is used internally in the Deduplicator for detecting near-duplicate content.

from fennec_community.chunks import cosine_similarity_matrix

cosine_similarity_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray
Parameter Type Description
A np.ndarray Matrix of shape (n, dim) — first set of embedding vectors.
B np.ndarray Matrix of shape (m, dim) — second set of embedding vectors.

Returns: np.ndarray of shape (n, m) — every cell [i, j] holds the cosine similarity between A[i] and B[j].

Example:

import numpy as np
from fennec_community.chunks import cosine_similarity_matrix, EmbeddingProvider

provider = EmbeddingProvider("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
texts_a = ["Hello world", "Machine learning is great"]
texts_b = ["Hi there", "AI is fascinating", "Cooking recipes"]

A = np.array(provider.embed_batch(texts_a))
B = np.array(provider.embed_batch(texts_b))

sim_matrix = cosine_similarity_matrix(A, B)
# sim_matrix.shape == (2, 3)
print(sim_matrix)

EmbeddingProvider(
    model_name: str,
    device: Optional[str] = None,
    cache_size: int = 10_000,
)
Parameter Type Default Description
model_name str HuggingFace sentence-transformers model name.
device str | None None "cuda" or "cpu".
cache_size int 10,000 Maximum number of text embeddings to cache in memory (LRU eviction). Corresponds to ChunkConfig.embedding_cache_size.

embed(text)

Goal: Compute the embedding for a single text with caching support.

def embed(self, text: str) -> np.ndarray

Returns: np.ndarray — 1-D embedding vector.

embed_batch(texts, batch_size=64)

Goal: Efficiently compute embeddings for a list of texts, skipping any entries already stored in the cache.

def embed_batch(self, texts: List[str], batch_size: int = 64) -> List[np.ndarray]
Parameter Type Description
texts List[str] Texts to embed.
batch_size int Number of texts processed per inference call. Corresponds to ChunkConfig.embedding_batch_size.

Returns: List[np.ndarray] — Same order as input.

pairwise_similarity(a, b)

Goal: Compute the semantic similarity (cosine similarity) between two texts.

def pairwise_similarity(self, a: str, b: str) -> float

Returns: float in [-1.0, 1.0]. Values close to 1.0 indicate high similarity.

adjacent_similarities(sentences)

Goal: Compute cosine similarity between each sentence and the next one in a list (used internally in SemanticChunker).

def adjacent_similarities(self, sentences: List[str]) -> List[float]

Returns: List[float] of length len(sentences) - 1.

Property:

Property Type Description
dim int Embedding vector dimension.

MetadataEnricher

Goal: Enrich each chunk with keywords (TF-IDF), keyword density, and a composite importance score that considers position, length, and lexical diversity.

MetadataEnricher(
    max_keywords: int = 10,
    header_score_boost: float = 1.5,
)
Parameter Type Default Description
max_keywords int 10 Maximum number of keywords extracted per chunk.
header_score_boost float 1.5 Score multiplier applied to header/section-title chunks.

enrich(chunks)

Goal: Enrich a full list of chunks using corpus-level TF-IDF, producing more accurate results than per-chunk enrichment.

def enrich(self, chunks: List[DocumentChunk]) -> List[DocumentChunk]

Returns: The same List[DocumentChunk] with updated metadata.keywords, metadata.keyword_density, and metadata.score.

enrich_single(chunk)

Goal: Independently enrich a single chunk (without corpus context) using TF only (no IDF). Suitable for real-time pipelines.

def enrich_single(self, chunk: DocumentChunk) -> DocumentChunk

Returns: The same DocumentChunk with enriched metadata.


Deduplicator

Goal: Remove duplicate chunks using a two-stage process: exact hash matching for literal duplicates (fast, O(n)), and cosine similarity for near-duplicate detection (slower, O(n²)).

Deduplicator(
    use_hash: bool = True,
    use_similarity: bool = False,
    similarity_threshold: float = 0.95,
    embedding_model: Optional[str] = None,
    device: Optional[str] = None,
)
Parameter Type Default Description
use_hash bool True Enable fast SHA-256 exact-match deduplication.
use_similarity bool False Enable cosine similarity near-duplicate detection. Requires embedding_model.
similarity_threshold float 0.95 Chunks with cosine similarity ≥ this value are considered duplicates. Corresponds to ChunkConfig.dedup_similarity_threshold.
embedding_model str | None None Model name for similarity-based dedup. Required if use_similarity=True.
device str | None None "cuda" or "cpu".

⚠️ Important — ChunkManager behaviour: When using ChunkManager, similarity-based deduplication (use_similarity) is always disabled internally regardless of ChunkConfig.dedup_similarity_threshold. To use similarity-based deduplication, instantiate Deduplicator directly and call deduplicate() on the chunks after chunking.

deduplicate(chunks)

Goal: Apply a two-stage deduplication process to a list of chunks and return a cleaned list.

def deduplicate(self, chunks: List[DocumentChunk]) -> List[DocumentChunk]

Returns: List[DocumentChunk] — deduplicated, preserving original order. The longer chunk is kept when two near-duplicates are found.


DocumentStructureParser

Goal: Analyze the document structure (Markdown, HTML, or plain text) and extract a list of StructuredSection objects used by StructureAwareChunker to perform structure-aware splitting.

DocumentStructureParser()

detect_document_type(text)

the goal: detect for document type

def detect_document_type(self, text: str) -> DocumentType

Returns: DocumentType — one of MARKDOWN, HTML, ARABIC, PLAIN_TEXT.

Detection logic:

  • HTML tags → DocumentType.HTML
  • Markdown headers / code fences / bold → DocumentType.MARKDOWN
  • Arabic character ratio > 20% → DocumentType.ARABIC
  • Otherwise → DocumentType.PLAIN_TEXT

parse(text, doc_type)

Goal: Analyze the document and return a list of structured sections.

def parse(self, text: str, doc_type: DocumentType) -> List[StructuredSection]
Parameter Type Description
text str Document text.
doc_type DocumentType Document type (use detect_document_type first if unknown).

Returns: List[StructuredSection] — see StructuredSection for field details.


QueryPatternAnalyzer

Goal: Analyze user query patterns and/or document content to recommend the optimal chunking strategy (chunk size, overlap, semantic vs. adaptive splitting).

QueryPatternAnalyzer(default_strategy: str = "general")
Parameter Type Default Description
default_strategy str "general" Fallback strategy name when no pattern is detected. One of: "faq", "documentation", "technical", "narrative", "general".

analyze_queries(queries)

Goal: Recommend a strategy based on a set of representative queries using majority voting.

def analyze_queries(self, queries: List[str]) -> ChunkingStrategy

Returns: ChunkingStrategy dataclass with name, chunk_size, overlap, use_semantic, use_adaptive, use_structure.

analyze_document(text)

Goal: Recommend a strategy based on the document’s own content.

def analyze_document(self, text: str) -> ChunkingStrategy

Returns: ChunkingStrategy

analyze(text, sample_queries=None)

Goal: Perform a combined analysis of both the document and the queries. Queries take priority if they indicate a specific (non-generic) pattern.

def analyze(
    self,
    text: str,
    sample_queries: Optional[List[str]] = None,
) -> ChunkingStrategy

Returns: ChunkingStrategy

get_strategy(name)

The Goal: return specific strategy with name

def get_strategy(self, name: str) -> ChunkingStrategy

Returns: ChunkingStrategy — returns the default strategy if name is unknown.

Available strategies:

Name chunk_size overlap Use Case
faq 256 32 Q&A documents, frequently asked questions
documentation 768 128 Technical docs, structured manuals
technical 400 80 Dense code, APIs, specifications
narrative 1024 200 Stories, novels, long-form prose
general 512 128 Balanced default for mixed content

Data Models

DocumentChunk

The core output unit. Every function in the library ultimately produces DocumentChunk instances.

@dataclass
class DocumentChunk:
    text: str                          # The chunk's text content
    chunk_id: str                      # UUID (auto-generated)
    doc_id: str                        # Parent document ID
    embedding: Optional[np.ndarray]    # Vector embedding (if computed)
    metadata: ChunkMetadata            # Rich metadata

Read-only properties:

Property Type Description
id str Alias for chunk_id.
content_hash str SHA-256 hash of normalized text (for deduplication).
char_count int Character count of text.
word_count int Word count of text.

Methods:

Method Returns Description
to_dict() Dict[str, Any] JSON-serializable representation of the chunk and all its metadata.

ChunkMetadata

@dataclass
class ChunkMetadata:
    source: str            # Document origin (filename, URL)
    page: Optional[int]    # Page number (if applicable)
    section: Optional[str] # Section/heading title
    position: int          # Index of this chunk in the document (0-based)
    total_chunks: int      # Total chunks in the parent document
    chunk_type: ChunkType  # Type classification
    document_type: DocumentType
    language: str
    keywords: List[str]    # Top TF-IDF keywords
    score: float           # Importance score [0.0 – 1.0]
    keyword_density: float # Ratio of meaningful words
    is_header: bool        # True if this chunk is a document heading
    heading_level: int     # 0 = not a header; 1–6 = H1–H6
    char_start: int        # Character offset in original document
    char_end: int          # Character offset end in original document
    extra: Dict[str, Any]  # Chunker-specific extra data

StructuredSection

The output unit of DocumentStructureParser.parse(). Represents a single structural unit within a document before it is converted to a DocumentChunk.

@dataclass
class StructuredSection:
    text: str                  # The section's raw text content
    heading: str               # Title of the nearest parent heading (empty if none)
    heading_level: int         # 0 = body text; 1–6 = H1–H6
    page: Optional[int]        # Page number if available (e.g. from PDF parsers)
    section_index: int         # Sequential index of this section within the document
    is_code_block: bool        # True if the section is a fenced code block
    is_list: bool              # True if the section is a list (ordered or unordered)
    is_table: bool             # True if the section is a table
    char_start: int            # Character offset of the section start in the original text
    char_end: int              # Character offset of the section end in the original text
    extra: Dict[str, Any]      # Parser-specific additional data

Example:

from fennec_community.chunks import DocumentStructureParser

parser = DocumentStructureParser()
doc_type = parser.detect_document_type(markdown_text)
sections = parser.parse(markdown_text, doc_type)

for s in sections:
    print(f"[H{s.heading_level}] {s.heading!r} — code={s.is_code_block} list={s.is_list}")
    print(f"  chars [{s.char_start}:{s.char_end}]: {s.text[:60]}")

Document

Input container compatible with LangChain conventions.

@dataclass
class Document:
    page_content: str            # The raw text
    metadata: Dict[str, Any]     # Arbitrary key-value metadata
    doc_id: str                  # Auto-generated UUID

Enumerations

ChunkType

PARAGRAPH, SENTENCE, SECTION, CODE_BLOCK, LIST_ITEM, HEADER, TABLE, SEMANTIC, ADAPTIVE, STRUCTURAL, WINDOW

DocumentType

PLAIN_TEXT, MARKDOWN, HTML, PDF, ARABIC, MIXED


Base Classes

These abstract base classes are exported from the package (from chunks import BaseChunker, TextSplitter, ChunkingStrategy) and are intended for building custom chunkers that integrate with the library's pipeline.

BaseChunker

Abstract base class for all chunkers. Subclass this to implement a custom splitting strategy that is compatible with ChunkManager and chunk_documents.

from fennec_community.chunks import BaseChunker, DocumentChunk
from typing import List

class MyChunker(BaseChunker):
    def _chunk_impl(self, text: str, doc_id: str, source: str) -> List[DocumentChunk]:
        # implement your splitting logic here
        ...

You must implement: _chunk_impl(text, doc_id, source) -> List[DocumentChunk]

You inherit for free: chunk(), chunk_documents(), and _finalize() (sets position and total_chunks on each chunk).


TextSplitter

Abstract base class for text splitters. Subclass this to implement a custom splitter that works with split_documents() and create_documents().

from fennec_community.chunks import TextSplitter
from typing import List

class MyTextSplitter(TextSplitter):
    def split_text(self, text: str) -> List[str]:
        # implement your splitting logic here
        ...

You must implement: split_text(text) -> List[str]

You inherit for free: split_documents() and create_documents().


ChunkingStrategy

Legacy abstract base class kept for backward compatibility. Represents a named chunking strategy. In current usage, QueryPatternAnalyzer returns plain dataclass instances (not subclasses of this ABC) that carry the same fields.

Field Type Description
name str Strategy name (e.g. "faq", "technical").
chunk_size int Recommended chunk size in characters.
overlap int Recommended overlap in characters.
use_semantic bool Whether to use semantic splitting.
use_adaptive bool Whether to use adaptive sizing.
use_structure bool Whether to respect document structure.

Configuration Reference — ChunkConfig

ChunkConfig is a dataclass with sensible defaults. Pass it to any chunker or ChunkManager to customize behaviour. All fields are optional — omit any field to use its default.

from fennec_community.chunks import ChunkConfig

config = ChunkConfig(
    chunk_size=512,
    overlap=128,
    use_semantic_chunking=True,
    extract_keywords=True,
    deduplication_enabled=True,
)

Basic Sizing

Field Type Default Description
chunk_size int 512 Target chunk size in characters.
overlap int 128 Overlap in characters between consecutive chunks.
min_chunk_size int 50 Chunks smaller than this are discarded or merged.
max_chunk_size int 2048 Hard maximum; chunks are force-split above this.
strict_size_limit bool True Enforce hard size limits. When True, chunks exceeding max_chunk_size are force-split at word boundaries.
size_tolerance float 0.05 Fractional overshoot allowed above chunk_size before a hard split is triggered (5% by default). Only applies when strict_size_limit=True.

Semantic Chunking

Field Type Default Description
use_semantic_chunking bool False Enable embedding-based semantic splitting.
semantic_similarity_threshold float 0.75 Similarity drop threshold for a new chunk boundary.
semantic_model str MiniLM Sentence-transformer model for semantic chunking.
model_name str MiniLM Alias for semantic_model kept for backward compatibility with ArabicTextChunker and MultilanguageTextChunker.
embedding_batch_size int 64 Number of sentences processed per embedding inference call. Increase for GPU throughput, decrease to save memory.
embedding_cache_size int 10_000 LRU cache size (number of text entries) shared across all EmbeddingProvider instances.

Adaptive Chunking

Field Type Default Description
use_adaptive_chunking bool False Enable density-adaptive chunk sizing.
adaptive_technical_threshold float 0.6 Information-density score above which text is classified as "technical". Technical text gets smaller chunks.
adaptive_min_size int 100 Minimum chunk size when content is dense/technical.
adaptive_max_size int 1500 Maximum chunk size when content is simple/narrative.
adaptive_base_size int 512 Starting size before density adjustment is applied.

Smart Overlap

Field Type Default Description
use_smart_overlap bool True Use semantic similarity to select the most relevant sentences to carry over as overlap, instead of a fixed character window. Enabled by default.
smart_overlap_threshold float 0.7 Minimum cosine similarity for a sentence to be included in the smart overlap region.

Structure-Aware Chunking

Field Type Default Description
respect_document_structure bool True Honor Markdown/HTML structural elements.
split_on_headers bool True Create a new chunk at each Markdown/HTML heading.
split_on_paragraphs bool True Treat paragraph breaks (\n\n) as natural chunk boundaries.
preserve_code_blocks bool True Never split code blocks.
preserve_tables bool True Never split tables.
preserve_lists bool True Never split list blocks mid-way.

Context-Aware Chunking

Field Type Default Description
use_context_window bool False Enable sliding window context padding.
context_window_size int 2 Number of sentences added as context on each side of a primary group.
min_context_sentences int 3 Minimum number of sentences a group must contain before context padding is applied.

Metadata & Scoring

Field Type Default Description
extract_keywords bool True Run TF-IDF keyword extraction after chunking.
keyword_max_count int 10 Maximum keywords per chunk.
compute_chunk_scores bool True Compute importance scores after chunking.
header_score_boost float 1.5 Score multiplier for header chunks.

Deduplication

Field Type Default Description
deduplication_enabled bool True Remove duplicate chunks automatically.
dedup_use_hash bool True Use fast SHA-256 exact deduplication.
dedup_similarity_threshold float 0.95 Cosine similarity threshold above which two chunks are considered duplicates. Note: this threshold is passed to Deduplicator at construction time, but ChunkManager always sets use_similarity=False — to use similarity-based deduplication you must call Deduplicator directly.

Query-Aware Strategy

Field Type Default Description
query_aware bool False Enable query-pattern-based strategy selection.
default_query_pattern str "general" Default strategy name when no pattern is detected. One of "faq", "documentation", "technical", "narrative", "general".

Performance

Field Type Default Description
async_processing bool False Enable asynchronous parallel processing of chunks.
n_workers int 4 Number of worker threads used when async_processing=True.

Formatting & Text Cleanup

Field Type Default Description
preserve_formatting bool True Keep original whitespace and formatting structure intact.
fix_spacing bool True Auto-fix spacing issues (e.g. missing spaces after Arabic punctuation).

Tiktoken

Field Type Default Description
model_token str "cl100k_base" Tiktoken encoding name used by TokenTextSplitter as its global default. Common values: "cl100k_base" (GPT-4, text-embedding-ada-002), "p50k_base" (Codex).

Language-Specific BERT Models

These fields set the HuggingFace model name used for each language when use_semantic_chunking=True inside MultilanguageTextChunker or ArabicTextChunker. Override any field to use a different model for that language.

Field Default Model
multilingual xlm-roberta-base
arabic CAMeL-Lab/bert-base-arabic-camelbert-mix
english bert-base-uncased
chinese bert-base-chinese
french camembert-base
german bert-base-german-cased
spanish dccuchile/bert-base-spanish-wwm-uncased
russian DeepPavlov/rubert-base-cased
japanese cl-tohoku/bert-base-japanese
korean klue/bert-base
portuguese neuralmind/bert-base-portuguese-cased
italian dbmdz/bert-base-italian-cased
dutch GroNLP/bert-base-dutch-cased
polish dkleczek/bert-base-polish-cased
turkish dbmdz/bert-base-turkish-cased
vietnamese vinai/phobert-base
hindi ai4bharat/indic-bert

Example:

config = ChunkConfig(
    use_semantic_chunking=True,
    arabic="CAMeL-Lab/bert-base-arabic-camelbert-ca",   # swap Arabic model
    french="camembert/camembert-large",                  # swap French model
)

Transformer Parameters

Low-level parameters forwarded to the HuggingFace tokenizer during semantic chunking. Rarely need to change.

Field Type Default Description
max_len int 512 Maximum token length passed to the tokenizer.
padding bool True Enable padding to max_len.
truncation bool True Truncate inputs exceeding max_len.
return_tensor str "pt" Tensor framework returned by the tokenizer ("pt" = PyTorch).

Merge Parameters

Field Type Default Description
min_sentences int 30 Minimum sentence count a document must have before sentence-level merging is considered. Documents with fewer sentences skip the merge pass.

ChunkMode Enum

from fennec_community.chunks import ChunkMode

ChunkMode.AUTO        # ChunkManager picks strategy based on config flags
ChunkMode.SEMANTIC    # Force SemanticChunker
ChunkMode.ADAPTIVE    # Force AdaptiveChunker
ChunkMode.STRUCTURAL  # Force StructureAwareChunker
ChunkMode.CONTEXTUAL  # Force ContextAwareChunker
ChunkMode.HYBRID      # StructureAwareChunker → SemanticChunker on large sections

End-to-End Usage Examples

from fennec_community.chunks import ChunkManager, ChunkConfig, ChunkMode

config = ChunkConfig(
    chunk_size=512,
    overlap=128,
    use_semantic_chunking=True,
    extract_keywords=True,
    deduplication_enabled=True,
    query_aware=True,
)

manager = ChunkManager(config=config, mode=ChunkMode.AUTO)

with open("report.md", "r", encoding="utf-8") as f:
    text = f.read()

chunks = manager.process_with_query_optimization(
    text=text,
    queries=["What are the financial results?", "Who is the CEO?"],
    source="report.md",
)

print(manager.get_stats(chunks))

for chunk in chunks:
    vector = embed_model.encode(chunk.text)   # your embedding model
    vector_db.insert(chunk.chunk_id, vector, chunk.to_dict())

Example 2: Arabic Document

from fennec_community.chunks import ArabicTextChunker

chunker = ArabicTextChunker(
    chunk_size=400,
    overlap=100,
    use_semantic_chunking=True,
    fix_spacing=True,
)

chunks = chunker.chunk(arabic_text, source="arabic_doc.pdf")

for ch in chunks:
    print(f"[{ch.metadata.position}] {ch.text[:60]}...")
    print(f"  keywords: {ch.metadata.keywords}")

Example 3: Multilingual Batch Processing

from fenenc_community.chunks import ChunkManager, Document

manager = ChunkManager(language="auto")

documents = [
    Document(page_content=english_text, metadata={"source": "en.txt", "lang": "en"}),
    Document(page_content=arabic_text,  metadata={"source": "ar.txt", "lang": "ar"}),
    Document(page_content=french_text,  metadata={"source": "fr.txt", "lang": "fr"}),
]

all_chunks = manager.process_documents(documents)
print(f"Total chunks: {len(all_chunks)}")

Example 4: Basic Splitters

from fenenc_community.chunks import RecursiveCharacterTextSplitter, TokenTextSplitter

# Token-aware splitting for OpenAI models
splitter = TokenTextSplitter(
    encoding_name="cl100k_base",
    chunk_size=500,
    chunk_overlap=50,
)
text_chunks = splitter.split_text(long_text)

# Arabic recursive splitting
arabic_splitter = RecursiveCharacterTextSplitter(
    arabic_mode=True,
    chunk_size=400,
    chunk_overlap=80,
)
arabic_chunks = arabic_splitter.split_text(arabic_text)

Example 5: Deduplication + Enrichment Standalone

from fenenc_community.chunks import MetadataEnricher, Deduplicator

enricher = MetadataEnricher(max_keywords=8, header_score_boost=2.0)
deduplicator = Deduplicator(use_hash=True, use_similarity=False)

# After any chunking operation:
chunks = enricher.enrich(chunks)
chunks = deduplicator.deduplicate(chunks)

# Sort by importance score
chunks.sort(key=lambda c: c.metadata.score, reverse=True)
top_chunks = chunks[:10]

Example 6: Similarity-Based Deduplication

# ChunkManager always disables similarity dedup internally.
# Use Deduplicator directly when you need near-duplicate removal:
from fenenc_community.chunks import ChunkManager, Deduplicator

manager = ChunkManager()
chunks = manager.process(text, source="doc.pdf")

deduplicator = Deduplicator(
    use_hash=True,
    use_similarity=True,
    similarity_threshold=0.92,
    embedding_model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
)
chunks = deduplicator.deduplicate(chunks)

Example 7: Custom BERT Models per Language

from fenenc_community.chunks import ChunkConfig, MultilanguageTextChunker

config = ChunkConfig(
    use_semantic_chunking=True,
    arabic="CAMeL-Lab/bert-base-arabic-camelbert-ca",
    french="camembert/camembert-large",
    embedding_batch_size=32,   # lower batch size on limited GPU memory
    embedding_cache_size=5_000,
)

chunker = MultilanguageTextChunker(language="arabic")
chunks = chunker.chunk(arabic_text, source="doc.pdf")

Example 8: Structure Inspection via StructuredSection

from fennec_community.chunks import DocumentStructureParser

parser = DocumentStructureParser()
doc_type = parser.detect_document_type(md_text)
sections = parser.parse(md_text, doc_type)

code_sections = [s for s in sections if s.is_code_block]
headers = [s for s in sections if s.heading_level > 0]

print(f"Found {len(code_sections)} code blocks and {len(headers)} headings")
Source: community/chunks.md