Chunks Modular
Table of Contents
- Overview
- Architecture at a Glance
- Quick Start
- Core Entry Point —
ChunkManager - Chunkers
- Text Splitters
- Supporting Components
- Data Models
- Base Classes
- Configuration Reference —
ChunkConfig - ChunkMode Enum
- End-to-End Usage Examples
Overview
The chunks modular provides a complete, production-ready pipeline for splitting documents into semantically meaningful units ("chunks") ready for vector embedding and retrieval in RAG systems. It handles:
- Multiple splitting strategies — semantic, adaptive, structure-aware, context-aware, and hybrid.
- Multilingual support — 17+ languages with dedicated BERT models; first-class Arabic support.
- Post-processing — TF-IDF keyword extraction, importance scoring, and deduplication.
- Query-aware routing — automatically selects the best strategy based on sample queries.
- LangChain-compatible splitters — drop-in replacements for
CharacterTextSplitter,RecursiveCharacterTextSplitter, etc.
Architecture at a Glance
Text / Documents
│
▼
ChunkManager ← Orchestrator (recommended entry point)
│
├── QueryPatternAnalyzer (optional: picks strategy from queries)
│
├── Chunker selection:
│ ├── SemanticChunker (embedding-based splits)
│ ├── AdaptiveChunker (density-adaptive sizing)
│ ├── StructureAwareChunker (respects headers, code, tables)
│ ├── ContextAwareChunker (sliding-window context)
│ └── Hybrid (structural + semantic)
│
├── MetadataEnricher (keywords, scores)
├── Deduplicator (hash + similarity)
└── Finalize (position, total_chunks)
│
▼
List[DocumentChunk]Quick Start
from fennec_community.chunks import ChunkManager, ChunkConfig
# 1. Basic usage — auto mode
manager = ChunkManager()
chunks = manager.process("Your document text here...", source="my_doc.pdf")
# 2. Semantic chunking with custom config
config = ChunkConfig(
use_semantic_chunking=True,
chunk_size=512,
overlap=128,
extract_keywords=True,
deduplication_enabled=True,
)
manager = ChunkManager(config=config)
chunks = manager.process(text, source="report.pdf")
# 3. Inspect results
for chunk in chunks:
print(chunk.text[:80])
print(f" keywords: {chunk.metadata.keywords}")
print(f" score: {chunk.metadata.score:.3f}")Core Entry Point — ChunkManager
ChunkManager is the recommended entry point for all production use. It orchestrates the full chunking lifecycle: strategy selection → chunking → enrichment → deduplication → finalization.
Constructor
ChunkManager(
config: Optional[ChunkConfig] = None,
mode: ChunkMode = ChunkMode.AUTO,
*,
device: Optional[str] = None,
language: str = "auto",
)| Parameter | Type | Default | Description |
|---|---|---|---|
config |
ChunkConfig | None |
None |
Configuration object. Uses defaults if omitted. |
mode |
ChunkMode |
ChunkMode.AUTO |
Chunking strategy. See ChunkMode. |
device |
str | None |
None |
Torch device ("cuda", "cpu"). Auto-detected if None. |
language |
str |
"auto" |
Language hint ("arabic", "english", etc.) or "auto" for detection. |
process
Target: it's point for split document to chunks ready for embedding.
def process(
text: str,
doc_id: Optional[str] = None,
source: str = "",
*,
sample_queries: Optional[List[str]] = None,
mode_override: Optional[ChunkMode] = None,
) -> List[DocumentChunk]Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
text |
str |
✅ | The raw document text to be chunked. |
doc_id |
str | None |
❌ | Unique document identifier. Auto-generated UUID if not provided. |
source |
str |
❌ | Document origin label (filename, URL, etc.) stored in chunk metadata. |
sample_queries |
List[str] | None |
❌ | Representative queries used to auto-select the best chunking strategy when config.query_aware=True. |
mode_override |
ChunkMode | None |
❌ | Temporarily override the manager's default mode for this call only. |
Returns: List[DocumentChunk] — Ordered, enriched, deduplicated chunks ready for embedding.
Returns empty list if text is empty or whitespace only.
Example:
chunks = manager.process(
text="Large document content...",
source="annual_report_2024.pdf",
sample_queries=["What is the revenue?", "Who are the key executives?"],
)process_documents
Target: split list of documents to chunks ready for embedding. in one batch.
def process_documents(
documents: List[Document],
*,
sample_queries: Optional[List[str]] = None,
mode_override: Optional[ChunkMode] = None,
) -> List[DocumentChunk]Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
documents |
List[Document] |
✅ | List of Document objects (each holds page_content + metadata). |
sample_queries |
List[str] | None |
❌ | Queries used for strategy selection (applied uniformly to all documents). |
mode_override |
ChunkMode | None |
❌ | Override mode for all documents in this batch. |
Returns: List[DocumentChunk] — All chunks from all documents combined in order.
Example:
from fennec_community.chunks import Document
docs = [
Document(page_content="Doc 1 text...", metadata={"source": "file1.pdf"}),
Document(page_content="Doc 2 text...", metadata={"source": "file2.pdf"}),
]
all_chunks = manager.process_documents(docs)process_with_query_optimization
Goal: A simplified shortcut to enable the query-aware pipeline without needing to configure config.query_aware.
def process_with_query_optimization(
text: str,
queries: List[str],
doc_id: Optional[str] = None,
source: str = "",
) -> List[DocumentChunk]Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
text |
str |
✅ | Document text to chunk. |
queries |
List[str] |
✅ | Sample queries that will be analyzed to select the optimal chunking strategy. |
doc_id |
str | None |
❌ | Optional document ID. |
source |
str |
❌ | Document source label. |
Returns: List[DocumentChunk]
Example:
chunks = manager.process_with_query_optimization(
text=document_text,
queries=["How does the API authenticate?", "What are the rate limits?"],
source="api_docs.md",
)get_stats
Goal: Extract summarized statistics from a set of chunks (useful for monitoring and diagnostics).
def get_stats(chunks: List[DocumentChunk]) -> Dict[str, Any]Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
chunks |
List[DocumentChunk] |
✅ | The chunk list returned by any process* method. |
Returns: Dict[str, Any] with the following keys:
| Key | Type | Description |
|---|---|---|
count |
int |
Total number of chunks. |
total_chars |
int |
Sum of all characters across chunks. |
avg_chars |
float |
Average characters per chunk. |
min_chars |
int |
Smallest chunk size in characters. |
max_chars |
int |
Largest chunk size in characters. |
avg_words |
float |
Average word count per chunk. |
avg_score |
float |
Mean importance score (0–1). |
max_score |
float |
Highest importance score in the set. |
chunk_types |
Dict[str, int] |
Count of each ChunkType value. |
unique_sections |
int |
Number of distinct document sections represented. |
Example:
stats = manager.get_stats(chunks)
print(stats)
# {'count': 42, 'avg_chars': 487.3, 'avg_score': 0.6812, ...}Chunkers
Each chunker can be used standalone (without ChunkManager) when you need direct control. All chunkers share the same public interface via BaseChunker.chunk().
Common Interface (all chunkers)
All chunkers inherit from BaseChunker and expose these two public methods:
chunk(text, doc_id="", source="")
Goal: Split a single text into a complete list of chunks while properly setting the position and total_chunks.
def chunk(self, text: str, doc_id: str = "", source: str = "") -> List[DocumentChunk]| Parameter | Type | Description |
|---|---|---|
text |
str |
Raw text to split. |
doc_id |
str |
Parent document ID stored in each chunk's metadata. |
source |
str |
Document origin label (filename, URL, etc.) |
Returns: List[DocumentChunk]
chunk_documents(documents)
Goal: Batch process a list of Document objects into chunks while preserving each document’s doc_id and source.
def chunk_documents(self, documents: List[Document]) -> List[DocumentChunk]| Parameter | Type | Description |
|---|---|---|
documents |
List[Document] |
Documents to split. Each Document.metadata may include a "source" key. |
Returns: List[DocumentChunk] — all chunks from all documents, in order.
SemanticChunker
Goal: Split text based on semantic similarity between sentences using embedding models. A new chunk is created when similarity drops below a defined threshold, ensuring each chunk contains a semantically coherent idea.
SemanticChunker(
config: Optional[ChunkConfig] = None,
*,
similarity_threshold: float = 0.75,
min_chunk_size: int = 50,
max_chunk_size: int = 2048,
model_name: Optional[str] = None,
device: Optional[str] = None,
language: str = "auto",
)| Parameter | Type | Default | Description |
|---|---|---|---|
config |
ChunkConfig | None |
None |
If provided, overrides individual parameters. |
similarity_threshold |
float |
0.75 |
Cosine similarity below this value triggers a new chunk. Lower = more splits. |
min_chunk_size |
int |
50 |
Minimum characters per chunk; short chunks are merged with the next. |
max_chunk_size |
int |
2048 |
Hard cap: forces a new chunk regardless of similarity. |
model_name |
str | None |
None |
Sentence-transformer model name. Defaults to multilingual MiniLM. |
device |
str | None |
None |
"cuda" or "cpu". |
language |
str |
"auto" |
Language tag stored in chunk metadata. |
Best for: Conceptually diverse documents, FAQs, articles where topics shift frequently.
Example:
from fennec_community.chunks import SemanticChunker
chunker = SemanticChunker(similarity_threshold=0.70)
chunks = chunker.chunk(text, source="article.txt")AdaptiveChunker
Goal: Automatically adjust chunk size based on the information density of the text. Dense technical content is split into smaller chunks for more precise retrieval, while narrative text uses larger chunks to preserve context.
AdaptiveChunker(
config: Optional[ChunkConfig] = None,
*,
base_size: int = 512,
min_size: int = 100,
max_size: int = 1500,
overlap: int = 128,
technical_threshold: float = 0.5,
language: str = "auto",
)| Parameter | Type | Default | Description |
|---|---|---|---|
config |
ChunkConfig | None |
None |
If provided, reads adaptive_base_size, adaptive_min_size, adaptive_max_size, adaptive_technical_threshold fields. |
base_size |
int |
512 |
Starting chunk size before density adjustment. |
min_size |
int |
100 |
Minimum allowed chunk size (used for dense/technical text). |
max_size |
int |
1500 |
Maximum allowed chunk size (used for simple/narrative text). |
overlap |
int |
128 |
Character overlap between consecutive chunks. |
technical_threshold |
float |
0.5 |
Information density score above which text is considered "technical". |
language |
str |
"auto" |
Language tag. |
Best for: Mixed-content documents that combine technical specifications with narrative explanations.
StructureAwareChunker
Goal: Split the document while respecting its natural structure (Markdown/HTML headings, paragraphs, code blocks, tables, and lists). It ensures that important structural elements are not split in the middle.
StructureAwareChunker(
config: Optional[ChunkConfig] = None,
*,
max_chunk_size: int = 1024,
min_chunk_size: int = 50,
language: str = "auto",
split_on_headers: bool = True,
preserve_code_blocks: bool = True,
)| Parameter | Type | Default | Description |
|---|---|---|---|
config |
ChunkConfig | None |
None |
Reads max_chunk_size, split_on_headers, preserve_code_blocks. |
max_chunk_size |
int |
1024 |
Sections exceeding this are further split by sentence. |
min_chunk_size |
int |
50 |
Short sections are merged with adjacent ones. |
language |
str |
"auto" |
Language tag. |
split_on_headers |
bool |
True |
Create a new chunk at each Markdown/HTML heading. |
preserve_code_blocks |
bool |
True |
Keep code fences as single chunks regardless of size. |
Best for: Markdown documentation, HTML pages, structured technical manuals.
ContextAwareChunker
Goal: Preserve context using sliding windows. Each chunk includes core sentences plus surrounding sentences before and after them, ensuring no loss of context at chunk boundaries.
ContextAwareChunker(
config: Optional[ChunkConfig] = None,
*,
chunk_size: int = 512,
overlap: int = 128,
window_size: int = 2,
min_chunk_size: int = 50,
language: str = "auto",
)| Parameter | Type | Default | Description |
|---|---|---|---|
config |
ChunkConfig | None |
None |
Reads chunk_size, overlap, context_window_size, min_context_sentences. |
chunk_size |
int |
512 |
Target character count for primary sentence groups. |
overlap |
int |
128 |
Characters from previous chunk carried into the next. |
window_size |
int |
2 |
Number of sentences added as context on each side of a primary group. |
min_chunk_size |
int |
50 |
Chunks below this size are discarded. |
language |
str |
"auto" |
Language tag. |
Best for: Conversational text, narrative documents, Q&A content where answer context spans sentence boundaries.
ArabicTextChunker
Goal: Specialized processing for Arabic text, including normalization (unifying forms of Alef, Ta Marbuta, Ya, and removing diacritics), spacing correction, and semantic splitting using Arabic BERT models. It supports both pure Arabic text and mixed Arabic-English content.
ArabicTextChunker(
chunk_size: int = 512,
overlap: int = 128,
min_chunk_size: int = 50,
model_name: str = "CAMeL-Lab/bert-base-arabic-camelbert-mix",
use_semantic_chunking: bool = False,
device: Optional[str] = None,
preserve_formatting: bool = True,
fix_spacing: bool = True,
strict_size_limit: bool = True,
size_tolerance: float = 0.05,
use_smart_overlap: bool = True,
smart_overlap_threshold: float = 0.7,
)| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size |
int |
512 |
Maximum characters per chunk. |
overlap |
int |
128 |
Character overlap between chunks. |
min_chunk_size |
int |
50 |
Minimum characters; smaller chunks are skipped. |
model_name |
str |
CAMeL BERT | HuggingFace model for Arabic BERT embeddings. |
use_semantic_chunking |
bool |
False |
Enable BERT-based semantic splitting (requires torch + transformers). |
device |
str | None |
None |
"cuda" or "cpu". |
preserve_formatting |
bool |
True |
Keep original whitespace and formatting structure. |
fix_spacing |
bool |
True |
Auto-fix spacing issues common in Arabic text (e.g., missing spaces after punctuation). |
strict_size_limit |
bool |
True |
Hard-enforce chunk_size; oversized chunks are force-split at word boundaries. |
size_tolerance |
float |
0.05 |
Allowed overshoot fraction (5% by default). Only applies when strict_size_limit=True. |
use_smart_overlap |
bool |
True |
Use semantic similarity to choose the most relevant overlap sentences. |
smart_overlap_threshold |
float |
0.7 |
Minimum cosine similarity for a sentence to be included in smart overlap. |
Class methods:
ArabicTextChunker.create_safely(**kwargs)
Goal: Safely create an instance while automatically ignoring any unknown parameters (useful when passing a dynamic configuration).
@classmethod
def create_safely(cls, **kwargs) -> ArabicTextChunkerReturns: ArabicTextChunker
ArabicTextChunker.get_available_parameters()
Goal: Retrieve a list of all accepted parameters in the __init__ method.
@classmethod
def get_available_parameters(cls) -> List[str]Returns: List[str]
Public utility methods:
get_arabic_ratio(text)
Goal: Calculate the ratio of Arabic characters to the total number of characters in the text.
def get_arabic_ratio(self, text: str) -> floatReturns: float in range [0.0, 1.0].
is_arabic_dominant(text, threshold=0.3)
Goal: Determine whether the text is primarily Arabic (used for automatic language detection).
def is_arabic_dominant(self, text: str, threshold: float = 0.3) -> boolReturns: True if Arabic character ratio ≥ threshold.
MultilanguageTextChunker
Goal: Process multilingual text (17+ languages) with automatic language detection and selection of the appropriate BERT model for each language. It supports both rule-based and semantic splitting.
MultilanguageTextChunker(
chunk_size: int = 512,
overlap: int = 128,
min_chunk_size: int = 50,
model_name: Optional[str] = None,
language: str = "auto",
use_semantic_chunking: bool = False,
device: Optional[str] = None,
use_smart_overlap: bool = True,
smart_overlap_threshold: float = 0.7,
strict_size_limit: bool = True,
size_tolerance: float = 0.05,
)| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size |
int |
512 |
Target chunk size in characters. |
overlap |
int |
128 |
Overlap in characters between consecutive chunks. |
min_chunk_size |
int |
50 |
Minimum chunk size; smaller chunks are discarded. |
model_name |
str | None |
None |
Override the auto-selected language model. |
language |
str |
"auto" |
Language code ("arabic", "chinese", "english", etc.) or "auto" for detection. |
use_semantic_chunking |
bool |
False |
Enable BERT-based semantic splitting. |
device |
str | None |
None |
"cuda" or "cpu". |
use_smart_overlap |
bool |
True |
Use semantic-similarity-based overlap selection. |
smart_overlap_threshold |
float |
0.7 |
Similarity threshold for smart overlap. |
strict_size_limit |
bool |
True |
Enforce hard size limits. |
size_tolerance |
float |
0.05 |
Allowed size overshoot fraction. |
Supported languages: arabic, chinese, japanese, korean, russian, hindi, english, french, german, spanish, portuguese, italian, dutch, polish, turkish, vietnamese, multilingual.
Public utility methods:
get_supported_languages()
Goal: Retrieve a list of all supported languages along with their corresponding BERT model names.
def get_supported_languages(self) -> List[str]Returns: List[str] — e.g. ["multilingual", "english", "arabic", "chinese", ...]
Example:
chunker = MultilanguageTextChunker()
print(chunker.get_supported_languages())
# ['multilingual', 'english', 'arabic', 'chinese', 'french', ...]switch_language(language)
Goal: Switch to a different language and its corresponding BERT model at runtime without re-instantiating the object. Useful for multilingual pipelines operating on the same chunker instance.
def switch_language(self, language: str) -> None| Parameter | Type | Description |
|---|---|---|
language |
str |
Target language code (e.g. "arabic", "french"). Falls back to "multilingual" if unsupported. |
Returns: None. The new model is loaded in-place.
Example:
chunker = MultilanguageTextChunker(language="english")
chunks_en = chunker.chunk(english_text)
chunker.switch_language("arabic")
chunks_ar = chunker.chunk(arabic_text)Text Splitters
splitters that return List[str] instead of List[DocumentChunk]. Useful as lightweight building blocks or drop-in replacements.
Common Interface (all text splitters)
All splitters inherit from TextSplitter and expose these shared methods in addition to split_text:
split_documents(documents)
Goal: Split a list of Document objects and return a new list of chunked Document objects while preserving the original document metadata in each chunk.
def split_documents(self, documents: List[Document]) -> List[Document]| Parameter | Type | Description |
|---|---|---|
documents |
List[Document] |
Input documents to split. |
Returns: List[Document] — each item is a chunk as a Document with page_content = chunk_text and the original document's metadata copied.
create_documents(texts, metadatas=None)
Goal: Convert a list of raw texts (with optional metadata) into a list of chunked Document objects. Useful as a factory method when manually loading text data.
def create_documents(
self,
texts: List[str],
metadatas: Optional[List[Dict[str, Any]]] = None,
) -> List[Document]| Parameter | Type | Description |
|---|---|---|
texts |
List[str] |
Raw texts to split into documents. |
metadatas |
List[dict] | None |
Optional metadata dicts, one per text. Defaults to empty dicts if omitted. |
Returns: List[Document] — each split chunk becomes a Document carrying the corresponding metadata.
Example:
from fennec_community.chunks import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
docs = splitter.create_documents(
texts=["Long document one...", "Long document two..."],
metadatas=[{"source": "file1.txt"}, {"source": "file2.txt"}],
)CharacterTextSplitter
Goal: Split text using a fixed delimiter with optional overlap support. This is the simplest and fastest among all chunkers.
CharacterTextSplitter(
separator: str = "\n\n",
chunk_size: int = 1000,
chunk_overlap: int = 200,
length_function: Callable = len,
)| Parameter | Type | Default | Description |
|---|---|---|---|
separator |
str |
"\n\n" |
String to split on (double newline = paragraph boundary). |
chunk_size |
int |
1000 |
Maximum characters per chunk. |
chunk_overlap |
int |
200 |
Characters to repeat at the start of the next chunk. |
length_function |
Callable |
len |
Function to measure chunk size (can be replaced with token counter). |
split_text(text)
def split_text(self, text: str) -> List[str]Returns: List[str] — list of text chunks.
RecursiveCharacterTextSplitter
Goal: Hierarchical splitting that tries a sequence of separators from the most specific (paragraphs) to the least (characters). It preserves natural text boundaries as much as possible and includes special handling for Arabic text.
RecursiveCharacterTextSplitter(
separators: Optional[List[str]] = None,
arabic_mode: bool = False,
chunk_size: int = 1000,
chunk_overlap: int = 200,
length_function: Callable = len,
)| Parameter | Type | Default | Description |
|---|---|---|---|
separators |
List[str] | None |
None |
Custom separator hierarchy. Defaults to ["\n\n", "\n", ".", "؟", "!", "،", "؛", " ", ""]. |
arabic_mode |
bool |
False |
Use Arabic-optimized separator list when True. |
chunk_size |
int |
1000 |
Maximum characters per chunk. |
chunk_overlap |
int |
200 |
Overlap in characters. |
length_function |
Callable |
len |
Size measurement function. |
split_text(text)
def split_text(self, text: str) -> List[str]Returns: List[str]
TokenTextSplitter
Goal: Split text based on token count (instead of character count) using tiktoken. This is essential when working with OpenAI models to ensure the context window limit is not exceeded.
Requires:
pip install tiktoken
TokenTextSplitter(
encoding_name: str = "cl100k_base",
chunk_size: int = 1000,
chunk_overlap: int = 200,
length_function: Callable = len,
)| Parameter | Type | Default | Description |
|---|---|---|---|
encoding_name |
str |
"cl100k_base" |
Tiktoken encoding (matches GPT-4 / text-embedding-ada-002). The default encoding can be overridden globally via ChunkConfig.model_token. |
chunk_size |
int |
1000 |
Maximum tokens per chunk. |
chunk_overlap |
int |
200 |
Overlap in tokens. |
split_text(text)
def split_text(self, text: str) -> List[str]Returns: List[str] — chunks guaranteed to be within chunk_size tokens.
SentenceTextSplitter
Goal: Split text while ensuring sentences are never cut in the middle. This is suitable for cases where preserving complete sentences is essential for understanding, with full support for Arabic.
SentenceTextSplitter(
chunk_size: int = 1000,
chunk_overlap: int = 200,
length_function: Callable = len,
)| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size |
int |
1000 |
Maximum characters per chunk. |
chunk_overlap |
int |
200 |
Character overlap between chunks. |
length_function |
Callable |
len |
Size measurement function. |
split_text(text)
def split_text(self, text: str) -> List[str]Returns: List[str] — chunks always ending at sentence boundaries.
Supporting Components
EmbeddingProvider
Goal: Provide text embeddings with smart caching (LRU) and batch processing to speed up operations. It uses sentence-transformers and automatically falls back to TF-IDF when unavailable.
Note: Implemented as a singleton per
(model_name, device)pair — calling the constructor with the same arguments always returns the same cached instance.
Module-level function: cosine_similarity_matrix
Goal: Compute a cosine similarity matrix between two embedding sets efficiently in a single batch using matrix operations. It is used internally in the Deduplicator for detecting near-duplicate content.
from fennec_community.chunks import cosine_similarity_matrix
cosine_similarity_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray| Parameter | Type | Description |
|---|---|---|
A |
np.ndarray |
Matrix of shape (n, dim) — first set of embedding vectors. |
B |
np.ndarray |
Matrix of shape (m, dim) — second set of embedding vectors. |
Returns: np.ndarray of shape (n, m) — every cell [i, j] holds the cosine similarity between A[i] and B[j].
Example:
import numpy as np
from fennec_community.chunks import cosine_similarity_matrix, EmbeddingProvider
provider = EmbeddingProvider("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
texts_a = ["Hello world", "Machine learning is great"]
texts_b = ["Hi there", "AI is fascinating", "Cooking recipes"]
A = np.array(provider.embed_batch(texts_a))
B = np.array(provider.embed_batch(texts_b))
sim_matrix = cosine_similarity_matrix(A, B)
# sim_matrix.shape == (2, 3)
print(sim_matrix)EmbeddingProvider(
model_name: str,
device: Optional[str] = None,
cache_size: int = 10_000,
)| Parameter | Type | Default | Description |
|---|---|---|---|
model_name |
str |
— | HuggingFace sentence-transformers model name. |
device |
str | None |
None |
"cuda" or "cpu". |
cache_size |
int |
10,000 |
Maximum number of text embeddings to cache in memory (LRU eviction). Corresponds to ChunkConfig.embedding_cache_size. |
embed(text)
Goal: Compute the embedding for a single text with caching support.
def embed(self, text: str) -> np.ndarrayReturns: np.ndarray — 1-D embedding vector.
embed_batch(texts, batch_size=64)
Goal: Efficiently compute embeddings for a list of texts, skipping any entries already stored in the cache.
def embed_batch(self, texts: List[str], batch_size: int = 64) -> List[np.ndarray]| Parameter | Type | Description |
|---|---|---|
texts |
List[str] |
Texts to embed. |
batch_size |
int |
Number of texts processed per inference call. Corresponds to ChunkConfig.embedding_batch_size. |
Returns: List[np.ndarray] — Same order as input.
pairwise_similarity(a, b)
Goal: Compute the semantic similarity (cosine similarity) between two texts.
def pairwise_similarity(self, a: str, b: str) -> floatReturns: float in [-1.0, 1.0]. Values close to 1.0 indicate high similarity.
adjacent_similarities(sentences)
Goal: Compute cosine similarity between each sentence and the next one in a list (used internally in SemanticChunker).
def adjacent_similarities(self, sentences: List[str]) -> List[float]Returns: List[float] of length len(sentences) - 1.
Property:
| Property | Type | Description |
|---|---|---|
dim |
int |
Embedding vector dimension. |
MetadataEnricher
Goal: Enrich each chunk with keywords (TF-IDF), keyword density, and a composite importance score that considers position, length, and lexical diversity.
MetadataEnricher(
max_keywords: int = 10,
header_score_boost: float = 1.5,
)| Parameter | Type | Default | Description |
|---|---|---|---|
max_keywords |
int |
10 |
Maximum number of keywords extracted per chunk. |
header_score_boost |
float |
1.5 |
Score multiplier applied to header/section-title chunks. |
enrich(chunks)
Goal: Enrich a full list of chunks using corpus-level TF-IDF, producing more accurate results than per-chunk enrichment.
def enrich(self, chunks: List[DocumentChunk]) -> List[DocumentChunk]Returns: The same List[DocumentChunk] with updated metadata.keywords, metadata.keyword_density, and metadata.score.
enrich_single(chunk)
Goal: Independently enrich a single chunk (without corpus context) using TF only (no IDF). Suitable for real-time pipelines.
def enrich_single(self, chunk: DocumentChunk) -> DocumentChunkReturns: The same DocumentChunk with enriched metadata.
Deduplicator
Goal: Remove duplicate chunks using a two-stage process: exact hash matching for literal duplicates (fast, O(n)), and cosine similarity for near-duplicate detection (slower, O(n²)).
Deduplicator(
use_hash: bool = True,
use_similarity: bool = False,
similarity_threshold: float = 0.95,
embedding_model: Optional[str] = None,
device: Optional[str] = None,
)| Parameter | Type | Default | Description |
|---|---|---|---|
use_hash |
bool |
True |
Enable fast SHA-256 exact-match deduplication. |
use_similarity |
bool |
False |
Enable cosine similarity near-duplicate detection. Requires embedding_model. |
similarity_threshold |
float |
0.95 |
Chunks with cosine similarity ≥ this value are considered duplicates. Corresponds to ChunkConfig.dedup_similarity_threshold. |
embedding_model |
str | None |
None |
Model name for similarity-based dedup. Required if use_similarity=True. |
device |
str | None |
None |
"cuda" or "cpu". |
⚠️ Important —
ChunkManagerbehaviour: When usingChunkManager, similarity-based deduplication (use_similarity) is always disabled internally regardless ofChunkConfig.dedup_similarity_threshold. To use similarity-based deduplication, instantiateDeduplicatordirectly and calldeduplicate()on the chunks after chunking.
deduplicate(chunks)
Goal: Apply a two-stage deduplication process to a list of chunks and return a cleaned list.
def deduplicate(self, chunks: List[DocumentChunk]) -> List[DocumentChunk]Returns: List[DocumentChunk] — deduplicated, preserving original order. The longer chunk is kept when two near-duplicates are found.
DocumentStructureParser
Goal: Analyze the document structure (Markdown, HTML, or plain text) and extract a list of StructuredSection objects used by StructureAwareChunker to perform structure-aware splitting.
DocumentStructureParser()detect_document_type(text)
the goal: detect for document type
def detect_document_type(self, text: str) -> DocumentTypeReturns: DocumentType — one of MARKDOWN, HTML, ARABIC, PLAIN_TEXT.
Detection logic:
- HTML tags →
DocumentType.HTML - Markdown headers / code fences / bold →
DocumentType.MARKDOWN - Arabic character ratio > 20% →
DocumentType.ARABIC - Otherwise →
DocumentType.PLAIN_TEXT
parse(text, doc_type)
Goal: Analyze the document and return a list of structured sections.
def parse(self, text: str, doc_type: DocumentType) -> List[StructuredSection]| Parameter | Type | Description |
|---|---|---|
text |
str |
Document text. |
doc_type |
DocumentType |
Document type (use detect_document_type first if unknown). |
Returns: List[StructuredSection] — see StructuredSection for field details.
QueryPatternAnalyzer
Goal: Analyze user query patterns and/or document content to recommend the optimal chunking strategy (chunk size, overlap, semantic vs. adaptive splitting).
QueryPatternAnalyzer(default_strategy: str = "general")| Parameter | Type | Default | Description |
|---|---|---|---|
default_strategy |
str |
"general" |
Fallback strategy name when no pattern is detected. One of: "faq", "documentation", "technical", "narrative", "general". |
analyze_queries(queries)
Goal: Recommend a strategy based on a set of representative queries using majority voting.
def analyze_queries(self, queries: List[str]) -> ChunkingStrategyReturns: ChunkingStrategy dataclass with name, chunk_size, overlap, use_semantic, use_adaptive, use_structure.
analyze_document(text)
Goal: Recommend a strategy based on the document’s own content.
def analyze_document(self, text: str) -> ChunkingStrategyReturns: ChunkingStrategy
analyze(text, sample_queries=None)
Goal: Perform a combined analysis of both the document and the queries. Queries take priority if they indicate a specific (non-generic) pattern.
def analyze(
self,
text: str,
sample_queries: Optional[List[str]] = None,
) -> ChunkingStrategyReturns: ChunkingStrategy
get_strategy(name)
The Goal: return specific strategy with name
def get_strategy(self, name: str) -> ChunkingStrategyReturns: ChunkingStrategy — returns the default strategy if name is unknown.
Available strategies:
| Name | chunk_size | overlap | Use Case |
|---|---|---|---|
faq |
256 | 32 | Q&A documents, frequently asked questions |
documentation |
768 | 128 | Technical docs, structured manuals |
technical |
400 | 80 | Dense code, APIs, specifications |
narrative |
1024 | 200 | Stories, novels, long-form prose |
general |
512 | 128 | Balanced default for mixed content |
Data Models
DocumentChunk
The core output unit. Every function in the library ultimately produces DocumentChunk instances.
@dataclass
class DocumentChunk:
text: str # The chunk's text content
chunk_id: str # UUID (auto-generated)
doc_id: str # Parent document ID
embedding: Optional[np.ndarray] # Vector embedding (if computed)
metadata: ChunkMetadata # Rich metadataRead-only properties:
| Property | Type | Description |
|---|---|---|
id |
str |
Alias for chunk_id. |
content_hash |
str |
SHA-256 hash of normalized text (for deduplication). |
char_count |
int |
Character count of text. |
word_count |
int |
Word count of text. |
Methods:
| Method | Returns | Description |
|---|---|---|
to_dict() |
Dict[str, Any] |
JSON-serializable representation of the chunk and all its metadata. |
ChunkMetadata
@dataclass
class ChunkMetadata:
source: str # Document origin (filename, URL)
page: Optional[int] # Page number (if applicable)
section: Optional[str] # Section/heading title
position: int # Index of this chunk in the document (0-based)
total_chunks: int # Total chunks in the parent document
chunk_type: ChunkType # Type classification
document_type: DocumentType
language: str
keywords: List[str] # Top TF-IDF keywords
score: float # Importance score [0.0 – 1.0]
keyword_density: float # Ratio of meaningful words
is_header: bool # True if this chunk is a document heading
heading_level: int # 0 = not a header; 1–6 = H1–H6
char_start: int # Character offset in original document
char_end: int # Character offset end in original document
extra: Dict[str, Any] # Chunker-specific extra dataStructuredSection
The output unit of DocumentStructureParser.parse(). Represents a single structural unit within a document before it is converted to a DocumentChunk.
@dataclass
class StructuredSection:
text: str # The section's raw text content
heading: str # Title of the nearest parent heading (empty if none)
heading_level: int # 0 = body text; 1–6 = H1–H6
page: Optional[int] # Page number if available (e.g. from PDF parsers)
section_index: int # Sequential index of this section within the document
is_code_block: bool # True if the section is a fenced code block
is_list: bool # True if the section is a list (ordered or unordered)
is_table: bool # True if the section is a table
char_start: int # Character offset of the section start in the original text
char_end: int # Character offset of the section end in the original text
extra: Dict[str, Any] # Parser-specific additional dataExample:
from fennec_community.chunks import DocumentStructureParser
parser = DocumentStructureParser()
doc_type = parser.detect_document_type(markdown_text)
sections = parser.parse(markdown_text, doc_type)
for s in sections:
print(f"[H{s.heading_level}] {s.heading!r} — code={s.is_code_block} list={s.is_list}")
print(f" chars [{s.char_start}:{s.char_end}]: {s.text[:60]}")Document
Input container compatible with LangChain conventions.
@dataclass
class Document:
page_content: str # The raw text
metadata: Dict[str, Any] # Arbitrary key-value metadata
doc_id: str # Auto-generated UUIDEnumerations
ChunkType
PARAGRAPH, SENTENCE, SECTION, CODE_BLOCK, LIST_ITEM, HEADER, TABLE, SEMANTIC, ADAPTIVE, STRUCTURAL, WINDOW
DocumentType
PLAIN_TEXT, MARKDOWN, HTML, PDF, ARABIC, MIXED
Base Classes
These abstract base classes are exported from the package (from chunks import BaseChunker, TextSplitter, ChunkingStrategy) and are intended for building custom chunkers that integrate with the library's pipeline.
BaseChunker
Abstract base class for all chunkers. Subclass this to implement a custom splitting strategy that is compatible with ChunkManager and chunk_documents.
from fennec_community.chunks import BaseChunker, DocumentChunk
from typing import List
class MyChunker(BaseChunker):
def _chunk_impl(self, text: str, doc_id: str, source: str) -> List[DocumentChunk]:
# implement your splitting logic here
...You must implement: _chunk_impl(text, doc_id, source) -> List[DocumentChunk]
You inherit for free: chunk(), chunk_documents(), and _finalize() (sets position and total_chunks on each chunk).
TextSplitter
Abstract base class for text splitters. Subclass this to implement a custom splitter that works with split_documents() and create_documents().
from fennec_community.chunks import TextSplitter
from typing import List
class MyTextSplitter(TextSplitter):
def split_text(self, text: str) -> List[str]:
# implement your splitting logic here
...You must implement: split_text(text) -> List[str]
You inherit for free: split_documents() and create_documents().
ChunkingStrategy
Legacy abstract base class kept for backward compatibility. Represents a named chunking strategy. In current usage, QueryPatternAnalyzer returns plain dataclass instances (not subclasses of this ABC) that carry the same fields.
| Field | Type | Description |
|---|---|---|
name |
str |
Strategy name (e.g. "faq", "technical"). |
chunk_size |
int |
Recommended chunk size in characters. |
overlap |
int |
Recommended overlap in characters. |
use_semantic |
bool |
Whether to use semantic splitting. |
use_adaptive |
bool |
Whether to use adaptive sizing. |
use_structure |
bool |
Whether to respect document structure. |
Configuration Reference — ChunkConfig
ChunkConfig is a dataclass with sensible defaults. Pass it to any chunker or ChunkManager to customize behaviour. All fields are optional — omit any field to use its default.
from fennec_community.chunks import ChunkConfig
config = ChunkConfig(
chunk_size=512,
overlap=128,
use_semantic_chunking=True,
extract_keywords=True,
deduplication_enabled=True,
)Basic Sizing
| Field | Type | Default | Description |
|---|---|---|---|
chunk_size |
int |
512 |
Target chunk size in characters. |
overlap |
int |
128 |
Overlap in characters between consecutive chunks. |
min_chunk_size |
int |
50 |
Chunks smaller than this are discarded or merged. |
max_chunk_size |
int |
2048 |
Hard maximum; chunks are force-split above this. |
strict_size_limit |
bool |
True |
Enforce hard size limits. When True, chunks exceeding max_chunk_size are force-split at word boundaries. |
size_tolerance |
float |
0.05 |
Fractional overshoot allowed above chunk_size before a hard split is triggered (5% by default). Only applies when strict_size_limit=True. |
Semantic Chunking
| Field | Type | Default | Description |
|---|---|---|---|
use_semantic_chunking |
bool |
False |
Enable embedding-based semantic splitting. |
semantic_similarity_threshold |
float |
0.75 |
Similarity drop threshold for a new chunk boundary. |
semantic_model |
str |
MiniLM | Sentence-transformer model for semantic chunking. |
model_name |
str |
MiniLM | Alias for semantic_model kept for backward compatibility with ArabicTextChunker and MultilanguageTextChunker. |
embedding_batch_size |
int |
64 |
Number of sentences processed per embedding inference call. Increase for GPU throughput, decrease to save memory. |
embedding_cache_size |
int |
10_000 |
LRU cache size (number of text entries) shared across all EmbeddingProvider instances. |
Adaptive Chunking
| Field | Type | Default | Description |
|---|---|---|---|
use_adaptive_chunking |
bool |
False |
Enable density-adaptive chunk sizing. |
adaptive_technical_threshold |
float |
0.6 |
Information-density score above which text is classified as "technical". Technical text gets smaller chunks. |
adaptive_min_size |
int |
100 |
Minimum chunk size when content is dense/technical. |
adaptive_max_size |
int |
1500 |
Maximum chunk size when content is simple/narrative. |
adaptive_base_size |
int |
512 |
Starting size before density adjustment is applied. |
Smart Overlap
| Field | Type | Default | Description |
|---|---|---|---|
use_smart_overlap |
bool |
True |
Use semantic similarity to select the most relevant sentences to carry over as overlap, instead of a fixed character window. Enabled by default. |
smart_overlap_threshold |
float |
0.7 |
Minimum cosine similarity for a sentence to be included in the smart overlap region. |
Structure-Aware Chunking
| Field | Type | Default | Description |
|---|---|---|---|
respect_document_structure |
bool |
True |
Honor Markdown/HTML structural elements. |
split_on_headers |
bool |
True |
Create a new chunk at each Markdown/HTML heading. |
split_on_paragraphs |
bool |
True |
Treat paragraph breaks (\n\n) as natural chunk boundaries. |
preserve_code_blocks |
bool |
True |
Never split code blocks. |
preserve_tables |
bool |
True |
Never split tables. |
preserve_lists |
bool |
True |
Never split list blocks mid-way. |
Context-Aware Chunking
| Field | Type | Default | Description |
|---|---|---|---|
use_context_window |
bool |
False |
Enable sliding window context padding. |
context_window_size |
int |
2 |
Number of sentences added as context on each side of a primary group. |
min_context_sentences |
int |
3 |
Minimum number of sentences a group must contain before context padding is applied. |
Metadata & Scoring
| Field | Type | Default | Description |
|---|---|---|---|
extract_keywords |
bool |
True |
Run TF-IDF keyword extraction after chunking. |
keyword_max_count |
int |
10 |
Maximum keywords per chunk. |
compute_chunk_scores |
bool |
True |
Compute importance scores after chunking. |
header_score_boost |
float |
1.5 |
Score multiplier for header chunks. |
Deduplication
| Field | Type | Default | Description |
|---|---|---|---|
deduplication_enabled |
bool |
True |
Remove duplicate chunks automatically. |
dedup_use_hash |
bool |
True |
Use fast SHA-256 exact deduplication. |
dedup_similarity_threshold |
float |
0.95 |
Cosine similarity threshold above which two chunks are considered duplicates. Note: this threshold is passed to Deduplicator at construction time, but ChunkManager always sets use_similarity=False — to use similarity-based deduplication you must call Deduplicator directly. |
Query-Aware Strategy
| Field | Type | Default | Description |
|---|---|---|---|
query_aware |
bool |
False |
Enable query-pattern-based strategy selection. |
default_query_pattern |
str |
"general" |
Default strategy name when no pattern is detected. One of "faq", "documentation", "technical", "narrative", "general". |
Performance
| Field | Type | Default | Description |
|---|---|---|---|
async_processing |
bool |
False |
Enable asynchronous parallel processing of chunks. |
n_workers |
int |
4 |
Number of worker threads used when async_processing=True. |
Formatting & Text Cleanup
| Field | Type | Default | Description |
|---|---|---|---|
preserve_formatting |
bool |
True |
Keep original whitespace and formatting structure intact. |
fix_spacing |
bool |
True |
Auto-fix spacing issues (e.g. missing spaces after Arabic punctuation). |
Tiktoken
| Field | Type | Default | Description |
|---|---|---|---|
model_token |
str |
"cl100k_base" |
Tiktoken encoding name used by TokenTextSplitter as its global default. Common values: "cl100k_base" (GPT-4, text-embedding-ada-002), "p50k_base" (Codex). |
Language-Specific BERT Models
These fields set the HuggingFace model name used for each language when use_semantic_chunking=True inside MultilanguageTextChunker or ArabicTextChunker. Override any field to use a different model for that language.
| Field | Default Model |
|---|---|
multilingual |
xlm-roberta-base |
arabic |
CAMeL-Lab/bert-base-arabic-camelbert-mix |
english |
bert-base-uncased |
chinese |
bert-base-chinese |
french |
camembert-base |
german |
bert-base-german-cased |
spanish |
dccuchile/bert-base-spanish-wwm-uncased |
russian |
DeepPavlov/rubert-base-cased |
japanese |
cl-tohoku/bert-base-japanese |
korean |
klue/bert-base |
portuguese |
neuralmind/bert-base-portuguese-cased |
italian |
dbmdz/bert-base-italian-cased |
dutch |
GroNLP/bert-base-dutch-cased |
polish |
dkleczek/bert-base-polish-cased |
turkish |
dbmdz/bert-base-turkish-cased |
vietnamese |
vinai/phobert-base |
hindi |
ai4bharat/indic-bert |
Example:
config = ChunkConfig(
use_semantic_chunking=True,
arabic="CAMeL-Lab/bert-base-arabic-camelbert-ca", # swap Arabic model
french="camembert/camembert-large", # swap French model
)Transformer Parameters
Low-level parameters forwarded to the HuggingFace tokenizer during semantic chunking. Rarely need to change.
| Field | Type | Default | Description |
|---|---|---|---|
max_len |
int |
512 |
Maximum token length passed to the tokenizer. |
padding |
bool |
True |
Enable padding to max_len. |
truncation |
bool |
True |
Truncate inputs exceeding max_len. |
return_tensor |
str |
"pt" |
Tensor framework returned by the tokenizer ("pt" = PyTorch). |
Merge Parameters
| Field | Type | Default | Description |
|---|---|---|---|
min_sentences |
int |
30 |
Minimum sentence count a document must have before sentence-level merging is considered. Documents with fewer sentences skip the merge pass. |
ChunkMode Enum
from fennec_community.chunks import ChunkMode
ChunkMode.AUTO # ChunkManager picks strategy based on config flags
ChunkMode.SEMANTIC # Force SemanticChunker
ChunkMode.ADAPTIVE # Force AdaptiveChunker
ChunkMode.STRUCTURAL # Force StructureAwareChunker
ChunkMode.CONTEXTUAL # Force ContextAwareChunker
ChunkMode.HYBRID # StructureAwareChunker → SemanticChunker on large sectionsEnd-to-End Usage Examples
Example 1: RAG Pipeline (Recommended)
from fennec_community.chunks import ChunkManager, ChunkConfig, ChunkMode
config = ChunkConfig(
chunk_size=512,
overlap=128,
use_semantic_chunking=True,
extract_keywords=True,
deduplication_enabled=True,
query_aware=True,
)
manager = ChunkManager(config=config, mode=ChunkMode.AUTO)
with open("report.md", "r", encoding="utf-8") as f:
text = f.read()
chunks = manager.process_with_query_optimization(
text=text,
queries=["What are the financial results?", "Who is the CEO?"],
source="report.md",
)
print(manager.get_stats(chunks))
for chunk in chunks:
vector = embed_model.encode(chunk.text) # your embedding model
vector_db.insert(chunk.chunk_id, vector, chunk.to_dict())Example 2: Arabic Document
from fennec_community.chunks import ArabicTextChunker
chunker = ArabicTextChunker(
chunk_size=400,
overlap=100,
use_semantic_chunking=True,
fix_spacing=True,
)
chunks = chunker.chunk(arabic_text, source="arabic_doc.pdf")
for ch in chunks:
print(f"[{ch.metadata.position}] {ch.text[:60]}...")
print(f" keywords: {ch.metadata.keywords}")Example 3: Multilingual Batch Processing
from fenenc_community.chunks import ChunkManager, Document
manager = ChunkManager(language="auto")
documents = [
Document(page_content=english_text, metadata={"source": "en.txt", "lang": "en"}),
Document(page_content=arabic_text, metadata={"source": "ar.txt", "lang": "ar"}),
Document(page_content=french_text, metadata={"source": "fr.txt", "lang": "fr"}),
]
all_chunks = manager.process_documents(documents)
print(f"Total chunks: {len(all_chunks)}")Example 4: Basic Splitters
from fenenc_community.chunks import RecursiveCharacterTextSplitter, TokenTextSplitter
# Token-aware splitting for OpenAI models
splitter = TokenTextSplitter(
encoding_name="cl100k_base",
chunk_size=500,
chunk_overlap=50,
)
text_chunks = splitter.split_text(long_text)
# Arabic recursive splitting
arabic_splitter = RecursiveCharacterTextSplitter(
arabic_mode=True,
chunk_size=400,
chunk_overlap=80,
)
arabic_chunks = arabic_splitter.split_text(arabic_text)Example 5: Deduplication + Enrichment Standalone
from fenenc_community.chunks import MetadataEnricher, Deduplicator
enricher = MetadataEnricher(max_keywords=8, header_score_boost=2.0)
deduplicator = Deduplicator(use_hash=True, use_similarity=False)
# After any chunking operation:
chunks = enricher.enrich(chunks)
chunks = deduplicator.deduplicate(chunks)
# Sort by importance score
chunks.sort(key=lambda c: c.metadata.score, reverse=True)
top_chunks = chunks[:10]Example 6: Similarity-Based Deduplication
# ChunkManager always disables similarity dedup internally.
# Use Deduplicator directly when you need near-duplicate removal:
from fenenc_community.chunks import ChunkManager, Deduplicator
manager = ChunkManager()
chunks = manager.process(text, source="doc.pdf")
deduplicator = Deduplicator(
use_hash=True,
use_similarity=True,
similarity_threshold=0.92,
embedding_model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
)
chunks = deduplicator.deduplicate(chunks)Example 7: Custom BERT Models per Language
from fenenc_community.chunks import ChunkConfig, MultilanguageTextChunker
config = ChunkConfig(
use_semantic_chunking=True,
arabic="CAMeL-Lab/bert-base-arabic-camelbert-ca",
french="camembert/camembert-large",
embedding_batch_size=32, # lower batch size on limited GPU memory
embedding_cache_size=5_000,
)
chunker = MultilanguageTextChunker(language="arabic")
chunks = chunker.chunk(arabic_text, source="doc.pdf")Example 8: Structure Inspection via StructuredSection
from fennec_community.chunks import DocumentStructureParser
parser = DocumentStructureParser()
doc_type = parser.detect_document_type(md_text)
sections = parser.parse(md_text, doc_type)
code_sections = [s for s in sections if s.is_code_block]
headers = [s for s in sections if s.heading_level > 0]
print(f"Found {len(code_sections)} code blocks and {len(headers)} headings")community/chunks.md