Fennec Community community/chunks.md

Chunks Modular

Overview
Architecture at a Glance
Quick Start
Core Entry Point — ChunkManager
Chunkers
Text Splitters
Supporting Components
Data Models
Base Classes
Configuration Reference — ChunkConfig
ChunkMode Enum
End-to-End Usage Examples

Overview

The chunks modular provides a complete, production-ready pipeline for splitting documents into semantically meaningful units ("chunks") ready for vector embedding and retrieval in RAG systems. It handles:

Multiple splitting strategies — semantic, adaptive, structure-aware, context-aware, and hybrid.
Multilingual support — 17+ languages with dedicated BERT models; first-class Arabic support.
Post-processing — TF-IDF keyword extraction, importance scoring, and deduplication.
Query-aware routing — automatically selects the best strategy based on sample queries.
LangChain-compatible splitters — drop-in replacements for CharacterTextSplitter, RecursiveCharacterTextSplitter, etc.

Architecture at a Glance

Text / Documents
       │
       ▼
  ChunkManager          ← Orchestrator (recommended entry point)
       │
       ├── QueryPatternAnalyzer   (optional: picks strategy from queries)
       │
       ├── Chunker selection:
       │     ├── SemanticChunker         (embedding-based splits)
       │     ├── AdaptiveChunker         (density-adaptive sizing)
       │     ├── StructureAwareChunker   (respects headers, code, tables)
       │     ├── ContextAwareChunker     (sliding-window context)
       │     └── Hybrid                  (structural + semantic)
       │
       ├── MetadataEnricher       (keywords, scores)
       ├── Deduplicator           (hash + similarity)
       └── Finalize               (position, total_chunks)
             │
             ▼
      List[DocumentChunk]

Quick Start

from fennec_community.chunks import ChunkManager, ChunkConfig

# 1. Basic usage — auto mode
manager = ChunkManager()
chunks = manager.process("Your document text here...", source="my_doc.pdf")

# 2. Semantic chunking with custom config
config = ChunkConfig(
    use_semantic_chunking=True,
    chunk_size=512,
    overlap=128,
    extract_keywords=True,
    deduplication_enabled=True,
)
manager = ChunkManager(config=config)
chunks = manager.process(text, source="report.pdf")

# 3. Inspect results
for chunk in chunks:
    print(chunk.text[:80])
    print(f"  keywords: {chunk.metadata.keywords}")
    print(f"  score:    {chunk.metadata.score:.3f}")

Core Entry Point — `ChunkManager`

ChunkManager is the recommended entry point for all production use. It orchestrates the full chunking lifecycle: strategy selection → chunking → enrichment → deduplication → finalization.

Constructor

ChunkManager(
    config: Optional[ChunkConfig] = None,
    mode: ChunkMode = ChunkMode.AUTO,
    *,
    device: Optional[str] = None,
    language: str = "auto",
)

Parameter	Type	Default	Description
`config`	`ChunkConfig \| None`	`None`	Configuration object. Uses defaults if omitted.
`mode`	`ChunkMode`	`ChunkMode.AUTO`	Chunking strategy. See ChunkMode.
`device`	`str \| None`	`None`	Torch device (`"cuda"`, `"cpu"`). Auto-detected if `None`.
`language`	`str`	`"auto"`	Language hint (`"arabic"`, `"english"`, etc.) or `"auto"` for detection.

`process`

Target: it's point for split document to chunks ready for embedding.

def process(
    text: str,
    doc_id: Optional[str] = None,
    source: str = "",
    *,
    sample_queries: Optional[List[str]] = None,
    mode_override: Optional[ChunkMode] = None,
) -> List[DocumentChunk]

Parameters:

Parameter	Type	Required	Description
`text`	`str`	✅	The raw document text to be chunked.
`doc_id`	`str \| None`	❌	Unique document identifier. Auto-generated UUID if not provided.
`source`	`str`	❌	Document origin label (filename, URL, etc.) stored in chunk metadata.
`sample_queries`	`List[str] \| None`	❌	Representative queries used to auto-select the best chunking strategy when `config.query_aware=True`.
`mode_override`	`ChunkMode \| None`	❌	Temporarily override the manager's default mode for this call only.

Returns: List[DocumentChunk] — Ordered, enriched, deduplicated chunks ready for embedding.

Returns empty list if text is empty or whitespace only.

Example:

chunks = manager.process(
    text="Large document content...",
    source="annual_report_2024.pdf",
    sample_queries=["What is the revenue?", "Who are the key executives?"],
)

`process_documents`

Target: split list of documents to chunks ready for embedding. in one batch.

def process_documents(
    documents: List[Document],
    *,
    sample_queries: Optional[List[str]] = None,
    mode_override: Optional[ChunkMode] = None,
) -> List[DocumentChunk]

Parameters:

Parameter	Type	Required	Description
`documents`	`List[Document]`	✅	List of `Document` objects (each holds `page_content` + `metadata`).
`sample_queries`	`List[str] \| None`	❌	Queries used for strategy selection (applied uniformly to all documents).
`mode_override`	`ChunkMode \| None`	❌	Override mode for all documents in this batch.

Returns: List[DocumentChunk] — All chunks from all documents combined in order.

Example:

from fennec_community.chunks import Document

docs = [
    Document(page_content="Doc 1 text...", metadata={"source": "file1.pdf"}),
    Document(page_content="Doc 2 text...", metadata={"source": "file2.pdf"}),
]
all_chunks = manager.process_documents(docs)

`process_with_query_optimization`

Goal: A simplified shortcut to enable the query-aware pipeline without needing to configure config.query_aware.

def process_with_query_optimization(
    text: str,
    queries: List[str],
    doc_id: Optional[str] = None,
    source: str = "",
) -> List[DocumentChunk]

Parameters:

Parameter	Type	Required	Description
`text`	`str`	✅	Document text to chunk.
`queries`	`List[str]`	✅	Sample queries that will be analyzed to select the optimal chunking strategy.
`doc_id`	`str \| None`	❌	Optional document ID.
`source`	`str`	❌	Document source label.

Returns: List[DocumentChunk]

Example:

chunks = manager.process_with_query_optimization(
    text=document_text,
    queries=["How does the API authenticate?", "What are the rate limits?"],
    source="api_docs.md",
)

`get_stats`

Goal: Extract summarized statistics from a set of chunks (useful for monitoring and diagnostics).

def get_stats(chunks: List[DocumentChunk]) -> Dict[str, Any]

Parameters:

Parameter	Type	Required	Description
`chunks`	`List[DocumentChunk]`	✅	The chunk list returned by any `process*` method.

Returns: Dict[str, Any] with the following keys:

Key	Type	Description
`count`	`int`	Total number of chunks.
`total_chars`	`int`	Sum of all characters across chunks.
`avg_chars`	`float`	Average characters per chunk.
`min_chars`	`int`	Smallest chunk size in characters.
`max_chars`	`int`	Largest chunk size in characters.
`avg_words`	`float`	Average word count per chunk.
`avg_score`	`float`	Mean importance score (0–1).
`max_score`	`float`	Highest importance score in the set.
`chunk_types`	`Dict[str, int]`	Count of each `ChunkType` value.
`unique_sections`	`int`	Number of distinct document sections represented.

Example:

stats = manager.get_stats(chunks)
print(stats)
# {'count': 42, 'avg_chars': 487.3, 'avg_score': 0.6812, ...}

Chunkers

Each chunker can be used standalone (without ChunkManager) when you need direct control. All chunkers share the same public interface via BaseChunker.chunk().

Common Interface (all chunkers)

All chunkers inherit from BaseChunker and expose these two public methods:

`chunk(text, doc_id="", source="")`

Goal: Split a single text into a complete list of chunks while properly setting the position and total_chunks.

def chunk(self, text: str, doc_id: str = "", source: str = "") -> List[DocumentChunk]

Parameter	Type	Description
`text`	`str`	Raw text to split.
`doc_id`	`str`	Parent document ID stored in each chunk's metadata.
`source`	`str`	Document origin label (filename, URL, etc.)

Returns: List[DocumentChunk]

`chunk_documents(documents)`

Goal: Batch process a list of Document objects into chunks while preserving each document’s doc_id and source.

def chunk_documents(self, documents: List[Document]) -> List[DocumentChunk]

Parameter	Type	Description
`documents`	`List[Document]`	Documents to split. Each `Document.metadata` may include a `"source"` key.

Returns: List[DocumentChunk] — all chunks from all documents, in order.

`SemanticChunker`

Goal: Split text based on semantic similarity between sentences using embedding models. A new chunk is created when similarity drops below a defined threshold, ensuring each chunk contains a semantically coherent idea.

SemanticChunker(
    config: Optional[ChunkConfig] = None,
    *,
    similarity_threshold: float = 0.75,
    min_chunk_size: int = 50,
    max_chunk_size: int = 2048,
    model_name: Optional[str] = None,
    device: Optional[str] = None,
    language: str = "auto",
)

Parameter	Type	Default	Description
`config`	`ChunkConfig \| None`	`None`	If provided, overrides individual parameters.
`similarity_threshold`	`float`	`0.75`	Cosine similarity below this value triggers a new chunk. Lower = more splits.
`min_chunk_size`	`int`	`50`	Minimum characters per chunk; short chunks are merged with the next.
`max_chunk_size`	`int`	`2048`	Hard cap: forces a new chunk regardless of similarity.
`model_name`	`str \| None`	`None`	Sentence-transformer model name. Defaults to multilingual MiniLM.
`device`	`str \| None`	`None`	`"cuda"` or `"cpu"`.
`language`	`str`	`"auto"`	Language tag stored in chunk metadata.

Best for: Conceptually diverse documents, FAQs, articles where topics shift frequently.

Example:

from fennec_community.chunks import SemanticChunker

chunker = SemanticChunker(similarity_threshold=0.70)
chunks = chunker.chunk(text, source="article.txt")

`AdaptiveChunker`

Goal: Automatically adjust chunk size based on the information density of the text. Dense technical content is split into smaller chunks for more precise retrieval, while narrative text uses larger chunks to preserve context.

AdaptiveChunker(
    config: Optional[ChunkConfig] = None,
    *,
    base_size: int = 512,
    min_size: int = 100,
    max_size: int = 1500,
    overlap: int = 128,
    technical_threshold: float = 0.5,
    language: str = "auto",
)

Parameter	Type	Default	Description
`config`	`ChunkConfig \| None`	`None`	If provided, reads `adaptive_base_size`, `adaptive_min_size`, `adaptive_max_size`, `adaptive_technical_threshold` fields.
`base_size`	`int`	`512`	Starting chunk size before density adjustment.
`min_size`	`int`	`100`	Minimum allowed chunk size (used for dense/technical text).
`max_size`	`int`	`1500`	Maximum allowed chunk size (used for simple/narrative text).
`overlap`	`int`	`128`	Character overlap between consecutive chunks.
`technical_threshold`	`float`	`0.5`	Information density score above which text is considered "technical".
`language`	`str`	`"auto"`	Language tag.

Best for: Mixed-content documents that combine technical specifications with narrative explanations.

`StructureAwareChunker`

Goal: Split the document while respecting its natural structure (Markdown/HTML headings, paragraphs, code blocks, tables, and lists). It ensures that important structural elements are not split in the middle.

StructureAwareChunker(
    config: Optional[ChunkConfig] = None,
    *,
    max_chunk_size: int = 1024,
    min_chunk_size: int = 50,
    language: str = "auto",
    split_on_headers: bool = True,
    preserve_code_blocks: bool = True,
)

Parameter	Type	Default	Description
`config`	`ChunkConfig \| None`	`None`	Reads `max_chunk_size`, `split_on_headers`, `preserve_code_blocks`.
`max_chunk_size`	`int`	`1024`	Sections exceeding this are further split by sentence.
`min_chunk_size`	`int`	`50`	Short sections are merged with adjacent ones.
`language`	`str`	`"auto"`	Language tag.
`split_on_headers`	`bool`	`True`	Create a new chunk at each Markdown/HTML heading.
`preserve_code_blocks`	`bool`	`True`	Keep code fences as single chunks regardless of size.

Best for: Markdown documentation, HTML pages, structured technical manuals.

`ContextAwareChunker`

Goal: Preserve context using sliding windows. Each chunk includes core sentences plus surrounding sentences before and after them, ensuring no loss of context at chunk boundaries.

ContextAwareChunker(
    config: Optional[ChunkConfig] = None,
    *,
    chunk_size: int = 512,
    overlap: int = 128,
    window_size: int = 2,
    min_chunk_size: int = 50,
    language: str = "auto",
)

Parameter	Type	Default	Description
`config`	`ChunkConfig \| None`	`None`	Reads `chunk_size`, `overlap`, `context_window_size`, `min_context_sentences`.
`chunk_size`	`int`	`512`	Target character count for primary sentence groups.
`overlap`	`int`	`128`	Characters from previous chunk carried into the next.
`window_size`	`int`	`2`	Number of sentences added as context on each side of a primary group.
`min_chunk_size`	`int`	`50`	Chunks below this size are discarded.
`language`	`str`	`"auto"`	Language tag.

Best for: Conversational text, narrative documents, Q&A content where answer context spans sentence boundaries.

`ArabicTextChunker`

Goal: Specialized processing for Arabic text, including normalization (unifying forms of Alef, Ta Marbuta, Ya, and removing diacritics), spacing correction, and semantic splitting using Arabic BERT models. It supports both pure Arabic text and mixed Arabic-English content.

ArabicTextChunker(
    chunk_size: int = 512,
    overlap: int = 128,
    min_chunk_size: int = 50,
    model_name: str = "CAMeL-Lab/bert-base-arabic-camelbert-mix",
    use_semantic_chunking: bool = False,
    device: Optional[str] = None,
    preserve_formatting: bool = True,
    fix_spacing: bool = True,
    strict_size_limit: bool = True,
    size_tolerance: float = 0.05,
    use_smart_overlap: bool = True,
    smart_overlap_threshold: float = 0.7,
)

Parameter	Type	Default	Description
`chunk_size`	`int`	`512`	Maximum characters per chunk.
`overlap`	`int`	`128`	Character overlap between chunks.
`min_chunk_size`	`int`	`50`	Minimum characters; smaller chunks are skipped.
`model_name`	`str`	CAMeL BERT	HuggingFace model for Arabic BERT embeddings.
`use_semantic_chunking`	`bool`	`False`	Enable BERT-based semantic splitting (requires `torch` + `transformers`).
`device`	`str \| None`	`None`	`"cuda"` or `"cpu"`.
`preserve_formatting`	`bool`	`True`	Keep original whitespace and formatting structure.
`fix_spacing`	`bool`	`True`	Auto-fix spacing issues common in Arabic text (e.g., missing spaces after punctuation).
`strict_size_limit`	`bool`	`True`	Hard-enforce `chunk_size`; oversized chunks are force-split at word boundaries.
`size_tolerance`	`float`	`0.05`	Allowed overshoot fraction (5% by default). Only applies when `strict_size_limit=True`.
`use_smart_overlap`	`bool`	`True`	Use semantic similarity to choose the most relevant overlap sentences.
`smart_overlap_threshold`	`float`	`0.7`	Minimum cosine similarity for a sentence to be included in smart overlap.

Class methods:

`ArabicTextChunker.create_safely(**kwargs)`

Goal: Safely create an instance while automatically ignoring any unknown parameters (useful when passing a dynamic configuration).

@classmethod
def create_safely(cls, **kwargs) -> ArabicTextChunker

Returns: ArabicTextChunker

`ArabicTextChunker.get_available_parameters()`

Goal: Retrieve a list of all accepted parameters in the __init__ method.

@classmethod
def get_available_parameters(cls) -> List[str]

Returns: List[str]

Public utility methods:

`get_arabic_ratio(text)`

Goal: Calculate the ratio of Arabic characters to the total number of characters in the text.

def get_arabic_ratio(self, text: str) -> float

Returns: float in range [0.0, 1.0].

`is_arabic_dominant(text, threshold=0.3)`

Goal: Determine whether the text is primarily Arabic (used for automatic language detection).

def is_arabic_dominant(self, text: str, threshold: float = 0.3) -> bool

Returns: True if Arabic character ratio ≥ threshold.

`MultilanguageTextChunker`

Goal: Process multilingual text (17+ languages) with automatic language detection and selection of the appropriate BERT model for each language. It supports both rule-based and semantic splitting.

MultilanguageTextChunker(
    chunk_size: int = 512,
    overlap: int = 128,
    min_chunk_size: int = 50,
    model_name: Optional[str] = None,
    language: str = "auto",
    use_semantic_chunking: bool = False,
    device: Optional[str] = None,
    use_smart_overlap: bool = True,
    smart_overlap_threshold: float = 0.7,
    strict_size_limit: bool = True,
    size_tolerance: float = 0.05,
)

Parameter	Type	Default	Description
`chunk_size`	`int`	`512`	Target chunk size in characters.
`overlap`	`int`	`128`	Overlap in characters between consecutive chunks.
`min_chunk_size`	`int`	`50`	Minimum chunk size; smaller chunks are discarded.
`model_name`	`str \| None`	`None`	Override the auto-selected language model.
`language`	`str`	`"auto"`	Language code (`"arabic"`, `"chinese"`, `"english"`, etc.) or `"auto"` for detection.
`use_semantic_chunking`	`bool`	`False`	Enable BERT-based semantic splitting.
`device`	`str \| None`	`None`	`"cuda"` or `"cpu"`.
`use_smart_overlap`	`bool`	`True`	Use semantic-similarity-based overlap selection.
`smart_overlap_threshold`	`float`	`0.7`	Similarity threshold for smart overlap.
`strict_size_limit`	`bool`	`True`	Enforce hard size limits.
`size_tolerance`	`float`	`0.05`	Allowed size overshoot fraction.

Supported languages: arabic, chinese, japanese, korean, russian, hindi, english, french, german, spanish, portuguese, italian, dutch, polish, turkish, vietnamese, multilingual.

Public utility methods:

`get_supported_languages()`

Goal: Retrieve a list of all supported languages along with their corresponding BERT model names.

def get_supported_languages(self) -> List[str]

Returns: List[str] — e.g. ["multilingual", "english", "arabic", "chinese", ...]

Example:

chunker = MultilanguageTextChunker()
print(chunker.get_supported_languages())
# ['multilingual', 'english', 'arabic', 'chinese', 'french', ...]

`switch_language(language)`

Goal: Switch to a different language and its corresponding BERT model at runtime without re-instantiating the object. Useful for multilingual pipelines operating on the same chunker instance.

def switch_language(self, language: str) -> None

Parameter	Type	Description
`language`	`str`	Target language code (e.g. `"arabic"`, `"french"`). Falls back to `"multilingual"` if unsupported.

Returns: None. The new model is loaded in-place.

Example:

chunker = MultilanguageTextChunker(language="english")
chunks_en = chunker.chunk(english_text)

chunker.switch_language("arabic")
chunks_ar = chunker.chunk(arabic_text)

Text Splitters

splitters that return List[str] instead of List[DocumentChunk]. Useful as lightweight building blocks or drop-in replacements.

Common Interface (all text splitters)

All splitters inherit from TextSplitter and expose these shared methods in addition to split_text:

`split_documents(documents)`

Goal: Split a list of Document objects and return a new list of chunked Document objects while preserving the original document metadata in each chunk.

def split_documents(self, documents: List[Document]) -> List[Document]

Parameter	Type	Description
`documents`	`List[Document]`	Input documents to split.

Returns: List[Document] — each item is a chunk as a Document with page_content = chunk_text and the original document's metadata copied.

`create_documents(texts, metadatas=None)`

Goal: Convert a list of raw texts (with optional metadata) into a list of chunked Document objects. Useful as a factory method when manually loading text data.

def create_documents(
    self,
    texts: List[str],
    metadatas: Optional[List[Dict[str, Any]]] = None,
) -> List[Document]

Parameter	Type	Description
`texts`	`List[str]`	Raw texts to split into documents.
`metadatas`	`List[dict] \| None`	Optional metadata dicts, one per text. Defaults to empty dicts if omitted.

Returns: List[Document] — each split chunk becomes a Document carrying the corresponding metadata.

Example:

from fennec_community.chunks import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)

docs = splitter.create_documents(
    texts=["Long document one...", "Long document two..."],
    metadatas=[{"source": "file1.txt"}, {"source": "file2.txt"}],
)

`CharacterTextSplitter`

Goal: Split text using a fixed delimiter with optional overlap support. This is the simplest and fastest among all chunkers.

CharacterTextSplitter(
    separator: str = "\n\n",
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    length_function: Callable = len,
)

Parameter	Type	Default	Description
`separator`	`str`	`"\n\n"`	String to split on (double newline = paragraph boundary).
`chunk_size`	`int`	`1000`	Maximum characters per chunk.
`chunk_overlap`	`int`	`200`	Characters to repeat at the start of the next chunk.
`length_function`	`Callable`	`len`	Function to measure chunk size (can be replaced with token counter).

`split_text(text)`

def split_text(self, text: str) -> List[str]

Returns: List[str] — list of text chunks.

`RecursiveCharacterTextSplitter`

Goal: Hierarchical splitting that tries a sequence of separators from the most specific (paragraphs) to the least (characters). It preserves natural text boundaries as much as possible and includes special handling for Arabic text.

RecursiveCharacterTextSplitter(
    separators: Optional[List[str]] = None,
    arabic_mode: bool = False,
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    length_function: Callable = len,
)

Parameter	Type	Default	Description
`separators`	`List[str] \| None`	`None`	Custom separator hierarchy. Defaults to `["\n\n", "\n", ".", "؟", "!", "،", "؛", " ", ""]`.
`arabic_mode`	`bool`	`False`	Use Arabic-optimized separator list when `True`.
`chunk_size`	`int`	`1000`	Maximum characters per chunk.
`chunk_overlap`	`int`	`200`	Overlap in characters.
`length_function`	`Callable`	`len`	Size measurement function.

`split_text(text)`

def split_text(self, text: str) -> List[str]

Returns: List[str]

`TokenTextSplitter`

Goal: Split text based on token count (instead of character count) using tiktoken. This is essential when working with OpenAI models to ensure the context window limit is not exceeded.

Requires: pip install tiktoken

TokenTextSplitter(
    encoding_name: str = "cl100k_base",
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    length_function: Callable = len,
)

Parameter	Type	Default	Description
`encoding_name`	`str`	`"cl100k_base"`	Tiktoken encoding (matches GPT-4 / text-embedding-ada-002). The default encoding can be overridden globally via `ChunkConfig.model_token`.
`chunk_size`	`int`	`1000`	Maximum tokens per chunk.
`chunk_overlap`	`int`	`200`	Overlap in tokens.

`split_text(text)`

def split_text(self, text: str) -> List[str]

Returns: List[str] — chunks guaranteed to be within chunk_size tokens.

`SentenceTextSplitter`

Goal: Split text while ensuring sentences are never cut in the middle. This is suitable for cases where preserving complete sentences is essential for understanding, with full support for Arabic.

SentenceTextSplitter(
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    length_function: Callable = len,
)

Parameter	Type	Default	Description
`chunk_size`	`int`	`1000`	Maximum characters per chunk.
`chunk_overlap`	`int`	`200`	Character overlap between chunks.
`length_function`	`Callable`	`len`	Size measurement function.

`split_text(text)`

def split_text(self, text: str) -> List[str]

Returns: List[str] — chunks always ending at sentence boundaries.

Supporting Components

`EmbeddingProvider`

Goal: Provide text embeddings with smart caching (LRU) and batch processing to speed up operations. It uses sentence-transformers and automatically falls back to TF-IDF when unavailable.

Note: Implemented as a singleton per (model_name, device) pair — calling the constructor with the same arguments always returns the same cached instance.

Module-level function: `cosine_similarity_matrix`

Goal: Compute a cosine similarity matrix between two embedding sets efficiently in a single batch using matrix operations. It is used internally in the Deduplicator for detecting near-duplicate content.

from fennec_community.chunks import cosine_similarity_matrix

cosine_similarity_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray

Parameter	Type	Description
`A`	`np.ndarray`	Matrix of shape `(n, dim)` — first set of embedding vectors.
`B`	`np.ndarray`	Matrix of shape `(m, dim)` — second set of embedding vectors.

Returns: np.ndarray of shape (n, m) — every cell [i, j] holds the cosine similarity between A[i] and B[j].

Example:

import numpy as np
from fennec_community.chunks import cosine_similarity_matrix, EmbeddingProvider

provider = EmbeddingProvider("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
texts_a = ["Hello world", "Machine learning is great"]
texts_b = ["Hi there", "AI is fascinating", "Cooking recipes"]

A = np.array(provider.embed_batch(texts_a))
B = np.array(provider.embed_batch(texts_b))

sim_matrix = cosine_similarity_matrix(A, B)
# sim_matrix.shape == (2, 3)
print(sim_matrix)

EmbeddingProvider(
    model_name: str,
    device: Optional[str] = None,
    cache_size: int = 10_000,
)

Parameter	Type	Default	Description
`model_name`	`str`	—	HuggingFace sentence-transformers model name.
`device`	`str \| None`	`None`	`"cuda"` or `"cpu"`.
`cache_size`	`int`	`10,000`	Maximum number of text embeddings to cache in memory (LRU eviction). Corresponds to `ChunkConfig.embedding_cache_size`.

`embed(text)`

Goal: Compute the embedding for a single text with caching support.

def embed(self, text: str) -> np.ndarray

Returns: np.ndarray — 1-D embedding vector.

`embed_batch(texts, batch_size=64)`

Goal: Efficiently compute embeddings for a list of texts, skipping any entries already stored in the cache.

def embed_batch(self, texts: List[str], batch_size: int = 64) -> List[np.ndarray]

Parameter	Type	Description
`texts`	`List[str]`	Texts to embed.
`batch_size`	`int`	Number of texts processed per inference call. Corresponds to `ChunkConfig.embedding_batch_size`.

Returns: List[np.ndarray] — Same order as input.

`pairwise_similarity(a, b)`

Goal: Compute the semantic similarity (cosine similarity) between two texts.

def pairwise_similarity(self, a: str, b: str) -> float

Returns: float in [-1.0, 1.0]. Values close to 1.0 indicate high similarity.

`adjacent_similarities(sentences)`

Goal: Compute cosine similarity between each sentence and the next one in a list (used internally in SemanticChunker).

def adjacent_similarities(self, sentences: List[str]) -> List[float]

Returns: List[float] of length len(sentences) - 1.

Property:

Property	Type	Description
`dim`	`int`	Embedding vector dimension.

`MetadataEnricher`

Goal: Enrich each chunk with keywords (TF-IDF), keyword density, and a composite importance score that considers position, length, and lexical diversity.

MetadataEnricher(
    max_keywords: int = 10,
    header_score_boost: float = 1.5,
)

Parameter	Type	Default	Description
`max_keywords`	`int`	`10`	Maximum number of keywords extracted per chunk.
`header_score_boost`	`float`	`1.5`	Score multiplier applied to header/section-title chunks.

`enrich(chunks)`

Goal: Enrich a full list of chunks using corpus-level TF-IDF, producing more accurate results than per-chunk enrichment.

def enrich(self, chunks: List[DocumentChunk]) -> List[DocumentChunk]

Returns: The same List[DocumentChunk] with updated metadata.keywords, metadata.keyword_density, and metadata.score.

`enrich_single(chunk)`

Goal: Independently enrich a single chunk (without corpus context) using TF only (no IDF). Suitable for real-time pipelines.

def enrich_single(self, chunk: DocumentChunk) -> DocumentChunk

Returns: The same DocumentChunk with enriched metadata.

`Deduplicator`

Goal: Remove duplicate chunks using a two-stage process: exact hash matching for literal duplicates (fast, O(n)), and cosine similarity for near-duplicate detection (slower, O(n²)).

Deduplicator(
    use_hash: bool = True,
    use_similarity: bool = False,
    similarity_threshold: float = 0.95,
    embedding_model: Optional[str] = None,
    device: Optional[str] = None,
)

Parameter	Type	Default	Description
`use_hash`	`bool`	`True`	Enable fast SHA-256 exact-match deduplication.
`use_similarity`	`bool`	`False`	Enable cosine similarity near-duplicate detection. Requires `embedding_model`.
`similarity_threshold`	`float`	`0.95`	Chunks with cosine similarity ≥ this value are considered duplicates. Corresponds to `ChunkConfig.dedup_similarity_threshold`.
`embedding_model`	`str \| None`	`None`	Model name for similarity-based dedup. Required if `use_similarity=True`.
`device`	`str \| None`	`None`	`"cuda"` or `"cpu"`.

⚠️ Important — ChunkManager behaviour: When using ChunkManager, similarity-based deduplication (use_similarity) is always disabled internally regardless of ChunkConfig.dedup_similarity_threshold. To use similarity-based deduplication, instantiate Deduplicator directly and call deduplicate() on the chunks after chunking.

`deduplicate(chunks)`

Goal: Apply a two-stage deduplication process to a list of chunks and return a cleaned list.

def deduplicate(self, chunks: List[DocumentChunk]) -> List[DocumentChunk]

Returns: List[DocumentChunk] — deduplicated, preserving original order. The longer chunk is kept when two near-duplicates are found.

`DocumentStructureParser`

Goal: Analyze the document structure (Markdown, HTML, or plain text) and extract a list of StructuredSection objects used by StructureAwareChunker to perform structure-aware splitting.

DocumentStructureParser()

`detect_document_type(text)`

the goal: detect for document type

def detect_document_type(self, text: str) -> DocumentType

Returns: DocumentType — one of MARKDOWN, HTML, ARABIC, PLAIN_TEXT.

Detection logic:

HTML tags → DocumentType.HTML
Markdown headers / code fences / bold → DocumentType.MARKDOWN
Arabic character ratio > 20% → DocumentType.ARABIC
Otherwise → DocumentType.PLAIN_TEXT

`parse(text, doc_type)`

Goal: Analyze the document and return a list of structured sections.

def parse(self, text: str, doc_type: DocumentType) -> List[StructuredSection]

Parameter	Type	Description
`text`	`str`	Document text.
`doc_type`	`DocumentType`	Document type (use `detect_document_type` first if unknown).

Returns: List[StructuredSection] — see StructuredSection for field details.

`QueryPatternAnalyzer`

Goal: Analyze user query patterns and/or document content to recommend the optimal chunking strategy (chunk size, overlap, semantic vs. adaptive splitting).

QueryPatternAnalyzer(default_strategy: str = "general")

Parameter	Type	Default	Description
`default_strategy`	`str`	`"general"`	Fallback strategy name when no pattern is detected. One of: `"faq"`, `"documentation"`, `"technical"`, `"narrative"`, `"general"`.

`analyze_queries(queries)`

Goal: Recommend a strategy based on a set of representative queries using majority voting.

def analyze_queries(self, queries: List[str]) -> ChunkingStrategy

Returns: ChunkingStrategy dataclass with name, chunk_size, overlap, use_semantic, use_adaptive, use_structure.

`analyze_document(text)`

Goal: Recommend a strategy based on the document’s own content.

def analyze_document(self, text: str) -> ChunkingStrategy

Returns: ChunkingStrategy

`analyze(text, sample_queries=None)`

Goal: Perform a combined analysis of both the document and the queries. Queries take priority if they indicate a specific (non-generic) pattern.

def analyze(
    self,
    text: str,
    sample_queries: Optional[List[str]] = None,
) -> ChunkingStrategy

Returns: ChunkingStrategy

`get_strategy(name)`

The Goal: return specific strategy with name

def get_strategy(self, name: str) -> ChunkingStrategy

Returns: ChunkingStrategy — returns the default strategy if name is unknown.

Available strategies:

Name	chunk_size	overlap	Use Case
`faq`	256	32	Q&A documents, frequently asked questions
`documentation`	768	128	Technical docs, structured manuals
`technical`	400	80	Dense code, APIs, specifications
`narrative`	1024	200	Stories, novels, long-form prose
`general`	512	128	Balanced default for mixed content

Data Models

`DocumentChunk`

The core output unit. Every function in the library ultimately produces DocumentChunk instances.

@dataclass
class DocumentChunk:
    text: str                          # The chunk's text content
    chunk_id: str                      # UUID (auto-generated)
    doc_id: str                        # Parent document ID
    embedding: Optional[np.ndarray]    # Vector embedding (if computed)
    metadata: ChunkMetadata            # Rich metadata

Read-only properties:

Property	Type	Description
`id`	`str`	Alias for `chunk_id`.
`content_hash`	`str`	SHA-256 hash of normalized text (for deduplication).
`char_count`	`int`	Character count of `text`.
`word_count`	`int`	Word count of `text`.

Methods:

Method	Returns	Description
`to_dict()`	`Dict[str, Any]`	JSON-serializable representation of the chunk and all its metadata.

`ChunkMetadata`

@dataclass
class ChunkMetadata:
    source: str            # Document origin (filename, URL)
    page: Optional[int]    # Page number (if applicable)
    section: Optional[str] # Section/heading title
    position: int          # Index of this chunk in the document (0-based)
    total_chunks: int      # Total chunks in the parent document
    chunk_type: ChunkType  # Type classification
    document_type: DocumentType
    language: str
    keywords: List[str]    # Top TF-IDF keywords
    score: float           # Importance score [0.0 – 1.0]
    keyword_density: float # Ratio of meaningful words
    is_header: bool        # True if this chunk is a document heading
    heading_level: int     # 0 = not a header; 1–6 = H1–H6
    char_start: int        # Character offset in original document
    char_end: int          # Character offset end in original document
    extra: Dict[str, Any]  # Chunker-specific extra data

`StructuredSection`

The output unit of DocumentStructureParser.parse(). Represents a single structural unit within a document before it is converted to a DocumentChunk.

@dataclass
class StructuredSection:
    text: str                  # The section's raw text content
    heading: str               # Title of the nearest parent heading (empty if none)
    heading_level: int         # 0 = body text; 1–6 = H1–H6
    page: Optional[int]        # Page number if available (e.g. from PDF parsers)
    section_index: int         # Sequential index of this section within the document
    is_code_block: bool        # True if the section is a fenced code block
    is_list: bool              # True if the section is a list (ordered or unordered)
    is_table: bool             # True if the section is a table
    char_start: int            # Character offset of the section start in the original text
    char_end: int              # Character offset of the section end in the original text
    extra: Dict[str, Any]      # Parser-specific additional data

Example:

from fennec_community.chunks import DocumentStructureParser

parser = DocumentStructureParser()
doc_type = parser.detect_document_type(markdown_text)
sections = parser.parse(markdown_text, doc_type)

for s in sections:
    print(f"[H{s.heading_level}] {s.heading!r} — code={s.is_code_block} list={s.is_list}")
    print(f"  chars [{s.char_start}:{s.char_end}]: {s.text[:60]}")

`Document`

Input container compatible with LangChain conventions.

@dataclass
class Document:
    page_content: str            # The raw text
    metadata: Dict[str, Any]     # Arbitrary key-value metadata
    doc_id: str                  # Auto-generated UUID

Enumerations

`ChunkType`

PARAGRAPH, SENTENCE, SECTION, CODE_BLOCK, LIST_ITEM, HEADER, TABLE, SEMANTIC, ADAPTIVE, STRUCTURAL, WINDOW

`DocumentType`

PLAIN_TEXT, MARKDOWN, HTML, PDF, ARABIC, MIXED

Base Classes

These abstract base classes are exported from the package (from chunks import BaseChunker, TextSplitter, ChunkingStrategy) and are intended for building custom chunkers that integrate with the library's pipeline.

`BaseChunker`

Abstract base class for all chunkers. Subclass this to implement a custom splitting strategy that is compatible with ChunkManager and chunk_documents.

from fennec_community.chunks import BaseChunker, DocumentChunk
from typing import List

class MyChunker(BaseChunker):
    def _chunk_impl(self, text: str, doc_id: str, source: str) -> List[DocumentChunk]:
        # implement your splitting logic here
        ...

You must implement: _chunk_impl(text, doc_id, source) -> List[DocumentChunk]

You inherit for free: chunk(), chunk_documents(), and _finalize() (sets position and total_chunks on each chunk).

`TextSplitter`

Abstract base class for text splitters. Subclass this to implement a custom splitter that works with split_documents() and create_documents().

from fennec_community.chunks import TextSplitter
from typing import List

class MyTextSplitter(TextSplitter):
    def split_text(self, text: str) -> List[str]:
        # implement your splitting logic here
        ...

You must implement: split_text(text) -> List[str]

You inherit for free: split_documents() and create_documents().

`ChunkingStrategy`

Legacy abstract base class kept for backward compatibility. Represents a named chunking strategy. In current usage, QueryPatternAnalyzer returns plain dataclass instances (not subclasses of this ABC) that carry the same fields.

Field	Type	Description
`name`	`str`	Strategy name (e.g. `"faq"`, `"technical"`).
`chunk_size`	`int`	Recommended chunk size in characters.
`overlap`	`int`	Recommended overlap in characters.
`use_semantic`	`bool`	Whether to use semantic splitting.
`use_adaptive`	`bool`	Whether to use adaptive sizing.
`use_structure`	`bool`	Whether to respect document structure.

Configuration Reference — `ChunkConfig`

ChunkConfig is a dataclass with sensible defaults. Pass it to any chunker or ChunkManager to customize behaviour. All fields are optional — omit any field to use its default.

from fennec_community.chunks import ChunkConfig

config = ChunkConfig(
    chunk_size=512,
    overlap=128,
    use_semantic_chunking=True,
    extract_keywords=True,
    deduplication_enabled=True,
)

Basic Sizing

Field	Type	Default	Description
`chunk_size`	`int`	`512`	Target chunk size in characters.
`overlap`	`int`	`128`	Overlap in characters between consecutive chunks.
`min_chunk_size`	`int`	`50`	Chunks smaller than this are discarded or merged.
`max_chunk_size`	`int`	`2048`	Hard maximum; chunks are force-split above this.
`strict_size_limit`	`bool`	`True`	Enforce hard size limits. When `True`, chunks exceeding `max_chunk_size` are force-split at word boundaries.
`size_tolerance`	`float`	`0.05`	Fractional overshoot allowed above `chunk_size` before a hard split is triggered (5% by default). Only applies when `strict_size_limit=True`.

Semantic Chunking

Field	Type	Default	Description
`use_semantic_chunking`	`bool`	`False`	Enable embedding-based semantic splitting.
`semantic_similarity_threshold`	`float`	`0.75`	Similarity drop threshold for a new chunk boundary.
`semantic_model`	`str`	MiniLM	Sentence-transformer model for semantic chunking.
`model_name`	`str`	MiniLM	Alias for `semantic_model` kept for backward compatibility with `ArabicTextChunker` and `MultilanguageTextChunker`.
`embedding_batch_size`	`int`	`64`	Number of sentences processed per embedding inference call. Increase for GPU throughput, decrease to save memory.
`embedding_cache_size`	`int`	`10_000`	LRU cache size (number of text entries) shared across all `EmbeddingProvider` instances.

Adaptive Chunking

Field	Type	Default	Description
`use_adaptive_chunking`	`bool`	`False`	Enable density-adaptive chunk sizing.
`adaptive_technical_threshold`	`float`	`0.6`	Information-density score above which text is classified as "technical". Technical text gets smaller chunks.
`adaptive_min_size`	`int`	`100`	Minimum chunk size when content is dense/technical.
`adaptive_max_size`	`int`	`1500`	Maximum chunk size when content is simple/narrative.
`adaptive_base_size`	`int`	`512`	Starting size before density adjustment is applied.

Smart Overlap

Field	Type	Default	Description
`use_smart_overlap`	`bool`	`True`	Use semantic similarity to select the most relevant sentences to carry over as overlap, instead of a fixed character window. Enabled by default.
`smart_overlap_threshold`	`float`	`0.7`	Minimum cosine similarity for a sentence to be included in the smart overlap region.

Structure-Aware Chunking

Field	Type	Default	Description
`respect_document_structure`	`bool`	`True`	Honor Markdown/HTML structural elements.
`split_on_headers`	`bool`	`True`	Create a new chunk at each Markdown/HTML heading.
`split_on_paragraphs`	`bool`	`True`	Treat paragraph breaks (`\n\n`) as natural chunk boundaries.
`preserve_code_blocks`	`bool`	`True`	Never split code blocks.
`preserve_tables`	`bool`	`True`	Never split tables.
`preserve_lists`	`bool`	`True`	Never split list blocks mid-way.

Context-Aware Chunking

Field	Type	Default	Description
`use_context_window`	`bool`	`False`	Enable sliding window context padding.
`context_window_size`	`int`	`2`	Number of sentences added as context on each side of a primary group.
`min_context_sentences`	`int`	`3`	Minimum number of sentences a group must contain before context padding is applied.

Metadata & Scoring

Field	Type	Default	Description
`extract_keywords`	`bool`	`True`	Run TF-IDF keyword extraction after chunking.
`keyword_max_count`	`int`	`10`	Maximum keywords per chunk.
`compute_chunk_scores`	`bool`	`True`	Compute importance scores after chunking.
`header_score_boost`	`float`	`1.5`	Score multiplier for header chunks.

Deduplication

Field	Type	Default	Description
`deduplication_enabled`	`bool`	`True`	Remove duplicate chunks automatically.
`dedup_use_hash`	`bool`	`True`	Use fast SHA-256 exact deduplication.
`dedup_similarity_threshold`	`float`	`0.95`	Cosine similarity threshold above which two chunks are considered duplicates. Note: this threshold is passed to `Deduplicator` at construction time, but `ChunkManager` always sets `use_similarity=False` — to use similarity-based deduplication you must call `Deduplicator` directly.

Query-Aware Strategy

Field	Type	Default	Description
`query_aware`	`bool`	`False`	Enable query-pattern-based strategy selection.
`default_query_pattern`	`str`	`"general"`	Default strategy name when no pattern is detected. One of `"faq"`, `"documentation"`, `"technical"`, `"narrative"`, `"general"`.

Performance

Field	Type	Default	Description
`async_processing`	`bool`	`False`	Enable asynchronous parallel processing of chunks.
`n_workers`	`int`	`4`	Number of worker threads used when `async_processing=True`.

Formatting & Text Cleanup

Field	Type	Default	Description
`preserve_formatting`	`bool`	`True`	Keep original whitespace and formatting structure intact.
`fix_spacing`	`bool`	`True`	Auto-fix spacing issues (e.g. missing spaces after Arabic punctuation).

Tiktoken

Field	Type	Default	Description
`model_token`	`str`	`"cl100k_base"`	Tiktoken encoding name used by `TokenTextSplitter` as its global default. Common values: `"cl100k_base"` (GPT-4, text-embedding-ada-002), `"p50k_base"` (Codex).

Language-Specific BERT Models

These fields set the HuggingFace model name used for each language when use_semantic_chunking=True inside MultilanguageTextChunker or ArabicTextChunker. Override any field to use a different model for that language.

Field	Default Model
`multilingual`	`xlm-roberta-base`
`arabic`	`CAMeL-Lab/bert-base-arabic-camelbert-mix`
`english`	`bert-base-uncased`
`chinese`	`bert-base-chinese`
`french`	`camembert-base`
`german`	`bert-base-german-cased`
`spanish`	`dccuchile/bert-base-spanish-wwm-uncased`
`russian`	`DeepPavlov/rubert-base-cased`
`japanese`	`cl-tohoku/bert-base-japanese`
`korean`	`klue/bert-base`
`portuguese`	`neuralmind/bert-base-portuguese-cased`
`italian`	`dbmdz/bert-base-italian-cased`
`dutch`	`GroNLP/bert-base-dutch-cased`
`polish`	`dkleczek/bert-base-polish-cased`
`turkish`	`dbmdz/bert-base-turkish-cased`
`vietnamese`	`vinai/phobert-base`
`hindi`	`ai4bharat/indic-bert`

Example:

config = ChunkConfig(
    use_semantic_chunking=True,
    arabic="CAMeL-Lab/bert-base-arabic-camelbert-ca",   # swap Arabic model
    french="camembert/camembert-large",                  # swap French model
)

Transformer Parameters

Low-level parameters forwarded to the HuggingFace tokenizer during semantic chunking. Rarely need to change.

Field	Type	Default	Description
`max_len`	`int`	`512`	Maximum token length passed to the tokenizer.
`padding`	`bool`	`True`	Enable padding to `max_len`.
`truncation`	`bool`	`True`	Truncate inputs exceeding `max_len`.
`return_tensor`	`str`	`"pt"`	Tensor framework returned by the tokenizer (`"pt"` = PyTorch).

Merge Parameters

Field	Type	Default	Description
`min_sentences`	`int`	`30`	Minimum sentence count a document must have before sentence-level merging is considered. Documents with fewer sentences skip the merge pass.

ChunkMode Enum

from fennec_community.chunks import ChunkMode

ChunkMode.AUTO        # ChunkManager picks strategy based on config flags
ChunkMode.SEMANTIC    # Force SemanticChunker
ChunkMode.ADAPTIVE    # Force AdaptiveChunker
ChunkMode.STRUCTURAL  # Force StructureAwareChunker
ChunkMode.CONTEXTUAL  # Force ContextAwareChunker
ChunkMode.HYBRID      # StructureAwareChunker → SemanticChunker on large sections

End-to-End Usage Examples

Example 1: RAG Pipeline (Recommended)

from fennec_community.chunks import ChunkManager, ChunkConfig, ChunkMode

config = ChunkConfig(
    chunk_size=512,
    overlap=128,
    use_semantic_chunking=True,
    extract_keywords=True,
    deduplication_enabled=True,
    query_aware=True,
)

manager = ChunkManager(config=config, mode=ChunkMode.AUTO)

with open("report.md", "r", encoding="utf-8") as f:
    text = f.read()

chunks = manager.process_with_query_optimization(
    text=text,
    queries=["What are the financial results?", "Who is the CEO?"],
    source="report.md",
)

print(manager.get_stats(chunks))

for chunk in chunks:
    vector = embed_model.encode(chunk.text)   # your embedding model
    vector_db.insert(chunk.chunk_id, vector, chunk.to_dict())

Example 2: Arabic Document

from fennec_community.chunks import ArabicTextChunker

chunker = ArabicTextChunker(
    chunk_size=400,
    overlap=100,
    use_semantic_chunking=True,
    fix_spacing=True,
)

chunks = chunker.chunk(arabic_text, source="arabic_doc.pdf")

for ch in chunks:
    print(f"[{ch.metadata.position}] {ch.text[:60]}...")
    print(f"  keywords: {ch.metadata.keywords}")

Example 3: Multilingual Batch Processing

from fenenc_community.chunks import ChunkManager, Document

manager = ChunkManager(language="auto")

documents = [
    Document(page_content=english_text, metadata={"source": "en.txt", "lang": "en"}),
    Document(page_content=arabic_text,  metadata={"source": "ar.txt", "lang": "ar"}),
    Document(page_content=french_text,  metadata={"source": "fr.txt", "lang": "fr"}),
]

all_chunks = manager.process_documents(documents)
print(f"Total chunks: {len(all_chunks)}")

Example 4: Basic Splitters

from fenenc_community.chunks import RecursiveCharacterTextSplitter, TokenTextSplitter

# Token-aware splitting for OpenAI models
splitter = TokenTextSplitter(
    encoding_name="cl100k_base",
    chunk_size=500,
    chunk_overlap=50,
)
text_chunks = splitter.split_text(long_text)

# Arabic recursive splitting
arabic_splitter = RecursiveCharacterTextSplitter(
    arabic_mode=True,
    chunk_size=400,
    chunk_overlap=80,
)
arabic_chunks = arabic_splitter.split_text(arabic_text)

Example 5: Deduplication + Enrichment Standalone

from fenenc_community.chunks import MetadataEnricher, Deduplicator

enricher = MetadataEnricher(max_keywords=8, header_score_boost=2.0)
deduplicator = Deduplicator(use_hash=True, use_similarity=False)

# After any chunking operation:
chunks = enricher.enrich(chunks)
chunks = deduplicator.deduplicate(chunks)

# Sort by importance score
chunks.sort(key=lambda c: c.metadata.score, reverse=True)
top_chunks = chunks[:10]

Example 6: Similarity-Based Deduplication

# ChunkManager always disables similarity dedup internally.
# Use Deduplicator directly when you need near-duplicate removal:
from fenenc_community.chunks import ChunkManager, Deduplicator

manager = ChunkManager()
chunks = manager.process(text, source="doc.pdf")

deduplicator = Deduplicator(
    use_hash=True,
    use_similarity=True,
    similarity_threshold=0.92,
    embedding_model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
)
chunks = deduplicator.deduplicate(chunks)

Example 7: Custom BERT Models per Language

from fenenc_community.chunks import ChunkConfig, MultilanguageTextChunker

config = ChunkConfig(
    use_semantic_chunking=True,
    arabic="CAMeL-Lab/bert-base-arabic-camelbert-ca",
    french="camembert/camembert-large",
    embedding_batch_size=32,   # lower batch size on limited GPU memory
    embedding_cache_size=5_000,
)

chunker = MultilanguageTextChunker(language="arabic")
chunks = chunker.chunk(arabic_text, source="doc.pdf")

Example 8: Structure Inspection via `StructuredSection`

from fennec_community.chunks import DocumentStructureParser

parser = DocumentStructureParser()
doc_type = parser.detect_document_type(md_text)
sections = parser.parse(md_text, doc_type)

code_sections = [s for s in sections if s.is_code_block]
headers = [s for s in sections if s.heading_level > 0]

print(f"Found {len(code_sections)} code blocks and {len(headers)} headings")

Source: community/chunks.md

Table of Contents

Overview

Architecture at a Glance

Quick Start

Core Entry Point — ChunkManager

Constructor

process

process_documents

process_with_query_optimization

get_stats

Chunkers

Common Interface (all chunkers)

chunk(text, doc_id="", source="")

chunk_documents(documents)

SemanticChunker

AdaptiveChunker

StructureAwareChunker

ContextAwareChunker

ArabicTextChunker

ArabicTextChunker.create_safely(**kwargs)

ArabicTextChunker.get_available_parameters()

get_arabic_ratio(text)

is_arabic_dominant(text, threshold=0.3)

MultilanguageTextChunker

get_supported_languages()

switch_language(language)

Text Splitters

Common Interface (all text splitters)

split_documents(documents)

create_documents(texts, metadatas=None)

CharacterTextSplitter

split_text(text)

RecursiveCharacterTextSplitter

split_text(text)

TokenTextSplitter

split_text(text)

SentenceTextSplitter

split_text(text)

Supporting Components

EmbeddingProvider

Module-level function: cosine_similarity_matrix

embed(text)

embed_batch(texts, batch_size=64)

pairwise_similarity(a, b)

adjacent_similarities(sentences)

MetadataEnricher

enrich(chunks)

enrich_single(chunk)

Deduplicator

deduplicate(chunks)

DocumentStructureParser

detect_document_type(text)

parse(text, doc_type)

QueryPatternAnalyzer

analyze_queries(queries)

analyze_document(text)

analyze(text, sample_queries=None)

get_strategy(name)

Data Models

DocumentChunk

ChunkMetadata

StructuredSection

Document

Enumerations

ChunkType

DocumentType

Base Classes

BaseChunker

TextSplitter

ChunkingStrategy

Configuration Reference — ChunkConfig

Basic Sizing

Semantic Chunking

Adaptive Chunking

Smart Overlap

Structure-Aware Chunking

Context-Aware Chunking

Metadata & Scoring

Deduplication

Query-Aware Strategy

Core Entry Point — `ChunkManager`

`process`

`process_documents`

`process_with_query_optimization`

`get_stats`

`chunk(text, doc_id="", source="")`

`chunk_documents(documents)`

`SemanticChunker`

`AdaptiveChunker`

`StructureAwareChunker`

`ContextAwareChunker`

`ArabicTextChunker`

`ArabicTextChunker.create_safely(**kwargs)`

`ArabicTextChunker.get_available_parameters()`

`get_arabic_ratio(text)`

`is_arabic_dominant(text, threshold=0.3)`

`MultilanguageTextChunker`

`get_supported_languages()`

`switch_language(language)`

`split_documents(documents)`

`create_documents(texts, metadatas=None)`

`CharacterTextSplitter`

`split_text(text)`

`RecursiveCharacterTextSplitter`

`split_text(text)`

`TokenTextSplitter`

`split_text(text)`

`SentenceTextSplitter`

`split_text(text)`

`EmbeddingProvider`

Module-level function: `cosine_similarity_matrix`

`embed(text)`

`embed_batch(texts, batch_size=64)`

`pairwise_similarity(a, b)`

`adjacent_similarities(sentences)`

`MetadataEnricher`

`enrich(chunks)`

`enrich_single(chunk)`

`Deduplicator`

`deduplicate(chunks)`

`DocumentStructureParser`

`detect_document_type(text)`

`parse(text, doc_type)`

`QueryPatternAnalyzer`

`analyze_queries(queries)`

`analyze_document(text)`

`analyze(text, sample_queries=None)`

`get_strategy(name)`

`DocumentChunk`

`ChunkMetadata`

`StructuredSection`

`Document`

`ChunkType`

`DocumentType`

`BaseChunker`

`TextSplitter`

`ChunkingStrategy`

Configuration Reference — `ChunkConfig`

Example 8: Structure Inspection via `StructuredSection`