Fennec Community community/embedding.md

Embeddings Modular

A unified, production-grade Python embeddings modular with first-class Arabic NLP support, multi-provider integration, and async-ready infrastructure.

High-Level Overview
Architecture Overview
Core Concepts
Module & Component Breakdown
API / Public Interfaces
Configuration System
Usage Guide
Code Examples
Design Decisions & Trade-offs
Extensibility Guide
Project Structure
Performance & Scalability

1. High-Level Overview

What It Does

This library provides a unified interface for generating text embeddings across six distinct backends: OpenAI, Google Gemini, Mistral AI, HuggingFace (local), Ollama (local), and a specialized Arabic NLP embedder. All backends share a common abstract contract, enabling seamless provider switching without changing application code.

Problem It Solves

Building embedding pipelines typically requires learning each provider's SDK, managing rate limiting, implementing retry logic, handling caching, and dealing with Arabic/multilingual text normalization — separately for each provider. This library eliminates that fragmentation by:

Exposing a single encode() interface across all providers
Centralizing caching, statistics tracking, and similarity computation in the base class
Providing Arabic-specific text normalization as a first-class preprocessing stage
Offering async variants of all critical operations out of the box

Key Design Ideas

Template Method pattern: BaseEmbedder defines the algorithm skeleton (caching, stats, async dispatch, cleanup), while concrete embedders implement only encode() and the embedding_dim property.
Provider-agnostic similarity layer: cosine, dot product, and Euclidean similarity are available uniformly regardless of which backend generated the embeddings.
Arabic NLP as a first-class citizen: The library was built with Arabic as a primary target language, with structured normalization levels rather than ad-hoc text cleaning.

Inheritance Hierarchy

BaseEmbedder  (ABC)
├── ArabicEmbedder           → sentence-transformers + Arabic normalization pipeline
├── HuggingFaceEmbedder      → sentence-transformers (general multilingual)
├── OllamaEmbedder           → Ollama REST API (local inference)
├── OpenAIEmbedder           → OpenAI REST API (cloud)
├── GeminiEmbedder           → Google GenAI SDK (cloud)
└── MistralEmbedder          → Mistral REST API (cloud)

Data Flow

Input Text(s)
     │
     ▼
[ArabicEmbedder only]
Arabic Normalization Pipeline
 (diacritics → hamza → alef_maksura → tatweel → taa_marbuta)
     │
     ▼
Cache Lookup (MD5 key)  ──── HIT ──► Return cached embedding
     │ MISS
     ▼
Provider-specific encode()
 (batching + rate limiting + retries)
     │
     ▼
[Optional] L2 Normalization
     │
     ▼
numpy.ndarray  (n, dim)
     │
     ├── Cache Store
     └── Statistics Update

Module Interaction

EmbedderConfig is consumed by all concrete embedders as a source of default values. The base class orchestrates caching, statistics accumulation, async dispatch, and resource cleanup. Concrete embedders own only the provider-specific network/model call and the embedding_dim property.

2. Core Concepts

Embeddings & Semantic Retrieval

An embedding is a dense numerical vector that encodes the semantic meaning of text. Texts with similar meaning have geometrically close vectors. This enables:

Semantic search: retrieve documents by meaning rather than keyword overlap
Clustering: group semantically related texts without labels
Classification: represent text in a fixed-size feature space for downstream models
RAG (Retrieval-Augmented Generation): embed a query, retrieve top-k relevant document chunks, and pass them to a language model as context

Similarity Metrics

The library supports three metrics via similarity() and batch_similarity():

Metric	Formula	When to Use
`cosine`	`dot(a, b)` (after L2 norm)	Default; magnitude-independent
`dot`	`dot(a, b)`	When embeddings are already normalized
`euclidean`	`1 / (1 + ‖a−b‖)`	When absolute scale matters

When normalize_embeddings=True (the default), cosine and dot product are numerically equivalent.

Arabic Text Normalization

Arabic text presents unique normalization challenges not present in Latin scripts. The library implements three normalization levels:

minimal — Remove diacritics (tashkeel) only. Suitable when the model is sensitive to letter form but the application data is already clean.

standard (default) — Remove diacritics + normalize all Hamza forms (أإآ → ا, ؤ → و, ئ → ي) + normalize Alef Maksura (ى → ي) + remove Tatweel (ـ). This covers the most common inconsistencies in user-generated content.

aggressive — All standard normalizations + normalize Taa Marbuta (ة → ه) + collapse runs of 3+ repeated characters to 2. Use for noisy social-media text.

Caching Strategy

The base class implements an in-memory embedding cache keyed by MD5 hash of the raw text string. This design choice (hash over raw text) keeps the cache storage compact and O(1) for lookup while avoiding collision in practice for embedding workloads. The cache is intentionally session-scoped (no disk persistence) to prevent stale embeddings when switching models mid-session.

Rate Limiting

API-backed embedders (OpenAI, Mistral) implement sliding-window rate limiters using thread-safe RLock. The limiter tracks request timestamps and token counts within a 1-minute (OpenAI) or 1-second (Mistral) window. When limits are approached, the system sleeps the minimum necessary duration rather than raising an exception.

Retry with Exponential Backoff

All API embedders implement retry loops with configurable max_retries and exponential backoff (delay * 2^attempt). Rate-limit errors are retried; hard API errors are re-raised immediately to avoid burning retry budget on unrecoverable failures.

3. Module & Component Breakdown

`EmbedderConfig`

Purpose: Centralizes all default values in a single @dataclass, preventing scattered magic numbers.

Responsibilities: Provides defaults for batch size, normalization flags, max sequence length, cache settings, Ollama connection parameters, and progress display. Concrete embedders import a module-level config = EmbedderConfig() instance and use it as parameter defaults, making the library configurable without subclassing.

Key fields:

batch_size: int = 32
normalize_embeddings: bool = True
max_seq_length: int = 512
enable_cache: bool = False
base_url: str = "http://127.0.0.1:11434" (Ollama)

`BaseEmbedder`

Purpose: Abstract base that defines the contract and provides all shared infrastructure.

Responsibilities:

Declares encode() and embedding_dim as abstract
Implements encode_with_cache(), similarity(), batch_similarity()
Provides validate_connection(), get_model_info(), get_stats(), reset_stats(), clear_cache()
Implements async variants (aencode, aencode_with_cache, abatch_similarity) via asyncio.to_thread
Provides a timing() context manager for performance measurement
Implements __aenter__/__aexit__ for async context manager protocol
Implements __del__ for automatic GPU memory cleanup

Hidden design detail: The _stats dictionary supports both dict and dataclass forms. This is a backward-compatibility bridge — HuggingFaceEmbedder uses an EmbeddingStats dataclass while all other embedders use a plain dict. get_stats() detects the type at runtime and normalizes the output.

Device selection: _get_best_device() probes CUDA → MPS (Apple Silicon) → CPU in order, logging the selected device for debugging.

`ArabicEmbedder`

Purpose: Local sentence-transformers embedder with a full Arabic NLP preprocessing pipeline.

Responsibilities:

Aliases friendly model keys (multilingual, labse, arabert, etc.) to full HuggingFace model names
Applies a multi-stage Arabic normalization pipeline before encoding
Tracks normalization statistics (diacritics removed, hamzas normalized) via ArabicProcessingStats
Handles OOM errors by halving batch size and retrying automatically
Provides save_embeddings() / load_embeddings() for .npz persistence with optional text and metadata arrays
Provides find_most_similar() and benchmark() as high-level utilities

Key classes:

ArabicProcessingStats: dataclass tracking normalization metrics per session
ARABIC_NORMALIZATION_PATTERNS: module-level compiled regex patterns (compiled once at class definition, not per call)
RECOMMENDED_MODELS: maps human-friendly keys to full model names + dimension info

Interaction: Inherits encode_with_cache, similarity, batch_similarity from BaseEmbedder. Uses sentence-transformers.SentenceTransformer internally.

`HuggingFaceEmbedder`

Purpose: General-purpose local sentence-transformers embedder with TTL-based cache and typed model registry.

Responsibilities:

Maintains ARABIC_MODELS registry mapping short names to ModelInfo dataclasses (dimensions, max_tokens, ArabicQuality enum, size)
Implements TTL-aware cache via _cache_timestamps dict alongside the base cache
Uses a retry_on_failure decorator (exponential backoff) on its internal _process_batch
Supports per-call normalization override via normalize parameter on encode()

Key classes:

ArabicQuality(Enum): EXCELLENT | GOOD | FAIR | UNKNOWN — structured quality tagging
ModelInfo: dataclass carrying model metadata for the registry
EmbeddingStats: dataclass (not dict) for statistics; requires to_dict() bridge in BaseEmbedder.get_stats()

Design note: Unlike ArabicEmbedder, this class does not apply Arabic-specific text preprocessing. It is intended for general multilingual use where the model's tokenizer handles normalization.

`OllamaEmbedder`

Purpose: Embeds text using locally-running Ollama inference server via REST API — zero API key, zero data egress.

Responsibilities:

Pings the Ollama /api/tags endpoint at init to verify server reachability
Optionally auto-starts the ollama serve subprocess if the server is not running
Sends embedding requests to /api/embed with configurable timeout and retries
Stores MODEL_SPECS dict at module level (not class level) for known models, with graceful fallback for unknown models

Known models with specs: nomic-embed-text (768d), mxbai-embed-large (1024d), all-minilm (384d), snowflake-arctic-embed (1024d), bge-m3 (1024d, best Arabic support).

Key behavior: If auto_start_server=True and ollama serve is not running, the embedder will launch it via subprocess and wait server_start_wait seconds before proceeding.

`OpenAIEmbedder`

Purpose: Production-grade OpenAI embedding client with token-aware rate limiting, cost tracking, and tiktoken-based accurate token counting.

Responsibilities:

Validates the API key on init by making a minimal test request (fails fast vs. failing on first real call)
Uses tiktoken (cl100k_base encoding) for accurate pre-call token estimation; falls back to len(text) // 4 if tiktoken is unavailable
Supports dimension reduction on text-embedding-3-* models via the dimensions API parameter
Tracks cost per request using MODEL_SPECS.cost_per_1m_tokens with UsageStats dataclass
Implements two-level rate limiting: requests/min and tokens/min via RateLimiter

Key classes:

UsageStats: tracks total tokens, cost, requests; computes requests-by-minute breakdown
RateLimiter: sliding-window limiter with RLock for thread safety

Supported models:

Model	Dimensions	Cost/1M tokens	Max tokens
`text-embedding-3-large`	3072 (reducible)	$0.13	8191
`text-embedding-3-small`	1536 (reducible)	$0.02	8191
`text-embedding-ada-002`	1536	$0.10	8191

`GeminiEmbedder`

Purpose: Google Gemini embedding client with task-type–aware encoding and MRL (Matryoshka Representation Learning) dimension reduction.

Responsibilities:

Uses the new google-genai SDK (not the deprecated google-generativeai)
Exposes task_type parameter to signal the embedding's intended use to the model (RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING)
Supports output_dimensionality for MRL-based dimension reduction (valid for gemini-embedding-001)
Validates model status — deprecated models (text-embedding-004, embedding-001) emit warnings with sunset dates
Tracks requests/chars/errors via GeminiUsageStats

Key insight: Task-type hints allow the Gemini model to optimize the embedding distribution for the specific downstream task. For RAG, encode queries with RETRIEVAL_QUERY and documents with RETRIEVAL_DOCUMENT for best retrieval accuracy.

Supported models:

Model	Dimensions	Status
`gemini-embedding-001`	3072 (MRL-reducible)	GA
`text-embedding-004`	768	Deprecated 2026-01-14
`embedding-001`	768	Deprecated 2025-08-14

`MistralEmbedder`

Purpose: Mistral AI embedding client with per-second rate limiting and LRU caching.

Responsibilities:

Implements MistralRateLimiter with a 1-second sliding window (Mistral's tighter rate limit vs. OpenAI's per-minute)
Uses OrderedDict for LRU cache eviction (bounded by cache_size parameter)
Reads API key from MISTRAL_API_KEY env var or constructor parameter
Calls https://api.mistral.ai/v1/embeddings directly via requests (no official SDK dependency)

Only available model: mistral-embed (1024 dimensions, $0.10/1M tokens).

Public Surface

Purpose: Defines the public API, module-level metadata, and documentation constants.

Exports: ArabicEmbedder, BaseEmbedder, EmbedderConfig, GeminiEmbedder, HuggingFaceEmbedder, OllamaEmbedder, MistralEmbedder, OpenAIEmbedder.

Metadata constants:

__arabic_normalization__: list of all normalization operations in pipeline order
__valid_levels__: human-readable descriptions of normalization levels
__task_types_gemini__: bilingual (Arabic/English) task type documentation
__gemini_embedding_models__: complete model specifications for Gemini models

5. API / Public Interfaces

`BaseEmbedder` — Core Contract

class BaseEmbedder(ABC):

    def __init__(
        self,
        model_name: str,
        device: Optional[str] = None,          # 'cuda' | 'cpu' | 'mps' | None (auto)
        normalize_embeddings: bool = True,
        batch_size: int = 32,
        max_length: Optional[int] = 512,
        cache_embeddings: bool = False,
        show_progress: bool = False,
        **kwargs
    ): ...

    @abstractmethod
    def encode(
        self,
        texts: Union[str, List[str]],
        show_progress_bar: bool = False,
        convert_to_numpy: bool = True,
        **kwargs
    ) -> np.ndarray: ...
    # Returns: shape (dim,) for single text, (n, dim) for list

    @property
    @abstractmethod
    def embedding_dim(self) -> int: ...

    def encode_with_cache(self, texts, **kwargs) -> np.ndarray: ...

    def similarity(
        self,
        text1: Union[str, np.ndarray],
        text2: Union[str, np.ndarray],
        metric: str = 'cosine'           # 'cosine' | 'dot' | 'euclidean'
    ) -> float: ...

    def batch_similarity(
        self,
        query: Union[str, np.ndarray],
        texts: List[str],
        top_k: Optional[int] = None
    ) -> Union[np.ndarray, List[Tuple[int, float]]]: ...
    # Returns: similarity array if top_k=None, else [(index, score), ...]

    def validate_connection(
        self,
        test_text: str = "مرحباً Hello",
        detailed: bool = False
    ) -> Dict[str, Any]: ...

    def get_model_info(self) -> Dict[str, Any]: ...
    def get_stats(self) -> Dict[str, Any]: ...
    def reset_stats(self): ...
    def clear_cache(self): ...

    # Async API
    async def aencode(self, texts, **kwargs) -> np.ndarray: ...
    async def aencode_with_cache(self, texts, **kwargs) -> np.ndarray: ...
    async def abatch_similarity(self, query, candidates, **kwargs): ...

    # Context manager
    @contextmanager
    def timing(self, operation: str = "encoding"): ...

`ArabicEmbedder` — Extended API

ArabicEmbedder(
    model_name: Optional[str] = None,   # Key from RECOMMENDED_MODELS or full HF model name
    normalization_level: str = 'standard',  # 'minimal' | 'standard' | 'aggressive'
    enable_preprocessing: bool = True,
    track_processing_stats: bool = True,
    auto_download: bool = True,
    ...
)

.find_most_similar(
    query: str,
    candidates: List[str],
    top_k: int = 5,
    metric: str = 'cosine'
) -> List[Tuple[int, str, float]]     # [(index, text, score), ...]

.save_embeddings(texts, filepath, save_texts=True, save_metadata=True) -> None
.load_embeddings(filepath, load_texts=False, load_metadata=False) -> Union[np.ndarray, Tuple]
.benchmark(sample_texts=None, num_iterations=10, warmup_iterations=2) -> Dict
.get_processing_stats() -> Dict[str, Any]
.list_recommended_models() -> Dict[str, Dict]  # @staticmethod

`OpenAIEmbedder` — Extended API

OpenAIEmbedder(
    model_name: str = "text-embedding-3-small",
    api_key: Optional[str] = None,     # Falls back to OPENAI_API_KEY env var
    dimensions: Optional[int] = None,  # Dimension reduction (text-embedding-3-* only)
    enable_rate_limiting: bool = True,
    max_requests_per_minute: int = 3000,
    max_tokens_per_minute: int = 1_000_000,
    track_costs: bool = True,
    ...
)

.estimate_cost(texts: Union[str, List[str]]) -> Dict[str, Any]
.get_usage_stats() -> Dict[str, Any]

`GeminiEmbedder` — Extended API

GeminiEmbedder(
    model_name: str = "gemini-embedding-001",
    api_key: Optional[str] = None,             # Falls back to GOOGLE_API_KEY env var
    task_type: Optional[str] = None,           # 'RETRIEVAL_QUERY' | 'RETRIEVAL_DOCUMENT' | ...
    output_dimensionality: Optional[int] = None,  # MRL reduction
    ...
)

6. Configuration System

`EmbedderConfig` Dataclass

All embedders accept EmbedderConfig defaults. Override at the module level to change library-wide defaults:

from fennec_community.embeddings import EmbedderConfig

# Override defaults globally before importing embedders
config = EmbedderConfig(
    batch_size=64,
    normalize_embeddings=True,
    enable_cache=True,
    show_progress_bar=True,
)

Full Configuration Reference

Field	Default	Description
`model_name`	`"multilingual"`	Default model key
`device`	`None` (auto)	`'cuda'`, `'cpu'`, `'mps'`
`cache_dir`	`None`	HuggingFace model cache directory
`batch_size`	`32`	Texts per encode batch
`normalize_embeddings`	`True`	L2-normalize output vectors
`max_seq_length`	`512`	Token truncation limit
`enable_preprocessing`	`True`	Arabic normalization pipeline
`enable_arabic_normalization`	`True`	Arabic char normalization
`enable_cache`	`False`	In-memory embedding cache
`cache_size`	`10000`	Max cached entries
`show_progress_bar`	`False`	tqdm progress during encoding
`convert_to_numpy`	`True`	Always return `np.ndarray`
`skip_preprocessing`	`False`	Bypass Arabic preprocessing
`return_valid_indices`	`False`	Return `(embeddings, valid_idx)` tuple
`track_processing_stats`	`True`	Track normalization statistics
`auto_download`	`True`	Auto-download missing models
`trust_remote_code`	`False`	HuggingFace `trust_remote_code`
`base_url`	`"http://127.0.0.1:11434"`	Ollama server URL
`embedding_model`	`"nomic-embed-text"`	Default Ollama model

Environment Variables

Variable	Used By
`OPENAI_API_KEY`	`OpenAIEmbedder`
`GOOGLE_API_KEY`	`GeminiEmbedder`
`MISTRAL_API_KEY`	`MistralEmbedder`

7. Usage Guide

Quick Start

from fennec_community.embeddings import ArabicEmbedder

# Minimal setup — downloads model on first run
embedder = ArabicEmbedder()
embedding = embedder.encode("مرحبا بك في عالم الذكاء الاصطناعي")
print(embedding.shape)  # (384,)

Provider Selection Guide

Use Case	Recommended Embedder	Reason
Arabic NLP, on-premise	`ArabicEmbedder`	Built-in normalization pipeline
General multilingual, local	`HuggingFaceEmbedder`	Full model control, no cost
Production API, best quality	`OpenAIEmbedder` (3-large)	Highest accuracy, dimension reduction
Budget-conscious API	`OpenAIEmbedder` (3-small)	Good quality, 6× cheaper
Task-specific RAG	`GeminiEmbedder`	Task-type optimization
Air-gapped / privacy-first	`OllamaEmbedder`	Zero network egress
Mistral ecosystem	`MistralEmbedder`	Native Mistral stack integration

Basic Usage Pattern

# All embedders share the same core interface
from fennec_community.embeddings import OpenAIEmbedder

embedder = OpenAIEmbedder(model_name="text-embedding-3-small")

# Single text → 1D array of shape (dim,)
emb = embedder.encode("Hello world")

# Multiple texts → 2D array of shape (n, dim)
embs = embedder.encode(["First text", "Second text", "Third text"])

# Semantic similarity
score = embedder.similarity("cat", "kitten")       # float, ~0.85
score = embedder.similarity("cat", "automobile")   # float, ~0.2

# Top-k search
results = embedder.batch_similarity(
    query="artificial intelligence",
    texts=["machine learning", "cooking recipes", "neural networks"],
    top_k=2
)
# Returns: [(0, 0.91), (2, 0.88)]  — (index, score) pairs

Arabic NLP Advanced Usage

from fennec_community.embeddings import ArabicEmbedder

# Use aggressive normalization for social media text
with ArabicEmbedder(
    model_name='arabert',
    normalization_level='aggressive',
    cache_embeddings=True
) as embedder:

    # Texts with heavy diacritics + repeated chars + hamza variants
    messy_texts = [
        "أَهْلاً وَسَهْلاً بِكُمْ",
        "ههههههههه كتير كتيرررر",
        "اريد ان اتعلم البرمجة"
    ]
    embeddings = embedder.encode(messy_texts)

    # Inspect normalization statistics
    stats = embedder.get_processing_stats()
    print(f"Diacritics removed: {stats['removed_diacritics']}")
    print(f"Hamzas normalized: {stats['normalized_hamzas']}")

Async Usage

import asyncio
from fennec_community.embeddings import HuggingFaceEmbedder

async def embed_documents(texts):
    async with HuggingFaceEmbedder("bge-m3") as embedder:
        # Runs encode() in a thread pool — does not block event loop
        embeddings = await embedder.aencode(texts)
        return embeddings

embeddings = asyncio.run(embed_documents(["doc1", "doc2", "doc3"]))

Persistence (ArabicEmbedder)

embedder = ArabicEmbedder(model_name='multilingual')

# Save to disk
texts = ["النص الأول", "النص الثاني"]
embedder.save_embeddings(texts, "my_embeddings", save_texts=True, save_metadata=True)
# Creates: my_embeddings.npz

# Load from disk
embeddings, texts, metadata = embedder.load_embeddings(
    "my_embeddings",
    load_texts=True,
    load_metadata=True
)
print(metadata['model_name'])     # paraphrase-multilingual-mpnet-base-v2
print(metadata['timestamp'])      # ISO 8601 timestamp

Cost Estimation (OpenAI)

embedder = OpenAIEmbedder(model_name="text-embedding-3-large")

# Estimate before committing to the API call
estimate = embedder.estimate_cost(my_large_document_list)
print(f"Estimated cost: ${estimate['estimated_cost_usd']:.4f}")
print(f"Total tokens: {estimate['total_tokens']:,}")

# After encoding, review actual usage
stats = embedder.get_usage_stats()
print(f"Actual cost: ${stats['api_usage']['total_cost_usd']:.6f}")

Performance Benchmarking

embedder = ArabicEmbedder(model_name='multilingual-mini')
results = embedder.benchmark(num_iterations=20, warmup_iterations=3)

print(f"Texts/second: {results['texts_per_second']:.1f}")
print(f"Avg latency:  {results['avg_time']*1000:.1f}ms")
print(f"Device:       {results['device']}")

8. Code Examples

RAG Document Indexing Pipeline

from fennec_community.embeddings import GeminiEmbedder
import numpy as np

# Index-time: embed documents
doc_embedder = GeminiEmbedder(
    model_name="gemini-embedding-001",
    task_type="RETRIEVAL_DOCUMENT",
    output_dimensionality=768,   # Reduce 3072 → 768 via MRL
    cache_embeddings=True,
)

documents = [
    "Python is a high-level programming language.",
    "Machine learning models require training data.",
    "Neural networks are inspired by the brain.",
]

doc_embeddings = doc_embedder.encode(documents)
print(doc_embeddings.shape)  # (3, 768)

# Query-time: embed query with matching task type
query_embedder = GeminiEmbedder(
    model_name="gemini-embedding-001",
    task_type="RETRIEVAL_QUERY",
    output_dimensionality=768,
)

query = "What is Python used for?"
results = query_embedder.batch_similarity(query, documents, top_k=2)
for idx, score in results:
    print(f"[{score:.3f}] {documents[idx]}")

Multi-Provider Evaluation

from fennec_community.embeddings import ArabicEmbedder, OpenAIEmbedder

query = "ما هو الذكاء الاصطناعي؟"
candidates = [
    "الذكاء الاصطناعي هو محاكاة العقل البشري",
    "الطبخ فن جميل",
    "التعلم الآلي فرع من الذكاء الاصطناعي",
]

for EmbedderClass, kwargs in [
    (ArabicEmbedder, {"model_name": "multilingual"}),
    (OpenAIEmbedder, {"model_name": "text-embedding-3-small"}),
]:
    embedder = EmbedderClass(**kwargs)
    results = embedder.batch_similarity(query, candidates, top_k=2)
    print(f"\n{EmbedderClass.__name__}:")
    for idx, score in results:
        print(f"  [{score:.3f}] {candidates[idx]}")

Connection Health Check

from fennec_community.embeddings import OllamaEmbedder

embedder = OllamaEmbedder(model_name="bge-m3")
report = embedder.validate_connection(detailed=True)

if report['success']:
    print(f"✅ Connected — dim={report['embedding_dim']}, latency={report['encoding_time']}")
else:
    print(f"❌ Failed: {report['reason']}")

Timing and Statistics

from fennec_community.embeddings import HuggingFaceEmbedder

embedder = HuggingFaceEmbedder(model_name="bge-m3")

with embedder.timing("bulk_encode"):
    embeddings = embedder.encode(large_text_list)

stats = embedder.get_stats()
print(f"Total texts:    {stats['total_texts']}")
print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")
print(f"Avg time/text:  {stats['avg_time_per_text']*1000:.2f}ms")

9. Design Decisions & Trade-offs

Template Method Over Composition

Decision: Shared logic (caching, stats, async, cleanup) lives in BaseEmbedder via the Template Method pattern rather than being injected as collaborator objects.

Pros: Zero boilerplate in concrete embedders; all providers get caching, timing, and async for free. Adding a new provider means implementing two methods.

Cons: Tighter coupling between base and concrete classes. A provider that wants different cache semantics (e.g., Redis) must override multiple base methods. The _stats dict/dataclass dual-support is evidence of this friction accumulating.

In-Memory Cache Only

Decision: The cache is a session-scoped Python dict, never persisted to disk.

Pros: No deserialization overhead, no cache invalidation complexity, no stale embeddings when models change.

Cons: Cache is lost on process restart. For high-reuse workloads (e.g., document corpora that are repeatedly queried), callers must implement external caching or use save_embeddings() / load_embeddings().

Arabic Normalization as Preprocessing (Not Postprocessing)

Decision: Arabic normalization happens before encoding, not after.

Pros: The embedding model sees cleaner, more consistent input. Two surface-form variants of the same word (e.g., أحمد and احمد) produce the same token sequence and therefore similar (or identical) embeddings.

Cons: If a model was specifically fine-tuned on raw Arabic text including diacritics, aggressive normalization may degrade quality. The skip_preprocessing parameter and minimal level exist to mitigate this.

Separate `ArabicEmbedder` vs. `HuggingFaceEmbedder`

Decision: Arabic-specific functionality is isolated in its own class rather than added as flags to HuggingFaceEmbedder.

Pros: Cleaner separation of concerns. Arabic users get a purpose-built interface with save/load, benchmarking, and processing stats. General users are not burdened with Arabic-specific parameters.

Cons: Some code duplication between the two classes. The model-loading logic and SentenceTransformer call are essentially identical.

API Key in Constructor vs. Environment Only

Decision: All API embedders accept api_key as an explicit constructor parameter with environment variable fallback.

Pros: Enables testing with different credentials without environment manipulation; makes dependencies explicit.

Cons: Risk of accidentally logging or serializing the API key if the embedder object is repr'd or pickled carelessly.

Rate Limiting Inside the Embedder

Decision: Rate limiting logic is embedded within OpenAIEmbedder and MistralEmbedder rather than being a separate middleware.

Pros: Self-contained — users cannot accidentally bypass the limiter by calling the underlying client directly.

Cons: If multiple embedder instances are created for the same API account, each has an independent rate limiter, potentially allowing combined rates to exceed account limits.

10. Extensibility Guide

Adding a New Embedding Provider

Create a new file (e.g., cohere_embedder.py) and implement the two abstract members:

from fennec_community.embeddings import BaseEmbedder
import numpy as np
from typing import Union, List, Optional

class CohereEmbedder(BaseEmbedder):

    def __init__(self, model_name: str = "embed-multilingual-v3.0",
                 api_key: Optional[str] = None, **kwargs):
        super().__init__(model_name=model_name, **kwargs)
        import cohere
        self.client = cohere.Client(api_key or os.getenv("COHERE_API_KEY"))
        self._dim = None  # Lazy-loaded on first encode

    def encode(self, texts: Union[str, List[str]],
               show_progress_bar: bool = False,
               convert_to_numpy: bool = True, **kwargs) -> np.ndarray:
        if isinstance(texts, str):
            texts = [texts]
        response = self.client.embed(texts=texts, model=self.model_name,
                                     input_type="search_document")
        embeddings = np.array(response.embeddings, dtype=np.float32)
        if self._dim is None:
            self._dim = embeddings.shape[1]
        if self.normalize_embeddings:
            norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
            embeddings = embeddings / np.where(norms == 0, 1, norms)
        return embeddings

    @property
    def embedding_dim(self) -> int:
        if self._dim is None:
            self.encode("init")  # trigger dimension detection
        return self._dim

Then register in __init__.py:

from .cohere_embedder import CohereEmbedder
__all__ = [..., "CohereEmbedder"]

The new embedder automatically gains: caching via encode_with_cache, async via aencode, similarity computation, stats tracking, context manager support, and validate_connection.

Adding a New Normalization Step

Extend ArabicEmbedder.ARABIC_NORMALIZATION_PATTERNS and update _normalize_arabic():

# In ArabicEmbedder
ARABIC_NORMALIZATION_PATTERNS = {
    ...
    'numerals': re.compile('[٠١٢٣٤٥٦٧٨٩]'),  # Arabic-Indic numerals
}

def _normalize_arabic(self, text: str) -> str:
    ...
    if self.normalization_level == 'aggressive':
        # Existing steps ...
        # New step: normalize Arabic-Indic numerals to ASCII
        text = self.ARABIC_NORMALIZATION_PATTERNS['numerals'].sub(
            lambda m: str(ord(m.group()) - 0x0660), text
        )
    return text

Plugging in External Cache (Redis)

Override encode_with_cache in a subclass to use an external store:

class RedisCachedEmbedder(OpenAIEmbedder):

    def __init__(self, redis_url: str, **kwargs):
        super().__init__(**kwargs)
        import redis
        self.redis = redis.from_url(redis_url)

    def encode_with_cache(self, texts, **kwargs):
        # Check Redis for each text, encode misses, store results
        ...

11. Performance & Scalability

Batch Processing

All embedders process texts in configurable batches (batch_size, default 32 for local models, 50–100 for API models). Batch size should be tuned per deployment:

GPU with ample VRAM: increase to 128–256 for local models
API models: OpenAI supports up to 2048 texts per call; the default of 100 is conservative
ArabicEmbedder OOM handling: automatically halves batch_size on CUDA OOM and retries — no intervention required

Cache Efficiency

Enable cache_embeddings=True when the same texts recur across requests (e.g., fixed document corpus, common query templates). The base class computes an MD5 digest per text — this hash operation costs ~1µs per text and is negligible vs. encoding latency. Monitor cache efficiency via get_stats()['cache_hit_rate'].

Async Concurrency

aencode(), aencode_with_cache(), and abatch_similarity() dispatch to asyncio.to_thread(), running the CPU-bound encoding in a thread pool executor. This prevents blocking the event loop in async web servers (FastAPI, aiohttp). Multiple concurrent encode calls will run in parallel up to the ThreadPoolExecutor size (default: min(32, cpu_count + 4) in Python 3.8+).

GPU Utilization

_get_best_device() probes CUDA → MPS → CPU. For maximum throughput on GPU:

Set batch_size to fill GPU VRAM (~70–80% utilization)
Use normalize_embeddings=True to avoid a separate normalization pass downstream
Call cleanup() or use as a context manager to release CUDA memory between sessions

Memory Considerations

The in-memory cache stores raw np.ndarray objects. At 768 dimensions × 4 bytes × 10,000 texts, the cache consumes ~30MB — acceptable for most deployments. For very large corpora, disable caching and use save_embeddings() / load_embeddings() for .npz-based batch retrieval instead.

Rate Limiting at Scale

For high-throughput API workloads:

OpenAIEmbedder defaults to 3,000 requests/min and 1M tokens/min — match these to your actual tier limits
Create one embedder instance per application (not per request) to share the rate limiter's sliding window across all concurrent requests
For multi-process deployments, rate limiters are per-process; coordinate externally (e.g., Redis-based token bucket) if sharing API quota across workers

Source: community/embedding.md

Table of Contents

1. High-Level Overview

What It Does

Problem It Solves

Key Design Ideas

Inheritance Hierarchy

Data Flow

Module Interaction

2. Core Concepts

Embeddings & Semantic Retrieval

Similarity Metrics

Arabic Text Normalization

Caching Strategy

Rate Limiting

Retry with Exponential Backoff

3. Module & Component Breakdown

EmbedderConfig

BaseEmbedder

ArabicEmbedder

HuggingFaceEmbedder

OllamaEmbedder

OpenAIEmbedder

GeminiEmbedder

MistralEmbedder

Public Surface

5. API / Public Interfaces

BaseEmbedder — Core Contract

ArabicEmbedder — Extended API

OpenAIEmbedder — Extended API

GeminiEmbedder — Extended API

6. Configuration System

EmbedderConfig Dataclass

Full Configuration Reference

Environment Variables

7. Usage Guide

Quick Start

Provider Selection Guide

Basic Usage Pattern

Arabic NLP Advanced Usage

Async Usage

Persistence (ArabicEmbedder)

Cost Estimation (OpenAI)

Performance Benchmarking

8. Code Examples

RAG Document Indexing Pipeline

Multi-Provider Evaluation

Connection Health Check

Timing and Statistics

9. Design Decisions & Trade-offs

Template Method Over Composition

In-Memory Cache Only

Arabic Normalization as Preprocessing (Not Postprocessing)

Separate ArabicEmbedder vs. HuggingFaceEmbedder

API Key in Constructor vs. Environment Only

Rate Limiting Inside the Embedder

10. Extensibility Guide

Adding a New Embedding Provider

Adding a New Normalization Step

Plugging in External Cache (Redis)

11. Performance & Scalability

Batch Processing

Cache Efficiency

Async Concurrency

GPU Utilization

Memory Considerations

Rate Limiting at Scale

`EmbedderConfig`

`BaseEmbedder`

`ArabicEmbedder`

`HuggingFaceEmbedder`

`OllamaEmbedder`

`OpenAIEmbedder`

`GeminiEmbedder`

`MistralEmbedder`

`BaseEmbedder` — Core Contract

`ArabicEmbedder` — Extended API

`OpenAIEmbedder` — Extended API

`GeminiEmbedder` — Extended API

`EmbedderConfig` Dataclass

Separate `ArabicEmbedder` vs. `HuggingFaceEmbedder`