Fennec Logo Fennec
Fennec Community community/embedding.md

Embeddings Modular

A unified, production-grade Python embeddings modular with first-class Arabic NLP support, multi-provider integration, and async-ready infrastructure.


Table of Contents

  1. High-Level Overview
  2. Architecture Overview
  3. Core Concepts
  4. Module & Component Breakdown
  5. API / Public Interfaces
  6. Configuration System
  7. Usage Guide
  8. Code Examples
  9. Design Decisions & Trade-offs
  10. Extensibility Guide
  11. Project Structure
  12. Performance & Scalability

1. High-Level Overview

What It Does

This library provides a unified interface for generating text embeddings across six distinct backends: OpenAI, Google Gemini, Mistral AI, HuggingFace (local), Ollama (local), and a specialized Arabic NLP embedder. All backends share a common abstract contract, enabling seamless provider switching without changing application code.

Problem It Solves

Building embedding pipelines typically requires learning each provider's SDK, managing rate limiting, implementing retry logic, handling caching, and dealing with Arabic/multilingual text normalization — separately for each provider. This library eliminates that fragmentation by:

  • Exposing a single encode() interface across all providers
  • Centralizing caching, statistics tracking, and similarity computation in the base class
  • Providing Arabic-specific text normalization as a first-class preprocessing stage
  • Offering async variants of all critical operations out of the box

Key Design Ideas

  • Template Method pattern: BaseEmbedder defines the algorithm skeleton (caching, stats, async dispatch, cleanup), while concrete embedders implement only encode() and the embedding_dim property.
  • Provider-agnostic similarity layer: cosine, dot product, and Euclidean similarity are available uniformly regardless of which backend generated the embeddings.
  • Arabic NLP as a first-class citizen: The library was built with Arabic as a primary target language, with structured normalization levels rather than ad-hoc text cleaning.

Inheritance Hierarchy

BaseEmbedder  (ABC)
├── ArabicEmbedder           → sentence-transformers + Arabic normalization pipeline
├── HuggingFaceEmbedder      → sentence-transformers (general multilingual)
├── OllamaEmbedder           → Ollama REST API (local inference)
├── OpenAIEmbedder           → OpenAI REST API (cloud)
├── GeminiEmbedder           → Google GenAI SDK (cloud)
└── MistralEmbedder          → Mistral REST API (cloud)

Data Flow

Input Text(s)
     │
     ▼
[ArabicEmbedder only]
Arabic Normalization Pipeline
 (diacritics → hamza → alef_maksura → tatweel → taa_marbuta)
     │
     ▼
Cache Lookup (MD5 key)  ──── HIT ──► Return cached embedding
     │ MISS
     ▼
Provider-specific encode()
 (batching + rate limiting + retries)
     │
     ▼
[Optional] L2 Normalization
     │
     ▼
numpy.ndarray  (n, dim)
     │
     ├── Cache Store
     └── Statistics Update

Module Interaction

EmbedderConfig is consumed by all concrete embedders as a source of default values. The base class orchestrates caching, statistics accumulation, async dispatch, and resource cleanup. Concrete embedders own only the provider-specific network/model call and the embedding_dim property.


2. Core Concepts

Embeddings & Semantic Retrieval

An embedding is a dense numerical vector that encodes the semantic meaning of text. Texts with similar meaning have geometrically close vectors. This enables:

  • Semantic search: retrieve documents by meaning rather than keyword overlap
  • Clustering: group semantically related texts without labels
  • Classification: represent text in a fixed-size feature space for downstream models
  • RAG (Retrieval-Augmented Generation): embed a query, retrieve top-k relevant document chunks, and pass them to a language model as context

Similarity Metrics

The library supports three metrics via similarity() and batch_similarity():

Metric Formula When to Use
cosine dot(a, b) (after L2 norm) Default; magnitude-independent
dot dot(a, b) When embeddings are already normalized
euclidean 1 / (1 + ‖a−b‖) When absolute scale matters

When normalize_embeddings=True (the default), cosine and dot product are numerically equivalent.

Arabic Text Normalization

Arabic text presents unique normalization challenges not present in Latin scripts. The library implements three normalization levels:

minimal — Remove diacritics (tashkeel) only. Suitable when the model is sensitive to letter form but the application data is already clean.

standard (default) — Remove diacritics + normalize all Hamza forms (أإآ → ا, ؤ → و, ئ → ي) + normalize Alef Maksura (ى → ي) + remove Tatweel (ـ). This covers the most common inconsistencies in user-generated content.

aggressive — All standard normalizations + normalize Taa Marbuta (ة → ه) + collapse runs of 3+ repeated characters to 2. Use for noisy social-media text.

Caching Strategy

The base class implements an in-memory embedding cache keyed by MD5 hash of the raw text string. This design choice (hash over raw text) keeps the cache storage compact and O(1) for lookup while avoiding collision in practice for embedding workloads. The cache is intentionally session-scoped (no disk persistence) to prevent stale embeddings when switching models mid-session.

Rate Limiting

API-backed embedders (OpenAI, Mistral) implement sliding-window rate limiters using thread-safe RLock. The limiter tracks request timestamps and token counts within a 1-minute (OpenAI) or 1-second (Mistral) window. When limits are approached, the system sleeps the minimum necessary duration rather than raising an exception.

Retry with Exponential Backoff

All API embedders implement retry loops with configurable max_retries and exponential backoff (delay * 2^attempt). Rate-limit errors are retried; hard API errors are re-raised immediately to avoid burning retry budget on unrecoverable failures.


3. Module & Component Breakdown

EmbedderConfig

Purpose: Centralizes all default values in a single @dataclass, preventing scattered magic numbers.

Responsibilities: Provides defaults for batch size, normalization flags, max sequence length, cache settings, Ollama connection parameters, and progress display. Concrete embedders import a module-level config = EmbedderConfig() instance and use it as parameter defaults, making the library configurable without subclassing.

Key fields:

  • batch_size: int = 32
  • normalize_embeddings: bool = True
  • max_seq_length: int = 512
  • enable_cache: bool = False
  • base_url: str = "http://127.0.0.1:11434" (Ollama)

BaseEmbedder

Purpose: Abstract base that defines the contract and provides all shared infrastructure.

Responsibilities:

  • Declares encode() and embedding_dim as abstract
  • Implements encode_with_cache(), similarity(), batch_similarity()
  • Provides validate_connection(), get_model_info(), get_stats(), reset_stats(), clear_cache()
  • Implements async variants (aencode, aencode_with_cache, abatch_similarity) via asyncio.to_thread
  • Provides a timing() context manager for performance measurement
  • Implements __aenter__/__aexit__ for async context manager protocol
  • Implements __del__ for automatic GPU memory cleanup

Hidden design detail: The _stats dictionary supports both dict and dataclass forms. This is a backward-compatibility bridge — HuggingFaceEmbedder uses an EmbeddingStats dataclass while all other embedders use a plain dict. get_stats() detects the type at runtime and normalizes the output.

Device selection: _get_best_device() probes CUDA → MPS (Apple Silicon) → CPU in order, logging the selected device for debugging.


ArabicEmbedder

Purpose: Local sentence-transformers embedder with a full Arabic NLP preprocessing pipeline.

Responsibilities:

  • Aliases friendly model keys (multilingual, labse, arabert, etc.) to full HuggingFace model names
  • Applies a multi-stage Arabic normalization pipeline before encoding
  • Tracks normalization statistics (diacritics removed, hamzas normalized) via ArabicProcessingStats
  • Handles OOM errors by halving batch size and retrying automatically
  • Provides save_embeddings() / load_embeddings() for .npz persistence with optional text and metadata arrays
  • Provides find_most_similar() and benchmark() as high-level utilities

Key classes:

  • ArabicProcessingStats: dataclass tracking normalization metrics per session
  • ARABIC_NORMALIZATION_PATTERNS: module-level compiled regex patterns (compiled once at class definition, not per call)
  • RECOMMENDED_MODELS: maps human-friendly keys to full model names + dimension info

Interaction: Inherits encode_with_cache, similarity, batch_similarity from BaseEmbedder. Uses sentence-transformers.SentenceTransformer internally.


HuggingFaceEmbedder

Purpose: General-purpose local sentence-transformers embedder with TTL-based cache and typed model registry.

Responsibilities:

  • Maintains ARABIC_MODELS registry mapping short names to ModelInfo dataclasses (dimensions, max_tokens, ArabicQuality enum, size)
  • Implements TTL-aware cache via _cache_timestamps dict alongside the base cache
  • Uses a retry_on_failure decorator (exponential backoff) on its internal _process_batch
  • Supports per-call normalization override via normalize parameter on encode()

Key classes:

  • ArabicQuality(Enum): EXCELLENT | GOOD | FAIR | UNKNOWN — structured quality tagging
  • ModelInfo: dataclass carrying model metadata for the registry
  • EmbeddingStats: dataclass (not dict) for statistics; requires to_dict() bridge in BaseEmbedder.get_stats()

Design note: Unlike ArabicEmbedder, this class does not apply Arabic-specific text preprocessing. It is intended for general multilingual use where the model's tokenizer handles normalization.


OllamaEmbedder

Purpose: Embeds text using locally-running Ollama inference server via REST API — zero API key, zero data egress.

Responsibilities:

  • Pings the Ollama /api/tags endpoint at init to verify server reachability
  • Optionally auto-starts the ollama serve subprocess if the server is not running
  • Sends embedding requests to /api/embed with configurable timeout and retries
  • Stores MODEL_SPECS dict at module level (not class level) for known models, with graceful fallback for unknown models

Known models with specs: nomic-embed-text (768d), mxbai-embed-large (1024d), all-minilm (384d), snowflake-arctic-embed (1024d), bge-m3 (1024d, best Arabic support).

Key behavior: If auto_start_server=True and ollama serve is not running, the embedder will launch it via subprocess and wait server_start_wait seconds before proceeding.


OpenAIEmbedder

Purpose: Production-grade OpenAI embedding client with token-aware rate limiting, cost tracking, and tiktoken-based accurate token counting.

Responsibilities:

  • Validates the API key on init by making a minimal test request (fails fast vs. failing on first real call)
  • Uses tiktoken (cl100k_base encoding) for accurate pre-call token estimation; falls back to len(text) // 4 if tiktoken is unavailable
  • Supports dimension reduction on text-embedding-3-* models via the dimensions API parameter
  • Tracks cost per request using MODEL_SPECS.cost_per_1m_tokens with UsageStats dataclass
  • Implements two-level rate limiting: requests/min and tokens/min via RateLimiter

Key classes:

  • UsageStats: tracks total tokens, cost, requests; computes requests-by-minute breakdown
  • RateLimiter: sliding-window limiter with RLock for thread safety

Supported models:

Model Dimensions Cost/1M tokens Max tokens
text-embedding-3-large 3072 (reducible) $0.13 8191
text-embedding-3-small 1536 (reducible) $0.02 8191
text-embedding-ada-002 1536 $0.10 8191

GeminiEmbedder

Purpose: Google Gemini embedding client with task-type–aware encoding and MRL (Matryoshka Representation Learning) dimension reduction.

Responsibilities:

  • Uses the new google-genai SDK (not the deprecated google-generativeai)
  • Exposes task_type parameter to signal the embedding's intended use to the model (RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING)
  • Supports output_dimensionality for MRL-based dimension reduction (valid for gemini-embedding-001)
  • Validates model status — deprecated models (text-embedding-004, embedding-001) emit warnings with sunset dates
  • Tracks requests/chars/errors via GeminiUsageStats

Key insight: Task-type hints allow the Gemini model to optimize the embedding distribution for the specific downstream task. For RAG, encode queries with RETRIEVAL_QUERY and documents with RETRIEVAL_DOCUMENT for best retrieval accuracy.

Supported models:

Model Dimensions Status
gemini-embedding-001 3072 (MRL-reducible) GA
text-embedding-004 768 Deprecated 2026-01-14
embedding-001 768 Deprecated 2025-08-14

MistralEmbedder

Purpose: Mistral AI embedding client with per-second rate limiting and LRU caching.

Responsibilities:

  • Implements MistralRateLimiter with a 1-second sliding window (Mistral's tighter rate limit vs. OpenAI's per-minute)
  • Uses OrderedDict for LRU cache eviction (bounded by cache_size parameter)
  • Reads API key from MISTRAL_API_KEY env var or constructor parameter
  • Calls https://api.mistral.ai/v1/embeddings directly via requests (no official SDK dependency)

Only available model: mistral-embed (1024 dimensions, $0.10/1M tokens).


Public Surface

Purpose: Defines the public API, module-level metadata, and documentation constants.

Exports: ArabicEmbedder, BaseEmbedder, EmbedderConfig, GeminiEmbedder, HuggingFaceEmbedder, OllamaEmbedder, MistralEmbedder, OpenAIEmbedder.

Metadata constants:

  • __arabic_normalization__: list of all normalization operations in pipeline order
  • __valid_levels__: human-readable descriptions of normalization levels
  • __task_types_gemini__: bilingual (Arabic/English) task type documentation
  • __gemini_embedding_models__: complete model specifications for Gemini models

5. API / Public Interfaces

BaseEmbedder — Core Contract

class BaseEmbedder(ABC):

    def __init__(
        self,
        model_name: str,
        device: Optional[str] = None,          # 'cuda' | 'cpu' | 'mps' | None (auto)
        normalize_embeddings: bool = True,
        batch_size: int = 32,
        max_length: Optional[int] = 512,
        cache_embeddings: bool = False,
        show_progress: bool = False,
        **kwargs
    ): ...

    @abstractmethod
    def encode(
        self,
        texts: Union[str, List[str]],
        show_progress_bar: bool = False,
        convert_to_numpy: bool = True,
        **kwargs
    ) -> np.ndarray: ...
    # Returns: shape (dim,) for single text, (n, dim) for list

    @property
    @abstractmethod
    def embedding_dim(self) -> int: ...

    def encode_with_cache(self, texts, **kwargs) -> np.ndarray: ...

    def similarity(
        self,
        text1: Union[str, np.ndarray],
        text2: Union[str, np.ndarray],
        metric: str = 'cosine'           # 'cosine' | 'dot' | 'euclidean'
    ) -> float: ...

    def batch_similarity(
        self,
        query: Union[str, np.ndarray],
        texts: List[str],
        top_k: Optional[int] = None
    ) -> Union[np.ndarray, List[Tuple[int, float]]]: ...
    # Returns: similarity array if top_k=None, else [(index, score), ...]

    def validate_connection(
        self,
        test_text: str = "مرحباً Hello",
        detailed: bool = False
    ) -> Dict[str, Any]: ...

    def get_model_info(self) -> Dict[str, Any]: ...
    def get_stats(self) -> Dict[str, Any]: ...
    def reset_stats(self): ...
    def clear_cache(self): ...

    # Async API
    async def aencode(self, texts, **kwargs) -> np.ndarray: ...
    async def aencode_with_cache(self, texts, **kwargs) -> np.ndarray: ...
    async def abatch_similarity(self, query, candidates, **kwargs): ...

    # Context manager
    @contextmanager
    def timing(self, operation: str = "encoding"): ...

ArabicEmbedder — Extended API

ArabicEmbedder(
    model_name: Optional[str] = None,   # Key from RECOMMENDED_MODELS or full HF model name
    normalization_level: str = 'standard',  # 'minimal' | 'standard' | 'aggressive'
    enable_preprocessing: bool = True,
    track_processing_stats: bool = True,
    auto_download: bool = True,
    ...
)

.find_most_similar(
    query: str,
    candidates: List[str],
    top_k: int = 5,
    metric: str = 'cosine'
) -> List[Tuple[int, str, float]]     # [(index, text, score), ...]

.save_embeddings(texts, filepath, save_texts=True, save_metadata=True) -> None
.load_embeddings(filepath, load_texts=False, load_metadata=False) -> Union[np.ndarray, Tuple]
.benchmark(sample_texts=None, num_iterations=10, warmup_iterations=2) -> Dict
.get_processing_stats() -> Dict[str, Any]
.list_recommended_models() -> Dict[str, Dict]  # @staticmethod

OpenAIEmbedder — Extended API

OpenAIEmbedder(
    model_name: str = "text-embedding-3-small",
    api_key: Optional[str] = None,     # Falls back to OPENAI_API_KEY env var
    dimensions: Optional[int] = None,  # Dimension reduction (text-embedding-3-* only)
    enable_rate_limiting: bool = True,
    max_requests_per_minute: int = 3000,
    max_tokens_per_minute: int = 1_000_000,
    track_costs: bool = True,
    ...
)

.estimate_cost(texts: Union[str, List[str]]) -> Dict[str, Any]
.get_usage_stats() -> Dict[str, Any]

GeminiEmbedder — Extended API

GeminiEmbedder(
    model_name: str = "gemini-embedding-001",
    api_key: Optional[str] = None,             # Falls back to GOOGLE_API_KEY env var
    task_type: Optional[str] = None,           # 'RETRIEVAL_QUERY' | 'RETRIEVAL_DOCUMENT' | ...
    output_dimensionality: Optional[int] = None,  # MRL reduction
    ...
)

6. Configuration System

EmbedderConfig Dataclass

All embedders accept EmbedderConfig defaults. Override at the module level to change library-wide defaults:

from fennec_community.embeddings import EmbedderConfig

# Override defaults globally before importing embedders
config = EmbedderConfig(
    batch_size=64,
    normalize_embeddings=True,
    enable_cache=True,
    show_progress_bar=True,
)

Full Configuration Reference

Field Default Description
model_name "multilingual" Default model key
device None (auto) 'cuda', 'cpu', 'mps'
cache_dir None HuggingFace model cache directory
batch_size 32 Texts per encode batch
normalize_embeddings True L2-normalize output vectors
max_seq_length 512 Token truncation limit
enable_preprocessing True Arabic normalization pipeline
enable_arabic_normalization True Arabic char normalization
enable_cache False In-memory embedding cache
cache_size 10000 Max cached entries
show_progress_bar False tqdm progress during encoding
convert_to_numpy True Always return np.ndarray
skip_preprocessing False Bypass Arabic preprocessing
return_valid_indices False Return (embeddings, valid_idx) tuple
track_processing_stats True Track normalization statistics
auto_download True Auto-download missing models
trust_remote_code False HuggingFace trust_remote_code
base_url "http://127.0.0.1:11434" Ollama server URL
embedding_model "nomic-embed-text" Default Ollama model

Environment Variables

Variable Used By
OPENAI_API_KEY OpenAIEmbedder
GOOGLE_API_KEY GeminiEmbedder
MISTRAL_API_KEY MistralEmbedder

7. Usage Guide

Quick Start

from fennec_community.embeddings import ArabicEmbedder

# Minimal setup — downloads model on first run
embedder = ArabicEmbedder()
embedding = embedder.encode("مرحبا بك في عالم الذكاء الاصطناعي")
print(embedding.shape)  # (384,)

Provider Selection Guide

Use Case Recommended Embedder Reason
Arabic NLP, on-premise ArabicEmbedder Built-in normalization pipeline
General multilingual, local HuggingFaceEmbedder Full model control, no cost
Production API, best quality OpenAIEmbedder (3-large) Highest accuracy, dimension reduction
Budget-conscious API OpenAIEmbedder (3-small) Good quality, 6× cheaper
Task-specific RAG GeminiEmbedder Task-type optimization
Air-gapped / privacy-first OllamaEmbedder Zero network egress
Mistral ecosystem MistralEmbedder Native Mistral stack integration

Basic Usage Pattern

# All embedders share the same core interface
from fennec_community.embeddings import OpenAIEmbedder

embedder = OpenAIEmbedder(model_name="text-embedding-3-small")

# Single text → 1D array of shape (dim,)
emb = embedder.encode("Hello world")

# Multiple texts → 2D array of shape (n, dim)
embs = embedder.encode(["First text", "Second text", "Third text"])

# Semantic similarity
score = embedder.similarity("cat", "kitten")       # float, ~0.85
score = embedder.similarity("cat", "automobile")   # float, ~0.2

# Top-k search
results = embedder.batch_similarity(
    query="artificial intelligence",
    texts=["machine learning", "cooking recipes", "neural networks"],
    top_k=2
)
# Returns: [(0, 0.91), (2, 0.88)]  — (index, score) pairs

Arabic NLP Advanced Usage

from fennec_community.embeddings import ArabicEmbedder

# Use aggressive normalization for social media text
with ArabicEmbedder(
    model_name='arabert',
    normalization_level='aggressive',
    cache_embeddings=True
) as embedder:

    # Texts with heavy diacritics + repeated chars + hamza variants
    messy_texts = [
        "أَهْلاً وَسَهْلاً بِكُمْ",
        "ههههههههه كتير كتيرررر",
        "اريد ان اتعلم البرمجة"
    ]
    embeddings = embedder.encode(messy_texts)

    # Inspect normalization statistics
    stats = embedder.get_processing_stats()
    print(f"Diacritics removed: {stats['removed_diacritics']}")
    print(f"Hamzas normalized: {stats['normalized_hamzas']}")

Async Usage

import asyncio
from fennec_community.embeddings import HuggingFaceEmbedder

async def embed_documents(texts):
    async with HuggingFaceEmbedder("bge-m3") as embedder:
        # Runs encode() in a thread pool — does not block event loop
        embeddings = await embedder.aencode(texts)
        return embeddings

embeddings = asyncio.run(embed_documents(["doc1", "doc2", "doc3"]))

Persistence (ArabicEmbedder)

embedder = ArabicEmbedder(model_name='multilingual')

# Save to disk
texts = ["النص الأول", "النص الثاني"]
embedder.save_embeddings(texts, "my_embeddings", save_texts=True, save_metadata=True)
# Creates: my_embeddings.npz

# Load from disk
embeddings, texts, metadata = embedder.load_embeddings(
    "my_embeddings",
    load_texts=True,
    load_metadata=True
)
print(metadata['model_name'])     # paraphrase-multilingual-mpnet-base-v2
print(metadata['timestamp'])      # ISO 8601 timestamp

Cost Estimation (OpenAI)

embedder = OpenAIEmbedder(model_name="text-embedding-3-large")

# Estimate before committing to the API call
estimate = embedder.estimate_cost(my_large_document_list)
print(f"Estimated cost: ${estimate['estimated_cost_usd']:.4f}")
print(f"Total tokens: {estimate['total_tokens']:,}")

# After encoding, review actual usage
stats = embedder.get_usage_stats()
print(f"Actual cost: ${stats['api_usage']['total_cost_usd']:.6f}")

Performance Benchmarking

embedder = ArabicEmbedder(model_name='multilingual-mini')
results = embedder.benchmark(num_iterations=20, warmup_iterations=3)

print(f"Texts/second: {results['texts_per_second']:.1f}")
print(f"Avg latency:  {results['avg_time']*1000:.1f}ms")
print(f"Device:       {results['device']}")

8. Code Examples

RAG Document Indexing Pipeline

from fennec_community.embeddings import GeminiEmbedder
import numpy as np

# Index-time: embed documents
doc_embedder = GeminiEmbedder(
    model_name="gemini-embedding-001",
    task_type="RETRIEVAL_DOCUMENT",
    output_dimensionality=768,   # Reduce 3072 → 768 via MRL
    cache_embeddings=True,
)

documents = [
    "Python is a high-level programming language.",
    "Machine learning models require training data.",
    "Neural networks are inspired by the brain.",
]

doc_embeddings = doc_embedder.encode(documents)
print(doc_embeddings.shape)  # (3, 768)

# Query-time: embed query with matching task type
query_embedder = GeminiEmbedder(
    model_name="gemini-embedding-001",
    task_type="RETRIEVAL_QUERY",
    output_dimensionality=768,
)

query = "What is Python used for?"
results = query_embedder.batch_similarity(query, documents, top_k=2)
for idx, score in results:
    print(f"[{score:.3f}] {documents[idx]}")

Multi-Provider Evaluation

from fennec_community.embeddings import ArabicEmbedder, OpenAIEmbedder

query = "ما هو الذكاء الاصطناعي؟"
candidates = [
    "الذكاء الاصطناعي هو محاكاة العقل البشري",
    "الطبخ فن جميل",
    "التعلم الآلي فرع من الذكاء الاصطناعي",
]

for EmbedderClass, kwargs in [
    (ArabicEmbedder, {"model_name": "multilingual"}),
    (OpenAIEmbedder, {"model_name": "text-embedding-3-small"}),
]:
    embedder = EmbedderClass(**kwargs)
    results = embedder.batch_similarity(query, candidates, top_k=2)
    print(f"\n{EmbedderClass.__name__}:")
    for idx, score in results:
        print(f"  [{score:.3f}] {candidates[idx]}")

Connection Health Check

from fennec_community.embeddings import OllamaEmbedder

embedder = OllamaEmbedder(model_name="bge-m3")
report = embedder.validate_connection(detailed=True)

if report['success']:
    print(f"✅ Connected — dim={report['embedding_dim']}, latency={report['encoding_time']}")
else:
    print(f"❌ Failed: {report['reason']}")

Timing and Statistics

from fennec_community.embeddings import HuggingFaceEmbedder

embedder = HuggingFaceEmbedder(model_name="bge-m3")

with embedder.timing("bulk_encode"):
    embeddings = embedder.encode(large_text_list)

stats = embedder.get_stats()
print(f"Total texts:    {stats['total_texts']}")
print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")
print(f"Avg time/text:  {stats['avg_time_per_text']*1000:.2f}ms")

9. Design Decisions & Trade-offs

Template Method Over Composition

Decision: Shared logic (caching, stats, async, cleanup) lives in BaseEmbedder via the Template Method pattern rather than being injected as collaborator objects.

Pros: Zero boilerplate in concrete embedders; all providers get caching, timing, and async for free. Adding a new provider means implementing two methods.

Cons: Tighter coupling between base and concrete classes. A provider that wants different cache semantics (e.g., Redis) must override multiple base methods. The _stats dict/dataclass dual-support is evidence of this friction accumulating.

In-Memory Cache Only

Decision: The cache is a session-scoped Python dict, never persisted to disk.

Pros: No deserialization overhead, no cache invalidation complexity, no stale embeddings when models change.

Cons: Cache is lost on process restart. For high-reuse workloads (e.g., document corpora that are repeatedly queried), callers must implement external caching or use save_embeddings() / load_embeddings().

Arabic Normalization as Preprocessing (Not Postprocessing)

Decision: Arabic normalization happens before encoding, not after.

Pros: The embedding model sees cleaner, more consistent input. Two surface-form variants of the same word (e.g., أحمد and احمد) produce the same token sequence and therefore similar (or identical) embeddings.

Cons: If a model was specifically fine-tuned on raw Arabic text including diacritics, aggressive normalization may degrade quality. The skip_preprocessing parameter and minimal level exist to mitigate this.

Separate ArabicEmbedder vs. HuggingFaceEmbedder

Decision: Arabic-specific functionality is isolated in its own class rather than added as flags to HuggingFaceEmbedder.

Pros: Cleaner separation of concerns. Arabic users get a purpose-built interface with save/load, benchmarking, and processing stats. General users are not burdened with Arabic-specific parameters.

Cons: Some code duplication between the two classes. The model-loading logic and SentenceTransformer call are essentially identical.

API Key in Constructor vs. Environment Only

Decision: All API embedders accept api_key as an explicit constructor parameter with environment variable fallback.

Pros: Enables testing with different credentials without environment manipulation; makes dependencies explicit.

Cons: Risk of accidentally logging or serializing the API key if the embedder object is repr'd or pickled carelessly.

Rate Limiting Inside the Embedder

Decision: Rate limiting logic is embedded within OpenAIEmbedder and MistralEmbedder rather than being a separate middleware.

Pros: Self-contained — users cannot accidentally bypass the limiter by calling the underlying client directly.

Cons: If multiple embedder instances are created for the same API account, each has an independent rate limiter, potentially allowing combined rates to exceed account limits.


10. Extensibility Guide

Adding a New Embedding Provider

Create a new file (e.g., cohere_embedder.py) and implement the two abstract members:

from fennec_community.embeddings import BaseEmbedder
import numpy as np
from typing import Union, List, Optional

class CohereEmbedder(BaseEmbedder):

    def __init__(self, model_name: str = "embed-multilingual-v3.0",
                 api_key: Optional[str] = None, **kwargs):
        super().__init__(model_name=model_name, **kwargs)
        import cohere
        self.client = cohere.Client(api_key or os.getenv("COHERE_API_KEY"))
        self._dim = None  # Lazy-loaded on first encode

    def encode(self, texts: Union[str, List[str]],
               show_progress_bar: bool = False,
               convert_to_numpy: bool = True, **kwargs) -> np.ndarray:
        if isinstance(texts, str):
            texts = [texts]
        response = self.client.embed(texts=texts, model=self.model_name,
                                     input_type="search_document")
        embeddings = np.array(response.embeddings, dtype=np.float32)
        if self._dim is None:
            self._dim = embeddings.shape[1]
        if self.normalize_embeddings:
            norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
            embeddings = embeddings / np.where(norms == 0, 1, norms)
        return embeddings

    @property
    def embedding_dim(self) -> int:
        if self._dim is None:
            self.encode("init")  # trigger dimension detection
        return self._dim

Then register in __init__.py:

from .cohere_embedder import CohereEmbedder
__all__ = [..., "CohereEmbedder"]

The new embedder automatically gains: caching via encode_with_cache, async via aencode, similarity computation, stats tracking, context manager support, and validate_connection.

Adding a New Normalization Step

Extend ArabicEmbedder.ARABIC_NORMALIZATION_PATTERNS and update _normalize_arabic():

# In ArabicEmbedder
ARABIC_NORMALIZATION_PATTERNS = {
    ...
    'numerals': re.compile('[٠١٢٣٤٥٦٧٨٩]'),  # Arabic-Indic numerals
}

def _normalize_arabic(self, text: str) -> str:
    ...
    if self.normalization_level == 'aggressive':
        # Existing steps ...
        # New step: normalize Arabic-Indic numerals to ASCII
        text = self.ARABIC_NORMALIZATION_PATTERNS['numerals'].sub(
            lambda m: str(ord(m.group()) - 0x0660), text
        )
    return text

Plugging in External Cache (Redis)

Override encode_with_cache in a subclass to use an external store:

class RedisCachedEmbedder(OpenAIEmbedder):

    def __init__(self, redis_url: str, **kwargs):
        super().__init__(**kwargs)
        import redis
        self.redis = redis.from_url(redis_url)

    def encode_with_cache(self, texts, **kwargs):
        # Check Redis for each text, encode misses, store results
        ...

11. Performance & Scalability

Batch Processing

All embedders process texts in configurable batches (batch_size, default 32 for local models, 50–100 for API models). Batch size should be tuned per deployment:

  • GPU with ample VRAM: increase to 128–256 for local models
  • API models: OpenAI supports up to 2048 texts per call; the default of 100 is conservative
  • ArabicEmbedder OOM handling: automatically halves batch_size on CUDA OOM and retries — no intervention required

Cache Efficiency

Enable cache_embeddings=True when the same texts recur across requests (e.g., fixed document corpus, common query templates). The base class computes an MD5 digest per text — this hash operation costs ~1µs per text and is negligible vs. encoding latency. Monitor cache efficiency via get_stats()['cache_hit_rate'].

Async Concurrency

aencode(), aencode_with_cache(), and abatch_similarity() dispatch to asyncio.to_thread(), running the CPU-bound encoding in a thread pool executor. This prevents blocking the event loop in async web servers (FastAPI, aiohttp). Multiple concurrent encode calls will run in parallel up to the ThreadPoolExecutor size (default: min(32, cpu_count + 4) in Python 3.8+).

GPU Utilization

_get_best_device() probes CUDA → MPS → CPU. For maximum throughput on GPU:

  1. Set batch_size to fill GPU VRAM (~70–80% utilization)
  2. Use normalize_embeddings=True to avoid a separate normalization pass downstream
  3. Call cleanup() or use as a context manager to release CUDA memory between sessions

Memory Considerations

The in-memory cache stores raw np.ndarray objects. At 768 dimensions × 4 bytes × 10,000 texts, the cache consumes ~30MB — acceptable for most deployments. For very large corpora, disable caching and use save_embeddings() / load_embeddings() for .npz-based batch retrieval instead.

Rate Limiting at Scale

For high-throughput API workloads:

  • OpenAIEmbedder defaults to 3,000 requests/min and 1M tokens/min — match these to your actual tier limits
  • Create one embedder instance per application (not per request) to share the rate limiter's sliding window across all concurrent requests
  • For multi-process deployments, rate limiters are per-process; coordinate externally (e.g., Redis-based token bucket) if sharing API quota across workers
Source: community/embedding.md