Embeddings Modular
A unified, production-grade Python embeddings modular with first-class Arabic NLP support, multi-provider integration, and async-ready infrastructure.
Table of Contents
- High-Level Overview
- Architecture Overview
- Core Concepts
- Module & Component Breakdown
- API / Public Interfaces
- Configuration System
- Usage Guide
- Code Examples
- Design Decisions & Trade-offs
- Extensibility Guide
- Project Structure
- Performance & Scalability
1. High-Level Overview
What It Does
This library provides a unified interface for generating text embeddings across six distinct backends: OpenAI, Google Gemini, Mistral AI, HuggingFace (local), Ollama (local), and a specialized Arabic NLP embedder. All backends share a common abstract contract, enabling seamless provider switching without changing application code.
Problem It Solves
Building embedding pipelines typically requires learning each provider's SDK, managing rate limiting, implementing retry logic, handling caching, and dealing with Arabic/multilingual text normalization — separately for each provider. This library eliminates that fragmentation by:
- Exposing a single
encode()interface across all providers - Centralizing caching, statistics tracking, and similarity computation in the base class
- Providing Arabic-specific text normalization as a first-class preprocessing stage
- Offering async variants of all critical operations out of the box
Key Design Ideas
- Template Method pattern:
BaseEmbedderdefines the algorithm skeleton (caching, stats, async dispatch, cleanup), while concrete embedders implement onlyencode()and theembedding_dimproperty. - Provider-agnostic similarity layer: cosine, dot product, and Euclidean similarity are available uniformly regardless of which backend generated the embeddings.
- Arabic NLP as a first-class citizen: The library was built with Arabic as a primary target language, with structured normalization levels rather than ad-hoc text cleaning.
Inheritance Hierarchy
BaseEmbedder (ABC)
├── ArabicEmbedder → sentence-transformers + Arabic normalization pipeline
├── HuggingFaceEmbedder → sentence-transformers (general multilingual)
├── OllamaEmbedder → Ollama REST API (local inference)
├── OpenAIEmbedder → OpenAI REST API (cloud)
├── GeminiEmbedder → Google GenAI SDK (cloud)
└── MistralEmbedder → Mistral REST API (cloud)Data Flow
Input Text(s)
│
▼
[ArabicEmbedder only]
Arabic Normalization Pipeline
(diacritics → hamza → alef_maksura → tatweel → taa_marbuta)
│
▼
Cache Lookup (MD5 key) ──── HIT ──► Return cached embedding
│ MISS
▼
Provider-specific encode()
(batching + rate limiting + retries)
│
▼
[Optional] L2 Normalization
│
▼
numpy.ndarray (n, dim)
│
├── Cache Store
└── Statistics UpdateModule Interaction
EmbedderConfig is consumed by all concrete embedders as a source of default values. The base class orchestrates caching, statistics accumulation, async dispatch, and resource cleanup. Concrete embedders own only the provider-specific network/model call and the embedding_dim property.
2. Core Concepts
Embeddings & Semantic Retrieval
An embedding is a dense numerical vector that encodes the semantic meaning of text. Texts with similar meaning have geometrically close vectors. This enables:
- Semantic search: retrieve documents by meaning rather than keyword overlap
- Clustering: group semantically related texts without labels
- Classification: represent text in a fixed-size feature space for downstream models
- RAG (Retrieval-Augmented Generation): embed a query, retrieve top-k relevant document chunks, and pass them to a language model as context
Similarity Metrics
The library supports three metrics via similarity() and batch_similarity():
| Metric | Formula | When to Use |
|---|---|---|
cosine |
dot(a, b) (after L2 norm) |
Default; magnitude-independent |
dot |
dot(a, b) |
When embeddings are already normalized |
euclidean |
1 / (1 + ‖a−b‖) |
When absolute scale matters |
When normalize_embeddings=True (the default), cosine and dot product are numerically equivalent.
Arabic Text Normalization
Arabic text presents unique normalization challenges not present in Latin scripts. The library implements three normalization levels:
minimal — Remove diacritics (tashkeel) only. Suitable when the model is sensitive to letter form but the application data is already clean.
standard (default) — Remove diacritics + normalize all Hamza forms (أإآ → ا, ؤ → و, ئ → ي) + normalize Alef Maksura (ى → ي) + remove Tatweel (ـ). This covers the most common inconsistencies in user-generated content.
aggressive — All standard normalizations + normalize Taa Marbuta (ة → ه) + collapse runs of 3+ repeated characters to 2. Use for noisy social-media text.
Caching Strategy
The base class implements an in-memory embedding cache keyed by MD5 hash of the raw text string. This design choice (hash over raw text) keeps the cache storage compact and O(1) for lookup while avoiding collision in practice for embedding workloads. The cache is intentionally session-scoped (no disk persistence) to prevent stale embeddings when switching models mid-session.
Rate Limiting
API-backed embedders (OpenAI, Mistral) implement sliding-window rate limiters using thread-safe RLock. The limiter tracks request timestamps and token counts within a 1-minute (OpenAI) or 1-second (Mistral) window. When limits are approached, the system sleeps the minimum necessary duration rather than raising an exception.
Retry with Exponential Backoff
All API embedders implement retry loops with configurable max_retries and exponential backoff (delay * 2^attempt). Rate-limit errors are retried; hard API errors are re-raised immediately to avoid burning retry budget on unrecoverable failures.
3. Module & Component Breakdown
EmbedderConfig
Purpose: Centralizes all default values in a single @dataclass, preventing scattered magic numbers.
Responsibilities: Provides defaults for batch size, normalization flags, max sequence length, cache settings, Ollama connection parameters, and progress display. Concrete embedders import a module-level config = EmbedderConfig() instance and use it as parameter defaults, making the library configurable without subclassing.
Key fields:
batch_size: int = 32normalize_embeddings: bool = Truemax_seq_length: int = 512enable_cache: bool = Falsebase_url: str = "http://127.0.0.1:11434"(Ollama)
BaseEmbedder
Purpose: Abstract base that defines the contract and provides all shared infrastructure.
Responsibilities:
- Declares
encode()andembedding_dimas abstract - Implements
encode_with_cache(),similarity(),batch_similarity() - Provides
validate_connection(),get_model_info(),get_stats(),reset_stats(),clear_cache() - Implements async variants (
aencode,aencode_with_cache,abatch_similarity) viaasyncio.to_thread - Provides a
timing()context manager for performance measurement - Implements
__aenter__/__aexit__for async context manager protocol - Implements
__del__for automatic GPU memory cleanup
Hidden design detail: The _stats dictionary supports both dict and dataclass forms. This is a backward-compatibility bridge — HuggingFaceEmbedder uses an EmbeddingStats dataclass while all other embedders use a plain dict. get_stats() detects the type at runtime and normalizes the output.
Device selection: _get_best_device() probes CUDA → MPS (Apple Silicon) → CPU in order, logging the selected device for debugging.
ArabicEmbedder
Purpose: Local sentence-transformers embedder with a full Arabic NLP preprocessing pipeline.
Responsibilities:
- Aliases friendly model keys (
multilingual,labse,arabert, etc.) to full HuggingFace model names - Applies a multi-stage Arabic normalization pipeline before encoding
- Tracks normalization statistics (diacritics removed, hamzas normalized) via
ArabicProcessingStats - Handles OOM errors by halving batch size and retrying automatically
- Provides
save_embeddings()/load_embeddings()for.npzpersistence with optional text and metadata arrays - Provides
find_most_similar()andbenchmark()as high-level utilities
Key classes:
ArabicProcessingStats: dataclass tracking normalization metrics per sessionARABIC_NORMALIZATION_PATTERNS: module-level compiled regex patterns (compiled once at class definition, not per call)RECOMMENDED_MODELS: maps human-friendly keys to full model names + dimension info
Interaction: Inherits encode_with_cache, similarity, batch_similarity from BaseEmbedder. Uses sentence-transformers.SentenceTransformer internally.
HuggingFaceEmbedder
Purpose: General-purpose local sentence-transformers embedder with TTL-based cache and typed model registry.
Responsibilities:
- Maintains
ARABIC_MODELSregistry mapping short names toModelInfodataclasses (dimensions, max_tokens,ArabicQualityenum, size) - Implements TTL-aware cache via
_cache_timestampsdict alongside the base cache - Uses a
retry_on_failuredecorator (exponential backoff) on its internal_process_batch - Supports per-call normalization override via
normalizeparameter onencode()
Key classes:
ArabicQuality(Enum):EXCELLENT | GOOD | FAIR | UNKNOWN— structured quality taggingModelInfo: dataclass carrying model metadata for the registryEmbeddingStats: dataclass (not dict) for statistics; requiresto_dict()bridge inBaseEmbedder.get_stats()
Design note: Unlike ArabicEmbedder, this class does not apply Arabic-specific text preprocessing. It is intended for general multilingual use where the model's tokenizer handles normalization.
OllamaEmbedder
Purpose: Embeds text using locally-running Ollama inference server via REST API — zero API key, zero data egress.
Responsibilities:
- Pings the Ollama
/api/tagsendpoint at init to verify server reachability - Optionally auto-starts the
ollama servesubprocess if the server is not running - Sends embedding requests to
/api/embedwith configurable timeout and retries - Stores
MODEL_SPECSdict at module level (not class level) for known models, with graceful fallback for unknown models
Known models with specs: nomic-embed-text (768d), mxbai-embed-large (1024d), all-minilm (384d), snowflake-arctic-embed (1024d), bge-m3 (1024d, best Arabic support).
Key behavior: If auto_start_server=True and ollama serve is not running, the embedder will launch it via subprocess and wait server_start_wait seconds before proceeding.
OpenAIEmbedder
Purpose: Production-grade OpenAI embedding client with token-aware rate limiting, cost tracking, and tiktoken-based accurate token counting.
Responsibilities:
- Validates the API key on init by making a minimal test request (fails fast vs. failing on first real call)
- Uses
tiktoken(cl100k_baseencoding) for accurate pre-call token estimation; falls back tolen(text) // 4if tiktoken is unavailable - Supports dimension reduction on
text-embedding-3-*models via thedimensionsAPI parameter - Tracks cost per request using
MODEL_SPECS.cost_per_1m_tokenswithUsageStatsdataclass - Implements two-level rate limiting: requests/min and tokens/min via
RateLimiter
Key classes:
UsageStats: tracks total tokens, cost, requests; computes requests-by-minute breakdownRateLimiter: sliding-window limiter withRLockfor thread safety
Supported models:
| Model | Dimensions | Cost/1M tokens | Max tokens |
|---|---|---|---|
text-embedding-3-large |
3072 (reducible) | $0.13 | 8191 |
text-embedding-3-small |
1536 (reducible) | $0.02 | 8191 |
text-embedding-ada-002 |
1536 | $0.10 | 8191 |
GeminiEmbedder
Purpose: Google Gemini embedding client with task-type–aware encoding and MRL (Matryoshka Representation Learning) dimension reduction.
Responsibilities:
- Uses the new
google-genaiSDK (not the deprecatedgoogle-generativeai) - Exposes
task_typeparameter to signal the embedding's intended use to the model (RETRIEVAL_QUERY,RETRIEVAL_DOCUMENT,SEMANTIC_SIMILARITY,CLASSIFICATION,CLUSTERING) - Supports
output_dimensionalityfor MRL-based dimension reduction (valid forgemini-embedding-001) - Validates model status — deprecated models (
text-embedding-004,embedding-001) emit warnings with sunset dates - Tracks requests/chars/errors via
GeminiUsageStats
Key insight: Task-type hints allow the Gemini model to optimize the embedding distribution for the specific downstream task. For RAG, encode queries with RETRIEVAL_QUERY and documents with RETRIEVAL_DOCUMENT for best retrieval accuracy.
Supported models:
| Model | Dimensions | Status |
|---|---|---|
gemini-embedding-001 |
3072 (MRL-reducible) | GA |
text-embedding-004 |
768 | Deprecated 2026-01-14 |
embedding-001 |
768 | Deprecated 2025-08-14 |
MistralEmbedder
Purpose: Mistral AI embedding client with per-second rate limiting and LRU caching.
Responsibilities:
- Implements
MistralRateLimiterwith a 1-second sliding window (Mistral's tighter rate limit vs. OpenAI's per-minute) - Uses
OrderedDictfor LRU cache eviction (bounded bycache_sizeparameter) - Reads API key from
MISTRAL_API_KEYenv var or constructor parameter - Calls
https://api.mistral.ai/v1/embeddingsdirectly viarequests(no official SDK dependency)
Only available model: mistral-embed (1024 dimensions, $0.10/1M tokens).
Public Surface
Purpose: Defines the public API, module-level metadata, and documentation constants.
Exports: ArabicEmbedder, BaseEmbedder, EmbedderConfig, GeminiEmbedder, HuggingFaceEmbedder, OllamaEmbedder, MistralEmbedder, OpenAIEmbedder.
Metadata constants:
__arabic_normalization__: list of all normalization operations in pipeline order__valid_levels__: human-readable descriptions of normalization levels__task_types_gemini__: bilingual (Arabic/English) task type documentation__gemini_embedding_models__: complete model specifications for Gemini models
5. API / Public Interfaces
BaseEmbedder — Core Contract
class BaseEmbedder(ABC):
def __init__(
self,
model_name: str,
device: Optional[str] = None, # 'cuda' | 'cpu' | 'mps' | None (auto)
normalize_embeddings: bool = True,
batch_size: int = 32,
max_length: Optional[int] = 512,
cache_embeddings: bool = False,
show_progress: bool = False,
**kwargs
): ...
@abstractmethod
def encode(
self,
texts: Union[str, List[str]],
show_progress_bar: bool = False,
convert_to_numpy: bool = True,
**kwargs
) -> np.ndarray: ...
# Returns: shape (dim,) for single text, (n, dim) for list
@property
@abstractmethod
def embedding_dim(self) -> int: ...
def encode_with_cache(self, texts, **kwargs) -> np.ndarray: ...
def similarity(
self,
text1: Union[str, np.ndarray],
text2: Union[str, np.ndarray],
metric: str = 'cosine' # 'cosine' | 'dot' | 'euclidean'
) -> float: ...
def batch_similarity(
self,
query: Union[str, np.ndarray],
texts: List[str],
top_k: Optional[int] = None
) -> Union[np.ndarray, List[Tuple[int, float]]]: ...
# Returns: similarity array if top_k=None, else [(index, score), ...]
def validate_connection(
self,
test_text: str = "مرحباً Hello",
detailed: bool = False
) -> Dict[str, Any]: ...
def get_model_info(self) -> Dict[str, Any]: ...
def get_stats(self) -> Dict[str, Any]: ...
def reset_stats(self): ...
def clear_cache(self): ...
# Async API
async def aencode(self, texts, **kwargs) -> np.ndarray: ...
async def aencode_with_cache(self, texts, **kwargs) -> np.ndarray: ...
async def abatch_similarity(self, query, candidates, **kwargs): ...
# Context manager
@contextmanager
def timing(self, operation: str = "encoding"): ...ArabicEmbedder — Extended API
ArabicEmbedder(
model_name: Optional[str] = None, # Key from RECOMMENDED_MODELS or full HF model name
normalization_level: str = 'standard', # 'minimal' | 'standard' | 'aggressive'
enable_preprocessing: bool = True,
track_processing_stats: bool = True,
auto_download: bool = True,
...
)
.find_most_similar(
query: str,
candidates: List[str],
top_k: int = 5,
metric: str = 'cosine'
) -> List[Tuple[int, str, float]] # [(index, text, score), ...]
.save_embeddings(texts, filepath, save_texts=True, save_metadata=True) -> None
.load_embeddings(filepath, load_texts=False, load_metadata=False) -> Union[np.ndarray, Tuple]
.benchmark(sample_texts=None, num_iterations=10, warmup_iterations=2) -> Dict
.get_processing_stats() -> Dict[str, Any]
.list_recommended_models() -> Dict[str, Dict] # @staticmethodOpenAIEmbedder — Extended API
OpenAIEmbedder(
model_name: str = "text-embedding-3-small",
api_key: Optional[str] = None, # Falls back to OPENAI_API_KEY env var
dimensions: Optional[int] = None, # Dimension reduction (text-embedding-3-* only)
enable_rate_limiting: bool = True,
max_requests_per_minute: int = 3000,
max_tokens_per_minute: int = 1_000_000,
track_costs: bool = True,
...
)
.estimate_cost(texts: Union[str, List[str]]) -> Dict[str, Any]
.get_usage_stats() -> Dict[str, Any]GeminiEmbedder — Extended API
GeminiEmbedder(
model_name: str = "gemini-embedding-001",
api_key: Optional[str] = None, # Falls back to GOOGLE_API_KEY env var
task_type: Optional[str] = None, # 'RETRIEVAL_QUERY' | 'RETRIEVAL_DOCUMENT' | ...
output_dimensionality: Optional[int] = None, # MRL reduction
...
)6. Configuration System
EmbedderConfig Dataclass
All embedders accept EmbedderConfig defaults. Override at the module level to change library-wide defaults:
from fennec_community.embeddings import EmbedderConfig
# Override defaults globally before importing embedders
config = EmbedderConfig(
batch_size=64,
normalize_embeddings=True,
enable_cache=True,
show_progress_bar=True,
)Full Configuration Reference
| Field | Default | Description |
|---|---|---|
model_name |
"multilingual" |
Default model key |
device |
None (auto) |
'cuda', 'cpu', 'mps' |
cache_dir |
None |
HuggingFace model cache directory |
batch_size |
32 |
Texts per encode batch |
normalize_embeddings |
True |
L2-normalize output vectors |
max_seq_length |
512 |
Token truncation limit |
enable_preprocessing |
True |
Arabic normalization pipeline |
enable_arabic_normalization |
True |
Arabic char normalization |
enable_cache |
False |
In-memory embedding cache |
cache_size |
10000 |
Max cached entries |
show_progress_bar |
False |
tqdm progress during encoding |
convert_to_numpy |
True |
Always return np.ndarray |
skip_preprocessing |
False |
Bypass Arabic preprocessing |
return_valid_indices |
False |
Return (embeddings, valid_idx) tuple |
track_processing_stats |
True |
Track normalization statistics |
auto_download |
True |
Auto-download missing models |
trust_remote_code |
False |
HuggingFace trust_remote_code |
base_url |
"http://127.0.0.1:11434" |
Ollama server URL |
embedding_model |
"nomic-embed-text" |
Default Ollama model |
Environment Variables
| Variable | Used By |
|---|---|
OPENAI_API_KEY |
OpenAIEmbedder |
GOOGLE_API_KEY |
GeminiEmbedder |
MISTRAL_API_KEY |
MistralEmbedder |
7. Usage Guide
Quick Start
from fennec_community.embeddings import ArabicEmbedder
# Minimal setup — downloads model on first run
embedder = ArabicEmbedder()
embedding = embedder.encode("مرحبا بك في عالم الذكاء الاصطناعي")
print(embedding.shape) # (384,)Provider Selection Guide
| Use Case | Recommended Embedder | Reason |
|---|---|---|
| Arabic NLP, on-premise | ArabicEmbedder |
Built-in normalization pipeline |
| General multilingual, local | HuggingFaceEmbedder |
Full model control, no cost |
| Production API, best quality | OpenAIEmbedder (3-large) |
Highest accuracy, dimension reduction |
| Budget-conscious API | OpenAIEmbedder (3-small) |
Good quality, 6× cheaper |
| Task-specific RAG | GeminiEmbedder |
Task-type optimization |
| Air-gapped / privacy-first | OllamaEmbedder |
Zero network egress |
| Mistral ecosystem | MistralEmbedder |
Native Mistral stack integration |
Basic Usage Pattern
# All embedders share the same core interface
from fennec_community.embeddings import OpenAIEmbedder
embedder = OpenAIEmbedder(model_name="text-embedding-3-small")
# Single text → 1D array of shape (dim,)
emb = embedder.encode("Hello world")
# Multiple texts → 2D array of shape (n, dim)
embs = embedder.encode(["First text", "Second text", "Third text"])
# Semantic similarity
score = embedder.similarity("cat", "kitten") # float, ~0.85
score = embedder.similarity("cat", "automobile") # float, ~0.2
# Top-k search
results = embedder.batch_similarity(
query="artificial intelligence",
texts=["machine learning", "cooking recipes", "neural networks"],
top_k=2
)
# Returns: [(0, 0.91), (2, 0.88)] — (index, score) pairsArabic NLP Advanced Usage
from fennec_community.embeddings import ArabicEmbedder
# Use aggressive normalization for social media text
with ArabicEmbedder(
model_name='arabert',
normalization_level='aggressive',
cache_embeddings=True
) as embedder:
# Texts with heavy diacritics + repeated chars + hamza variants
messy_texts = [
"أَهْلاً وَسَهْلاً بِكُمْ",
"ههههههههه كتير كتيرررر",
"اريد ان اتعلم البرمجة"
]
embeddings = embedder.encode(messy_texts)
# Inspect normalization statistics
stats = embedder.get_processing_stats()
print(f"Diacritics removed: {stats['removed_diacritics']}")
print(f"Hamzas normalized: {stats['normalized_hamzas']}")Async Usage
import asyncio
from fennec_community.embeddings import HuggingFaceEmbedder
async def embed_documents(texts):
async with HuggingFaceEmbedder("bge-m3") as embedder:
# Runs encode() in a thread pool — does not block event loop
embeddings = await embedder.aencode(texts)
return embeddings
embeddings = asyncio.run(embed_documents(["doc1", "doc2", "doc3"]))Persistence (ArabicEmbedder)
embedder = ArabicEmbedder(model_name='multilingual')
# Save to disk
texts = ["النص الأول", "النص الثاني"]
embedder.save_embeddings(texts, "my_embeddings", save_texts=True, save_metadata=True)
# Creates: my_embeddings.npz
# Load from disk
embeddings, texts, metadata = embedder.load_embeddings(
"my_embeddings",
load_texts=True,
load_metadata=True
)
print(metadata['model_name']) # paraphrase-multilingual-mpnet-base-v2
print(metadata['timestamp']) # ISO 8601 timestampCost Estimation (OpenAI)
embedder = OpenAIEmbedder(model_name="text-embedding-3-large")
# Estimate before committing to the API call
estimate = embedder.estimate_cost(my_large_document_list)
print(f"Estimated cost: ${estimate['estimated_cost_usd']:.4f}")
print(f"Total tokens: {estimate['total_tokens']:,}")
# After encoding, review actual usage
stats = embedder.get_usage_stats()
print(f"Actual cost: ${stats['api_usage']['total_cost_usd']:.6f}")Performance Benchmarking
embedder = ArabicEmbedder(model_name='multilingual-mini')
results = embedder.benchmark(num_iterations=20, warmup_iterations=3)
print(f"Texts/second: {results['texts_per_second']:.1f}")
print(f"Avg latency: {results['avg_time']*1000:.1f}ms")
print(f"Device: {results['device']}")8. Code Examples
RAG Document Indexing Pipeline
from fennec_community.embeddings import GeminiEmbedder
import numpy as np
# Index-time: embed documents
doc_embedder = GeminiEmbedder(
model_name="gemini-embedding-001",
task_type="RETRIEVAL_DOCUMENT",
output_dimensionality=768, # Reduce 3072 → 768 via MRL
cache_embeddings=True,
)
documents = [
"Python is a high-level programming language.",
"Machine learning models require training data.",
"Neural networks are inspired by the brain.",
]
doc_embeddings = doc_embedder.encode(documents)
print(doc_embeddings.shape) # (3, 768)
# Query-time: embed query with matching task type
query_embedder = GeminiEmbedder(
model_name="gemini-embedding-001",
task_type="RETRIEVAL_QUERY",
output_dimensionality=768,
)
query = "What is Python used for?"
results = query_embedder.batch_similarity(query, documents, top_k=2)
for idx, score in results:
print(f"[{score:.3f}] {documents[idx]}")Multi-Provider Evaluation
from fennec_community.embeddings import ArabicEmbedder, OpenAIEmbedder
query = "ما هو الذكاء الاصطناعي؟"
candidates = [
"الذكاء الاصطناعي هو محاكاة العقل البشري",
"الطبخ فن جميل",
"التعلم الآلي فرع من الذكاء الاصطناعي",
]
for EmbedderClass, kwargs in [
(ArabicEmbedder, {"model_name": "multilingual"}),
(OpenAIEmbedder, {"model_name": "text-embedding-3-small"}),
]:
embedder = EmbedderClass(**kwargs)
results = embedder.batch_similarity(query, candidates, top_k=2)
print(f"\n{EmbedderClass.__name__}:")
for idx, score in results:
print(f" [{score:.3f}] {candidates[idx]}")Connection Health Check
from fennec_community.embeddings import OllamaEmbedder
embedder = OllamaEmbedder(model_name="bge-m3")
report = embedder.validate_connection(detailed=True)
if report['success']:
print(f"✅ Connected — dim={report['embedding_dim']}, latency={report['encoding_time']}")
else:
print(f"❌ Failed: {report['reason']}")Timing and Statistics
from fennec_community.embeddings import HuggingFaceEmbedder
embedder = HuggingFaceEmbedder(model_name="bge-m3")
with embedder.timing("bulk_encode"):
embeddings = embedder.encode(large_text_list)
stats = embedder.get_stats()
print(f"Total texts: {stats['total_texts']}")
print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")
print(f"Avg time/text: {stats['avg_time_per_text']*1000:.2f}ms")9. Design Decisions & Trade-offs
Template Method Over Composition
Decision: Shared logic (caching, stats, async, cleanup) lives in BaseEmbedder via the Template Method pattern rather than being injected as collaborator objects.
Pros: Zero boilerplate in concrete embedders; all providers get caching, timing, and async for free. Adding a new provider means implementing two methods.
Cons: Tighter coupling between base and concrete classes. A provider that wants different cache semantics (e.g., Redis) must override multiple base methods. The _stats dict/dataclass dual-support is evidence of this friction accumulating.
In-Memory Cache Only
Decision: The cache is a session-scoped Python dict, never persisted to disk.
Pros: No deserialization overhead, no cache invalidation complexity, no stale embeddings when models change.
Cons: Cache is lost on process restart. For high-reuse workloads (e.g., document corpora that are repeatedly queried), callers must implement external caching or use save_embeddings() / load_embeddings().
Arabic Normalization as Preprocessing (Not Postprocessing)
Decision: Arabic normalization happens before encoding, not after.
Pros: The embedding model sees cleaner, more consistent input. Two surface-form variants of the same word (e.g., أحمد and احمد) produce the same token sequence and therefore similar (or identical) embeddings.
Cons: If a model was specifically fine-tuned on raw Arabic text including diacritics, aggressive normalization may degrade quality. The skip_preprocessing parameter and minimal level exist to mitigate this.
Separate ArabicEmbedder vs. HuggingFaceEmbedder
Decision: Arabic-specific functionality is isolated in its own class rather than added as flags to HuggingFaceEmbedder.
Pros: Cleaner separation of concerns. Arabic users get a purpose-built interface with save/load, benchmarking, and processing stats. General users are not burdened with Arabic-specific parameters.
Cons: Some code duplication between the two classes. The model-loading logic and SentenceTransformer call are essentially identical.
API Key in Constructor vs. Environment Only
Decision: All API embedders accept api_key as an explicit constructor parameter with environment variable fallback.
Pros: Enables testing with different credentials without environment manipulation; makes dependencies explicit.
Cons: Risk of accidentally logging or serializing the API key if the embedder object is repr'd or pickled carelessly.
Rate Limiting Inside the Embedder
Decision: Rate limiting logic is embedded within OpenAIEmbedder and MistralEmbedder rather than being a separate middleware.
Pros: Self-contained — users cannot accidentally bypass the limiter by calling the underlying client directly.
Cons: If multiple embedder instances are created for the same API account, each has an independent rate limiter, potentially allowing combined rates to exceed account limits.
10. Extensibility Guide
Adding a New Embedding Provider
Create a new file (e.g., cohere_embedder.py) and implement the two abstract members:
from fennec_community.embeddings import BaseEmbedder
import numpy as np
from typing import Union, List, Optional
class CohereEmbedder(BaseEmbedder):
def __init__(self, model_name: str = "embed-multilingual-v3.0",
api_key: Optional[str] = None, **kwargs):
super().__init__(model_name=model_name, **kwargs)
import cohere
self.client = cohere.Client(api_key or os.getenv("COHERE_API_KEY"))
self._dim = None # Lazy-loaded on first encode
def encode(self, texts: Union[str, List[str]],
show_progress_bar: bool = False,
convert_to_numpy: bool = True, **kwargs) -> np.ndarray:
if isinstance(texts, str):
texts = [texts]
response = self.client.embed(texts=texts, model=self.model_name,
input_type="search_document")
embeddings = np.array(response.embeddings, dtype=np.float32)
if self._dim is None:
self._dim = embeddings.shape[1]
if self.normalize_embeddings:
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.where(norms == 0, 1, norms)
return embeddings
@property
def embedding_dim(self) -> int:
if self._dim is None:
self.encode("init") # trigger dimension detection
return self._dimThen register in __init__.py:
from .cohere_embedder import CohereEmbedder
__all__ = [..., "CohereEmbedder"]The new embedder automatically gains: caching via encode_with_cache, async via aencode, similarity computation, stats tracking, context manager support, and validate_connection.
Adding a New Normalization Step
Extend ArabicEmbedder.ARABIC_NORMALIZATION_PATTERNS and update _normalize_arabic():
# In ArabicEmbedder
ARABIC_NORMALIZATION_PATTERNS = {
...
'numerals': re.compile('[٠١٢٣٤٥٦٧٨٩]'), # Arabic-Indic numerals
}
def _normalize_arabic(self, text: str) -> str:
...
if self.normalization_level == 'aggressive':
# Existing steps ...
# New step: normalize Arabic-Indic numerals to ASCII
text = self.ARABIC_NORMALIZATION_PATTERNS['numerals'].sub(
lambda m: str(ord(m.group()) - 0x0660), text
)
return textPlugging in External Cache (Redis)
Override encode_with_cache in a subclass to use an external store:
class RedisCachedEmbedder(OpenAIEmbedder):
def __init__(self, redis_url: str, **kwargs):
super().__init__(**kwargs)
import redis
self.redis = redis.from_url(redis_url)
def encode_with_cache(self, texts, **kwargs):
# Check Redis for each text, encode misses, store results
...11. Performance & Scalability
Batch Processing
All embedders process texts in configurable batches (batch_size, default 32 for local models, 50–100 for API models). Batch size should be tuned per deployment:
- GPU with ample VRAM: increase to 128–256 for local models
- API models: OpenAI supports up to 2048 texts per call; the default of 100 is conservative
ArabicEmbedderOOM handling: automatically halvesbatch_sizeon CUDA OOM and retries — no intervention required
Cache Efficiency
Enable cache_embeddings=True when the same texts recur across requests (e.g., fixed document corpus, common query templates). The base class computes an MD5 digest per text — this hash operation costs ~1µs per text and is negligible vs. encoding latency. Monitor cache efficiency via get_stats()['cache_hit_rate'].
Async Concurrency
aencode(), aencode_with_cache(), and abatch_similarity() dispatch to asyncio.to_thread(), running the CPU-bound encoding in a thread pool executor. This prevents blocking the event loop in async web servers (FastAPI, aiohttp). Multiple concurrent encode calls will run in parallel up to the ThreadPoolExecutor size (default: min(32, cpu_count + 4) in Python 3.8+).
GPU Utilization
_get_best_device() probes CUDA → MPS → CPU. For maximum throughput on GPU:
- Set
batch_sizeto fill GPU VRAM (~70–80% utilization) - Use
normalize_embeddings=Trueto avoid a separate normalization pass downstream - Call
cleanup()or use as a context manager to release CUDA memory between sessions
Memory Considerations
The in-memory cache stores raw np.ndarray objects. At 768 dimensions × 4 bytes × 10,000 texts, the cache consumes ~30MB — acceptable for most deployments. For very large corpora, disable caching and use save_embeddings() / load_embeddings() for .npz-based batch retrieval instead.
Rate Limiting at Scale
For high-throughput API workloads:
OpenAIEmbedderdefaults to 3,000 requests/min and 1M tokens/min — match these to your actual tier limits- Create one embedder instance per application (not per request) to share the rate limiter's sliding window across all concurrent requests
- For multi-process deployments, rate limiters are per-process; coordinate externally (e.g., Redis-based token bucket) if sharing API quota across workers
community/embedding.md