Fennec Community community/llm.md

LLM Modular

A unified, provider-agnostic Python interface for interacting with large language models.
Supports synchronous generation, asynchronous generation, and native token streaming across six major LLM providers with a single, consistent API surface.

High-Level Overview
Architecture Overview
Core Concepts
Module & Component Breakdown
API / Public Interfaces
Configuration System
Usage Guide
Code Examples
Design Decisions & Trade-offs
Extensibility Guide
Project Structure
Performance & Scalability

1. High-Level Overview

What It Does

The llm package is a provider-agnostic LLM client library that wraps six distinct LLM backends — OpenAI, Anthropic (Claude), Google Gemini, Groq, Mistral, and Ollama — behind a single, stable interface. Consumers of this library can swap providers, switch models, or run experiments across backends without changing application-level code.

Problem It Solves

Integrating LLMs into production systems typically means writing bespoke client code for each provider: different SDKs, different authentication models, different streaming APIs, different retry semantics, and different async patterns. This library eliminates that friction by:

Providing one consistent method surface (generate, generate_async, astream) regardless of provider.
Encapsulating all provider-specific quirks — rate limits, retry logic, server lifecycle management, response parsing — inside each adapter.
Enabling runtime provider selection via a shared configuration dataclass.

Key Design Ideas

Template Method pattern via BaseLLMInterface — the abstract base defines the contract; concrete subclasses fulfill it per-provider.
Adapter pattern — each provider's SDK is wrapped inside a thin adapter that normalizes inputs and outputs.
Graceful degradation — where native streaming is unsupported (Gemini, Mistral fallback path), the base class provides a word-by-word simulation so callers always get an async generator.
Dual-client architecture — for providers supporting it (OpenAI, Anthropic, Groq), both a sync and an async client are instantiated at construction time, avoiding per-request setup overhead.

2. Architecture Overview

Component Hierarchy

BaseLLMInterface (ABC)
│
├── OpenAIInterface         — OpenAI GPT-4/3.5, native sync+async+streaming
├── AnthropicInterface      — Claude 3/4 series, native sync+async+streaming
├── GroqInterface           — Llama/Mixtral via Groq, native sync+async+streaming
├── MistralInterface        — Mistral models, native sync+async+streaming
├── GeminiInterface         — Google Gemini, sync+async, simulated streaming
└── OllamaInterface         — Local Ollama server, HTTP-based, full lifecycle management

Data Flow

Caller
  │
  ▼
[ProviderInterface].generate(prompt) / generate_async() / astream()
  │
  ├─ _build_messages(prompt, messages)     # normalize input (where applicable)
  │
  ├─ provider SDK call                     # provider-specific I/O
  │
  ├─ response extraction                   # normalize output to str
  │
  └─ return str / AsyncIterator[str]

Module Interaction

config_llm.py is consumed by every module at import time via llm_config_ = llm_config(). All concrete interfaces inherit from BaseLLMInterface, which itself imports from config_llm. The __init__.py re-exports all public symbols, making the entire library accessible from a single import.

3. Core Concepts

Adapter Pattern

Each provider class is an adapter — it translates the library's normalized interface (generate(prompt, max_tokens, temperature)) into the provider's specific SDK calls, response shapes, and error models. The caller never interacts with a provider SDK directly.

Dual Execution Modes

Every provider exposes both synchronous (generate) and asynchronous (generate_async) generation paths. This is intentional: sync execution is simpler for scripts and notebooks; async is required for high-throughput services, concurrent request batching, and frameworks like FastAPI or asyncio-based pipelines.

Async Token Streaming

The astream method is an async generator that yields string tokens as they are produced. This enables real-time output display (e.g., chat UIs, CLI spinners) without waiting for full response completion.

There are two streaming tiers:

Native streaming (OpenAI, Anthropic, Groq, Mistral): tokens are yielded as the model produces them via SSE or streaming SDK support.
Simulated streaming (Gemini, Ollama fallback, base class default): the full response is generated first, then split word-by-word and yielded with asyncio.sleep(0) between yields. This preserves the async generator contract without true token streaming.

Rate Limiting (Gemini-specific)

GeminiInterface implements client-side rate limiting via a timestamp-based gate: it enforces a minimum 5-second interval between requests to stay within Gemini's free-tier cap of 15 req/min. This is applied to both the sync and async paths.

Exponential Backoff with Jitter (Gemini-specific)

On 503 UNAVAILABLE or 429 Too Many Requests responses, GeminiInterface retries up to 5 times with exponential backoff (base_delay * 2^attempt) plus random jitter (random.uniform(0, 1.5) seconds). This is the standard pattern for handling transient cloud API errors without thundering herd.

Server Lifecycle Management (Ollama-specific)

OllamaInterface uniquely manages the full lifecycle of a local Ollama server process:

On construction, it checks server health via GET /api/version.
If the server is not running and auto_start=True, it spawns ollama serve as a detached subprocess (cross-platform: DETACHED_PROCESS on Windows, start_new_session=True on Unix).
It tracks ownership: _we_started_server ensures that stop_server() never terminates a pre-existing Ollama instance that the user started manually.
Smart polling (0.5s intervals) replaces a blind sleep for faster startup detection.

Async Context Manager Protocol

BaseLLMInterface implements __aenter__ / __aexit__, enabling all subclasses to be used as async with context managers. On exit, acleanup() is called, allowing each adapter to close HTTP sessions or SDK clients. This prevents resource leaks in long-running async applications.

Connection Validation

BaseLLMInterface.validate_connection() provides a health-check method that sends a minimal test prompt and returns a structured dict {"success": bool, "reason": str, "response": str}. This is useful for startup checks, CI/CD pipeline validation, and provider fallback logic.

4. Module & Component Breakdown

`LLM Config` — Centralized Defaults

Purpose: Single source of truth for all default generation parameters and provider-specific settings.

Key class: llm_config (Python dataclass)

Field	Default	Description
`max_token`	`2048`	Default max output tokens
`temperature`	`0.3`	Default sampling temperature
`top_p`	`0.9`	Nucleus sampling cutoff
`top_k`	`50`	Top-k sampling
`gemini_model`	`gemini-3-flash-preview`	Default Gemini model
`mistral_model`	`mistral-large-latest`	Default Mistral model
`groq_model`	`llama-3.3-70b-versatile`	Default Groq model
`ollama_model`	`llama2`	Default local Ollama model
`ollama_base_url`	`http://127.0.0.1:11434`	Ollama server address
`time_out`	`200`	Ollama request read timeout (seconds)

`Base LLM Interface` — Abstract Base Class

Purpose: Defines the mandatory contract for all LLM adapters and provides shared utility implementations.

Key responsibilities:

Declares generate and generate_async as @abstractmethod — subclasses must implement both.
Provides a default astream implementation (word-by-word simulation) that subclasses can override with native streaming.
Implements validate_connection() as a concrete, reusable health-check.
Implements the async context manager protocol (__aenter__ / __aexit__ / acleanup).

Interactions: Imported by all six provider adapters. llm_config_ is instantiated at module level and used for default parameter values.

`OpenAI Interface` — OpenAI Adapter

Purpose: Wraps openai.OpenAI (sync) and openai.AsyncOpenAI (async) clients.

Key design detail: _build_messages is a static helper that accepts either a raw prompt: str or a pre-formatted messages: list[dict]. This allows the adapter to support both simple single-turn usage and multi-turn conversation history natively.

Streaming: Native via stream=True on chat.completions.create; yields chunk.choices[0].delta.content tokens.

Cleanup: Both sync and async clients are explicitly closed. The async client uses await client.close().

Error handling: Catches all exceptions, logs via logging, and returns {"error": str(e)} instead of raising — allowing callers to handle errors gracefully without try/except at every call site.

`Anthropic Interface` — Anthropic (Claude) Adapter

Purpose: Wraps anthropic.Anthropic (sync) and anthropic.AsyncAnthropic (async) clients.

Key design detail: Mirrors the OpenAI interface exactly — including the _build_messages static method and the prompt / messages dual-input pattern. This intentional API symmetry means code written for OpenAI can be redirected to Claude with minimal changes.

Streaming: Native via async_client.messages.stream() context manager; yields from stream.text_stream.

Default model: claude-sonnet-4-20250514 — pinned to a specific version rather than a floating alias, which is important for reproducibility.

`Groq Interface` — Groq Adapter

Purpose: Wraps groq.Groq (sync) and groq.AsyncGroq (async) — Groq's OpenAI-compatible SDK.

Key design detail: Because Groq's API is OpenAI-compatible, this adapter is structurally nearly identical to OpenAIInterface. The abstraction value is in encapsulating the different SDK import, client types, and model namespace.

Streaming: Native, same pattern as OpenAI.

Default model: llama-3.3-70b-versatile — a high-performance open model served at Groq's inference speeds.

`Mistral Interface` — Mistral AI Adapter

Purpose: Wraps mistralai.Mistral — Mistral's official Python SDK.

Notable difference from OpenAI/Groq: Mistral uses a single client (self._client) that exposes both sync and async methods (chat.complete vs chat.complete_async), rather than separate sync/async client objects. The streaming path uses chat.stream_async.

Streaming fallback: If native streaming fails, it falls back to generate_async + word-by-word yield, catching the secondary exception independently.

Input normalization: Unlike OpenAI/Anthropic/Groq, MistralInterface always constructs messages inline from prompt — it does not expose a messages parameter in its public signature. Multi-turn usage requires callers to use the generate path with a pre-formatted prompt string.

`Gemini Interface` — Google Gemini Adapter

Purpose: Wraps google.genai.Client to call Google's Gemini models.

Notable behaviors:

Rate limiting: enforces a 5-second floor between requests via _last_request_time tracking.
Retry loop: up to 5 attempts on 503/429 errors with exponential backoff + jitter.
Response parsing: iterates response.candidates[].content.parts[].text to handle Gemini's multi-candidate, multi-part response structure, with a .text attribute fallback.
Short response guard: responses under 3 characters are retried (except on the last attempt), distinguishing between a truncated generation and a valid short factual answer.
Async implementation: generate_async wraps the synchronous SDK call in asyncio.get_running_loop().run_in_executor(None, ...) rather than using a native async SDK — this is the correct pattern when the SDK does not expose async methods.
Streaming: simulated (word-by-word), not native.

`Ollama Interface` — Ollama (Local) Adapter

Purpose: Interfaces with a locally-running Ollama server via its REST API (/api/generate, /api/tags, /api/pull).

Notable behaviors:

Server auto-start: spawns ollama serve if not already running, with cross-platform subprocess flags.
Ownership tracking: _we_started_server flag prevents stopping a server the library didn't start.
HTTP streaming for sync generation: even the synchronous generate() uses requests with stream=True and iterates NDJSON lines. This avoids read timeout on long generations — each received token resets the underlying TCP socket's read deadline.
Async generation: uses aiohttp for native async HTTP, falling back to asyncio.to_thread(self.generate, ...) if aiohttp is unavailable.
Model management: exposes list_models() and pull_model() methods not present in other adapters — necessary because local model availability is not guaranteed.
Cleanup via __del__: the destructor calls stop_server(), ensuring the server is terminated when the object is garbage collected — though explicit stop_server() calls are preferred.

`init.py` — Package Entry Point

Purpose: Exposes the public API and maintains metadata about supported providers and known model families.

Exports: BaseLLMInterface, all six provider interfaces, and llm_config.

Provider registry: __llm_providers__ list can be used by application code for validation or dynamic instantiation.

HuggingFace registry: __hugginface_models__ lists known open model family prefixes — suggests future planned support for HuggingFace-hosted models.

5. API / Public Interfaces

`BaseLLMInterface`

class BaseLLMInterface(ABC):
    def __init__(self, model_name: str, api_key: str, **kwargs): ...

    @abstractmethod
    def generate(self, prompt: str, max_tokens: int = 2048, temperature: float = 0.3, **kwargs) -> str: ...

    @abstractmethod
    async def generate_async(self, prompt: str, max_tokens: int = 2048, temperature: float = 0.3, **kwargs) -> str: ...

    async def astream(self, prompt: str, max_tokens: int = 2048, temperature: float = 0.3, **kwargs) -> AsyncIterator[str]: ...

    def validate_connection(self, test_prompt: str = "test", max_tokens: int = 10, temperature: float = 0.7, async_mode: bool = False) -> dict: ...

    async def acleanup(self): ...

Provider Interfaces (OpenAI, Anthropic, Groq)

These three share an identical extended signature that supports multi-turn conversation:

def generate(
    self,
    messages: Optional[list[dict]] = None,   # pre-formatted message list (takes priority)
    prompt: Optional[str] = None,             # simple string prompt (converted to messages internally)
    max_tokens: int = 2048,
    temperature: float = 0.3,
    **kwargs,                                 # passed through to provider SDK
) -> Union[str, dict]: ...                    # str on success, {"error": "..."} on failure

Gemini / Mistral / Ollama

These use a simpler signature:

def generate(self, prompt: str, max_tokens: int = 2048, temperature: float = 0.3, **kwargs) -> str: ...

`validate_connection` Return Type

{
    "success": True,
    "reason": "Connection successful",
    "response": "<model's response to test_prompt>"
}

# On failure:
{
    "success": False,
    "reason": "<exception message or 'Empty response'>",
    "response": None
}

`OllamaInterface` — Additional Methods

def list_models(self) -> list[str]: ...
# Returns: ["llama2", "mistral", "phi3", ...] — names of locally available models

def pull_model(self, model_name: str = None) -> bool: ...
# Downloads a model from Ollama's registry. Uses self.model_name if model_name is None.

def stop_server(self) -> None: ...
# Terminates the Ollama server process, only if this instance started it.

6. Configuration System

`llm_config` Dataclass

All defaults live in a single dataclass in llm_config. It is instantiated once per module:

from fennec_community.llm import llm_config
config = llm_config()

Because it is a plain dataclass, instances can be modified at runtime:

config = llm_config()
config.max_token = 4096
config.temperature = 0.7

However, since each module instantiates its own llm_config() at import time, runtime mutations to one module's instance do not affect others. To apply global overrides, pass parameters explicitly to each interface constructor and method.

Environment Variables

The library does not currently read from environment variables directly. API keys must be passed explicitly:

OpenAIInterface(model_name="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])

This is a deliberate design choice for explicitness — integrating with python-dotenv or os.environ is the caller's responsibility.

Per-Provider Defaults

Provider	Default Model	Notes
OpenAI	`gpt-4`	Must be overridden for newer models
Anthropic	`claude-sonnet-4-20250514`	Version-pinned
Gemini	`gemini-3-flash-preview`	Via `config.gemini_model`
Groq	`llama-3.3-70b-versatile`	Via `config.groq_model`
Mistral	`mistral-large-latest`	Via `config.mistral_model`
Ollama	`llama2`	Via `config.ollama_model`

Ollama-Specific Configuration

OllamaInterface(
    model_name="llama3",
    base_url="http://192.168.1.10:11434",  # remote server
    auto_start=False,                        # don't try to spawn server
    server_start_wait=20,                    # seconds to poll for startup
)

7. Usage Guide

Quick Start

pip install openai anthropic google-genai mistralai groq
# For local models:
pip install requests aiohttp
# Install Ollama binary from https://ollama.com/download
pip install fennec-community

from fennec_community.llm import OpenAIInterface

llm = OpenAIInterface(model_name="gpt-4o", api_key="sk-...")
response = llm.generate(prompt="Explain the Transformer architecture in 3 sentences.")
print(response)

Basic Usage

Synchronous generation:

from fennec_community.llm import AnthropicInterface

llm = AnthropicInterface(model_name="claude-sonnet-4-20250514", api_key="sk-ant-...")
result = llm.generate(prompt="What is the capital of France?")
# "Paris"

Asynchronous generation:

import asyncio
from fennec_community.llm import GroqInterface

async def main():
    llm = GroqInterface(api_key="gsk_...")
    result = await llm.generate_async(prompt="Write a haiku about inference speed.")
    print(result)

asyncio.run(main())

Streaming:

import asyncio
from fennec_community.llm import MistralInterface

async def stream_response():
    llm = MistralInterface(api_key="...")
    async for token in llm.astream("Describe quantum entanglement"):
        print(token, end="", flush=True)
    print()  # newline at end

asyncio.run(stream_response())

Advanced Usage

Multi-turn conversation (OpenAI/Anthropic/Groq):

messages = [
    {"role": "system", "content": "You are a Python expert."},
    {"role": "user", "content": "What is a decorator?"},
    {"role": "assistant", "content": "A decorator is a function that wraps another function..."},
    {"role": "user", "content": "Show me an example."},
]
response = llm.generate(messages=messages)

Async context manager (automatic cleanup):

async with AnthropicInterface(model_name="claude-sonnet-4-20250514", api_key="...") as llm:
    result = await llm.generate_async(prompt="Summarize the BERT paper.")
# acleanup() called automatically — HTTP client closed

Connection validation at startup:

llm = OpenAIInterface(api_key="sk-...")
status = llm.validate_connection()
if not status["success"]:
    raise RuntimeError(f"LLM unavailable: {status['reason']}")

Local inference with Ollama:

from fennec_community.llm import OllamaInterface

llm = OllamaInterface(model_name="llama3", auto_start=True)
# Server starts automatically if not running

available = llm.list_models()
if "llama3" not in available:
    llm.pull_model("llama3")  # download ~4GB

response = llm.generate("Translate 'hello' to Arabic.")
print(response)
llm.stop_server()  # only if we started it

Provider switching at runtime:

from fennec_community.llm import BaseLLMInterface, OpenAIInterface, AnthropicInterface

def get_llm(provider: str, api_key: str) -> BaseLLMInterface:
    providers = {
        "openai": OpenAIInterface,
        "anthropic": AnthropicInterface,
    }
    cls = providers.get(provider)
    if not cls:
        raise ValueError(f"Unknown provider: {provider}")
    return cls(api_key=api_key)

llm = get_llm("anthropic", api_key="...")

8. Code Examples

Parallel async calls across providers

import asyncio
from fennec_community.llm import OpenAIInterface, AnthropicInterface, GroqInterface

async def compare_providers(prompt: str) -> dict:
    providers = {
        "openai": OpenAIInterface(api_key="sk-..."),
        "anthropic": AnthropicInterface(api_key="sk-ant-..."),
        "groq": GroqInterface(api_key="gsk_..."),
    }
    tasks = {
        name: llm.generate_async(prompt=prompt)
        for name, llm in providers.items()
    }
    results = await asyncio.gather(*tasks.values(), return_exceptions=True)
    return dict(zip(tasks.keys(), results))

responses = asyncio.run(compare_providers("What is RAG?"))
for provider, response in responses.items():
    print(f"\n=== {provider.upper()} ===\n{response}")

FastAPI streaming endpoint

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from fennec_community.llm import AnthropicInterface

app = FastAPI()
llm = AnthropicInterface(api_key="sk-ant-...")

@app.get("/stream")
async def stream(prompt: str):
    async def generate():
        async for token in llm.astream(prompt=prompt):
            yield token

    return StreamingResponse(generate(), media_type="text/plain")

Gemini with custom generation parameters

from fennec_community.llm import GeminiInterface

llm = GeminiInterface(model_name="gemini-3-flash-preview", api_key="AIza...")
result = llm.generate(
    prompt="Write a detailed analysis of transformer attention mechanisms.",
    max_tokens=4096,
    temperature=0.2,
    top_p=0.85,
    top_k=40,
)
print(result)

Local Ollama model management

from fennec_community.llm import OllamaInterface

llm = OllamaInterface(
    model_name="phi3",
    base_url="http://localhost:11434",
    auto_start=True,
    server_start_wait=15,
)

print("Available models:", llm.list_models())

if "phi3" not in llm.list_models():
    success = llm.pull_model("phi3")
    if not success:
        raise RuntimeError("Failed to download model.")

response = llm.generate("What is chain-of-thought prompting?")
print(response)

9. Design Decisions & Trade-offs

Adapter over direct SDK usage

Why: Provider SDKs change frequently. Wrapping them in adapters localizes breaking changes to a single file per provider. Callers are insulated.

Trade-off: A thin abstraction layer adds one indirection hop. For high-throughput systems where every microsecond counts, the overhead is measurable but negligible relative to network latency.

Module-level config instantiation

Why: llm_config_ = llm_config() at module level avoids re-instantiating the dataclass on every call. It's a minor optimization that also makes the default values visible in IDEs.

Trade-off: Module-level instances are not thread-safe if mutated. Since the config is effectively read-only after import, this is acceptable. If dynamic config changes are needed, pass parameters explicitly per-call.

Dual-input pattern (`prompt` vs `messages`)

Why: Simple prompt strings cover 80% of use cases. Exposing messages directly supports multi-turn and system prompt workflows without requiring callers to manually wrap strings in [{"role": "user", "content": ...}].

Trade-off: The Optional[str] / Optional[list[dict]] signature is slightly more complex. A ValueError is raised if both are None, which is a runtime check rather than a type-checked invariant.

Error returns vs raised exceptions

Why: Provider interfaces return {"error": "..."} on failure rather than raising. This makes it easier to handle failures in batch processing pipelines without wrapping every call in try/except.

Trade-off: Callers must check the return type. A function annotated as -> Union[str, dict] is less ergonomic than one that always returns str or always raises on error. Future versions might standardize on typed exceptions.

Simulated streaming fallback in base class

Why: Gemini's SDK (at the time of writing) does not expose a native streaming API. Providing a simulated word-by-word stream in BaseLLMInterface.astream means all subclasses are streaming-compatible from day one, even if the underlying provider doesn't support it natively.

Trade-off: Simulated streaming is not true token streaming — the full response is buffered first, then split on whitespace. For very long responses this delays the first token. Subword tokens may not align with word boundaries. This is clearly documented in the base class.

Ollama auto-start

Why: Local inference users often forget to start the server. Auto-starting reduces onboarding friction and avoids cryptic connection error messages.

Trade-off: Auto-starting a subprocess from within a library is opinionated behavior. It is therefore opt-out (auto_start=False) and ownership-aware (_we_started_server). The destructor stop is a best-effort safety net, not a guarantee.

10. Extensibility Guide

Adding a New Provider

Create llm/newprovider_interface.py.
Subclass BaseLLMInterface.
Implement generate and generate_async.
Optionally override astream for native streaming.
Optionally override acleanup to close clients.

# llm/newprovider_interface.py
from fennec_community.llm import BaseLLMInterface
from fennec_community.llm import llm_config

config = llm_config()

class NewProviderInterface(BaseLLMInterface):
    def __init__(self, model_name: str = "default-model", api_key: str = None, **kwargs):
        super().__init__(model_name, api_key, **kwargs)
        try:
            import newprovider_sdk
            self._client = newprovider_sdk.Client(api_key=self.api_key)
        except ImportError:
            raise ImportError("pip install newprovider-sdk")

    def generate(self, prompt: str, max_tokens: int = config.max_token,
                 temperature: float = config.temperature, **kwargs) -> str:
        try:
            response = self._client.complete(prompt, max_tokens=max_tokens, temperature=temperature)
            return response.text
        except Exception as e:
            return {"error": str(e)}

    async def generate_async(self, prompt: str, max_tokens: int = config.max_token,
                              temperature: float = config.temperature, **kwargs) -> str:
        import asyncio
        return await asyncio.to_thread(self.generate, prompt, max_tokens, temperature, **kwargs)

11. Performance & Scalability

Async Concurrency

For concurrent requests, use asyncio.gather with generate_async. Because each provider interface maintains its own client instances, multiple concurrent calls share the underlying connection pool managed by the SDK (e.g., httpx for OpenAI/Anthropic/Groq).

results = await asyncio.gather(
    llm.generate_async(prompt=p1),
    llm.generate_async(prompt=p2),
    llm.generate_async(prompt=p3),
)

Client Reuse

Constructing an interface object creates SDK clients (and HTTP connection pools) once. Reuse the same interface object across requests rather than constructing a new one per call. For Ollama specifically, this also avoids repeated server health checks.

Ollama Timeout Strategy

Ollama uses a split timeout (connect=10, read=200). The read timeout is per-read, not per-response — because the implementation uses streaming internally, each received token resets the read clock. This means very long local model responses will not timeout even with the 200-second cap.

Gemini Rate Limiting

The 5-second inter-request floor in GeminiInterface serializes requests on a per-instance basis. For higher throughput, instantiate multiple GeminiInterface objects (each with its own rate limiter state) and distribute requests across them — effectively sharding the rate limit.

Memory Considerations

All adapters except Ollama operate on complete response strings (no generator-based intermediate buffering for non-streaming calls). For very large responses, consider using astream to process tokens incrementally rather than buffering the full output.

Provider-Specific Latency Profile

Provider	Latency	Throughput	Notes
Groq	Very low (~200ms)	High	Hardware-accelerated inference
OpenAI	Low–Medium	High	Varies by model tier
Anthropic	Medium	Medium	Higher on complex reasoning
Gemini	Medium + rate limit	Limited (free tier)	5s floor enforced
Mistral	Medium	Medium	EU-hosted option available
Ollama	Depends on hardware	Unlimited (local)	No API costs, hardware-bound

Source: community/llm.md

Table of Contents