LLM Modular
A unified, provider-agnostic Python interface for interacting with large language models.
Supports synchronous generation, asynchronous generation, and native token streaming across six major LLM providers with a single, consistent API surface.
Table of Contents
- High-Level Overview
- Architecture Overview
- Core Concepts
- Module & Component Breakdown
- API / Public Interfaces
- Configuration System
- Usage Guide
- Code Examples
- Design Decisions & Trade-offs
- Extensibility Guide
- Project Structure
- Performance & Scalability
1. High-Level Overview
What It Does
The llm package is a provider-agnostic LLM client library that wraps six distinct LLM backends — OpenAI, Anthropic (Claude), Google Gemini, Groq, Mistral, and Ollama — behind a single, stable interface. Consumers of this library can swap providers, switch models, or run experiments across backends without changing application-level code.
Problem It Solves
Integrating LLMs into production systems typically means writing bespoke client code for each provider: different SDKs, different authentication models, different streaming APIs, different retry semantics, and different async patterns. This library eliminates that friction by:
- Providing one consistent method surface (
generate,generate_async,astream) regardless of provider. - Encapsulating all provider-specific quirks — rate limits, retry logic, server lifecycle management, response parsing — inside each adapter.
- Enabling runtime provider selection via a shared configuration dataclass.
Key Design Ideas
- Template Method pattern via
BaseLLMInterface— the abstract base defines the contract; concrete subclasses fulfill it per-provider. - Adapter pattern — each provider's SDK is wrapped inside a thin adapter that normalizes inputs and outputs.
- Graceful degradation — where native streaming is unsupported (Gemini, Mistral fallback path), the base class provides a word-by-word simulation so callers always get an async generator.
- Dual-client architecture — for providers supporting it (OpenAI, Anthropic, Groq), both a sync and an async client are instantiated at construction time, avoiding per-request setup overhead.
2. Architecture Overview
Component Hierarchy
BaseLLMInterface (ABC)
│
├── OpenAIInterface — OpenAI GPT-4/3.5, native sync+async+streaming
├── AnthropicInterface — Claude 3/4 series, native sync+async+streaming
├── GroqInterface — Llama/Mixtral via Groq, native sync+async+streaming
├── MistralInterface — Mistral models, native sync+async+streaming
├── GeminiInterface — Google Gemini, sync+async, simulated streaming
└── OllamaInterface — Local Ollama server, HTTP-based, full lifecycle managementData Flow
Caller
│
▼
[ProviderInterface].generate(prompt) / generate_async() / astream()
│
├─ _build_messages(prompt, messages) # normalize input (where applicable)
│
├─ provider SDK call # provider-specific I/O
│
├─ response extraction # normalize output to str
│
└─ return str / AsyncIterator[str]Module Interaction
config_llm.py is consumed by every module at import time via llm_config_ = llm_config(). All concrete interfaces inherit from BaseLLMInterface, which itself imports from config_llm. The __init__.py re-exports all public symbols, making the entire library accessible from a single import.
3. Core Concepts
Adapter Pattern
Each provider class is an adapter — it translates the library's normalized interface (generate(prompt, max_tokens, temperature)) into the provider's specific SDK calls, response shapes, and error models. The caller never interacts with a provider SDK directly.
Dual Execution Modes
Every provider exposes both synchronous (generate) and asynchronous (generate_async) generation paths. This is intentional: sync execution is simpler for scripts and notebooks; async is required for high-throughput services, concurrent request batching, and frameworks like FastAPI or asyncio-based pipelines.
Async Token Streaming
The astream method is an async generator that yields string tokens as they are produced. This enables real-time output display (e.g., chat UIs, CLI spinners) without waiting for full response completion.
There are two streaming tiers:
- Native streaming (OpenAI, Anthropic, Groq, Mistral): tokens are yielded as the model produces them via SSE or streaming SDK support.
- Simulated streaming (Gemini, Ollama fallback, base class default): the full response is generated first, then split word-by-word and yielded with
asyncio.sleep(0)between yields. This preserves the async generator contract without true token streaming.
Rate Limiting (Gemini-specific)
GeminiInterface implements client-side rate limiting via a timestamp-based gate: it enforces a minimum 5-second interval between requests to stay within Gemini's free-tier cap of 15 req/min. This is applied to both the sync and async paths.
Exponential Backoff with Jitter (Gemini-specific)
On 503 UNAVAILABLE or 429 Too Many Requests responses, GeminiInterface retries up to 5 times with exponential backoff (base_delay * 2^attempt) plus random jitter (random.uniform(0, 1.5) seconds). This is the standard pattern for handling transient cloud API errors without thundering herd.
Server Lifecycle Management (Ollama-specific)
OllamaInterface uniquely manages the full lifecycle of a local Ollama server process:
- On construction, it checks server health via
GET /api/version. - If the server is not running and
auto_start=True, it spawnsollama serveas a detached subprocess (cross-platform:DETACHED_PROCESSon Windows,start_new_session=Trueon Unix). - It tracks ownership:
_we_started_serverensures thatstop_server()never terminates a pre-existing Ollama instance that the user started manually. - Smart polling (0.5s intervals) replaces a blind sleep for faster startup detection.
Async Context Manager Protocol
BaseLLMInterface implements __aenter__ / __aexit__, enabling all subclasses to be used as async with context managers. On exit, acleanup() is called, allowing each adapter to close HTTP sessions or SDK clients. This prevents resource leaks in long-running async applications.
Connection Validation
BaseLLMInterface.validate_connection() provides a health-check method that sends a minimal test prompt and returns a structured dict {"success": bool, "reason": str, "response": str}. This is useful for startup checks, CI/CD pipeline validation, and provider fallback logic.
4. Module & Component Breakdown
LLM Config — Centralized Defaults
Purpose: Single source of truth for all default generation parameters and provider-specific settings.
Key class: llm_config (Python dataclass)
| Field | Default | Description |
|---|---|---|
max_token |
2048 |
Default max output tokens |
temperature |
0.3 |
Default sampling temperature |
top_p |
0.9 |
Nucleus sampling cutoff |
top_k |
50 |
Top-k sampling |
gemini_model |
gemini-3-flash-preview |
Default Gemini model |
mistral_model |
mistral-large-latest |
Default Mistral model |
groq_model |
llama-3.3-70b-versatile |
Default Groq model |
ollama_model |
llama2 |
Default local Ollama model |
ollama_base_url |
http://127.0.0.1:11434 |
Ollama server address |
time_out |
200 |
Ollama request read timeout (seconds) |
Base LLM Interface — Abstract Base Class
Purpose: Defines the mandatory contract for all LLM adapters and provides shared utility implementations.
Key responsibilities:
- Declares
generateandgenerate_asyncas@abstractmethod— subclasses must implement both. - Provides a default
astreamimplementation (word-by-word simulation) that subclasses can override with native streaming. - Implements
validate_connection()as a concrete, reusable health-check. - Implements the async context manager protocol (
__aenter__/__aexit__/acleanup).
Interactions: Imported by all six provider adapters. llm_config_ is instantiated at module level and used for default parameter values.
OpenAI Interface — OpenAI Adapter
Purpose: Wraps openai.OpenAI (sync) and openai.AsyncOpenAI (async) clients.
Key design detail: _build_messages is a static helper that accepts either a raw prompt: str or a pre-formatted messages: list[dict]. This allows the adapter to support both simple single-turn usage and multi-turn conversation history natively.
Streaming: Native via stream=True on chat.completions.create; yields chunk.choices[0].delta.content tokens.
Cleanup: Both sync and async clients are explicitly closed. The async client uses await client.close().
Error handling: Catches all exceptions, logs via logging, and returns {"error": str(e)} instead of raising — allowing callers to handle errors gracefully without try/except at every call site.
Anthropic Interface — Anthropic (Claude) Adapter
Purpose: Wraps anthropic.Anthropic (sync) and anthropic.AsyncAnthropic (async) clients.
Key design detail: Mirrors the OpenAI interface exactly — including the _build_messages static method and the prompt / messages dual-input pattern. This intentional API symmetry means code written for OpenAI can be redirected to Claude with minimal changes.
Streaming: Native via async_client.messages.stream() context manager; yields from stream.text_stream.
Default model: claude-sonnet-4-20250514 — pinned to a specific version rather than a floating alias, which is important for reproducibility.
Groq Interface — Groq Adapter
Purpose: Wraps groq.Groq (sync) and groq.AsyncGroq (async) — Groq's OpenAI-compatible SDK.
Key design detail: Because Groq's API is OpenAI-compatible, this adapter is structurally nearly identical to OpenAIInterface. The abstraction value is in encapsulating the different SDK import, client types, and model namespace.
Streaming: Native, same pattern as OpenAI.
Default model: llama-3.3-70b-versatile — a high-performance open model served at Groq's inference speeds.
Mistral Interface — Mistral AI Adapter
Purpose: Wraps mistralai.Mistral — Mistral's official Python SDK.
Notable difference from OpenAI/Groq: Mistral uses a single client (self._client) that exposes both sync and async methods (chat.complete vs chat.complete_async), rather than separate sync/async client objects. The streaming path uses chat.stream_async.
Streaming fallback: If native streaming fails, it falls back to generate_async + word-by-word yield, catching the secondary exception independently.
Input normalization: Unlike OpenAI/Anthropic/Groq, MistralInterface always constructs messages inline from prompt — it does not expose a messages parameter in its public signature. Multi-turn usage requires callers to use the generate path with a pre-formatted prompt string.
Gemini Interface — Google Gemini Adapter
Purpose: Wraps google.genai.Client to call Google's Gemini models.
Notable behaviors:
- Rate limiting: enforces a 5-second floor between requests via
_last_request_timetracking. - Retry loop: up to 5 attempts on
503/429errors with exponential backoff + jitter. - Response parsing: iterates
response.candidates[].content.parts[].textto handle Gemini's multi-candidate, multi-part response structure, with a.textattribute fallback. - Short response guard: responses under 3 characters are retried (except on the last attempt), distinguishing between a truncated generation and a valid short factual answer.
- Async implementation:
generate_asyncwraps the synchronous SDK call inasyncio.get_running_loop().run_in_executor(None, ...)rather than using a native async SDK — this is the correct pattern when the SDK does not expose async methods. - Streaming: simulated (word-by-word), not native.
Ollama Interface — Ollama (Local) Adapter
Purpose: Interfaces with a locally-running Ollama server via its REST API (/api/generate, /api/tags, /api/pull).
Notable behaviors:
- Server auto-start: spawns
ollama serveif not already running, with cross-platform subprocess flags. - Ownership tracking:
_we_started_serverflag prevents stopping a server the library didn't start. - HTTP streaming for sync generation: even the synchronous
generate()usesrequestswithstream=Trueand iterates NDJSON lines. This avoids read timeout on long generations — each received token resets the underlying TCP socket's read deadline. - Async generation: uses
aiohttpfor native async HTTP, falling back toasyncio.to_thread(self.generate, ...)ifaiohttpis unavailable. - Model management: exposes
list_models()andpull_model()methods not present in other adapters — necessary because local model availability is not guaranteed. - Cleanup via
__del__: the destructor callsstop_server(), ensuring the server is terminated when the object is garbage collected — though explicitstop_server()calls are preferred.
__init__.py — Package Entry Point
Purpose: Exposes the public API and maintains metadata about supported providers and known model families.
Exports: BaseLLMInterface, all six provider interfaces, and llm_config.
Provider registry: __llm_providers__ list can be used by application code for validation or dynamic instantiation.
HuggingFace registry: __hugginface_models__ lists known open model family prefixes — suggests future planned support for HuggingFace-hosted models.
5. API / Public Interfaces
BaseLLMInterface
class BaseLLMInterface(ABC):
def __init__(self, model_name: str, api_key: str, **kwargs): ...
@abstractmethod
def generate(self, prompt: str, max_tokens: int = 2048, temperature: float = 0.3, **kwargs) -> str: ...
@abstractmethod
async def generate_async(self, prompt: str, max_tokens: int = 2048, temperature: float = 0.3, **kwargs) -> str: ...
async def astream(self, prompt: str, max_tokens: int = 2048, temperature: float = 0.3, **kwargs) -> AsyncIterator[str]: ...
def validate_connection(self, test_prompt: str = "test", max_tokens: int = 10, temperature: float = 0.7, async_mode: bool = False) -> dict: ...
async def acleanup(self): ...Provider Interfaces (OpenAI, Anthropic, Groq)
These three share an identical extended signature that supports multi-turn conversation:
def generate(
self,
messages: Optional[list[dict]] = None, # pre-formatted message list (takes priority)
prompt: Optional[str] = None, # simple string prompt (converted to messages internally)
max_tokens: int = 2048,
temperature: float = 0.3,
**kwargs, # passed through to provider SDK
) -> Union[str, dict]: ... # str on success, {"error": "..."} on failureGemini / Mistral / Ollama
These use a simpler signature:
def generate(self, prompt: str, max_tokens: int = 2048, temperature: float = 0.3, **kwargs) -> str: ...validate_connection Return Type
{
"success": True,
"reason": "Connection successful",
"response": "<model's response to test_prompt>"
}
# On failure:
{
"success": False,
"reason": "<exception message or 'Empty response'>",
"response": None
}OllamaInterface — Additional Methods
def list_models(self) -> list[str]: ...
# Returns: ["llama2", "mistral", "phi3", ...] — names of locally available models
def pull_model(self, model_name: str = None) -> bool: ...
# Downloads a model from Ollama's registry. Uses self.model_name if model_name is None.
def stop_server(self) -> None: ...
# Terminates the Ollama server process, only if this instance started it.6. Configuration System
llm_config Dataclass
All defaults live in a single dataclass in llm_config. It is instantiated once per module:
from fennec_community.llm import llm_config
config = llm_config()Because it is a plain dataclass, instances can be modified at runtime:
config = llm_config()
config.max_token = 4096
config.temperature = 0.7However, since each module instantiates its own llm_config() at import time, runtime mutations to one module's instance do not affect others. To apply global overrides, pass parameters explicitly to each interface constructor and method.
Environment Variables
The library does not currently read from environment variables directly. API keys must be passed explicitly:
OpenAIInterface(model_name="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])This is a deliberate design choice for explicitness — integrating with python-dotenv or os.environ is the caller's responsibility.
Per-Provider Defaults
| Provider | Default Model | Notes |
|---|---|---|
| OpenAI | gpt-4 |
Must be overridden for newer models |
| Anthropic | claude-sonnet-4-20250514 |
Version-pinned |
| Gemini | gemini-3-flash-preview |
Via config.gemini_model |
| Groq | llama-3.3-70b-versatile |
Via config.groq_model |
| Mistral | mistral-large-latest |
Via config.mistral_model |
| Ollama | llama2 |
Via config.ollama_model |
Ollama-Specific Configuration
OllamaInterface(
model_name="llama3",
base_url="http://192.168.1.10:11434", # remote server
auto_start=False, # don't try to spawn server
server_start_wait=20, # seconds to poll for startup
)7. Usage Guide
Quick Start
pip install openai anthropic google-genai mistralai groq
# For local models:
pip install requests aiohttp
# Install Ollama binary from https://ollama.com/download
pip install fennec-communityfrom fennec_community.llm import OpenAIInterface
llm = OpenAIInterface(model_name="gpt-4o", api_key="sk-...")
response = llm.generate(prompt="Explain the Transformer architecture in 3 sentences.")
print(response)Basic Usage
Synchronous generation:
from fennec_community.llm import AnthropicInterface
llm = AnthropicInterface(model_name="claude-sonnet-4-20250514", api_key="sk-ant-...")
result = llm.generate(prompt="What is the capital of France?")
# "Paris"Asynchronous generation:
import asyncio
from fennec_community.llm import GroqInterface
async def main():
llm = GroqInterface(api_key="gsk_...")
result = await llm.generate_async(prompt="Write a haiku about inference speed.")
print(result)
asyncio.run(main())Streaming:
import asyncio
from fennec_community.llm import MistralInterface
async def stream_response():
llm = MistralInterface(api_key="...")
async for token in llm.astream("Describe quantum entanglement"):
print(token, end="", flush=True)
print() # newline at end
asyncio.run(stream_response())Advanced Usage
Multi-turn conversation (OpenAI/Anthropic/Groq):
messages = [
{"role": "system", "content": "You are a Python expert."},
{"role": "user", "content": "What is a decorator?"},
{"role": "assistant", "content": "A decorator is a function that wraps another function..."},
{"role": "user", "content": "Show me an example."},
]
response = llm.generate(messages=messages)Async context manager (automatic cleanup):
async with AnthropicInterface(model_name="claude-sonnet-4-20250514", api_key="...") as llm:
result = await llm.generate_async(prompt="Summarize the BERT paper.")
# acleanup() called automatically — HTTP client closedConnection validation at startup:
llm = OpenAIInterface(api_key="sk-...")
status = llm.validate_connection()
if not status["success"]:
raise RuntimeError(f"LLM unavailable: {status['reason']}")Local inference with Ollama:
from fennec_community.llm import OllamaInterface
llm = OllamaInterface(model_name="llama3", auto_start=True)
# Server starts automatically if not running
available = llm.list_models()
if "llama3" not in available:
llm.pull_model("llama3") # download ~4GB
response = llm.generate("Translate 'hello' to Arabic.")
print(response)
llm.stop_server() # only if we started itProvider switching at runtime:
from fennec_community.llm import BaseLLMInterface, OpenAIInterface, AnthropicInterface
def get_llm(provider: str, api_key: str) -> BaseLLMInterface:
providers = {
"openai": OpenAIInterface,
"anthropic": AnthropicInterface,
}
cls = providers.get(provider)
if not cls:
raise ValueError(f"Unknown provider: {provider}")
return cls(api_key=api_key)
llm = get_llm("anthropic", api_key="...")8. Code Examples
Parallel async calls across providers
import asyncio
from fennec_community.llm import OpenAIInterface, AnthropicInterface, GroqInterface
async def compare_providers(prompt: str) -> dict:
providers = {
"openai": OpenAIInterface(api_key="sk-..."),
"anthropic": AnthropicInterface(api_key="sk-ant-..."),
"groq": GroqInterface(api_key="gsk_..."),
}
tasks = {
name: llm.generate_async(prompt=prompt)
for name, llm in providers.items()
}
results = await asyncio.gather(*tasks.values(), return_exceptions=True)
return dict(zip(tasks.keys(), results))
responses = asyncio.run(compare_providers("What is RAG?"))
for provider, response in responses.items():
print(f"\n=== {provider.upper()} ===\n{response}")FastAPI streaming endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from fennec_community.llm import AnthropicInterface
app = FastAPI()
llm = AnthropicInterface(api_key="sk-ant-...")
@app.get("/stream")
async def stream(prompt: str):
async def generate():
async for token in llm.astream(prompt=prompt):
yield token
return StreamingResponse(generate(), media_type="text/plain")Gemini with custom generation parameters
from fennec_community.llm import GeminiInterface
llm = GeminiInterface(model_name="gemini-3-flash-preview", api_key="AIza...")
result = llm.generate(
prompt="Write a detailed analysis of transformer attention mechanisms.",
max_tokens=4096,
temperature=0.2,
top_p=0.85,
top_k=40,
)
print(result)Local Ollama model management
from fennec_community.llm import OllamaInterface
llm = OllamaInterface(
model_name="phi3",
base_url="http://localhost:11434",
auto_start=True,
server_start_wait=15,
)
print("Available models:", llm.list_models())
if "phi3" not in llm.list_models():
success = llm.pull_model("phi3")
if not success:
raise RuntimeError("Failed to download model.")
response = llm.generate("What is chain-of-thought prompting?")
print(response)9. Design Decisions & Trade-offs
Adapter over direct SDK usage
Why: Provider SDKs change frequently. Wrapping them in adapters localizes breaking changes to a single file per provider. Callers are insulated.
Trade-off: A thin abstraction layer adds one indirection hop. For high-throughput systems where every microsecond counts, the overhead is measurable but negligible relative to network latency.
Module-level config instantiation
Why: llm_config_ = llm_config() at module level avoids re-instantiating the dataclass on every call. It's a minor optimization that also makes the default values visible in IDEs.
Trade-off: Module-level instances are not thread-safe if mutated. Since the config is effectively read-only after import, this is acceptable. If dynamic config changes are needed, pass parameters explicitly per-call.
Dual-input pattern (prompt vs messages)
Why: Simple prompt strings cover 80% of use cases. Exposing messages directly supports multi-turn and system prompt workflows without requiring callers to manually wrap strings in [{"role": "user", "content": ...}].
Trade-off: The Optional[str] / Optional[list[dict]] signature is slightly more complex. A ValueError is raised if both are None, which is a runtime check rather than a type-checked invariant.
Error returns vs raised exceptions
Why: Provider interfaces return {"error": "..."} on failure rather than raising. This makes it easier to handle failures in batch processing pipelines without wrapping every call in try/except.
Trade-off: Callers must check the return type. A function annotated as -> Union[str, dict] is less ergonomic than one that always returns str or always raises on error. Future versions might standardize on typed exceptions.
Simulated streaming fallback in base class
Why: Gemini's SDK (at the time of writing) does not expose a native streaming API. Providing a simulated word-by-word stream in BaseLLMInterface.astream means all subclasses are streaming-compatible from day one, even if the underlying provider doesn't support it natively.
Trade-off: Simulated streaming is not true token streaming — the full response is buffered first, then split on whitespace. For very long responses this delays the first token. Subword tokens may not align with word boundaries. This is clearly documented in the base class.
Ollama auto-start
Why: Local inference users often forget to start the server. Auto-starting reduces onboarding friction and avoids cryptic connection error messages.
Trade-off: Auto-starting a subprocess from within a library is opinionated behavior. It is therefore opt-out (auto_start=False) and ownership-aware (_we_started_server). The destructor stop is a best-effort safety net, not a guarantee.
10. Extensibility Guide
Adding a New Provider
- Create
llm/newprovider_interface.py. - Subclass
BaseLLMInterface. - Implement
generateandgenerate_async. - Optionally override
astreamfor native streaming. - Optionally override
acleanupto close clients.
# llm/newprovider_interface.py
from fennec_community.llm import BaseLLMInterface
from fennec_community.llm import llm_config
config = llm_config()
class NewProviderInterface(BaseLLMInterface):
def __init__(self, model_name: str = "default-model", api_key: str = None, **kwargs):
super().__init__(model_name, api_key, **kwargs)
try:
import newprovider_sdk
self._client = newprovider_sdk.Client(api_key=self.api_key)
except ImportError:
raise ImportError("pip install newprovider-sdk")
def generate(self, prompt: str, max_tokens: int = config.max_token,
temperature: float = config.temperature, **kwargs) -> str:
try:
response = self._client.complete(prompt, max_tokens=max_tokens, temperature=temperature)
return response.text
except Exception as e:
return {"error": str(e)}
async def generate_async(self, prompt: str, max_tokens: int = config.max_token,
temperature: float = config.temperature, **kwargs) -> str:
import asyncio
return await asyncio.to_thread(self.generate, prompt, max_tokens, temperature, **kwargs)11. Performance & Scalability
Async Concurrency
For concurrent requests, use asyncio.gather with generate_async. Because each provider interface maintains its own client instances, multiple concurrent calls share the underlying connection pool managed by the SDK (e.g., httpx for OpenAI/Anthropic/Groq).
results = await asyncio.gather(
llm.generate_async(prompt=p1),
llm.generate_async(prompt=p2),
llm.generate_async(prompt=p3),
)Client Reuse
Constructing an interface object creates SDK clients (and HTTP connection pools) once. Reuse the same interface object across requests rather than constructing a new one per call. For Ollama specifically, this also avoids repeated server health checks.
Ollama Timeout Strategy
Ollama uses a split timeout (connect=10, read=200). The read timeout is per-read, not per-response — because the implementation uses streaming internally, each received token resets the read clock. This means very long local model responses will not timeout even with the 200-second cap.
Gemini Rate Limiting
The 5-second inter-request floor in GeminiInterface serializes requests on a per-instance basis. For higher throughput, instantiate multiple GeminiInterface objects (each with its own rate limiter state) and distribute requests across them — effectively sharding the rate limit.
Memory Considerations
All adapters except Ollama operate on complete response strings (no generator-based intermediate buffering for non-streaming calls). For very large responses, consider using astream to process tokens incrementally rather than buffering the full output.
Provider-Specific Latency Profile
| Provider | Latency | Throughput | Notes |
|---|---|---|---|
| Groq | Very low (~200ms) | High | Hardware-accelerated inference |
| OpenAI | Low–Medium | High | Varies by model tier |
| Anthropic | Medium | Medium | Higher on complex reasoning |
| Gemini | Medium + rate limit | Limited (free tier) | 5s floor enforced |
| Mistral | Medium | Medium | EU-hosted option available |
| Ollama | Depends on hardware | Unlimited (local) | No API costs, hardware-bound |
community/llm.md