Document Loader
Purpose: Unified, extensible document ingestion layer for RAG pipelines and NLP preprocessing
1. High-Level Overview
document_loaders is a Modular in Fennec_community library that ingests documents from heterogeneous sources — local files, directories, and web URLs — and normalizes them into a single data structure (LoadedDocument) ready for downstream NLP tasks such as vector embedding, chunking, and retrieval-augmented generation (RAG).
Problem It Solves
Real-world AI pipelines need to ingest documents from many formats (PDF, Word, CSV, JSON, HTML, etc.) and sources (disk, web, entire directories). Without a unified abstraction, each format requires bespoke parsing logic with inconsistent metadata, encoding handling, and error behavior. This library provides:
- A single
LoadedDocumentoutput type regardless of source - Deterministic document IDs (SHA-256 content-based hashing) enabling stable deduplication and caching
- A factory entry point (
AutoLoader) that eliminates format-detection boilerplate - A config-driven architecture for fine-grained control without subclassing
Design Philosophy
The library follows the Adapter pattern: each loader is an adapter between a specific format/source and the universal LoadedDocument contract. A layered configuration system (per-loader Config dataclasses + a master LoaderConfig) separates concerns between behavior and identity, keeping loaders testable and composable.
2. Architecture Overview
┌──────────────────────────────────────────────────────────┐
│ AutoLoader │
│ (smart dispatcher — recommended entry point) │
└────────────┬────────────────┬──────────────┬─────────────┘
│ │ │
┌────────▼───────┐ ┌─────▼─────┐ ┌────▼──────────┐
│ File Loaders │ │ WebLoader │ │DirectoryLoader│
│ Text/MD/PDF/ │ │MultiURL │ │(concurrent) │
│ DOCX/CSV/Excel │ └─────┬─────┘ └────┬──────────┘
│ JSON/JSONL/HTML│ │ │
└────────┬───────┘ │ dispatches to File Loaders
│ │
└────────┬───────┘
│
┌───────▼────────┐
│ LoadedDocument │
│ (output model) │
└───────┬────────┘
│
┌──────────▼──────────┐
│ Downstream Pipeline │
│ (splitter, embed, │
│ vector store, RAG) │
└──────────────────────┘Data Flow
- Input — a file path, URL, or directory path
- Dispatch —
AutoLoaderinspects the input and selects the correct loader - Parse — the loader reads the source using the appropriate backend/library
- Normalize — raw content is wrapped in
LoadedDocumentwith rich metadata - Output — a
List[LoadedDocument]is returned to the caller
3. Core Concepts
3.1 LoadedDocument — The Universal Output Unit
Every loader produces LoadedDocument instances, regardless of source format. This contract enables downstream components (chunkers, embedders, vector stores) to operate generically.
@dataclass
class LoadedDocument:
page_content: str # The extracted text
metadata: Dict[str, Any] # Source, loader, page number, file stats, etc.
doc_id: Optional[str] # Deterministic SHA-256 fingerprintDeterministic doc_id: The ID is computed as SHA-256(source + page_number + content)[:16]. This is critical for caching and deduplication — the same document always produces the same ID across runs.
3.2 Chunking Integration
BaseDocumentLoader.load_and_split(text_splitter) is the integration point for chunking. Any object implementing split_text(str) → List[str] (compatible with LangChain splitters) can be passed. Chunk metadata automatically inherits the parent document's metadata and adds chunk_index, total_chunks, and original_doc_id fields.
3.3 Lazy Loading / Streaming
All loaders implement lazy_load() → Iterator[LoadedDocument]. For large corpora (big CSV files, JSONL datasets, directories), lazy loading avoids loading the entire corpus into memory. By default, lazy_load() delegates to load(), but loaders like CSVLoader, JSONLinesLoader, and DirectoryLoader override it for true streaming.
3.4 Multi-Backend PDF Loading
PDFLoader implements a fallback chain across three backends:
| Priority | Backend | Strength |
|---|---|---|
| 1 | PyMuPDF (fitz) |
Fastest, best Arabic/RTL support |
| 2 | pdfplumber |
Best table extraction |
| 3 | pypdf |
Pure Python, zero native dependencies |
The first available backend is used. This makes the library deployable even in constrained environments where native libraries cannot be installed.
3.5 Concurrent Loading
DirectoryLoader and MultiURLLoader use ThreadPoolExecutor for parallel I/O. MultiURLLoader additionally implements a global rate limiter using threading.Semaphore to enforce minimum inter-request delays across threads — a property that per-thread time.sleep() cannot guarantee.
4. Module & Component Breakdown
Base Loader
Purpose: Defines the foundational abstractions all loaders build on.
| Class | Responsibility |
|---|---|
LoadedDocument |
Output model. Holds text, metadata, and deterministic ID. Implements __len__, __eq__, __hash__ for use in sets/dicts. |
BaseDocumentLoader |
Abstract base with load(), lazy_load(), load_and_split(), and _build_metadata(). |
BaseFileLoader |
Extends BaseDocumentLoader with file validation, extension enforcement, and _build_file_metadata() (adds file name, type, size). |
Config Loader
Purpose: Centralizes all configuration as typed dataclasses.
| Class | Controls |
|---|---|
LoaderType |
Enum of all supported source types |
EXTENSION_MAP |
Dict[str, LoaderType] — drives AutoLoader and DirectoryLoader dispatch |
TextLoaderConfig |
Encoding, error handling, auto-detection |
PDFLoaderConfig |
Per-page mode, page range, password, separator |
DocxLoaderConfig |
Tables, headers, footers inclusion |
CSVLoaderConfig |
Delimiter, column selection, row limits |
JSONLoaderConfig |
jq schema, content key, metadata function |
HTMLLoaderConfig |
Parser choice, tag filtering, link/table extraction |
WebLoaderConfig |
Timeout, SSL, retry policy, encoding |
DirectoryLoaderConfig |
Glob pattern, exclusions, concurrency, progress |
LoaderConfig |
Master config composing all sub-configs |
Auto Loader
Purpose: Single entry point for all loading operations.
AutoLoader is a factory class (not a loader itself — it does not inherit BaseDocumentLoader). It inspects the source string in this order:
- URL — starts with
http://,https://, orftp://→WebLoader - Directory —
Path.is_dir()→DirectoryLoader - File by extension — looks up
EXTENSION_MAP→ appropriate file loader
Key methods: AutoLoader.load(source, **kwargs), AutoLoader.get_loader(source, **kwargs), AutoLoader.detect_type(source).
Text Loader
Purpose: Load plain text and Markdown files.
TextLoader— reads.txt/.log/.rst, withchardet-based encoding auto-detection and latin-1 last-resort fallback.MarkdownLoader— reads.md/.markdown, optionally stripping Markdown syntax via regex. Extracts H1 astitlemetadata. Detects code block presence.
PDF Loader
Purpose: Load PDF files with multi-backend fallback.
PDFLoader supports per-page splitting (one LoadedDocument per page) or full-document mode. Password-protected PDFs are handled across all three backends. Page range selection (start_page, end_page) allows loading sub-sections of large documents.
DOCX Loader
Purpose: Load Microsoft Word documents.
DocxLoader uses python-docx for .docx files and applies a structured extraction order: headers → paragraphs (with heading level detection) → tables → footers. Heading styles are converted to Markdown-style # prefixes, preserving document structure for downstream chunking.
For legacy .doc files, a conversion chain is applied: LibreOffice (if installed) → textract → descriptive ImportError with user instructions. The original bug of using PdfReader on a .doc file has been corrected.
CSV Loader
Purpose: Load CSV/TSV and Excel files.
CSVLoader— streams rows from.csv/.tsv. Each row becomes oneLoadedDocument. Supports flexible column selection:content_columnscontrols which fields form the document text;metadata_columnscontrols which fields go into metadata. Encoding is resolved via a two-attempt loop ([configured_encoding, "latin-1"]), eliminating the recursive retry bug of the original.ExcelLoader— wrapsopenpyxlfor.xlsx/.xlsmandpandas+xlrdfor legacy.xls. Supports multi-sheet loading and sheet name validation.
Json Loader
Purpose: Load JSON and JSONL files.
JSONLoader— handles both JSON arrays and objects. Supportsjq-style path extraction (via thejqlibrary, with a built-in dot-path fallback). Ametadata_funccallable allows arbitrary metadata extraction per record.JSONLinesLoader— streams JSONL/NDJSON files line by line, supportingmax_recordslimits andskip_invalidfor tolerant parsing of noisy datasets (common in ML training data).
HTML Loader
Purpose: Load HTML files and parse HTML strings.
HTMLLoader— reads.html/.htmfiles, removes noise elements (scripts, styles, nav, footer, header), extracts page metadata (title, OG tags, language), and optionally extracts tables as separate documents.HTMLStringLoader— same parsing logic applied to an in-memory HTML string. Uses the internal_HTMLParserhelper which bypasses file validation, allowing reuse of parsing logic without a file.
Web Loader
Purpose: Fetch and load web content.
WebLoader— fetches a single URL usingrequestswith exponential-backoff retry (delay * 2^attempt). Delegates HTML parsing toHTMLStringLoader. Injectsurlandsourcemetadata.MultiURLLoader— fetches multiple URLs usingThreadPoolExecutor. Athreading.Semaphore(1)with sharedlast_request_timeenforces global rate limiting between fetches.
Directory Loader
Purpose: Recursively load an entire directory.
DirectoryLoader collects files matching a glob pattern, filters by supported extensions and exclusion patterns, then dispatches each file to the appropriate loader. Supports both sequential and multithreaded loading. The get_stats() method returns a breakdown by file type without loading any content.
5. API / Public Interfaces
AutoLoader
# Load from any source
docs: List[LoadedDocument] = AutoLoader.load(source: str, **kwargs)
# Get the loader without executing it
loader: BaseDocumentLoader = AutoLoader.get_loader(source: str, **kwargs)
# Detect source type without loading
source_type: str = AutoLoader.detect_type(source: str)
# Returns: "text", "pdf", "csv", "web", "directory", "unknown", etc.LoadedDocument
doc.page_content # str — extracted text
doc.metadata # Dict[str, Any] — source, file_name, loader_type, loaded_at, etc.
doc.doc_id # str — deterministic "doc_" + SHA-256[:16]
doc.to_dict() # Dict with all three fields
len(doc) # int — character count
doc1 == doc2 # bool — based on doc_id
hash(doc) # int — for set/dict usageBaseDocumentLoader
loader.load() -> List[LoadedDocument]
loader.lazy_load() -> Iterator[LoadedDocument]
loader.load_and_split(text_splitter=None) -> List[LoadedDocument]PDFLoader
PDFLoader(
file_path: str,
config: Optional[PDFLoaderConfig] = None,
per_page: bool = True, # One document per page
password: Optional[str] = None,
)CSVLoader
CSVLoader(
file_path: str,
content_columns: Optional[List[str]] = None, # None = all columns
metadata_columns: Optional[List[str]] = None,
source_column: Optional[str] = None,
encoding: str = "utf-8",
delimiter: str = ",",
)JSONLoader
JSONLoader(
file_path: str,
content_key: Optional[str] = None, # e.g., "body", "text"
metadata_func: Optional[Callable[[Dict], Dict]] = None,
jq_schema: Optional[str] = None, # e.g., ".items[]"
encoding: str = "utf-8",
)WebLoader
WebLoader(
url: str,
config: Optional[WebLoaderConfig] = None,
headers: Optional[Dict[str, str]] = None,
timeout: int = 10,
)DirectoryLoader
DirectoryLoader(
path: str,
config: Optional[DirectoryLoaderConfig] = None,
glob_pattern: str = "**/*",
recursive: bool = True,
silent_errors: bool = False,
show_progress: bool = True,
)
loader.get_stats() -> Dict # {"total_files": N, "by_type": {...}, "total_size_mb": X}6. Configuration System
Per-Loader Configs
Each loader accepts an optional typed config dataclass. When not provided, a default instance is created from keyword arguments.
from fennec_community.document_loaders import PDFLoaderConfig, PDFLoader
config = PDFLoaderConfig(
per_page=True,
start_page=0,
end_page=50,
password="secret",
page_separator="\n---\n",
)
loader = PDFLoader("report.pdf", config=config)Master Config
LoaderConfig composes all sub-configs for scenarios requiring system-wide configuration:
from fennec_community.document_loaders import LoaderConfig
master = LoaderConfig()
master.pdf.per_page = True
master.web.timeout = 30
master.web.max_retries = 5
master.directory.max_concurrency = 8
master.directory.silent_errors = TrueExtension Map
EXTENSION_MAP is a public Dict[str, LoaderType] that controls how AutoLoader and DirectoryLoader resolve file types. It can be inspected to discover supported extensions:
from fennec_community.document_loaders import EXTENSION_MAP
print(sorted(EXTENSION_MAP.keys()))
# ['.csv', '.doc', '.docx', '.htm', '.html', '.json', '.jsonl', '.md', ...]Key Config Options by Loader
| Loader | Key Options |
|---|---|
TextLoader |
autodetect_encoding, errors ("strict"/"ignore"/"replace") |
PDFLoader |
per_page, start_page, end_page, password, page_separator |
DocxLoader |
include_tables, include_headers, include_footers |
CSVLoader |
content_columns, metadata_columns, source_column, max_rows, skip_rows |
JSONLoader |
content_key, jq_schema, metadata_func, text_content |
HTMLLoader |
parser, tags_to_remove, extract_links, extract_tables |
WebLoader |
timeout, max_retries, retry_delay, verify_ssl, headers |
DirectoryLoader |
glob_pattern, exclude_patterns, recursive, max_concurrency, silent_errors |
7. Usage Guide
Quick Start
from fennec_community.document_loaders import AutoLoader
# File
docs = AutoLoader.load("report.pdf")
# Web URL
docs = AutoLoader.load("https://docs.python.org/3/library/pathlib.html")
# Directory
docs = AutoLoader.load("./knowledge_base/")
print(f"Loaded {len(docs)} documents")
print(docs[0].page_content[:200])
print(docs[0].metadata)Basic Usage — Specific Loaders
from fennec_community.document_loaders import PDFLoader, CSVLoader, JSONLoader
# PDF: per-page loading
pdf_docs = PDFLoader("annual_report.pdf", per_page=True).load()
# CSV: select specific columns
csv_docs = CSVLoader(
"products.csv",
content_columns=["name", "description"],
metadata_columns=["sku", "category"],
).load()
# JSON: extract specific field
json_docs = JSONLoader(
"articles.json",
content_key="body",
metadata_func=lambda r: {"title": r.get("title"), "author": r.get("author")},
).load()Advanced Usage — Chunking Pipeline
from fennenc_community.document_loaders import AutoLoader
from fennec_community.chunks import TokenTextSplitter
# Assume a LangChain-compatible splitter
#from langchain.text_splitter import RecursiveCharacterTextSplitter
#splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
splitter = TokenTextSplitter()
loader = AutoLoader.get_loader("deep_learning.pdf")
chunks = loader.load_and_split(text_splitter=splitter)
print(f"{len(chunks)} chunks")
print(chunks[0].metadata)
# {'source': '...', 'chunk_index': 0, 'total_chunks': 14, 'original_doc_id': 'doc_...'}Advanced Usage — Concurrent Web Scraping
from fennec_community.document_loaders import MultiURLLoader, WebLoaderConfig
config = WebLoaderConfig(timeout=15, max_retries=3, verify_ssl=True)
urls = [
"https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)",
"https://en.wikipedia.org/wiki/BERT_(language_model)",
"https://en.wikipedia.org/wiki/GPT-4",
]
loader = MultiURLLoader(
urls=urls,
config=config,
max_workers=3,
delay_between_requests=1.0,
continue_on_error=True,
)
docs = loader.load()Advanced Usage — Directory with Stats
from fennec_community.document_loaders import DirectoryLoader, DirectoryLoaderConfig
config = DirectoryLoaderConfig(
glob_pattern="**/*.pdf",
recursive=True,
silent_errors=True,
max_concurrency=6,
show_progress=True,
)
loader = DirectoryLoader("./documents", config=config)
# Inspect before loading
stats = loader.get_stats()
print(f"Total: {stats['total_size_mb']} MB across {stats['total_files']} files")
# Load
docs = loader.load()Advanced Usage — Lazy Loading a Large JSONL Dataset
from fennec_community.document_loaders import JSONLinesLoader
loader = JSONLinesLoader(
"dataset.jsonl",
content_key="text",
metadata_func=lambda r: {"label": r.get("label"), "id": r.get("id")},
max_records=10_000,
skip_invalid=True,
)
# Process without loading everything into memory
for doc in loader.lazy_load():
embed_and_store(doc) # your indexing function8. Code Examples
Deduplication Using Set Semantics
from fennec_community.document_loaders import AutoLoader
docs_a = AutoLoader.load("./folder_a/")
docs_b = AutoLoader.load("./folder_b/")
unique_docs = list(set(docs_a + docs_b))
print(f"Unique documents: {len(unique_docs)}")Building a RAG Corpus from Mixed Sources
from fennec_community.document_loaders import AutoLoader, MultiURLLoader
sources = [
"research_papers/", # local directory
"notes.md", # markdown file
"data/faq.json", # JSON dataset
]
all_docs = []
for source in sources:
all_docs.extend(AutoLoader.load(source))
web_loader = MultiURLLoader(["https://company.com/blog", "https://docs.company.com"])
all_docs.extend(web_loader.load())
print(f"Total corpus: {len(all_docs)} documents")
print(f"Total characters: {sum(len(d) for d in all_docs):,}")Custom Metadata Injection
from fennec_community.document_loaders import PDFLoader
loader = PDFLoader("legal_contract.pdf", per_page=True)
docs = loader.load()
# Enrich metadata post-load
for doc in docs:
doc.metadata["project"] = "contract_2024"
doc.metadata["confidential"] = TrueInspecting the EXTENSION_MAP
from fennec_community.document_loaders import EXTENSION_MAP, LoaderType
pdf_extensions = [ext for ext, t in EXTENSION_MAP.items() if t == LoaderType.PDF]
print(pdf_extensions) # ['.pdf']9. Design Decisions & Trade-offs
Content-Addressed Document IDs
Decision: doc_id is a SHA-256 hash of (source + page_number + content), not a timestamp or UUID.
Rationale: Time-based IDs break caching and deduplication — loading the same file twice would yield different IDs. Content-based IDs are idempotent: re-indexing a corpus skips unchanged documents without additional bookkeeping.
Trade-off: Two documents with identical content from different sources get different IDs (the source is part of the hash). Purely content-identical documents (same source and page) do collide — which is the intended behavior.
Fallback Chain for PDF Backends
Decision: Try PyMuPDF → pdfplumber → pypdf in order, catching ImportError at each step.
Rationale: Different environments have different native library constraints. A data science environment likely has PyMuPDF; a serverless function might only have pure-Python pypdf. The library works in all three contexts without configuration.
Trade-off: The best backend (PyMuPDF) may not be available, silently degrading to a slower one. The backend key in document metadata reveals which was used.
Config Dataclasses over **kwargs
Decision: All configuration is expressed as typed @dataclass instances.
Rationale: Typed configs make IDEs provide autocompletion, make configuration serializable/inspectable, and make the public API self-documenting. Arbitrary **kwargs would hide configuration options.
Trade-off: Adding a new config option requires updating the dataclass. This is a minor friction compared to the discoverability benefits.
HTMLStringLoader + _HTMLParser Internal Pattern
Decision: HTMLStringLoader delegates to an internal _HTMLParser subclass that bypasses file validation.
Rationale: WebLoader needs to parse HTML strings (not files), but the parsing logic in HTMLLoader is non-trivial. Duplicating it would violate DRY. The _HTMLParser internal class allows reuse while keeping the public API clean.
Trade-off: _HTMLParser is a semi-public internal class. Subclassing HTMLLoader while skipping __init__ validation is a fragile pattern that could break if HTMLLoader.__init__ changes significantly.
Global Rate Limiting in MultiURLLoader
Decision: A threading.Semaphore(1) with a shared last_time dict enforces inter-request delays globally, not per-thread.
Rationale: Per-thread delays are ineffective when threads run concurrently — all threads could fire simultaneously with no inter-request spacing. The semaphore serializes the delay phase while allowing the actual HTTP fetch to run in parallel.
Trade-off: The semaphore briefly serializes the start of each request. For very fast servers, this may underutilize parallelism. A more sophisticated token-bucket rate limiter would be better for high-throughput scenarios.
10. Extensibility Guide
Adding a New File Loader
- Create
your_loader.pyinheritingBaseFileLoader:
from fennec_community.document_loaders import BaseFileLoader, LoadedDocument
from typing import List
class YourFormatLoader(BaseFileLoader):
SUPPORTED_EXTENSIONS = [".yourext"]
def load(self) -> List[LoadedDocument]:
# Parse self.file_path
text = your_parsing_logic(str(self.file_path))
meta = self._build_file_metadata(custom_key="value")
return [LoadedDocument(page_content=text, metadata=meta)]- Register the extension in
config_loader.py:
class LoaderType(Enum):
YOUR_FORMAT = "yourformat"
EXTENSION_MAP[".yourext"] = LoaderType.YOUR_FORMATAdd dispatch logic in
AutoLoader._create_file_loaderandDirectoryLoader._get_loader_for_file.Export from
__init__.py.
Adding a New Config
Add a dataclass to config_loader.py and a field to LoaderConfig:
@dataclass
class YourFormatLoaderConfig:
option_a: bool = True
option_b: str = "default"
@dataclass
class LoaderConfig:
# ... existing fields ...
your_format: YourFormatLoaderConfig = field(default_factory=YourFormatLoaderConfig)Adding a Non-File Source Loader
Inherit directly from BaseDocumentLoader:
from fennec_community.document_loaders import BaseDocumentLoader, LoadedDocument
from typing import List
class DatabaseLoader(BaseDocumentLoader):
def __init__(self, connection_string: str, query: str):
super().__init__()
self.connection_string = connection_string
self.query = query
def load(self) -> List[LoadedDocument]:
# fetch rows, convert to LoadedDocument
...Custom Text Splitter Integration
Any object with a split_text(text: str) -> List[str] method works with load_and_split():
class SentenceSplitter:
def split_text(self, text: str):
import re
return [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]
docs = PDFLoader("paper.pdf").load_and_split(SentenceSplitter())11. Performance & Scalability
Memory
- Lazy loading (
lazy_load()) is available onCSVLoader,JSONLinesLoader, andDirectoryLoader. For large files, always preferlazy_load()overload()when processing documents sequentially. - PDF per-page mode (
per_page=True) prevents entire PDF text from being held in memory as a single string when page-level processing is sufficient. ExcelLoaderusesopenpyxlwithread_only=True, which streams the workbook rather than loading the entire file into memory.
Concurrency
DirectoryLoader— controlled bymax_concurrency(default: 4 threads). Set higher for I/O-bound loads from fast storage; the GIL is not a bottleneck for I/O operations.MultiURLLoader— controlled bymax_workers(default: 4 threads) with a global semaphore-based rate limiter. Do not setmax_workersabove the rate-limit ceiling of the target server.
Async
The library is synchronous and thread-based. For integration with async frameworks (FastAPI, asyncio pipelines), wrap calls in asyncio.get_event_loop().run_in_executor():
import asyncio
from fennec_community.document_loaders import AutoLoader
async def load_async(source: str):
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, AutoLoader.load, source)Caching
Because doc_id is deterministic, a simple content-addressed cache layer can be built on top:
import shelve
from fennec_community.document_loaders import AutoLoader, LoadedDocument
def cached_load(source: str, cache_path="doc_cache") -> list:
docs = AutoLoader.load(source)
with shelve.open(cache_path) as cache:
result = []
for doc in docs:
if doc.doc_id not in cache:
cache[doc.doc_id] = doc
result.append(cache[doc.doc_id])
return resultOptional Dependencies
The library uses optional dependencies to avoid forcing heavy installs on minimal environments. Import errors are caught and surfaced as descriptive ImportError messages rather than crashes.
| Feature | Required Package |
|---|---|
| PDF (fast) | PyMuPDF (pip install pymupdf) |
| PDF (tables) | pdfplumber |
| PDF (fallback) | pypdf |
| Word documents | python-docx |
| Excel | openpyxl (.xlsx), xlrd (.xls) |
| HTML/Web | beautifulsoup4, requests |
| JSON path | jq (optional; built-in fallback available) |
| Encoding detection | chardet (optional; latin-1 fallback available) |
Legacy .doc |
LibreOffice (system) or textract |
community/document_loaders.md