Fennec Logo Fennec
Fennec Community community/document_loaders.md

Document Loader

Purpose: Unified, extensible document ingestion layer for RAG pipelines and NLP preprocessing


1. High-Level Overview

document_loaders is a Modular in Fennec_community library that ingests documents from heterogeneous sources — local files, directories, and web URLs — and normalizes them into a single data structure (LoadedDocument) ready for downstream NLP tasks such as vector embedding, chunking, and retrieval-augmented generation (RAG).

Problem It Solves

Real-world AI pipelines need to ingest documents from many formats (PDF, Word, CSV, JSON, HTML, etc.) and sources (disk, web, entire directories). Without a unified abstraction, each format requires bespoke parsing logic with inconsistent metadata, encoding handling, and error behavior. This library provides:

  • A single LoadedDocument output type regardless of source
  • Deterministic document IDs (SHA-256 content-based hashing) enabling stable deduplication and caching
  • A factory entry point (AutoLoader) that eliminates format-detection boilerplate
  • A config-driven architecture for fine-grained control without subclassing

Design Philosophy

The library follows the Adapter pattern: each loader is an adapter between a specific format/source and the universal LoadedDocument contract. A layered configuration system (per-loader Config dataclasses + a master LoaderConfig) separates concerns between behavior and identity, keeping loaders testable and composable.


2. Architecture Overview

┌──────────────────────────────────────────────────────────┐
│                        AutoLoader                        │
│         (smart dispatcher — recommended entry point)     │
└────────────┬────────────────┬──────────────┬─────────────┘
             │                │              │
    ┌────────▼───────┐  ┌─────▼─────┐  ┌────▼──────────┐
    │  File Loaders  │  │ WebLoader │  │DirectoryLoader│
    │ Text/MD/PDF/   │  │MultiURL   │  │(concurrent)   │
    │ DOCX/CSV/Excel │  └─────┬─────┘  └────┬──────────┘
    │ JSON/JSONL/HTML│        │              │
    └────────┬───────┘        │    dispatches to File Loaders
             │                │
             └────────┬───────┘
                      │
              ┌───────▼────────┐
              │ LoadedDocument │
              │ (output model) │
              └───────┬────────┘
                      │
           ┌──────────▼──────────┐
           │  Downstream Pipeline │
           │  (splitter, embed,  │
           │   vector store, RAG) │
           └──────────────────────┘

Data Flow

  1. Input — a file path, URL, or directory path
  2. DispatchAutoLoader inspects the input and selects the correct loader
  3. Parse — the loader reads the source using the appropriate backend/library
  4. Normalize — raw content is wrapped in LoadedDocument with rich metadata
  5. Output — a List[LoadedDocument] is returned to the caller

3. Core Concepts

3.1 LoadedDocument — The Universal Output Unit

Every loader produces LoadedDocument instances, regardless of source format. This contract enables downstream components (chunkers, embedders, vector stores) to operate generically.

@dataclass
class LoadedDocument:
    page_content: str          # The extracted text
    metadata: Dict[str, Any]   # Source, loader, page number, file stats, etc.
    doc_id: Optional[str]      # Deterministic SHA-256 fingerprint

Deterministic doc_id: The ID is computed as SHA-256(source + page_number + content)[:16]. This is critical for caching and deduplication — the same document always produces the same ID across runs.

3.2 Chunking Integration

BaseDocumentLoader.load_and_split(text_splitter) is the integration point for chunking. Any object implementing split_text(str) → List[str] (compatible with LangChain splitters) can be passed. Chunk metadata automatically inherits the parent document's metadata and adds chunk_index, total_chunks, and original_doc_id fields.

3.3 Lazy Loading / Streaming

All loaders implement lazy_load() → Iterator[LoadedDocument]. For large corpora (big CSV files, JSONL datasets, directories), lazy loading avoids loading the entire corpus into memory. By default, lazy_load() delegates to load(), but loaders like CSVLoader, JSONLinesLoader, and DirectoryLoader override it for true streaming.

3.4 Multi-Backend PDF Loading

PDFLoader implements a fallback chain across three backends:

Priority Backend Strength
1 PyMuPDF (fitz) Fastest, best Arabic/RTL support
2 pdfplumber Best table extraction
3 pypdf Pure Python, zero native dependencies

The first available backend is used. This makes the library deployable even in constrained environments where native libraries cannot be installed.

3.5 Concurrent Loading

DirectoryLoader and MultiURLLoader use ThreadPoolExecutor for parallel I/O. MultiURLLoader additionally implements a global rate limiter using threading.Semaphore to enforce minimum inter-request delays across threads — a property that per-thread time.sleep() cannot guarantee.


4. Module & Component Breakdown

Base Loader

Purpose: Defines the foundational abstractions all loaders build on.

Class Responsibility
LoadedDocument Output model. Holds text, metadata, and deterministic ID. Implements __len__, __eq__, __hash__ for use in sets/dicts.
BaseDocumentLoader Abstract base with load(), lazy_load(), load_and_split(), and _build_metadata().
BaseFileLoader Extends BaseDocumentLoader with file validation, extension enforcement, and _build_file_metadata() (adds file name, type, size).

Config Loader

Purpose: Centralizes all configuration as typed dataclasses.

Class Controls
LoaderType Enum of all supported source types
EXTENSION_MAP Dict[str, LoaderType] — drives AutoLoader and DirectoryLoader dispatch
TextLoaderConfig Encoding, error handling, auto-detection
PDFLoaderConfig Per-page mode, page range, password, separator
DocxLoaderConfig Tables, headers, footers inclusion
CSVLoaderConfig Delimiter, column selection, row limits
JSONLoaderConfig jq schema, content key, metadata function
HTMLLoaderConfig Parser choice, tag filtering, link/table extraction
WebLoaderConfig Timeout, SSL, retry policy, encoding
DirectoryLoaderConfig Glob pattern, exclusions, concurrency, progress
LoaderConfig Master config composing all sub-configs

Auto Loader

Purpose: Single entry point for all loading operations.

AutoLoader is a factory class (not a loader itself — it does not inherit BaseDocumentLoader). It inspects the source string in this order:

  1. URL — starts with http://, https://, or ftp://WebLoader
  2. DirectoryPath.is_dir()DirectoryLoader
  3. File by extension — looks up EXTENSION_MAP → appropriate file loader

Key methods: AutoLoader.load(source, **kwargs), AutoLoader.get_loader(source, **kwargs), AutoLoader.detect_type(source).

Text Loader

Purpose: Load plain text and Markdown files.

  • TextLoader — reads .txt/.log/.rst, with chardet-based encoding auto-detection and latin-1 last-resort fallback.
  • MarkdownLoader — reads .md/.markdown, optionally stripping Markdown syntax via regex. Extracts H1 as title metadata. Detects code block presence.

PDF Loader

Purpose: Load PDF files with multi-backend fallback.

PDFLoader supports per-page splitting (one LoadedDocument per page) or full-document mode. Password-protected PDFs are handled across all three backends. Page range selection (start_page, end_page) allows loading sub-sections of large documents.

DOCX Loader

Purpose: Load Microsoft Word documents.

DocxLoader uses python-docx for .docx files and applies a structured extraction order: headers → paragraphs (with heading level detection) → tables → footers. Heading styles are converted to Markdown-style # prefixes, preserving document structure for downstream chunking.

For legacy .doc files, a conversion chain is applied: LibreOffice (if installed) → textract → descriptive ImportError with user instructions. The original bug of using PdfReader on a .doc file has been corrected.

CSV Loader

Purpose: Load CSV/TSV and Excel files.

  • CSVLoader — streams rows from .csv/.tsv. Each row becomes one LoadedDocument. Supports flexible column selection: content_columns controls which fields form the document text; metadata_columns controls which fields go into metadata. Encoding is resolved via a two-attempt loop ([configured_encoding, "latin-1"]), eliminating the recursive retry bug of the original.
  • ExcelLoader — wraps openpyxl for .xlsx/.xlsm and pandas+xlrd for legacy .xls. Supports multi-sheet loading and sheet name validation.

Json Loader

Purpose: Load JSON and JSONL files.

  • JSONLoader — handles both JSON arrays and objects. Supports jq-style path extraction (via the jq library, with a built-in dot-path fallback). A metadata_func callable allows arbitrary metadata extraction per record.
  • JSONLinesLoader — streams JSONL/NDJSON files line by line, supporting max_records limits and skip_invalid for tolerant parsing of noisy datasets (common in ML training data).

HTML Loader

Purpose: Load HTML files and parse HTML strings.

  • HTMLLoader — reads .html/.htm files, removes noise elements (scripts, styles, nav, footer, header), extracts page metadata (title, OG tags, language), and optionally extracts tables as separate documents.
  • HTMLStringLoader — same parsing logic applied to an in-memory HTML string. Uses the internal _HTMLParser helper which bypasses file validation, allowing reuse of parsing logic without a file.

Web Loader

Purpose: Fetch and load web content.

  • WebLoader — fetches a single URL using requests with exponential-backoff retry (delay * 2^attempt). Delegates HTML parsing to HTMLStringLoader. Injects url and source metadata.
  • MultiURLLoader — fetches multiple URLs using ThreadPoolExecutor. A threading.Semaphore(1) with shared last_request_time enforces global rate limiting between fetches.

Directory Loader

Purpose: Recursively load an entire directory.

DirectoryLoader collects files matching a glob pattern, filters by supported extensions and exclusion patterns, then dispatches each file to the appropriate loader. Supports both sequential and multithreaded loading. The get_stats() method returns a breakdown by file type without loading any content.


5. API / Public Interfaces

AutoLoader

# Load from any source
docs: List[LoadedDocument] = AutoLoader.load(source: str, **kwargs)

# Get the loader without executing it
loader: BaseDocumentLoader = AutoLoader.get_loader(source: str, **kwargs)

# Detect source type without loading
source_type: str = AutoLoader.detect_type(source: str)
# Returns: "text", "pdf", "csv", "web", "directory", "unknown", etc.

LoadedDocument

doc.page_content       # str — extracted text
doc.metadata           # Dict[str, Any] — source, file_name, loader_type, loaded_at, etc.
doc.doc_id             # str — deterministic "doc_" + SHA-256[:16]
doc.to_dict()          # Dict with all three fields
len(doc)               # int — character count
doc1 == doc2           # bool — based on doc_id
hash(doc)              # int — for set/dict usage

BaseDocumentLoader

loader.load() -> List[LoadedDocument]
loader.lazy_load() -> Iterator[LoadedDocument]
loader.load_and_split(text_splitter=None) -> List[LoadedDocument]

PDFLoader

PDFLoader(
    file_path: str,
    config: Optional[PDFLoaderConfig] = None,
    per_page: bool = True,          # One document per page
    password: Optional[str] = None,
)

CSVLoader

CSVLoader(
    file_path: str,
    content_columns: Optional[List[str]] = None,   # None = all columns
    metadata_columns: Optional[List[str]] = None,
    source_column: Optional[str] = None,
    encoding: str = "utf-8",
    delimiter: str = ",",
)

JSONLoader

JSONLoader(
    file_path: str,
    content_key: Optional[str] = None,             # e.g., "body", "text"
    metadata_func: Optional[Callable[[Dict], Dict]] = None,
    jq_schema: Optional[str] = None,               # e.g., ".items[]"
    encoding: str = "utf-8",
)

WebLoader

WebLoader(
    url: str,
    config: Optional[WebLoaderConfig] = None,
    headers: Optional[Dict[str, str]] = None,
    timeout: int = 10,
)

DirectoryLoader

DirectoryLoader(
    path: str,
    config: Optional[DirectoryLoaderConfig] = None,
    glob_pattern: str = "**/*",
    recursive: bool = True,
    silent_errors: bool = False,
    show_progress: bool = True,
)

loader.get_stats() -> Dict   # {"total_files": N, "by_type": {...}, "total_size_mb": X}

6. Configuration System

Per-Loader Configs

Each loader accepts an optional typed config dataclass. When not provided, a default instance is created from keyword arguments.

from fennec_community.document_loaders import PDFLoaderConfig, PDFLoader

config = PDFLoaderConfig(
    per_page=True,
    start_page=0,
    end_page=50,
    password="secret",
    page_separator="\n---\n",
)
loader = PDFLoader("report.pdf", config=config)

Master Config

LoaderConfig composes all sub-configs for scenarios requiring system-wide configuration:

from fennec_community.document_loaders import LoaderConfig

master = LoaderConfig()
master.pdf.per_page = True
master.web.timeout = 30
master.web.max_retries = 5
master.directory.max_concurrency = 8
master.directory.silent_errors = True

Extension Map

EXTENSION_MAP is a public Dict[str, LoaderType] that controls how AutoLoader and DirectoryLoader resolve file types. It can be inspected to discover supported extensions:

from fennec_community.document_loaders import EXTENSION_MAP
print(sorted(EXTENSION_MAP.keys()))
# ['.csv', '.doc', '.docx', '.htm', '.html', '.json', '.jsonl', '.md', ...]

Key Config Options by Loader

Loader Key Options
TextLoader autodetect_encoding, errors ("strict"/"ignore"/"replace")
PDFLoader per_page, start_page, end_page, password, page_separator
DocxLoader include_tables, include_headers, include_footers
CSVLoader content_columns, metadata_columns, source_column, max_rows, skip_rows
JSONLoader content_key, jq_schema, metadata_func, text_content
HTMLLoader parser, tags_to_remove, extract_links, extract_tables
WebLoader timeout, max_retries, retry_delay, verify_ssl, headers
DirectoryLoader glob_pattern, exclude_patterns, recursive, max_concurrency, silent_errors

7. Usage Guide

Quick Start

from fennec_community.document_loaders import AutoLoader

# File
docs = AutoLoader.load("report.pdf")

# Web URL
docs = AutoLoader.load("https://docs.python.org/3/library/pathlib.html")

# Directory
docs = AutoLoader.load("./knowledge_base/")

print(f"Loaded {len(docs)} documents")
print(docs[0].page_content[:200])
print(docs[0].metadata)

Basic Usage — Specific Loaders

from fennec_community.document_loaders import PDFLoader, CSVLoader, JSONLoader

# PDF: per-page loading
pdf_docs = PDFLoader("annual_report.pdf", per_page=True).load()

# CSV: select specific columns
csv_docs = CSVLoader(
    "products.csv",
    content_columns=["name", "description"],
    metadata_columns=["sku", "category"],
).load()

# JSON: extract specific field
json_docs = JSONLoader(
    "articles.json",
    content_key="body",
    metadata_func=lambda r: {"title": r.get("title"), "author": r.get("author")},
).load()

Advanced Usage — Chunking Pipeline

from fennenc_community.document_loaders import AutoLoader
from fennec_community.chunks import TokenTextSplitter

# Assume a LangChain-compatible splitter
#from langchain.text_splitter import RecursiveCharacterTextSplitter
#splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)

splitter = TokenTextSplitter()
loader = AutoLoader.get_loader("deep_learning.pdf")
chunks = loader.load_and_split(text_splitter=splitter)

print(f"{len(chunks)} chunks")
print(chunks[0].metadata)
# {'source': '...', 'chunk_index': 0, 'total_chunks': 14, 'original_doc_id': 'doc_...'}

Advanced Usage — Concurrent Web Scraping

from fennec_community.document_loaders import MultiURLLoader, WebLoaderConfig

config = WebLoaderConfig(timeout=15, max_retries=3, verify_ssl=True)

urls = [
    "https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)",
    "https://en.wikipedia.org/wiki/BERT_(language_model)",
    "https://en.wikipedia.org/wiki/GPT-4",
]

loader = MultiURLLoader(
    urls=urls,
    config=config,
    max_workers=3,
    delay_between_requests=1.0,
    continue_on_error=True,
)
docs = loader.load()

Advanced Usage — Directory with Stats

from fennec_community.document_loaders import DirectoryLoader, DirectoryLoaderConfig

config = DirectoryLoaderConfig(
    glob_pattern="**/*.pdf",
    recursive=True,
    silent_errors=True,
    max_concurrency=6,
    show_progress=True,
)
loader = DirectoryLoader("./documents", config=config)

# Inspect before loading
stats = loader.get_stats()
print(f"Total: {stats['total_size_mb']} MB across {stats['total_files']} files")

# Load
docs = loader.load()

Advanced Usage — Lazy Loading a Large JSONL Dataset

from fennec_community.document_loaders import JSONLinesLoader

loader = JSONLinesLoader(
    "dataset.jsonl",
    content_key="text",
    metadata_func=lambda r: {"label": r.get("label"), "id": r.get("id")},
    max_records=10_000,
    skip_invalid=True,
)

# Process without loading everything into memory
for doc in loader.lazy_load():
    embed_and_store(doc)  # your indexing function

8. Code Examples

Deduplication Using Set Semantics

from fennec_community.document_loaders import AutoLoader

docs_a = AutoLoader.load("./folder_a/")
docs_b = AutoLoader.load("./folder_b/")

unique_docs = list(set(docs_a + docs_b))
print(f"Unique documents: {len(unique_docs)}")

Building a RAG Corpus from Mixed Sources

from fennec_community.document_loaders import AutoLoader, MultiURLLoader

sources = [
    "research_papers/",           # local directory
    "notes.md",                   # markdown file
    "data/faq.json",              # JSON dataset
]

all_docs = []
for source in sources:
    all_docs.extend(AutoLoader.load(source))

web_loader = MultiURLLoader(["https://company.com/blog", "https://docs.company.com"])
all_docs.extend(web_loader.load())

print(f"Total corpus: {len(all_docs)} documents")
print(f"Total characters: {sum(len(d) for d in all_docs):,}")

Custom Metadata Injection

from fennec_community.document_loaders import PDFLoader

loader = PDFLoader("legal_contract.pdf", per_page=True)
docs = loader.load()

# Enrich metadata post-load
for doc in docs:
    doc.metadata["project"] = "contract_2024"
    doc.metadata["confidential"] = True

Inspecting the EXTENSION_MAP

from fennec_community.document_loaders import EXTENSION_MAP, LoaderType

pdf_extensions = [ext for ext, t in EXTENSION_MAP.items() if t == LoaderType.PDF]
print(pdf_extensions)  # ['.pdf']

9. Design Decisions & Trade-offs

Content-Addressed Document IDs

Decision: doc_id is a SHA-256 hash of (source + page_number + content), not a timestamp or UUID.

Rationale: Time-based IDs break caching and deduplication — loading the same file twice would yield different IDs. Content-based IDs are idempotent: re-indexing a corpus skips unchanged documents without additional bookkeeping.

Trade-off: Two documents with identical content from different sources get different IDs (the source is part of the hash). Purely content-identical documents (same source and page) do collide — which is the intended behavior.

Fallback Chain for PDF Backends

Decision: Try PyMuPDF → pdfplumber → pypdf in order, catching ImportError at each step.

Rationale: Different environments have different native library constraints. A data science environment likely has PyMuPDF; a serverless function might only have pure-Python pypdf. The library works in all three contexts without configuration.

Trade-off: The best backend (PyMuPDF) may not be available, silently degrading to a slower one. The backend key in document metadata reveals which was used.

Config Dataclasses over **kwargs

Decision: All configuration is expressed as typed @dataclass instances.

Rationale: Typed configs make IDEs provide autocompletion, make configuration serializable/inspectable, and make the public API self-documenting. Arbitrary **kwargs would hide configuration options.

Trade-off: Adding a new config option requires updating the dataclass. This is a minor friction compared to the discoverability benefits.

HTMLStringLoader + _HTMLParser Internal Pattern

Decision: HTMLStringLoader delegates to an internal _HTMLParser subclass that bypasses file validation.

Rationale: WebLoader needs to parse HTML strings (not files), but the parsing logic in HTMLLoader is non-trivial. Duplicating it would violate DRY. The _HTMLParser internal class allows reuse while keeping the public API clean.

Trade-off: _HTMLParser is a semi-public internal class. Subclassing HTMLLoader while skipping __init__ validation is a fragile pattern that could break if HTMLLoader.__init__ changes significantly.

Global Rate Limiting in MultiURLLoader

Decision: A threading.Semaphore(1) with a shared last_time dict enforces inter-request delays globally, not per-thread.

Rationale: Per-thread delays are ineffective when threads run concurrently — all threads could fire simultaneously with no inter-request spacing. The semaphore serializes the delay phase while allowing the actual HTTP fetch to run in parallel.

Trade-off: The semaphore briefly serializes the start of each request. For very fast servers, this may underutilize parallelism. A more sophisticated token-bucket rate limiter would be better for high-throughput scenarios.


10. Extensibility Guide

Adding a New File Loader

  1. Create your_loader.py inheriting BaseFileLoader:
from fennec_community.document_loaders import BaseFileLoader, LoadedDocument
from typing import List

class YourFormatLoader(BaseFileLoader):
    SUPPORTED_EXTENSIONS = [".yourext"]

    def load(self) -> List[LoadedDocument]:
        # Parse self.file_path
        text = your_parsing_logic(str(self.file_path))
        meta = self._build_file_metadata(custom_key="value")
        return [LoadedDocument(page_content=text, metadata=meta)]
  1. Register the extension in config_loader.py:
class LoaderType(Enum):
    YOUR_FORMAT = "yourformat"

EXTENSION_MAP[".yourext"] = LoaderType.YOUR_FORMAT
  1. Add dispatch logic in AutoLoader._create_file_loader and DirectoryLoader._get_loader_for_file.

  2. Export from __init__.py.

Adding a New Config

Add a dataclass to config_loader.py and a field to LoaderConfig:

@dataclass
class YourFormatLoaderConfig:
    option_a: bool = True
    option_b: str = "default"

@dataclass
class LoaderConfig:
    # ... existing fields ...
    your_format: YourFormatLoaderConfig = field(default_factory=YourFormatLoaderConfig)

Adding a Non-File Source Loader

Inherit directly from BaseDocumentLoader:

from fennec_community.document_loaders import BaseDocumentLoader, LoadedDocument
from typing import List

class DatabaseLoader(BaseDocumentLoader):
    def __init__(self, connection_string: str, query: str):
        super().__init__()
        self.connection_string = connection_string
        self.query = query

    def load(self) -> List[LoadedDocument]:
        # fetch rows, convert to LoadedDocument
        ...

Custom Text Splitter Integration

Any object with a split_text(text: str) -> List[str] method works with load_and_split():

class SentenceSplitter:
    def split_text(self, text: str):
        import re
        return [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]

docs = PDFLoader("paper.pdf").load_and_split(SentenceSplitter())

11. Performance & Scalability

Memory

  • Lazy loading (lazy_load()) is available on CSVLoader, JSONLinesLoader, and DirectoryLoader. For large files, always prefer lazy_load() over load() when processing documents sequentially.
  • PDF per-page mode (per_page=True) prevents entire PDF text from being held in memory as a single string when page-level processing is sufficient.
  • ExcelLoader uses openpyxl with read_only=True, which streams the workbook rather than loading the entire file into memory.

Concurrency

  • DirectoryLoader — controlled by max_concurrency (default: 4 threads). Set higher for I/O-bound loads from fast storage; the GIL is not a bottleneck for I/O operations.
  • MultiURLLoader — controlled by max_workers (default: 4 threads) with a global semaphore-based rate limiter. Do not set max_workers above the rate-limit ceiling of the target server.

Async

The library is synchronous and thread-based. For integration with async frameworks (FastAPI, asyncio pipelines), wrap calls in asyncio.get_event_loop().run_in_executor():

import asyncio
from fennec_community.document_loaders import AutoLoader

async def load_async(source: str):
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, AutoLoader.load, source)

Caching

Because doc_id is deterministic, a simple content-addressed cache layer can be built on top:

import shelve
from fennec_community.document_loaders import AutoLoader, LoadedDocument

def cached_load(source: str, cache_path="doc_cache") -> list:
    docs = AutoLoader.load(source)
    with shelve.open(cache_path) as cache:
        result = []
        for doc in docs:
            if doc.doc_id not in cache:
                cache[doc.doc_id] = doc
            result.append(cache[doc.doc_id])
    return result

Optional Dependencies

The library uses optional dependencies to avoid forcing heavy installs on minimal environments. Import errors are caught and surfaced as descriptive ImportError messages rather than crashes.

Feature Required Package
PDF (fast) PyMuPDF (pip install pymupdf)
PDF (tables) pdfplumber
PDF (fallback) pypdf
Word documents python-docx
Excel openpyxl (.xlsx), xlrd (.xls)
HTML/Web beautifulsoup4, requests
JSON path jq (optional; built-in fallback available)
Encoding detection chardet (optional; latin-1 fallback available)
Legacy .doc LibreOffice (system) or textract
Source: community/document_loaders.md