<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kreuzberg</title>
    <description>The latest articles on Forem by Kreuzberg (@kreuzberg).</description>
    <link>https://forem.com/kreuzberg</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F12062%2F3bf62621-afa0-40a7-8e8e-31858840ed6e.png</url>
      <title>Forem: Kreuzberg</title>
      <link>https://forem.com/kreuzberg</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/kreuzberg"/>
    <language>en</language>
    <item>
      <title>Document Structure Extraction with Kreuzberg</title>
      <dc:creator>TI</dc:creator>
      <pubDate>Tue, 31 Mar 2026 10:57:04 +0000</pubDate>
      <link>https://forem.com/kreuzberg/document-structure-extraction-with-kreuzberg-44cj</link>
      <guid>https://forem.com/kreuzberg/document-structure-extraction-with-kreuzberg-44cj</guid>
      <description>&lt;p&gt;Extracting structured data from PDFs is one of the hardest problems in AI infrastructure. Most tools give you a text dump but no headings, no table boundaries, no distinction between a caption and a footnote. When Docling launched, it changed the game with a genuinely good layout model.&lt;/p&gt;

&lt;p&gt;We want to be clear– Docling is a great project, and we have the greatest respect for the team at IBM for putting it out there. It’s also fully open-source under a permissive Apache-2.0 license. We integrated their model into Kreuzberg and embedded it into a Rust-native pipeline. Currently, it runs 2.8× faster with a fraction of the memory footprint.&lt;/p&gt;

&lt;p&gt;This post covers the behind-the-scenes part: what we used, what we rebuilt from scratch, and where the speed comes from.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Document Structure Matters for AI and RAG Pipelines&lt;/strong&gt;&lt;br&gt;
If you’re building AI infrastructure like RAG pipelines, document processing workflows, or any AI application that ingests PDFs at scale, flat text extraction isn’t enough anymore.&lt;/p&gt;

&lt;p&gt;Consider what happens when you feed an LLM a PDF that’s been extracted as a single blob of text. The model can’t distinguish a section heading from body text. It can’t tell if a number belongs to a table cell or a footnote. It merges multi-column layouts into nonsense. The retrieval quality of your entire pipeline degrades because the source data has no structure.&lt;/p&gt;

&lt;p&gt;Docling, IBM’s open-source document understanding library, addressed this head-on. Their RT-DETR v2 layout model (called Docling Heron) classifies 17 different document element types: headings, paragraphs, tables, figures, captions, page headers, footers, and more. It produces a structural representation that downstream systems can actually work with.&lt;/p&gt;

&lt;p&gt;The model is excellent. The issue lies in what’s around it.&lt;/p&gt;

&lt;p&gt;Docling is a Python library built on deep learning inference. Model loading takes time. Processing is sequential. Memory usage scales with document complexity. For a single document or a research prototype, that’s fine. For thousands of documents in a production pipeline, especially if your stack isn’t Python, it starts to matter. That’s the gap we set out to close.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Foundation&lt;/strong&gt;&lt;br&gt;
Starting with Kreuzberg v4.5.0, we integrated Docling’s RT-DETR v2 layout model directly into our Rust-native pipeline. The model is open-source under Apache-2.0, and we want to be transparent about its use. Docling’s team built something excellent. But the model is only one piece of a document extraction system. The inference runtime, the text extraction layer, the page processing strategy, the table reconstruction pipeline, all of which you can have in Rust now. The result is a system that uses Docling’s layout intelligence but runs it through an entirely different execution engine.&lt;/p&gt;

&lt;p&gt;Here’s where the engineering differences live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Engineering the Pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ONNX Runtime for Layout Inference&lt;br&gt;
The RT-DETR v2 model runs through ONNX Runtime, not Python’s PyTorch. There’s no Python dependency, GIL contention, and native support for hardware acceleration such as CPU, CUDA, CoreML, and TensorRT. All of this is configurable through a typed AccelerationConfig that works across every language binding Kreuzberg supports.&lt;/p&gt;

&lt;p&gt;This alone eliminates the cold-start penalty. The ONNX session loads once and stays resident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parallel Page Processing&lt;/strong&gt;&lt;br&gt;
Layout inference processes page batches in a single session.run() call. SLANet-Plus (the table structure recognition model) and layout inference both run in parallel using thread-local model instances and Rayon workers. Each page is processed independently and released after extraction, keeping memory usage flat even on 500-page documents.&lt;/p&gt;

&lt;p&gt;Docling processes pages sequentially through Python. Kreuzberg processes them concurrently through Rust. On a 100-page PDF, that difference compounds fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Native PDF Text Extraction via PDFium&lt;/strong&gt;&lt;br&gt;
This is where most of the quality gains come from, and it’s the biggest architectural divergence from Docling.&lt;/p&gt;

&lt;p&gt;Instead of relying on the layout model’s pipeline to also handle text extraction, Kreuzberg reads text directly from the PDF’s native text layer using PDFium’s character-level API. This preserves exact character positions, font metadata (bold, italic, size), and Unicode encoding. The layout model then classifies and organizes this high-fidelity text according to the document’s visual structure.&lt;/p&gt;

&lt;p&gt;The distinction matters because Docling’s pipeline treats the rendered page image as the primary input for both layout detection and text extraction. Kreuzberg uses the page image only for layout detection, then pulls text from the PDF’s native layer. You get neural-network-quality structure classification with lossless text fidelity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structure Tree Integration&lt;/strong&gt;&lt;br&gt;
When a PDF contains a tagged structure tree (common in PDF/A and accessibility-compliant documents), Kreuzberg uses the author’s original paragraph boundaries and heading hierarchy, then applies layout model predictions as classification overrides. The structure tree gives you the author’s intent; the layout model gives you visual classification. Combining both produces better results than either alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fixing Edge Cases in PDFs&lt;/strong&gt;&lt;br&gt;
The single biggest quality improvement came not from the layout model integration, but from rewriting how we extract text from PDFs at the character level.&lt;/p&gt;

&lt;p&gt;Before v4.5.0, Kreuzberg used PdfiumParagraph::from_objects() which is a paragraph-level extraction approach that relied on PDFium’s built-in text segmentation. It worked on clean documents but broke down on anything with non-standard font matrices, complex column layouts, or broken CMap encodings. AndPDFs are full of exactly these problems.&lt;/p&gt;

&lt;p&gt;We replaced it with per-character text extraction using PDFium’s PdfPageText::chars() API. Every character is read individually with its exact position, font size, and baseline coordinates. From there, we rebuild the text structure ourselves.&lt;/p&gt;

&lt;p&gt;This unlocked a chain of fixes that would have been impossible at the paragraph level:&lt;/p&gt;

&lt;p&gt;Broken font metrics. Many PDFs report incorrect font sizes due to font matrix scaling. PDFium might say font_size=1 when the rendered text is clearly 12pt. Our old 4pt minimum filter would silently drop all content from these pages. Now, when the filter removes everything, it’s skipped automatically. Same logic for margin filtering. When it removes all text on a page (PDFs with baseline values outside expected bands), the filter falls back gracefully.&lt;/p&gt;

&lt;p&gt;Ligature corruption. LaTeX-generated PDFs with broken ToUnicode CMaps produce garbled text: different becomes di!erent, offices becomes o”ces. We repair these inline during character iteration using a vowel/consonant heuristic to disambiguate ambiguous ligature mappings. Fixing this during extraction rather than as a post-processing pass improved both accuracy and performance.&lt;/p&gt;

&lt;p&gt;Word spacing artifacts. PDFium sometimes inserts spurious spaces mid-word — shall be active becomes s hall a b e active. Pages with detected broken spacing are re-extracted using character-level gap analysis (font_size × 0.33 threshold). Clean pages use the fast single-call path. On the ISO 21111–10 test document, this reduced garbled lines from 406 to zero.&lt;/p&gt;

&lt;p&gt;Multi-column reading order. Federal Register-style multi-column PDFs jumped from 69.9% to 90.7% F1 after switching to PDFium’s text API, which naturally handles column reading order without us needing to implement column detection heuristics.&lt;/p&gt;

&lt;p&gt;The final result: Kreuzberg’s PDF markdown extraction hit 91.0% average F1 across 16 test PDFs, compared to Docling’s 91.4%. Effectively at parity, while being 10–50× faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Table Extraction Works&lt;/strong&gt;&lt;br&gt;
Table extraction runs in two stages.&lt;/p&gt;

&lt;p&gt;First, the RT-DETR v2 layout model identifies table regions on the page image. Then, Kreuzberg crops each detected region and runs SLANet-Plus, a specialized model that predicts internal table structure: rows, columns, cells, including colspan and rowspan.&lt;/p&gt;

&lt;p&gt;The predicted cell grid is matched against native PDF text positions to reconstruct accurate markdown tables. This hybrid approach i.e., neural structure prediction plus native text extraction, avoids the OCR-like quality loss you get when working only with rendered page images.&lt;/p&gt;

&lt;p&gt;We also tightened the detection heuristics. Table detection now requires at least 3 aligned columns, which eliminates false positives from two-column text layouts like academic papers and newsletters. Post-processing rejects tables with 2 or fewer columns, tables where more than 50% of cells contain long text, or tables with an average cell length above 50 characters. These rules cut false positive detections significantly without hurting recall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmarks: How We Measured This&lt;/strong&gt;&lt;br&gt;
We benchmarked Kreuzberg against Docling on 171 PDF documents spanning academic papers, government and legal documents, invoices, OCR scans, and edge cases. F1 score measures the harmonic mean of precision and recall and how much of the expected content was correctly extracted, and how much of what was extracted was actually correct.&lt;/p&gt;

&lt;p&gt;Metric: Kreuzberg Docling&lt;/p&gt;

&lt;p&gt;Structure F1: 42.1% 41.7%&lt;/p&gt;

&lt;p&gt;Text F1: 88.9% 86.7%&lt;/p&gt;

&lt;p&gt;Avg. processing time: 1,032 ms/doc 2,894 ms/doc&lt;/p&gt;

&lt;p&gt;The 2.8× speed advantage comes from four angles: Rust’s native memory management, PDFium character-level text extraction (no Python overhead), ONNX Runtime inference (no PyTorch), and Rayon parallelism across pages. Structure F1 measures how accurately document elements such as headings, paragraphs, and tables are detected.&lt;/p&gt;

&lt;p&gt;In broader &lt;a href="https://kreuzberg.dev/benchmarks" rel="noopener noreferrer"&gt;benchmarks&lt;/a&gt;, we compared Kreuzberg to Apache Tika, Docling, MarkItDown, Unstructured.io, PDFPlumber, MinerU, MuPDF4LLM, and more. There, you can see Kreuzberg is substantially faster on average, with much lower memory usage and a smaller installation footprint. The Docker image is around 1–1.3GB versus Docling’s 1GB+ Python installation before you even add your application code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What This Means for Your Stack&lt;/strong&gt;&lt;br&gt;
Already using Docling and happy with the quality? You’ll get equivalent extraction accuracy from Kreuzberg with less infrastructure overhead. The layout model is the same, the execution is faster, and the memory is lower.&lt;/p&gt;

&lt;p&gt;Running a polyglot stack? If your backend is Rust, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno), Kreuzberg gives you the same layout detection capabilities without wrapping a Python service behind an HTTP endpoint. Native bindings for 12 languages, same Rust core underneath.&lt;/p&gt;

&lt;p&gt;Processing at scale? The combination of parallel page processing, native text extraction, and efficient ONNX inference means significantly higher document throughput on the same hardware. No GPU required for layout detection; CPU inference is fast enough for most production workloads.&lt;/p&gt;

&lt;p&gt;Layout detection is available across all 12 language bindings, the CLI, the REST API, and the MCP server. Models auto-download from HuggingFace on first use and are cached locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;p&gt;`# CLI&lt;/p&gt;

&lt;p&gt;kreuzberg extract document.pdf — layout-detection&lt;/p&gt;

&lt;h1&gt;
  
  
  Python
&lt;/h1&gt;

&lt;p&gt;from kreuzberg import extract_file, ExtractionConfig&lt;/p&gt;

&lt;p&gt;result = await extract_file(“document.pdf”, ExtractionConfig(&lt;/p&gt;

&lt;p&gt;layout_detection=True,&lt;/p&gt;

&lt;p&gt;output_format=”markdown”&lt;/p&gt;

&lt;p&gt;))`&lt;/p&gt;

&lt;p&gt;Document structure extraction is becoming table stakes for production AI pipelines. Modern AI systems depend on structured document data. The faster you can extract data,the more scalable your pipeline becomes.&lt;/p&gt;

&lt;p&gt;We’re grateful to the &lt;a href="https://www.docling.ai/" rel="noopener noreferrer"&gt;Docling&lt;/a&gt; team at IBM for the truly great foundation they’ve provided. If you’re running Docling in production today, try &lt;a href="https://kreuzberg.dev/" rel="noopener noreferrer"&gt;Kreuzberg&lt;/a&gt; against it on your actual documents and let us know what you think.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>rag</category>
      <category>webdev</category>
    </item>
    <item>
      <title>BM25 + Vector Search in One Query: kreuzberg-surrealdb + SurrealDB v3</title>
      <dc:creator>TI</dc:creator>
      <pubDate>Mon, 23 Mar 2026 08:08:59 +0000</pubDate>
      <link>https://forem.com/kreuzberg/bm25-vector-search-in-one-query-kreuzberg-surrealdb-surrealdb-v3-5abi</link>
      <guid>https://forem.com/kreuzberg/bm25-vector-search-in-one-query-kreuzberg-surrealdb-surrealdb-v3-5abi</guid>
      <description>&lt;p&gt;Author: &lt;a href="https://www.linkedin.com/in/androidvarun/" rel="noopener noreferrer"&gt;Varun Tandon&lt;/a&gt;, Software Engineer at &lt;a href="https://kreuzberg.dev/" rel="noopener noreferrer"&gt;Kreuzberg&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hybrid Search in 40 Lines: kreuzberg-surrealdb + SurrealDB v3
&lt;/h2&gt;

&lt;p&gt;Every hybrid search tutorial starts with clean text already in the database: ten toy documents, never scanned, never duplicated, never OCR'd. Real pipelines start somewhere else: a directory of client PDFs, some scanned, some protected, plus legacy DOCX files and an ingestion layer you've assembled from LangChain loaders, Unstructured subprocesses, and filename-based IDs that inflate your vector store on every re-run.&lt;/p&gt;

&lt;p&gt;kreuzberg-surrealdb replaces that entire pre-query layer. Two calls get you to a working hybrid search pipeline: &lt;code&gt;setup_schema()&lt;/code&gt; creates the HNSW vector index and BM25 full-text index in SurrealDB; &lt;code&gt;ingest_directory()&lt;/code&gt; handles format detection, OCR, chunking, embedding, and deduplication across 88+ file formats. Then SurrealDB's &lt;code&gt;search::rrf()&lt;/code&gt; runs hybrid BM25 + vector search in a single query. It requires SurrealDB v3.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start: Connection to Hybrid Search
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;kreuzberg-surrealdb
&lt;span class="c"&gt;# Requires SurrealDB v3: surreal start --user root --pass root&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;surrealdb&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncSurreal&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kreuzberg_surrealdb&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DocumentPipeline&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncSurreal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ws://localhost:8000/rpc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;signin&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;username&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;root&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;root&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mydb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# "balanced" preset = bge-base-en-v1.5, 768 dims
&lt;/span&gt;        &lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DocumentPipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;balanced&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Creates HNSW vector index + BM25 full-text index
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setup_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Format detection, OCR, chunking, embedding, dedup
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ingest_directory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;regulatory compliance Q4 2025&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# search::rrf() is SurrealDB — not kreuzberg-surrealdb
&lt;/span&gt;        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            SELECT * FROM search::rrf([
              (SELECT id, content FROM chunks
               WHERE embedding &amp;lt;|10,COSINE|&amp;gt; $embedding),
              (SELECT id, content, search::score(1) AS score FROM chunks
               WHERE content @1@ $query
               ORDER BY score DESC LIMIT 10)
            ], 10, 60);
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Building Ingestion for SurrealDB
&lt;/h2&gt;

&lt;p&gt;Hybrid search with RRF improved Mean Reciprocal Rank from 0.410 to 0.486: an 18.5% gain over single-mode retrieval in a production RAG system. That gain depends entirely on both indexes being correctly populated. Getting there from scratch means solving four problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Format extraction.&lt;/strong&gt; LangChain's PDFLoader returns empty strings or raises errors on scanned PDFs (GitHub issue #6376). LibreOffice in Unstructured runs single-threaded, so concurrent ingestion creates silent race conditions on file handles. Missing libmagic on the host causes DOCX files to be misidentified as &lt;code&gt;application/zip&lt;/code&gt;, bypassing all DOCX-specific extraction logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HNSW DDL.&lt;/strong&gt; You specify &lt;code&gt;DIMENSION&lt;/code&gt;, &lt;code&gt;DIST&lt;/code&gt;, &lt;code&gt;EFC&lt;/code&gt;, and &lt;code&gt;M&lt;/code&gt; manually. Wrong values silently produce an underperforming index. &lt;code&gt;DimensionMismatchError&lt;/code&gt; fires at insert time, not schema creation. Switch embedding models after writing records and every subsequent insert fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deduplication.&lt;/strong&gt; LlamaIndex document IDs default to filename-based hashing (GitHub issue #13461). Re-running the pipeline on unchanged files creates new vector records, triggers re-embedding API calls, and inflates the vector store. Content-hash dedup isn't in LlamaIndex's default configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The LangChain SurrealDBVectorStore&lt;/strong&gt; covers retrieval only. Schema creation, chunking, embedding, and batched inserts remain on you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up kreuzberg-surrealdb
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;kreuzberg-surrealdb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You own the &lt;code&gt;AsyncSurreal&lt;/code&gt; connection — authenticate, select namespace and database, then pass it to &lt;code&gt;DocumentPipeline&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;surrealdb&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncSurreal&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kreuzberg_surrealdb&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DocumentPipeline&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;AsyncSurreal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ws://localhost:8000/rpc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;signin&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;username&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;root&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;root&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mydb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DocumentPipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;balanced&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setup_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What &lt;code&gt;setup_schema()&lt;/code&gt; Generates
&lt;/h2&gt;

&lt;p&gt;One call creates everything SurrealDB needs to run both retrievers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- documents table&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="n"&gt;SCHEMAFULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;        &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;       &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;content_hash&lt;/span&gt;  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;ingested_at&lt;/span&gt;   &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="nb"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;quality_score&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="c1"&gt;-- OCR confidence (0.0–1.0) for scanned content&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;         &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;authors&lt;/span&gt;       &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;      &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="k"&gt;object&lt;/span&gt; &lt;span class="n"&gt;FLEXIBLE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- chunks table&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="n"&gt;SCHEMAFULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt;    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;     &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;   &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;chunk_index&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;word_count&lt;/span&gt;  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;page_number&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;char_start&lt;/span&gt;  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;FIELD&lt;/span&gt; &lt;span class="n"&gt;char_end&lt;/span&gt;    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- HNSW vector index&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_chunk_embedding&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="n"&gt;FIELDS&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;
  &lt;span class="n"&gt;HNSW&lt;/span&gt; &lt;span class="n"&gt;DIMENSION&lt;/span&gt; &lt;span class="mi"&gt;768&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;F32&lt;/span&gt; &lt;span class="n"&gt;DIST&lt;/span&gt; &lt;span class="n"&gt;COSINE&lt;/span&gt; &lt;span class="n"&gt;EFC&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt; &lt;span class="n"&gt;M&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- BM25 full-text index&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;ANALYZER&lt;/span&gt; &lt;span class="n"&gt;text_analyzer&lt;/span&gt; &lt;span class="n"&gt;TOKENIZERS&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt;
  &lt;span class="n"&gt;FILTERS&lt;/span&gt; &lt;span class="n"&gt;lowercase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stemmer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;english&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_chunk_content&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="n"&gt;FIELDS&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;
  &lt;span class="k"&gt;SEARCH&lt;/span&gt; &lt;span class="n"&gt;ANALYZER&lt;/span&gt; &lt;span class="n"&gt;text_analyzer&lt;/span&gt; &lt;span class="n"&gt;BM25&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Embedding Presets
&lt;/h2&gt;

&lt;p&gt;The preset determines the &lt;code&gt;DIMENSION&lt;/code&gt; value in the HNSW DDL — you never specify it manually:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Preset&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Dimensions&lt;/th&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;"fast"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;all-MiniLM-L6-v2&lt;/td&gt;
&lt;td&gt;384&lt;/td&gt;
&lt;td&gt;Low-latency, resource-constrained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;"balanced"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bge-base-en-v1.5&lt;/td&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;td&gt;Default; best general-purpose trade-off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;"quality"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bge-large-en-v1.5&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;High-recall when compute is available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;"multilingual"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;multilingual-e5-base&lt;/td&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;td&gt;Non-English or mixed-language corpora&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a custom model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EmbeddingModelType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fastembed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAAI/bge-small-en-v1.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding_dimensions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DocumentPipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;One important constraint:&lt;/strong&gt; SurrealDB enforces vector dimension server-wide. All tables on the same instance must use the same dimension. Pick the preset before first ingest — changing it later means dropping the HNSW index, re-running &lt;code&gt;setup_schema()&lt;/code&gt;, and re-embedding the entire corpus.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Chunking Configuration
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kreuzberg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ExtractionConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ChunkingConfig&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExtractionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ChunkingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;max_chars&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Smaller = more precise retrieval, more records
&lt;/span&gt;        &lt;span class="n"&gt;max_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;  &lt;span class="c1"&gt;# Prevents context loss at chunk boundaries
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DocumentPipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;balanced&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Ingesting a Mixed Document Corpus
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;ingest_directory()&lt;/code&gt; walks the directory, detects each file's format, extracts text (with OCR for scanned content), chunks, embeds, and writes to SurrealDB. No Tesseract configuration required.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ingest_directory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;glob&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;glob&lt;/code&gt; parameter follows &lt;code&gt;pathlib.Path.glob()&lt;/code&gt; syntax — &lt;code&gt;**/*&lt;/code&gt; walks all subdirectories recursively (default), &lt;code&gt;**/*.pdf&lt;/code&gt; scopes to PDFs only.&lt;/p&gt;

&lt;p&gt;For targeted ingestion or upload flows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Single file
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ingest_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./reports/q4-2025.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Bytes — e.g. from an HTTP upload handler
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ingest_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pdf_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;upload://q4-2025.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Deduplication
&lt;/h2&gt;

&lt;p&gt;kreuzberg-surrealdb computes a SHA-256 hash from each chunk's extracted text content and uses it as the SurrealDB RecordID (pattern: &lt;code&gt;{content_hash}_{chunk_index}&lt;/code&gt;). All inserts use &lt;code&gt;INSERT IGNORE&lt;/code&gt;. Running &lt;code&gt;ingest_directory()&lt;/code&gt; twice on unchanged content is a complete no-op: zero new records, zero re-embedding calls.&lt;/p&gt;

&lt;p&gt;This differs meaningfully from LlamaIndex's &lt;code&gt;filename_as_id=True&lt;/code&gt; default. When you re-ingest the same file from a different path, LlamaIndex generates a new RecordID from the new path and creates a duplicate. kreuzberg-surrealdb hashes the content itself — same text from any path, same RecordID, same no-op.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Honest limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sequential ingestion.&lt;/strong&gt; &lt;code&gt;ingest_files()&lt;/code&gt; and &lt;code&gt;ingest_directory()&lt;/code&gt; process files in a sequential loop. For high-throughput pipelines, use a queue-based architecture where independent workers each call &lt;code&gt;ingest_file()&lt;/code&gt; per document.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No orphan deletion.&lt;/strong&gt; Files removed from the source directory stay in the database. Manual cleanup: &lt;code&gt;DELETE FROM documents WHERE source NOT IN $active_sources;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exact-match dedup only.&lt;/strong&gt; Two slightly different versions of the same document create two separate records. Near-duplicate detection isn't supported.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Hybrid Search Payoff: How &lt;code&gt;search::rrf()&lt;/code&gt; Works
&lt;/h2&gt;

&lt;p&gt;Both indexes are now in SurrealDB — HNSW for vector retrieval, BM25 for full-text. SurrealDB's &lt;code&gt;search::rrf()&lt;/code&gt; combines them in a single query using Reciprocal Rank Fusion (RRF).&lt;/p&gt;

&lt;p&gt;Because RRF operates on ranked positions rather than raw scores, BM25's unbounded values and cosine similarity's 0–1 range are never directly compared. No score normalization. No alpha parameter. The formula (Cormack, Clarke &amp;amp; Buettcher, SIGIR 2009):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RRF_score(d) = Σ 1 / (k + rank_i(d))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;k=60&lt;/code&gt; is the smoothing constant from the original paper — not a tunable weight. It prevents top-ranked documents from dominating when they appear near rank 1 in only one list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Attribution: What Owns What
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Owner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Extraction from 88+ formats, OCR&lt;/td&gt;
&lt;td&gt;kreuzberg (via kreuzberg-surrealdb)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chunking and embedding&lt;/td&gt;
&lt;td&gt;kreuzberg (via kreuzberg-surrealdb)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HNSW + BM25 index creation&lt;/td&gt;
&lt;td&gt;kreuzberg-surrealdb (&lt;code&gt;setup_schema()&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consistent query embedding&lt;/td&gt;
&lt;td&gt;kreuzberg-surrealdb (&lt;code&gt;embed_query()&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid fusion&lt;/td&gt;
&lt;td&gt;SurrealDB (&lt;code&gt;search::rrf()&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector + full-text retrieval&lt;/td&gt;
&lt;td&gt;SurrealDB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chunk → document traversal&lt;/td&gt;
&lt;td&gt;SurrealDB record links&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  All Three Search Modes
&lt;/h2&gt;

&lt;p&gt;Always call &lt;code&gt;embed_query()&lt;/code&gt; before a vector or hybrid search. It ensures the query vector uses the same model and dimension as stored chunk embeddings. A mismatch causes cosine similarity scores to become meaningless without raising an error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid (BM25 + vector):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;regulatory compliance Q4 2025&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT * FROM search::rrf([
      (SELECT id, content FROM chunks
       WHERE embedding &amp;lt;|10,COSINE|&amp;gt; $embedding),
      (SELECT id, content, search::score(1) AS score FROM chunks
       WHERE content @1@ $query
       ORDER BY score DESC LIMIT 10)
    ], 10, 60);
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Vector-only:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;knn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;|&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;COSINE&lt;/span&gt;&lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;BM25-only:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Chunk → parent document traversal&lt;/strong&gt; (no JOIN, no second query):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quality_score&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;document&lt;/code&gt; field on each chunk is a SurrealDB record link — dot notation traverses it inline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Each Retriever Fails
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;BM25 fails on:&lt;/strong&gt; paraphrased queries, vocabulary mismatch ("car" vs "automobile"), semantic synonyms, conceptual proximity without term overlap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector fails on:&lt;/strong&gt; exact product codes, named entities, precise version strings, rare technical terms, regulation IDs, serial numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid RRF covers both.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Filtering by OCR Quality
&lt;/h2&gt;

&lt;p&gt;Low-quality extraction degrades both retrievers. Filter on &lt;code&gt;quality_score&lt;/code&gt; before retrieval:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;rrf&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;|&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;COSINE&lt;/span&gt;&lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;search&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
   &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quality_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Tuning HNSW and BM25 Parameters
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;setup_schema()&lt;/code&gt; exposes four tunable parameters. The defaults work well for 256–512 token chunks in typical document corpora.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setup_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;hnsw_efc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;# Higher = better recall, slower index build
&lt;/span&gt;    &lt;span class="n"&gt;hnsw_m&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# Higher = better recall, more memory per node
&lt;/span&gt;    &lt;span class="n"&gt;distance_metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;COSINE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bm25_k1&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;# Term-frequency saturation
&lt;/span&gt;    &lt;span class="n"&gt;bm25_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;                &lt;span class="c1"&gt;# Length normalization
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;When to tune&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hnsw_efc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;150&lt;/td&gt;
&lt;td&gt;Large corpora (100K+ docs) where recall matters more than indexing speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hnsw_m&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;High-dimensional embeddings (1024-dim); memory is available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bm25_k1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1.2&lt;/td&gt;
&lt;td&gt;Technical corpora with high term repetition (code, legal docs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bm25_b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.75&lt;/td&gt;
&lt;td&gt;Corpora with highly variable document lengths&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Parameters are fixed at schema creation time. Changing them requires dropping and recreating the indexes and re-embedding the full corpus. Pick before ingesting production data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not pgvector + Qdrant?
&lt;/h2&gt;

&lt;p&gt;Running pgvector and Qdrant separately means two write paths, two uptime SLAs, and no ACID guarantees across them. Here's a failure mode every engineer hits eventually: a Qdrant write succeeds; a Postgres write fails during a network partition. The vector store now holds an embedding whose parent document record doesn't exist. Your search returns a chunk with no context — no source, no metadata, no document link. The retry wrapper is still on the backlog.&lt;/p&gt;

&lt;p&gt;kreuzberg-surrealdb's &lt;code&gt;ingest_directory()&lt;/code&gt; writes to &lt;code&gt;documents&lt;/code&gt; and &lt;code&gt;chunks&lt;/code&gt; in the same database. Both the HNSW index and the BM25 index are maintained within the same transaction. &lt;code&gt;search::rrf()&lt;/code&gt; runs inside that same database — no cross-service retrieval latency, no dual-write coordination. The record link from &lt;code&gt;chunks.document&lt;/code&gt; to &lt;code&gt;documents&lt;/code&gt; is always consistent because both were written in the same transaction.&lt;/p&gt;

&lt;p&gt;The LangChain &lt;code&gt;EnsembleRetriever&lt;/code&gt; compounds the problem: two separate HTTP calls to two separate systems, merged in Python with a hardcoded &lt;code&gt;weights&lt;/code&gt; parameter. Weights don't apply to a rank-based algorithm; that mismatch is baked into the design. &lt;code&gt;search::rrf()&lt;/code&gt; doesn't have this problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Honest trade-offs:&lt;/strong&gt; SurrealDB isn't Elasticsearch. At very large scale — hundreds of millions of vectors — specialized vector databases have more managed hosting options and operational tooling. &lt;code&gt;ingest_files()&lt;/code&gt; is sequential; high-throughput batch ingestion requires a queue-based architecture regardless of which database you're using. As of SurrealDB v3, there's no managed cloud option at scale. Verify current hosting options before adopting this stack for production infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What SurrealDB version is required?&lt;/strong&gt; &lt;code&gt;search::rrf()&lt;/code&gt; requires SurrealDB v3. It is not available in v2. BM25 and vector search work separately on v2, but not the combined hybrid query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use a custom embedding model?&lt;/strong&gt; Yes, via &lt;code&gt;EmbeddingModelType.fastembed()&lt;/code&gt; or &lt;code&gt;EmbeddingModelType.custom()&lt;/code&gt;. You must provide &lt;code&gt;embedding_dimensions&lt;/code&gt; explicitly. All chunks and queries must use the same model and dimension as SurrealDB enforces dimension consistency server-wide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is ingestion concurrent or sequential?&lt;/strong&gt; &lt;code&gt;ingest_files()&lt;/code&gt; and &lt;code&gt;ingest_directory()&lt;/code&gt; are sequential. For high-throughput pipelines, use a queue-based architecture with one worker per document. &lt;code&gt;ingest_bytes()&lt;/code&gt; can be called concurrently from multiple coroutines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens to records for deleted files?&lt;/strong&gt; Nothing automatic. Records remain until manually removed. See orphan cleanup above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pip install kreuzberg-surrealdb&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/kreuzberg-dev/kreuzberg-surrealdb" rel="noopener noreferrer"&gt;github.com/kreuzberg-dev/kreuzberg-surrealdb&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Deduplication demo: &lt;code&gt;examples/incremental_ingest.py&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>vectordatabase</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>How to Extract Text from PDF in Python (2026)</title>
      <dc:creator>TI</dc:creator>
      <pubDate>Mon, 16 Mar 2026 09:21:35 +0000</pubDate>
      <link>https://forem.com/kreuzberg/how-to-extract-text-from-pdf-in-python-2026-3a97</link>
      <guid>https://forem.com/kreuzberg/how-to-extract-text-from-pdf-in-python-2026-3a97</guid>
      <description>&lt;p&gt;Extracting text from PDFs is still one of the most common tasks in data engineering, AI pipelines, and automation workflows. Whether you're building a search system, a retrieval-augmented generation (RAG) pipeline, or simply processing reports, the first step is turning PDFs into clean, usable text.&lt;/p&gt;

&lt;p&gt;At first glance this sounds simple, but PDFs were never designed to be machine-readable in the way modern formats are. A PDF is essentially a set of instructions describing how a page should look, not a structured representation of paragraphs, headings, or tables. That means text may be stored in fragments, positioned arbitrarily, or embedded as images.&lt;/p&gt;

&lt;p&gt;Because of this, native extraction often produces broken sentences, incorrect reading order, or missing content. Modern tools try to reconstruct structure rather than just reading raw text streams, which is why the choice of extraction method matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  How PDF Text Extraction Works
&lt;/h2&gt;

&lt;p&gt;Most PDF extraction pipelines follow the same high-level process. First, the document is parsed page by page. Then text blocks are detected and assembled into a readable order. If the document contains scanned pages instead of selectable text, OCR is applied. Finally, the output is normalized so it can be indexed, searched, or passed to downstream systems.&lt;/p&gt;

&lt;p&gt;Even though this workflow sounds straightforward, each step contains a surprising amount of complexity. Reading order detection, for example, becomes difficult in multi-column layouts or technical documents. Tables introduce another layer of difficulty, because the visual structure does not always map cleanly to text.&lt;br&gt;
This is why many teams eventually move beyond simple PDF libraries to more complete document processing frameworks.&lt;/p&gt;
&lt;h2&gt;
  
  
  Extracting Text from a PDF in Python
&lt;/h2&gt;

&lt;p&gt;In Python, the basic workflow for extracting text usually looks the same regardless of the library being used. A document is loaded, parsed, and converted into text that can be printed, stored, or processed further. Different libraries use different APIs, but the general pattern remains consistent. The real differences appear in how well they handle layout, performance, and OCR.&lt;/p&gt;
&lt;h2&gt;
  
  
  Using Kreuzberg for PDF Extraction
&lt;/h2&gt;

&lt;p&gt;Modern document pipelines often require more than just reading text streams. They need consistent metadata, reliable handling of different formats, and good performance when processing large batches of files.&lt;/p&gt;

&lt;p&gt;Kreuzberg is designed for this type of workload. It uses a Rust-based extraction engine with Python bindings (and supports 11 other programming languages as of March 2026), enabling efficient document processing while integrating smoothly into Python pipelines.&lt;/p&gt;

&lt;p&gt;Here is how to get started with Kreuzberg in Python. First, install the package:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install kreuzberg&lt;br&gt;
&lt;/code&gt; For the simplest case — extracting text from a PDF synchronously — use &lt;code&gt;extract_file_sync:&lt;br&gt;
python&lt;br&gt;
from kreuzberg import extract_file_sync&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_file_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pages: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;page_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;If&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;working&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;an&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="n"&gt;variant&lt;/span&gt; &lt;span class="n"&gt;works&lt;/span&gt; &lt;span class="n"&gt;identically&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kreuzberg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;extract_file&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;extract_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tables found: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;

&lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ExtractionResult&lt;/code&gt; object returned by both variants gives you &lt;code&gt;result.content&lt;/code&gt; for the extracted text, &lt;code&gt;result.tables&lt;/code&gt; for any detected tables, and &lt;code&gt;result.metadata&lt;/code&gt; for document properties like page count and format type. To process multiple PDFs at once, use the batch extraction functions, which handle concurrency automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kreuzberg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;batch_extract_files_sync&lt;/span&gt;

&lt;span class="n"&gt;paths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;batch_extract_files_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; characters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For scanned PDFs, enable OCR by passing an &lt;code&gt;ExtractionConfig&lt;/code&gt; with an&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;OcrConfig&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kreuzberg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;extract_file_sync&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ExtractionConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OcrConfig&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExtractionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ocr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;OcrConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tesseract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eng&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_file_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scanned.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Chinese documents, you can also use PaddleOCR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kreuzberg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;extract_file_sync&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ExtractionConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OcrConfig&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExtractionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ocr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;OcrConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;paddleocr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zh&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_file_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scanned.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extracted content (preview): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total characters: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you get a &lt;code&gt;libonnxruntime.so&lt;/code&gt; loading error, install &lt;code&gt;onnxruntime&lt;/code&gt; first:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;python -m pip install&lt;/code&gt;--upgrade onnxruntime&lt;/p&gt;

&lt;p&gt;If the error still persists on Linux, add the &lt;code&gt;onnxruntime/capi&lt;/code&gt; directory to &lt;code&gt;LD_LIBRARY_PATH&lt;/code&gt; before running your script (replace the path with your actual venv location):&lt;/p&gt;

&lt;p&gt;&lt;code&gt;export LD_LIBRARY_PATH=\"&amp;lt;venv&amp;gt;/lib/pythonX.Y/site-packages/onnxruntime/capi:$LD_LIBRARY_PATH\&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Kreuzberg supports Tesseract, EasyOCR, and PaddleOCR as backends, which is useful for multilingual documents where backend quality varies by language.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling Scanned PDFs
&lt;/h2&gt;

&lt;p&gt;One of the biggest challenges in real-world workflows is dealing with scanned documents. These files contain images instead of selectable text, so extraction requires optical character recognition.&lt;/p&gt;

&lt;p&gt;A modern pipeline typically detects when text is missing and automatically runs OCR before merging the results into the document structure. The quality of OCR depends heavily on language, resolution, and document quality, which is why systems that allow different OCR backends are often more reliable in multilingual environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extracting Tables and Structured Content
&lt;/h2&gt;

&lt;p&gt;Tables are another area where simple extraction approaches struggle. Even when the text is captured correctly, the relationships between rows and columns may be lost.&lt;br&gt;
More advanced extraction pipelines attempt to detect table regions and preserve structure so that data remains usable. This is particularly important in financial reports, research papers, and operational documents where tables often contain the most important information.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance and Scaling Considerations
&lt;/h2&gt;

&lt;p&gt;Performance becomes increasingly important as soon as you begin processing more than a handful of files. Batch ingestion, RAG pipelines, and search indexing workflows may involve thousands or millions of documents, and inefficiencies at the parsing stage quickly become expensive.&lt;/p&gt;

&lt;p&gt;Several factors influence performance, including how the parsing engine is implemented, how memory is managed, and how well the system supports concurrency. Tools that rely heavily on interpreted execution or external subprocesses often encounter bottlenecks at scale, while native parsing engines tend to perform better under sustained workloads.&lt;br&gt;
This is one reason many modern document processing tools use compiled cores with language bindings on top.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where PDF Extraction Fits in a Modern Pipeline
&lt;/h2&gt;

&lt;p&gt;In most real systems, text extraction is only the first step. Once text is available, it is typically split into chunks, converted into embeddings, and stored in a vector database for retrieval.&lt;br&gt;
This architecture has become standard for document search and RAG systems because it allows large collections of documents to be queried efficiently. Reliable extraction is the foundation that makes everything else possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Pitfalls
&lt;/h2&gt;

&lt;p&gt;Developers new to PDF extraction often assume that all PDFs behave the same way. In reality, documents vary widely in structure and quality, and a pipeline that works well for one dataset may fail on another.&lt;br&gt;
It is always worth testing extraction using a mix of documents, including scanned files, multi-column layouts, and large reports. Problems usually appear quickly under realistic conditions.&lt;br&gt;
Another common mistake is ignoring metadata. Information such as page numbers, titles, and document structure often becomes critical later, especially when building retrieval systems that need to cite sources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Extracting text from PDFs in Python is easier than it was a few years ago, but the fundamental challenges of document structure and layout remain. Choosing tools that handle these complexities well can significantly improve the quality of downstream systems, from search to RAG to analytics. Once the ingestion layer is reliable, the rest of the pipeline becomes far easier to design and maintain.&lt;/p&gt;

</description>
      <category>python</category>
      <category>pdf</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
    <item>
      <title>Kreuzberg vs. Unstructured.io: Benchmarks and Architecture Comparison (March 2026)</title>
      <dc:creator>TI</dc:creator>
      <pubDate>Mon, 02 Mar 2026 14:33:56 +0000</pubDate>
      <link>https://forem.com/kreuzberg/kreuzberg-vs-unstructuredio-benchmarks-and-architecture-comparison-march-2026-2ogf</link>
      <guid>https://forem.com/kreuzberg/kreuzberg-vs-unstructuredio-benchmarks-and-architecture-comparison-march-2026-2ogf</guid>
      <description>&lt;h1&gt;
  
  
  Kreuzberg vs Unstructured: Benchmarks and Architecture Comparison (March 2026)
&lt;/h1&gt;

&lt;p&gt;When building document pipelines, the choice of extraction library directly impacts performance, infrastructure costs, and reliability. &lt;a href="https://medium.com/@premchandak_11/stop-using-unstructured-io-kreuzberg-v4-0-is-the-rust-powered-successor-db1c154ab21a" rel="noopener noreferrer"&gt;Two tools that developers compare&lt;/a&gt; are &lt;a href="https://github.com/kreuzberg-dev/kreuzberg" rel="noopener noreferrer"&gt;Kreuzberg&lt;/a&gt; and &lt;a href="https://github.com/Unstructured-IO/unstructured" rel="noopener noreferrer"&gt;Unstructured.io&lt;/a&gt;. Both can extract text and metadata from documents, but they differ significantly in architecture and behavior under load. The purpose of this comparison is not to declare a universal winner but to explain where the differences come from and when they matter in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Difference: Architecture
&lt;/h2&gt;

&lt;p&gt;The most important distinction between the two systems is architectural. &lt;a href="https://github.com/Unstructured-IO/unstructured" rel="noopener noreferrer"&gt;Unstructured.io&lt;/a&gt; is primarily Python-based and designed around flexible pipelines and integrations, which makes &lt;a href="https://github.com/Unstructured-IO/unstructured" rel="noopener noreferrer"&gt;its open source library&lt;/a&gt; convenient for rapid prototyping and experimentation in Python-centric environments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kreuzberg.dev/" rel="noopener noreferrer"&gt;Kreuzberg&lt;/a&gt; takes a different approach. Its extraction engine is implemented in Rust and exposed through bindings to multiple languages. In March 2026, Kreuzberg supports 12 programming languages: Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, WASM, and TypeScript/Node.js. This design moves performance-critical work into compiled native code while keeping the developer experience accessible in Python and other stacks.&lt;/p&gt;

&lt;p&gt;This is important for performance because compiled native code runs directly on the CPU without an interpreter or runtime in between, unlike Python, which adds an extra layer of execution overhead (bytecode interpretation, dynamic typing, and runtime dispatch).&lt;/p&gt;

&lt;p&gt;In practical terms, this architectural difference affects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how memory is allocated and reused (e.g. predictable allocation patterns vs Python object overhead)
&lt;/li&gt;
&lt;li&gt;how concurrency is implemented (native threads vs multiprocessing or async workarounds)
&lt;/li&gt;
&lt;li&gt;how much work happens across process boundaries (serialization/deserialization costs)
&lt;/li&gt;
&lt;li&gt;and how much runtime initialization is required before processing begins
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These factors become increasingly noticeable when processing large batches of files or running pipelines continuously in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Benchmarks Measure
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://kreuzberg.dev/benchmarks" rel="noopener noreferrer"&gt;Document-processing benchmarks&lt;/a&gt; are meaningful only when they measure more than raw speed. Throughput, latency percentiles, memory usage, installation size, cold start time, and extraction reliability all contribute to real-world performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kreuzberg.dev/benchmarks" rel="noopener noreferrer"&gt;Kreuzberg’s published benchmarks&lt;/a&gt; use reproducible tests and real-world datasets containing PDFs, Office documents, images, and multilingual text. The benchmark harness runs continuously and is designed to isolate extraction performance from external bottlenecks (e.g. network or I/O), so results reflect the behavior of the extraction engine itself.&lt;/p&gt;

&lt;p&gt;This is important because pipelines often behave very differently depending on file type and layout complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Throughput in Practice
&lt;/h2&gt;

&lt;p&gt;In comparative benchmarks, Kreuzberg has demonstrated significantly higher throughput than many Python-only pipelines, in some cases processing documents roughly 9–50x faster on average across tested workloads. Results vary depending on document type and configuration, but the pattern is consistent: architecture matters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5u6wwsfszxmp44vwq97f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5u6wwsfszxmp44vwq97f.png" alt=" " width="800" height="344"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;02.03.2026 — snapshot of Kreuzberg rust (PDF) benchmarks in Duration and Quality Score. See current benchmarks here: &lt;a href="https://kreuzberg.dev/benchmarks" rel="noopener noreferrer"&gt;https://kreuzberg.dev/benchmarks&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These differences arise from several factors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Native parsing avoids interpreter overhead and reduces per-operation cost
&lt;/li&gt;
&lt;li&gt;Tight loops and parsing routines can be optimized at compile time
&lt;/li&gt;
&lt;li&gt;Memory locality and cache efficiency are improved in compiled code
&lt;/li&gt;
&lt;li&gt;Parallel execution can be implemented without the constraints of the Python GIL
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When processing large datasets, these advantages compound. Even small per-document overhead differences (e.g. a few milliseconds) can translate into minutes or hours of total runtime at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation Size and Operational Considerations
&lt;/h2&gt;

&lt;p&gt;If you’ve ever deployed a large-scale system, you know that another practical difference appears in installation size and dependency complexity. Benchmarks have shown smaller installation footprints and fewer dependencies for Kreuzberg compared with heavier Python frameworks like Unstructured.&lt;/p&gt;

&lt;p&gt;Python-based pipelines often depend on multiple layers of libraries (parsers, OCR tools, system packages), which increases container size and introduces more potential points of failure or version conflicts. In contrast, Kreuzberg packages much of its functionality into a compiled core, reducing the need for large runtime dependency chains.&lt;/p&gt;

&lt;p&gt;While this may seem like a minor detail, it affects container build times, CI pipelines, and cold start latency. In large distributed systems, operational efficiency often matters as much as raw processing speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cold Start and Resource Efficiency
&lt;/h2&gt;

&lt;p&gt;Cold start time becomes particularly important in serverless or autoscaling environments. Systems that rely on large dependency stacks or complex initialization routines may take longer to start, increasing latency and costs.&lt;/p&gt;

&lt;p&gt;Native engines with smaller runtime requirements tend to start faster because there is less dynamic initialization (e.g. importing modules, resolving dependencies, initializing interpreters). They also tend to use memory more efficiently due to tighter control over allocation and fewer intermediary objects.&lt;/p&gt;

&lt;p&gt;Lower memory usage also allows higher parallelism and better performance on smaller machines, which directly impacts cost efficiency in production environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extraction Quality
&lt;/h2&gt;

&lt;p&gt;Performance alone is not enough. Extraction quality depends on document type, layout complexity, language, and OCR requirements. Some pipelines may perform better on certain classes of documents than others, which is why testing on representative datasets is essential. Benchmarks provide useful signals, but they cannot replace real-world evaluation.&lt;/p&gt;

&lt;p&gt;In our own benchmarks (early 2026, &lt;a href="https://kreuzberg.dev/benchmarks" rel="noopener noreferrer"&gt;see the benchmark run and live dashboard&lt;/a&gt;, these tradeoffs show up clearly in practice. Across mixed real-world datasets, like PDFs, Office documents, HTML, and scanned images, we observe that performance and extraction behavior vary significantly by document class.&lt;/p&gt;

&lt;p&gt;On clean, text-based PDFs, multiple pipelines achieve high success rates (often &amp;gt;95%) with relatively stable latency. In contrast, on OCR-heavy or layout-complex documents (e.g. scanned pages, tables, multilingual content), both latency and extraction consistency diverge more noticeably.&lt;/p&gt;

&lt;p&gt;In our runs, throughput differences between approaches range from ~9× to as high as ~50× depending on the workload, while structured extraction success rates can drop by 10–30 percentage points on harder document types. These effects are often linked to how each system handles layout detection, OCR integration, and post-processing heuristics.&lt;/p&gt;

&lt;p&gt;These figures are drawn from our late February 2026 benchmark runs and will evolve, but they consistently reinforce the same point: performance and quality are highly dependent on file type, language, and processing mode (single-file vs batch), which is why evaluating on representative datasets is critical before making production decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing Between the Two
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/Unstructured-IO/unstructured" rel="noopener noreferrer"&gt;Unstructured.io&lt;/a&gt; has built a reputation as a flexible, developer-friendly toolkit for working with unstructured data. Its open-source library can be a strong choice for rapid prototyping, especially in Python-heavy environments where flexibility and ecosystem integration are the primary concerns. For smaller workloads or experimental projects, this can be an excellent fit. For production use at scale, bigger teams rely on &lt;a href="https://unstructured.io/" rel="noopener noreferrer"&gt;Unstructured’s hosted platform&lt;/a&gt;, which is tailored more toward enterprise deployments and managed workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kreuzberg-dev/kreuzberg" rel="noopener noreferrer"&gt;Kreuzberg open source&lt;/a&gt; tends to be particularly attractive in production ingestion pipelines, where throughput, resource efficiency, and predictable performance become more important. In these environments, architectural differences translate directly into cost savings and faster processing. Many teams and companies are already using the OSS in their workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kreuzberg.dev/" rel="noopener noreferrer"&gt;Kreuzberg Cloud&lt;/a&gt; will be a fully managed, core-available solution for teams who need reduced complexity and excellent results in a single API. The library will remain MIT-licensed (permissive open-source) forever. The commercial offering is being built around the core library, not by restricting the library itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Document processing is the foundation of modern AI and search systems. In production environments, pipelines often need to handle millions, or even tens of millions, of documents, where even small inefficiencies compound quickly.&lt;/p&gt;

&lt;p&gt;The ingestion layer, including extraction, chunking, and embedding preparation, determines how fast data can be processed, how reliably it can be retrieved, and how much infrastructure is required to operate the system at scale.&lt;/p&gt;

&lt;p&gt;Testing with real data remains the most reliable way to decide, and you’re welcome to test Kreuzberg’s newly published comparative benchmarks with batch processing or a single file on GitHub. The benchmarks run continuously.&lt;/p&gt;

&lt;p&gt;Benchmarks consistently show that architecture plays a major role in performance. Native engines, efficient memory usage, and minimal dependencies often lead to higher throughput and lower operational cost. At the same time, the best tool always depends on the documents being processed and the requirements of the system.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>document</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Building a RAG pipeline with Kreuzberg and LangChain</title>
      <dc:creator>TI</dc:creator>
      <pubDate>Mon, 23 Feb 2026 09:28:56 +0000</pubDate>
      <link>https://forem.com/kreuzberg/building-a-rag-pipeline-with-kreuzberg-and-langchain-3gj2</link>
      <guid>https://forem.com/kreuzberg/building-a-rag-pipeline-with-kreuzberg-and-langchain-3gj2</guid>
      <description>&lt;p&gt;Most discussions about retrieval-augmented generation (RAG) focus on choosing the right model, tuning prompts, or experimenting with vector databases. In practice, these are rarely the hardest parts. The real bottleneck appears much earlier: getting clean, reliable text out of messy documents.&lt;/p&gt;

&lt;p&gt;There is a real challenge in ingestion, chunking, and embeddings. PDFs preserve visual layout rather than logical structure, Office files rely on completely different internal formats, and scanned documents require OCR before any text exists at all. Metadata is often incomplete or inconsistent, and small problems at this stage propagate downstream. If the extraction quality is poor, retrieval becomes unreliable, and the language model begins to produce weak or misleading answers.&lt;/p&gt;

&lt;p&gt;This is where &lt;a href="https://kreuzberg.dev/" rel="noopener noreferrer"&gt;Kreuzberg&lt;/a&gt; plays a central role, covering the entire early-stage data flow: document ingestion, text chunking, and embedding generation. A typical RAG pipeline can combine Kreuzberg for ingestion, chunking, and embeddings with LangChain as the orchestration layer, alongside a vector database and an LLM. While the architecture is fairly standard, the quality of the early steps determines everything that follows.&lt;/p&gt;

&lt;p&gt;Embeddings are numerical vector representations of text. An embedding model converts a piece of text, such as a sentence, paragraph, or document, into a list of numbers that captures its semantic meaning. Texts with similar meanings end up close to each other in this high-dimensional vector space, making it possible to search by meaning rather than exact keywords. If you haven’t seen this before, the &lt;a href="https://projector.tensorflow.org/" rel="noopener noreferrer"&gt;TensorFlow Embedding Projector&lt;/a&gt; is a useful way to visualize how embeddings cluster similar concepts together.&lt;/p&gt;

&lt;p&gt;Here are the steps to a RAG pipeline with &lt;a href="https://kreuzberg.dev/" rel="noopener noreferrer"&gt;Kreuzberg&lt;/a&gt; and &lt;a href="https://www.langchain.com/" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract text from a sample PDF and DOCX using Kreuzberg &lt;/li&gt;
&lt;li&gt;Inspect the raw output and metadata to understand what high-quality extraction looks like&lt;/li&gt;
&lt;li&gt;Chunk the text using a concrete strategy (recursive splitting with overlap) with Kreuzberg&lt;/li&gt;
&lt;li&gt;Generate embeddings with Kreuzberg and store them in a vector database such as Chroma or FAISS&lt;/li&gt;
&lt;li&gt;Wire everything together with LangChain and run a query end-to-end&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In the examples, we'll use Kreuzberg Python.&lt;/p&gt;

&lt;p&gt;Begin by installing dependencies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install kreuzberg langchain chromadb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, extract text from your document.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from kreuzberg import extract

# Extract from a PDF
pdf_result = extract("sample.pdf")

# Extract from a DOCX
docx_result = extract("sample.docx")

print(pdf_result.text[:500])
print(pdf_result.metadata)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this stage, you receive:&lt;/p&gt;

&lt;p&gt;Clean extracted text&lt;br&gt;
Structured metadata&lt;br&gt;
Page-level and document-level information&lt;/p&gt;

&lt;p&gt;After that, chunk the extracted text. Instead of manually splitting strings, use Kreuzberg’s built-in chunking configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from kreuzberg import extract, ChunkingConfig

result = extract(
    "sample.pdf",
    chunking=ChunkingConfig(
        strategy="recursive",
        chunk_size=500,
        chunk_overlap=50
    )
)

# Access generated chunks
for chunk in result.chunks[:3]:
    print(chunk.content)
    print(chunk.metadata)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Embeddings with Kreuzberg are the next step.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from kreuzberg import extract, ChunkingConfig, EmbeddingConfig

result = extract(
    "sample.pdf",
    chunking=ChunkingConfig(
        strategy="recursive",
        chunk_size=500,
        chunk_overlap=50
    ),
    embedding=EmbeddingConfig(
        preset="sentence-transformers/all-MiniLM-L6-v2"
    )
)

# Each chunk now contains an embedding vector
first_chunk = result.chunks[0]

print(len(first_chunk.embedding))  # vector dimension
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store embeddings in a Vector Database (for example, &lt;a href="https://www.trychroma.com/" rel="noopener noreferrer"&gt;Chroma&lt;/a&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import chromadb
from chromadb.config import Settings

client = chromadb.Client(Settings(anonymized_telemetry=False))
collection = client.create_collection("documents")

for chunk in result.chunks:
    collection.add(
        documents=[chunk.content],
        metadatas=[chunk.metadata],
        embeddings=[chunk.embedding],
        ids=[chunk.id]
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And query with LangChain. LangChain orchestrates retrieval and generation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

vectorstore = Chroma(
    collection_name="documents",
    embedding_function=None  # embeddings already computed
)

retriever = vectorstore.as_retriever()

llm = ChatOpenAI(model="gpt-4o-mini")

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever
)

response = qa_chain.run("What is this document about?")
print(response)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LangChain connects:&lt;br&gt;
The retriever (vector database)&lt;br&gt;
The prompt template&lt;br&gt;
The LLM&lt;br&gt;
The final response pipeline&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Just Built
&lt;/h2&gt;

&lt;p&gt;You now have:&lt;br&gt;
Document ingestion (Kreuzberg)&lt;br&gt;
Structured chunking (Kreuzberg)&lt;br&gt;
Embedding generation (Kreuzberg)&lt;br&gt;
Vector storage (Chroma)&lt;br&gt;
Retrieval orchestration (LangChain)&lt;br&gt;
Answer synthesis (LLM)&lt;/p&gt;

&lt;p&gt;This is a complete, production-ready RAG pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Document Processing Can Be the Hardest Part of RAG
&lt;/h2&gt;

&lt;p&gt;Many tutorials focus heavily on embeddings and prompting, but teams that deploy real systems quickly discover that data preparation is the bottleneck. Production pipelines must deal with complex layouts, multiple file formats, scanned documents, large batches, and multilingual content.&lt;/p&gt;

&lt;p&gt;Kreuzberg is designed specifically for this layer. It transforms heterogeneous documents into clean, structured outputs that downstream systems can reliably use. In a typical RAG pipeline, Kreuzberg sits at the beginning, extracting text, structuring metadata, chunking content, and generating embeddings in a consistent and unified way.&lt;/p&gt;

&lt;p&gt;A useful way to visualize the flow is as a sequence of transformations: documents are extracted, divided into smaller segments, converted into embeddings, stored in a vector database, retrieved in response to a query, and finally synthesized by a language model. Every stage depends on the quality of the one before it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture of a RAG Pipeline
&lt;/h2&gt;

&lt;p&gt;Although implementations differ, most pipelines follow the same logical progression. Documents are first ingested and normalized. The extracted text is then split into chunks of manageable size, after which embeddings are generated and stored in a searchable index. When a user asks a question, the system retrieves the most relevant chunks and passes them to an LLM for synthesis.&lt;/p&gt;

&lt;p&gt;One of the strengths of the RAG pattern is that each stage can be swapped independently. The ingestion engine, embedding model, database, and LLM can all be replaced without redesigning the entire system. Keeping these concerns separated makes pipelines easier to evolve.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extracting Text from Documents
&lt;/h2&gt;

&lt;p&gt;The first stage is always extraction. In practice, this involves reading files in multiple formats, detecting whether text is embedded or must be recovered through OCR, and preserving structural or metadata information whenever possible.&lt;/p&gt;

&lt;p&gt;After this step, the system has clean text, document metadata, and often page-level or structural information. This output becomes the foundation for everything that follows, and in Kreuzberg’s case, it directly feeds into chunking and embedding generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chunking and Embeddings
&lt;/h2&gt;

&lt;p&gt;Once text has been extracted, it must be divided into smaller segments. Large documents cannot be embedded or retrieved efficiently as a single block. The goal of chunking is not only to reduce size but also to preserve meaning. Splitting in the wrong place can destroy context and reduce retrieval accuracy.&lt;/p&gt;

&lt;p&gt;This step is especially critical because the semantic models used in RAG systems are designed to capture relationships across sequences of text. Many models effectively learn patterns in both directions, allowing them to understand context beyond individual tokens. The way text is chunked directly affects how well these relationships are preserved in the resulting embeddings.&lt;/p&gt;

&lt;p&gt;After chunking, each segment is converted into a vector representation. At this point, each chunk becomes a structured record consisting of text, metadata, and an embedding vector. Kreuzberg handles both chunking and embedding generation, reducing complexity and ensuring consistency across the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval and Answer Generation
&lt;/h2&gt;

&lt;p&gt;When a user submits a query, the pipeline converts it into an embedding and searches the vector database for similar entries. In practice, this means finding the chunks whose representations are closest to the query in semantic space.&lt;/p&gt;

&lt;p&gt;Frameworks like &lt;a href="https://www.langchain.com/" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt; orchestrate this process, connecting retrieval, prompting, and generation into a single workflow. They also make it possible to refine retrieval, for example, through filtering, ranking, or hybrid search, so that the most relevant context is passed to the language model.&lt;br&gt;
An important detail is that the model never sees the entire dataset. It only receives a carefully selected subset of chunks. The quality of this selection determines the quality of the final answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling a RAG Pipeline
&lt;/h2&gt;

&lt;p&gt;Once a pipeline works on a small dataset, real-world deployments introduce additional requirements. Ingestion must handle large volumes of files and often run in parallel. Retrieval systems benefit from metadata filtering and hybrid search strategies, and generation layers often include structured prompts or citation mechanisms.&lt;br&gt;
At scale, another challenge emerges: as data grows, it becomes increasingly difficult to understand or navigate the information at all. Large document collections quickly exceed what humans can manually organize or search effectively. This is exactly where RAG systems become so important: they make massive, unstructured datasets usable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;One of the most frequent mistakes is treating ingestion as a trivial preprocessing step. Teams often invest heavily in prompt engineering while overlooking extraction quality, only to discover that retrieval accuracy is limited by poor source data. Inconsistent chunking and missing metadata create similar issues.&lt;br&gt;
A good rule of thumb is to design this early stage carefully. Because extraction, chunking, and embedding happen at the beginning, mistakes here propagate forward. Poor extraction leads to weaker chunking, lower-quality embeddings, less accurate retrieval, and ultimately worse answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;RAG systems succeed or fail based on the quality of their data pipeline. Reliable document parsing, chunking, and consistent embedding generation form the foundation on which retrieval and generation depend. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://kreuzberg.dev/" rel="noopener noreferrer"&gt;Kreuzberg&lt;/a&gt; fits naturally into this architecture because it addresses the first part of the workflow: turning messy, real-world documents into clean, structured, and semantically meaningful data ready for retrieval and generation. &lt;a href="https://www.langchain.com/" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt; provides the glue between components, letting you compose retrieval, prompts, and LLMs into a single, production-ready pipeline.&lt;/p&gt;

&lt;p&gt;Don't hesitate to submit issues or make contributions to Kreuzberg &lt;a href="https://github.com/kreuzberg-dev/kreuzberg" rel="noopener noreferrer"&gt;on GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Kreuzberg v4.3.0 and comparative benchmarks</title>
      <dc:creator>TI</dc:creator>
      <pubDate>Thu, 12 Feb 2026 09:02:29 +0000</pubDate>
      <link>https://forem.com/kreuzberg/kreuzberg-v430-and-benchmarks-500b</link>
      <guid>https://forem.com/kreuzberg/kreuzberg-v430-and-benchmarks-500b</guid>
      <description>&lt;p&gt;Hi all, we have two important announcements related to &lt;a href="https://github.com/kreuzberg-dev/kreuzberg" rel="noopener noreferrer"&gt;Kreuzberg&lt;/a&gt;: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We released our new &lt;a href="https://kreuzberg.dev/benchmarks" rel="noopener noreferrer"&gt;comparative benchmarks&lt;/a&gt;. These have a slick UI and we have been working hard on them for a while now (more on this below), and we'd love to hear your impressions and get some feedback from the community!&lt;/li&gt;
&lt;li&gt;We released v4.3.0, which brings in a bunch of improvements including PaddleOCR as an optional backend, document structure extraction, and native Word97 format support. More details below.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What is Kreuzberg?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kreuzberg-dev/kreuzberg" rel="noopener noreferrer"&gt;Kreuzberg&lt;/a&gt; is an open-source (MIT license) polyglot document intelligence framework written in Rust, with bindings for Python, TypeScript/JavaScript (Node/Bun/WASM), PHP, Ruby, Java, C#, Golang and Elixir. It's also available as a docker image and standalone CLI tool you can install via homebrew.&lt;/p&gt;

&lt;p&gt;If the above is unintelligible to you (understandably so), here is the TL;DR: Kreuzberg allows users to extract text from 75+ formats (and growing), perform OCR, create embeddings and quite a few other things as well. This is necessary for many AI applications, data pipelines, machine learning, and basically any use case that requires processing documents and images as sources for textual outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparative Benchmarks
&lt;/h2&gt;

&lt;p&gt;Our new comparative benchmarks UI is live here: &lt;a href="https://kreuzberg.dev/benchmarks" rel="noopener noreferrer"&gt;https://kreuzberg.dev/benchmarks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The comparative benchmarks compare Kreuzberg with several of the top open source alternatives - Apache Tika, Docling, Markitdown, Unstructured.io, PDFPlumber, Mineru, MuPDF4LLM. In a nutshell - Kreuzberg is 9x faster on average, uses substantially less memory, has much better cold start, and a smaller installation footprint. It also requires less system dependencies to function (only &lt;strong&gt;optional&lt;/strong&gt; system dependency for it is onnxruntime, for embeddings/PaddleOCR).&lt;/p&gt;

&lt;p&gt;The benchmarks measure throughput, duration, p99/95/50, memory, installation size and cold start with more than 50 different file formats. They are run in GitHub CI on ubuntu latest machines and the results are published into GitHub releases (here is an &lt;a href="https://github.com/kreuzberg-dev/kreuzberg/releases/tag/benchmark-run-21923145045" rel="noopener noreferrer"&gt;example&lt;/a&gt;). The &lt;a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/tools/benchmark-harness" rel="noopener noreferrer"&gt;source code&lt;/a&gt; for the benchmarks and the full data is available in GitHub, and you are invited to check it out.&lt;/p&gt;

&lt;h2&gt;
  
  
  V4.3.0 Changes
&lt;/h2&gt;

&lt;p&gt;The v4.3.0 full release notes can be found here: &lt;a href="https://github.com/kreuzberg-dev/kreuzberg/releases/tag/v4.3.0" rel="noopener noreferrer"&gt;https://github.com/kreuzberg-dev/kreuzberg/releases/tag/v4.3.0&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key highlights:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;PaddleOCR optional backend - in Rust. Yes, you read this right, Kreuzberg now supports PaddleOCR in Rust and by extension - across all languages and bindings except WASM. This is a big one, especially for Chinese speakers and other east Asian languages, at which these models excel.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Document structure extraction - while we already had page hierarchy extraction, we had requests to give document structure extraction similar to Docling, which has very good extraction. We now have a different but up to par implementation that extracts document structure from a huge variety of text documents - yes, including PDFs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Native Word97 format extraction - wait, what? Yes, we now support the legacy &lt;code&gt;.doc&lt;/code&gt; and &lt;code&gt;.ppt&lt;/code&gt; formats directly in Rust. This means we no longer need LibreOffice as an optional system dependency, which saves a lot of space. Who cares you may ask? Well, usually enterprises and governmental orgs- we still live in a world where legacy is a thing.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to get involved with Kreuzberg
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Kreuzberg is an open-source project, and as such, contributions are welcome. You can check us out on GitHub, open issues or discussions, and of course, submit fixes and pull requests. Here is the GitHub: &lt;a href="https://github.com/kreuzberg-dev/kreuzberg" rel="noopener noreferrer"&gt;https://github.com/kreuzberg-dev/kreuzberg&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We have a &lt;a href="https://discord.gg/rzGzur3kj4" rel="noopener noreferrer"&gt;Discord Server&lt;/a&gt; and you are all invited to join (and lurk)!&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's it for now. As always, if you like it, star it on GitHub, it helps us get visibility for the project.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>rust</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Kreuzberg.dev now supports PHP and Elixir- and covers most of the backend landscape</title>
      <dc:creator>TI</dc:creator>
      <pubDate>Sat, 03 Jan 2026 12:24:24 +0000</pubDate>
      <link>https://forem.com/kreuzberg/kreuzbergdev-now-supports-php-and-elixir-and-covers-most-of-the-backend-landscape-1hee</link>
      <guid>https://forem.com/kreuzberg/kreuzbergdev-now-supports-php-and-elixir-and-covers-most-of-the-backend-landscape-1hee</guid>
      <description>&lt;p&gt;Kreuzberg.dev now supports PHP and Elixir 🎉&lt;/p&gt;

&lt;p&gt;We’ve added PHP and Elixir bindings to Kreuzberg.dev, our open-source document intelligence engine.&lt;/p&gt;

&lt;p&gt;With this release, Kreuzberg is now available for:&lt;/p&gt;

&lt;p&gt;Rust, Python, Ruby, Go, PHP, Elixir, and TypeScript/Node.js&lt;/p&gt;

&lt;p&gt;This covers most modern backend and web development stacks, making it easier to integrate high-performance document processing into existing systems without forcing teams into a single language.&lt;/p&gt;

&lt;p&gt;What is Kreuzberg.dev?&lt;/p&gt;

&lt;p&gt;Kreuzberg is an MIT-licensed engine for extracting and structuring data from 56+ document formats, including PDFs, Office files, images, archives, and emails.&lt;/p&gt;

&lt;p&gt;Typical use cases include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;feeding documents into search or RAG pipelines&lt;/li&gt;
&lt;li&gt;extracting structured data from unstructured files&lt;/li&gt;
&lt;li&gt;building ingestion systems for large document collections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What’s next?&lt;/p&gt;

&lt;p&gt;We’re continuing to improve:&lt;/p&gt;

&lt;p&gt;performance and memory usage&lt;/p&gt;

&lt;p&gt;format coverage and extraction quality&lt;/p&gt;

&lt;p&gt;documentation and real-world examples&lt;/p&gt;

&lt;p&gt;We’re also very interested in feedback from people running Kreuzberg.dev in production — especially around scaling, fault tolerance, and integration patterns.&lt;/p&gt;

&lt;p&gt;Try it out&lt;/p&gt;

&lt;p&gt;The library is open-source and self-hostable.&lt;/p&gt;

&lt;p&gt;Repo and docs: &lt;a href="https://github.com/kreuzberg-dev/kreuzberg" rel="noopener noreferrer"&gt;https://github.com/kreuzberg-dev/kreuzberg&lt;/a&gt;&lt;br&gt;
Join our Discord community: &lt;a href="https://discord.gg/JraV699cKj" rel="noopener noreferrer"&gt;https://discord.gg/JraV699cKj&lt;/a&gt; &lt;br&gt;
Issues, questions, and PRs are always welcome.&lt;/p&gt;

&lt;p&gt;If you’re using Kreuzberg.dev already (or trying it now), we’d love to hear what you’re building with it. Have a great start to 2026!&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>php</category>
      <category>elixir</category>
      <category>rust</category>
    </item>
    <item>
      <title>Kreuzberg v4.0.0-rc14 released: optimization phase and stable v4 ahead</title>
      <dc:creator>TI</dc:creator>
      <pubDate>Sun, 21 Dec 2025 11:22:16 +0000</pubDate>
      <link>https://forem.com/kreuzberg/kreuzberg-v400-rc14-released-optimization-phase-and-stable-v4-ahead-1nji</link>
      <guid>https://forem.com/kreuzberg/kreuzberg-v400-rc14-released-optimization-phase-and-stable-v4-ahead-1nji</guid>
      <description>&lt;p&gt;We’ve just released Kreuzberg v4.0.0-rc14, now working across all release channels (language bindings, Docker, CLI).&lt;/p&gt;

&lt;p&gt;With the core feature set in place, focus is shifting to performance optimization — profiling and improving bindings, followed by comparative benchmarks and a documentation refresh.&lt;/p&gt;

&lt;p&gt;If you have time to test rc14, we’d be happy to receive any feedback- bugs, encouragement, design critique, or else- as we prepare for a stable v4 release next month. Thank you!&lt;/p&gt;

&lt;p&gt;Kreuzberg's position: As an open-source library, Kreuzberg provides a self-hosted alternative with no per-document API costs, making it suitable for high-volume workloads where cost efficiency matters.&lt;/p&gt;

&lt;p&gt;Resources&lt;br&gt;
GitHub: Star us at &lt;a href="https://github.com/kreuzberg-dev/kreuzberg" rel="noopener noreferrer"&gt;https://github.com/kreuzberg-dev/kreuzberg&lt;/a&gt;&lt;br&gt;
Discord: Join our community server at &lt;a href="https://discord.gg/JraV699cKj" rel="noopener noreferrer"&gt;https://discord.gg/JraV699cKj&lt;/a&gt;&lt;br&gt;
Subreddit: Join the discussion at &lt;a href="https://www.reddit.com/r/kreuzberg_dev/" rel="noopener noreferrer"&gt;https://www.reddit.com/r/kreuzberg_dev/&lt;/a&gt;&lt;br&gt;
Documentation: &lt;a href="https://kreuzberg.dev/" rel="noopener noreferrer"&gt;https://kreuzberg.dev/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We'd love to hear your contributions!&lt;/p&gt;

&lt;p&gt;For more background on Kreuzberg, competitive comparison, and recent changes, see the earlier deep dive from v4.0.0-rc8:&lt;br&gt;
&lt;a href="https://dev.to/kreuzberg-dev/kreuzberg-v400-rc8-is-available-4fma"&gt;https://dev.to/kreuzberg-dev/kreuzberg-v400-rc8-is-available-4fma&lt;/a&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>product</category>
      <category>python</category>
      <category>rust</category>
    </item>
    <item>
      <title>Kreuzberg v4.0.0-RC.8 is Available</title>
      <dc:creator>Na'aman Hirschfeld (Goldziher)</dc:creator>
      <pubDate>Mon, 15 Dec 2025 13:06:14 +0000</pubDate>
      <link>https://forem.com/kreuzberg/kreuzberg-v400-rc8-is-available-4fma</link>
      <guid>https://forem.com/kreuzberg/kreuzberg-v400-rc8-is-available-4fma</guid>
      <description>&lt;p&gt;Hi Peeps,&lt;/p&gt;

&lt;p&gt;I'm excited to announce that &lt;a href="https://github.com/kreuzberg-dev/kreuzberg" rel="noopener noreferrer"&gt;Kreuzberg&lt;/a&gt; v4.0.0 is coming very soon. We will release v4.0.0 at the beginning of next year - in just a couple of weeks time. For now, v4.0.0-rc.8 has been released to all channels.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Kreuzberg?
&lt;/h2&gt;

&lt;p&gt;Kreuzberg is a document intelligence toolkit for extracting text, metadata, tables, images, and structured data from 56+ file formats. It was originally written in Python (v1-v3), where it demonstrated strong performance characteristics compared to alternatives in the ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's new in V4?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  A Complete Rust Rewrite with Polyglot Bindings
&lt;/h3&gt;

&lt;p&gt;The new version of Kreuzberg represents a massive architectural evolution. &lt;strong&gt;Kreuzberg has been completely rewritten in Rust&lt;/strong&gt; - leveraging Rust's memory safety, zero-cost abstractions, and native performance. The new architecture consists of a high-performance Rust core with native bindings to multiple languages. That's right - it's no longer just a Python library.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kreuzberg v4 is now available for 7 languages across 8 runtime bindings:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rust&lt;/strong&gt; (native library)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt; (PyO3 native bindings)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TypeScript&lt;/strong&gt; - Node.js (NAPI-RS native bindings) + Deno/Browser/Edge (WASM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ruby&lt;/strong&gt; (Magnus FFI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Java 25+&lt;/strong&gt; (Panama Foreign Function &amp;amp; Memory API)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;C#&lt;/strong&gt; (P/Invoke)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Go&lt;/strong&gt; (cgo bindings)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Post v4.0.0 roadmap includes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PHP&lt;/li&gt;
&lt;li&gt;Elixir (via Rustler - with Erlang and Gleam interop)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additionally, it's available as a &lt;strong&gt;CLI&lt;/strong&gt; (installable via &lt;code&gt;cargo&lt;/code&gt; or &lt;code&gt;homebrew&lt;/code&gt;), &lt;strong&gt;HTTP REST API server&lt;/strong&gt;, &lt;strong&gt;Model Context Protocol (MCP) server&lt;/strong&gt; for Claude Desktop/Continue.dev, and as &lt;strong&gt;public Docker images&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the Rust Rewrite? Performance and Architecture
&lt;/h3&gt;

&lt;p&gt;The Rust rewrite wasn't just about performance - though that's a major benefit. It was an opportunity to fundamentally rethink the architecture:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural improvements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-copy operations&lt;/strong&gt; via Rust's ownership model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;True async concurrency&lt;/strong&gt; with Tokio runtime (no GIL limitations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming parsers&lt;/strong&gt; for constant memory usage on multi-GB files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SIMD-accelerated text processing&lt;/strong&gt; for token reduction and string operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory-safe FFI boundaries&lt;/strong&gt; for all language bindings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plugin system&lt;/strong&gt; with trait-based extensibility&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  v3 vs v4: What Changed?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;v3 (Python)&lt;/th&gt;
&lt;th&gt;v4 (Rust Core)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pure Python&lt;/td&gt;
&lt;td&gt;Rust 2024 edition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File Formats&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30-40+ (via Pandoc)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;56+ (native parsers)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python only&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;7 languages&lt;/strong&gt; (Rust/Python/TS/Ruby/Java/Go/C#)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dependencies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requires Pandoc (system binary)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Zero system dependencies&lt;/strong&gt; (all native)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Embeddings&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;✓ FastEmbed with ONNX (3 presets + custom)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic Chunking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Via semantic-text-splitter library&lt;/td&gt;
&lt;td&gt;✓ Built-in (text + markdown-aware)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Token Reduction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Built-in (TF-IDF based)&lt;/td&gt;
&lt;td&gt;✓ Enhanced with 3 modes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language Detection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optional (fast-langdetect)&lt;/td&gt;
&lt;td&gt;✓ Built-in (68 languages)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Keyword Extraction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optional (KeyBERT)&lt;/td&gt;
&lt;td&gt;✓ Built-in (YAKE + RAKE algorithms)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OCR Backends&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tesseract/EasyOCR/PaddleOCR&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Same + better integration&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Plugin System&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited extractor registry&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Full trait-based&lt;/strong&gt; (4 plugin types)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Page Tracking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Character-based indices&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Byte-based with O(1) lookup&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Servers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;REST API (Litestar)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;HTTP (Axum) + MCP + MCP-SSE&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Installation Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~100MB base&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16-31 MB complete&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python heap management&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;RAII with streaming&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Concurrency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;asyncio (GIL-limited)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Tokio work-stealing&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Replacement of Pandoc - Native Performance
&lt;/h3&gt;

&lt;p&gt;Kreuzberg v3 relied on &lt;strong&gt;Pandoc&lt;/strong&gt; - an amazing tool, but one that had to be invoked via subprocess because of its GPL license. This had significant impacts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v3 Pandoc limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System dependency (installation required)&lt;/li&gt;
&lt;li&gt;Subprocess overhead on every document&lt;/li&gt;
&lt;li&gt;No streaming support&lt;/li&gt;
&lt;li&gt;Limited metadata extraction&lt;/li&gt;
&lt;li&gt;~500MB+ installation footprint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;v4 native parsers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero external dependencies&lt;/strong&gt; - everything is native Rust&lt;/li&gt;
&lt;li&gt;Direct parsing with full control over extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Substantially more metadata&lt;/strong&gt; extracted (e.g., DOCX document properties, section structure, style information)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming support&lt;/strong&gt; for massive files (tested on multi-GB XML documents with stable memory)&lt;/li&gt;
&lt;li&gt;Example: PPTX extractor is now a &lt;strong&gt;fully streaming parser&lt;/strong&gt; capable of handling gigabyte-scale presentations with constant memory usage and high throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  New File Format Support
&lt;/h3&gt;

&lt;p&gt;v4 expanded format support from ~20 to &lt;strong&gt;56+ file formats&lt;/strong&gt;, including:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Added legacy format support:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;.doc&lt;/code&gt; (Word 97-2003)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.ppt&lt;/code&gt; (PowerPoint 97-2003)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.xls&lt;/code&gt; (Excel 97-2003)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.eml&lt;/code&gt; (Email messages)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.msg&lt;/code&gt; (Outlook messages)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Added academic/technical formats:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LaTeX (&lt;code&gt;.tex&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;BibTeX (&lt;code&gt;.bib&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Typst (&lt;code&gt;.typ&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;JATS XML (scientific articles)&lt;/li&gt;
&lt;li&gt;DocBook XML&lt;/li&gt;
&lt;li&gt;FictionBook (&lt;code&gt;.fb2&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;OPML (&lt;code&gt;.opml&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Better Office support:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XLSB, XLSM (Excel binary/macro formats)&lt;/li&gt;
&lt;li&gt;Better structured metadata extraction from DOCX/PPTX/XLSX&lt;/li&gt;
&lt;li&gt;Full table extraction from presentations&lt;/li&gt;
&lt;li&gt;Image extraction with deduplication&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  New Features: Full Document Intelligence Solution
&lt;/h3&gt;

&lt;p&gt;The v4 rewrite was also an opportunity to close gaps with commercial alternatives and add features specifically designed for &lt;strong&gt;RAG applications and LLM workflows&lt;/strong&gt;:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. &lt;strong&gt;Embeddings (NEW)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FastEmbed integration&lt;/strong&gt; with full ONNX Runtime acceleration&lt;/li&gt;
&lt;li&gt;Three presets: &lt;code&gt;"fast"&lt;/code&gt; (384d), &lt;code&gt;"balanced"&lt;/code&gt; (512d), &lt;code&gt;"quality"&lt;/code&gt; (768d/1024d)&lt;/li&gt;
&lt;li&gt;Custom model support (bring your own ONNX model)&lt;/li&gt;
&lt;li&gt;Local generation (no API calls, no rate limits)&lt;/li&gt;
&lt;li&gt;Automatic model downloading and caching&lt;/li&gt;
&lt;li&gt;Per-chunk embedding generation
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kreuzberg&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ExtractionConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EmbeddingConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EmbeddingModelType&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExtractionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;EmbeddingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;EmbeddingModelType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;preset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;balanced&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;normalize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kreuzberg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# result.embeddings contains vectors for each chunk
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. &lt;strong&gt;Semantic Text Chunking (NOW BUILT-IN)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Now integrated directly into the core (v3 used external semantic-text-splitter library):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structure-aware chunking&lt;/strong&gt; that respects document semantics&lt;/li&gt;
&lt;li&gt;Two strategies:

&lt;ul&gt;
&lt;li&gt;Generic text chunker (whitespace/punctuation-aware)&lt;/li&gt;
&lt;li&gt;Markdown chunker (preserves headings, lists, code blocks, tables)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Configurable chunk size and overlap&lt;/li&gt;

&lt;li&gt;Unicode-safe (handles CJK, emojis correctly)&lt;/li&gt;

&lt;li&gt;Automatic chunk-to-page mapping&lt;/li&gt;

&lt;li&gt;Per-chunk metadata with byte offsets&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. &lt;strong&gt;Byte-Accurate Page Tracking (BREAKING CHANGE)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;This is a critical improvement for LLM applications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v3&lt;/strong&gt;: Character-based indices (&lt;code&gt;char_start&lt;/code&gt;/&lt;code&gt;char_end&lt;/code&gt;) - incorrect for UTF-8 multi-byte characters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v4&lt;/strong&gt;: Byte-based indices (&lt;code&gt;byte_start&lt;/code&gt;/&lt;code&gt;byte_end&lt;/code&gt;) - correct for all string operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additional page features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;O(1) lookup: "which page is byte offset X on?" → instant answer&lt;/li&gt;
&lt;li&gt;Per-page content extraction&lt;/li&gt;
&lt;li&gt;Page markers in combined text (e.g., &lt;code&gt;--- Page 5 ---&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Automatic chunk-to-page mapping for citations&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. &lt;strong&gt;Enhanced Token Reduction for LLM Context&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Enhanced from v3 with three configurable modes to save on LLM costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Light mode&lt;/strong&gt;: ~15% reduction (preserve most detail)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Moderate mode&lt;/strong&gt;: ~30% reduction (balanced)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggressive mode&lt;/strong&gt;: ~50% reduction (key information only)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Uses TF-IDF sentence scoring with position-aware weighting and language-specific stopword filtering. SIMD-accelerated for improved performance over v3.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. &lt;strong&gt;Language Detection (NOW BUILT-IN)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;68 language support with confidence scoring&lt;/li&gt;
&lt;li&gt;Multi-language detection (documents with mixed languages)&lt;/li&gt;
&lt;li&gt;ISO 639-1 and ISO 639-3 code support&lt;/li&gt;
&lt;li&gt;Configurable confidence thresholds&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  6. &lt;strong&gt;Keyword Extraction (NOW BUILT-IN)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Now built into core (previously optional KeyBERT in v3):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;YAKE&lt;/strong&gt; (Yet Another Keyword Extractor): Unsupervised, language-independent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAKE&lt;/strong&gt; (Rapid Automatic Keyword Extraction): Fast statistical method&lt;/li&gt;
&lt;li&gt;Configurable n-grams (1-3 word phrases)&lt;/li&gt;
&lt;li&gt;Relevance scoring with language-specific stopwords&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  7. &lt;strong&gt;Plugin System (NEW)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Four extensible plugin types for customization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DocumentExtractor&lt;/strong&gt; - Custom file format handlers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OcrBackend&lt;/strong&gt; - Custom OCR engines (integrate your own Python models)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostProcessor&lt;/strong&gt; - Data transformation and enrichment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validator&lt;/strong&gt; - Pre-extraction validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plugins defined in Rust work across all language bindings. Python/TypeScript can define custom plugins with thread-safe callbacks into the Rust core.&lt;/p&gt;

&lt;h4&gt;
  
  
  8. &lt;strong&gt;Production-Ready Servers (NEW)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTTP REST API&lt;/strong&gt;: Production-grade Axum server with OpenAPI docs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP Server&lt;/strong&gt;: Direct integration with Claude Desktop, Continue.dev, and other MCP clients&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP-SSE Transport&lt;/strong&gt; (RC.8): Server-Sent Events for cloud deployments without WebSocket support&lt;/li&gt;
&lt;li&gt;All three modes support the same feature set: extraction, batch processing, caching&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Performance: Benchmarked Against the Competition
&lt;/h2&gt;

&lt;p&gt;We maintain &lt;strong&gt;continuous benchmarks&lt;/strong&gt; comparing Kreuzberg against the leading OSS alternatives:&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark Setup
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Platform&lt;/strong&gt;: Ubuntu 22.04 (GitHub Actions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test Suite&lt;/strong&gt;: 30+ documents covering all formats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt;: Latency (p50, p95), throughput (MB/s), memory usage, success rate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Competitors&lt;/strong&gt;: Apache Tika, Docling, Unstructured, MarkItDown&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How Kreuzberg Compares
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Installation Size&lt;/strong&gt; (critical for containers/serverless):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kreuzberg&lt;/strong&gt;: &lt;strong&gt;16-31 MB complete&lt;/strong&gt; (CLI: 16 MB, Python wheel: 22 MB, Java JAR: 31 MB - all features included)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MarkItDown&lt;/strong&gt;: ~251 MB installed (58.3 KB wheel, 25 dependencies)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unstructured&lt;/strong&gt;: ~146 MB minimal (open source base) - &lt;strong&gt;several GB with ML models&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docling&lt;/strong&gt;: ~1 GB base, &lt;strong&gt;9.74GB Docker image&lt;/strong&gt; (includes PyTorch CUDA)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Tika&lt;/strong&gt;: ~55 MB (tika-app JAR) + dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GROBID&lt;/strong&gt;: 500MB (CRF-only) to &lt;strong&gt;8GB&lt;/strong&gt; (full deep learning)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Performance Characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Formats&lt;/th&gt;
&lt;th&gt;Installation&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kreuzberg&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚡ Fast (Rust-native)&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;56+&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16-31 MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;General-purpose, production-ready&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Docling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚡ Fast (3.1s/pg x86, 1.27s/pg ARM)&lt;/td&gt;
&lt;td&gt;Best&lt;/td&gt;
&lt;td&gt;7+&lt;/td&gt;
&lt;td&gt;1-9.74 GB&lt;/td&gt;
&lt;td&gt;Complex documents, when accuracy &amp;gt; size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GROBID&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚡⚡ Very Fast (10.6 PDF/s)&lt;/td&gt;
&lt;td&gt;Best&lt;/td&gt;
&lt;td&gt;PDF only&lt;/td&gt;
&lt;td&gt;0.5-8 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Academic/scientific papers only&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unstructured&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚡ Moderate&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;25-65+&lt;/td&gt;
&lt;td&gt;146 MB-several GB&lt;/td&gt;
&lt;td&gt;Python-native LLM pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MarkItDown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚡ Fast (small files)&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;11+&lt;/td&gt;
&lt;td&gt;~251 MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Lightweight Markdown conversion&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Tika&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚡ Moderate&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1000+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~55 MB&lt;/td&gt;
&lt;td&gt;Enterprise, broadest format support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Kreuzberg's sweet spot:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Smallest full-featured installation&lt;/strong&gt;: 16-31 MB complete (vs 146 MB-9.74 GB for competitors)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5-15x smaller&lt;/strong&gt; than Unstructured/MarkItDown, &lt;strong&gt;30-300x smaller&lt;/strong&gt; than Docling/GROBID&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rust-native performance&lt;/strong&gt; without ML model overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broad format support&lt;/strong&gt; (56+ formats) with native parsers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-language support&lt;/strong&gt; unique in the space (7 languages vs Python-only for most)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production-ready&lt;/strong&gt; with general-purpose design (vs specialized tools like GROBID)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Is Kreuzberg a SaaS Product?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;No.&lt;/strong&gt; Kreuzberg is and will remain &lt;strong&gt;MIT-licensed open source&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;However, we are building &lt;strong&gt;Kreuzberg.cloud&lt;/strong&gt; - a commercial SaaS and self-hosted document intelligence solution built &lt;em&gt;on top of&lt;/em&gt; Kreuzberg. This follows the proven open-core model: the library stays free and open, while we offer a cloud service for teams that want managed infrastructure, APIs, and enterprise features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Will Kreuzberg become commercially licensed?&lt;/strong&gt; Absolutely not. There is no BSL (Business Source License) in Kreuzberg's future. The library was MIT-licensed and will remain MIT-licensed. We're building the commercial offering as a separate product around the core library, not by restricting the library itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Target Audience
&lt;/h2&gt;

&lt;p&gt;Any developer or data scientist who needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Document text extraction (PDF, Office, images, email, archives, etc.)&lt;/li&gt;
&lt;li&gt;OCR (Tesseract, EasyOCR, PaddleOCR)&lt;/li&gt;
&lt;li&gt;Metadata extraction (authors, dates, properties, EXIF)&lt;/li&gt;
&lt;li&gt;Table and image extraction&lt;/li&gt;
&lt;li&gt;Document pre-processing for RAG pipelines&lt;/li&gt;
&lt;li&gt;Text chunking with embeddings&lt;/li&gt;
&lt;li&gt;Token reduction for LLM context windows&lt;/li&gt;
&lt;li&gt;Multi-language document intelligence in production systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ideal for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RAG application developers&lt;/li&gt;
&lt;li&gt;Data engineers building document pipelines&lt;/li&gt;
&lt;li&gt;ML engineers preprocessing training data&lt;/li&gt;
&lt;li&gt;Enterprise developers handling document workflows&lt;/li&gt;
&lt;li&gt;DevOps teams needing lightweight, performant extraction in containers/serverless&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Comparison with Alternatives
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Open Source Python Libraries
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Unstructured.io&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths&lt;/strong&gt;: Established, modular, broad format support (25+ open source, 65+ enterprise), LLM-focused, good Python ecosystem integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-offs&lt;/strong&gt;: Python GIL performance constraints, 146 MB minimal installation (several GB with ML models)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt;: Apache-2.0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When to choose&lt;/strong&gt;: Python-only projects where ecosystem fit &amp;gt; performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MarkItDown (Microsoft)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths&lt;/strong&gt;: Fast for small files, Markdown-optimized, simple API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-offs&lt;/strong&gt;: Limited format support (11 formats), less structured metadata, ~251 MB installed (despite small wheel), requires OpenAI API for images&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt;: MIT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When to choose&lt;/strong&gt;: Markdown-only conversion, LLM consumption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Docling (IBM)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths&lt;/strong&gt;: Excellent accuracy on complex documents (97.9% cell-level accuracy on tested sustainability report tables), state-of-the-art AI models for technical documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-offs&lt;/strong&gt;: Massive installation (1-9.74 GB), high memory usage, GPU-optimized (underutilized on CPU)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt;: MIT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When to choose&lt;/strong&gt;: Accuracy on complex documents &amp;gt; deployment size/speed, have GPU infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Open Source Java/Academic Tools
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Apache Tika&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths&lt;/strong&gt;: Mature, stable, broadest format support (1000+ types), proven at scale, Apache Foundation backing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-offs&lt;/strong&gt;: Java/JVM required, slower on large files, older architecture, complex dependency management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt;: Apache-2.0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When to choose&lt;/strong&gt;: Enterprise environments with JVM infrastructure, need for maximum format coverage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GROBID&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths&lt;/strong&gt;: Best-in-class for academic papers (F1 0.87-0.90), extremely fast (10.6 PDF/sec sustained), proven at scale (34M+ documents at CORE)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-offs&lt;/strong&gt;: Academic papers only, large installation (500MB-8GB), complex Java+Python setup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt;: Apache-2.0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When to choose&lt;/strong&gt;: Scientific/academic document processing exclusively&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Commercial APIs
&lt;/h3&gt;

&lt;p&gt;There are numerous commercial options from startups (LlamaIndex, Unstructured.io paid tiers) to big cloud providers (AWS Textract, Azure Form Recognizer, Google Document AI). These are not OSS but offer managed infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kreuzberg's position&lt;/strong&gt;: As an open-source library, Kreuzberg provides a self-hosted alternative with no per-document API costs, making it suitable for high-volume workloads where cost efficiency matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community &amp;amp; Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: Star us at &lt;a href="https://github.com/kreuzberg-dev/kreuzberg" rel="noopener noreferrer"&gt;https://github.com/kreuzberg-dev/kreuzberg&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discord&lt;/strong&gt;: Join our community server at &lt;a href="https://discord.gg/pXxagNK2zN" rel="noopener noreferrer"&gt;discord.gg/pXxagNK2zN&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subreddit&lt;/strong&gt;: Join the discussion at &lt;a href="https://www.reddit.com/r/kreuzberg_dev/" rel="noopener noreferrer"&gt;r/kreuzberg_dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation&lt;/strong&gt;: &lt;a href="https://kreuzberg.dev" rel="noopener noreferrer"&gt;kreuzberg.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We'd love to hear your feedback, use cases, and contributions!&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: Kreuzberg v4 is a complete Rust rewrite of a document intelligence library, offering native bindings for 7 languages (8 runtime targets), 56+ file formats, Rust-native performance, embeddings, semantic chunking, and production-ready servers - all in a 16-31 MB complete package (5-15x smaller than alternatives). Releasing January 2026. MIT licensed forever.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>programming</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
