Forem: Kreuzberg

Why most enterprise AI projects underperform

TI — Wed, 13 May 2026 19:10:47 +0000

Enterprises are investing more in AI now than ever before, but most of that investment is not delivering what boards expect. Agents often miss important context. RAG pipelines pull up the wrong information. Internal copilots look good in demos but struggle with real user content.

When problems show up, teams usually check the model first. Then they look at the pipeline and try better chunking, new embeddings, a different vector database, or adding a re-ranker. These steps help, but they are usually not where the problem lies.

The real problem starts with the input. The documents sent into the pipeline are not in a format the AI can use, and no amount of downstream work can fully fix that. If scanned PDFs turn into a mess of unstructured characters before reaching the embeddings, the embeddings are working with bad content. If multilingual contracts are treated as if they are only in English, the model is making decisions on text it cannot understand. If your codebase is split by line count instead of by function, your code-aware agent is working with pieces that have no real meaning.

This is the layer where the industry has not invested enough, and it is the key factor in whether your AI stack works.

What 'AI-ready' means

AI-ready content is more than just raw text. For example, a contract has parties, clauses, definitions, and an obligation structure. A research report includes sections, figures, tables, footnotes, and citations that all connect. A codebase has modules, functions, imports, and call graphs. If you reduce any of these to just characters, you lose most of their value.

When a document is AI-ready, its structure stays intact. Headings remain as headings. Tables keep their rows and columns. Each section's language is identified, instead of assuming the whole document is in one language. Code is broken into units that follow its syntax. Cross-references between documents are kept as links, not turned into plain text.

The test is straightforward: can a downstream AI system use this content correctly without restructuring it? If not, the content is not AI-ready, no matter how clean the text appears.

What breaks downstream when the input layer is weak

These problems are often misdiagnosed because the symptoms appear far from the actual cause. A retrieval system might show an outdated policy because the document lacked clear timestamp metadata, and three nearly identical copies were embedded, with no way to tell them apart. The team blames the vector database, but the database is not the problem. The ingestion process never pulled out the metadata.

An agent might answer questions about a quarterly report with a confident but wrong number. The PDF was scanned; the OCR missed some details; a row in a financial table got messed up; and the agent repeated the wrong number as fact. The team thinks the model is the issue, but it's just working with the text it was given.

A multilingual support copilot works well in English but struggles in German and Japanese. The pipeline assumed each document was in only one language, so tickets with mixed languages were handled incorrectly from the start.

A code-aware agent might suggest a refactor that breaks a function because it never saw the whole function together. The repository was split into 800-character segments, so the function was split across two embeddings. The cause is at the input. Pipelines that look healthy on synthetic test data degrade on real enterprise content, and the degradation is most pronounced where document complexity is highest.

Extraction doesn't equal understanding

Most teams have not prepared for this difference. Extraction just pulls characters from files. Understanding creates output that another system can use without losing important details.

In practice, that means handling:

PDFs that are scanned, born-digital, mixed, or partially corrupted.
Images, including charts and diagrams that contain the actual information.
Audio and video, where speaker attribution, timestamps, and topic structure carry meaning beyond transcription.
Web content that depends on the rendered DOM structure rather than plain HTML.
Code, parsed as code, with awareness of language, file boundaries, and symbol structure.

This process needs to be consistent. The same contract should be handled the same way, whether it comes as a PDF, a Word file, or a scanned image. The output format should stay the same for all content types. Tables should remain tables, and hierarchy should remain hierarchy.

It also needs to handle nested content. For example, an email with three PDF attachments and an embedded image is really a group of documents, not just one. A repository is organized like a tree, not just a list of files. If the content layer flattens these structures, it loses much of what made the input valuable.

Where it fits in an existing stack

Most AI engineering teams have already chosen a framework, a vector database, an LLM provider, and an orchestration tool. They don't want a content layer that forces them to rebuild their pipeline.

A good document intelligence layer should not require that. Kreuzberg connects at the start of your existing pipeline and works directly with LangChain, Haystack, LlamaIndex, Spring AI, txtai, and CrewAI (the list is growing!). The rest of the pipeline stays the same. It just starts getting content in the right format.

The same idea applies to deployment. Some content can go through a managed cloud service. Some must stay self-hosted for compliance. Some must run in an air-gapped environment. For most enterprises, a document intelligence layer that operates in only one mode is not a practical solution.

Kreuzberg Cloud as Infrastructure Layer

Teams often have to rebuild their content layer because they see it as a one-time integration task. They get the documents in, extract the text, and move on. But then a new content type appears, a multilingual customer joins, someone wants to add video, or an auditor asks where the data is processed.

Each of these is a small change on its own. But together, they explain why a content pipeline that seemed complete six months ago is now holding up three items on your roadmap.

This is why document intelligence is an infrastructure issue. It needs to work at scale, handle every content type your enterprise uses, run wherever your data is allowed to run, and remain reliable as your AI stack changes.

Kreuzberg Cloud is built for that role.

Do you want to find out where your AI projects are losing accuracy? The quickest test is to take some real content from your pipeline, run it through Kreuzberg, and compare the results to what your current pipeline produces. If there is only a small difference, the problem lies elsewhere. If there is a big difference, you have just found the most cost-effective fix in your stack.

Why AI Agents Need Structured Code Intelligence (And How to Stop Managing Parsers)

TI — Thu, 07 May 2026 13:23:05 +0000

When teams build AI agents that work with code, the parsing layer rarely appears in the architecture diagram, even though it should.

Every language your agent touches needs a parser. Every parser has its own grammar format, its own compilation toolchain, and its own quirks around error recovery. When you're building a coding agent that handles Python, TypeScript, and Go, you're already maintaining three separate parsing dependencies before you've written a single line of agent logic. Now add Rust, Java, and Ruby to the mix, and you have a dependency management problem dressed up as an infrastructure problem.

This is the invisible tax on code-aware AI systems. This blog is all about making it visible and showing how our tree-sitter-language-pack eliminates it at the infrastructure layer.

The parser tax no one budgets for

Most code-aware agents today handle parsing one of two ways: they treat source files as plain text, or they bolt on language-specific parsing libraries one at a time.

Plain text chunking works until it doesn't. It splits a Python file at fixed token boundaries and you'll cut a function in half, separate an import from the code that uses it, and hand the model a chunk with no meaningful context. The model might recover, but sometimes it might not. At scale, "might not" becomes "frequently doesn't."

Language-specific parsers fix the semantic problem but create an operational one. Grammar for Python, a different grammar for JavaScript, another for Go, and so on. Each with its own build process, its own versioning, its own failure modes. When you're managing this across a team and a production pipeline, it's a big headache.

The result is that most teams end up with inconsistent parser coverage, workarounds for languages that didn't get a proper integration, and code that breaks when a grammar dependency updates out of sync.

Why AST-aware chunking is different

An Abstract Syntax Tree represents code as a tree of meaningful units: functions, classes, imports, blocks, and statements. AST-aware chunking uses that structure to split code at real boundaries instead of arbitrary ones.

The difference in practice is significant. A function stays whole. Its imports stay connected to it. A class method doesn't get split from its class definition. The model receives chunks corresponding to units of code rather than slices of a file.

For retrieval-augmented generation (RAG), this matters more than it might seem. When a developer asks, "How does this codebase handle authentication?", a retriever that operates on AST-aware chunks can return the authentication function itself, along with its imports and the types it references. A retriever working on token-boundary chunks returns whatever happened to fall inside the window. The first is useful. The second requires the model to reconstruct a context that it was never given.

The same logic applies to code-editing agents, code-review agents, and anything that needs to reason about what code does rather than how it looks.

How tree-sitter-language-pack helps

Tree-sitter is the parsing library behind code intelligence in Neovim, Helix, Zed, and a lot of other editors. It's fast, it recovers from syntax errors gracefully, and it produces concrete syntax trees that map cleanly to AST-aware operations. The problem with using it directly is that each language requires its own grammar repository, which must be compiled and managed separately.

tree-sitter-language-pack wraps 305 language grammars behind a single dependency and a unified API. Parsers are fetched on demand and cached locally, so you're not compiling everything up front.

The process() API returns structured output: functions, classes, imports, comments, and AST-aware chunks ready for indexing or retrieval. Here's what that looks like in Python:

from tree_sitter_language_pack import process

result = process("path/to/file.py")

for chunk in result.chunks:
    print(chunk.type)     # "function", "class", "import_block", etc.
    print(chunk.content)  # the actual code
    print(chunk.metadata) # language, line range, parent scope

For a coding agent or RAG pipeline, this means the parsing step looks the same regardless of which language you're processing. You get the same structured output format from a Rust file that you get from a TypeScript file. Your downstream retriever, your context-assembly logic, your chunk ranking — none of it needs to know which language it's looking at.

The 12 ecosystems covered are Rust, Python, Node.js, Go, Java, Ruby, Elixir, PHP, C#, WASM, CLI, and C FFI. For teams building internal tooling across a mixed-language codebase, or platforms that need to handle code submissions in arbitrary languages, this replaces a fragile multi-library setup with a single import.

The operational math of one dependency

For teams running code-aware RAG or coding agents in production, the operational math changes when you go from managing parsers per language to managing one dependency.

305 languages under a single MIT-licensed package means your legal and security reviews happen only once. Dependency updates are tracked in one place. When a grammar needs patching, it's one update to one package, not a cross-repository hunt across six grammar forks usually.

For platforms that accept code in many languages — competitive programming tools, code review infrastructure, internal developer portals — the old approach was to decide which languages were "supported" based on which parsers had been managed to work. With tree-sitter-language-pack, the supported set includes 305 languages and continues to grow. The decision isn't about which languages to support; it's about what to build with the structured output.

The demand caching model also matters at scale. Parsers aren't loaded until needed, so a deployment that primarily processes Python doesn't incur the memory cost of having 304 other grammars loaded in RAM.

Where Kreuzberg and tree-sitter-language-pack converge

Kreuzberg already handles 97 file formats: PDFs, DOCX, HTML, OCR-processed images, spreadsheets, and more. The design philosophy is consistent, structured output regardless of input format — agents using Kreuzberg don't write format-specific logic; they work with the same output shape for every document type.

tree-sitter-language-pack extends that to source code. An agent that uses Kreuzberg to process a codebase's documentation, spec files, and READMEs can now process the code itself through the same structured pipeline. The output format is consistent. The dependency story is consistent. The agent doesn't need to wonder "Is this a document or is this source code?"

For teams building document-and-code intelligence together — think agents that understand both the API spec and the implementation, or tools that connect internal documentation to the functions that implement it — this is where the two systems begin to work as a single layer.

Kreuzberg Cloud will handle the infrastructure side entirely by spinning up, scaling, and managing both the document and code processing pipelines without requiring teams to run or maintain anything themselves. If you're building at scale and would rather not think about parser infrastructure at all, that's the path.

Parser infrastructure shouldn't be an application-layer problem

Parser management is the kind of problem that looks small in a prototype and compounds in production. It doesn't show up in architecture reviews because it's "just infrastructure." Then it suddenly shows up in on-call rotations when a grammar update breaks ingestion for one language.

tree-sitter-language-pack is a reasonable answer to a problem that shouldn't exist at the application layer in the first place. AST-aware chunking with 305 languages, only one dependency. For teams building serious code intelligence tooling, that's a meaningful starting point.

The open source code is at github.com/kreuzberg-dev/tree-sitter-language-pack. Curious about Kreuzberg Cloud? Join the waitlist at kreuzberg.dev.

Connect with our team and with like-minded people on our Discord server.

Beyond the Model: Why Document Intelligence Is the Next AI Infrastructure Layer

TI — Wed, 22 Apr 2026 06:50:00 +0000

Every serious AI project eventually runs into the same moment. The model is capable. The team knows what they're doing. The architecture makes sense on paper. And then someone asks: where does the data actually come from?

For most enterprises, the answer is documents. Contracts, invoices, regulatory filings, scanned reports, vendor submissions, and the kind of data that reflects how businesses actually operate. It exists in abundance. Getting it into a shape that AI systems can reliably act on is where the real engineering begins.

This is the data readiness challenge. It doesn't surface in benchmarks or model evaluations. It shows up in production, usually later than expected, and it's one of the most consistent reasons why capable AI systems underperform relative to their potential.

Agentic AI Raises the Stakes

A simple RAG pipeline can tolerate imperfect parsing. A missed table, a garbled header as the answer quality degrades, but the system doesn't fail hard. You notice it in output quality. You can tune around it. Agents are different.

An agent reading a 40-page vendor contract to extract payment terms, then triggering downstream actions based on those terms, has no tolerance for a parsing error at the input stage. A flattened table means a wrong number. A wrong number means a wrong action. Wrong actions compound across a multi-step workflow in ways that are hard to trace and expensive to fix.

A pipeline extracting invoice totals can work perfectly in staging. The moment production documents arrive, it scans PDFs from actual vendors, not clean test files and then the system starts misfiring. Tables collapse into text. OCR misreads decimal points. The agent triggers incorrect payments. The bug is in the three layers upstream rather than the agent logic.

What Data Readiness Actually Requires

When you take document parsing seriously as an infrastructure problem, the requirements become clear:

Format fidelity. Parsing a financial statement means preserving table structure, column alignment, and the relationship between numbers and their labels. The same number in a table cell and in running prose means something different as a document intelligence layer needs to understand that distinction and preserve it.

Throughput and latency predictability. A pipeline processing thousands of invoices per day can't afford a parsing layer with unpredictable latency. One slow document shouldn't block the queue. This is partly why Kreuzberg is built in Rust: parallel parsing across document sections, low memory overhead, no messy collection pauses at inconvenient moments. Predictable performance is what separates infrastructure from a library that needs babysitting.

*Language-agnostic integration. * Most teams aren't choosing a document parsing library in isolation. They're integrating it into a Python orchestration layer, a Go service, a TypeScript backend, you name it. Document infrastructure needs to meet them where they are. Kreuzberg ships bindings for 12 languages precisely because this isn't a Python problem.

*Reliability on real documents. * Production documents are not clean. They're scanned at odd angles, use non-standard encodings, mix printed and handwritten content, embed tables inside images. A production-grade parsing layer handles these edge cases without requiring manual intervention or application-layer workarounds.
These requirements aren't new. What's new is that AI systems have made them load-bearing. A document parsing failure used to mean bad output. Now it can mean a cascading agent failure across an entire workflow.

Kreuzberg Cloud

We’re building managed document intelligence as a foundational layer.
Kreuzberg started as an open-source Rust framework for document parsing, with bindings across 12 programming languages. The open source library is Elv2 licensed and will remain free forever. It's used in production pipelines today. Countless teams don’t have the time or technical resources to handle deployment, scaling, handling edge cases in production, and maintaining reliability in volume. They, understandably, want to focus on building their own products. Kreuzberg Cloud will help them as a fully managed AI infrastructure layer for document intelligence.

The model is simple: send documents in, get structured, accurate data out. Kreuzberg Cloud handles parsing, OCR, layout understanding, and format normalization- everything between raw documents and the structured input your agents, pipelines, or retrieval systems actually need, without an infrastructure to manage, edge cases to handle in application code, or surprise failures because a vendor switched to a slightly different PDF encoding.

The performance comes from the same Rust core powering the open-source library and it is proven in production, not assembled for demos. Kreuzberg Cloud wraps it in the managed layer that makes it viable to depend on at any scale.

The measure of success is straightforward: document ingestion should be the most dependable part of your AI stack. Reliable, consistent, and quiet - the kind of infrastructure you build on rather than around.

Where This Fits in Your workflow

Document intelligence sits at the foundation of any AI system.

Most teams build this layer themselves, under time pressure, as an afterthought. It often ends up being a weak link in the stack, not because it's hard to get working, but because getting it production-ready requires sustained investment in edge cases, performance, and reliability that's hard to justify when the actual product is elsewhere.

That's the AI infrastructure gap Kreuzberg Cloud fills. We’re launching very soon. Join the waitlist and follow along as we build it out, and join our Discord server to connect directly with the team.

The Haystack converter that handles 91+ file formats without a Cloud API

TI — Sun, 12 Apr 2026 13:39:03 +0000

Haystack already has converters for PDFs, for DOCX, for HTML. If you're building a RAG pipeline, you've probably used at least two of them. But if you've ever tried to build an indexing pipeline that handles everything a user might throw at it like PDFs, scanned invoices, spreadsheets, PowerPoint decks, images, archives, you know the pain. You end up wiring together three or four different converters, each with its own quirks, its own failure modes, and its own gap in format coverage.

KreuzbergConverter is now merged into haystack-core-integrations. It's a single component that extracts text from 91+ file formats, runs OCR on scanned documents, preserves table structure, and does all of it locally. You won't need API keys, per-page billing, or files leaving your infrastructure.

Here's how it works.

Why document extraction breaks most RAG pipelines

That first step is text extraction which determines the ceiling for everything downstream. If your extractor drops table data, your LLM can't answer questions about pricing tables. If it mangles OCR output, your embeddings are unusable. If it fails on a PPTX file, that document just doesn't exist in your knowledge base.

Flawed document extraction is consistently identified as the primary bottleneck for RAG pipelines, establishing a performance ceiling that even the most optimized retrieval strategies, chunking methods, or LLM selections cannot break through.

The setup you're replacing

If you're processing mixed file types in Haystack, your indexing pipeline probably looks something like this:

pipeline = Pipeline()
pipeline.add_component("router", FileTypeRouter(
    mime_types=["application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "text/html"]
))
pipeline.add_component("pdf_converter", PyPDFToDocument())
pipeline.add_component("docx_converter", DOCXToDocument())
pipeline.add_component("html_converter", HTMLToDocument())
pipeline.add_component("joiner", DocumentJoiner())

pipeline.connect("router.application/pdf", "pdf_converter")
pipeline.connect("router.application/vnd.openxmlformats-officedocument.wordprocessingml.document", "docx_converter")
pipeline.connect("router.text/html", "html_converter")
pipeline.connect("pdf_converter", "joiner")
pipeline.connect("docx_converter", "joiner")
pipeline.connect("html_converter", "joiner")

Six components wired together, and you still can't handle XLSX, PPTX, images, or scanned PDFs. Every new format means adding another converter and another route.

With KreuzbergConverter, that entire setup becomes:

from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter

pipeline = Pipeline()
pipeline.add_component("converter", KreuzbergConverter())
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "writer")

pipeline.run({"converter": {"sources": ["contract.pdf", "invoice.png", "report.xlsx"]}})

One component. Format detection is automatic and Kreuzberg identifies the MIME type and routes internally. You don't write conditional logic for different file types.

What the output looks like

Here's what you get back when you process a mixed batch. Each file produces a Haystack Document with extracted content and rich metadata:

converter = KreuzbergConverter()
result = converter.run(sources=["quarterly_report.pdf", "scanned_receipt.png", "budget.xlsx"])

for doc in result["documents"]:
    print(f"Source: {doc.meta['file_path']}")
    print(f"Type: {doc.meta['mime_type']}")
    print(f"Languages: {doc.meta['detected_languages']}")
    print(f"Quality: {doc.meta['quality_score']}")
    print(f"Content preview: {doc.content[:100]}")

Source: quarterly_report.pdf
Type: application/pdf
Languages: ['en']
Quality: 0.95
Content preview: Q3 2025 Financial Results\n\nRevenue grew 12% year-over-year...
---
Source: scanned_receipt.png
Type: image/png
Languages: ['en']
Quality: 0.72
Content preview: STORE #4421\n123 Main Street\nTotal: $47.83...
---
Source: budget.xlsx
Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Languages: ['en']
Quality: 0.98
Content preview: | Department | Q1 Budget | Q1 Actual | Variance |...

Notice the quality scores. Clean PDF: 0.95. Scanned receipt: 0.72, which is still usable, but the converter is telling you the extraction was less confident. Spreadsheet with native digital data: 0.98. You can filter on this before embedding but no point polluting your vector store with garbage text from a barely-readable fax.

Three output modes that make it better

Unified (default): One Document per file. All content merged. Drop this into a simple search pipeline and move on.

Per-page: One Document per page, with page numbers and blank-page detection in metadata. Useful when you need page-level precision. You can ask:"what does page 14 of this contract say?"

Chunked: Text is split into semantic chunks, each with an optional embedding vector. This means you can skip the separate DocumentSplitter step entirely if you want. One component does extraction and chunking in a single pass.

Most Haystack converters give you a blob of text and leave chunking to a downstream component. KreuzbergConverter does both, which means fewer pipeline stages to debug.

OCR that works on documents

When the converter hits a scanned PDF or an image file, it runs OCR automatically. Three backends are supported:

Tesseract (default) :- general-purpose, comes pre-installed on most systems
EasyOCR :- better for handwriting and non-Latin scripts, GPU-accelerated
PaddleOCR :-high-volume, 80+ languages, PP-OCRv5 support

Before running OCR, the converter preprocesses images: DPI adjustment, rotation, deskewing, denoising, contrast enhancement, binarization. This happens automatically. If you've ever dealt with scanned faxes or photos of receipts in a production pipeline, you know that raw OCR on unprocessed images produces unusable text. The preprocessing step is what makes the difference between "sort of works" and "it really works."

Tables come through as structured data

Tables extracted from documents are converted to structured markdown. They're not flattened into a wall of text but the row/column structure is preserved. For markdown and HTML output, tables are inlined where they appeared in the original document.

Financial figures, product specifications, pricing tiers, compliance checklists i.e. a huge amount of high-value enterprise data lives in tables. Lose the structure during extraction and your LLM has no way to reconstruct it.

Metadata you can use

Every Document the converter produces comes with metadata that's useful for downstream filtering:

mime_type - actual detected format
detected_languages - automatic language detection
quality_score - extraction confidence (0.0–1.0)
page_count, image_count
annotations - PDF comments and highlights
processing_warnings - anything that went wrong during extraction

The quality_score field is worth calling out. You can use it to filter before embedding or skip documents below 0.7, for example, instead of polluting your vector store with garbage text from a badly scanned document. The language detection lets you route documents to language-specific models or prompts downstream.

Parallel Batch Processing

KreuzbergConverter processes multiple files using a Rust-powered thread pool (rayon). This is not Python-level parallelism limited by the GIL, it's real multi-threaded processing at the system level.

Single file? Processed directly, no overhead. Multiple files? Automatically parallelized across your CPU cores. According to the Kreuzberg project's benchmarks, the library processes 35+ files per second on CPU hardware.

If you're indexing thousands of documents in a common enterprise scenario, this is the difference between a pipeline run that takes minutes and one that takes hours.

Proper Error Handling

If one file in a batch is corrupted or unreadable, the converter logs a warning and moves on. You still get Documents from everything else. This sounds obvious, but in production pipelines processing thousands of files, one bad PDF bringing down the entire indexing job is an operational problem. KreuzbergConverter treats it as a warning, not an exception.

Where it sits in a pipeline

A typical Haystack RAG indexing pipeline with KreuzbergConverter:

KreuzbergConverter is fully serializable and you can save your pipeline as YAML or JSON, version-control it, and deploy it across environments. This is important for DevOps teams managing production pipelines.

Why not just use the cloud APIs?

AWS Textract, Google Document AI, and Azure Document Intelligence all handle document extraction, and they're usually accurate. They're also expensive, and they require sending your files to someone else's servers.

The per-page costs add up fast:

Service	Basic OCR (per 1K pages)	Table extraction (per 1K pages)
AWS Textract	$1.50	$15.00
Google Document AI	$1.50	$30.00
Azure Document Intelligence	$1.50	$30.00

KreuzbergConverter runs on your own hardware. The API cost is $0.

Beyond cost, there's the compliance question. If you're in healthcare, legal, or finance, sending contracts and medical records to a third-party API might not be an option. Local processing means your documents never leave your infrastructure.

How it compares to other Haystack converters

Haystack already has converters: PyPDFToDocument, DOCXToDocument, HTMLToDocument. And integrations for Docling, Unstructured, and MarkItDown.

The tradeoffs:

PyPDFToDocument / DOCXToDocument / HTMLToDocument: Built-in, lightweight, reliable for their specific format. But each handles one format. If you're processing mixed file types, you need a router component that picks the right converter for each file. KreuzbergConverter replaces that entire pattern with one component.

Unstructured: Powerful, but cloud-dependent (the free tier has limitations, the full API is paid). The OSS version has 54 dependencies and a 146 MB install footprint.

Docling (IBM): Good for structured document understanding with a deep learning approach. But it's 1 GB+ installed and 88 dependencies. It's a heavy tool for a heavy job.

KreuzbergConverter: 71 MB installed, 20 dependencies, ELv2 licensed, runs locally. It won't match Docling's deep layout understanding on complex research papers. Still, for the 90% case, extracting clean text and tables from everyday business documents - it's faster, lighter, and doesn't require a GPU.

The other thing worth mentioning: PyMuPDF, which several tools depend on, uses an AGPL-3.0 license. If your application distributes PyMuPDF and isn't open-source, you need a commercial license from Artifex. Kreuzberg is MIT and has no restrictions on commercial use.

Getting started

Install:

pip install kreuzberg-haystack

Minimal pipeline:

from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", KreuzbergConverter())
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "writer")

pipeline.run({"converter": {"sources": ["report.pdf", "data.xlsx", "scan.png"]}})

If you're running Haystack in production with mixed file types, we suggest trying KreuzbergConverter. It's a single component that ensures your documents stay on your local infrastructure while reducing API costs to zero.

Document Structure Extraction with Kreuzberg

TI — Tue, 31 Mar 2026 10:57:04 +0000

Extracting structured data from PDFs is one of the hardest problems in AI infrastructure. Most tools give you a text dump but no headings, no table boundaries, no distinction between a caption and a footnote. When Docling launched, it changed the game with a genuinely good layout model.

We want to be clear– Docling is a great project, and we have the greatest respect for the team at IBM for putting it out there. It’s also fully open-source under a permissive Apache-2.0 license. We integrated their model into Kreuzberg and embedded it into a Rust-native pipeline. Currently, it runs 2.8× faster with a fraction of the memory footprint.

This post covers the behind-the-scenes part: what we used, what we rebuilt from scratch, and where the speed comes from.

Why Document Structure Matters for AI and RAG Pipelines
If you’re building AI infrastructure like RAG pipelines, document processing workflows, or any AI application that ingests PDFs at scale, flat text extraction isn’t enough anymore.

Consider what happens when you feed an LLM a PDF that’s been extracted as a single blob of text. The model can’t distinguish a section heading from body text. It can’t tell if a number belongs to a table cell or a footnote. It merges multi-column layouts into nonsense. The retrieval quality of your entire pipeline degrades because the source data has no structure.

Docling, IBM’s open-source document understanding library, addressed this head-on. Their RT-DETR v2 layout model (called Docling Heron) classifies 17 different document element types: headings, paragraphs, tables, figures, captions, page headers, footers, and more. It produces a structural representation that downstream systems can actually work with.

The model is excellent. The issue lies in what’s around it.

Docling is a Python library built on deep learning inference. Model loading takes time. Processing is sequential. Memory usage scales with document complexity. For a single document or a research prototype, that’s fine. For thousands of documents in a production pipeline, especially if your stack isn’t Python, it starts to matter. That’s the gap we set out to close.

The Foundation
Starting with Kreuzberg v4.5.0, we integrated Docling’s RT-DETR v2 layout model directly into our Rust-native pipeline. The model is open-source under Apache-2.0, and we want to be transparent about its use. Docling’s team built something excellent. But the model is only one piece of a document extraction system. The inference runtime, the text extraction layer, the page processing strategy, the table reconstruction pipeline, all of which you can have in Rust now. The result is a system that uses Docling’s layout intelligence but runs it through an entirely different execution engine.

Here’s where the engineering differences live.

Engineering the Pipeline

ONNX Runtime for Layout Inference
The RT-DETR v2 model runs through ONNX Runtime, not Python’s PyTorch. There’s no Python dependency, GIL contention, and native support for hardware acceleration such as CPU, CUDA, CoreML, and TensorRT. All of this is configurable through a typed AccelerationConfig that works across every language binding Kreuzberg supports.

This alone eliminates the cold-start penalty. The ONNX session loads once and stays resident.

Parallel Page Processing
Layout inference processes page batches in a single session.run() call. SLANet-Plus (the table structure recognition model) and layout inference both run in parallel using thread-local model instances and Rayon workers. Each page is processed independently and released after extraction, keeping memory usage flat even on 500-page documents.

Docling processes pages sequentially through Python. Kreuzberg processes them concurrently through Rust. On a 100-page PDF, that difference compounds fast.

Native PDF Text Extraction via PDFium
This is where most of the quality gains come from, and it’s the biggest architectural divergence from Docling.

Instead of relying on the layout model’s pipeline to also handle text extraction, Kreuzberg reads text directly from the PDF’s native text layer using PDFium’s character-level API. This preserves exact character positions, font metadata (bold, italic, size), and Unicode encoding. The layout model then classifies and organizes this high-fidelity text according to the document’s visual structure.

The distinction matters because Docling’s pipeline treats the rendered page image as the primary input for both layout detection and text extraction. Kreuzberg uses the page image only for layout detection, then pulls text from the PDF’s native layer. You get neural-network-quality structure classification with lossless text fidelity.

Structure Tree Integration
When a PDF contains a tagged structure tree (common in PDF/A and accessibility-compliant documents), Kreuzberg uses the author’s original paragraph boundaries and heading hierarchy, then applies layout model predictions as classification overrides. The structure tree gives you the author’s intent; the layout model gives you visual classification. Combining both produces better results than either alone.

Fixing Edge Cases in PDFs
The single biggest quality improvement came not from the layout model integration, but from rewriting how we extract text from PDFs at the character level.

Before v4.5.0, Kreuzberg used PdfiumParagraph::from_objects() which is a paragraph-level extraction approach that relied on PDFium’s built-in text segmentation. It worked on clean documents but broke down on anything with non-standard font matrices, complex column layouts, or broken CMap encodings. AndPDFs are full of exactly these problems.

We replaced it with per-character text extraction using PDFium’s PdfPageText::chars() API. Every character is read individually with its exact position, font size, and baseline coordinates. From there, we rebuild the text structure ourselves.

This unlocked a chain of fixes that would have been impossible at the paragraph level:

Broken font metrics. Many PDFs report incorrect font sizes due to font matrix scaling. PDFium might say font_size=1 when the rendered text is clearly 12pt. Our old 4pt minimum filter would silently drop all content from these pages. Now, when the filter removes everything, it’s skipped automatically. Same logic for margin filtering. When it removes all text on a page (PDFs with baseline values outside expected bands), the filter falls back gracefully.

Ligature corruption. LaTeX-generated PDFs with broken ToUnicode CMaps produce garbled text: different becomes di!erent, offices becomes o”ces. We repair these inline during character iteration using a vowel/consonant heuristic to disambiguate ambiguous ligature mappings. Fixing this during extraction rather than as a post-processing pass improved both accuracy and performance.

Word spacing artifacts. PDFium sometimes inserts spurious spaces mid-word — shall be active becomes s hall a b e active. Pages with detected broken spacing are re-extracted using character-level gap analysis (font_size × 0.33 threshold). Clean pages use the fast single-call path. On the ISO 21111–10 test document, this reduced garbled lines from 406 to zero.

Multi-column reading order. Federal Register-style multi-column PDFs jumped from 69.9% to 90.7% F1 after switching to PDFium’s text API, which naturally handles column reading order without us needing to implement column detection heuristics.

The final result: Kreuzberg’s PDF markdown extraction hit 91.0% average F1 across 16 test PDFs, compared to Docling’s 91.4%. Effectively at parity, while being 10–50× faster.

How Table Extraction Works
Table extraction runs in two stages.

First, the RT-DETR v2 layout model identifies table regions on the page image. Then, Kreuzberg crops each detected region and runs SLANet-Plus, a specialized model that predicts internal table structure: rows, columns, cells, including colspan and rowspan.

The predicted cell grid is matched against native PDF text positions to reconstruct accurate markdown tables. This hybrid approach i.e., neural structure prediction plus native text extraction, avoids the OCR-like quality loss you get when working only with rendered page images.

We also tightened the detection heuristics. Table detection now requires at least 3 aligned columns, which eliminates false positives from two-column text layouts like academic papers and newsletters. Post-processing rejects tables with 2 or fewer columns, tables where more than 50% of cells contain long text, or tables with an average cell length above 50 characters. These rules cut false positive detections significantly without hurting recall.

Benchmarks: How We Measured This
We benchmarked Kreuzberg against Docling on 171 PDF documents spanning academic papers, government and legal documents, invoices, OCR scans, and edge cases. F1 score measures the harmonic mean of precision and recall and how much of the expected content was correctly extracted, and how much of what was extracted was actually correct.

Metric: Kreuzberg Docling

Structure F1: 42.1% 41.7%

Text F1: 88.9% 86.7%

Avg. processing time: 1,032 ms/doc 2,894 ms/doc

The 2.8× speed advantage comes from four angles: Rust’s native memory management, PDFium character-level text extraction (no Python overhead), ONNX Runtime inference (no PyTorch), and Rayon parallelism across pages. Structure F1 measures how accurately document elements such as headings, paragraphs, and tables are detected.

In broader benchmarks, we compared Kreuzberg to Apache Tika, Docling, MarkItDown, Unstructured.io, PDFPlumber, MinerU, MuPDF4LLM, and more. There, you can see Kreuzberg is substantially faster on average, with much lower memory usage and a smaller installation footprint. The Docker image is around 1–1.3GB versus Docling’s 1GB+ Python installation before you even add your application code.

What This Means for Your Stack
Already using Docling and happy with the quality? You’ll get equivalent extraction accuracy from Kreuzberg with less infrastructure overhead. The layout model is the same, the execution is faster, and the memory is lower.

Running a polyglot stack? If your backend is Rust, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno), Kreuzberg gives you the same layout detection capabilities without wrapping a Python service behind an HTTP endpoint. Native bindings for 12 languages, same Rust core underneath.

Processing at scale? The combination of parallel page processing, native text extraction, and efficient ONNX inference means significantly higher document throughput on the same hardware. No GPU required for layout detection; CPU inference is fast enough for most production workloads.

Layout detection is available across all 12 language bindings, the CLI, the REST API, and the MCP server. Models auto-download from HuggingFace on first use and are cached locally.

Get Started

`# CLI

kreuzberg extract document.pdf — layout-detection

Python

from kreuzberg import extract_file, ExtractionConfig

result = await extract_file(“document.pdf”, ExtractionConfig(

layout_detection=True,

output_format=”markdown”

))`

Document structure extraction is becoming table stakes for production AI pipelines. Modern AI systems depend on structured document data. The faster you can extract data,the more scalable your pipeline becomes.

We’re grateful to the Docling team at IBM for the truly great foundation they’ve provided. If you’re running Docling in production today, try Kreuzberg against it on your actual documents and let us know what you think.

BM25 + Vector Search in One Query: kreuzberg-surrealdb + SurrealDB v3

TI — Mon, 23 Mar 2026 08:08:59 +0000

Author: Varun Tandon, Software Engineer at Kreuzberg.

Hybrid Search in 40 Lines: kreuzberg-surrealdb + SurrealDB v3

Every hybrid search tutorial starts with clean text already in the database: ten toy documents, never scanned, never duplicated, never OCR'd. Real pipelines start somewhere else: a directory of client PDFs, some scanned, some protected, plus legacy DOCX files and an ingestion layer you've assembled from LangChain loaders, Unstructured subprocesses, and filename-based IDs that inflate your vector store on every re-run.

kreuzberg-surrealdb replaces that entire pre-query layer. Two calls get you to a working hybrid search pipeline: setup_schema() creates the HNSW vector index and BM25 full-text index in SurrealDB; ingest_directory() handles format detection, OCR, chunking, embedding, and deduplication across 88+ file formats. Then SurrealDB's search::rrf() runs hybrid BM25 + vector search in a single query. It requires SurrealDB v3.

Quick Start: Connection to Hybrid Search

pip install kreuzberg-surrealdb
# Requires SurrealDB v3: surreal start --user root --pass root

import asyncio
from surrealdb import AsyncSurreal
from kreuzberg_surrealdb import DocumentPipeline

async def main():
    async with AsyncSurreal("ws://localhost:8000/rpc") as db:
        await db.signin({"username": "root", "password": "root"})
        await db.use("myns", "mydb")

        # "balanced" preset = bge-base-en-v1.5, 768 dims
        pipeline = DocumentPipeline(db=db, embedding_model="balanced")

        # Creates HNSW vector index + BM25 full-text index
        await pipeline.setup_schema()

        # Format detection, OCR, chunking, embedding, dedup
        await pipeline.ingest_directory("./docs")

        query = "regulatory compliance Q4 2025"
        embedding = await pipeline.embed_query(query)

        # search::rrf() is SurrealDB — not kreuzberg-surrealdb
        results = await db.query("""
            SELECT * FROM search::rrf([
              (SELECT id, content FROM chunks
               WHERE embedding <|10,COSINE|> $embedding),
              (SELECT id, content, search::score(1) AS score FROM chunks
               WHERE content @1@ $query
               ORDER BY score DESC LIMIT 10)
            ], 10, 60);
        """, {"query": query, "embedding": embedding})

        for row in results[0]:
            print(row.get("content", "")[:200])
            print(row.get("document", {}).get("source", ""))
            print("---")

asyncio.run(main())

Building Ingestion for SurrealDB

Hybrid search with RRF improved Mean Reciprocal Rank from 0.410 to 0.486: an 18.5% gain over single-mode retrieval in a production RAG system. That gain depends entirely on both indexes being correctly populated. Getting there from scratch means solving four problems.

Format extraction. LangChain's PDFLoader returns empty strings or raises errors on scanned PDFs (GitHub issue #6376). LibreOffice in Unstructured runs single-threaded, so concurrent ingestion creates silent race conditions on file handles. Missing libmagic on the host causes DOCX files to be misidentified as application/zip, bypassing all DOCX-specific extraction logic.

HNSW DDL. You specify DIMENSION, DIST, EFC, and M manually. Wrong values silently produce an underperforming index. DimensionMismatchError fires at insert time, not schema creation. Switch embedding models after writing records and every subsequent insert fails.

Deduplication. LlamaIndex document IDs default to filename-based hashing (GitHub issue #13461). Re-running the pipeline on unchanged files creates new vector records, triggers re-embedding API calls, and inflates the vector store. Content-hash dedup isn't in LlamaIndex's default configuration.

The LangChain SurrealDBVectorStore covers retrieval only. Schema creation, chunking, embedding, and batched inserts remain on you.

Setting Up kreuzberg-surrealdb

pip install kreuzberg-surrealdb

You own the AsyncSurreal connection — authenticate, select namespace and database, then pass it to DocumentPipeline:

from surrealdb import AsyncSurreal
from kreuzberg_surrealdb import DocumentPipeline

async with AsyncSurreal("ws://localhost:8000/rpc") as db:
    await db.signin({"username": "root", "password": "root"})
    await db.use("myns", "mydb")

    pipeline = DocumentPipeline(db=db, embedding_model="balanced")
    await pipeline.setup_schema()

What `setup_schema()` Generates

One call creates everything SurrealDB needs to run both retrievers:

-- documents table
DEFINE TABLE documents SCHEMAFULL;
DEFINE FIELD source        ON documents TYPE string;
DEFINE FIELD content       ON documents TYPE string;
DEFINE FIELD content_hash  ON documents TYPE string;
DEFINE FIELD ingested_at   ON documents TYPE datetime;
DEFINE FIELD quality_score ON documents TYPE float;
  -- OCR confidence (0.0–1.0) for scanned content
DEFINE FIELD title         ON documents TYPE string;
DEFINE FIELD authors       ON documents TYPE array;
DEFINE FIELD metadata      ON documents TYPE object FLEXIBLE;

-- chunks table
DEFINE TABLE chunks SCHEMAFULL;
DEFINE FIELD document    ON chunks TYPE record<documents>;
DEFINE FIELD content     ON chunks TYPE string;
DEFINE FIELD embedding   ON chunks TYPE array<float>;
DEFINE FIELD chunk_index ON chunks TYPE int;
DEFINE FIELD word_count  ON chunks TYPE int;
DEFINE FIELD page_number ON chunks TYPE int;
DEFINE FIELD char_start  ON chunks TYPE int;
DEFINE FIELD char_end    ON chunks TYPE int;

-- HNSW vector index
DEFINE INDEX idx_chunk_embedding ON chunks FIELDS embedding
  HNSW DIMENSION 768 TYPE F32 DIST COSINE EFC 150 M 12;

-- BM25 full-text index
DEFINE ANALYZER text_analyzer TOKENIZERS class
  FILTERS lowercase, stemmer(english);
DEFINE INDEX idx_chunk_content ON chunks FIELDS content
  SEARCH ANALYZER text_analyzer BM25(1.2, 0.75);

Embedding Presets

The preset determines the DIMENSION value in the HNSW DDL — you never specify it manually:

Preset	Model	Dimensions	Use case
`"fast"`	all-MiniLM-L6-v2	384	Low-latency, resource-constrained
`"balanced"`	bge-base-en-v1.5	768	Default; best general-purpose trade-off
`"quality"`	bge-large-en-v1.5	1024	High-recall when compute is available
`"multilingual"`	multilingual-e5-base	768	Non-English or mixed-language corpora

For a custom model:

model = EmbeddingModelType.fastembed("BAAI/bge-small-en-v1.5", embedding_dimensions=384)
pipeline = DocumentPipeline(db=db, embedding_model=model)

One important constraint: SurrealDB enforces vector dimension server-wide. All tables on the same instance must use the same dimension. Pick the preset before first ingest — changing it later means dropping the HNSW index, re-running setup_schema(), and re-embedding the entire corpus.

Chunking Configuration

from kreuzberg import ExtractionConfig, ChunkingConfig

config = ExtractionConfig(
    chunking=ChunkingConfig(
        max_chars=512,   # Smaller = more precise retrieval, more records
        max_overlap=100  # Prevents context loss at chunk boundaries
    )
)
pipeline = DocumentPipeline(db=db, config=config, embedding_model="balanced")

Ingesting a Mixed Document Corpus

ingest_directory() walks the directory, detects each file's format, extracts text (with OCR for scanned content), chunks, embeds, and writes to SurrealDB. No Tesseract configuration required.

await pipeline.ingest_directory("./docs", glob="**/*")

The glob parameter follows pathlib.Path.glob() syntax — **/* walks all subdirectories recursively (default), **/*.pdf scopes to PDFs only.

For targeted ingestion or upload flows:

# Single file
await pipeline.ingest_file("./reports/q4-2025.pdf")

# Bytes — e.g. from an HTTP upload handler
await pipeline.ingest_bytes(
    data=pdf_bytes,
    mime_type="application/pdf",
    source="upload://q4-2025.pdf"
)

Deduplication

kreuzberg-surrealdb computes a SHA-256 hash from each chunk's extracted text content and uses it as the SurrealDB RecordID (pattern: {content_hash}_{chunk_index}). All inserts use INSERT IGNORE. Running ingest_directory() twice on unchanged content is a complete no-op: zero new records, zero re-embedding calls.

This differs meaningfully from LlamaIndex's filename_as_id=True default. When you re-ingest the same file from a different path, LlamaIndex generates a new RecordID from the new path and creates a duplicate. kreuzberg-surrealdb hashes the content itself — same text from any path, same RecordID, same no-op.

Honest limitations:

Sequential ingestion. ingest_files() and ingest_directory() process files in a sequential loop. For high-throughput pipelines, use a queue-based architecture where independent workers each call ingest_file() per document.
No orphan deletion. Files removed from the source directory stay in the database. Manual cleanup: DELETE FROM documents WHERE source NOT IN $active_sources;
Exact-match dedup only. Two slightly different versions of the same document create two separate records. Near-duplicate detection isn't supported.

The Hybrid Search Payoff: How `search::rrf()` Works

Both indexes are now in SurrealDB — HNSW for vector retrieval, BM25 for full-text. SurrealDB's search::rrf() combines them in a single query using Reciprocal Rank Fusion (RRF).

Because RRF operates on ranked positions rather than raw scores, BM25's unbounded values and cosine similarity's 0–1 range are never directly compared. No score normalization. No alpha parameter. The formula (Cormack, Clarke & Buettcher, SIGIR 2009):

RRF_score(d) = Σ 1 / (k + rank_i(d))

k=60 is the smoothing constant from the original paper — not a tunable weight. It prevents top-ranked documents from dominating when they appear near rank 1 in only one list.

Attribution: What Owns What

Step	Owner
Extraction from 88+ formats, OCR	kreuzberg (via kreuzberg-surrealdb)
Chunking and embedding	kreuzberg (via kreuzberg-surrealdb)
HNSW + BM25 index creation	kreuzberg-surrealdb (`setup_schema()`)
Consistent query embedding	kreuzberg-surrealdb (`embed_query()`)
Hybrid fusion	SurrealDB (`search::rrf()`)
Vector + full-text retrieval	SurrealDB
Chunk → document traversal	SurrealDB record links

All Three Search Modes

Always call embed_query() before a vector or hybrid search. It ensures the query vector uses the same model and dimension as stored chunk embeddings. A mismatch causes cosine similarity scores to become meaningless without raising an error.

Hybrid (BM25 + vector):

query = "regulatory compliance Q4 2025"
embedding = await pipeline.embed_query(query)

results = await db.query("""
    SELECT * FROM search::rrf([
      (SELECT id, content FROM chunks
       WHERE embedding <|10,COSINE|> $embedding),
      (SELECT id, content, search::score(1) AS score FROM chunks
       WHERE content @1@ $query
       ORDER BY score DESC LIMIT 10)
    ], 10, 60);
""", {"query": query, "embedding": embedding})

Vector-only:

SELECT *, vector::distance::knn() AS distance FROM chunks
WHERE embedding <|10,COSINE|> $embedding ORDER BY distance;

BM25-only:

SELECT *, search::score(1) AS score FROM chunks
WHERE content @1@ $query
ORDER BY score DESC LIMIT 10;

Chunk → parent document traversal (no JOIN, no second query):

SELECT *, document.source, document.quality_score FROM chunks
WHERE content @1@ $query LIMIT 5;

The document field on each chunk is a SurrealDB record link — dot notation traverses it inline.

Where Each Retriever Fails

BM25 fails on: paraphrased queries, vocabulary mismatch ("car" vs "automobile"), semantic synonyms, conceptual proximity without term overlap.

Vector fails on: exact product codes, named entities, precise version strings, rare technical terms, regulation IDs, serial numbers.

Hybrid RRF covers both.

Filtering by OCR Quality

Low-quality extraction degrades both retrievers. Filter on quality_score before retrieval:

SELECT * FROM search::rrf([
  (SELECT id, content FROM chunks WHERE embedding <|10,COSINE|> $embedding),
  (SELECT id, content, search::score(1) AS score FROM chunks
   WHERE content @1@ $query ORDER BY score DESC LIMIT 10)
], 10, 60)
WHERE document.quality_score > 0.7;

Tuning HNSW and BM25 Parameters

setup_schema() exposes four tunable parameters. The defaults work well for 256–512 token chunks in typical document corpora.

await pipeline.setup_schema(
    hnsw_efc=200,             # Higher = better recall, slower index build
    hnsw_m=16,                # Higher = better recall, more memory per node
    distance_metric="COSINE",
    bm25_k1=1.5,              # Term-frequency saturation
    bm25_b=0.5                # Length normalization
)

Parameter	Default	When to tune
`hnsw_efc`	150	Large corpora (100K+ docs) where recall matters more than indexing speed
`hnsw_m`	12	High-dimensional embeddings (1024-dim); memory is available
`bm25_k1`	1.2	Technical corpora with high term repetition (code, legal docs)
`bm25_b`	0.75	Corpora with highly variable document lengths

Parameters are fixed at schema creation time. Changing them requires dropping and recreating the indexes and re-embedding the full corpus. Pick before ingesting production data.

Why Not pgvector + Qdrant?

Running pgvector and Qdrant separately means two write paths, two uptime SLAs, and no ACID guarantees across them. Here's a failure mode every engineer hits eventually: a Qdrant write succeeds; a Postgres write fails during a network partition. The vector store now holds an embedding whose parent document record doesn't exist. Your search returns a chunk with no context — no source, no metadata, no document link. The retry wrapper is still on the backlog.

kreuzberg-surrealdb's ingest_directory() writes to documents and chunks in the same database. Both the HNSW index and the BM25 index are maintained within the same transaction. search::rrf() runs inside that same database — no cross-service retrieval latency, no dual-write coordination. The record link from chunks.document to documents is always consistent because both were written in the same transaction.

The LangChain EnsembleRetriever compounds the problem: two separate HTTP calls to two separate systems, merged in Python with a hardcoded weights parameter. Weights don't apply to a rank-based algorithm; that mismatch is baked into the design. search::rrf() doesn't have this problem.

Honest trade-offs: SurrealDB isn't Elasticsearch. At very large scale — hundreds of millions of vectors — specialized vector databases have more managed hosting options and operational tooling. ingest_files() is sequential; high-throughput batch ingestion requires a queue-based architecture regardless of which database you're using. As of SurrealDB v3, there's no managed cloud option at scale. Verify current hosting options before adopting this stack for production infrastructure.

Frequently Asked Questions

What SurrealDB version is required? search::rrf() requires SurrealDB v3. It is not available in v2. BM25 and vector search work separately on v2, but not the combined hybrid query.

Can I use a custom embedding model? Yes, via EmbeddingModelType.fastembed() or EmbeddingModelType.custom(). You must provide embedding_dimensions explicitly. All chunks and queries must use the same model and dimension as SurrealDB enforces dimension consistency server-wide.

Is ingestion concurrent or sequential? ingest_files() and ingest_directory() are sequential. For high-throughput pipelines, use a queue-based architecture with one worker per document. ingest_bytes() can be called concurrently from multiple coroutines.

What happens to records for deleted files? Nothing automatic. Records remain until manually removed. See orphan cleanup above.

Next Steps

pip install kreuzberg-surrealdb
GitHub: github.com/kreuzberg-dev/kreuzberg-surrealdb
Deduplication demo: examples/incremental_ingest.py

How to Extract Text from PDF in Python (2026)

TI — Mon, 16 Mar 2026 09:21:35 +0000

Extracting text from PDFs is still one of the most common tasks in data engineering, AI pipelines, and automation workflows. Whether you're building a search system, a retrieval-augmented generation (RAG) pipeline, or simply processing reports, the first step is turning PDFs into clean, usable text.

At first glance this sounds simple, but PDFs were never designed to be machine-readable in the way modern formats are. A PDF is essentially a set of instructions describing how a page should look, not a structured representation of paragraphs, headings, or tables. That means text may be stored in fragments, positioned arbitrarily, or embedded as images.

Because of this, native extraction often produces broken sentences, incorrect reading order, or missing content. Modern tools try to reconstruct structure rather than just reading raw text streams, which is why the choice of extraction method matters.

How PDF Text Extraction Works

Most PDF extraction pipelines follow the same high-level process. First, the document is parsed page by page. Then text blocks are detected and assembled into a readable order. If the document contains scanned pages instead of selectable text, OCR is applied. Finally, the output is normalized so it can be indexed, searched, or passed to downstream systems.

Even though this workflow sounds straightforward, each step contains a surprising amount of complexity. Reading order detection, for example, becomes difficult in multi-column layouts or technical documents. Tables introduce another layer of difficulty, because the visual structure does not always map cleanly to text.
This is why many teams eventually move beyond simple PDF libraries to more complete document processing frameworks.

Extracting Text from a PDF in Python

In Python, the basic workflow for extracting text usually looks the same regardless of the library being used. A document is loaded, parsed, and converted into text that can be printed, stored, or processed further. Different libraries use different APIs, but the general pattern remains consistent. The real differences appear in how well they handle layout, performance, and OCR.

Using Kreuzberg for PDF Extraction

Modern document pipelines often require more than just reading text streams. They need consistent metadata, reliable handling of different formats, and good performance when processing large batches of files.

Kreuzberg is designed for this type of workload. It uses a Rust-based extraction engine with Python bindings (and supports 11 other programming languages as of March 2026), enabling efficient document processing while integrating smoothly into Python pipelines.

Here is how to get started with Kreuzberg in Python. First, install the package:

pip install kreuzberg For the simplest case — extracting text from a PDF synchronously — use extract_file_sync: python from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")
print(result.content)
print(f"Pages: {result.metadata['page_count']}")
If you are working in an async context, the async variant works identically:
python
import asyncio
from kreuzberg import extract_file

async def main():
    result = await extract_file("document.pdf")
    print(result.content)
    print(f"Tables found: {len(result.tables)}")`

`asyncio.run(main())

The ExtractionResult object returned by both variants gives you result.content for the extracted text, result.tables for any detected tables, and result.metadata for document properties like page count and format type. To process multiple PDFs at once, use the batch extraction functions, which handle concurrency automatically:

python
from pathlib import Path
from kreuzberg import batch_extract_files_sync

paths = list(Path("documents").glob("*.pdf"))
results = batch_extract_files_sync(paths)

for path, result in zip(paths, results):
    print(f"{path.name}: {len(result.content)} characters")

For scanned PDFs, enable OCR by passing an ExtractionConfig with an

OcrConfig:
python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng")
)
result = extract_file_sync("scanned.pdf", config=config)
print(result.content)

For Chinese documents, you can also use PaddleOCR:

python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config = ExtractionConfig(
    ocr=OcrConfig(backend="paddleocr", language="zh")
)
result = extract_file_sync("scanned.pdf", config=config)
print(f"Extracted content (preview): {result.content[:100]}")
print(f"Total characters: {len(result.content)}")

If you get a libonnxruntime.so loading error, install onnxruntime first:

python -m pip install--upgrade onnxruntime

If the error still persists on Linux, add the onnxruntime/capi directory to LD_LIBRARY_PATH before running your script (replace the path with your actual venv location):

export LD_LIBRARY_PATH=\"<venv>/lib/pythonX.Y/site-packages/onnxruntime/capi:$LD_LIBRARY_PATH\

Kreuzberg supports Tesseract, EasyOCR, and PaddleOCR as backends, which is useful for multilingual documents where backend quality varies by language.

Handling Scanned PDFs

One of the biggest challenges in real-world workflows is dealing with scanned documents. These files contain images instead of selectable text, so extraction requires optical character recognition.

A modern pipeline typically detects when text is missing and automatically runs OCR before merging the results into the document structure. The quality of OCR depends heavily on language, resolution, and document quality, which is why systems that allow different OCR backends are often more reliable in multilingual environments.

Extracting Tables and Structured Content

Tables are another area where simple extraction approaches struggle. Even when the text is captured correctly, the relationships between rows and columns may be lost.
More advanced extraction pipelines attempt to detect table regions and preserve structure so that data remains usable. This is particularly important in financial reports, research papers, and operational documents where tables often contain the most important information.

Performance and Scaling Considerations

Performance becomes increasingly important as soon as you begin processing more than a handful of files. Batch ingestion, RAG pipelines, and search indexing workflows may involve thousands or millions of documents, and inefficiencies at the parsing stage quickly become expensive.

Several factors influence performance, including how the parsing engine is implemented, how memory is managed, and how well the system supports concurrency. Tools that rely heavily on interpreted execution or external subprocesses often encounter bottlenecks at scale, while native parsing engines tend to perform better under sustained workloads.
This is one reason many modern document processing tools use compiled cores with language bindings on top.

Where PDF Extraction Fits in a Modern Pipeline

In most real systems, text extraction is only the first step. Once text is available, it is typically split into chunks, converted into embeddings, and stored in a vector database for retrieval.
This architecture has become standard for document search and RAG systems because it allows large collections of documents to be queried efficiently. Reliable extraction is the foundation that makes everything else possible.

Common Pitfalls

Developers new to PDF extraction often assume that all PDFs behave the same way. In reality, documents vary widely in structure and quality, and a pipeline that works well for one dataset may fail on another.
It is always worth testing extraction using a mix of documents, including scanned files, multi-column layouts, and large reports. Problems usually appear quickly under realistic conditions.
Another common mistake is ignoring metadata. Information such as page numbers, titles, and document structure often becomes critical later, especially when building retrieval systems that need to cite sources.

Final Thoughts

Extracting text from PDFs in Python is easier than it was a few years ago, but the fundamental challenges of document structure and layout remain. Choosing tools that handle these complexities well can significantly improve the quality of downstream systems, from search to RAG to analytics. Once the ingestion layer is reliable, the rest of the pipeline becomes far easier to design and maintain.

Kreuzberg vs. Unstructured.io: Benchmarks and Architecture Comparison (March 2026)

TI — Mon, 02 Mar 2026 14:33:56 +0000

Kreuzberg vs Unstructured: Benchmarks and Architecture Comparison (March 2026)

When building document pipelines, the choice of extraction library directly impacts performance, infrastructure costs, and reliability. Two tools that developers compare are Kreuzberg and Unstructured.io. Both can extract text and metadata from documents, but they differ significantly in architecture and behavior under load. The purpose of this comparison is not to declare a universal winner but to explain where the differences come from and when they matter in practice.

The Core Difference: Architecture

The most important distinction between the two systems is architectural. Unstructured.io is primarily Python-based and designed around flexible pipelines and integrations, which makes its open source library convenient for rapid prototyping and experimentation in Python-centric environments.

Kreuzberg takes a different approach. Its extraction engine is implemented in Rust and exposed through bindings to multiple languages. In March 2026, Kreuzberg supports 12 programming languages: Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, WASM, and TypeScript/Node.js. This design moves performance-critical work into compiled native code while keeping the developer experience accessible in Python and other stacks.

This is important for performance because compiled native code runs directly on the CPU without an interpreter or runtime in between, unlike Python, which adds an extra layer of execution overhead (bytecode interpretation, dynamic typing, and runtime dispatch).

In practical terms, this architectural difference affects:

how memory is allocated and reused (e.g. predictable allocation patterns vs Python object overhead)
how concurrency is implemented (native threads vs multiprocessing or async workarounds)
how much work happens across process boundaries (serialization/deserialization costs)
and how much runtime initialization is required before processing begins

These factors become increasingly noticeable when processing large batches of files or running pipelines continuously in production.

What the Benchmarks Measure

Document-processing benchmarks are meaningful only when they measure more than raw speed. Throughput, latency percentiles, memory usage, installation size, cold start time, and extraction reliability all contribute to real-world performance.

Kreuzberg’s published benchmarks use reproducible tests and real-world datasets containing PDFs, Office documents, images, and multilingual text. The benchmark harness runs continuously and is designed to isolate extraction performance from external bottlenecks (e.g. network or I/O), so results reflect the behavior of the extraction engine itself.

This is important because pipelines often behave very differently depending on file type and layout complexity.

Throughput in Practice

In comparative benchmarks, Kreuzberg has demonstrated significantly higher throughput than many Python-only pipelines, in some cases processing documents roughly 9–50x faster on average across tested workloads. Results vary depending on document type and configuration, but the pattern is consistent: architecture matters.

02.03.2026 — snapshot of Kreuzberg rust (PDF) benchmarks in Duration and Quality Score. See current benchmarks here: https://kreuzberg.dev/benchmarks

These differences arise from several factors:

Native parsing avoids interpreter overhead and reduces per-operation cost
Tight loops and parsing routines can be optimized at compile time
Memory locality and cache efficiency are improved in compiled code
Parallel execution can be implemented without the constraints of the Python GIL

When processing large datasets, these advantages compound. Even small per-document overhead differences (e.g. a few milliseconds) can translate into minutes or hours of total runtime at scale.

Installation Size and Operational Considerations

If you’ve ever deployed a large-scale system, you know that another practical difference appears in installation size and dependency complexity. Benchmarks have shown smaller installation footprints and fewer dependencies for Kreuzberg compared with heavier Python frameworks like Unstructured.

Python-based pipelines often depend on multiple layers of libraries (parsers, OCR tools, system packages), which increases container size and introduces more potential points of failure or version conflicts. In contrast, Kreuzberg packages much of its functionality into a compiled core, reducing the need for large runtime dependency chains.

While this may seem like a minor detail, it affects container build times, CI pipelines, and cold start latency. In large distributed systems, operational efficiency often matters as much as raw processing speed.

Cold Start and Resource Efficiency

Cold start time becomes particularly important in serverless or autoscaling environments. Systems that rely on large dependency stacks or complex initialization routines may take longer to start, increasing latency and costs.

Native engines with smaller runtime requirements tend to start faster because there is less dynamic initialization (e.g. importing modules, resolving dependencies, initializing interpreters). They also tend to use memory more efficiently due to tighter control over allocation and fewer intermediary objects.

Lower memory usage also allows higher parallelism and better performance on smaller machines, which directly impacts cost efficiency in production environments.

Extraction Quality

Performance alone is not enough. Extraction quality depends on document type, layout complexity, language, and OCR requirements. Some pipelines may perform better on certain classes of documents than others, which is why testing on representative datasets is essential. Benchmarks provide useful signals, but they cannot replace real-world evaluation.

In our own benchmarks (early 2026, see the benchmark run and live dashboard, these tradeoffs show up clearly in practice. Across mixed real-world datasets, like PDFs, Office documents, HTML, and scanned images, we observe that performance and extraction behavior vary significantly by document class.

On clean, text-based PDFs, multiple pipelines achieve high success rates (often >95%) with relatively stable latency. In contrast, on OCR-heavy or layout-complex documents (e.g. scanned pages, tables, multilingual content), both latency and extraction consistency diverge more noticeably.

In our runs, throughput differences between approaches range from ~9× to as high as ~50× depending on the workload, while structured extraction success rates can drop by 10–30 percentage points on harder document types. These effects are often linked to how each system handles layout detection, OCR integration, and post-processing heuristics.

These figures are drawn from our late February 2026 benchmark runs and will evolve, but they consistently reinforce the same point: performance and quality are highly dependent on file type, language, and processing mode (single-file vs batch), which is why evaluating on representative datasets is critical before making production decisions.

Choosing Between the Two

Unstructured.io has built a reputation as a flexible, developer-friendly toolkit for working with unstructured data. Its open-source library can be a strong choice for rapid prototyping, especially in Python-heavy environments where flexibility and ecosystem integration are the primary concerns. For smaller workloads or experimental projects, this can be an excellent fit. For production use at scale, bigger teams rely on Unstructured’s hosted platform, which is tailored more toward enterprise deployments and managed workflows.

Kreuzberg open source tends to be particularly attractive in production ingestion pipelines, where throughput, resource efficiency, and predictable performance become more important. In these environments, architectural differences translate directly into cost savings and faster processing. Many teams and companies are already using the OSS in their workflows.

Kreuzberg Cloud will be a fully managed, core-available solution for teams who need reduced complexity and excellent results in a single API. The library will remain MIT-licensed (permissive open-source) forever. The commercial offering is being built around the core library, not by restricting the library itself.

Final Thoughts

Document processing is the foundation of modern AI and search systems. In production environments, pipelines often need to handle millions, or even tens of millions, of documents, where even small inefficiencies compound quickly.

The ingestion layer, including extraction, chunking, and embedding preparation, determines how fast data can be processed, how reliably it can be retrieved, and how much infrastructure is required to operate the system at scale.

Testing with real data remains the most reliable way to decide, and you’re welcome to test Kreuzberg’s newly published comparative benchmarks with batch processing or a single file on GitHub. The benchmarks run continuously.

Benchmarks consistently show that architecture plays a major role in performance. Native engines, efficient memory usage, and minimal dependencies often lead to higher throughput and lower operational cost. At the same time, the best tool always depends on the documents being processed and the requirements of the system.

Building a RAG pipeline with Kreuzberg and LangChain

TI — Mon, 23 Feb 2026 09:28:56 +0000

Most discussions about retrieval-augmented generation (RAG) focus on choosing the right model, tuning prompts, or experimenting with vector databases. In practice, these are rarely the hardest parts. The real bottleneck appears much earlier: getting clean, reliable text out of messy documents.

There is a real challenge in ingestion, chunking, and embeddings. PDFs preserve visual layout rather than logical structure, Office files rely on completely different internal formats, and scanned documents require OCR before any text exists at all. Metadata is often incomplete or inconsistent, and small problems at this stage propagate downstream. If the extraction quality is poor, retrieval becomes unreliable, and the language model begins to produce weak or misleading answers.

This is where Kreuzberg plays a central role, covering the entire early-stage data flow: document ingestion, text chunking, and embedding generation. A typical RAG pipeline can combine Kreuzberg for ingestion, chunking, and embeddings with LangChain as the orchestration layer, alongside a vector database and an LLM. While the architecture is fairly standard, the quality of the early steps determines everything that follows.

Embeddings are numerical vector representations of text. An embedding model converts a piece of text, such as a sentence, paragraph, or document, into a list of numbers that captures its semantic meaning. Texts with similar meanings end up close to each other in this high-dimensional vector space, making it possible to search by meaning rather than exact keywords. If you haven’t seen this before, the TensorFlow Embedding Projector is a useful way to visualize how embeddings cluster similar concepts together.

Here are the steps to a RAG pipeline with Kreuzberg and LangChain:

Extract text from a sample PDF and DOCX using Kreuzberg
Inspect the raw output and metadata to understand what high-quality extraction looks like
Chunk the text using a concrete strategy (recursive splitting with overlap) with Kreuzberg
Generate embeddings with Kreuzberg and store them in a vector database such as Chroma or FAISS
Wire everything together with LangChain and run a query end-to-end

In the examples, we'll use Kreuzberg Python.

Begin by installing dependencies.

pip install kreuzberg langchain chromadb

Then, extract text from your document.

from kreuzberg import extract

# Extract from a PDF
pdf_result = extract("sample.pdf")

# Extract from a DOCX
docx_result = extract("sample.docx")

print(pdf_result.text[:500])
print(pdf_result.metadata)

At this stage, you receive:

Clean extracted text
Structured metadata
Page-level and document-level information

After that, chunk the extracted text. Instead of manually splitting strings, use Kreuzberg’s built-in chunking configuration.

from kreuzberg import extract, ChunkingConfig

result = extract(
    "sample.pdf",
    chunking=ChunkingConfig(
        strategy="recursive",
        chunk_size=500,
        chunk_overlap=50
    )
)

# Access generated chunks
for chunk in result.chunks[:3]:
    print(chunk.content)
    print(chunk.metadata)

Embeddings with Kreuzberg are the next step.

from kreuzberg import extract, ChunkingConfig, EmbeddingConfig

result = extract(
    "sample.pdf",
    chunking=ChunkingConfig(
        strategy="recursive",
        chunk_size=500,
        chunk_overlap=50
    ),
    embedding=EmbeddingConfig(
        preset="sentence-transformers/all-MiniLM-L6-v2"
    )
)

# Each chunk now contains an embedding vector
first_chunk = result.chunks[0]

print(len(first_chunk.embedding))  # vector dimension

Store embeddings in a Vector Database (for example, Chroma)

import chromadb
from chromadb.config import Settings

client = chromadb.Client(Settings(anonymized_telemetry=False))
collection = client.create_collection("documents")

for chunk in result.chunks:
    collection.add(
        documents=[chunk.content],
        metadatas=[chunk.metadata],
        embeddings=[chunk.embedding],
        ids=[chunk.id]
    )

And query with LangChain. LangChain orchestrates retrieval and generation.

from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

vectorstore = Chroma(
    collection_name="documents",
    embedding_function=None  # embeddings already computed
)

retriever = vectorstore.as_retriever()

llm = ChatOpenAI(model="gpt-4o-mini")

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever
)

response = qa_chain.run("What is this document about?")
print(response)

LangChain connects:
The retriever (vector database)
The prompt template
The LLM
The final response pipeline

What You Just Built

You now have:
Document ingestion (Kreuzberg)
Structured chunking (Kreuzberg)
Embedding generation (Kreuzberg)
Vector storage (Chroma)
Retrieval orchestration (LangChain)
Answer synthesis (LLM)

This is a complete, production-ready RAG pipeline.

Why Document Processing Can Be the Hardest Part of RAG

Many tutorials focus heavily on embeddings and prompting, but teams that deploy real systems quickly discover that data preparation is the bottleneck. Production pipelines must deal with complex layouts, multiple file formats, scanned documents, large batches, and multilingual content.

Kreuzberg is designed specifically for this layer. It transforms heterogeneous documents into clean, structured outputs that downstream systems can reliably use. In a typical RAG pipeline, Kreuzberg sits at the beginning, extracting text, structuring metadata, chunking content, and generating embeddings in a consistent and unified way.

A useful way to visualize the flow is as a sequence of transformations: documents are extracted, divided into smaller segments, converted into embeddings, stored in a vector database, retrieved in response to a query, and finally synthesized by a language model. Every stage depends on the quality of the one before it.

The Architecture of a RAG Pipeline

Although implementations differ, most pipelines follow the same logical progression. Documents are first ingested and normalized. The extracted text is then split into chunks of manageable size, after which embeddings are generated and stored in a searchable index. When a user asks a question, the system retrieves the most relevant chunks and passes them to an LLM for synthesis.

One of the strengths of the RAG pattern is that each stage can be swapped independently. The ingestion engine, embedding model, database, and LLM can all be replaced without redesigning the entire system. Keeping these concerns separated makes pipelines easier to evolve.

Extracting Text from Documents

The first stage is always extraction. In practice, this involves reading files in multiple formats, detecting whether text is embedded or must be recovered through OCR, and preserving structural or metadata information whenever possible.

After this step, the system has clean text, document metadata, and often page-level or structural information. This output becomes the foundation for everything that follows, and in Kreuzberg’s case, it directly feeds into chunking and embedding generation.

Chunking and Embeddings

Once text has been extracted, it must be divided into smaller segments. Large documents cannot be embedded or retrieved efficiently as a single block. The goal of chunking is not only to reduce size but also to preserve meaning. Splitting in the wrong place can destroy context and reduce retrieval accuracy.

This step is especially critical because the semantic models used in RAG systems are designed to capture relationships across sequences of text. Many models effectively learn patterns in both directions, allowing them to understand context beyond individual tokens. The way text is chunked directly affects how well these relationships are preserved in the resulting embeddings.

After chunking, each segment is converted into a vector representation. At this point, each chunk becomes a structured record consisting of text, metadata, and an embedding vector. Kreuzberg handles both chunking and embedding generation, reducing complexity and ensuring consistency across the pipeline.

Retrieval and Answer Generation

When a user submits a query, the pipeline converts it into an embedding and searches the vector database for similar entries. In practice, this means finding the chunks whose representations are closest to the query in semantic space.

Frameworks like LangChain orchestrate this process, connecting retrieval, prompting, and generation into a single workflow. They also make it possible to refine retrieval, for example, through filtering, ranking, or hybrid search, so that the most relevant context is passed to the language model.
An important detail is that the model never sees the entire dataset. It only receives a carefully selected subset of chunks. The quality of this selection determines the quality of the final answer.

Scaling a RAG Pipeline

Once a pipeline works on a small dataset, real-world deployments introduce additional requirements. Ingestion must handle large volumes of files and often run in parallel. Retrieval systems benefit from metadata filtering and hybrid search strategies, and generation layers often include structured prompts or citation mechanisms.
At scale, another challenge emerges: as data grows, it becomes increasingly difficult to understand or navigate the information at all. Large document collections quickly exceed what humans can manually organize or search effectively. This is exactly where RAG systems become so important: they make massive, unstructured datasets usable.

Common Mistakes

One of the most frequent mistakes is treating ingestion as a trivial preprocessing step. Teams often invest heavily in prompt engineering while overlooking extraction quality, only to discover that retrieval accuracy is limited by poor source data. Inconsistent chunking and missing metadata create similar issues.
A good rule of thumb is to design this early stage carefully. Because extraction, chunking, and embedding happen at the beginning, mistakes here propagate forward. Poor extraction leads to weaker chunking, lower-quality embeddings, less accurate retrieval, and ultimately worse answers.

Final Thoughts

RAG systems succeed or fail based on the quality of their data pipeline. Reliable document parsing, chunking, and consistent embedding generation form the foundation on which retrieval and generation depend.

Kreuzberg fits naturally into this architecture because it addresses the first part of the workflow: turning messy, real-world documents into clean, structured, and semantically meaningful data ready for retrieval and generation. LangChain provides the glue between components, letting you compose retrieval, prompts, and LLMs into a single, production-ready pipeline.

Don't hesitate to submit issues or make contributions to Kreuzberg on GitHub.

Kreuzberg v4.3.0 and comparative benchmarks

TI — Thu, 12 Feb 2026 09:02:29 +0000

Hi all, we have two important announcements related to Kreuzberg:

We released our new comparative benchmarks. These have a slick UI and we have been working hard on them for a while now (more on this below), and we'd love to hear your impressions and get some feedback from the community!
We released v4.3.0, which brings in a bunch of improvements including PaddleOCR as an optional backend, document structure extraction, and native Word97 format support. More details below.

What is Kreuzberg?

Kreuzberg is an open-source (MIT license) polyglot document intelligence framework written in Rust, with bindings for Python, TypeScript/JavaScript (Node/Bun/WASM), PHP, Ruby, Java, C#, Golang and Elixir. It's also available as a docker image and standalone CLI tool you can install via homebrew.

If the above is unintelligible to you (understandably so), here is the TL;DR: Kreuzberg allows users to extract text from 75+ formats (and growing), perform OCR, create embeddings and quite a few other things as well. This is necessary for many AI applications, data pipelines, machine learning, and basically any use case that requires processing documents and images as sources for textual outputs.

Comparative Benchmarks

Our new comparative benchmarks UI is live here: https://kreuzberg.dev/benchmarks

The comparative benchmarks compare Kreuzberg with several of the top open source alternatives - Apache Tika, Docling, Markitdown, Unstructured.io, PDFPlumber, Mineru, MuPDF4LLM. In a nutshell - Kreuzberg is 9x faster on average, uses substantially less memory, has much better cold start, and a smaller installation footprint. It also requires less system dependencies to function (only optional system dependency for it is onnxruntime, for embeddings/PaddleOCR).

The benchmarks measure throughput, duration, p99/95/50, memory, installation size and cold start with more than 50 different file formats. They are run in GitHub CI on ubuntu latest machines and the results are published into GitHub releases (here is an example). The source code for the benchmarks and the full data is available in GitHub, and you are invited to check it out.

V4.3.0 Changes

The v4.3.0 full release notes can be found here: https://github.com/kreuzberg-dev/kreuzberg/releases/tag/v4.3.0

Key highlights:

PaddleOCR optional backend - in Rust. Yes, you read this right, Kreuzberg now supports PaddleOCR in Rust and by extension - across all languages and bindings except WASM. This is a big one, especially for Chinese speakers and other east Asian languages, at which these models excel.
Document structure extraction - while we already had page hierarchy extraction, we had requests to give document structure extraction similar to Docling, which has very good extraction. We now have a different but up to par implementation that extracts document structure from a huge variety of text documents - yes, including PDFs.
Native Word97 format extraction - wait, what? Yes, we now support the legacy .doc and .ppt formats directly in Rust. This means we no longer need LibreOffice as an optional system dependency, which saves a lot of space. Who cares you may ask? Well, usually enterprises and governmental orgs- we still live in a world where legacy is a thing.

How to get involved with Kreuzberg

Kreuzberg is an open-source project, and as such, contributions are welcome. You can check us out on GitHub, open issues or discussions, and of course, submit fixes and pull requests. Here is the GitHub: https://github.com/kreuzberg-dev/kreuzberg
We have a Discord Server and you are all invited to join (and lurk)!

That's it for now. As always, if you like it, star it on GitHub, it helps us get visibility for the project.

Kreuzberg.dev now supports PHP and Elixir- and covers most of the backend landscape

TI — Sat, 03 Jan 2026 12:24:24 +0000

Kreuzberg.dev now supports PHP and Elixir 🎉

We’ve added PHP and Elixir bindings to Kreuzberg.dev, our open-source document intelligence engine.

With this release, Kreuzberg is now available for:

Rust, Python, Ruby, Go, PHP, Elixir, and TypeScript/Node.js

This covers most modern backend and web development stacks, making it easier to integrate high-performance document processing into existing systems without forcing teams into a single language.

What is Kreuzberg.dev?

Kreuzberg is an MIT-licensed engine for extracting and structuring data from 56+ document formats, including PDFs, Office files, images, archives, and emails.

Typical use cases include:

feeding documents into search or RAG pipelines
extracting structured data from unstructured files
building ingestion systems for large document collections

What’s next?

We’re continuing to improve:

performance and memory usage

format coverage and extraction quality

documentation and real-world examples

We’re also very interested in feedback from people running Kreuzberg.dev in production — especially around scaling, fault tolerance, and integration patterns.

Try it out

The library is open-source and self-hostable.

Repo and docs: https://github.com/kreuzberg-dev/kreuzberg
Join our Discord community: https://discord.gg/JraV699cKj
Issues, questions, and PRs are always welcome.

If you’re using Kreuzberg.dev already (or trying it now), we’d love to hear what you’re building with it. Have a great start to 2026!

Kreuzberg v4.0.0-rc14 released: optimization phase and stable v4 ahead

TI — Sun, 21 Dec 2025 11:22:16 +0000

We’ve just released Kreuzberg v4.0.0-rc14, now working across all release channels (language bindings, Docker, CLI).

With the core feature set in place, focus is shifting to performance optimization — profiling and improving bindings, followed by comparative benchmarks and a documentation refresh.

If you have time to test rc14, we’d be happy to receive any feedback- bugs, encouragement, design critique, or else- as we prepare for a stable v4 release next month. Thank you!

Kreuzberg's position: As an open-source library, Kreuzberg provides a self-hosted alternative with no per-document API costs, making it suitable for high-volume workloads where cost efficiency matters.

Resources
GitHub: Star us at https://github.com/kreuzberg-dev/kreuzberg
Discord: Join our community server at https://discord.gg/JraV699cKj
Subreddit: Join the discussion at https://www.reddit.com/r/kreuzberg_dev/
Documentation: https://kreuzberg.dev/

We'd love to hear your contributions!

For more background on Kreuzberg, competitive comparison, and recent changes, see the earlier deep dive from v4.0.0-rc8:
https://dev.to/kreuzberg-dev/kreuzberg-v400-rc8-is-available-4fma

Forem: Kreuzberg

Why most enterprise AI projects underperform

What 'AI-ready' means

What breaks downstream when the input layer is weak

Extraction doesn't equal understanding

Where it fits in an existing stack

Kreuzberg Cloud as Infrastructure Layer

Why AI Agents Need Structured Code Intelligence (And How to Stop Managing Parsers)

The parser tax no one budgets for

Why AST-aware chunking is different

How tree-sitter-language-pack helps

The operational math of one dependency

Where Kreuzberg and tree-sitter-language-pack converge

Parser infrastructure shouldn't be an application-layer problem

Beyond the Model: Why Document Intelligence Is the Next AI Infrastructure Layer

Agentic AI Raises the Stakes

What Data Readiness Actually Requires

Kreuzberg Cloud

Where This Fits in Your workflow

The Haystack converter that handles 91+ file formats without a Cloud API

Why document extraction breaks most RAG pipelines

The setup you're replacing

What the output looks like

Three output modes that make it better

OCR that works on documents

Tables come through as structured data

Metadata you can use

Parallel Batch Processing

Proper Error Handling

Where it sits in a pipeline

Why not just use the cloud APIs?

How it compares to other Haystack converters

Getting started

Links

Document Structure Extraction with Kreuzberg

Get Started

Python

BM25 + Vector Search in One Query: kreuzberg-surrealdb + SurrealDB v3

Hybrid Search in 40 Lines: kreuzberg-surrealdb + SurrealDB v3

Quick Start: Connection to Hybrid Search

Building Ingestion for SurrealDB

Setting Up kreuzberg-surrealdb

What setup_schema() Generates

Embedding Presets

Chunking Configuration

Ingesting a Mixed Document Corpus

Deduplication

The Hybrid Search Payoff: How search::rrf() Works

Attribution: What Owns What

All Three Search Modes

Where Each Retriever Fails

Filtering by OCR Quality

Tuning HNSW and BM25 Parameters

Why Not pgvector + Qdrant?

Frequently Asked Questions

Next Steps

How to Extract Text from PDF in Python (2026)

How PDF Text Extraction Works

Extracting Text from a PDF in Python

Using Kreuzberg for PDF Extraction

Handling Scanned PDFs

Extracting Tables and Structured Content

Performance and Scaling Considerations

Where PDF Extraction Fits in a Modern Pipeline

Common Pitfalls

Final Thoughts

Kreuzberg vs. Unstructured.io: Benchmarks and Architecture Comparison (March 2026)

Kreuzberg vs Unstructured: Benchmarks and Architecture Comparison (March 2026)

The Core Difference: Architecture

What the Benchmarks Measure

Throughput in Practice

Installation Size and Operational Considerations

Cold Start and Resource Efficiency

Extraction Quality

Choosing Between the Two

Final Thoughts

Building a RAG pipeline with Kreuzberg and LangChain

What You Just Built

Why Document Processing Can Be the Hardest Part of RAG

The Architecture of a RAG Pipeline

What `setup_schema()` Generates

The Hybrid Search Payoff: How `search::rrf()` Works