<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Santiago Fernández</title>
    <description>The latest articles on Forem by Santiago Fernández (@bytaro).</description>
    <link>https://forem.com/bytaro</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3562817%2Fbae83da8-8169-4192-a5bf-853fd1429343.png</url>
      <title>Forem: Santiago Fernández</title>
      <link>https://forem.com/bytaro</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/bytaro"/>
    <language>en</language>
    <item>
      <title>Implementing Parallel PDF Batch Processing in Rust</title>
      <dc:creator>Santiago Fernández</dc:creator>
      <pubDate>Wed, 15 Oct 2025 19:19:55 +0000</pubDate>
      <link>https://forem.com/bytaro/implementing-parallel-pdf-batch-processing-in-rust-330j</link>
      <guid>https://forem.com/bytaro/implementing-parallel-pdf-batch-processing-in-rust-330j</guid>
      <description>&lt;p&gt;Processing large batches of PDFs is a common requirement in document management systems, data extraction pipelines, and archive digitization projects. The challenge is doing it efficiently while handling the inevitable edge cases: corrupted files, encryption, or malformed structures.&lt;br&gt;
I've been working on oxidize-pdf, a Rust native library for PDF processing. Recently, I implemented a parallel batch processing feature to handle production-scale document processing.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Another PDF Library?
&lt;/h2&gt;

&lt;p&gt;There are existing PDF libraries in the Rust ecosystem, notably lopdf. However, oxidize-pdf addresses a different use case:&lt;br&gt;
&lt;strong&gt;lopdf&lt;/strong&gt; is a general-purpose PDF manipulation library. It provides low-level access to PDF structures and is excellent for creating, modifying, and inspecting PDFs.&lt;br&gt;
&lt;strong&gt;oxidize-pdf&lt;/strong&gt; is optimized for document content extraction and processing. It focuses on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-performance text extraction&lt;/li&gt;
&lt;li&gt;OCR integration with Tesseract for scanned documents&lt;/li&gt;
&lt;li&gt;Structured data extraction (tables, forms, metadata)&lt;/li&gt;
&lt;li&gt;Batch processing at scale&lt;/li&gt;
&lt;li&gt;Production-ready error handling and recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The library is designed for systems that need to process thousands of documents daily: invoice processing, contract analysis, document classification, and RAG (Retrieval-Augmented Generation) pipelines.&lt;/p&gt;
&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;The batch processing implementation needed to satisfy several constraints:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Performance: Sequential processing is impractical for batches of 500+ files&lt;/li&gt;
&lt;li&gt;Reliability: Individual file failures must not halt the entire batch&lt;/li&gt;
&lt;li&gt;Observability: Clear progress indication and detailed error reporting&lt;/li&gt;
&lt;li&gt;Integration: Machine-readable output for automation&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Core Implementation
&lt;/h2&gt;

&lt;p&gt;The batch processor uses Rayon for parallelism. Here's the main processing function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rust
pub fn process_batch(files: &amp;amp;[PathBuf], config: &amp;amp;BatchConfig) -&amp;gt; BatchResult {
    let start = Instant::now();
    let progress = ProgressBar::new(files.len() as u64);
    let results = Arc::new(Mutex::new(Vec::new()));

    // Parallel processing with Rayon
    files.par_iter().for_each(|path| {
        let file_start = Instant::now();

        let result = match process_single_pdf(path) {
            Ok(data) =&amp;gt; ProcessingResult {
                filename: path.file_name().unwrap().to_string_lossy().to_string(),
                success: true,
                pages: Some(data.page_count),
                text_chars: Some(data.text.len()),
                duration_ms: file_start.elapsed().as_millis() as u64,
                error: None,
            },
            Err(e) =&amp;gt; ProcessingResult {
                filename: path.file_name().unwrap().to_string_lossy().to_string(),
                success: false,
                pages: None,
                text_chars: None,
                duration_ms: file_start.elapsed().as_millis() as u64,
                error: Some(e.to_string()),
            },
        };

        results.lock().unwrap().push(result);
        progress.inc(1);
    });

    progress.finish();

    let all_results = results.lock().unwrap();
    aggregate_results(&amp;amp;all_results, start.elapsed())
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key aspects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;par_iter()&lt;/code&gt; enables Rayon's parallel iteration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error isolation&lt;/strong&gt; through individual &lt;code&gt;match&lt;/code&gt; on each file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thread-safe result collection&lt;/strong&gt; using &lt;code&gt;Arc&amp;lt;Mutex&amp;lt;Vec&amp;gt;&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progress tracking&lt;/strong&gt; with &lt;code&gt;indicatif&lt;/code&gt; crate&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Processing Individual PDFs
&lt;/h2&gt;

&lt;p&gt;Each PDF is processed independently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rust
fn process_single_pdf(path: &amp;amp;Path) -&amp;gt; Result&amp;lt;DocumentData, PdfError&amp;gt; {
    let document = Document::load(path)?;
    let text = document.extract_text()?;

    Ok(DocumentData {
        page_count: document.get_pages().len(),
        text,
    })
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The simplicity here is intentional. If &lt;code&gt;load()&lt;/code&gt; or &lt;code&gt;extract_text()&lt;/code&gt; fails, the error propagates up, gets caught in the main loop, and processing continues with the next file.&lt;/p&gt;

&lt;h3&gt;
  
  
  Usage
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bash
# Process directory with default settings
cargo run --example batch_processing --features rayon -- --dir ./pdfs

# Control parallelism
cargo run --example batch_processing --features rayon -- --dir ./pdfs --workers 8

# JSON output for automation
cargo run --example batch_processing --features rayon -- --dir ./pdfs --json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance Results
&lt;/h2&gt;

&lt;p&gt;Testing with 772 PDFs on an Intel i9 MacBook Pro:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sequential: ~10 minutes&lt;/li&gt;
&lt;li&gt;Parallel: ~1 minute&lt;/li&gt;
&lt;li&gt;Speedup: ~10x&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Throughput varies with file complexity (15+ docs/sec for simple files, 1-2 docs/sec for complex ones), but the parallelization benefit is consistent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Error Handling
&lt;/h2&gt;

&lt;p&gt;The design decision was file independence. Each PDF is processed in isolation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rust
// This continues even if some files fail
files.par_iter().for_each(|path| {
    match process_single_pdf(path) {
        Ok(data) =&amp;gt; { /* record success */ },
        Err(e) =&amp;gt; { /* record error, continue */ },
    }
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At completion, you receive a full report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;╔═══════════════════════════════════════╗
         BATCH SUMMARY REPORT
╚═══════════════════════════════════════╝

📊 Statistics:
   Total files:     772
   ✅ Successful:   749 (97.0%)
   ❌ Failed:       23 (3.0%)

⏱️  Performance:
   Total time:      62.4s
   Throughput:      12.4 docs/sec

❌ Failed files:
   • corrupted.pdf - Invalid PDF structure
   • locked.pdf - Permission denied
   • encrypted.pdf - Encryption not supported
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach is essential for production systems where restarting a 1-hour batch job because of file #437 is unacceptable.&lt;/p&gt;

&lt;h2&gt;
  
  
  JSON Output for Automation
&lt;/h2&gt;

&lt;p&gt;For pipeline integration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rust
#[derive(Serialize)]
struct BatchResult {
    total: usize,
    successful: usize,
    failed: usize,
    total_duration_ms: u128,
    throughput_docs_per_sec: f64,
    results: Vec&amp;lt;ProcessingResult&amp;gt;,
}

#[derive(Serialize)]
struct ProcessingResult {
    filename: String,
    success: bool,
    pages: Option&amp;lt;usize&amp;gt;,
    text_chars: Option&amp;lt;usize&amp;gt;,
    duration_ms: u64,
    error: Option&amp;lt;String&amp;gt;,
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;json
{
  "total": 772,
  "successful": 749,
  "failed": 23,
  "throughput_docs_per_sec": 12.4,
  "results": [
    {
      "filename": "document1.pdf",
      "success": true,
      "pages": 25,
      "text_chars": 15234,
      "duration_ms": 145,
      "error": null
    },
    {
      "filename": "corrupted.pdf",
      "success": false,
      "pages": null,
      "text_chars": null,
      "duration_ms": 23,
      "error": "Invalid PDF structure"
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This integrates easily with &lt;code&gt;jq&lt;/code&gt;, Python scripts, or monitoring systems:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bash
# Extract failed files
cat results.json | jq -r '.results[] | select(.success == false) | .filename'

# Calculate success rate
cat results.json | jq '.successful / .total * 100'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why Rust Over Python
&lt;/h2&gt;

&lt;p&gt;The decision to implement this in Rust rather than Python was driven by practical considerations:&lt;br&gt;
&lt;strong&gt;Memory efficiency:&lt;/strong&gt; Python PDF libraries typically consume 2-3GB when processing large batches. Rust keeps memory usage predictable and significantly lower.&lt;br&gt;
&lt;strong&gt;True parallelism:&lt;/strong&gt; Python's GIL limits parallel processing. While &lt;code&gt;multiprocessing&lt;/code&gt; works, the overhead is substantial. Rayon provides efficient work-stealing parallelism without process spawning costs.&lt;br&gt;
&lt;strong&gt;Deployment simplicity:&lt;/strong&gt; A single compiled binary eliminates dependency management issues in production environments.&lt;br&gt;
Python remains excellent for exploratory work. But for daily production pipelines processing thousands of documents, the operational benefits of Rust are substantial.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current Limitations
&lt;/h2&gt;

&lt;p&gt;The current implementation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processes a single directory (non-recursive)&lt;/li&gt;
&lt;li&gt;Loads complete PDFs into memory&lt;/li&gt;
&lt;li&gt;Text extraction only (no images/metadata in this example)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For very large files (1GB+), a streaming approach would be more appropriate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The library is under active development. Planned features include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Advanced structured data extraction with templates&lt;/li&gt;
&lt;li&gt;Enhanced OCR quality detection and preprocessing&lt;/li&gt;
&lt;li&gt;Memory-efficient streaming for large documents&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The code is available at &lt;a href="https://github.com/bzsanti/oxidizePdf" rel="noopener noreferrer"&gt;github.com/bzsanti/oxidizePdf&lt;/a&gt;. The batch processing example is in &lt;code&gt;examples/batch_processing.rs&lt;/code&gt;.&lt;br&gt;
If you're processing PDFs at scale, I'd be interested in hearing about your use cases and any edge cases this implementation doesn't handle.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>programming</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
