<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: BookWyrm.ai</title>
    <description>The latest articles on Forem by BookWyrm.ai (@bookwyrm-ai).</description>
    <link>https://forem.com/bookwyrm-ai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3615367%2Ff8afdc66-92ec-4b3d-86d3-a7a2191925d9.png</url>
      <title>Forem: BookWyrm.ai</title>
      <link>https://forem.com/bookwyrm-ai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/bookwyrm-ai"/>
    <language>en</language>
    <item>
      <title>Automate Document Workflows with AI</title>
      <dc:creator>BookWyrm.ai</dc:creator>
      <pubDate>Mon, 17 Nov 2025 12:48:41 +0000</pubDate>
      <link>https://forem.com/bookwyrm-ai/automate-document-workflows-with-ai-oed</link>
      <guid>https://forem.com/bookwyrm-ai/automate-document-workflows-with-ai-oed</guid>
      <description>&lt;h1&gt;
  
  
  PDF Structured Output Server: Technical Overview
&lt;/h1&gt;

&lt;p&gt;There's no getting away from the fact that AI is everywhere and most businesses are experimenting in an attempt to boost productivity and competitive advantage.&lt;/p&gt;

&lt;p&gt;I work for an AI startup called BookWyrm and we have a set of API endpoints and a Python client to let customers build agentic pipelines to create automation, workflow, and accurate chat tools based on business documents. &lt;a href="https://bookwyrm.ai/about" rel="noopener noreferrer"&gt;Learn about the endpoints here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The aim of BookWyrm is to extract the pain of extracting data from business documents so devs don't have to mess around spending countless hours preparing data. They get to build straight way.&lt;/p&gt;

&lt;p&gt;This article focuses on the PDF Structured Output Server we built, initially for product data enrichment to send to OpenAI for its Agentic Commerce Protocol. However, on reflection, we've extended the use of this, as by setting the Pydantic model you use, you can get any structured output that you want from documents. &lt;/p&gt;

&lt;p&gt;The premise is, set up the server, you then have a service to handle manual document processing based on the requirements you have. Let the front-end create its own schema, or predefine them with agreed output.&lt;/p&gt;

&lt;p&gt;Cloneable examples to play with:&lt;/p&gt;

&lt;h3&gt;
  
  
  Invoice Processing:
&lt;/h3&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/MVY8MYSYPTQ"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h3&gt;
  
  
  Product Data Enrichment:
&lt;/h3&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/cptXpf52dCc"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Sorry about the AI voice, it sure beats my dulcet tones!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You can clone and play with the server and front-end examples from GitHub:&lt;/p&gt;

&lt;p&gt;Server Repo: &lt;a href="https://github.com/scidonia/pdf-structured-output-server" rel="noopener noreferrer"&gt;https://github.com/scidonia/pdf-structured-output-server&lt;/a&gt;&lt;br&gt;
Invoice Processing Front-end: &lt;a href="https://github.com/scidonia/demo-frontend-invoice-processing" rel="noopener noreferrer"&gt;https://github.com/scidonia/demo-frontend-invoice-processing&lt;/a&gt;&lt;br&gt;
Product Enrichment Front-end: &lt;a href="https://github.com/scidonia/demo-frontend-product-enrichment" rel="noopener noreferrer"&gt;https://github.com/scidonia/demo-frontend-product-enrichment&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Intro to the Tutorial/Tech Overview
&lt;/h2&gt;

&lt;p&gt;The PDF Structured Output Server is a FastAPI service that extracts structured data from PDFs using the BookWyrm API. It processes product brochures, catalogs, and invoices (and any other PDFs you need to process), returning structured JSON via a streaming API.&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The server is a FastAPI application that orchestrates a three-step pipeline using BookWyrm's API endpoints. The architecture is designed for flexibility. You define the output schema, and the server handles the document processing.&lt;/p&gt;
&lt;h3&gt;
  
  
  Core Components
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI Server&lt;/strong&gt; (&lt;code&gt;server.py&lt;/code&gt;): Handles HTTP requests, manages parallel processing, and streams results via Server-Sent Events (SSE)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ProductFeedGenerator&lt;/strong&gt; (&lt;code&gt;product_feed_generator.py&lt;/code&gt;): Orchestrates the BookWyrm API workflow.&lt;em&gt;It probably should be named something else, we initially started building a product enrichment example, before expanding it to other uses.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pydantic Models&lt;/strong&gt; (&lt;code&gt;models.py&lt;/code&gt;): Define your structured output schema&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  The BookWyrm Processing Pipeline
&lt;/h2&gt;

&lt;p&gt;The server uses three BookWyrm endpoints in sequence to transform raw PDFs into structured data:&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: PDF Text Extraction
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream_extract_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pdf_bytes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pdf_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;start_page&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# Extract all pages
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;BookWyrm's &lt;code&gt;/extract_pdf&lt;/code&gt; endpoint handles PDF parsing, extracting text while preserving structure and layout. This handles complex layouts, tables, and multi-column formats.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Text to Phrasal Processing
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream_process_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text_content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WITH_OFFSETS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;/process_text&lt;/code&gt; endpoint converts raw text into a phrasal representation. Semantic chunks that preserve context and relationships. This improves extraction accuracy.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 3: Structured Extraction
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream_summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;phrases&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;phrases&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;summary_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ProductExtractionModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Your Pydantic model
&lt;/span&gt;    &lt;span class="n"&gt;model_strength&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wise&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;/summarize/sse&lt;/code&gt; endpoint extracts structured data matching your Pydantic model or JSON schema. BookWyrm uses the schema to identify and extract relevant fields.&lt;/p&gt;
&lt;h2&gt;
  
  
  Flexible Schema Definition
&lt;/h2&gt;

&lt;p&gt;You can define your output schema in two ways:&lt;/p&gt;
&lt;h3&gt;
  
  
  Option 1: Pydantic Models (CLI Mode)
&lt;/h3&gt;

&lt;p&gt;Define a Pydantic model in &lt;code&gt;models/models.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProductExtractionModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Product title as mentioned in the document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Regular price with currency code (e.g., &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;79.99 USD&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Overall dimensions with units (e.g., &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;12x8x5 in&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option 2: JSON Schema (API Mode)
&lt;/h3&gt;

&lt;p&gt;Pass a JSON schema directly in the API request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Product name or title"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Product price as a numeric value"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lets front-end applications define schema dynamically without code changes.&lt;/p&gt;

&lt;p&gt;Note that within the schema, the description is important. It tells BookWyrm exactly what you want to extract, so treat it like a prompt to ensure the effectiveness of your app.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Usage
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Starting the Server
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run pdf-server serve &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Processing PDFs via API
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;/process&lt;/code&gt; endpoint accepts multiple PDFs and returns streaming results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"http://localhost:8000/process"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$BOOKWYRM_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"files=@product_brochure.pdf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"schema_name=ProductSummary"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s1"&gt;'json_schema={"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}},"required":["title"]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Streaming Response Format
&lt;/h3&gt;

&lt;p&gt;The API returns Server-Sent Events (SSE) with different message types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Status&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;update&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Starting processing of 2 files"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Progress&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;update&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"progress"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"completed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"total"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"percentage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;50.0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Result&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"product_brochure.pdf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Completion&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"complete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Processed 2 of 2 files successfully"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This enables real-time progress updates in frontend applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallel Processing
&lt;/h2&gt;

&lt;p&gt;The server processes multiple PDFs concurrently using a &lt;code&gt;ThreadPoolExecutor&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_process_single_pdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema_dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pdf_path&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pdf_paths&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results stream as they complete, improving throughput for batch processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Error Handling &amp;amp; Resilience
&lt;/h2&gt;

&lt;p&gt;The server includes error handling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Individual PDF failures don't stop batch processing&lt;/li&gt;
&lt;li&gt;Graceful fallbacks when extraction fails&lt;/li&gt;
&lt;li&gt;Clear error messages in the streaming response&lt;/li&gt;
&lt;li&gt;Validation of input files and schemas&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Use Cases
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Product Data Enrichment
&lt;/h3&gt;

&lt;p&gt;Extract product information from brochures and catalogs for e-commerce feeds. The example model includes fields for title, price, dimensions, specifications, and more.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Invoice Processing
&lt;/h3&gt;

&lt;p&gt;Extract structured data from invoices—vendor, date, line items, totals—by defining an invoice-specific schema.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Custom Document Processing
&lt;/h3&gt;

&lt;p&gt;Define any Pydantic model or JSON schema to extract data from contracts, reports, or forms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits for Developers
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;No prompt engineering: Define your schema; BookWyrm handles extraction&lt;/li&gt;
&lt;li&gt;Type safety: Pydantic models provide validation and IDE support&lt;/li&gt;
&lt;li&gt;Streaming API: Real-time progress for better UX&lt;/li&gt;
&lt;li&gt;Parallel processing: Efficient batch handling&lt;/li&gt;
&lt;li&gt;Flexible schemas: JSON schema support for dynamic front-end needs&lt;/li&gt;
&lt;li&gt;Production-ready: Error handling, validation, and logging&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Clone the repository:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   git clone https://github.com/scidonia/pdf-structured-output-server
   &lt;span class="nb"&gt;cd &lt;/span&gt;pdf-structured-output-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Install dependencies:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   uv &lt;span class="nb"&gt;sync&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Set your BookWyrm API key (not needed if sending authorization headers via the front-end):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;BOOKWYRM_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_api_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Start the server:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   uv run pdf-server serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Process your first PDF:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"http://localhost:8000/process"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$BOOKWYRM_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"files=@your_document.pdf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"schema_name=MySchema"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s1"&gt;'json_schema={"type":"object","properties":{"title":{"type":"string"}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Try it for yourself and let us know what you think
&lt;/h2&gt;

&lt;p&gt;The PDF Structured Output Server demonstrates how BookWyrm's API endpoints can be combined to build production-ready document processing pipelines. By abstracting away the complexity of PDF parsing and AI extraction, developers can focus on defining their data requirements and building applications.&lt;/p&gt;

&lt;p&gt;The streaming API, parallel processing, and flexible schema system make it suitable for both manual document processing workflows and automated pipelines. Whether you're enriching product catalogs, processing invoices, or extracting custom data from business documents, this server provides a solid foundation.&lt;/p&gt;

&lt;p&gt;Check out the &lt;a href="https://github.com/scidonia/pdf-structured-output-server" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; to explore the code, and visit &lt;a href="https://bookwyrm.ai/about" rel="noopener noreferrer"&gt;BookWyrm's documentation&lt;/a&gt; to learn more about the API endpoints powering this solution.&lt;/p&gt;

&lt;p&gt;The actual server was built in half a day using Aider assistance and our AI-integration docs. &lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
