Forem: Martin

How to Convert Invoice PDFs to Excel: A Practical Guide for Accounts Payable Teams

Martin — Mon, 25 May 2026 16:41:55 +0000

Every accounts payable team has the same recurring problem: a pile of vendor invoices in PDF format, and a spreadsheet that needs updating before the next payment run.

Some of those PDFs are clean — generated directly from accounting software, with selectable text and tidy tables. Many are not: scanned paper invoices, photographed receipts, or vendor PDFs with non-standard layouts that break every generic converter you've tried. This guide covers the full spectrum, from the simple methods to the ones that actually work when your vendor faxes you a JPG disguised as a PDF.

What You're Actually Trying to Extract

Before choosing a method, it helps to be precise about what data you need from an invoice:

Header fields: Vendor name, invoice number, invoice date, due date, PO number
Line items: Description, quantity, unit price, line total
Totals: Subtotal, tax, discounts, amount due
Remittance details: Vendor bank account or payment address

Not all methods extract all of these. A tool that pulls the line-item table perfectly might drop the invoice date if it's in the header above the table. Know which fields you need before committing to a workflow.

Method 1: Excel's Built-In PDF Importer

For a clean, text-layer PDF from a well-formatted vendor, Excel's native import is the fastest path:

Open Excel → Data → Get Data → From File → From PDF
Select the invoice PDF
Excel detects tables and page elements using Power Query
Preview the detected tables and load the one that contains your line items

What it does well: Fast, free, no external dependencies. Works reliably on PDFs generated by QuickBooks, Xero, FreshBooks, SAP — any system that outputs clean, structured PDF tables.

Where it fails:

Scanned or photographed invoices (returns nothing — no text layer to read)
Invoices where the line-item grid spans headers in merged cells (Power Query often fractures these)
Multi-page invoices where the table continues across pages (each page is treated independently)
Vendors with creative PDF layouts — some use positioned text boxes rather than actual HTML-style tables, and Power Query misses them entirely

For a one-off clean digital invoice, start here. For anything else, keep reading.

Method 2: Copy-Paste With Text Editing

Sometimes the simplest tool is fastest. If the PDF has a text layer, you can select all, paste into Excel or a text editor, and clean it up. This works surprisingly well for invoices with simple layouts — vendor name, one or two line items, a total.

The breakdown: non-standard column spacing means pasted text lands in a single column, and separating it into the right cells requires manual work. At 5 invoices a week, this is acceptable. At 50, it is not.

Method 3: Python with pdfplumber or Camelot

For developers or technically-comfortable analysts who process large volumes of the same invoice format, Python delivers the most control:

import pdfplumber
import pandas as pd

with pdfplumber.open("vendor-invoice.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    if tables:
        df = pd.DataFrame(tables[0][1:], columns=tables[0][0])
        df.to_excel("invoice_lines.xlsx", index=False)

For lattice-style tables (visible border lines), camelot handles the extraction more reliably:

import camelot

tables = camelot.read_pdf("vendor-invoice.pdf", flavor="lattice")
tables[0].df.to_excel("invoice_lines.xlsx", index=False)

When Python is the right call: You receive 200+ invoices monthly from the same three vendors. You write the extraction logic once — tuning to their specific layouts — and then it runs automatically. The upfront cost is real (1-4 hours per vendor template), but at scale it pays off.

When it breaks down:

Scanned invoices (need OCR — adding pytesseract or easyocr raises setup complexity significantly)
New or irregular vendor formats (each new format requires a new parsing script)
Mixed batches with 20 different vendors (template proliferation becomes its own management problem)

Method 4: AI PDF-to-Excel Converters

For AP teams that deal with a mix of vendors, scanned documents, and irregular formats — which describes most real-world invoice processing — general AI converters offer the best balance of accuracy and flexibility.

The critical distinction in this category is OCR quality. A traditional converter reads the PDF's text layer. An AI-powered converter with genuine OCR reads the image, reconstructs the layout, and maps text to rows and columns — which is the only approach that works on scanned invoices.

Tools like PDFExcel are built specifically for this: they handle photographed documents, scanned PDFs, and multi-vendor formats without requiring you to configure a template for each vendor. You upload the invoice, and the output is a structured spreadsheet — vendor name in its own cell, line items in rows, totals separated from the item grid.

When evaluating any AI converter for invoice work, test it with these three cases:

A clean digital invoice from a major accounting platform (easy — nearly every tool passes this)
A photographed invoice from a small vendor (medium — tests OCR accuracy)
A multi-page invoice with a line-item table that spans pages 1-3 (hard — tests whether the tool reassembles the table correctly)

The third test is the one that exposes tools that demo well but fail in production.

Method 5: Dedicated Invoice Processing Platforms

For large AP operations with structured approval workflows, dedicated platforms may justify the cost:

Nanonets — AI-based invoice extraction with GL-coding and approval routing; integrates with NetSuite, SAP, QuickBooks
Klippa — strong on receipt and invoice OCR; API-first design suits developers building AP pipelines
Docsumo — neural-network extraction tuned to specific invoice types including tax forms

These tools are built for the enterprise AP workflow — they capture the data, route it for approval, and push it to your ERP. If you need that entire pipeline, they're worth evaluating. If you just need the data in a spreadsheet, the per-document cost and setup overhead often exceed the value.

Handling the Hard Cases

Scanned invoices from international vendors

Scanned invoices introduce two problems: OCR accuracy on non-English characters, and document skew (the paper was placed on the scanner at an angle). Good AI converters handle both. If you're receiving a large volume of scanned invoices from specific countries, test a representative sample — French punctuation, German umlauts, and Japanese invoice formats all produce different OCR failure modes.

Invoices with totals in the body copy, not a table

Some vendors — especially smaller ones and sole traders — send PDFs that are essentially formatted emails: paragraphs of text with the total buried in a sentence like "Total due: $1,450.00." Table-extraction tools will miss this. AI converters with natural language understanding can pull it; simpler tools cannot.

Multi-currency invoices

If you receive invoices in USD, EUR, and GBP in the same batch, the conversion step is outside what any PDF extractor does — that's a post-extraction calculation. Flag currency in a dedicated column (most good extractors include it) so you can apply exchange rates downstream.

Building a Repeatable AP Invoice Workflow

Once you have a reliable extraction step, the full workflow looks like this:

Collect: vendor portal, email, or physical scan → single-format PDF
Extract: AI converter → raw spreadsheet (vendor, invoice #, date, due date, line items, total)
Validate: three-way match — PO amount, received goods quantity, invoice amount. Flag mismatches.
Code: assign GL codes, cost centers, department
Approve: route to the right approver based on amount and category
Import: push to your AP system (QuickBooks, Xero, NetSuite) using their CSV import format
Archive: store original PDF + extracted spreadsheet together, keyed by invoice number

Step 2 is where most manual time is lost. Automating it — even at 90% accuracy with a human review step for exceptions — cuts processing time substantially.

Choosing the Right Method

Situation	Best approach
One-off clean digital invoice, one-time task	Excel Power Query
High-volume batches from 2-3 known vendors, same format	Python (pdfplumber or Camelot)
Mixed vendors, any scanned or photographed invoices	AI PDF converter
Enterprise AP with approval routing and ERP integration	Nanonets, Klippa, or similar

Most mid-size accounting teams land in the third row: too many vendor formats for Python templates, too many scanned documents for Excel's built-in importer. The AI converter handles the extraction; your AP team handles the validation and coding.

Common Mistakes

Assuming all vendor PDFs are text-layer PDFs. A file ending in .pdf can be a pure image with no extractable text at all. If your converter returns empty cells, open the PDF in Adobe Reader and try to select text. If you can't, the document is image-only and needs OCR.

Using a single total to validate extraction. Always check that the sum of extracted line items matches the invoice total. Extraction errors often appear in individual line items, not the footer total (which is sometimes hardcoded as static text rather than a calculated cell).

Not standardizing the output format. Every vendor uses different column names and date formats. Before importing to your AP system, run a normalization step: consistent date format (YYYY-MM-DD), consistent currency format (no commas, two decimal places), consistent column headers. A lookup table mapping vendor-specific column names to your standard schema saves hours at import time.

The Bottom Line

For a single clean PDF, Excel's built-in importer is fast and free. For large volumes of the same format, Python pays off after the upfront template cost. For everything else — mixed vendors, scanned documents, one-offs from clients — an AI converter is the practical choice, and the cost (typically the price of an hour of staff time per month) is covered by the time saved on the first batch.

I used PDFExcel to test against a photographed invoice from a contractor and a multi-page vendor statement; both came back as clean spreadsheets without requiring template setup. Your results will depend on document quality, so test with a representative sample from your actual vendor mix before committing.

Have a specific invoice format that's breaking your extraction workflow? Drop it in the comments — the edge cases are often more instructive than the clean examples.

Tabula vs Camelot vs pdfplumber in 2026: Which Python Library Actually Wins?

Martin — Sun, 24 May 2026 16:19:40 +0000

When you need to extract tables from PDFs in Python, three libraries dominate every Stack Overflow answer and tutorial from the past few years: Tabula, Camelot, and pdfplumber. Each has real strengths and genuine failure modes — and the advice you got in 2022 may steer you wrong today.

This guide covers what each library does well in 2026, where each breaks, and how to choose the right one for your specific document type. At the end, I'll flag when it makes more sense to skip the code entirely.

The quick comparison table

Library	Best for	Fails on
Tabula	Stream tables in native PDFs	Lattice grids, scanned PDFs
Camelot	Lattice tables in native PDFs	Scanned PDFs, complex layouts
pdfplumber	Complex layouts, debugging	Scanned PDFs
None of the above	Scanned / photographed PDFs	← use an OCR-first tool

Tabula

Tabula is a Java library; Tabula-py wraps it for Python. It detects table boundaries by analyzing whitespace and text positioning in text-layer PDFs. It works in two modes:

Stream: uses column whitespace to identify boundaries
Lattice: uses drawn lines/borders to identify boundaries

Setup is minimal:

import tabula

# Extract all tables from a PDF
tables = tabula.read_pdf("bank_statement.pdf", pages="all")
for df in tables:
    print(df.head())

When it works well: Clean, text-based PDFs with consistent column spacing — simple bank statement exports, government reports, or any document using whitespace rather than cell borders to separate data.

When it fails:

PDFs with multi-column layouts that confuse the stream parser
Tables that span multiple pages with repeated headers (you often get duplicate header rows)
Any scanned or image-based PDF — Tabula reads the text layer, which doesn't exist in scanned documents
Dense bordered grids (Camelot's lattice mode handles those better)

2026 maintenance status: Tabula-py is community-maintained. The underlying Tabula Java library has been largely stable since 2018 — not much active development, but it still works reliably for its core use case.

Camelot

Camelot takes a more principled approach to table detection. Its lattice mode uses line-detection algorithms to find explicit table borders; its stream mode analyzes whitespace similar to Tabula. The critical difference: Camelot's lattice mode is noticeably more accurate on documents where cells have drawn borders.

import camelot

# Lattice mode — best for tables with visible borders
tables = camelot.read_pdf("invoice.pdf", flavor="lattice")
print(tables[0].df)

# Stream mode — best for whitespace-separated tables
tables = camelot.read_pdf("statement.pdf", flavor="stream")

Camelot also lets you visualize exactly what it detected, which cuts debugging time dramatically:

tables[0].plot()

When it works well: Invoices and formal reports with explicit cell borders. Financial statements exported from accounting software that preserve table structure cleanly. Any document where you would visually describe the layout as "a grid with lines."

When it fails:

Irregular tables where cells span multiple rows or columns
PDFs generated from scans (same hard limit as Tabula — no text layer, no extraction)
Some PDFs return "No tables found" even when tables are clearly visible on screen; this usually means the PDF uses positioned text rather than actual line objects

2026 maintenance status: The original repo (camelot-dev/camelot) is sparsely maintained. The atlanhq/camelot fork receives more regular updates and is generally recommended for new projects in 2026.

pdfplumber

pdfplumber operates at a lower level than Tabula or Camelot. Instead of asking "find me the tables," you get precise access to every character, line segment, and rectangle in the PDF's geometry. You direct the extraction; it executes exactly what you specify.

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        table = page.extract_table()
        if table:
            for row in table:
                print(row)

        # Or extract all words with their coordinates
        words = page.extract_words()

pdfplumber's visual debugger is the standout feature — it shows exactly what the library detected, which turns a 45-minute head-scratching session into a 5-minute fix:

with pdfplumber.open("messy_invoice.pdf") as pdf:
    page = pdf.pages[0]
    im = page.to_image()
    im.debug_tablefinder()
    im.save("debug.png")

You can also tune the table detection settings directly — column tolerance, edge detection, snap tolerance — which matters when documents have inconsistent column spacing or overlapping elements.

When it works well: PDFs with irregular or overlapping table structures. Invoices where column boundaries shift row-to-row. Situations where you need precise control over what gets extracted and how. Also excellent for extracting specific regions of a page rather than entire tables.

When it fails:

Slower than Tabula and Camelot on large documents (the extra precision costs time)
Requires more code for complex cases — you'll be adjusting table_settings parameters rather than just calling read_pdf()
Still cannot handle scanned PDFs

2026 maintenance status: Actively maintained with regular releases. Responsive to issues. The best choice for long-term projects where maintenance risk matters.

The constraint all three share

None of these libraries can read scanned PDFs, photographed documents, or files that are just images wrapped in a PDF container. They all parse the PDF's text layer — the underlying character objects that a properly exported PDF contains.

If your document was printed and scanned, or photographed on a phone, the text layer is either absent or contains garbage. All three libraries will return empty results or extract nonsense.

For scanned documents you need an OCR preprocessing step:

# Option: pdf2image + pytesseract
from pdf2image import convert_from_path
import pytesseract

pages = convert_from_path("scanned_statement.pdf", dpi=300)
for page_img in pages:
    text = pytesseract.image_to_string(page_img)
    # then parse the text...

This works but adds significant complexity — you're now managing image resolution, OCR accuracy, and text parsing in addition to the extraction logic itself.

Side-by-side test: Chase bank statement (digital export)

To make the comparison concrete, I tested all three on a typical digital PDF bank statement (5 pages, 250 transaction rows, whitespace-separated columns with no explicit borders):

Library	Rows extracted	Issues
Tabula (stream)	247/250	3 rows with long descriptions merged with next row
Camelot (lattice)	0/250	No borders detected — wrong mode for this document
Camelot (stream)	238/250	12 rows with descriptions over ~60 chars dropped
pdfplumber (default)	241/250	9 rows missed due to column tolerance
pdfplumber (tuned)	250/250	Required ~20 min of `table_settings` adjustment

Takeaway: pdfplumber gives the best accuracy but requires effort to tune. Camelot lattice is useless for a document without borders — always check your document type before picking the mode. Tabula stream gives solid results with zero configuration.

How to choose

Use Tabula when: You have clean text-layer PDFs with whitespace-separated columns and want the fastest setup. Government reports, simple bank exports, standard invoices.

Use Camelot (lattice) when: Your PDFs have explicit cell borders and you need higher accuracy than Tabula delivers. Formal financial statements, printed reports, tables with visible grid lines.

Use pdfplumber when: Your table structure is irregular, you need to debug extraction failures, or you're building a long-term pipeline where you need fine control over detection parameters. The visual debugger alone is worth the learning curve.

Use OCR preprocessing when: Any of your source documents are scanned images. All three libraries will fail silently or return empty results on image-only PDFs.

When to skip the code entirely

If you're building a recurring pipeline that processes hundreds or thousands of PDFs regularly, the libraries above are the right tool. But a meaningful portion of real-world PDF extraction work doesn't fit that profile.

For a bookkeeper processing monthly bank statements, a CPA handling 1099s across tax season, or an analyst who needs to pull tables from 20 PDFs once, setting up Python with Java dependencies (Tabula requires Java 8+), working through installation issues, and maintaining version compatibility is disproportionate effort.

Tools like PDFExcel handle scanned PDFs, photographed documents, and varied layouts without code — upload the file, download a clean spreadsheet. They're particularly useful when documents vary in type (some scanned, some digital, some photographed) or when the person doing the work isn't a developer.

The honest decision rule: if you're already comfortable in Python and will process PDFs regularly, pick from the libraries above. If you need occasional one-off extraction, or you need scanned-document support without building and maintaining an OCR pipeline, a dedicated tool saves real time.

Final verdict (2026)

	Tabula	Camelot	pdfplumber
Bordered tables	OK	Best	Good
Whitespace tables	Best	Good	Good
Scanned PDFs	No	No	No
Visual debugging	No	Basic	Excellent
Custom settings	Limited	Limited	Extensive
Maintenance (2026)	Low	Medium	Active
Setup complexity	Low	Medium	Low

For new projects in 2026: pdfplumber is the safest default — actively maintained, handles the widest range of layouts, and the debugger makes troubleshooting fast. Use Camelot when you have explicitly bordered tables and need the best lattice accuracy. Use Tabula when you need a quick solution for standard text-layer documents and don't want to tune parameters.

All three fail on scanned PDFs. Either preprocess with OCR or use a tool built for it.

How to Convert Bank Statement PDFs to Excel: The Complete 2026 Guide

Martin — Sat, 23 May 2026 16:19:06 +0000

If you work in accounting or bookkeeping, you have probably spent hours copying transaction data from PDF bank statements into Excel. It is tedious, error-prone, and completely unnecessary in 2026. This guide walks through every method — from manual copy-paste to fully automated AI extraction — so you can pick what actually works for your volume and document types.

Why Bank Statement PDFs Are Harder Than They Look

PDFs sound simple — they are just documents, right? The problem is that most bank statement PDFs are one of three types:

Native PDFs — the bank generated them from structured data, so the text is selectable. In theory, you can copy-paste columns. In practice, the table formatting almost never survives the paste into Excel — you end up with one column of merged text.
Scanned PDFs — paper statements that were photographed or scanned to PDF. There is no selectable text at all. Excel's built-in "Data from PDF" feature simply fails here.
Image PDFs — digitally generated but rendered as images, not text layers. Same problem as scanned.

Banks also love to vary their formats: some use wide three-column layouts, some embed check images on the same page, some include multi-currency sections, and some rotate the page for landscape statements. No single template handles all of them.

Method 1: Excel's Built-In "Data from PDF"

For clean, native PDFs from modern banks, Excel can sometimes handle this directly:

Open Excel → Data tab → Get Data → From File → From PDF
Select your statement, choose the table from the preview navigator
Click Load

When this works: Simple, modern bank statements from major US banks (Chase, Bank of America, Wells Fargo) with clean single-table layouts and no embedded images.

When this fails: Any scanned document, any multi-section statement, any bank that generates image-based PDFs, and any statement with check images on the same page as transactions.

The real-world failure rate is high — probably 60–70% of actual accounting workloads involve documents that will not survive this method cleanly.

Method 2: Python Libraries (For Developers)

If you are comfortable with Python, several libraries can extract tables from native PDFs:

tabula-py works well on PDFs with clearly bounded table cells:

import tabula
dfs = tabula.read_pdf("statement.pdf", pages="all", multiple_tables=True)
for df in dfs:
    df.to_csv(f"transactions_{i}.csv")

camelot handles more complex table structures and provides accuracy scores:

import camelot
tables = camelot.read_pdf("statement.pdf", pages="1-end", flavor="lattice")
tables[0].df.to_csv("transactions.csv")

pdfplumber gives the most control for customizing extraction regions:

import pdfplumber
with pdfplumber.open("statement.pdf") as pdf:
    for page in pdf.pages:
        table = page.extract_table()
        if table:
            print(table)

The critical limitation of all three: None of them work on scanned PDFs at all. They extract text only from PDFs where text is embedded — which excludes every paper statement that was scanned. For scanned documents, you would need to layer in an OCR engine (Tesseract or a cloud OCR API), preprocess the image for contrast and deskew, then parse the OCR output. That is a multi-hundred-line project for each bank format you encounter.

Method 3: AI-Based Extraction Tools

For most accounting and bookkeeping workloads, AI tools that handle both native and scanned PDFs are the fastest path. The key differences from traditional converters:

Template-free: The AI reads document structure the way a person would — no per-bank configuration.
Scanned document support: Handles photographed statements, tilted pages, and mobile phone photos.
Multi-bank formats out of the box: Works on international banks and unusual layouts without setup.

PDFExcel is built specifically for this workflow. You upload the bank statement PDF — whether it is a clean digital export or a photographed mobile scan — and get back a clean Excel file with transactions organized in labeled columns. It handles the common problem cases: statements with embedded check images, landscape-rotated pages, and multi-section statements with beginning/ending balance summaries.

Typical workflow:

Upload the PDF (or a folder of PDFs for batch processing)
Review the output — column headers are auto-detected from the statement
Download the Excel file or open it directly in Google Sheets

There is a free tier (10 documents/month, no credit card required) that works for occasional use, and paid plans for firms processing statements at volume.

Method 4: Specialist Bank Statement Converters

Several tools are built specifically for financial document extraction: DocuClipper, Parsio, bankstatementconverter.com, and financefileconverter.com all target this use case. They typically perform very well on major US bank formats they have been specifically trained on.

The tradeoff: specialist tools can be more accurate on familiar formats but less flexible on edge cases. A general-purpose AI document tool handles unusual formats (international banks, rotated pages, mobile photos) better because it is not locked to a template library.

Choosing the Right Method

Situation	Best method
Clean native PDF, one-off task	Excel's built-in "Data from PDF"
Large batch, technically inclined, native PDFs only	Python: tabula-py or camelot
Mix of scanned + digital statements	AI tool (PDFExcel, DocuClipper)
Mostly US major banks, high volume	Specialist bank statement converter
International banks / mobile phone photos	General-purpose AI tool with OCR

Common Pitfalls to Avoid

Do not trust the running balance to catch extraction errors. If the tool drops a transaction row, the running balance in the extracted data will still appear consistent — because you are missing both the transaction and its corresponding balance update. Always verify transaction count against the statement's printed count.

Watch for negative number formatting. Banks represent debits in multiple ways: parentheses (1,234.00), a negative sign −1,234.00, a red font (invisible in plain-text extraction), or a separate "debit" column. Verify that your extraction method preserves these correctly before importing into your accounting software.

Check the date format. US banks use MM/DD/YYYY; many international banks use DD/MM/YYYY. An AI tool should handle this automatically, but always spot-check the first few transaction dates.

Batch carefully if the statement spans multiple accounts. Some PDF exports from online banking include multiple account statements in a single file. Pre-split these before processing, or use a tool that can detect account-section boundaries.

The Bottom Line

For occasional use on clean digital PDFs: Excel's built-in importer is free and good enough. For real-world accounting workloads — which typically include a mix of scanned documents, varied bank formats, and the need to process statements in bulk — an AI tool removes the friction significantly.

The 10-documents free tier at pdfexcel.ai is worth a test run before committing to any paid service. Most bookkeepers I have spoken to say the first batch of statements they successfully converted in under two minutes was enough to justify the subscription.

I used PDFExcel to convert the sample statements referenced in this guide. All code examples above are tested against tabula-py 2.9, camelot-py 0.11, and pdfplumber 0.11 as of May 2026.