# PDF Invoice Extraction Solution
## Library Choice
**LangChain + OpenAI (GPT-4o-mini)** combined with **pypdf** for text extraction.
Since your 200 invoices have **different layouts**, rule-based tools like `pdfplumber` or regex would require custom parsers per vendor. An LLM handles layout variability natively, extracting structured fields regardless of format. Using **Pydantic schemas** with LangChain's `with_structured_output()` ensures reliable, typed JSON responses ready for CSV export.
## Installation
```bash
pip install langchain langchain-openai pypdf pydantic pandas
export OPENAI_API_KEY="your-key-here"
```
## Code
```python
import os
import glob
import pandas as pd
from typing import List
from pypdf import PdfReader
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
# --- Define structured output schema ---
class LineItem(BaseModel):
description: str = Field(description="Line item description")
quantity: float = Field(description="Quantity ordered")
price: float = Field(description="Unit price")
class Invoice(BaseModel):
invoice_number: str = Field(description="Invoice number/ID")
date: str = Field(description="Invoice date (YYYY-MM-DD)")
vendor_name: str = Field(description="Vendor/supplier name")
line_items: List[LineItem] = Field(description="All line items")
total_amount: float = Field(description="Final total amount")
# --- Initialize LLM with structured output ---
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(Invoice)
def extract_pdf_text(pdf_path: str) -> str:
"""Extract raw text from a PDF file."""
reader = PdfReader(pdf_path)
return "\n".join(page.extract_text() or "" for page in reader.pages)
def parse_invoice(pdf_path: str) -> dict:
"""Send invoice text to LLM and return structured data."""
text = extract_pdf_text(pdf_path)
prompt = f"Extract invoice details from the following text:\n\n{text}"
try:
invoice: Invoice = structured_llm.invoke(prompt)
return invoice.model_dump()
except Exception as e:
print(f"Error parsing {pdf_path}: {e}")
return None
# --- Process all invoices in folder ---
INVOICE_FOLDER = "./invoices"
rows = []
for pdf_file in glob.glob(os.path.join(INVOICE_FOLDER, "*.pdf")):
print(f"Processing: {pdf_file}")
data = parse_invoice(pdf_file)
if not data:
continue
# Flatten line items into one row per item
for item in data["line_items"]:
rows.append({
"source_file": os.path.basename(pdf_file),
"invoice_number": data["invoice_number"],
"date": data["date"],
"vendor_name": data["vendor_name"],
"description": item["description"],
"quantity": item["quantity"],
"price": item["price"],
"line_total": item["quantity"] * item["price"],
"invoice_total": data["total_amount"],
})
# --- Export to CSV ---
df = pd.DataFrame(rows)
df.to_csv("invoices_extracted.csv", index=False)
print(f"Done. Extracted {len(df)} line items from {df['source_file'].nunique()} invoices.")
```
## How It Works
1. **`pypdf`** extracts raw text from each PDF (fast and free).
2. **Pydantic models** define the exact schema (`Invoice` + `LineItem`), giving the LLM a strict contract.
3. **`ChatOpenAI.with_structured_output()`** forces GPT-4o-mini to return JSON matching the schema — no regex or prompt engineering needed per vendor.
4. The script **iterates** all PDFs, flattens nested line items into tabular rows, and writes a clean CSV.
**Cost estimate:** ~$0.01–0.03 per invoice with `gpt-4o-mini` (~$2–6 total for 200 invoices). For scanned/image-based PDFs, swap `pypdf` for `pytesseract` OCR or use GPT-4o's vision API directly.
How to Extract Structured Data from PDFs Using Python
Tested prompts for extract data from pdf using python compared across 5 leading AI models.
Extracting data from PDFs with Python is one of those tasks that sounds simple until you actually try it. PDFs were designed for printing, not data portability, so the same file format that renders beautifully on screen can be a nightmare to parse programmatically. Whether you are pulling invoice totals, scraping research paper tables, or batch-processing hundreds of scanned forms, the core challenge is the same: getting structured, usable data out of a format that actively resists it.
Python has a mature ecosystem for this problem. Libraries like PyMuPDF, pdfplumber, pdfminer.six, and Camelot each handle different PDF types and structures. Text-based PDFs with clean formatting are the easiest case. Scanned PDFs require OCR via tools like pytesseract or cloud vision APIs. Tables embedded in PDFs are their own category entirely, often needing layout-aware parsers to preserve column relationships correctly.
This page shows you exactly how an AI-assisted approach accelerates the process. Instead of writing boilerplate extraction and cleaning code from scratch, you can use a tested prompt to generate Python scripts tuned to your specific PDF structure. Below you will find real model outputs, a comparison of how different models handle the task, and practical guidance on which approach fits your situation.
When to use this
This approach works best when you have PDFs with consistent structure and need to automate extraction at scale. If you are dealing with invoices that follow a template, reports with predictable section headers, or forms with labeled fields, Python-based extraction with AI-generated code will save you significant time compared to manual parsing or one-off scripts.
- Batch extracting invoice line items, totals, and vendor details from hundreds of PDFs
- Pulling structured tables from financial reports or scientific papers for downstream analysis
- Parsing government or legal forms where fields follow a known layout
- Converting product catalog PDFs into structured JSON or CSV for database ingestion
- Automating data entry workflows by extracting key fields from uploaded PDF documents
When this format breaks down
- Scanned PDFs with low resolution or skewed text: OCR accuracy drops sharply and extracted data will require heavy manual correction before it is usable.
- PDFs with highly variable or unpredictable layouts: if each document looks different, template-based extraction breaks and you will spend more time on edge cases than you save on automation.
- Legally sensitive documents where extraction errors carry real risk: always validate AI-generated parsing code against a human-reviewed sample before deploying to production.
- PDFs that are password-protected or DRM-locked: most Python libraries cannot decrypt these without the owner password, and bypassing DRM may violate terms of service or law.
The prompt we tested
You are a Python developer specializing in PDF data extraction. Based on the user's PDF extraction scenario below, provide a complete, working Python solution that extracts the requested data. Instructions: Recommend the best-fit Python library (e.g., pdfplumber, PyPDF2, pymupdf, unstructured, or LLM-based approaches like LangChain with OpenAI) based on the use case, then provide runnable code with comments and pip install commands. Keep the explanation concise (under 150 words) and structure the response with clear sections: Library Choice, Installation, Code, and How It Works. User's PDF extraction scenario: I have a folder of 200 vendor invoices in PDF format. Each invoice has a different layout but contains the same fields: invoice number, date, vendor name, line items (description, quantity, price), and total amount. I need to extract this data into a structured CSV file for accounting. Provide the recommended library choice, installation command, complete working code, and a brief explanation of how it works.
Model comparison
| Model | Quality | Cost | Latency | Tokens |
|---|---|---|---|---|
| Claude Opus 4.7 | 9 | $0.1189 | 17.0s | 339 / 1518 |
| Claude Haiku 4.5 | 9 | $0.0072 | 9.1s | 238 / 1393 |
| GPT-5.4 | 9 | $0.0313 | 24.2s | 213 / 2048 |
| Gemini 2.5 Pro | Not run yet | |||
| Grok 4.1 Fast Reasoning | 8 | $0.0004 | 15.4s | 361 / 640 |
Outputs
# PDF Vendor Invoice Extraction Solution
## Library Choice
**Recommendation: `pdfplumber`**
`pdfplumber` excels at extracting structured data (tables, text) from PDFs with varying layouts. It handles text positioning well and works reliably with invoices. For highly inconsistent layouts, pair it with `LangChain + GPT-4` for semantic understanding, but `pdfplumber` is ideal for this use case since invoices typically follow predictable patterns.
---
## Installation
```bash
pip install pdfplumber pandas openpyxl
```
---
## Code
```python
import pdfplumber
import pandas as pd
import os
import re
from pathlib import Path
def extract_invoice_data(pdf_path):
"""
Extract invoice data from a single PDF file.
Handles varying layouts by searching for key patterns.
"""
invoice_data = {
'invoice_number': None,
'date': None,
'vendor_name': None,
'line_items': [],
'total_amount': None,
'file_name': os.path.basename(pdf_path)
}
try:
with pdfplumber.open(pdf_path) as pdf:
# Extract text from first page (invoices typically fit on 1 page)
page = pdf.pages[0]
text = page.extract_text()
# Try to extract tables (common in invoices)
tables = page.extract_tables()
if tables:
# Process first table (usually line items)
for row in tables[0][1:]: # Skip header
if len(row) >= 3:
invoice_data['line_items'].append({
'description': row[0],
'quantity': row[1],
'price': row[2]
})
# Extract key fields using regex patterns
# Invoice Number
inv_match = re.search(r'(?:Invoice|Invoice #|Inv #)[\s#:]*([A-Z0-9\-]+)', text, re.IGNORECASE)
if inv_match:
invoice_data['invoice_number'] = inv_match.group(1).strip()
# Date
date_match = re.search(r'(?:Date|Invoice Date)[\s:]*(\d{1,2}/\d{1,2}/\d{4}|\d{4}-\d{2}-\d{2})', text, re.IGNORECASE)
if date_match:
invoice_data['date'] = date_match.group(1).strip()
# Vendor Name (usually near top)
lines = text.split('\n')
if lines:
invoice_data['vendor_name'] = lines[0].strip()
# Total Amount
total_match = re.search(r'(?:Total|Total Amount)[\s$:]*(\d+[.,]\d{2})', text, re.IGNORECASE)
if total_match:
invoice_data['total_amount'] = total_match.group(1).strip()
except Exception as e:
print(f"Error processing {pdf_path}: {str(e)}")
return invoice_data
def process_invoice_folder(folder_path, output_csv='invoices_data.csv'):
"""
Process all PDFs in a folder and export to CSV.
"""
all_invoices = []
pdf_files = list(Path(folder_path).glob('*.pdf'))
print(f"Found {len(pdf_files)} PDF files. Processing...")
for idx, pdf_file in enumerate(pdf_files, 1):
print(f"[{idx}/{len(pdf_files)}] Processing: {pdf_file.name}")
data = extract_invoice_data(str(pdf_file))
all_invoices.append(data)
# Create DataFrame
df = pd.DataFrame(all_invoices)
# Expand line_items into separate rows if needed
expanded_rows = []
for _, row in df.iterrows():
if row['line_items']:
for item in row['line_items']:
expanded_row = row.copy()
expanded_row.update(item)
expanded_rows.append(expanded_row)
else:
expanded_rows.append(row)
df_expanded = pd.DataFrame(expanded_rows)
# Export to CSV
df_expanded.to_csv(output_csv, index=False)
print(f"\n✓ Data exported to {output_csv}")
print(f"Total rows: {len(df_expanded)}")
return df_expanded
# Usage
if __name__ == "__main__":
folder_path = "./invoices" # Change to your folder path
df = process_invoice_folder(folder_path, 'vendor_invoices.csv')
print("\nFirst few rows:")
print(df.head())
```
---
## How It Works
1. **Iterates** through all PDFs in the folder
2. **Extracts tables** automatically (line items)
3. **Uses regex patterns** to find invoice number, date, vendor, and total
4. **Handles varying layouts** by searching for key patterns rather than fixed positions
5. **Flattens line items** into separate rows for accounting systems
6. **Exports to CSV** with all extracted fields
**Limitations:** Highly inconsistent layouts may require manual regex tuning per vendor. For 100% accuracy, use **LangChain + GPT-4** (adds cost but handles any layout).
## Library Choice
**Best fit: `unstructured` + `pdfplumber` + regex**
Because the invoices have **different layouts**, a pure fixed-coordinate/table approach is brittle. `unstructured` helps normalize text across varied PDFs, while `pdfplumber` is strong for extracting line-item tables when present. This hybrid approach is practical, local, and cheaper/simpler than an LLM pipeline for 200 files.
## Installation
```bash
pip install unstructured pdfplumber pandas python-dateutil
```
## Code
```python
import os
import re
import csv
import pdfplumber
import pandas as pd
from dateutil import parser as date_parser
from unstructured.partition.pdf import partition_pdf
# =========================
# CONFIG
# =========================
PDF_FOLDER = "invoices" # Folder containing PDF invoices
OUTPUT_CSV = "extracted_invoices.csv"
# Common label patterns across invoices
FIELD_PATTERNS = {
"invoice_number": [
r"invoice\s*(?:number|no\.?|#)\s*[:\-]?\s*([A-Z0-9\-\/]+)",
r"inv\s*(?:number|no\.?|#)\s*[:\-]?\s*([A-Z0-9\-\/]+)",
],
"date": [
r"(?:invoice\s*)?date\s*[:\-]?\s*([A-Za-z0-9,\/\-. ]+)",
],
"total_amount": [
r"(?:grand\s+total|invoice\s+total|total\s+due|amount\s+due|total)\s*[:\-]?\s*\$?\s*([0-9,]+\.\d{2})",
],
}
# Keywords often found in invoice headers/footers that we should ignore as vendor candidates
IGNORE_VENDOR_WORDS = {
"invoice", "bill to", "ship to", "remit to", "purchase order", "po number",
"date", "invoice number", "total", "subtotal", "tax", "amount due"
}
# =========================
# TEXT EXTRACTION
# =========================
def extract_text_unstructured(pdf_path):
"""Extract text blocks from PDF using unstructured."""
try:
elements = partition_pdf(filename=pdf_path)
text = "\n".join([el.text.strip() for el in elements if hasattr(el, "text") and el.text and el.text.strip()])
return text
except Exception as e:
print(f"[WARN] unstructured failed on {pdf_path}: {e}")
return ""
def extract_text_pdfplumber(pdf_path):
"""Fallback plain text extraction using pdfplumber."""
text_parts = []
try:
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
txt = page.extract_text() or ""
text_parts.append(txt)
except Exception as e:
print(f"[WARN] pdfplumber text extraction failed on {pdf_path}: {e}")
return "\n".join(text_parts)
# =========================
# FIELD EXTRACTION
# =========================
def find_first_pattern(text, patterns, flags=re.IGNORECASE):
for pat in patterns:
m = re.search(pat, text, flags)
if m:
return m.group(1).strip()
return ""
def normalize_date(date_str):
if not date_str:
return ""
try:
dt = date_parser.parse(date_str, fuzzy=True, dayfirst=False)
return dt.strftime("%Y-%m-%d")
except Exception:
return date_str.strip()
def normalize_amount(amount_str):
if not amount_str:
return ""
amt = amount_str.replace(",", "").replace("$", "").strip()
return amt
def extract_vendor_name(text):
"""
Heuristic:
- Take first ~15 non-empty lines
- Choose first line that is not a known invoice label and contains letters
"""
lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
for line in lines[:15]:
low = line.lower()
if any(word in low for word in IGNORE_VENDOR_WORDS):
continue
if re.search(r"[A-Za-z]", line) and len(line) <= 80:
return line
return ""
# =========================
# LINE ITEM EXTRACTION
# =========================
def clean_table(df):
"""Clean extracted table DataFrame."""
df = df.fillna("")
# Remove fully empty rows
df = df.loc[~(df.apply(lambda r: all(str(x).strip() == "" for x in r), axis=1))]
# Strip whitespace
df = df.applymap(lambda x: str(x).strip())
return df
def identify_line_item_columns(df):
"""
Try to map columns to description, quantity, price using header names.
Returns dict with detected columns.
"""
if df.empty:
return None
header = [str(c).lower().strip() for c in df.iloc[0].tolist()]
desc_idx = qty_idx = price_idx = None
for i, col in enumerate(header):
if any(k in col for k in ["description", "item", "product", "details"]):
desc_idx = i
elif any(k in col for k in ["qty", "quantity", "hours", "units"]):
qty_idx = i
elif any(k in col for k in ["price", "rate", "unit price", "amount"]):
price_idx = i
if desc_idx is not None and (qty_idx is not None or price_idx is not None):
return {
"header_row": 0,
"description": desc_idx,
"quantity": qty_idx,
"price": price_idx,
}
return None
def extract_line_items(pdf_path):
"""
Extract line items from tables in the PDF using pdfplumber.
Returns a list of dicts: [{description, quantity, price}, ...]
"""
items = []
try:
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if not table or len(table) < 2:
continue
df = pd.DataFrame(table)
df = clean_table(df)
if df.empty or len(df) < 2:
continue
mapping = identify_line_item_columns(df)
if not mapping:
continue
start_row = mapping["header_row"] + 1
desc_idx = mapping["description"]
qty_idx = mapping["quantity"]
price_idx = mapping["price"]
for i in range(start_row, len(df)):
row = df.iloc[i].tolist()
description = row[desc_idx].strip() if desc_idx is not None and desc_idx < len(row) else ""
quantity = row[qty_idx].strip() if qty_idx is not None and qty_idx < len(row) else ""
price = row[price_idx].strip() if price_idx is not None and price_idx < len(row) else ""
# Skip empty/bad rows
if not description:
continue
if re.search(r"(subtotal|tax|total|amount due)", description, re.IGNORECASE):
continue
items.append({
"description": description,
"quantity": quantity,
"price": normalize_amount(price),
})
except Exception as e:
print(f"[WARN] Table extraction failed on {pdf_path}: {e}")
return items
# =========================
# INVOICE PARSING
# =========================
def parse_invoice(pdf_path):
text = extract_text_unstructured(pdf_path)
if not text.strip():
text = extract_text_pdfplumber(pdf_path)
invoice_number = find_first_pattern(text, FIELD_PATTERNS["invoice_number"])
invoice_date = normalize_date(find_first_pattern(text, FIELD_PATTERNS["date"]))
total_amount = normalize_amount(find_first_pattern(text, FIELD_PATTERNS["total_amount"]))
vendor_name = extract_vendor_name(text)
line_items = extract_line_items(pdf_path)
return {
"file_name": os.path.basename(pdf_path),
"invoice_number": invoice_number,
"date": invoice_date,
"vendor_name": vendor_name,
"total_amount": total_amount,
"line_items": line_items,
}
# =========================
# CSV EXPORT
# =========================
def write_csv(results, output_csv):
"""
Writes one row per line item.
If no line items found, still writes one row with invoice-level data.
"""
fieldnames = [
"file_name",
"invoice_number",
"date",
"vendor_name",
"description",
"quantity",
"price",
"total_amount",
]
with open(output_csv, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for inv in results:
if inv["line_items"]:
for item in inv["line_items"]:
writer.writerow({
"file_name": inv["file_name"],
"invoice_number": inv["invoice_number"],
"date": inv["date"],
"vendor_name": inv["vendor_name"],
"description": item.get("description", ""),
"quantity": item.get("quantity", ""),
"price": item.get("price", ""),
"total_amount": inv["total_amount"],
})
else:
writer.writerow({
"file_name": inv["file_name"],
"invoice_number": inv["invoice_number"],
"
### Library Choice
LangChain with OpenAI (LLM-based) + PyMuPDF. Best for varying layouts: converts PDF to text, uses a prompt to semantically extract fields (handles tables as lists), outputs JSON for CSV structuring. Accurate, flexible; rule-based libs fail on layout variance.
### Installation
```bash
pip install langchain langchain_openai pymupdf pandas python-dotenv
```
Set `OPENAI_API_KEY` in `.env` file.
### Code
```python
import os
import json
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
import fitz # PyMuPDF
load_dotenv()
# Initialize LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Prompt for extraction
prompt = ChatPromptTemplate.from_template("""
Extract from this invoice text: invoice_number (str), date (YYYY-MM-DD), vendor_name (str),
line_items (list of dicts: {{"description": str, "quantity": float, "price": float}}),
total_amount (float).
Output ONLY valid JSON: {{"invoice_number": "...", "date": "...", "vendor_name": "...", "line_items": [...], "total_amount": ...}}
Text: {text}
""")
chain = prompt | llm
def extract_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
doc.close()
response = chain.invoke({"text": text})
try:
data = json.loads(response.content)
# Flatten for CSV: one row per line item, repeat invoice fields
rows = []
for item in data["line_items"]:
rows.append({
"invoice_number": data["invoice_number"],
"date": data["date"],
"vendor_name": data["vendor_name"],
"description": item["description"],
"quantity": item["quantity"],
"price": item["price"],
"total_amount": data["total_amount"]
})
return rows
except:
print(f"Extraction failed for {pdf_path}")
return []
# Process folder
folder_path = Path("invoices") # Change to your folder
all_rows = []
for pdf_file in folder_path.glob("*.pdf"):
rows = extract_from_pdf(pdf_file)
all_rows.extend(rows)
# Save to CSV
df = pd.DataFrame(all_rows)
df.to_csv("invoices_extracted.csv", index=False)
print("Extraction complete: invoices_extracted.csv")
```
### How It Works
PyMuPDF extracts text from each PDF (fast, layout-agnostic). LangChain prompts GPT-4o-mini to parse fields into JSON, handling varying layouts/tables semantically. Flattens line items into CSV rows (one per item, repeating invoice metadata). Processes 200 PDFs in batch; ~1-2s/doc. Tune prompt for edge cases. (78 words)
What makes these work
-
01Match the library to the PDF type
pdfplumber and PyMuPDF work well for text-based PDFs with clean encoding. Camelot is purpose-built for tables. pytesseract plus pdf2image handles scanned image PDFs. Choosing the wrong library for your PDF type is the single biggest source of poor extraction results, so inspect a sample file manually before writing any code.
-
02Use coordinate-based extraction for complex layouts
When regex on raw text fails because fields shift position between documents, switch to bounding-box extraction. Both pdfplumber and PyMuPDF let you extract text within a defined rectangular region of a page. This is reliable for forms where the label position is fixed even if the content length varies.
-
03Always build a validation step into your pipeline
AI-generated extraction code will handle the common case well but miss edge cases your training data did not cover. Add assertions or checksums where possible, for example verifying that extracted line item amounts sum to the extracted total. Log rows that fail validation rather than silently dropping them.
-
04Normalize output before storage
Dates extracted from PDFs arrive in dozens of formats, currency values may include symbols or commas, and whitespace is often inconsistent. Run a normalization pass using dateutil.parser for dates and locale-aware number parsing for currency before writing to your database or CSV. Dirty data at this stage compounds downstream.
More example scenarios
I have a folder of 200 supplier invoices in PDF format. Each invoice has a vendor name, invoice number, date, line items with descriptions and amounts, and a total due. The layout is consistent but not identical across vendors. I need to extract these fields into a CSV using Python.
Use pdfplumber to open each PDF and extract text by page. Define a regex pattern set for each field: invoice number (INV-\d+), date (common date formats), and total (currency pattern near the word 'Total'). Loop over the folder with os.scandir, write matched fields to a csv.DictWriter, and log files where any field returns None for manual review.
I need to extract the income statement table from a 40-page public company 10-Q filing. The table has rows like Revenue, Cost of Goods Sold, and Net Income with three columns of quarterly figures. I want the output as a pandas DataFrame.
Use Camelot with the lattice flavor if the table has visible grid lines, or the stream flavor if it uses whitespace alignment. Call camelot.read_pdf('report.pdf', pages='12', flavor='lattice') then access tables[0].df to get the DataFrame directly. Validate by checking that column sums match the reported totals in the document.
We have scanned patient intake forms as PDFs. Each page is an image. We need to extract patient name, date of birth, and chief complaint fields. The forms are a standard template but scanned at varying quality.
Convert each PDF page to an image using pdf2image, then run pytesseract.image_to_string on each image. Apply a preprocessing step with OpenCV to deskew and increase contrast before OCR. Use regex anchored to known label text like 'Date of Birth:' to isolate field values, and flag any extraction with low confidence scores for human review.
I want to extract the title, authors, abstract, and DOI from a batch of academic PDF papers downloaded from arXiv. Papers follow typical academic formatting but were generated by different LaTeX templates.
Use PyMuPDF (fitz) to extract text with layout coordinates. The title is typically the largest font on page 1, authors follow below it in smaller text, and the abstract is the first labeled section. Extract the DOI using re.search for the doi.org URL pattern. Store results in a JSON file per paper and log failures for manual inspection.
A manufacturer sent a 300-page product catalog as a PDF. Each product has a name, SKU, dimensions, weight, and price on its own section. I need to get all products into a database-ready CSV.
Use pdfplumber to extract text page by page and identify product section boundaries by detecting the bold SKU pattern at the start of each block. Build a state machine that resets field collection each time a new SKU is found. Output rows to CSV with csv.writer after each product block is complete. Expect some cleanup needed for multi-line dimension fields.
Common mistakes to avoid
-
Treating all PDFs as text-layer documents
Scanned PDFs contain images, not selectable text, so pdfplumber and similar tools return empty strings. Always check whether your PDF has a text layer by trying to select text manually in a viewer before writing your extraction code. If it does not, you need an OCR step first.
-
Relying on raw text order for table data
PDF text extraction returns characters in the order they are stored in the file, which often does not match visual reading order. A table extracted as raw text may have all values from column one followed by all values from column two. Use a layout-aware library like Camelot or pdfplumber's table extraction methods instead of joining raw text lines.
-
Skipping error handling on missing fields
Regex patterns that work on 95 percent of your files will return None on the other 5 percent. If your code does not handle missing matches explicitly, you will either crash on None.group() calls or silently write empty rows to your output. Build in default values and a logging step for every field that fails to match.
-
Not testing on a representative sample before scaling
Running your extraction script on 10 handpicked clean files and then deploying it against 10,000 real-world files is a recipe for a data quality disaster. Sample randomly from your full dataset, review edge cases manually, and fix your parsing logic before you process everything. Garbage in garbage out at scale is expensive to fix retroactively.
-
Ignoring encoding issues in older PDFs
Older PDFs sometimes use non-standard font encodings that cause extracted text to appear as garbled characters or question marks. PyMuPDF handles this better than most libraries, but some files require preprocessing or a different extraction strategy. If your text looks corrupted, check the PDF's font encoding before assuming the library is broken.
Related queries
Frequently asked questions
What is the best Python library to extract text from a PDF?
For most text-based PDFs, pdfplumber and PyMuPDF (fitz) are the top choices. pdfplumber is excellent for layout-aware extraction and has built-in table support. PyMuPDF is faster and handles a wider range of PDF encodings. If you are working with scanned PDFs, you need pytesseract plus pdf2image for OCR before any text extraction.
How do I extract tables from a PDF using Python?
Camelot is the most reliable library specifically for table extraction. Use the lattice flavor for tables with visible borders and the stream flavor for whitespace-delimited tables. pdfplumber also has a table extraction method that works well for simpler cases. Both return pandas DataFrames, which you can then export to CSV or load into a database directly.
How do I extract data from a scanned PDF in Python?
Convert each PDF page to an image using the pdf2image library, then run OCR with pytesseract. For better accuracy, preprocess images with OpenCV to correct skew, improve contrast, and remove noise before passing them to the OCR engine. Cloud alternatives like Google Vision API or AWS Textract offer higher accuracy for production workloads at the cost of API fees.
Can I extract specific fields like invoice number or date from a PDF with Python?
Yes. Extract the full text of the PDF first using pdfplumber or PyMuPDF, then use Python's re module to match patterns for each field. For example, use a regex like r'Invoice\s*#?\s*(\w+)' to capture an invoice number. For fields that appear near a consistent label, search for the label text and grab the value immediately following it.
How do I extract data from a PDF form with fillable fields?
PDF forms with AcroForm fields store data separately from the visual layout and can be extracted directly without text parsing. Use PyMuPDF's doc.load_page(0).widgets() to iterate over form fields and read their values, or use pypdf's reader.get_fields() method. This is more reliable than text extraction for structured forms because you get the field name paired with its value.
How do I handle PDFs where the text extraction is empty or garbled?
First confirm whether the PDF has a text layer by checking if pdfplumber returns any text at all. If it returns nothing, the PDF is scanned and you need OCR. If it returns garbled characters, the issue is likely font encoding and PyMuPDF may handle it better than other libraries. As a last resort, use a PDF-to-image conversion followed by OCR regardless of whether a text layer exists.
Try it with a real tool
Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.