How to Extract Structured Data from PDFs Using Python

Tested prompts for extract data from pdf using python compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 9/10

Extracting data from PDFs with Python is one of those tasks that sounds simple until you actually try it. PDFs were designed for printing, not data portability, so the same file format that renders beautifully on screen can be a nightmare to parse programmatically. Whether you are pulling invoice totals, scraping research paper tables, or batch-processing hundreds of scanned forms, the core challenge is the same: getting structured, usable data out of a format that actively resists it.

Python has a mature ecosystem for this problem. Libraries like PyMuPDF, pdfplumber, pdfminer.six, and Camelot each handle different PDF types and structures. Text-based PDFs with clean formatting are the easiest case. Scanned PDFs require OCR via tools like pytesseract or cloud vision APIs. Tables embedded in PDFs are their own category entirely, often needing layout-aware parsers to preserve column relationships correctly.

This page shows you exactly how an AI-assisted approach accelerates the process. Instead of writing boilerplate extraction and cleaning code from scratch, you can use a tested prompt to generate Python scripts tuned to your specific PDF structure. Below you will find real model outputs, a comparison of how different models handle the task, and practical guidance on which approach fits your situation.

When to use this

This approach works best when you have PDFs with consistent structure and need to automate extraction at scale. If you are dealing with invoices that follow a template, reports with predictable section headers, or forms with labeled fields, Python-based extraction with AI-generated code will save you significant time compared to manual parsing or one-off scripts.

  • Batch extracting invoice line items, totals, and vendor details from hundreds of PDFs
  • Pulling structured tables from financial reports or scientific papers for downstream analysis
  • Parsing government or legal forms where fields follow a known layout
  • Converting product catalog PDFs into structured JSON or CSV for database ingestion
  • Automating data entry workflows by extracting key fields from uploaded PDF documents

When this format breaks down

  • Scanned PDFs with low resolution or skewed text: OCR accuracy drops sharply and extracted data will require heavy manual correction before it is usable.
  • PDFs with highly variable or unpredictable layouts: if each document looks different, template-based extraction breaks and you will spend more time on edge cases than you save on automation.
  • Legally sensitive documents where extraction errors carry real risk: always validate AI-generated parsing code against a human-reviewed sample before deploying to production.
  • PDFs that are password-protected or DRM-locked: most Python libraries cannot decrypt these without the owner password, and bypassing DRM may violate terms of service or law.

The prompt we tested

You are a Python developer specializing in PDF data extraction. Based on the user's PDF extraction scenario below, provide a complete, working Python solution that extracts the requested data.

Instructions: Recommend the best-fit Python library (e.g., pdfplumber, PyPDF2, pymupdf, unstructured, or LLM-based approaches like LangChain with OpenAI) based on the use case, then provide runnable code with comments and pip install commands. Keep the explanation concise (under 150 words) and structure the response with clear sections: Library Choice, Installation, Code, and How It Works.

User's PDF extraction scenario:
I have a folder of 200 vendor invoices in PDF format. Each invoice has a different layout but contains the same fields: invoice number, date, vendor name, line items (description, quantity, price), and total amount. I need to extract this data into a structured CSV file for accounting.

Provide the recommended library choice, installation command, complete working code, and a brief explanation of how it works.

Model comparison

ModelQualityCostLatencyTokens
Claude Opus 4.7 9 $0.1189 17.0s 339 / 1518
Claude Haiku 4.5 9 $0.0072 9.1s 238 / 1393
GPT-5.4 9 $0.0313 24.2s 213 / 2048
Gemini 2.5 ProNot run yet
Grok 4.1 Fast Reasoning 8 $0.0004 15.4s 361 / 640

Outputs

Claude Opus 4.7 9/10
# PDF Invoice Extraction Solution

## Library Choice

**LangChain + OpenAI (GPT-4o-mini)** combined with **pypdf** for text extraction.

Since your 200 invoices have **different layouts**, rule-based tools like `pdfplumber` or regex would require custom parsers per vendor. An LLM handles layout variability natively, extracting structured fields regardless of format. Using **Pydantic schemas** with LangChain's `with_structured_output()` ensures reliable, typed JSON responses ready for CSV export.

## Installation

```bash
pip install langchain langchain-openai pypdf pydantic pandas
export OPENAI_API_KEY="your-key-here"
```

## Code

```python
import os
import glob
import pandas as pd
from typing import List
from pypdf import PdfReader
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI

# --- Define structured output schema ---
class LineItem(BaseModel):
    description: str = Field(description="Line item description")
    quantity: float = Field(description="Quantity ordered")
    price: float = Field(description="Unit price")

class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice number/ID")
    date: str = Field(description="Invoice date (YYYY-MM-DD)")
    vendor_name: str = Field(description="Vendor/supplier name")
    line_items: List[LineItem] = Field(description="All line items")
    total_amount: float = Field(description="Final total amount")

# --- Initialize LLM with structured output ---
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(Invoice)

def extract_pdf_text(pdf_path: str) -> str:
    """Extract raw text from a PDF file."""
    reader = PdfReader(pdf_path)
    return "\n".join(page.extract_text() or "" for page in reader.pages)

def parse_invoice(pdf_path: str) -> dict:
    """Send invoice text to LLM and return structured data."""
    text = extract_pdf_text(pdf_path)
    prompt = f"Extract invoice details from the following text:\n\n{text}"
    try:
        invoice: Invoice = structured_llm.invoke(prompt)
        return invoice.model_dump()
    except Exception as e:
        print(f"Error parsing {pdf_path}: {e}")
        return None

# --- Process all invoices in folder ---
INVOICE_FOLDER = "./invoices"
rows = []

for pdf_file in glob.glob(os.path.join(INVOICE_FOLDER, "*.pdf")):
    print(f"Processing: {pdf_file}")
    data = parse_invoice(pdf_file)
    if not data:
        continue
    # Flatten line items into one row per item
    for item in data["line_items"]:
        rows.append({
            "source_file": os.path.basename(pdf_file),
            "invoice_number": data["invoice_number"],
            "date": data["date"],
            "vendor_name": data["vendor_name"],
            "description": item["description"],
            "quantity": item["quantity"],
            "price": item["price"],
            "line_total": item["quantity"] * item["price"],
            "invoice_total": data["total_amount"],
        })

# --- Export to CSV ---
df = pd.DataFrame(rows)
df.to_csv("invoices_extracted.csv", index=False)
print(f"Done. Extracted {len(df)} line items from {df['source_file'].nunique()} invoices.")
```

## How It Works

1. **`pypdf`** extracts raw text from each PDF (fast and free).
2. **Pydantic models** define the exact schema (`Invoice` + `LineItem`), giving the LLM a strict contract.
3. **`ChatOpenAI.with_structured_output()`** forces GPT-4o-mini to return JSON matching the schema — no regex or prompt engineering needed per vendor.
4. The script **iterates** all PDFs, flattens nested line items into tabular rows, and writes a clean CSV.

**Cost estimate:** ~$0.01–0.03 per invoice with `gpt-4o-mini` (~$2–6 total for 200 invoices). For scanned/image-based PDFs, swap `pypdf` for `pytesseract` OCR or use GPT-4o's vision API directly.
Judge notes: Excellent response that directly addresses the query with a well-justified library choice for layout-variable invoices, complete working code with clear structure, proper error handling, and practical cost/alternative guidance—minor deduction only for not mentioning potential API costs upfront in the library choice section.
Claude Haiku 4.5 9/10
# PDF Vendor Invoice Extraction Solution

## Library Choice
**Recommendation: `pdfplumber`**

`pdfplumber` excels at extracting structured data (tables, text) from PDFs with varying layouts. It handles text positioning well and works reliably with invoices. For highly inconsistent layouts, pair it with `LangChain + GPT-4` for semantic understanding, but `pdfplumber` is ideal for this use case since invoices typically follow predictable patterns.

---

## Installation

```bash
pip install pdfplumber pandas openpyxl
```

---

## Code

```python
import pdfplumber
import pandas as pd
import os
import re
from pathlib import Path

def extract_invoice_data(pdf_path):
    """
    Extract invoice data from a single PDF file.
    Handles varying layouts by searching for key patterns.
    """
    invoice_data = {
        'invoice_number': None,
        'date': None,
        'vendor_name': None,
        'line_items': [],
        'total_amount': None,
        'file_name': os.path.basename(pdf_path)
    }
    
    try:
        with pdfplumber.open(pdf_path) as pdf:
            # Extract text from first page (invoices typically fit on 1 page)
            page = pdf.pages[0]
            text = page.extract_text()
            
            # Try to extract tables (common in invoices)
            tables = page.extract_tables()
            
            if tables:
                # Process first table (usually line items)
                for row in tables[0][1:]:  # Skip header
                    if len(row) >= 3:
                        invoice_data['line_items'].append({
                            'description': row[0],
                            'quantity': row[1],
                            'price': row[2]
                        })
            
            # Extract key fields using regex patterns
            # Invoice Number
            inv_match = re.search(r'(?:Invoice|Invoice #|Inv #)[\s#:]*([A-Z0-9\-]+)', text, re.IGNORECASE)
            if inv_match:
                invoice_data['invoice_number'] = inv_match.group(1).strip()
            
            # Date
            date_match = re.search(r'(?:Date|Invoice Date)[\s:]*(\d{1,2}/\d{1,2}/\d{4}|\d{4}-\d{2}-\d{2})', text, re.IGNORECASE)
            if date_match:
                invoice_data['date'] = date_match.group(1).strip()
            
            # Vendor Name (usually near top)
            lines = text.split('\n')
            if lines:
                invoice_data['vendor_name'] = lines[0].strip()
            
            # Total Amount
            total_match = re.search(r'(?:Total|Total Amount)[\s$:]*(\d+[.,]\d{2})', text, re.IGNORECASE)
            if total_match:
                invoice_data['total_amount'] = total_match.group(1).strip()
    
    except Exception as e:
        print(f"Error processing {pdf_path}: {str(e)}")
    
    return invoice_data

def process_invoice_folder(folder_path, output_csv='invoices_data.csv'):
    """
    Process all PDFs in a folder and export to CSV.
    """
    all_invoices = []
    pdf_files = list(Path(folder_path).glob('*.pdf'))
    
    print(f"Found {len(pdf_files)} PDF files. Processing...")
    
    for idx, pdf_file in enumerate(pdf_files, 1):
        print(f"[{idx}/{len(pdf_files)}] Processing: {pdf_file.name}")
        data = extract_invoice_data(str(pdf_file))
        all_invoices.append(data)
    
    # Create DataFrame
    df = pd.DataFrame(all_invoices)
    
    # Expand line_items into separate rows if needed
    expanded_rows = []
    for _, row in df.iterrows():
        if row['line_items']:
            for item in row['line_items']:
                expanded_row = row.copy()
                expanded_row.update(item)
                expanded_rows.append(expanded_row)
        else:
            expanded_rows.append(row)
    
    df_expanded = pd.DataFrame(expanded_rows)
    
    # Export to CSV
    df_expanded.to_csv(output_csv, index=False)
    print(f"\n✓ Data exported to {output_csv}")
    print(f"Total rows: {len(df_expanded)}")
    return df_expanded

# Usage
if __name__ == "__main__":
    folder_path = "./invoices"  # Change to your folder path
    df = process_invoice_folder(folder_path, 'vendor_invoices.csv')
    print("\nFirst few rows:")
    print(df.head())
```

---

## How It Works

1. **Iterates** through all PDFs in the folder
2. **Extracts tables** automatically (line items)
3. **Uses regex patterns** to find invoice number, date, vendor, and total
4. **Handles varying layouts** by searching for key patterns rather than fixed positions
5. **Flattens line items** into separate rows for accounting systems
6. **Exports to CSV** with all extracted fields

**Limitations:** Highly inconsistent layouts may require manual regex tuning per vendor. For 100% accuracy, use **LangChain + GPT-4** (adds cost but handles any layout).
Judge notes: Excellent match to query intent with practical, well-commented code, appropriate library recommendation, clear structure, and realistic acknowledgment of limitations—only minor deduction for not including a full LangChain alternative example despite mentioning it.
GPT-5.4 9/10
## Library Choice

**Best fit: `unstructured` + `pdfplumber` + regex**  
Because the invoices have **different layouts**, a pure fixed-coordinate/table approach is brittle. `unstructured` helps normalize text across varied PDFs, while `pdfplumber` is strong for extracting line-item tables when present. This hybrid approach is practical, local, and cheaper/simpler than an LLM pipeline for 200 files.

## Installation

```bash
pip install unstructured pdfplumber pandas python-dateutil
```

## Code

```python
import os
import re
import csv
import pdfplumber
import pandas as pd
from dateutil import parser as date_parser
from unstructured.partition.pdf import partition_pdf

# =========================
# CONFIG
# =========================
PDF_FOLDER = "invoices"             # Folder containing PDF invoices
OUTPUT_CSV = "extracted_invoices.csv"

# Common label patterns across invoices
FIELD_PATTERNS = {
    "invoice_number": [
        r"invoice\s*(?:number|no\.?|#)\s*[:\-]?\s*([A-Z0-9\-\/]+)",
        r"inv\s*(?:number|no\.?|#)\s*[:\-]?\s*([A-Z0-9\-\/]+)",
    ],
    "date": [
        r"(?:invoice\s*)?date\s*[:\-]?\s*([A-Za-z0-9,\/\-. ]+)",
    ],
    "total_amount": [
        r"(?:grand\s+total|invoice\s+total|total\s+due|amount\s+due|total)\s*[:\-]?\s*\$?\s*([0-9,]+\.\d{2})",
    ],
}

# Keywords often found in invoice headers/footers that we should ignore as vendor candidates
IGNORE_VENDOR_WORDS = {
    "invoice", "bill to", "ship to", "remit to", "purchase order", "po number",
    "date", "invoice number", "total", "subtotal", "tax", "amount due"
}


# =========================
# TEXT EXTRACTION
# =========================
def extract_text_unstructured(pdf_path):
    """Extract text blocks from PDF using unstructured."""
    try:
        elements = partition_pdf(filename=pdf_path)
        text = "\n".join([el.text.strip() for el in elements if hasattr(el, "text") and el.text and el.text.strip()])
        return text
    except Exception as e:
        print(f"[WARN] unstructured failed on {pdf_path}: {e}")
        return ""


def extract_text_pdfplumber(pdf_path):
    """Fallback plain text extraction using pdfplumber."""
    text_parts = []
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                txt = page.extract_text() or ""
                text_parts.append(txt)
    except Exception as e:
        print(f"[WARN] pdfplumber text extraction failed on {pdf_path}: {e}")
    return "\n".join(text_parts)


# =========================
# FIELD EXTRACTION
# =========================
def find_first_pattern(text, patterns, flags=re.IGNORECASE):
    for pat in patterns:
        m = re.search(pat, text, flags)
        if m:
            return m.group(1).strip()
    return ""


def normalize_date(date_str):
    if not date_str:
        return ""
    try:
        dt = date_parser.parse(date_str, fuzzy=True, dayfirst=False)
        return dt.strftime("%Y-%m-%d")
    except Exception:
        return date_str.strip()


def normalize_amount(amount_str):
    if not amount_str:
        return ""
    amt = amount_str.replace(",", "").replace("$", "").strip()
    return amt


def extract_vendor_name(text):
    """
    Heuristic:
    - Take first ~15 non-empty lines
    - Choose first line that is not a known invoice label and contains letters
    """
    lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
    for line in lines[:15]:
        low = line.lower()
        if any(word in low for word in IGNORE_VENDOR_WORDS):
            continue
        if re.search(r"[A-Za-z]", line) and len(line) <= 80:
            return line
    return ""


# =========================
# LINE ITEM EXTRACTION
# =========================
def clean_table(df):
    """Clean extracted table DataFrame."""
    df = df.fillna("")
    # Remove fully empty rows
    df = df.loc[~(df.apply(lambda r: all(str(x).strip() == "" for x in r), axis=1))]
    # Strip whitespace
    df = df.applymap(lambda x: str(x).strip())
    return df


def identify_line_item_columns(df):
    """
    Try to map columns to description, quantity, price using header names.
    Returns dict with detected columns.
    """
    if df.empty:
        return None

    header = [str(c).lower().strip() for c in df.iloc[0].tolist()]
    desc_idx = qty_idx = price_idx = None

    for i, col in enumerate(header):
        if any(k in col for k in ["description", "item", "product", "details"]):
            desc_idx = i
        elif any(k in col for k in ["qty", "quantity", "hours", "units"]):
            qty_idx = i
        elif any(k in col for k in ["price", "rate", "unit price", "amount"]):
            price_idx = i

    if desc_idx is not None and (qty_idx is not None or price_idx is not None):
        return {
            "header_row": 0,
            "description": desc_idx,
            "quantity": qty_idx,
            "price": price_idx,
        }
    return None


def extract_line_items(pdf_path):
    """
    Extract line items from tables in the PDF using pdfplumber.
    Returns a list of dicts: [{description, quantity, price}, ...]
    """
    items = []

    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                tables = page.extract_tables()
                for table in tables:
                    if not table or len(table) < 2:
                        continue

                    df = pd.DataFrame(table)
                    df = clean_table(df)

                    if df.empty or len(df) < 2:
                        continue

                    mapping = identify_line_item_columns(df)
                    if not mapping:
                        continue

                    start_row = mapping["header_row"] + 1
                    desc_idx = mapping["description"]
                    qty_idx = mapping["quantity"]
                    price_idx = mapping["price"]

                    for i in range(start_row, len(df)):
                        row = df.iloc[i].tolist()

                        description = row[desc_idx].strip() if desc_idx is not None and desc_idx < len(row) else ""
                        quantity = row[qty_idx].strip() if qty_idx is not None and qty_idx < len(row) else ""
                        price = row[price_idx].strip() if price_idx is not None and price_idx < len(row) else ""

                        # Skip empty/bad rows
                        if not description:
                            continue
                        if re.search(r"(subtotal|tax|total|amount due)", description, re.IGNORECASE):
                            continue

                        items.append({
                            "description": description,
                            "quantity": quantity,
                            "price": normalize_amount(price),
                        })
    except Exception as e:
        print(f"[WARN] Table extraction failed on {pdf_path}: {e}")

    return items


# =========================
# INVOICE PARSING
# =========================
def parse_invoice(pdf_path):
    text = extract_text_unstructured(pdf_path)
    if not text.strip():
        text = extract_text_pdfplumber(pdf_path)

    invoice_number = find_first_pattern(text, FIELD_PATTERNS["invoice_number"])
    invoice_date = normalize_date(find_first_pattern(text, FIELD_PATTERNS["date"]))
    total_amount = normalize_amount(find_first_pattern(text, FIELD_PATTERNS["total_amount"]))
    vendor_name = extract_vendor_name(text)
    line_items = extract_line_items(pdf_path)

    return {
        "file_name": os.path.basename(pdf_path),
        "invoice_number": invoice_number,
        "date": invoice_date,
        "vendor_name": vendor_name,
        "total_amount": total_amount,
        "line_items": line_items,
    }


# =========================
# CSV EXPORT
# =========================
def write_csv(results, output_csv):
    """
    Writes one row per line item.
    If no line items found, still writes one row with invoice-level data.
    """
    fieldnames = [
        "file_name",
        "invoice_number",
        "date",
        "vendor_name",
        "description",
        "quantity",
        "price",
        "total_amount",
    ]

    with open(output_csv, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()

        for inv in results:
            if inv["line_items"]:
                for item in inv["line_items"]:
                    writer.writerow({
                        "file_name": inv["file_name"],
                        "invoice_number": inv["invoice_number"],
                        "date": inv["date"],
                        "vendor_name": inv["vendor_name"],
                        "description": item.get("description", ""),
                        "quantity": item.get("quantity", ""),
                        "price": item.get("price", ""),
                        "total_amount": inv["total_amount"],
                    })
            else:
                writer.writerow({
                    "file_name": inv["file_name"],
                    "invoice_number": inv["invoice_number"],
                    "
Judge notes: Excellent, comprehensive, production-ready solution that directly addresses the query with hybrid library approach, complete working code, smart heuristics for varied layouts, and proper error handling—minor deduction only for output being truncated at the end.
Grok 4.1 Fast Reasoning 8/10
### Library Choice
LangChain with OpenAI (LLM-based) + PyMuPDF. Best for varying layouts: converts PDF to text, uses a prompt to semantically extract fields (handles tables as lists), outputs JSON for CSV structuring. Accurate, flexible; rule-based libs fail on layout variance.

### Installation
```bash
pip install langchain langchain_openai pymupdf pandas python-dotenv
```
Set `OPENAI_API_KEY` in `.env` file.

### Code
```python
import os
import json
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
import fitz  # PyMuPDF

load_dotenv()

# Initialize LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Prompt for extraction
prompt = ChatPromptTemplate.from_template("""
Extract from this invoice text: invoice_number (str), date (YYYY-MM-DD), vendor_name (str), 
line_items (list of dicts: {{"description": str, "quantity": float, "price": float}}), 
total_amount (float).

Output ONLY valid JSON: {{"invoice_number": "...", "date": "...", "vendor_name": "...", "line_items": [...], "total_amount": ...}}

Text: {text}
""")

chain = prompt | llm

def extract_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    
    response = chain.invoke({"text": text})
    try:
        data = json.loads(response.content)
        # Flatten for CSV: one row per line item, repeat invoice fields
        rows = []
        for item in data["line_items"]:
            rows.append({
                "invoice_number": data["invoice_number"],
                "date": data["date"],
                "vendor_name": data["vendor_name"],
                "description": item["description"],
                "quantity": item["quantity"],
                "price": item["price"],
                "total_amount": data["total_amount"]
            })
        return rows
    except:
        print(f"Extraction failed for {pdf_path}")
        return []

# Process folder
folder_path = Path("invoices")  # Change to your folder
all_rows = []
for pdf_file in folder_path.glob("*.pdf"):
    rows = extract_from_pdf(pdf_file)
    all_rows.extend(rows)

# Save to CSV
df = pd.DataFrame(all_rows)
df.to_csv("invoices_extracted.csv", index=False)
print("Extraction complete: invoices_extracted.csv")
```

### How It Works
PyMuPDF extracts text from each PDF (fast, layout-agnostic). LangChain prompts GPT-4o-mini to parse fields into JSON, handling varying layouts/tables semantically. Flattens line items into CSV rows (one per item, repeating invoice metadata). Processes 200 PDFs in batch; ~1-2s/doc. Tune prompt for edge cases. (78 words)
Judge notes: Excellent solution that directly addresses the query with a well-justified library choice, complete working code, and clear structure; minor deduction for high API costs on 200 invoices and lack of error handling/retry logic for production robustness.

What makes these work

  1. 01
    Match the library to the PDF type

    pdfplumber and PyMuPDF work well for text-based PDFs with clean encoding. Camelot is purpose-built for tables. pytesseract plus pdf2image handles scanned image PDFs. Choosing the wrong library for your PDF type is the single biggest source of poor extraction results, so inspect a sample file manually before writing any code.

  2. 02
    Use coordinate-based extraction for complex layouts

    When regex on raw text fails because fields shift position between documents, switch to bounding-box extraction. Both pdfplumber and PyMuPDF let you extract text within a defined rectangular region of a page. This is reliable for forms where the label position is fixed even if the content length varies.

  3. 03
    Always build a validation step into your pipeline

    AI-generated extraction code will handle the common case well but miss edge cases your training data did not cover. Add assertions or checksums where possible, for example verifying that extracted line item amounts sum to the extracted total. Log rows that fail validation rather than silently dropping them.

  4. 04
    Normalize output before storage

    Dates extracted from PDFs arrive in dozens of formats, currency values may include symbols or commas, and whitespace is often inconsistent. Run a normalization pass using dateutil.parser for dates and locale-aware number parsing for currency before writing to your database or CSV. Dirty data at this stage compounds downstream.

More example scenarios

#01 · Extracting invoice fields for accounts payable automation
Input
I have a folder of 200 supplier invoices in PDF format. Each invoice has a vendor name, invoice number, date, line items with descriptions and amounts, and a total due. The layout is consistent but not identical across vendors. I need to extract these fields into a CSV using Python.
Expected output
Use pdfplumber to open each PDF and extract text by page. Define a regex pattern set for each field: invoice number (INV-\d+), date (common date formats), and total (currency pattern near the word 'Total'). Loop over the folder with os.scandir, write matched fields to a csv.DictWriter, and log files where any field returns None for manual review.
#02 · Parsing tables from quarterly financial reports
Input
I need to extract the income statement table from a 40-page public company 10-Q filing. The table has rows like Revenue, Cost of Goods Sold, and Net Income with three columns of quarterly figures. I want the output as a pandas DataFrame.
Expected output
Use Camelot with the lattice flavor if the table has visible grid lines, or the stream flavor if it uses whitespace alignment. Call camelot.read_pdf('report.pdf', pages='12', flavor='lattice') then access tables[0].df to get the DataFrame directly. Validate by checking that column sums match the reported totals in the document.
#03 · Extracting text from scanned medical intake forms
Input
We have scanned patient intake forms as PDFs. Each page is an image. We need to extract patient name, date of birth, and chief complaint fields. The forms are a standard template but scanned at varying quality.
Expected output
Convert each PDF page to an image using pdf2image, then run pytesseract.image_to_string on each image. Apply a preprocessing step with OpenCV to deskew and increase contrast before OCR. Use regex anchored to known label text like 'Date of Birth:' to isolate field values, and flag any extraction with low confidence scores for human review.
#04 · Building a research paper metadata extractor
Input
I want to extract the title, authors, abstract, and DOI from a batch of academic PDF papers downloaded from arXiv. Papers follow typical academic formatting but were generated by different LaTeX templates.
Expected output
Use PyMuPDF (fitz) to extract text with layout coordinates. The title is typically the largest font on page 1, authors follow below it in smaller text, and the abstract is the first labeled section. Extract the DOI using re.search for the doi.org URL pattern. Store results in a JSON file per paper and log failures for manual inspection.
#05 · Extracting product specs from a supplier catalog PDF
Input
A manufacturer sent a 300-page product catalog as a PDF. Each product has a name, SKU, dimensions, weight, and price on its own section. I need to get all products into a database-ready CSV.
Expected output
Use pdfplumber to extract text page by page and identify product section boundaries by detecting the bold SKU pattern at the start of each block. Build a state machine that resets field collection each time a new SKU is found. Output rows to CSV with csv.writer after each product block is complete. Expect some cleanup needed for multi-line dimension fields.

Common mistakes to avoid

  • Treating all PDFs as text-layer documents

    Scanned PDFs contain images, not selectable text, so pdfplumber and similar tools return empty strings. Always check whether your PDF has a text layer by trying to select text manually in a viewer before writing your extraction code. If it does not, you need an OCR step first.

  • Relying on raw text order for table data

    PDF text extraction returns characters in the order they are stored in the file, which often does not match visual reading order. A table extracted as raw text may have all values from column one followed by all values from column two. Use a layout-aware library like Camelot or pdfplumber's table extraction methods instead of joining raw text lines.

  • Skipping error handling on missing fields

    Regex patterns that work on 95 percent of your files will return None on the other 5 percent. If your code does not handle missing matches explicitly, you will either crash on None.group() calls or silently write empty rows to your output. Build in default values and a logging step for every field that fails to match.

  • Not testing on a representative sample before scaling

    Running your extraction script on 10 handpicked clean files and then deploying it against 10,000 real-world files is a recipe for a data quality disaster. Sample randomly from your full dataset, review edge cases manually, and fix your parsing logic before you process everything. Garbage in garbage out at scale is expensive to fix retroactively.

  • Ignoring encoding issues in older PDFs

    Older PDFs sometimes use non-standard font encodings that cause extracted text to appear as garbled characters or question marks. PyMuPDF handles this better than most libraries, but some files require preprocessing or a different extraction strategy. If your text looks corrupted, check the PDF's font encoding before assuming the library is broken.

Related queries

Frequently asked questions

What is the best Python library to extract text from a PDF?

For most text-based PDFs, pdfplumber and PyMuPDF (fitz) are the top choices. pdfplumber is excellent for layout-aware extraction and has built-in table support. PyMuPDF is faster and handles a wider range of PDF encodings. If you are working with scanned PDFs, you need pytesseract plus pdf2image for OCR before any text extraction.

How do I extract tables from a PDF using Python?

Camelot is the most reliable library specifically for table extraction. Use the lattice flavor for tables with visible borders and the stream flavor for whitespace-delimited tables. pdfplumber also has a table extraction method that works well for simpler cases. Both return pandas DataFrames, which you can then export to CSV or load into a database directly.

How do I extract data from a scanned PDF in Python?

Convert each PDF page to an image using the pdf2image library, then run OCR with pytesseract. For better accuracy, preprocess images with OpenCV to correct skew, improve contrast, and remove noise before passing them to the OCR engine. Cloud alternatives like Google Vision API or AWS Textract offer higher accuracy for production workloads at the cost of API fees.

Can I extract specific fields like invoice number or date from a PDF with Python?

Yes. Extract the full text of the PDF first using pdfplumber or PyMuPDF, then use Python's re module to match patterns for each field. For example, use a regex like r'Invoice\s*#?\s*(\w+)' to capture an invoice number. For fields that appear near a consistent label, search for the label text and grab the value immediately following it.

How do I extract data from a PDF form with fillable fields?

PDF forms with AcroForm fields store data separately from the visual layout and can be extracted directly without text parsing. Use PyMuPDF's doc.load_page(0).widgets() to iterate over form fields and read their values, or use pypdf's reader.get_fields() method. This is more reliable than text extraction for structured forms because you get the field name paired with its value.

How do I handle PDFs where the text extraction is empty or garbled?

First confirm whether the PDF has a text layer by checking if pdfplumber returns any text at all. If it returns nothing, the PDF is scanned and you need OCR. If it returns garbled characters, the issue is likely font encoding and PyMuPDF may handle it better than other libraries. As a last resort, use a PDF-to-image conversion followed by OCR regardless of whether a text layer exists.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.