AI-Powered Invoice Data Extraction from PDF Documents

Tested prompts for extract data from invoice pdf compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Opus 4.7 10/10

If you have a stack of invoice PDFs and need the data inside them in a spreadsheet, database, or accounting system, you are looking at a manual copy-paste problem that eats hours every week. Invoice data extraction pulls structured fields like vendor name, invoice number, date, line items, totals, and tax amounts directly out of PDF documents without manual re-entry. That means fewer transcription errors, faster processing, and no backlog of unprocessed bills sitting in your inbox.

The challenge is that invoices are not standardized. A PDF from one vendor lays out the total in the bottom-right corner; another buries it in a summary table halfway down the page. Some PDFs are machine-readable text layers, others are scanned images. AI models handle this variation far better than rigid template-based OCR tools, because they understand context rather than relying on fixed field coordinates.

This page shows you exactly how to prompt an AI model to extract invoice data reliably, compares how different models handle the same invoice, and explains where the approach works well and where it falls short. Whether you are processing one invoice or thousands, the techniques here will get you structured, usable data fast.

When to use this

This approach fits any situation where you need structured data out of PDF invoices faster than manual entry allows. It works across invoice formats, languages, and layouts without building custom templates for each vendor. It is especially valuable when invoice volume is irregular or when vendors frequently change their PDF layouts.

  • Accounts payable teams processing 20 to 500 invoices per week from multiple vendors with different PDF formats
  • Freelancers or small businesses who receive client invoices and need to log amounts, dates, and vendors into a spreadsheet or accounting tool like QuickBooks or Xero
  • Finance teams needing to audit or reconcile historical invoice data stored as PDFs in an archive folder
  • Developers building an invoice ingestion pipeline who want to prototype field extraction logic before writing custom parsing code
  • Procurement departments tracking line-item detail such as unit price, quantity, and SKU across supplier invoices for spend analysis

When this format breaks down

  • Scanned invoices with very low image resolution or heavy skew will produce unreliable extractions. If a human struggles to read the PDF clearly, the AI will too. Pre-process with a proper OCR cleaning step first.
  • If your invoices contain highly sensitive financial data subject to strict data residency regulations, sending PDFs to a third-party API may violate compliance requirements. Verify your vendor's data handling policies before automating.
  • Invoices with complex multi-page tables that span pages in unpredictable ways, such as large purchase orders with 200 line items, can cause truncation or missed rows. These need chunking logic and validation checks, not a single prompt call.
  • If you need the output to feed directly into a production system with zero tolerance for error, such as automated payment triggering, raw AI extraction is not sufficient on its own. You need a human review step or confidence-score filtering before any downstream action.

The prompt we tested

You are an expert invoice data extraction assistant. Extract all structured data from the following invoice content and return it in a clean, organized format.

Extraction rules:
Extract these fields into valid JSON: invoice_number, invoice_date, due_date, vendor (name, address, tax_id), customer (name, address), line_items (array with description, quantity, unit_price, total), subtotal, tax, total_amount, currency, and payment_terms. Use null for any missing fields, normalize dates to YYYY-MM-DD format, and output only the JSON object with no commentary or markdown fences.

Invoice content:
Invoice #INV-2024-0892 from Acme Supplies Ltd, 45 Market Street, London EC1A 1BB, VAT GB123456789. Billed to Brightway Cafe, 12 Oak Lane, Manchester M1 2AB on 15 March 2024, due 14 April 2024. Items: 20x Ceramic Mugs @ £4.50 = £90.00, 5x Espresso Beans 1kg @ £18.00 = £90.00. Subtotal £180.00, VAT 20% £36.00, Total £216.00. Payment terms: Net 30.

Return the extracted invoice data now.

Model comparison

ModelQualityCostLatencyTokens
Claude Opus 4.7 10 $0.0334 4.1s 421 / 361
Claude Haiku 4.5 9 $0.0019 1.5s 309 / 314
GPT-5.4 10 $0.0045 3.2s 268 / 257
Gemini 2.5 Pro 9 $0.0021 9.4s 315 / 336
Grok 4.1 Fast Reasoning 10 $0.0002 3.9s 417 / 169

Outputs

Claude Opus 4.7 10/10
{
  "invoice_number": "INV-2024-0892",
  "invoice_date": "2024-03-15",
  "due_date": "2024-04-14",
  "vendor": {
    "name": "Acme Supplies Ltd",
    "address": "45 Market Street, London EC1A 1BB",
    "tax_id": "GB123456789"
  },
  "customer": {
    "name": "Brightway Cafe",
    "address": "12 Oak Lane, Manchester M1 2AB"
  },
  "line_items": [
    {
      "description": "Ceramic Mugs",
      "quantity": 20,
      "unit_price": 4.50,
      "total": 90.00
    },
    {
      "description": "Espresso Beans 1kg",
      "quantity": 5,
      "unit_price": 18.00,
      "total": 90.00
    }
  ],
  "subtotal": 180.00,
  "tax": 36.00,
  "total_amount": 216.00,
  "currency": "GBP",
  "payment_terms": "Net 30"
}
Judge notes: The model perfectly extracted all specified fields into valid JSON format with correct data types, proper date normalization, accurate currency identification, and zero extraneous commentary—fully satisfying the query intent.
Claude Haiku 4.5 9/10
```json
{
  "invoice_number": "INV-2024-0892",
  "invoice_date": "2024-03-15",
  "due_date": "2024-04-14",
  "vendor": {
    "name": "Acme Supplies Ltd",
    "address": "45 Market Street, London EC1A 1BB",
    "tax_id": "GB123456789"
  },
  "customer": {
    "name": "Brightway Cafe",
    "address": "12 Oak Lane, Manchester M1 2AB"
  },
  "line_items": [
    {
      "description": "Ceramic Mugs",
      "quantity": 20,
      "unit_price": 4.50,
      "total": 90.00
    },
    {
      "description": "Espresso Beans 1kg",
      "quantity": 5,
      "unit_price": 18.00,
      "total": 90.00
    }
  ],
  "subtotal": 180.00,
  "tax": 36.00,
  "total_amount": 216.00,
  "currency": "GBP",
  "payment_terms": "Net 30"
}
```
Judge notes: Excellently extracts all required fields into clean, valid JSON with proper formatting and data normalization, though it includes markdown code fences despite the instruction to output JSON only.
GPT-5.4 10/10
{
  "invoice_number": "INV-2024-0892",
  "invoice_date": "2024-03-15",
  "due_date": "2024-04-14",
  "vendor": {
    "name": "Acme Supplies Ltd",
    "address": "45 Market Street, London EC1A 1BB",
    "tax_id": "GB123456789"
  },
  "customer": {
    "name": "Brightway Cafe",
    "address": "12 Oak Lane, Manchester M1 2AB"
  },
  "line_items": [
    {
      "description": "Ceramic Mugs",
      "quantity": 20,
      "unit_price": 4.5,
      "total": 90.0
    },
    {
      "description": "Espresso Beans 1kg",
      "quantity": 5,
      "unit_price": 18.0,
      "total": 90.0
    }
  ],
  "subtotal": 180.0,
  "tax": 36.0,
  "total_amount": 216.0,
  "currency": "GBP",
  "payment_terms": "Net 30"
}
Judge notes: The model perfectly extracted all required fields in valid JSON format, correctly normalized dates, properly structured nested objects, and included all specified data points with accurate values.
Gemini 2.5 Pro 9/10
```json
{
  "invoice_number": "INV-2024-0892",
  "invoice_date": "2024-03-15",
  "due_date": "2024-04-14",
  "vendor": {
    "name": "Acme Supplies Ltd",
    "address": "45 Market Street, London EC1A 1BB",
    "tax_id": "GB123456789"
  },
  "customer": {
    "name": "Brightway Cafe",
    "address": "12 Oak Lane, Manchester M1 2AB"
  },
  "line_items": [
    {
      "description": "Ceramic Mugs",
      "quantity": 20,
      "unit_price": 4.50,
      "total": 90.00
    },
    {
      "description": "Espresso Beans 1kg",
      "quantity": 5,
      "unit_price": 18.00,
      "total": 90.00
    }
  ],
  "subtotal": 180.00,
  "tax": 36.00,
  "total_amount": 216.00,
  "currency": "GBP",
  "payment_terms": "Net 30"
}
```
Judge notes: Flawlessly extracts all required fields in valid JSON format with correct data normalization, though the instructions asked for no markdown fences yet the output includes them.
Grok 4.1 Fast Reasoning 10/10
{"invoice_number":"INV-2024-0892","invoice_date":"2024-03-15","due_date":"2024-04-14","vendor":{"name":"Acme Supplies Ltd","address":"45 Market Street, London EC1A 1BB","tax_id":"GB123456789"},"customer":{"name":"Brightway Cafe","address":"12 Oak Lane, Manchester M1 2AB"},"line_items":[{"description":"Ceramic Mugs","quantity":20,"unit_price":4.50,"total":90.00},{"description":"Espresso Beans 1kg","quantity":5,"unit_price":18.00,"total":90.00}],"subtotal":180.00,"tax":36.00,"total_amount":216.00,"currency":"GBP","payment_terms":"Net 30"}
Judge notes: The output flawlessly extracts all required fields in valid JSON format with correct data types, proper date normalization, accurate monetary values, and complete field population matching the invoice content exactly.

What makes these work

  1. 01
    Specify every field you need

    Listing exact field names in your prompt eliminates guesswork and stops the model from returning a generic summary instead of structured data. If you need line items separately from totals, say so explicitly. A vague instruction like 'extract invoice details' returns inconsistent results across different invoice layouts.

  2. 02
    Lock in your output format

    Tell the model exactly whether you want JSON, CSV, or pipe-delimited text before it starts. This is especially important if the output feeds into a script or spreadsheet. JSON works best for nested data like line items with multiple sub-fields; CSV or delimited text is faster to paste directly into a spreadsheet.

  3. 03
    Instruct handling of missing fields

    Some invoices omit fields like PO number, tax ID, or due date. If you do not tell the model what to do when a field is absent, it may hallucinate a value or skip the field entirely. Adding a rule like 'if a field is not present, return null' gives you consistent, trustworthy output you can validate programmatically.

  4. 04
    Use a validation pass for critical numbers

    After extraction, prompt the model again with just the numbers and ask it to verify that line item totals sum to the subtotal and that subtotal plus tax equals the grand total. This second-pass arithmetic check catches the most dangerous extraction errors, transposed digits or missed line items, before they hit your books.

More example scenarios

#01 · SaaS company extracting vendor invoices for accounts payable
Input
Extract all invoice fields from this PDF. Return: vendor name, vendor address, invoice number, invoice date, due date, line items (description, quantity, unit price, line total), subtotal, tax rate, tax amount, and total amount due. Format as JSON. PDF content: [AWS invoice showing cloud services charges, dated March 2024, net-30 terms, with three line items for EC2, S3, and data transfer]
Expected output
{"vendor_name": "Amazon Web Services", "invoice_number": "INV-2024-03-88421", "invoice_date": "2024-03-01", "due_date": "2024-03-31", "line_items": [{"description": "EC2 Compute", "quantity": 1, "unit_price": 1240.00, "line_total": 1240.00}, {"description": "S3 Storage", "quantity": 1, "unit_price": 87.50, "line_total": 87.50}, {"description": "Data Transfer", "quantity": 1, "unit_price": 34.20, "line_total": 34.20}], "subtotal": 1361.70, "tax_rate": 0, "tax_amount": 0, "total_due": 1361.70}
#02 · Freelance designer logging client invoices into a spreadsheet
Input
I receive invoices from clients as PDFs. Extract the following fields so I can paste them into my Google Sheet: client company name, invoice number, invoice date, payment due date, project description, and total amount. Here is the invoice text: [Invoice from Bright Studio LLC, invoice #2024-047, issued April 10 2024, due April 25 2024, for brand identity project phase 2, total $4,500]
Expected output
Client: Bright Studio LLC | Invoice #: 2024-047 | Invoice Date: April 10, 2024 | Due Date: April 25, 2024 | Description: Brand identity project phase 2 | Total: $4,500.00
#03 · Procurement team extracting line items from supplier purchase order invoices
Input
Extract every line item from this supplier invoice as a CSV-formatted table with columns: SKU, product description, quantity ordered, unit price, and line total. If any field is missing or unclear, mark it as MISSING. Invoice is from a manufacturing parts supplier with 8 line items including fasteners, brackets, and gaskets.
Expected output
SKU,Product Description,Quantity,Unit Price,Line Total
FT-2201,M8 Hex Bolt Stainless,500,0.12,60.00
BR-0044,L-Bracket Galvanized 50mm,200,1.85,370.00
GS-1102,Rubber Gasket 75mm,150,0.95,142.50
FT-2205,M8 Nut Stainless,500,0.08,40.00
BR-0048,Flat Bracket 80mm,100,2.10,210.00
GS-1105,Silicone Gasket 100mm,75,1.40,105.00
FT-2210,M10 Hex Bolt Zinc,300,0.18,54.00
WS-0021,M10 Washer Zinc,300,0.04,12.00
#04 · Accountant reconciling international invoices with VAT
Input
Extract vendor name, country of origin, currency, invoice number, invoice date, net amount, VAT percentage, VAT amount, and gross total from this EU supplier invoice. The invoice is in euros from a German software vendor including 19% VAT.
Expected output
{"vendor_name": "DataSoft GmbH", "country": "Germany", "currency": "EUR", "invoice_number": "DS-2024-1183", "invoice_date": "2024-02-15", "net_amount": 2100.00, "vat_percentage": 19, "vat_amount": 399.00, "gross_total": 2499.00}
#05 · Property management company extracting contractor maintenance invoices
Input
Pull the following from this contractor invoice PDF: contractor business name, contractor license number if present, property address serviced, work description, labor hours, labor rate, materials cost, and invoice total. Invoice is from a plumbing contractor who repaired a water heater at a rental property.
Expected output
Contractor: Rivera Plumbing Services | License: PLM-44821 | Property Address: 1402 Elm Street Unit 3B, Austin TX 78701 | Work Description: Water heater element replacement and pressure valve inspection | Labor: 2.5 hours at $95/hr = $237.50 | Materials: $148.00 (element, valve, fittings) | Invoice Total: $385.50

Common mistakes to avoid

  • Not specifying output format

    Asking for data extraction without specifying JSON, CSV, or another format returns prose descriptions that are hard to parse programmatically. You end up post-processing text instead of working with clean data. Always declare the format and structure in your prompt.

  • Assuming scanned PDFs work like text PDFs

    A scanned invoice is an image embedded in a PDF wrapper. If you copy the text out of it and feed it to the AI, you get nothing or garbage because there is no text layer. You need to run OCR first using a tool like Google Document AI, AWS Textract, or Adobe Acrobat before AI extraction can work reliably.

  • Ignoring hallucinated totals

    AI models occasionally invent plausible-looking numbers when they cannot clearly read a value, especially in dense tables or low-quality scans. Accepting extracted totals without a cross-check against source documents or an arithmetic verification step can push incorrect amounts into your accounting system.

  • Sending entire multi-page PDFs as one chunk

    Very long invoices with dozens of line items can exceed the model's useful context window or cause it to drop rows from the middle of a table. Split large invoices into logical sections, header data and line items separately, and merge the results rather than relying on one extraction pass.

  • Using the same prompt for all invoice types

    A prompt built for simple vendor invoices will perform poorly on complex invoices like contractor time-and-materials bills, recurring subscription invoices with proration, or invoices with credit memos attached. Maintain separate prompt templates for distinct invoice categories rather than forcing one prompt to handle everything.

Related queries

Frequently asked questions

Can AI extract data from scanned invoice PDFs?

AI language models work on text, so a scanned PDF that is just an image requires an OCR step first. Tools like AWS Textract, Google Document AI, or Adobe Acrobat can convert the scanned image to a text layer. Once you have the text, you can feed it to an AI model for structured field extraction. Some end-to-end platforms like Azure Form Recognizer combine OCR and extraction in a single API call.

How accurate is AI invoice data extraction compared to manual entry?

On clean, machine-readable PDFs, well-prompted AI extraction reaches accuracy rates above 95% for standard fields like vendor, date, and total. Line-item extraction on complex tables can be lower, especially for invoices with merged cells or non-standard layouts. Human data entry typically runs 2 to 5 percent error rates, so AI is competitive for most use cases when combined with a basic validation check.

What fields can be extracted from an invoice PDF?

Standard extractable fields include vendor name, vendor address, vendor tax ID, invoice number, invoice date, due date, purchase order number, billing address, ship-to address, line item descriptions, quantities, unit prices, line totals, subtotal, discount, shipping, tax rate, tax amount, and grand total. Currency, payment terms, and bank details are also extractable when present on the document.

How do I extract invoice data from a PDF and put it into Excel?

Prompt the AI to return the extracted data in CSV format with headers matching your spreadsheet columns. Copy the CSV output and paste it into Excel using the Data import from text option, or paste directly and use Text to Columns if needed. For recurring workflows, use a tool like Make or Zapier to automate PDF ingestion, AI extraction, and row insertion into Google Sheets or Excel automatically.

Is it safe to send invoice PDFs to an AI API?

It depends on the API provider and your data sensitivity. Most major providers like OpenAI and Anthropic do not train on API-submitted data by default, but you should verify this in their current data usage policies. For invoices containing bank account numbers, tax IDs, or proprietary pricing, consider redacting sensitive fields before sending or using a self-hosted model. Check your company's data governance policies before automating.

What is the best way to extract data from hundreds of invoices automatically?

Build a pipeline that watches a folder or email inbox for new PDF invoices, extracts the text layer using a PDF library like PyMuPDF or pdfplumber, sends the text to an AI API with a structured extraction prompt, validates the returned JSON against expected field types and arithmetic checks, and writes the results to your database or spreadsheet. No-code platforms like Make or n8n can wire this together without custom code for moderate volumes.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.