Automate Purchase Order Data Extraction from PDFs

Tested prompts for extract data from purchase order pdf compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 9/10

When a purchase order arrives as a PDF, someone has to get the data out of it: vendor name, PO number, line items, quantities, unit prices, delivery dates. Doing that by hand is slow, error-prone, and completely unscalable once volume picks up. Whether you are processing five POs a day or five hundred, you need a repeatable method that pulls the right fields every time without manual re-keying.

The approach on this page uses a structured prompt fed to a large language model. You paste or pipe in the text content of the PDF, and the model returns a clean JSON or table with every field you care about. No custom OCR pipeline to build, no vendor-specific template to maintain, no code to write for each new supplier format.

This works because modern LLMs understand document structure well enough to find PO number on line 3 or buried in a header block, match line items to their quantities and prices, and handle the inconsistencies that come from dealing with dozens of different supplier templates. The comparison table below shows how four leading models handle the same real-world PO input, so you can pick the right one for your volume and accuracy requirements.

When to use this

This approach is the right fit when you are dealing with PDFs from multiple suppliers with different layouts, need structured output you can feed into an ERP or spreadsheet, or want to automate a manual data-entry step without writing a custom parser for every vendor template. It handles both native text PDFs and, with an OCR pre-step, scanned documents.

  • Accounts payable teams receiving POs from dozens of suppliers with inconsistent formatting
  • Procurement analysts who need to aggregate line-item data from POs into a single spreadsheet or database
  • Developers building an intake workflow where PO data must populate an ERP like SAP, NetSuite, or QuickBooks
  • Operations teams validating that received goods match what was ordered by comparing extracted PO fields against packing slips
  • Small businesses replacing a manual data-entry step that currently takes hours per week

When this format breaks down

  • The PDF is a scanned image with no text layer and you have no OCR step in place. The model reads text, not pixels. You must run the file through Tesseract, AWS Textract, or a similar tool first.
  • You need legally defensible extraction with zero tolerance for hallucination, such as a contract PO that triggers automatic payment. Always add a human review step before money moves.
  • Your POs are in a single, fixed, machine-readable format you control end to end. A simple regex or a CSV export from your own system is faster and cheaper.
  • The document is over 50 pages of dense line items and the model's context window cannot hold it all. You will need to chunk the document or use a purpose-built document AI service.

The prompt we tested

You are an expert data extraction assistant specializing in parsing purchase order documents. Extract all relevant structured data from the purchase order content provided below and return it in a clean, organized format.

Instructions: Return the extracted data as a structured JSON object with clearly labeled keys. Use null for missing fields and preserve original formatting for numbers, dates (ISO 8601), and currency. Keep line items as an array of objects and do not invent data not present in the source.

Purchase Order Content:
Purchase Order #PO-2024-00891 dated March 15, 2024 from Acme Corp (123 Main St, Boston MA 02101) to Global Supplies Inc (456 Warehouse Rd, Dallas TX 75201). Items: 50 units of Widget-A @ $12.50 each, 25 units of Gadget-B @ $45.00 each. Subtotal $1,750.00, tax $140.00, shipping $75.00, total $1,965.00 USD. Net 30 payment terms, deliver by April 10, 2024.

Extract and return the following fields when present: PO Number, PO Date, Buyer/Company Name, Buyer Address, Vendor/Supplier Name, Vendor Address, Ship-To Address, Bill-To Address, Payment Terms, Delivery Date, Line Items (SKU, Description, Quantity, Unit Price, Total), Subtotal, Tax, Shipping, Total Amount, Currency, and any Notes or Special Instructions. Flag any missing or ambiguous fields.

Model comparison

ModelQualityCostLatencyTokens
Claude Opus 4.7 9 $0.0692 8.0s 520 / 819
Claude Haiku 4.5 9 $0.0027 2.7s 352 / 464
GPT-5.4 9 $0.0069 4.3s 310 / 409
Gemini 2.5 ProNot run yet
Grok 4.1 Fast Reasoning 9 $0.0003 12.5s 459 / 323

Outputs

Claude Opus 4.7 9/10
```json
{
  "purchase_order": {
    "po_number": "PO-2024-00891",
    "po_date": "2024-03-15",
    "buyer": {
      "company_name": "Acme Corp",
      "address": {
        "street": "123 Main St",
        "city": "Boston",
        "state": "MA",
        "postal_code": "02101",
        "country": null
      }
    },
    "vendor": {
      "company_name": "Global Supplies Inc",
      "address": {
        "street": "456 Warehouse Rd",
        "city": "Dallas",
        "state": "TX",
        "postal_code": "75201",
        "country": null
      }
    },
    "ship_to_address": null,
    "bill_to_address": null,
    "payment_terms": "Net 30",
    "delivery_date": "2024-04-10",
    "line_items": [
      {
        "sku": null,
        "description": "Widget-A",
        "quantity": 50,
        "unit_price": 12.50,
        "total": 625.00
      },
      {
        "sku": null,
        "description": "Gadget-B",
        "quantity": 25,
        "unit_price": 45.00,
        "total": 1125.00
      }
    ],
    "financials": {
      "subtotal": 1750.00,
      "tax": 140.00,
      "shipping": 75.00,
      "total_amount": 1965.00,
      "currency": "USD"
    },
    "notes": null,
    "special_instructions": null
  },
  "data_quality_flags": {
    "missing_fields": [
      "ship_to_address",
      "bill_to_address",
      "line_item_skus",
      "country (buyer and vendor)",
      "notes",
      "special_instructions"
    ],
    "ambiguous_fields": [
      {
        "field": "ship_to_address",
        "note": "Not explicitly stated; may be assumed to match buyer address (Acme Corp, 123 Main St, Boston MA 02101) but not confirmed."
      },
      {
        "field": "bill_to_address",
        "note": "Not explicitly stated; may be assumed to match buyer address but not confirmed."
      },
      {
        "field": "line_item_totals",
        "note": "Per-line totals were calculated (qty × unit price) since they were not explicitly provided in the source. Sum matches stated subtotal of $1,750.00."
      }
    ]
  }
}
```
Judge notes: Excellent execution that fully extracts all available data in clean JSON format, correctly handles missing fields with null values, accurately calculates line item totals, and transparently flags ambiguities and missing information without inventing data.
Claude Haiku 4.5 9/10
```json
{
  "purchase_order": {
    "po_number": "PO-2024-00891",
    "po_date": "2024-03-15",
    "buyer": {
      "company_name": "Acme Corp",
      "address": "123 Main St, Boston MA 02101"
    },
    "vendor": {
      "company_name": "Global Supplies Inc",
      "address": "456 Warehouse Rd, Dallas TX 75201"
    },
    "ship_to_address": null,
    "bill_to_address": null,
    "payment_terms": "Net 30",
    "delivery_date": "2024-04-10",
    "line_items": [
      {
        "sku": null,
        "description": "Widget-A",
        "quantity": 50,
        "unit": "units",
        "unit_price": 12.50,
        "total": 625.00
      },
      {
        "sku": null,
        "description": "Gadget-B",
        "quantity": 25,
        "unit": "units",
        "unit_price": 45.00,
        "total": 1125.00
      }
    ],
    "financial_summary": {
      "subtotal": 1750.00,
      "tax": 140.00,
      "shipping": 75.00,
      "total_amount": 1965.00,
      "currency": "USD"
    },
    "notes_and_special_instructions": null,
    "flags": {
      "missing_fields": [
        "ship_to_address",
        "bill_to_address",
        "line_item_skus",
        "notes_and_special_instructions"
      ],
      "ambiguous_fields": []
    }
  }
}
```
Judge notes: Excellent extraction with proper JSON structure, correct data parsing, accurate calculations, appropriate null values for missing fields, and clear flagging of absent information—minor deduction only because ship-to/bill-to addresses weren't explicitly stated as potentially defaulting to buyer/vendor addresses.
GPT-5.4 9/10
{
  "po_number": "PO-2024-00891",
  "po_date": "2024-03-15",
  "buyer_company_name": "Acme Corp",
  "buyer_address": "123 Main St, Boston MA 02101",
  "vendor_supplier_name": "Global Supplies Inc",
  "vendor_address": "456 Warehouse Rd, Dallas TX 75201",
  "ship_to_address": null,
  "bill_to_address": null,
  "payment_terms": "Net 30",
  "delivery_date": "2024-04-10",
  "line_items": [
    {
      "sku": null,
      "description": "Widget-A",
      "quantity": "50",
      "unit_price": "$12.50",
      "total": "$625.00"
    },
    {
      "sku": null,
      "description": "Gadget-B",
      "quantity": "25",
      "unit_price": "$45.00",
      "total": "$1,125.00"
    }
  ],
  "subtotal": "$1,750.00",
  "tax": "$140.00",
  "shipping": "$75.00",
  "total_amount": "$1,965.00",
  "currency": "USD",
  "notes_or_special_instructions": null,
  "missing_or_ambiguous_fields": [
    {
      "field": "ship_to_address",
      "reason": "Not provided in the source content."
    },
    {
      "field": "bill_to_address",
      "reason": "Not provided in the source content."
    },
    {
      "field": "line_items[].sku",
      "reason": "SKU values were not provided; item names appear to be descriptions only."
    },
    {
      "field": "notes_or_special_instructions",
      "reason": "No notes or special instructions were provided in the source content."
    }
  ]
}
Judge notes: Excellent extraction with proper JSON structure, correct date formatting, accurate data preservation, and transparent flagging of missing fields—minor deduction only because currency symbols could have been numeric values for better data consistency.
Grok 4.1 Fast Reasoning 9/10
{
  "PO Number": "PO-2024-00891",
  "PO Date": "2024-03-15",
  "Buyer Company Name": "Acme Corp",
  "Buyer Address": "123 Main St, Boston MA 02101",
  "Vendor Supplier Name": "Global Supplies Inc",
  "Vendor Address": "456 Warehouse Rd, Dallas TX 75201",
  "Ship To Address": null,
  "Bill To Address": null,
  "Payment Terms": "Net 30",
  "Delivery Date": "2024-04-10",
  "Line Items": [
    {
      "SKU": null,
      "Description": "Widget-A",
      "Quantity": 50,
      "Unit Price": "12.50",
      "Total": null
    },
    {
      "SKU": null,
      "Description": "Gadget-B",
      "Quantity": 25,
      "Unit Price": "45.00",
      "Total": null
    }
  ],
  "Subtotal": "1,750.00",
  "Tax": "140.00",
  "Shipping": "75.00",
  "Total Amount": "1,965.00",
  "Currency": "USD",
  "Notes or Special Instructions": null,
  "flags": {
    "missing_fields": [
      "Ship-To Address",
      "Bill-To Address",
      "SKU (all line items)",
      "Line item totals (not explicitly provided)"
    ],
    "ambiguous_fields": []
  }
}
Judge notes: Excellent execution with accurate data extraction, proper JSON formatting, correct ISO 8601 dates, appropriate null values for missing data, and helpful flagging of absent fields; only minor issue is that line item totals could have been calculated (50×12.50=625, 25×45=1125) despite the instruction not to invent data.

What makes these work

  1. 01
    Always specify your output schema

    Tell the model exactly which fields you want and in what format before you paste the document. A prompt like 'Return a JSON object with these keys: po_number, vendor, line_items (array with description, qty, unit_price, total), po_total, delivery_date' produces consistent output you can parse programmatically. Without a schema, different POs will return fields with different names and structures, breaking any downstream automation.

  2. 02
    Extract text from the PDF before prompting

    LLMs read text, not PDFs. Use a library like PyMuPDF (fitz), pdfplumber, or pdftotext to pull the raw text layer first, then pass that string to the model. For scanned POs, run the file through an OCR tool first. Feeding clean text instead of a file path means better extraction accuracy and no dependency on the model's file-handling capabilities.

  3. 03
    Include validation instructions in the prompt

    Ask the model to cross-check its own arithmetic. A prompt addition like 'Verify that each line_total equals qty multiplied by unit_price, and that the sum of line_totals equals po_total. Flag any discrepancy with a mismatch field set to true' catches data-entry errors in the original document and extraction errors from the model itself.

  4. 04
    Handle multi-page and multi-item POs with chunking

    If a PO has dozens of line items spread across several pages, concatenate all the extracted text and pass it in one prompt if it fits the context window. If it does not, split by page and merge the line_items arrays in your code afterward. Always include a system instruction that the document may continue across chunks so the model does not try to infer a false total from an incomplete view.

More example scenarios

#01 · Manufacturing supplier PO with multiple line items
Input
PO Number: PO-2024-08821. Date: 2024-11-14. Vendor: Apex Industrial Supply. Ship To: 400 Commerce Dr, Detroit MI 48201. Line 1: Steel Hex Bolts M8x30, Qty 5000, Unit Price $0.04, Total $200.00. Line 2: Nylon Lock Nuts M8, Qty 5000, Unit Price $0.03, Total $150.00. Requested Delivery: 2024-12-01. Payment Terms: Net 30.
Expected output
{"po_number":"PO-2024-08821","date":"2024-11-14","vendor":"Apex Industrial Supply","ship_to":"400 Commerce Dr, Detroit MI 48201","payment_terms":"Net 30","requested_delivery":"2024-12-01","line_items":[{"description":"Steel Hex Bolts M8x30","qty":5000,"unit_price":0.04,"total":200.00},{"description":"Nylon Lock Nuts M8","qty":5000,"unit_price":0.03,"total":150.00}],"po_total":350.00}
#02 · Retail buyer PO for apparel with size/color variants
Input
Purchase Order from: Westfield Retail Group. PO: WRG-55293. Vendor: BlueLine Apparel Co. Order Date: Jan 8 2025. Item: Men's Classic Crewneck Sweatshirt, Style BL-CC-401. Color: Navy. Sizes: S x20, M x40, L x35, XL x15. Unit Cost: $18.50. Total Units: 110. Total Cost: $2,035.00. Deliver By: Feb 15 2025.
Expected output
{"po_number":"WRG-55293","buyer":"Westfield Retail Group","vendor":"BlueLine Apparel Co","order_date":"2025-01-08","deliver_by":"2025-02-15","line_items":[{"style":"BL-CC-401","description":"Men's Classic Crewneck Sweatshirt","color":"Navy","sizes":{"S":20,"M":40,"L":35,"XL":15},"total_units":110,"unit_cost":18.50,"line_total":2035.00}]}
#03 · SaaS company PO for software licenses and professional services
Input
Purchase Order. Issued by: Thornton Financial LLC. PO #: TF-2025-0042. To: CloudBase Software Inc. Date Issued: March 3, 2025. Line 1: Enterprise License, 50 seats, $120/seat/year, Subtotal $6,000. Line 2: Onboarding Professional Services, 10 hours, $250/hr, Subtotal $2,500. Payment Terms: 50% upfront, 50% on delivery. Total: $8,500.
Expected output
{"po_number":"TF-2025-0042","buyer":"Thornton Financial LLC","vendor":"CloudBase Software Inc","issue_date":"2025-03-03","payment_terms":"50% upfront, 50% on delivery","line_items":[{"description":"Enterprise License","qty":50,"unit":"seats","unit_price":120,"period":"year","subtotal":6000},{"description":"Onboarding Professional Services","qty":10,"unit":"hours","unit_price":250,"subtotal":2500}],"po_total":8500}
#04 · Restaurant chain PO for food and beverage inventory
Input
PO Date: 04/02/2025. PO Number: RC-0882. From: Harvest Table Restaurants. To: Pacific Coast Distributors. Items: Olive Oil Extra Virgin 4L x 24 units @ $14.00 = $336.00; Canned Tomatoes 2.5kg x 48 units @ $3.20 = $153.60; Dried Pasta 5kg x 30 units @ $6.50 = $195.00. Delivery Required: 04/07/2025. Total: $684.60.
Expected output
{"po_number":"RC-0882","po_date":"2025-04-02","buyer":"Harvest Table Restaurants","vendor":"Pacific Coast Distributors","delivery_required":"2025-04-07","line_items":[{"description":"Olive Oil Extra Virgin 4L","qty":24,"unit_price":14.00,"total":336.00},{"description":"Canned Tomatoes 2.5kg","qty":48,"unit_price":3.20,"total":153.60},{"description":"Dried Pasta 5kg","qty":30,"unit_price":6.50,"total":195.00}],"po_total":684.60}
#05 · Construction firm PO for materials with project code tracking
Input
Purchase Order No: CPO-2025-1147. Project: Riverside Commercial Build. Project Code: RCB-7. Contractor: Summit Build Co. Supplier: BuildRight Materials Inc. Date: 05 May 2025. Ready-Mix Concrete 30MPa, 18 cubic meters, $210/m3, $3,780. Rebar 12mm D500N, 2000kg, $1.85/kg, $3,700. Delivery Address: 88 River Rd, Auckland. Required On Site: 12 May 2025. PO Total: $7,480.
Expected output
{"po_number":"CPO-2025-1147","project":"Riverside Commercial Build","project_code":"RCB-7","contractor":"Summit Build Co","supplier":"BuildRight Materials Inc","po_date":"2025-05-05","required_on_site":"2025-05-12","delivery_address":"88 River Rd, Auckland","line_items":[{"description":"Ready-Mix Concrete 30MPa","qty":18,"unit":"cubic meters","unit_price":210,"total":3780},{"description":"Rebar 12mm D500N","qty":2000,"unit":"kg","unit_price":1.85,"total":3700}],"po_total":7480}

Common mistakes to avoid

  • Assuming the PDF has a text layer

    Many POs are scanned documents with no extractable text. If you pass raw bytes or a file path without first confirming there is a text layer, the model gets nothing useful and will hallucinate or return empty fields. Always check programmatically with pdfplumber or similar before deciding whether OCR is needed.

  • Not pinning the output format

    Without an explicit schema in the prompt, the model may return line items as a flat table in one response and a nested JSON array in the next, depending on how the PO is laid out. This inconsistency breaks parsers downstream. Define the exact field names, data types, and nesting structure in every prompt.

  • Ignoring extraction confidence on messy documents

    Poorly scanned POs, faxed documents, or files with overlapping text can produce garbage OCR output that the model will try to make sense of, sometimes confidently wrong. Ask the model to return a low_confidence flag on any field it is uncertain about, and build a human review queue for those records.

  • Skipping totals reconciliation

    A model might correctly read each line item but miss that the document has a freight charge or tax line at the bottom, making the extracted total wrong. Always instruct the model to extract every numeric total it finds, including subtotals, tax, freight, and grand total, and never assume po_total equals the sum of visible line items.

  • Running extraction without a test suite across vendor formats

    A prompt tuned on your top supplier's PO format will likely fail on a supplier who uses a completely different layout. Before deploying to production, test against at least ten real POs from different vendors. Identify the failure cases and add examples or clarifying instructions to the prompt until accuracy is acceptable across all formats.

Related queries

Frequently asked questions

Can I extract data from a scanned purchase order PDF?

Yes, but you need an OCR step first. Tools like AWS Textract, Google Document AI, Azure Form Recognizer, or open-source options like Tesseract convert the scanned image to text. Once you have the text, you can pass it to an LLM for structured extraction. The quality of your OCR output directly affects extraction accuracy, so higher-resolution scans produce better results.

What fields can be reliably extracted from a purchase order PDF?

Standard fields like PO number, issue date, vendor name, ship-to address, payment terms, requested delivery date, and line items (description, quantity, unit price, total) are reliably extracted from well-formatted PDFs. More complex fields like freight terms, tax breakdowns, or coded product SKUs work well when you explicitly name them in your prompt. Accuracy drops on handwritten annotations or fields printed in unusual locations.

How do I get the extracted data into Excel or my ERP system?

Ask the model to return JSON, then write a short script (Python's json module, for example) to convert that JSON to a CSV or call your ERP's API. Tools like Zapier or Make can also receive JSON output from an AI step and push it directly to Google Sheets, NetSuite, or QuickBooks without custom code. The structured output from the model is intentionally designed to be the bridge into those systems.

How accurate is AI-based purchase order data extraction?

On clean, native-text PDFs with consistent formatting, top models achieve extraction accuracy above 95 percent for standard fields. Accuracy drops on scanned documents, multilingual POs, or documents with complex tables. Running validation checks inside the prompt (totals reconciliation, required-field checks) and routing low-confidence records to a human review queue brings effective accuracy close to 100 percent in production workflows.

Is this better than using a dedicated document parsing tool like Rossum or Hyperscience?

For low-to-medium volume and varied supplier formats, an LLM prompt approach is faster to set up, requires no training data, and costs less. Purpose-built tools offer advantages at high volume, where they provide audit trails, pre-built ERP connectors, and SLA guarantees. Start with the prompt approach to validate your workflow, then evaluate specialized tools once you know your throughput and accuracy requirements.

Can this handle POs in languages other than English?

Leading models handle French, German, Spanish, Japanese, and other common languages well for structured document extraction. Include a note in your prompt specifying the document language if it is not English, and define your output field names in English if your downstream systems expect English keys. Edge cases arise with right-to-left languages or documents that mix two languages, which may require additional prompt tuning.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.