Extracting Structured Data from PDFs with ChatGPT Prompts

Tested prompts for use chatgpt to extract data from pdf compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 9/10

You have a PDF and you need the data out of it in a usable format. Maybe it is a contract full of key dates and dollar amounts, a research paper packed with statistics, or a stack of invoices your accounting team needs in a spreadsheet. Copy-pasting is slow and error-prone. Writing a custom parser takes engineering time you do not have. ChatGPT gives you a third option: paste or upload the content and describe exactly what you want extracted.

The core workflow is straightforward. You give ChatGPT the text from your PDF, either by copying it directly or uploading the file in supported interfaces, then write a prompt that defines the fields you want, the format you want them in, and any rules for handling ambiguous or missing values. The model reads the unstructured text and returns structured output like JSON, CSV, or a markdown table you can drop straight into another tool.

This page shows you the exact prompts that work, the model outputs you can expect, and a comparison of how different approaches perform. If you have ever Googled how to get structured data out of a PDF without writing code, you are in the right place.

When to use this

This approach is the right tool when you have text-based PDFs and need to pull specific, repeatable fields without building a pipeline. It works especially well for one-off extractions, small-to-medium batches where manual work would take hours, or situations where the document structure changes enough that a rigid parser would break constantly.

Pulling invoice fields (vendor name, date, line items, totals) from supplier PDFs into a spreadsheet
Extracting key clauses, party names, and dates from legal contracts or NDAs
Collecting data tables from research papers or financial reports into clean CSV format
Grabbing product specs, prices, and SKUs from manufacturer data sheets
Summarizing patient or client intake forms into structured records for a database

When this format breaks down

Scanned PDFs that are image-only with no OCR layer: ChatGPT cannot read pixel images of text, so you will get nothing back unless you run OCR first with a tool like Adobe Acrobat or Tesseract.
High-volume automated pipelines processing thousands of documents daily: prompt-based extraction does not scale cost-effectively or reliably compared to a dedicated document AI service like AWS Textract or Azure Form Recognizer.
Documents with legally sensitive data where sending content to a third-party API violates your compliance requirements, such as HIPAA-protected health records or confidential M&A documents under NDA.
Heavily formatted PDFs where critical data lives inside complex multi-level tables, charts, or embedded objects that lose their structure when converted to plain text.

The prompt we tested

You are a data extraction assistant specialized in parsing PDF documents. Follow these rules precisely:

Identify the document type (invoice, resume, contract, report, form, etc.) and extract all relevant fields into a clean JSON object with clear key names. Preserve exact values (dates, amounts, names, IDs) without paraphrasing, use null for missing fields, and include a brief 'document_type' and 'confidence_notes' field flagging any ambiguous or unreadable sections.

Below is the text content extracted from a PDF file. Analyze it and extract the requested structured data.

PDF CONTENT:
INVOICE #INV-2024-0892
Date: March 15, 2024
Bill To: Acme Corp, 123 Main St, Boston MA 02101
Items: Web Design Services - $2,500.00; Hosting (12 months) - $480.00
Subtotal: $2,980.00 | Tax (8%): $238.40 | Total Due: $3,218.40
Payment Terms: Net 30 | Due Date: April 14, 2024

Return the extracted data now.

Model comparison

Model	Quality	Cost	Latency	Tokens
Claude Opus 4.7	9	$0.0443	7.0s	372 / 516
Claude Haiku 4.5	9	$0.0018	4.5s	267 / 313
GPT-5.4	9	$0.0046	4.3s	240 / 269
Gemini 2.5 Pro	2	$0.0004	15.8s	271 / 17
Grok 4.1 Fast Reasoning	9	$0.0002	4.2s	390 / 209

Outputs

Claude Opus 4.7 9/10

```json
{
  "document_type": "invoice",
  "invoice_number": "INV-2024-0892",
  "invoice_date": "March 15, 2024",
  "bill_to": {
    "company_name": "Acme Corp",
    "street_address": "123 Main St",
    "city": "Boston",
    "state": "MA",
    "postal_code": "02101",
    "country": null
  },
  "vendor": {
    "name": null,
    "address": null,
    "contact": null
  },
  "line_items": [
    {
      "description": "Web Design Services",
      "quantity": null,
      "unit_price": null,
      "amount": "$2,500.00"
    },
    {
      "description": "Hosting (12 months)",
      "quantity": null,
      "unit_price": null,
      "amount": "$480.00"
    }
  ],
  "subtotal": "$2,980.00",
  "tax_rate": "8%",
  "tax_amount": "$238.40",
  "total_due": "$3,218.40",
  "currency": "USD",
  "payment_terms": "Net 30",
  "due_date": "April 14, 2024",
  "payment_instructions": null,
  "notes": null,
  "confidence_notes": "All key invoice fields (invoice number, dates, bill-to address, line items, totals) were clearly extractable. Vendor/supplier information is not present in the source text. Currency is inferred as USD based on the '$' symbol. Line-item quantities and unit prices were not itemized separately in the source."
}
```

Claude Haiku 4.5 9/10

```json
{
  "document_type": "invoice",
  "invoice_number": "INV-2024-0892",
  "invoice_date": "March 15, 2024",
  "bill_to": {
    "company_name": "Acme Corp",
    "street_address": "123 Main St",
    "city": "Boston",
    "state": "MA",
    "zip_code": "02101"
  },
  "line_items": [
    {
      "description": "Web Design Services",
      "amount": 2500.00
    },
    {
      "description": "Hosting (12 months)",
      "amount": 480.00
    }
  ],
  "subtotal": 2980.00,
  "tax_rate": 0.08,
  "tax_amount": 238.40,
  "total_due": 3218.40,
  "payment_terms": "Net 30",
  "due_date": "April 14, 2024",
  "bill_from": null,
  "purchase_order_number": null,
  "confidence_notes": "All key invoice fields clearly identified and extracted. No ambiguities detected. Bill_from information not provided in source document."
}
```

GPT-5.4 9/10

{
  "document_type": "invoice",
  "invoice_number": "INV-2024-0892",
  "invoice_date": "March 15, 2024",
  "bill_to": {
    "name": "Acme Corp",
    "address": "123 Main St, Boston MA 02101"
  },
  "items": [
    {
      "description": "Web Design Services",
      "amount": "$2,500.00",
      "quantity": null,
      "unit_price": null
    },
    {
      "description": "Hosting (12 months)",
      "amount": "$480.00",
      "quantity": null,
      "unit_price": null
    }
  ],
  "subtotal": "$2,980.00",
  "tax": {
    "rate": "8%",
    "amount": "$238.40"
  },
  "total_due": "$3,218.40",
  "payment_terms": "Net 30",
  "due_date": "April 14, 2024",
  "currency": "USD",
  "confidence_notes": "High confidence extraction. Currency inferred from dollar symbols as USD. No seller/vendor information was present in the provided text. Item quantities and unit prices were not explicitly provided."
}

Gemini 2.5 Pro 2/10

```json
{
  "document_type": "invoice",
  "

Grok 4.1 Fast Reasoning 9/10

{
  "document_type": "invoice",
  "invoice_number": "INV-2024-0892",
  "invoice_date": "March 15, 2024",
  "bill_to": {
    "name": "Acme Corp",
    "address": "123 Main St, Boston MA 02101"
  },
  "line_items": [
    {
      "description": "Web Design Services",
      "amount": "$2,500.00"
    },
    {
      "description": "Hosting (12 months)",
      "amount": "$480.00"
    }
  ],
  "subtotal": "$2,980.00",
  "tax_rate": "8%",
  "tax_amount": "$238.40",
  "total_due": "$3,218.40",
  "payment_terms": "Net 30",
  "due_date": "April 14, 2024",
  "confidence_notes": "All fields clearly readable and unambiguous; high extraction confidence."
}

What makes these work

01

Define every output field explicitly
Do not ask ChatGPT to 'extract the important data.' Name every field you want and specify the data type where it matters, for example 'invoice_date as YYYY-MM-DD format.' Vague prompts produce inconsistent field names and structures that break downstream processing.
02

Specify a null or missing-value strategy
Tell the model what to return when a field does not exist in the document, such as null, 'N/A', or an empty string. Without this instruction, the model may hallucinate a plausible-sounding value to fill the gap, which is the most dangerous failure mode in data extraction.
03

Request one output format and stick to it
Choose JSON for programmatic use, CSV for spreadsheet imports, or a markdown table for human review, then ask for only that format with no surrounding explanation. Adding 'Return only the JSON, no commentary' prevents the model from wrapping your data in prose that breaks parsing.
04

Chunk large documents before prompting
If your PDF is long, paste it in sections and extract per section, then merge. Overloading the context window with a 50-page document degrades accuracy on fields that appear late in the text. Splitting by logical section, for example one invoice at a time, keeps extraction precise.

More example scenarios

#01 · Invoice data extraction for accounts payable

Input

Here is the text from a vendor invoice. Extract the following fields and return them as JSON: vendor_name, invoice_number, invoice_date, due_date, line_items (array with description, quantity, unit_price, total), subtotal, tax, total_due. If any field is missing, return null for that field.

[Paste invoice text here]

Expected output

{ "vendor_name": "Apex Supply Co.", "invoice_number": "INV-2024-0891", "invoice_date": "2024-03-15", "due_date": "2024-04-14", "line_items": [{"description": "Office chairs", "quantity": 10, "unit_price": 149.00, "total": 1490.00}], "subtotal": 1490.00, "tax": 119.20, "total_due": 1609.20 }

#02 · Contract clause extraction for legal review

Input

Read the following contract text and extract these fields into a markdown table: party_1_name, party_2_name, effective_date, termination_date, governing_law, payment_terms, any clauses tagged as limitations of liability. Quote the exact clause text for limitations of liability.

[Paste contract text here]

Expected output

| Field | Value |
|---|---|
| Party 1 | Northgate Technologies Inc. |
| Party 2 | Riverdale Consulting LLC |
| Effective Date | January 1, 2024 |
| Termination Date | December 31, 2025 |
| Governing Law | State of Delaware |
| Payment Terms | Net 30 |
| Limitation of Liability | "In no event shall either party be liable for indirect, incidental, or consequential damages..." |

#03 · Research paper statistics extraction

Input

From the following academic paper excerpt, extract every numerical result reported in the Results section. Return a CSV with columns: metric_name, value, unit, sample_size, confidence_interval. If confidence interval is not reported, leave that column blank.

[Paste results section text here]

Expected output

metric_name,value,unit,sample_size,confidence_interval
Mean response time,342,ms,120,"[318, 366]"
Accuracy improvement,14.3,%,120,
Drop-out rate,8.2,%,120,"[5.1, 11.3]"

#04 · Product data sheet extraction for e-commerce catalog

Input

Extract product specifications from the text below. Return JSON with these fields: product_name, sku, dimensions (object with height, width, depth, unit), weight, weight_unit, materials (array), color_options (array), certifications (array), msrp_usd.

[Paste spec sheet text here]

Expected output

{ "product_name": "ErgoLift Monitor Arm", "sku": "EL-ARM-002", "dimensions": {"height": 18, "width": 4, "depth": 12, "unit": "inches"}, "weight": 3.2, "weight_unit": "lbs", "materials": ["aluminum", "steel"], "color_options": ["matte black", "silver"], "certifications": ["UL Listed", "RoHS"], "msrp_usd": 89.99 }

#05 · Real estate lease abstraction

Input

Abstract the following commercial lease. Extract: tenant_name, landlord_name, property_address, lease_start, lease_end, monthly_rent_usd, annual_escalation_percent, security_deposit_usd, permitted_use, renewal_options. Return as JSON. Mark any field not found as null.

[Paste lease text here]

Expected output

{ "tenant_name": "Blue Horizon Cafe LLC", "landlord_name": "Westpark Properties Group", "property_address": "412 Commerce St, Suite 3, Austin TX 78701", "lease_start": "2024-06-01", "lease_end": "2027-05-31", "monthly_rent_usd": 4200, "annual_escalation_percent": 3.0, "security_deposit_usd": 8400, "permitted_use": "Food and beverage retail", "renewal_options": "One 2-year option at market rate" }

Common mistakes to avoid

Trusting extracted numbers without checking
ChatGPT can misread or transpose digits, especially in dense tables. Always spot-check numeric fields like totals, dates, and quantities against the source document before importing into any financial or legal system. Treat output as a draft, not ground truth.
Pasting scanned PDFs expecting text output
If your PDF was created by scanning a physical document and no OCR has been applied, the file contains images, not text. ChatGPT cannot extract from images pasted as file text. Run the document through OCR first, then extract from the resulting text layer.
Using an inconsistent prompt across document batches
If you are extracting the same fields from 50 invoices, use the exact same prompt every time. Tweaking the wording between runs changes which fields get returned and how they are named, making batch output impossible to merge cleanly into a single dataset.
Ignoring document formatting loss during copy-paste
When you copy text from a PDF, tables often collapse into a single line of scrambled values. Preview the pasted text before sending it to the model. If the structure is broken, you will need to reformat it manually or use a PDF-to-text tool that preserves layout before prompting.
Asking for too many fields in one prompt on complex documents
Requesting 30 fields from a dense 20-page document in a single prompt increases the chance of fields being skipped or merged. Break complex extractions into two or three focused prompts, each targeting a logical section of the document, then combine the outputs.

Related queries

Frequently asked questions

Can ChatGPT read a PDF file directly without copying the text?

In ChatGPT with the file upload feature enabled (available on Plus and Team plans), you can upload a PDF and the system extracts the text layer automatically before passing it to the model. If you are using the API or a plan without file upload, you need to extract the text yourself first using a tool like pdfplumber, PyMuPDF, or Adobe Acrobat, then paste it into the prompt.

How accurate is ChatGPT at extracting data from PDFs?

Accuracy is high for clearly structured text-based PDFs when the prompt is specific about field names and formats. Accuracy drops for scanned documents, multi-column layouts that lose order when copied, and tables with merged cells. Numeric fields carry the most risk of small errors, so always verify critical values like totals or dates against the source.

What is the best output format to request for extracted PDF data?

JSON is the best choice if you are feeding the output into code or a database because it is machine-readable and handles nested data like line item arrays cleanly. CSV works well for flat data you want to open directly in Excel or Google Sheets. Markdown tables are useful for human review but require extra parsing if you need to process them programmatically.

Can I use ChatGPT to extract data from multiple PDFs at once?

Not natively in a single prompt, but you can build a simple loop in Python using the OpenAI API that reads each PDF, extracts the text, sends it with your standard prompt, and writes the JSON output to a file. For batches under a few hundred documents this works well. For larger volumes, dedicated document processing services are more cost-effective and reliable.

Does ChatGPT work for extracting data from password-protected PDFs?

No. You need to unlock the PDF first using the password before you can copy the text or upload it. Once the document is unlocked and the text layer is accessible, the extraction process works the same as any other PDF. ChatGPT has no ability to bypass encryption.

What should I do if ChatGPT returns different field names each time I run the same extraction?

Add an explicit schema definition to your prompt. List the exact field names you expect, for example 'use snake_case and these exact keys: vendor_name, invoice_date, total_due,' and instruct the model to return only those keys. You can also provide a one-shot example in your prompt showing the exact JSON structure you want, which significantly stabilizes output consistency across runs.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.

Perplexity Pro AI-powered answer engine

Try Perplexity →

CustomGPT ChatGPT trained on your content

Try CustomGPT →