```json
{
"document_type": "invoice",
"invoice_number": "INV-2024-0892",
"invoice_date": "March 15, 2024",
"bill_to": {
"company_name": "Acme Corp",
"street_address": "123 Main St",
"city": "Boston",
"state": "MA",
"postal_code": "02101",
"country": null
},
"vendor": {
"name": null,
"address": null,
"contact": null
},
"line_items": [
{
"description": "Web Design Services",
"quantity": null,
"unit_price": null,
"amount": "$2,500.00"
},
{
"description": "Hosting (12 months)",
"quantity": null,
"unit_price": null,
"amount": "$480.00"
}
],
"subtotal": "$2,980.00",
"tax_rate": "8%",
"tax_amount": "$238.40",
"total_due": "$3,218.40",
"currency": "USD",
"payment_terms": "Net 30",
"due_date": "April 14, 2024",
"payment_instructions": null,
"notes": null,
"confidence_notes": "All key invoice fields (invoice number, dates, bill-to address, line items, totals) were clearly extractable. Vendor/supplier information is not present in the source text. Currency is inferred as USD based on the '$' symbol. Line-item quantities and unit prices were not itemized separately in the source."
}
```
Extracting Structured Data from PDFs with ChatGPT Prompts
Tested prompts for use chatgpt to extract data from pdf compared across 5 leading AI models.
You have a PDF and you need the data out of it in a usable format. Maybe it is a contract full of key dates and dollar amounts, a research paper packed with statistics, or a stack of invoices your accounting team needs in a spreadsheet. Copy-pasting is slow and error-prone. Writing a custom parser takes engineering time you do not have. ChatGPT gives you a third option: paste or upload the content and describe exactly what you want extracted.
The core workflow is straightforward. You give ChatGPT the text from your PDF, either by copying it directly or uploading the file in supported interfaces, then write a prompt that defines the fields you want, the format you want them in, and any rules for handling ambiguous or missing values. The model reads the unstructured text and returns structured output like JSON, CSV, or a markdown table you can drop straight into another tool.
This page shows you the exact prompts that work, the model outputs you can expect, and a comparison of how different approaches perform. If you have ever Googled how to get structured data out of a PDF without writing code, you are in the right place.
When to use this
This approach is the right tool when you have text-based PDFs and need to pull specific, repeatable fields without building a pipeline. It works especially well for one-off extractions, small-to-medium batches where manual work would take hours, or situations where the document structure changes enough that a rigid parser would break constantly.
- Pulling invoice fields (vendor name, date, line items, totals) from supplier PDFs into a spreadsheet
- Extracting key clauses, party names, and dates from legal contracts or NDAs
- Collecting data tables from research papers or financial reports into clean CSV format
- Grabbing product specs, prices, and SKUs from manufacturer data sheets
- Summarizing patient or client intake forms into structured records for a database
When this format breaks down
- Scanned PDFs that are image-only with no OCR layer: ChatGPT cannot read pixel images of text, so you will get nothing back unless you run OCR first with a tool like Adobe Acrobat or Tesseract.
- High-volume automated pipelines processing thousands of documents daily: prompt-based extraction does not scale cost-effectively or reliably compared to a dedicated document AI service like AWS Textract or Azure Form Recognizer.
- Documents with legally sensitive data where sending content to a third-party API violates your compliance requirements, such as HIPAA-protected health records or confidential M&A documents under NDA.
- Heavily formatted PDFs where critical data lives inside complex multi-level tables, charts, or embedded objects that lose their structure when converted to plain text.
The prompt we tested
You are a data extraction assistant specialized in parsing PDF documents. Follow these rules precisely: Identify the document type (invoice, resume, contract, report, form, etc.) and extract all relevant fields into a clean JSON object with clear key names. Preserve exact values (dates, amounts, names, IDs) without paraphrasing, use null for missing fields, and include a brief 'document_type' and 'confidence_notes' field flagging any ambiguous or unreadable sections. Below is the text content extracted from a PDF file. Analyze it and extract the requested structured data. PDF CONTENT: INVOICE #INV-2024-0892 Date: March 15, 2024 Bill To: Acme Corp, 123 Main St, Boston MA 02101 Items: Web Design Services - $2,500.00; Hosting (12 months) - $480.00 Subtotal: $2,980.00 | Tax (8%): $238.40 | Total Due: $3,218.40 Payment Terms: Net 30 | Due Date: April 14, 2024 Return the extracted data now.
Model comparison
| Model | Quality | Cost | Latency | Tokens |
|---|---|---|---|---|
| Claude Opus 4.7 | 9 | $0.0443 | 7.0s | 372 / 516 |
| Claude Haiku 4.5 | 9 | $0.0018 | 4.5s | 267 / 313 |
| GPT-5.4 | 9 | $0.0046 | 4.3s | 240 / 269 |
| Gemini 2.5 Pro | 2 | $0.0004 | 15.8s | 271 / 17 |
| Grok 4.1 Fast Reasoning | 9 | $0.0002 | 4.2s | 390 / 209 |
Outputs
```json
{
"document_type": "invoice",
"invoice_number": "INV-2024-0892",
"invoice_date": "March 15, 2024",
"bill_to": {
"company_name": "Acme Corp",
"street_address": "123 Main St",
"city": "Boston",
"state": "MA",
"zip_code": "02101"
},
"line_items": [
{
"description": "Web Design Services",
"amount": 2500.00
},
{
"description": "Hosting (12 months)",
"amount": 480.00
}
],
"subtotal": 2980.00,
"tax_rate": 0.08,
"tax_amount": 238.40,
"total_due": 3218.40,
"payment_terms": "Net 30",
"due_date": "April 14, 2024",
"bill_from": null,
"purchase_order_number": null,
"confidence_notes": "All key invoice fields clearly identified and extracted. No ambiguities detected. Bill_from information not provided in source document."
}
```
{
"document_type": "invoice",
"invoice_number": "INV-2024-0892",
"invoice_date": "March 15, 2024",
"bill_to": {
"name": "Acme Corp",
"address": "123 Main St, Boston MA 02101"
},
"items": [
{
"description": "Web Design Services",
"amount": "$2,500.00",
"quantity": null,
"unit_price": null
},
{
"description": "Hosting (12 months)",
"amount": "$480.00",
"quantity": null,
"unit_price": null
}
],
"subtotal": "$2,980.00",
"tax": {
"rate": "8%",
"amount": "$238.40"
},
"total_due": "$3,218.40",
"payment_terms": "Net 30",
"due_date": "April 14, 2024",
"currency": "USD",
"confidence_notes": "High confidence extraction. Currency inferred from dollar symbols as USD. No seller/vendor information was present in the provided text. Item quantities and unit prices were not explicitly provided."
}
```json
{
"document_type": "invoice",
"
{
"document_type": "invoice",
"invoice_number": "INV-2024-0892",
"invoice_date": "March 15, 2024",
"bill_to": {
"name": "Acme Corp",
"address": "123 Main St, Boston MA 02101"
},
"line_items": [
{
"description": "Web Design Services",
"amount": "$2,500.00"
},
{
"description": "Hosting (12 months)",
"amount": "$480.00"
}
],
"subtotal": "$2,980.00",
"tax_rate": "8%",
"tax_amount": "$238.40",
"total_due": "$3,218.40",
"payment_terms": "Net 30",
"due_date": "April 14, 2024",
"confidence_notes": "All fields clearly readable and unambiguous; high extraction confidence."
}
What makes these work
-
01Define every output field explicitly
Do not ask ChatGPT to 'extract the important data.' Name every field you want and specify the data type where it matters, for example 'invoice_date as YYYY-MM-DD format.' Vague prompts produce inconsistent field names and structures that break downstream processing.
-
02Specify a null or missing-value strategy
Tell the model what to return when a field does not exist in the document, such as null, 'N/A', or an empty string. Without this instruction, the model may hallucinate a plausible-sounding value to fill the gap, which is the most dangerous failure mode in data extraction.
-
03Request one output format and stick to it
Choose JSON for programmatic use, CSV for spreadsheet imports, or a markdown table for human review, then ask for only that format with no surrounding explanation. Adding 'Return only the JSON, no commentary' prevents the model from wrapping your data in prose that breaks parsing.
-
04Chunk large documents before prompting
If your PDF is long, paste it in sections and extract per section, then merge. Overloading the context window with a 50-page document degrades accuracy on fields that appear late in the text. Splitting by logical section, for example one invoice at a time, keeps extraction precise.
More example scenarios
Here is the text from a vendor invoice. Extract the following fields and return them as JSON: vendor_name, invoice_number, invoice_date, due_date, line_items (array with description, quantity, unit_price, total), subtotal, tax, total_due. If any field is missing, return null for that field. [Paste invoice text here]
{ "vendor_name": "Apex Supply Co.", "invoice_number": "INV-2024-0891", "invoice_date": "2024-03-15", "due_date": "2024-04-14", "line_items": [{"description": "Office chairs", "quantity": 10, "unit_price": 149.00, "total": 1490.00}], "subtotal": 1490.00, "tax": 119.20, "total_due": 1609.20 }
Read the following contract text and extract these fields into a markdown table: party_1_name, party_2_name, effective_date, termination_date, governing_law, payment_terms, any clauses tagged as limitations of liability. Quote the exact clause text for limitations of liability. [Paste contract text here]
| Field | Value | |---|---| | Party 1 | Northgate Technologies Inc. | | Party 2 | Riverdale Consulting LLC | | Effective Date | January 1, 2024 | | Termination Date | December 31, 2025 | | Governing Law | State of Delaware | | Payment Terms | Net 30 | | Limitation of Liability | "In no event shall either party be liable for indirect, incidental, or consequential damages..." |
From the following academic paper excerpt, extract every numerical result reported in the Results section. Return a CSV with columns: metric_name, value, unit, sample_size, confidence_interval. If confidence interval is not reported, leave that column blank. [Paste results section text here]
metric_name,value,unit,sample_size,confidence_interval Mean response time,342,ms,120,"[318, 366]" Accuracy improvement,14.3,%,120, Drop-out rate,8.2,%,120,"[5.1, 11.3]"
Extract product specifications from the text below. Return JSON with these fields: product_name, sku, dimensions (object with height, width, depth, unit), weight, weight_unit, materials (array), color_options (array), certifications (array), msrp_usd. [Paste spec sheet text here]
{ "product_name": "ErgoLift Monitor Arm", "sku": "EL-ARM-002", "dimensions": {"height": 18, "width": 4, "depth": 12, "unit": "inches"}, "weight": 3.2, "weight_unit": "lbs", "materials": ["aluminum", "steel"], "color_options": ["matte black", "silver"], "certifications": ["UL Listed", "RoHS"], "msrp_usd": 89.99 }
Abstract the following commercial lease. Extract: tenant_name, landlord_name, property_address, lease_start, lease_end, monthly_rent_usd, annual_escalation_percent, security_deposit_usd, permitted_use, renewal_options. Return as JSON. Mark any field not found as null. [Paste lease text here]
{ "tenant_name": "Blue Horizon Cafe LLC", "landlord_name": "Westpark Properties Group", "property_address": "412 Commerce St, Suite 3, Austin TX 78701", "lease_start": "2024-06-01", "lease_end": "2027-05-31", "monthly_rent_usd": 4200, "annual_escalation_percent": 3.0, "security_deposit_usd": 8400, "permitted_use": "Food and beverage retail", "renewal_options": "One 2-year option at market rate" }
Common mistakes to avoid
-
Trusting extracted numbers without checking
ChatGPT can misread or transpose digits, especially in dense tables. Always spot-check numeric fields like totals, dates, and quantities against the source document before importing into any financial or legal system. Treat output as a draft, not ground truth.
-
Pasting scanned PDFs expecting text output
If your PDF was created by scanning a physical document and no OCR has been applied, the file contains images, not text. ChatGPT cannot extract from images pasted as file text. Run the document through OCR first, then extract from the resulting text layer.
-
Using an inconsistent prompt across document batches
If you are extracting the same fields from 50 invoices, use the exact same prompt every time. Tweaking the wording between runs changes which fields get returned and how they are named, making batch output impossible to merge cleanly into a single dataset.
-
Ignoring document formatting loss during copy-paste
When you copy text from a PDF, tables often collapse into a single line of scrambled values. Preview the pasted text before sending it to the model. If the structure is broken, you will need to reformat it manually or use a PDF-to-text tool that preserves layout before prompting.
-
Asking for too many fields in one prompt on complex documents
Requesting 30 fields from a dense 20-page document in a single prompt increases the chance of fields being skipped or merged. Break complex extractions into two or three focused prompts, each targeting a logical section of the document, then combine the outputs.
Related queries
Frequently asked questions
Can ChatGPT read a PDF file directly without copying the text?
In ChatGPT with the file upload feature enabled (available on Plus and Team plans), you can upload a PDF and the system extracts the text layer automatically before passing it to the model. If you are using the API or a plan without file upload, you need to extract the text yourself first using a tool like pdfplumber, PyMuPDF, or Adobe Acrobat, then paste it into the prompt.
How accurate is ChatGPT at extracting data from PDFs?
Accuracy is high for clearly structured text-based PDFs when the prompt is specific about field names and formats. Accuracy drops for scanned documents, multi-column layouts that lose order when copied, and tables with merged cells. Numeric fields carry the most risk of small errors, so always verify critical values like totals or dates against the source.
What is the best output format to request for extracted PDF data?
JSON is the best choice if you are feeding the output into code or a database because it is machine-readable and handles nested data like line item arrays cleanly. CSV works well for flat data you want to open directly in Excel or Google Sheets. Markdown tables are useful for human review but require extra parsing if you need to process them programmatically.
Can I use ChatGPT to extract data from multiple PDFs at once?
Not natively in a single prompt, but you can build a simple loop in Python using the OpenAI API that reads each PDF, extracts the text, sends it with your standard prompt, and writes the JSON output to a file. For batches under a few hundred documents this works well. For larger volumes, dedicated document processing services are more cost-effective and reliable.
Does ChatGPT work for extracting data from password-protected PDFs?
No. You need to unlock the PDF first using the password before you can copy the text or upload it. Once the document is unlocked and the text layer is accessible, the extraction process works the same as any other PDF. ChatGPT has no ability to bypass encryption.
What should I do if ChatGPT returns different field names each time I run the same extraction?
Add an explicit schema definition to your prompt. List the exact field names you expect, for example 'use snake_case and these exact keys: vendor_name, invoice_date, total_due,' and instruct the model to return only those keys. You can also provide a one-shot example in your prompt showing the exact JSON structure you want, which significantly stabilizes output consistency across runs.
Try it with a real tool
Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.