AI Workflows to Convert PDF Documents into JSON Data

Tested prompts for convert pdf to json with ai compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Opus 4.7 9/10

You have a PDF — maybe an invoice, a research report, a product catalog, or a form — and you need the data inside it as structured JSON so your code, database, or API can actually use it. Copy-pasting by hand is error-prone and does not scale. Writing a custom parser breaks on every slightly different layout. AI models can read the document, understand its structure, and return clean, typed JSON in one step.

The approach works by sending the PDF content — either as extracted text or as an image — to a large language model with a prompt that defines the exact JSON schema you want back. The model handles messy formatting, varied layouts, and implicit structure that rules-based tools miss entirely.

This page shows you exactly how that prompt is built, compares outputs from four leading models on the same PDF, and gives you the practical details you need to use this in a real workflow: which models handle tables and nested fields best, where the approach breaks down, and how to make the output reliable enough to parse programmatically.

When to use this

This approach fits any situation where you need machine-readable data extracted from a PDF that has inconsistent formatting, mixed content types, or fields that a naive text extractor would mangle. It is especially useful when you need the output to match a specific schema and do not want to write brittle regex or layout-dependent parsing code.

Extracting line items, totals, and vendor details from supplier invoices into a standard billing schema
Pulling structured fields from scanned or digital intake forms, applications, or contracts
Converting product spec sheets or catalogs into JSON records for database import
Parsing research reports or clinical documents into structured data for downstream analysis
Processing batches of varying-format PDFs where a single rigid parser would fail on layout differences

When this format breaks down

The PDF contains large tables with hundreds of rows — most models have context limits and will truncate or hallucinate values mid-table; use a dedicated table extraction tool first.
You need 100% guaranteed field accuracy for legal or financial records without a human review step — AI models do occasionally misread digits or merge adjacent fields, which is unacceptable without validation.
The PDF is a scanned image with poor scan quality, handwriting, or stamps over text — OCR accuracy drops sharply and the model inherits every OCR error before it even starts structuring the data.
Your PDF is hundreds of pages long — per-page token costs and latency make AI extraction impractical compared to a pipeline that chunks and processes pages selectively.

The prompt we tested

You are an expert document parser that converts unstructured PDF text into clean, structured JSON data.

Rules:
Analyze the PDF text and infer a logical schema with clear field names in camelCase; output valid, minified-friendly JSON with properly typed values (strings, numbers, booleans, arrays, nested objects) and use null for missing fields. Preserve line items, tables, and repeated sections as arrays of objects, and never invent data that is not present in the source.

PDF content to convert:
INVOICE #INV-2024-0892 | Date: March 15, 2024 | Bill To: Acme Corp, 123 Main St, Austin TX 78701. Items: Consulting Services - 10 hrs @ $150/hr = $1,500; Software License - 1 unit @ $299 = $299. Subtotal: $1,799 | Tax (8.25%): $148.42 | Total Due: $1,947.42 | Payment Terms: Net 30.

Return only the JSON object, with no commentary, explanations, or markdown fences.

Model comparison

Model	Quality	Cost	Latency	Tokens
Claude Opus 4.7	9	$0.0221	2.7s	353 / 224
Claude Haiku 4.5	8	$0.0010	1.3s	259 / 156
GPT-5.4	9	$0.0028	2.4s	230 / 147
Gemini 2.5 Pro	9	$0.0011	9.3s	260 / 162
Grok 4.1 Fast Reasoning	9	$0.0001	10.1s	377 / 138

Outputs

Claude Opus 4.7 9/10

{"invoiceNumber":"INV-2024-0892","date":"March 15, 2024","billTo":{"name":"Acme Corp","address":"123 Main St","city":"Austin","state":"TX","postalCode":"78701"},"items":[{"description":"Consulting Services","quantity":10,"unit":"hrs","unitPrice":150,"amount":1500},{"description":"Software License","quantity":1,"unit":"unit","unitPrice":299,"amount":299}],"subtotal":1799,"tax":{"rate":0.0825,"amount":148.42},"totalDue":1947.42,"paymentTerms":"Net 30"}

Claude Haiku 4.5 8/10

```json
{"invoiceNumber":"INV-2024-0892","date":"2024-03-15","billTo":{"company":"Acme Corp","address":"123 Main St","city":"Austin","state":"TX","zipCode":"78701"},"lineItems":[{"description":"Consulting Services","quantity":10,"unit":"hrs","unitPrice":150,"total":1500},{"description":"Software License","quantity":1,"unit":"unit","unitPrice":299,"total":299}],"subtotal":1799,"taxRate":8.25,"taxAmount":148.42,"totalDue":1947.42,"paymentTerms":"Net 30"}
```

GPT-5.4 9/10

{"documentType":"invoice","invoiceNumber":"INV-2024-0892","date":"2024-03-15","billTo":{"name":"Acme Corp","address":{"street":"123 Main St","city":"Austin","state":"TX","postalCode":"78701"}},"items":[{"description":"Consulting Services","quantity":10,"unit":"hrs","unitPrice":150,"lineTotal":1500},{"description":"Software License","quantity":1,"unit":"unit","unitPrice":299,"lineTotal":299}],"subtotal":1799,"tax":{"rate":8.25,"amount":148.42},"totalDue":1947.42,"paymentTerms":"Net 30"}

Gemini 2.5 Pro 9/10

{"invoiceId":"INV-2024-0892","invoiceDate":"March 15, 2024","billTo":{"name":"Acme Corp","address":"123 Main St, Austin TX 78701"},"items":[{"description":"Consulting Services","quantity":10,"unit":"hrs","unitPrice":150,"lineTotal":1500},{"description":"Software License","quantity":1,"unit":"unit","unitPrice":299,"lineTotal":299}],"subtotal":1799,"tax":{"rate":8.25,"amount":148.42},"totalDue":1947.42,"paymentTerms":"Net 30"}

Grok 4.1 Fast Reasoning 9/10

{"invoiceNumber":"INV-2024-0892","invoiceDate":"March 15, 2024","billTo":{"companyName":"Acme Corp","address":{"street":"123 Main St","city":"Austin","state":"TX","zipCode":"78701"}},"items":[{"description":"Consulting Services","quantity":10,"unit":"hrs","rate":150,"total":1500},{"description":"Software License","quantity":1,"unit":"unit","rate":299,"total":299}],"subtotal":1799,"tax":{"rate":0.0825,"amount":148.42},"totalDue":1947.42,"paymentTerms":"Net 30"}

What makes these work

01

Define your schema in the prompt
Do not ask the model to guess what structure you want. Paste your exact target JSON schema or a filled example into the prompt. When the model has a concrete template to match, field names stay consistent across documents and you can parse the output directly without remapping keys.
02

Use strict output instructions
Tell the model to return only valid JSON with no explanation, no markdown fences, and no extra commentary. Add an instruction like 'If a field is not found, return null for that key rather than omitting it.' This keeps your downstream JSON.parse calls from throwing on missing keys or unexpected wrapper text.
03

Pre-extract text before sending to the model
For digital PDFs, extract the raw text with a library like PyMuPDF or pdfplumber before passing it to the AI. This avoids burning vision-model tokens on clean digital text, reduces cost, and gives the model a cleaner input than rendering the page as an image would.
04

Validate the output programmatically
Run the returned JSON through a schema validator like Zod, Pydantic, or AJV immediately after the model responds. If validation fails, send the error message back to the model with an instruction to fix it. This retry loop catches the small percentage of malformed outputs without manual intervention.

More example scenarios

#01 · Accounts payable: extract invoice fields

Input

PDF content: Invoice #INV-2041 from Apex Supply Co., dated March 14 2024. Bill to: Northgate LLC. Line items: 200x Widget A at $4.50 each, 50x Widget B at $12.00 each. Subtotal $1500.00, tax $120.00, total due $1620.00. Payment due April 13 2024.

Expected output

{"invoice_number": "INV-2041", "vendor": "Apex Supply Co.", "bill_to": "Northgate LLC", "date": "2024-03-14", "due_date": "2024-04-13", "line_items": [{"description": "Widget A", "quantity": 200, "unit_price": 4.50, "total": 900.00}, {"description": "Widget B", "quantity": 50, "unit_price": 12.00, "total": 600.00}], "subtotal": 1500.00, "tax": 120.00, "total_due": 1620.00}

#02 · HR: parse job application form

Input

PDF content: Applicant name: Sarah Chen. Position applied for: Senior Data Engineer. Email: s.chen@email.com. Phone: 415-555-0192. Years of experience: 7. Skills listed: Python, Spark, dbt, Snowflake, Airflow. Available start date: June 1 2024. References available on request.

Expected output

{"applicant": {"name": "Sarah Chen", "email": "s.chen@email.com", "phone": "415-555-0192"}, "position": "Senior Data Engineer", "experience_years": 7, "skills": ["Python", "Spark", "dbt", "Snowflake", "Airflow"], "available_start_date": "2024-06-01", "references_available": true}

#03 · E-commerce: convert product spec sheet

Input

PDF content: Product: UltraGrip Work Glove Model UG-7. Material: Nitrile-coated polyester. Sizes available: S, M, L, XL. Colors: Black, Hi-Vis Yellow. EN388 rating: 4121X. Weight per pair: 85g. Case pack: 144 pairs. MSRP: $8.99 per pair.

Expected output

{"product_name": "UltraGrip Work Glove", "model": "UG-7", "material": "Nitrile-coated polyester", "sizes": ["S", "M", "L", "XL"], "colors": ["Black", "Hi-Vis Yellow"], "certifications": {"EN388": "4121X"}, "weight_per_pair_grams": 85, "case_pack_pairs": 144, "msrp_usd": 8.99}

#04 · Healthcare: extract patient intake form data

Input

PDF content: Patient: James Okafor, DOB 1978-09-03. Chief complaint: persistent lower back pain for 3 weeks. Current medications: Lisinopril 10mg daily, Vitamin D 1000IU. Allergies: Penicillin. Insurance: BlueCross BlueShield, Member ID BCX-449201. Primary care physician: Dr. Lin.

Expected output

{"patient": {"name": "James Okafor", "dob": "1978-09-03"}, "chief_complaint": "persistent lower back pain for 3 weeks", "medications": [{"name": "Lisinopril", "dose": "10mg", "frequency": "daily"}, {"name": "Vitamin D", "dose": "1000IU", "frequency": "daily"}], "allergies": ["Penicillin"], "insurance": {"provider": "BlueCross BlueShield", "member_id": "BCX-449201"}, "primary_physician": "Dr. Lin"}

#05 · Legal: extract key contract terms

Input

PDF content: Service Agreement between TechBuild Inc. (Service Provider) and Orion Retail Group (Client). Effective date: February 1 2024. Contract term: 24 months. Monthly fee: $3,500. Termination notice required: 60 days. Governing law: State of Delaware. Liability cap: $50,000.

Expected output

{"agreement_type": "Service Agreement", "parties": {"service_provider": "TechBuild Inc.", "client": "Orion Retail Group"}, "effective_date": "2024-02-01", "term_months": 24, "monthly_fee_usd": 3500, "termination_notice_days": 60, "governing_law": "State of Delaware", "liability_cap_usd": 50000}

Common mistakes to avoid

Asking for JSON without a schema
Prompting the model to 'convert this to JSON' without specifying field names produces different structures every time. One call returns 'invoice_date', the next returns 'date' or 'InvoiceDate'. Your parsing code breaks on the variation, and you spend time normalizing instead of using the data.
Sending scanned PDFs as raw text
If you run OCR on a scanned PDF and send the garbled output to the model, errors in character recognition get baked into the JSON. A misread '8' as '3' in a dollar amount is a real financial error. Either use a high-accuracy OCR tool first or use a vision-capable model that reads the page image directly.
Ignoring token limits on long documents
Sending a 40-page PDF as one prompt silently truncates content once the context window fills. The model does not warn you — it just stops extracting data from the truncated portion and you get incomplete JSON with no error. Split long documents into pages or logical sections and merge the results.
Treating AI output as ground truth
Models occasionally hallucinate a plausible-looking value for a field they cannot clearly read, rather than returning null. For any data that drives financial, legal, or medical decisions, build a confidence-flagging step or spot-check sample outputs against the source documents before trusting the pipeline at scale.
Not escaping special characters in the output
PDF content often contains special characters, curly quotes, or line breaks that break JSON syntax when the model includes them literally. Instruct the model to escape all strings properly, and always run JSON.parse or equivalent in a try-catch so malformed output fails loudly rather than silently corrupting your data.

Related queries

Frequently asked questions

Which AI model is best for converting PDF to JSON?

For digital PDFs with clean text, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro all perform well. For scanned or image-based PDFs, GPT-4o and Gemini 1.5 Pro have strong vision capabilities that handle layout and tables better than text-only models. The comparison table on this page shows side-by-side output quality on the same test document.

Can AI extract tables from PDFs into JSON?

Yes, but with caveats. Simple tables with clear headers convert reliably. Complex tables with merged cells, nested headers, or multi-page spans cause errors in most models. For critical table data, extract the table with a dedicated tool like Camelot or pdfplumber first, then pass the structured result to the AI for final JSON formatting.

How do I convert a scanned PDF to JSON with AI?

Use a vision-capable model like GPT-4o or Gemini 1.5 Pro and send the PDF page as an image rather than extracted text. Render each page to a PNG at 150-300 DPI, include your target JSON schema in the prompt, and the model reads the visual layout directly. This bypasses OCR preprocessing but costs more tokens per page than text-based extraction.

Is there an API I can use to automate PDF to JSON conversion?

Yes. OpenAI, Anthropic, and Google all expose REST APIs for their models. For a production pipeline, you would extract text or render page images from the PDF programmatically, call the API with your extraction prompt, parse and validate the JSON response, and store or forward the result. No third-party service is required.

How accurate is AI at extracting data from PDFs?

On clean digital PDFs with standard layouts, accuracy is very high — typically above 95% for structured fields. Accuracy drops on handwriting, low-quality scans, complex table layouts, and documents with non-standard formatting. For high-stakes use cases, always validate extracted values against the source or build a human review step for flagged records.

Can I extract nested or hierarchical data from a PDF into JSON?

Yes, and this is one area where AI outperforms rule-based parsers. If your prompt includes an example of the nested structure you want — such as a line_items array inside an invoice object — the model will map hierarchical relationships from the document into your schema. Provide a concrete example schema rather than describing the nesting in words.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.

Perplexity Pro AI-powered answer engine

Try Perplexity →

CustomGPT ChatGPT trained on your content

Try CustomGPT →