Pull Form Field Values from Completed PDF Forms with AI

Tested prompts for extract form field data from pdf compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 9/10

If you have a stack of completed PDF forms and need the data out of them, you already know the problem: PDFs are designed to display information, not share it. Whether someone filled out an AcroForm in Adobe Acrobat or printed and scanned a paper form, getting those field values into a spreadsheet, database, or downstream system is painful without the right approach.

AI-assisted extraction changes that workflow. Instead of manually copying values or wrestling with brittle regex scripts, you describe what you need and the model reads the form, identifies named fields, and returns structured data you can actually use. This works on standard interactive PDFs with embedded field metadata as well as flat, image-like PDFs where the form was printed and scanned.

This page shows you exactly how to prompt an AI model to pull form field data from a completed PDF, compares how four leading models handle the task, and gives you the practical context to know when this approach is the right one for your situation.

When to use this

This approach fits best when you have one or more completed PDF forms and need the field values in a structured format fast. It is especially useful when the forms vary slightly in layout across batches, when you lack access to the original form software, or when you need to process forms at volume without writing custom parsing code.

Extracting applicant data from submitted PDF job applications or intake forms
Pulling invoice line items, totals, and vendor details from PDF invoices received by email
Processing government or compliance forms where field names map to required database columns
Reading survey or registration forms returned by clients as completed PDF attachments
Digitizing scanned paper forms from legacy filing systems into structured records

When this format breaks down

The PDF contains hundreds of pages or thousands of form instances in a single batch run -- at that scale you need a dedicated document processing pipeline like AWS Textract, Azure Form Recognizer, or a purpose-built OCR service rather than a prompt-per-document AI approach.
You need legally certified, auditable extraction with a chain of custody record. AI extraction does not produce a verifiable log that would satisfy compliance requirements in regulated industries like healthcare billing or legal document processing without additional tooling around it.
The form fields are deeply nested inside complex tables or multi-column layouts with merged cells. AI models can misalign values to wrong fields in dense tabular structures, producing silently incorrect data.
Your PDF is a heavily degraded scan with low DPI, skewed pages, or significant ink bleed. OCR accuracy on poor-quality scans is unreliable regardless of the AI model used, and errors in extraction will propagate into your data without obvious flags.

The prompt we tested

You are an expert at extracting structured form field data from filled PDF forms. Analyze the PDF content provided below and extract every form field along with its user-entered value.

Instructions: Output a clean JSON object where each key is the exact field label from the PDF and each value is the user-entered response (use null for empty fields, true/false for checkboxes, and arrays for multi-select fields); group related fields under nested objects when the form has clear sections, and do not include any commentary outside the JSON.

PDF Form Content:
Filled PDF form titled 'Employee Onboarding Form' containing: Full Name: Sarah Chen, Date of Birth: 1990-04-12, Email: sarah.chen@example.com, Department: [X] Engineering [ ] Sales [ ] Marketing, Remote Work: Yes, Emergency Contact Name: David Chen, Emergency Contact Phone: 555-234-8891, T-Shirt Size: [ ] S [X] M [ ] L [ ] XL.

Return the extracted form field data now.

Model comparison

Model	Quality	Cost	Latency	Tokens
Claude Opus 4.7	9	$0.0213	4.0s	353 / 214
Claude Haiku 4.5	9	$0.0008	924ms	234 / 110
GPT-5.4	9	$0.0021	1.6s	217 / 101
Gemini 2.5 Pro	9	$0.0010	7.7s	228 / 138
Grok 4.1 Fast Reasoning	9	$0.0001	7.7s	365 / 85

Outputs

Claude Opus 4.7 9/10

{
  "Employee Onboarding Form": {
    "Personal Information": {
      "Full Name": "Sarah Chen",
      "Date of Birth": "1990-04-12",
      "Email": "sarah.chen@example.com"
    },
    "Employment Details": {
      "Department": "Engineering",
      "Remote Work": "Yes"
    },
    "Emergency Contact": {
      "Emergency Contact Name": "David Chen",
      "Emergency Contact Phone": "555-234-8891"
    },
    "Additional Information": {
      "T-Shirt Size": "M"
    }
  }
}

Claude Haiku 4.5 9/10

```json
{
  "Full Name": "Sarah Chen",
  "Date of Birth": "1990-04-12",
  "Email": "sarah.chen@example.com",
  "Department": "Engineering",
  "Remote Work": true,
  "Emergency Contact": {
    "Name": "David Chen",
    "Phone": "555-234-8891"
  },
  "T-Shirt Size": "M"
}
```

GPT-5.4 9/10

{
  "Employee Onboarding Form": {
    "Full Name": "Sarah Chen",
    "Date of Birth": "1990-04-12",
    "Email": "sarah.chen@example.com",
    "Department": "Engineering",
    "Remote Work": true,
    "Emergency Contact": {
      "Name": "David Chen",
      "Phone": "555-234-8891"
    },
    "T-Shirt Size": "M"
  }
}

Gemini 2.5 Pro 9/10

```json
{
  "Personal Information": {
    "Full Name": "Sarah Chen",
    "Date of Birth": "1990-04-12",
    "Email": "sarah.chen@example.com"
  },
  "Employment Details": {
    "Department": "Engineering",
    "Remote Work": "Yes"
  },
  "Emergency Contact": {
    "Name": "David Chen",
    "Phone": "555-234-8891"
  },
  "T-Shirt Size": "M"
}
```

Grok 4.1 Fast Reasoning 9/10

{
  "Full Name": "Sarah Chen",
  "Date of Birth": "1990-04-12",
  "Email": "sarah.chen@example.com",
  "Department": ["Engineering"],
  "Remote Work": "Yes",
  "Emergency Contact Name": "David Chen",
  "Emergency Contact Phone": "555-234-8891",
  "T-Shirt Size": ["M"]
}

What makes these work

01

Name every field explicitly
Telling the model exactly which fields to extract, using the same labels that appear on the form, dramatically reduces hallucination and missed values. Vague instructions like 'get all the data' produce inconsistent key names across documents. Explicit field lists give you a predictable schema you can pipe directly into a database or CSV.
02

Specify your output format upfront
Ask for JSON, CSV, or a specific table structure in the same sentence where you describe the task. Models will default to prose descriptions of the data if you do not constrain the format. For downstream automation, JSON with consistent key names is almost always the right choice -- it requires no further parsing.
03

Handle blanks and ambiguity explicitly
Tell the model what to return when a field is empty (null, an empty string, or a placeholder like 'N/A') and what to do when a value is unreadable (flag it versus skip it). Without this instruction, different models handle missing data differently, breaking any schema validation you have downstream.
04

Ask for date and number normalization
Forms filled by humans contain dates in a dozen formats and numbers written with inconsistent comma and decimal usage. Instruct the model to normalize dates to ISO 8601 (YYYY-MM-DD) and return currency values as plain numbers without symbols. This saves a transformation step later and makes your extracted data immediately joinable with other records.

More example scenarios

#01 · Employee onboarding form extraction

Input

Here is a completed employee onboarding PDF. Extract all form field values and return them as a JSON object. The form includes fields for: Full Name, Date of Birth, Start Date, Job Title, Department, Emergency Contact Name, Emergency Contact Phone, and Direct Deposit Bank Name. If a field is blank, return null for that key.

Expected output

{"full_name": "Marcus T. Okafor", "date_of_birth": "1988-04-12", "start_date": "2024-09-02", "job_title": "Senior Analyst", "department": "Finance", "emergency_contact_name": "Adaeze Okafor", "emergency_contact_phone": "312-555-0174", "direct_deposit_bank_name": null}

#02 · Medical patient intake form

Input

This is a scanned patient intake form from a clinic. Extract the following fields and return them as a flat JSON object: Patient Name, Date of Birth, Insurance Provider, Policy Number, Primary Complaint, Referring Physician. Standardize dates to YYYY-MM-DD format. If the handwriting is unclear for any field, flag it with a 'review_needed' note instead of guessing.

Expected output

{"patient_name": "Linda Ferreira", "date_of_birth": "1972-11-03", "insurance_provider": "Blue Cross Blue Shield", "policy_number": "BCB-00482917", "primary_complaint": "Persistent lower back pain, 3 weeks", "referring_physician": "review_needed: handwriting unclear"}

#03 · Vendor invoice data extraction

Input

Extract the following data from this PDF invoice and return it as JSON: Vendor Name, Invoice Number, Invoice Date, Due Date, Line Items (as an array with description, quantity, unit price, and line total for each), Subtotal, Tax Amount, and Total Due.

Expected output

{"vendor_name": "Apex Office Supplies LLC", "invoice_number": "INV-20948", "invoice_date": "2024-08-15", "due_date": "2024-09-14", "line_items": [{"description": "Printer Paper A4 Case", "quantity": 10, "unit_price": 42.00, "line_total": 420.00}, {"description": "Ballpoint Pens Box", "quantity": 5, "unit_price": 8.50, "line_total": 42.50}], "subtotal": 462.50, "tax_amount": 36.99, "total_due": 499.49}

#04 · Real estate lease application

Input

This is a completed rental application PDF. Extract: Applicant Full Name, Current Address, Monthly Income, Employer Name, Years at Current Job, Co-Applicant Name (if present), Desired Move-In Date, and Unit Applied For. Return as JSON. Use null for any field left blank by the applicant.

Expected output

{"applicant_full_name": "Priya Sharma", "current_address": "2210 N Halsted St, Apt 4B, Chicago, IL 60614", "monthly_income": 7200, "employer_name": "Meridian Health Group", "years_at_current_job": 3.5, "co_applicant_name": null, "desired_move_in_date": "2024-10-01", "unit_applied_for": "Unit 12C"}

#05 · Customer feedback survey from PDF

Input

Extract all answered fields from this customer satisfaction survey PDF. The form uses a 1-5 rating scale for several questions and includes open text fields. Return a JSON object where each key is the question label and each value is the response given. Preserve the exact text of any open-ended answers.

Expected output

{"overall_satisfaction": 4, "product_quality": 5, "shipping_speed": 3, "customer_service": 4, "likelihood_to_recommend": 5, "open_feedback": "The packaging was excellent and the product arrived in perfect condition. Shipping took a day longer than expected but I was kept informed throughout."}

Common mistakes to avoid

Accepting output without schema validation
AI models occasionally swap values between adjacent fields, especially on dense forms. If you pipe extracted JSON directly into a database without validating that each key exists and the value type matches your schema, bad data enters silently. Always run at least a basic type check on extracted output before writing it anywhere persistent.
Ignoring field label variations across form versions
If your organization uses multiple versions of the same form, field labels often differ slightly between versions. Prompting for 'Policy Number' will miss the field on a version that labels it 'Policy ID' or 'Pol. No.'. Audit your form versions upfront and include all known label variants in your prompt, or instruct the model to match semantically similar labels.
Assuming scanned PDFs work the same as digital ones
Interactive PDFs with embedded AcroForm fields are fundamentally different from scanned image PDFs. A scanned PDF is effectively a photograph of a form. Extraction from scans depends entirely on OCR quality, and the model cannot access embedded field metadata because none exists. Set accuracy expectations accordingly and build a review step for high-stakes scanned data.
Not chunking multi-page forms
Very long PDF forms that exceed a model's context window will be silently truncated, and the model will extract data only from the portion it received. If your form is more than 10-15 pages, split it into sections and extract each section separately, then merge the results. Never assume the model processed every page unless you verify the output covers all expected fields.
Using free-text prompts for repeating field groups
Many forms have repeating sections -- multiple dependents, multiple line items, multiple previous employers. If your prompt does not explicitly ask the model to return these as an array, it will often return only the first instance or merge values together. Specify 'return as an array of objects' for any field group that can repeat.

Related queries

Frequently asked questions

Can AI extract form fields from a scanned PDF, not just a digital one?

Yes, but with lower reliability than a native digital PDF. Scanned PDFs require the AI to perform OCR before extraction, and scan quality directly affects accuracy. Low-resolution scans, skewed pages, or handwritten fields all increase error rates. For high-stakes scanned data, build a human review step into your workflow for flagged or low-confidence fields.

What is the best output format for extracted PDF form data?

JSON is the most useful output format for most workflows because it preserves field names as keys, handles nested structures like line item arrays naturally, and integrates directly with APIs and databases without additional parsing. If your end destination is Excel or Google Sheets, ask for CSV instead. Avoid asking for plain prose descriptions of the data.

How do I extract form data from multiple PDFs at once?

Most AI models process one document at a time through a prompt interface. For batch processing, you need to loop over your files programmatically, sending each PDF with the same extraction prompt, and collect the structured outputs. Tools like the OpenAI API, Claude API, or no-code platforms with PDF connectors can automate this loop. For very large batches, purpose-built document intelligence services scale better.

Will the AI invent data if a field is blank on the form?

A well-instructed model should not hallucinate data for blank fields if you explicitly tell it to return null or a placeholder for empty fields. Without that instruction, some models will attempt to infer or fill in plausible values, which is dangerous for factual form data. Always include explicit blank-field handling in your prompt and spot-check outputs against source documents.

How is AI extraction different from using Adobe Acrobat to export form data?

Adobe Acrobat can export embedded AcroForm field data to FDF or XML formats natively, which is faster and more reliable than AI for interactive digital PDFs you created yourself. AI extraction becomes more valuable when you receive PDFs from external sources with inconsistent form versions, when forms are scanned rather than digital, or when you need to normalize and restructure the data as part of the extraction step.

How accurate is AI at extracting handwritten form fields?

Accuracy on handwritten fields varies significantly with handwriting clarity, typically ranging from roughly 85 to 95 percent on neat handwriting down to 60 to 75 percent on poor handwriting. For any use case where field-level accuracy matters, treat handwritten field extraction as draft output that requires a human review pass rather than a fully automated pipeline.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.

Perplexity Pro AI-powered answer engine

Try Perplexity →

CustomGPT ChatGPT trained on your content

Try CustomGPT →