Pull Form Field Values from Completed PDF Forms with AI

Tested prompts for extract form field data from pdf compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 9/10

If you have a stack of completed PDF forms and need the data out of them, you already know the problem: PDFs are designed to display information, not share it. Whether someone filled out an AcroForm in Adobe Acrobat or printed and scanned a paper form, getting those field values into a spreadsheet, database, or downstream system is painful without the right approach.

AI-assisted extraction changes that workflow. Instead of manually copying values or wrestling with brittle regex scripts, you describe what you need and the model reads the form, identifies named fields, and returns structured data you can actually use. This works on standard interactive PDFs with embedded field metadata as well as flat, image-like PDFs where the form was printed and scanned.

This page shows you exactly how to prompt an AI model to pull form field data from a completed PDF, compares how four leading models handle the task, and gives you the practical context to know when this approach is the right one for your situation.

When to use this

This approach fits best when you have one or more completed PDF forms and need the field values in a structured format fast. It is especially useful when the forms vary slightly in layout across batches, when you lack access to the original form software, or when you need to process forms at volume without writing custom parsing code.

  • Extracting applicant data from submitted PDF job applications or intake forms
  • Pulling invoice line items, totals, and vendor details from PDF invoices received by email
  • Processing government or compliance forms where field names map to required database columns
  • Reading survey or registration forms returned by clients as completed PDF attachments
  • Digitizing scanned paper forms from legacy filing systems into structured records

When this format breaks down

  • The PDF contains hundreds of pages or thousands of form instances in a single batch run -- at that scale you need a dedicated document processing pipeline like AWS Textract, Azure Form Recognizer, or a purpose-built OCR service rather than a prompt-per-document AI approach.
  • You need legally certified, auditable extraction with a chain of custody record. AI extraction does not produce a verifiable log that would satisfy compliance requirements in regulated industries like healthcare billing or legal document processing without additional tooling around it.
  • The form fields are deeply nested inside complex tables or multi-column layouts with merged cells. AI models can misalign values to wrong fields in dense tabular structures, producing silently incorrect data.
  • Your PDF is a heavily degraded scan with low DPI, skewed pages, or significant ink bleed. OCR accuracy on poor-quality scans is unreliable regardless of the AI model used, and errors in extraction will propagate into your data without obvious flags.

The prompt we tested

You are an expert at extracting structured form field data from filled PDF forms. Analyze the PDF content provided below and extract every form field along with its user-entered value.

Instructions: Output a clean JSON object where each key is the exact field label from the PDF and each value is the user-entered response (use null for empty fields, true/false for checkboxes, and arrays for multi-select fields); group related fields under nested objects when the form has clear sections, and do not include any commentary outside the JSON.

PDF Form Content:
Filled PDF form titled 'Employee Onboarding Form' containing: Full Name: Sarah Chen, Date of Birth: 1990-04-12, Email: sarah.chen@example.com, Department: [X] Engineering [ ] Sales [ ] Marketing, Remote Work: Yes, Emergency Contact Name: David Chen, Emergency Contact Phone: 555-234-8891, T-Shirt Size: [ ] S [X] M [ ] L [ ] XL.

Return the extracted form field data now.

Model comparison

ModelQualityCostLatencyTokens
Claude Opus 4.7 9 $0.0213 4.0s 353 / 214
Claude Haiku 4.5 9 $0.0008 924ms 234 / 110
GPT-5.4 9 $0.0021 1.6s 217 / 101
Gemini 2.5 Pro 9 $0.0010 7.7s 228 / 138
Grok 4.1 Fast Reasoning 9 $0.0001 7.7s 365 / 85

Outputs

Claude Opus 4.7 9/10
{
  "Employee Onboarding Form": {
    "Personal Information": {
      "Full Name": "Sarah Chen",
      "Date of Birth": "1990-04-12",
      "Email": "sarah.chen@example.com"
    },
    "Employment Details": {
      "Department": "Engineering",
      "Remote Work": "Yes"
    },
    "Emergency Contact": {
      "Emergency Contact Name": "David Chen",
      "Emergency Contact Phone": "555-234-8891"
    },
    "Additional Information": {
      "T-Shirt Size": "M"
    }
  }
}
Judge notes: Excellent extraction with proper JSON structure, logical field grouping, and accurate value mapping; only minor deviation from instructions is using "Yes" instead of true for the boolean Remote Work field.
Claude Haiku 4.5 9/10
```json
{
  "Full Name": "Sarah Chen",
  "Date of Birth": "1990-04-12",
  "Email": "sarah.chen@example.com",
  "Department": "Engineering",
  "Remote Work": true,
  "Emergency Contact": {
    "Name": "David Chen",
    "Phone": "555-234-8891"
  },
  "T-Shirt Size": "M"
}
```
Judge notes: Excellently structured JSON output with proper nesting, accurate field extraction, correct data type handling (boolean for yes/no, string for selections), and clean formatting that fully satisfies the query intent with only minor deviation from checkbox array format requested.
GPT-5.4 9/10
{
  "Employee Onboarding Form": {
    "Full Name": "Sarah Chen",
    "Date of Birth": "1990-04-12",
    "Email": "sarah.chen@example.com",
    "Department": "Engineering",
    "Remote Work": true,
    "Emergency Contact": {
      "Name": "David Chen",
      "Phone": "555-234-8891"
    },
    "T-Shirt Size": "M"
  }
}
Judge notes: Excellently extracts all form field data into clean, well-organized JSON with appropriate data types and logical nesting; minor deduction only for wrapping the entire form under a top-level "Employee Onboarding Form" key when the instructions implied direct field mapping.
Gemini 2.5 Pro 9/10
```json
{
  "Personal Information": {
    "Full Name": "Sarah Chen",
    "Date of Birth": "1990-04-12",
    "Email": "sarah.chen@example.com"
  },
  "Employment Details": {
    "Department": "Engineering",
    "Remote Work": "Yes"
  },
  "Emergency Contact": {
    "Name": "David Chen",
    "Phone": "555-234-8891"
  },
  "T-Shirt Size": "M"
}
```
Judge notes: Excellent extraction with proper JSON structure, correct field values, and logical grouping into sections; only minor point deduction for not explicitly showing checkbox/boolean conversion methodology in the output format.
Grok 4.1 Fast Reasoning 9/10
{
  "Full Name": "Sarah Chen",
  "Date of Birth": "1990-04-12",
  "Email": "sarah.chen@example.com",
  "Department": ["Engineering"],
  "Remote Work": "Yes",
  "Emergency Contact Name": "David Chen",
  "Emergency Contact Phone": "555-234-8891",
  "T-Shirt Size": ["M"]
}
Judge notes: Excellent extraction with proper JSON formatting and correct data types, though checkbox fields could have used true/false boolean values per the stated instructions rather than string/array alternatives.

What makes these work

  1. 01
    Name every field explicitly

    Telling the model exactly which fields to extract, using the same labels that appear on the form, dramatically reduces hallucination and missed values. Vague instructions like 'get all the data' produce inconsistent key names across documents. Explicit field lists give you a predictable schema you can pipe directly into a database or CSV.

  2. 02
    Specify your output format upfront

    Ask for JSON, CSV, or a specific table structure in the same sentence where you describe the task. Models will default to prose descriptions of the data if you do not constrain the format. For downstream automation, JSON with consistent key names is almost always the right choice -- it requires no further parsing.

  3. 03
    Handle blanks and ambiguity explicitly

    Tell the model what to return when a field is empty (null, an empty string, or a placeholder like 'N/A') and what to do when a value is unreadable (flag it versus skip it). Without this instruction, different models handle missing data differently, breaking any schema validation you have downstream.

  4. 04
    Ask for date and number normalization

    Forms filled by humans contain dates in a dozen formats and numbers written with inconsistent comma and decimal usage. Instruct the model to normalize dates to ISO 8601 (YYYY-MM-DD) and return currency values as plain numbers without symbols. This saves a transformation step later and makes your extracted data immediately joinable with other records.

More example scenarios

#01 · Employee onboarding form extraction
Input
Here is a completed employee onboarding PDF. Extract all form field values and return them as a JSON object. The form includes fields for: Full Name, Date of Birth, Start Date, Job Title, Department, Emergency Contact Name, Emergency Contact Phone, and Direct Deposit Bank Name. If a field is blank, return null for that key.
Expected output
{"full_name": "Marcus T. Okafor", "date_of_birth": "1988-04-12", "start_date": "2024-09-02", "job_title": "Senior Analyst", "department": "Finance", "emergency_contact_name": "Adaeze Okafor", "emergency_contact_phone": "312-555-0174", "direct_deposit_bank_name": null}
#02 · Medical patient intake form
Input
This is a scanned patient intake form from a clinic. Extract the following fields and return them as a flat JSON object: Patient Name, Date of Birth, Insurance Provider, Policy Number, Primary Complaint, Referring Physician. Standardize dates to YYYY-MM-DD format. If the handwriting is unclear for any field, flag it with a 'review_needed' note instead of guessing.
Expected output
{"patient_name": "Linda Ferreira", "date_of_birth": "1972-11-03", "insurance_provider": "Blue Cross Blue Shield", "policy_number": "BCB-00482917", "primary_complaint": "Persistent lower back pain, 3 weeks", "referring_physician": "review_needed: handwriting unclear"}
#03 · Vendor invoice data extraction
Input
Extract the following data from this PDF invoice and return it as JSON: Vendor Name, Invoice Number, Invoice Date, Due Date, Line Items (as an array with description, quantity, unit price, and line total for each), Subtotal, Tax Amount, and Total Due.
Expected output
{"vendor_name": "Apex Office Supplies LLC", "invoice_number": "INV-20948", "invoice_date": "2024-08-15", "due_date": "2024-09-14", "line_items": [{"description": "Printer Paper A4 Case", "quantity": 10, "unit_price": 42.00, "line_total": 420.00}, {"description": "Ballpoint Pens Box", "quantity": 5, "unit_price": 8.50, "line_total": 42.50}], "subtotal": 462.50, "tax_amount": 36.99, "total_due": 499.49}
#04 · Real estate lease application
Input
This is a completed rental application PDF. Extract: Applicant Full Name, Current Address, Monthly Income, Employer Name, Years at Current Job, Co-Applicant Name (if present), Desired Move-In Date, and Unit Applied For. Return as JSON. Use null for any field left blank by the applicant.
Expected output
{"applicant_full_name": "Priya Sharma", "current_address": "2210 N Halsted St, Apt 4B, Chicago, IL 60614", "monthly_income": 7200, "employer_name": "Meridian Health Group", "years_at_current_job": 3.5, "co_applicant_name": null, "desired_move_in_date": "2024-10-01", "unit_applied_for": "Unit 12C"}
#05 · Customer feedback survey from PDF
Input
Extract all answered fields from this customer satisfaction survey PDF. The form uses a 1-5 rating scale for several questions and includes open text fields. Return a JSON object where each key is the question label and each value is the response given. Preserve the exact text of any open-ended answers.
Expected output
{"overall_satisfaction": 4, "product_quality": 5, "shipping_speed": 3, "customer_service": 4, "likelihood_to_recommend": 5, "open_feedback": "The packaging was excellent and the product arrived in perfect condition. Shipping took a day longer than expected but I was kept informed throughout."}

Common mistakes to avoid

  • Accepting output without schema validation

    AI models occasionally swap values between adjacent fields, especially on dense forms. If you pipe extracted JSON directly into a database without validating that each key exists and the value type matches your schema, bad data enters silently. Always run at least a basic type check on extracted output before writing it anywhere persistent.

  • Ignoring field label variations across form versions

    If your organization uses multiple versions of the same form, field labels often differ slightly between versions. Prompting for 'Policy Number' will miss the field on a version that labels it 'Policy ID' or 'Pol. No.'. Audit your form versions upfront and include all known label variants in your prompt, or instruct the model to match semantically similar labels.

  • Assuming scanned PDFs work the same as digital ones

    Interactive PDFs with embedded AcroForm fields are fundamentally different from scanned image PDFs. A scanned PDF is effectively a photograph of a form. Extraction from scans depends entirely on OCR quality, and the model cannot access embedded field metadata because none exists. Set accuracy expectations accordingly and build a review step for high-stakes scanned data.

  • Not chunking multi-page forms

    Very long PDF forms that exceed a model's context window will be silently truncated, and the model will extract data only from the portion it received. If your form is more than 10-15 pages, split it into sections and extract each section separately, then merge the results. Never assume the model processed every page unless you verify the output covers all expected fields.

  • Using free-text prompts for repeating field groups

    Many forms have repeating sections -- multiple dependents, multiple line items, multiple previous employers. If your prompt does not explicitly ask the model to return these as an array, it will often return only the first instance or merge values together. Specify 'return as an array of objects' for any field group that can repeat.

Related queries

Frequently asked questions

Can AI extract form fields from a scanned PDF, not just a digital one?

Yes, but with lower reliability than a native digital PDF. Scanned PDFs require the AI to perform OCR before extraction, and scan quality directly affects accuracy. Low-resolution scans, skewed pages, or handwritten fields all increase error rates. For high-stakes scanned data, build a human review step into your workflow for flagged or low-confidence fields.

What is the best output format for extracted PDF form data?

JSON is the most useful output format for most workflows because it preserves field names as keys, handles nested structures like line item arrays naturally, and integrates directly with APIs and databases without additional parsing. If your end destination is Excel or Google Sheets, ask for CSV instead. Avoid asking for plain prose descriptions of the data.

How do I extract form data from multiple PDFs at once?

Most AI models process one document at a time through a prompt interface. For batch processing, you need to loop over your files programmatically, sending each PDF with the same extraction prompt, and collect the structured outputs. Tools like the OpenAI API, Claude API, or no-code platforms with PDF connectors can automate this loop. For very large batches, purpose-built document intelligence services scale better.

Will the AI invent data if a field is blank on the form?

A well-instructed model should not hallucinate data for blank fields if you explicitly tell it to return null or a placeholder for empty fields. Without that instruction, some models will attempt to infer or fill in plausible values, which is dangerous for factual form data. Always include explicit blank-field handling in your prompt and spot-check outputs against source documents.

How is AI extraction different from using Adobe Acrobat to export form data?

Adobe Acrobat can export embedded AcroForm field data to FDF or XML formats natively, which is faster and more reliable than AI for interactive digital PDFs you created yourself. AI extraction becomes more valuable when you receive PDFs from external sources with inconsistent form versions, when forms are scanned rather than digital, or when you need to normalize and restructure the data as part of the extraction step.

How accurate is AI at extracting handwritten form fields?

Accuracy on handwritten fields varies significantly with handwriting clarity, typically ranging from roughly 85 to 95 percent on neat handwriting down to 60 to 75 percent on poor handwriting. For any use case where field-level accuracy matters, treat handwritten field extraction as draft output that requires a human review pass rather than a fully automated pipeline.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.