AI Extraction for W-2, 1099, and Other Tax Form PDFs

Tested prompts for extract data from tax forms pdf compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 9/10

If you have a stack of W-2s, 1099-NECs, 1099-INTs, or other tax PDFs and need the numbers out of them fast, you are dealing with a structured extraction problem. The forms are standardized by the IRS, which means the field positions and labels are predictable — making them a strong candidate for AI-assisted extraction rather than manual keying or brittle template-based OCR tools.

Most people searching for this are doing one of three things: processing their own documents for a tax preparer, building an internal workflow to handle client documents at a firm, or automating data ingestion into accounting software like QuickBooks, Xero, or a custom spreadsheet. In all three cases, the goal is the same — get structured, accurate field values out of a PDF without typing them by hand.

This page walks through exactly how to prompt an AI model to extract tax form data, which fields to target, where models perform well, and where they can fail. The tested prompts and model comparison below give you something you can copy and adapt immediately.

When to use this

AI extraction works well when your tax PDFs are digitally generated or cleanly scanned, and when you need consistent field-level output across a batch of documents. It is especially useful if you are dealing with multiple form types and do not want to build a separate template for each one.

  • Processing 10 to 500 W-2 or 1099 PDFs per tax season without manual data entry
  • Extracting specific fields — EIN, SSN (masked), Box 1 wages, Box 3 Social Security wages — for import into payroll or accounting software
  • Building a client onboarding workflow at a tax or accounting firm where clients upload their own forms
  • Validating that totals on submitted tax documents match what a client reported on an intake form
  • Pulling 1099-INT or 1099-DIV data into a spreadsheet to reconcile investment income across multiple brokerages

When this format breaks down

  • Handwritten or heavily degraded scanned forms — AI models struggle with low-resolution scans below 150 DPI and will produce hallucinated field values that look plausible but are wrong
  • When you need IRS-audit-grade accuracy without a human review step — AI extraction should be treated as a first pass, not a final record, for any document with legal or financial consequences
  • Processing more than a few hundred documents per day at scale — at that volume, a dedicated IDP (intelligent document processing) platform like AWS Textract, Google Document AI, or Azure Form Recognizer is more cost-effective and auditable than chat-based AI prompting
  • Forms that are password-protected or contain embedded images rather than selectable text — the model receives no usable input and will either refuse or fabricate output

The prompt we tested

You are a tax document data extraction specialist. Extract structured data from the tax form content provided below.

Follow these instructions exactly:
Identify the tax form type (e.g., W-2, 1099-NEC, 1099-MISC, 1099-INT, 1098, etc.) and output a structured JSON object containing every labeled field with its box number, field name, and value, using null for any field that is missing or illegible. Preserve exact formatting for EINs, SSNs (masked as XXX-XX-####), dollar amounts (as numbers without $ or commas), and tax year; after the JSON, add a brief 'notes' array flagging any ambiguous values, OCR concerns, or multi-form documents.

Tax form content:
Form W-2 Wage and Tax Statement 2023. Employer: Acme Corp, EIN 12-3456789, 100 Main St, Springfield IL 62701. Employee: Jane Doe, SSN XXX-XX-4567, 22 Oak Ave, Springfield IL 62702. Box 1 Wages: $78,450.00, Box 2 Federal income tax withheld: $11,230.50, Box 3 Social security wages: $78,450.00, Box 4 SS tax withheld: $4,863.90, Box 5 Medicare wages: $78,450.00, Box 6 Medicare tax: $1,137.53, Box 15 State: IL, Box 16 State wages: $78,450.00, Box 17 State income tax: $3,884.28.

Return the extracted data now.

Model comparison

ModelQualityCostLatencyTokens
Claude Opus 4.7 9 $0.0821 11.5s 516 / 991
Claude Haiku 4.5 9 $0.0029 2.6s 387 / 495
GPT-5.4 9 $0.0122 8.7s 347 / 757
Gemini 2.5 Pro 8 $0.0035 11.1s 403 / 606
Grok 4.1 Fast Reasoning 9 $0.0003 6.9s 492 / 366

Outputs

Claude Opus 4.7 9/10
```json
{
  "form_type": "W-2",
  "tax_year": "2023",
  "employer": {
    "name": "Acme Corp",
    "ein": "12-3456789",
    "address": "100 Main St, Springfield IL 62701"
  },
  "employee": {
    "name": "Jane Doe",
    "ssn": "XXX-XX-4567",
    "address": "22 Oak Ave, Springfield IL 62702"
  },
  "fields": [
    { "box": "1", "name": "Wages, tips, other compensation", "value": 78450.00 },
    { "box": "2", "name": "Federal income tax withheld", "value": 11230.50 },
    { "box": "3", "name": "Social security wages", "value": 78450.00 },
    { "box": "4", "name": "Social security tax withheld", "value": 4863.90 },
    { "box": "5", "name": "Medicare wages and tips", "value": 78450.00 },
    { "box": "6", "name": "Medicare tax withheld", "value": 1137.53 },
    { "box": "7", "name": "Social security tips", "value": null },
    { "box": "8", "name": "Allocated tips", "value": null },
    { "box": "10", "name": "Dependent care benefits", "value": null },
    { "box": "11", "name": "Nonqualified plans", "value": null },
    { "box": "12", "name": "Codes", "value": null },
    { "box": "13", "name": "Statutory employee / Retirement plan / Third-party sick pay", "value": null },
    { "box": "14", "name": "Other", "value": null },
    { "box": "15", "name": "State / Employer's state ID number", "value": "IL" },
    { "box": "16", "name": "State wages, tips, etc.", "value": 78450.00 },
    { "box": "17", "name": "State income tax", "value": 3884.28 },
    { "box": "18", "name": "Local wages, tips, etc.", "value": null },
    { "box": "19", "name": "Local income tax", "value": null },
    { "box": "20", "name": "Locality name", "value": null }
  ],
  "notes": [
    "Box 4 SS tax withheld ($4,863.90) does not equal the standard 6.2% of Box 3 ($78,450.00 × 6.2% = $4,863.90) — matches correctly; no discrepancy.",
    "Box 6 Medicare tax ($1,137.53) matches expected 1.45% of Box 5.",
    "Employer's state ID number (part of Box 15) not provided in source; only state abbreviation 'IL' captured.",
    "Boxes 7, 8, 9 (obsolete), 10, 11, 12, 13, 14, and 18–20 were not present in the source and are set to null.",
    "Control number (Box a) and employer/employee identifying box letters (a–f) were not explicitly labeled in the source text."
  ]
}
```
Judge notes: The extraction is accurate, well-structured, and comprehensively addresses the query with proper JSON formatting, correct field mappings, thoughtful null handling, and helpful validation notes that add credibility.
Claude Haiku 4.5 9/10
```json
{
  "form_type": "W-2",
  "tax_year": 2023,
  "employer": {
    "name": "Acme Corp",
    "ein": "12-3456789",
    "address": "100 Main St, Springfield IL 62701"
  },
  "employee": {
    "name": "Jane Doe",
    "ssn": "XXX-XX-4567",
    "address": "22 Oak Ave, Springfield IL 62702"
  },
  "boxes": {
    "box_1": {
      "label": "Wages, tips, other compensation",
      "value": 78450.00
    },
    "box_2": {
      "label": "Federal income tax withheld",
      "value": 11230.50
    },
    "box_3": {
      "label": "Social security wages",
      "value": 78450.00
    },
    "box_4": {
      "label": "Social security tax withheld",
      "value": 4863.90
    },
    "box_5": {
      "label": "Medicare wages and tips",
      "value": 78450.00
    },
    "box_6": {
      "label": "Medicare tax withheld",
      "value": 1137.53
    },
    "box_15": {
      "label": "State",
      "value": "IL"
    },
    "box_16": {
      "label": "State wages, tips, etc.",
      "value": 78450.00
    },
    "box_17": {
      "label": "State income tax",
      "value": 3884.28
    }
  },
  "notes": [
    "All required W-2 fields present and legible",
    "SSN properly masked in XXX-XX-#### format",
    "EIN and address information complete",
    "All monetary values extracted without formatting symbols"
  ]
}
```
Judge notes: Excellent extraction with proper JSON structure, correct field mapping, masked SSN, accurate numeric formatting, and appropriate notes; only minor gap is lacking explicit null values for unmentioned optional boxes that could exist on a standard W-2.
GPT-5.4 9/10
{
  "form_type": "W-2",
  "tax_year": "2023",
  "employer": {
    "name": "Acme Corp",
    "ein": "12-3456789",
    "address": "100 Main St, Springfield IL 62701"
  },
  "employee": {
    "name": "Jane Doe",
    "ssn": "XXX-XX-4567",
    "address": "22 Oak Ave, Springfield IL 62702"
  },
  "fields": [
    {
      "box": "1",
      "field_name": "Wages, tips, other compensation",
      "value": 78450.00
    },
    {
      "box": "2",
      "field_name": "Federal income tax withheld",
      "value": 11230.50
    },
    {
      "box": "3",
      "field_name": "Social security wages",
      "value": 78450.00
    },
    {
      "box": "4",
      "field_name": "Social security tax withheld",
      "value": 4863.90
    },
    {
      "box": "5",
      "field_name": "Medicare wages and tips",
      "value": 78450.00
    },
    {
      "box": "6",
      "field_name": "Medicare tax withheld",
      "value": 1137.53
    },
    {
      "box": "7",
      "field_name": "Social security tips",
      "value": null
    },
    {
      "box": "8",
      "field_name": "Allocated tips",
      "value": null
    },
    {
      "box": "9",
      "field_name": "Verification code",
      "value": null
    },
    {
      "box": "10",
      "field_name": "Dependent care benefits",
      "value": null
    },
    {
      "box": "11",
      "field_name": "Nonqualified plans",
      "value": null
    },
    {
      "box": "12",
      "field_name": "Codes",
      "value": null
    },
    {
      "box": "13",
      "field_name": "Statutory employee / Retirement plan / Third-party sick pay",
      "value": null
    },
    {
      "box": "14",
      "field_name": "Other",
      "value": null
    },
    {
      "box": "15",
      "field_name": "State / Employer's state ID number",
      "value": "IL"
    },
    {
      "box": "16",
      "field_name": "State wages, tips, etc.",
      "value": 78450.00
    },
    {
      "box": "17",
      "field_name": "State income tax",
      "value": 3884.28
    },
    {
      "box": "18",
      "field_name": "Local wages, tips, etc.",
      "value": null
    },
    {
      "box": "19",
      "field_name": "Local income tax",
      "value": null
    },
    {
      "box": "20",
      "field_name": "Locality name",
      "value": null
    }
  ],
  "notes": [
    "Box 15 on Form W-2 typically includes both state abbreviation and employer state ID number; only state 'IL' was provided.",
    "Single W-2 form detected."
  ]
}
Judge notes: Excellent extraction with proper JSON structure, accurate field mapping, correct data formatting, comprehensive null handling for missing fields, and appropriate contextual notes—minor deduction only for not explicitly flagging that employer state ID number was absent from Box 15.
Gemini 2.5 Pro 8/10
```json
{
  "form_type": "W-2",
  "tax_year": 2023,
  "employer": {
    "name": "Acme Corp",
    "ein": "12-3456789",
    "address": {
      "street": "100 Main St",
      "city": "Springfield",
      "state": "IL",
      "zip": "62701"
    }
  },
  "employee": {
    "name": "Jane Doe",
    "ssn": "XXX-XX-4567",
    "address": {
      "street": "22 Oak Ave",
      "city": "Springfield",
      "state": "IL",
      "zip": "62702"
    }
  },
  "boxes": [
    {
      "box_number": "1",
      "field_name": "Wages, tips, other compensation",
      "value": 78450.00
    },
    {
      "box_number": "2",
      "field_name": "Federal income tax withheld",
      "value": 11230.50
    },
    {
      "box_number": "3",
      "field_name": "Social security wages",
      "value": 78450.00
    },
    {
      "box_number": "4",
      "field_name": "Social security tax withheld",
      "value": 4863.90
    },
    {
      "box_number": "5",
      "field_name": "Medicare wages and tips",
      "value": 78450.00
    },
    {
      "box_number": "6",
      "field_name": "Medicare tax withheld",
      "value": 1137.53
    },
    {
      "box_number": "15",
      "field_name": "State",
      "value": "IL"
    },
    {
      "box_number": "16",
      "field_name": "State wages, tips, etc.",
      "value": 78450.00
    },
    {
      "box_number": "17",
      "field_name": "State income tax",
      "value": 3884.28
    }
  ]
}
```
```json
{
  "notes": []
}
```
Judge notes: Excellent structural extraction and JSON formatting with proper field mapping, though the output lacks the requested 'notes' array with flagging details and could have included more comprehensive validation checks.
Grok 4.1 Fast Reasoning 9/10
{
  "form_type": "W-2",
  "tax_year": 2023,
  "employer": {
    "name": "Acme Corp",
    "ein": "12-3456789",
    "address": "100 Main St, Springfield IL 62701"
  },
  "employee": {
    "name": "Jane Doe",
    "ssn": "XXX-XX-4567",
    "address": "22 Oak Ave, Springfield IL 62702"
  },
  "boxes": {
    "1": {
      "field_name": "Wages",
      "value": 78450.00
    },
    "2": {
      "field_name": "Federal income tax withheld",
      "value": 11230.50
    },
    "3": {
      "field_name": "Social security wages",
      "value": 78450.00
    },
    "4": {
      "field_name": "SS tax withheld",
      "value": 4863.90
    },
    "5": {
      "field_name": "Medicare wages",
      "value": 78450.00
    },
    "6": {
      "field_name": "Medicare tax",
      "value": 1137.53
    },
    "15": {
      "field_name": "State",
      "value": "IL"
    },
    "16": {
      "field_name": "State wages",
      "value": 78450.00
    },
    "17": {
      "field_name": "State income tax",
      "value": 3884.28
    }
  }
}
["Clear extraction, single W-2 form, no OCR concerns or ambiguities."]
Judge notes: Excellent extraction with proper JSON structure, accurate field mapping, correct data formatting, and appropriate notes; only minor deduction for not explicitly flagging the SSN masking as a security preservation detail in notes.

What makes these work

  1. 01
    Name every field you want

    Do not ask the model to extract 'all the data from the form.' Tax forms have dozens of boxes and most runs will miss or skip fields if you are vague. List each field by its official box number and label. This also makes output validation easier because you know exactly what keys to expect in the JSON.

  2. 02
    Specify the output format before the document

    Put your JSON schema or field list in the prompt before the document text. Models tend to follow structure instructions more reliably when they see the expected format first. Include instructions for null values when a box is blank — otherwise models sometimes omit the key entirely, which breaks downstream parsing.

  3. 03
    Mask sensitive data before sending to third-party APIs

    Full SSNs and EINs appear on tax forms. Before passing raw PDF text to any external AI API, redact or mask the full SSN to last-4 only at the extraction or pre-processing step. Include this instruction explicitly in your prompt so the model does not surface the full number in its output.

  4. 04
    Use a PDF text extraction layer first

    AI models work on text, not images. Run your PDF through a text extraction library like pdfplumber, PyMuPDF, or Adobe's PDF Extract API first to get selectable text. Then pass that text to the model. Sending a base64-encoded image of a PDF to a vision model works for low volumes but is slower, more expensive, and less accurate on dense numerical tables.

More example scenarios

#01 · Extracting key fields from a W-2 for payroll reconciliation
Input
Here is the text extracted from an employee W-2. Pull out the following fields and return them as JSON: employer name, employer EIN, employee name, employee SSN (last 4 digits only), Box 1 wages, Box 2 federal income tax withheld, Box 3 social security wages, Box 4 social security tax withheld, Box 5 Medicare wages, Box 6 Medicare tax withheld, state in Box 15, and Box 16 state wages.
Expected output
{"employer_name": "Acme Distribution LLC", "employer_ein": "45-1234567", "employee_name": "Jordan M. Rivera", "employee_ssn_last4": "8821", "box_1_wages": 74200.00, "box_2_federal_withheld": 9856.00, "box_3_ss_wages": 74200.00, "box_4_ss_withheld": 4600.40, "box_5_medicare_wages": 74200.00, "box_6_medicare_withheld": 1075.90, "box_15_state": "OH", "box_16_state_wages": 74200.00}
#02 · Pulling 1099-NEC contractor income for an accounting firm's client intake
Input
Extract the following from this 1099-NEC PDF text: payer name, payer TIN, recipient name, recipient TIN (masked), Box 1 nonemployee compensation, and Box 4 federal income tax withheld. Return a clean JSON object. If a field is blank on the form, return null.
Expected output
{"payer_name": "Brightline Consulting Group", "payer_tin": "82-4471009", "recipient_name": "Sasha T. Okonkwo", "recipient_tin_masked": "***-**-4432", "box_1_nonemployee_compensation": 18500.00, "box_4_federal_withheld": null}
#03 · Reconciling 1099-INT interest income from multiple bank PDFs
Input
I have the text from three 1099-INT forms from different banks. For each one, extract: payer name, Box 1 interest income, Box 4 federal tax withheld, and Box 8 tax-exempt interest. Return an array of JSON objects, one per form.
Expected output
[{"payer_name": "First National Bank", "box_1_interest": 412.87, "box_4_withheld": 0, "box_8_tax_exempt": 0}, {"payer_name": "Ally Bank", "box_1_interest": 1204.33, "box_4_withheld": 240.87, "box_8_tax_exempt": 0}, {"payer_name": "Vanguard Brokerage", "box_1_interest": 89.10, "box_4_withheld": 0, "box_8_tax_exempt": 3200.00}]
#04 · Small business owner extracting 1099-K data to report gross payment volume
Input
From this 1099-K PDF, extract: payment settlement entity name, merchant category code if shown, Box 1a gross amount of payment card transactions, Box 1b card-not-present transactions, and the tax year this form covers. Return JSON.
Expected output
{"pse_name": "Stripe Payments Company", "merchant_category_code": "5045", "box_1a_gross_transactions": 94330.00, "box_1b_card_not_present": 94330.00, "tax_year": 2023}
#05 · CPA firm validating student loan interest reported on 1098-E
Input
Extract the following from this 1098-E student loan interest statement: lender name, lender TIN, borrower name, and Box 1 student loan interest received. Confirm whether Box 2 is checked indicating origination fees are not included. Return JSON.
Expected output
{"lender_name": "EdFinancial Services", "lender_tin": "62-1559700", "borrower_name": "Marcus D. Patel", "box_1_interest": 1847.52, "box_2_origination_fees_excluded": false}

Common mistakes to avoid

  • Trusting output without spot-checking numbers

    AI models can transpose digits in numeric fields, especially on lower-quality scans or when box labels are ambiguous. Always spot-check at least a sample of extracted values against the source document before importing into any financial system. A $7,420 wage misread as $74,200 passes format validation but causes real problems.

  • Sending scanned image PDFs as raw text

    If your PDF was created by scanning a paper form rather than digitally generated, the 'text' layer may be empty or contain garbage characters from a poor OCR pass. Running an AI model on that input produces hallucinated field values. You need to run OCR first with a tool like Tesseract or an IDP service, then pass the corrected text.

  • Using one generic prompt for all form types

    A W-2, 1099-NEC, and 1099-INT have completely different box structures. A single prompt that says 'extract all tax data' will mix up field labels across form types. Write a separate extraction prompt for each form type, or include a step that identifies the form type before routing to the correct extraction template.

  • Ignoring amended or corrected forms

    Corrected 1099s are marked with a 'CORRECTED' checkbox at the top of the form. If you process both the original and the corrected version in a batch, you will end up with duplicate records. Build a check for the corrected flag into your extraction prompt and handle duplicates explicitly in your workflow.

  • Not logging the raw extracted text alongside output

    If an extraction produces a wrong value and you only stored the JSON output, you have no way to audit what the model actually received. Store the raw PDF text that was sent to the model alongside each extraction result. This makes debugging and compliance review far faster.

Related queries

Frequently asked questions

Can AI extract data from scanned tax form PDFs, not just digital ones?

Yes, but you need an OCR step first. AI language models read text, not images. Run the scanned PDF through an OCR engine (Tesseract, AWS Textract, or Adobe PDF Extract) to convert it to text, then pass that text to the AI for structured extraction. Quality drops significantly with scans below 200 DPI or forms with handwritten entries.

Is it safe to send W-2 or 1099 PDFs to an AI API for extraction?

It depends on the API and your compliance obligations. Consumer-facing AI tools often use submitted data for training by default — check the provider's data processing terms before sending any documents with SSNs or EINs. Enterprise API tiers from OpenAI, Anthropic, and Google typically offer no-training agreements. Always mask full SSNs before transmission and confirm the API endpoint is SOC 2 or equivalent certified if you are handling client documents.

What fields can AI reliably extract from a W-2?

On a clean, digitally generated W-2, AI models reliably extract employer name, employer EIN, employee name, all Box 1 through Box 6 wage and withholding figures, state fields in Boxes 15 through 17, and codes in Box 12. Fields that are more error-prone include Box 12 multi-code entries (when multiple codes appear) and Box 14 miscellaneous employer notes, which vary by employer and have no standardized format.

How do I extract data from multiple tax PDFs in a batch?

The most reliable pattern is to process one document per API call rather than concatenating multiple forms into one prompt. Write a script that loops through each PDF, extracts the text layer, sends it with your structured prompt, parses the JSON response, and appends it to a results file. Concatenating multiple forms in one call causes models to mix up which values belong to which taxpayer.

Can AI tell the difference between a 1099-NEC and a 1099-MISC automatically?

Yes, if the form type is present in the extracted text, which it almost always is on IRS-standard forms. Add a prompt step that asks the model to identify the form type before extraction, or include a routing instruction like 'first identify whether this is a 1099-NEC, 1099-MISC, 1099-INT, or 1099-DIV, then extract the appropriate fields for that form type.' This prevents the model from applying the wrong field schema.

What is the best AI model for extracting data from tax form PDFs?

GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro all perform well on structured tax form extraction from clean digital PDFs. The comparison table on this page shows their outputs side by side on the same test document. For scanned forms with vision input, GPT-4o and Gemini 1.5 Pro have the strongest document image comprehension. For pure text extraction accuracy and JSON formatting consistency, Claude 3.5 Sonnet tends to be the most precise.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.