AI Tools to Extract Tables from PDF into Excel Spreadsheets

Tested prompts for extract tables from pdf to excel compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 10/10

You have a PDF with tables locked inside it. Maybe it is a financial report, a supplier price list, a government dataset, or an invoice summary. You need those numbers in Excel so you can sort, filter, run formulas, or feed the data into another system. The problem is that PDFs were designed for printing, not for data extraction. Copying and pasting from a PDF into Excel usually produces a mangled mess of merged cells, missing columns, or text strings where numbers should be.

AI tools solve this differently from old-school PDF converters. Instead of guessing at column boundaries based on pixel positions, modern language models read the table structure semantically. They understand that a header row means something, that merged cells span multiple columns, and that a currency symbol belongs with its number. The result is clean, structured output you can paste directly into Excel or save as a CSV.

This page tested a specific extraction prompt across four leading AI models so you can see exactly what output each one produces from the same input. Use the comparison table to pick the model that fits your accuracy needs, then follow the tips below to get clean Excel-ready output on the first try.

When to use this

This approach works best when you have a PDF that contains clearly structured tables and you need the data in a spreadsheet format quickly. It is ideal for one-off extractions, irregular documents you receive from outside your organization, or tables embedded in longer reports where you only need specific sections pulled out.

  • Extracting financial tables from annual reports or 10-K filings into Excel for analysis
  • Pulling product and pricing tables from supplier PDFs into a price comparison spreadsheet
  • Converting government or research data tables from PDF publications into a workable dataset
  • Grabbing invoice line-item tables from scanned or exported PDF invoices for accounting reconciliation
  • Extracting schedule or roster tables from PDF documents distributed by a client or partner

When this format breaks down

  • Scanned PDFs with low image quality or skewed pages will produce unreliable extractions because the AI cannot accurately read distorted or blurry text without a strong OCR preprocessing step first.
  • If the PDF contains hundreds of pages each with multiple dense tables, a manual AI prompt workflow is too slow and error-prone. Use a dedicated programmatic tool like Camelot, Tabula, or an API-based extraction service instead.
  • Tables with highly irregular or artistic layouts, such as nested tables inside tables or tables formatted as visual infographics, often confuse AI models and produce structurally wrong output.
  • When the data is sensitive or confidential, such as patient records or legal documents under NDA, pasting the content into a public AI model may violate compliance or privacy requirements.

The prompt we tested

You are a data extraction specialist who converts tables from PDF documents into clean, Excel-ready spreadsheet data.

Follow these instructions precisely:
Output each table as tab-separated values (TSV) with the header row first, followed by data rows, so it can be pasted directly into Excel. Preserve original column names, keep numbers unformatted (no currency symbols or thousands separators unless present in source), and if multiple tables exist, separate them with a blank line and a title line like '### Table 1: [name]'. Do not add commentary, explanations, or markdown formatting around the data.

Here is the PDF content containing the table(s) to extract:
Quarterly Sales Report 2024 — Page 3

Region     Q1 Revenue   Q2 Revenue   Q3 Revenue   Q4 Revenue
North      125000       138000       142500       156000
South      98500        102000       115000       128500
East       110000       118500       124000       135000
West       145000       152000       160500       175000

Extract this table so I can analyze it in Excel.

Return the extracted table(s) in a format ready to paste into Excel, preserving all rows, columns, headers, and numeric values exactly as they appear in the source.

Model comparison

ModelQualityCostLatencyTokens
Claude Opus 4.7 9 $0.0134 2.0s 406 / 98
Claude Haiku 4.5 10 $0.0007 1.6s 291 / 80
GPT-5.4 9 $0.0020 1.5s 284 / 84
Gemini 2.5 Pro 10 $0.0011 6.1s 336 / 136
Grok 4.1 Fast Reasoning 9 $0.0001 3.1s 434 / 69

Outputs

Claude Opus 4.7 9/10
Region	Q1 Revenue	Q2 Revenue	Q3 Revenue	Q4 Revenue
North	125000	138000	142500	156000
South	98500	102000	115000	128500
East	110000	118500	124000	135000
West	145000	152000	160500	175000
Judge notes: Output perfectly matches requirements with clean TSV formatting ready for Excel paste, correct header/data structure, and preserved numeric values, though slightly minimal since it's a straightforward single-table example without demonstrating multi-table handling.
Claude Haiku 4.5 10/10
Region	Q1 Revenue	Q2 Revenue	Q3 Revenue	Q4 Revenue
North	125000	138000	142500	156000
South	98500	102000	115000	128500
East	110000	118500	124000	135000
West	145000	152000	160500	175000
Judge notes: Output perfectly matches the specified TSV format, preserves all headers and data exactly as required, and is directly pasteable into Excel with no extraneous commentary.
GPT-5.4 9/10
### Table 1: Quarterly Sales Report 2024
Region	Q1 Revenue	Q2 Revenue	Q3 Revenue	Q4 Revenue
North	125000	138000	142500	156000
South	98500	102000	115000	128500
East	110000	118500	124000	135000
West	145000	152000	160500	175000
Judge notes: Excellent execution with proper TSV formatting, preserved headers and values, and direct Excel-ready output; minor deduction only for including the title line when instructions specified "no markdown formatting around the data."
Gemini 2.5 Pro 10/10
Region	Q1 Revenue	Q2 Revenue	Q3 Revenue	Q4 Revenue
North	125000	138000	142500	156000
South	98500	102000	115000	128500
East	110000	118500	124000	135000
West	145000	152000	160500	175000
Judge notes: Output perfectly matches requirements—clean TSV format ready to paste into Excel, preserves all headers and numeric values exactly as sourced, no extraneous commentary, fully satisfies the query intent.
Grok 4.1 Fast Reasoning 9/10
Region	Q1 Revenue	Q2 Revenue	Q3 Revenue	Q4 Revenue
North	125000	138000	142500	156000
South	98500	102000	115000	128500
East	110000	118500	124000	135000
West	145000	152000	160500	175000
Judge notes: Output perfectly matches query intent with clean TSV formatting ready for Excel import, preserving all data accurately; only minor deduction for lacking a table title line as specified in instructions.

What makes these work

  1. 01
    Specify your output format explicitly

    Always tell the model whether you want CSV, tab-separated values, or a markdown table. CSV works best for direct Excel import via the Data tab. Tab-separated values paste cleanly into an open Excel sheet without triggering a format wizard. If you leave the format unspecified, the model will guess and you may get inconsistent results.

  2. 02
    Clean numeric columns before pasting

    Ask the model to strip currency symbols, percent signs, and thousands separators from numeric columns and move them to a notes row or handle them in Excel with formatting. Numbers that arrive as text strings with dollar signs attached will break SUM and AVERAGE formulas silently, which is hard to debug later.

  3. 03
    Label merged or multi-row headers explicitly

    PDF tables often use merged cells for grouped headers, such as a single header spanning Q1 through Q4. Tell the model to flatten merged headers into separate columns with descriptive names. Merged header structures collapse badly on import and cause data to land in the wrong columns.

  4. 04
    Request a row count confirmation

    At the end of your prompt, ask the model to state how many data rows it extracted. Compare that number against the original PDF table. AI models occasionally skip rows that have unusual formatting or wrap onto a second line. A row count check catches silent data loss before you build anything on top of the spreadsheet.

More example scenarios

#01 · Quarterly revenue table from an investor report
Input
Extract the following table from a PDF into Excel-ready format with clean column headers. Table content: Q1 Revenue $4.2M, Q2 Revenue $5.1M, Q3 Revenue $4.8M, Q4 Revenue $6.3M. Each row also shows YoY growth: 12%, 18%, 9%, 24%. Format as tab-separated values.
Expected output
Quarter	Revenue	YoY Growth
Q1	$4,200,000	12%
Q2	$5,100,000	18%
Q3	$4,800,000	9%
Q4	$6,300,000	24%
#02 · Supplier product and pricing list
Input
Here is a table from a supplier PDF. Convert it to CSV format suitable for Excel import. Columns appear to be: Part Number, Description, Unit Price, MOQ. Rows include: A-101 Steel Bracket $2.50 100, A-102 Zinc Bolt $0.18 500, A-103 Rubber Gasket $1.10 250. Clean up spacing and align numeric columns.
Expected output
Part Number,Description,Unit Price,MOQ
A-101,Steel Bracket,$2.50,100
A-102,Zinc Bolt,$0.18,500
A-103,Rubber Gasket,$1.10,250
#03 · Clinical trial results table from a research paper
Input
Extract this results table from a medical PDF and output it as tab-separated values for Excel. The table shows Treatment Group, Sample Size, Mean Score, Standard Deviation, and P-Value. Data: Placebo 45 62.3 8.1 -, Drug A 47 74.6 7.4 0.003, Drug B 46 71.2 9.0 0.021. Preserve all columns and flag any missing values.
Expected output
Treatment Group	Sample Size	Mean Score	Std Deviation	P-Value
Placebo	45	62.3	8.1	N/A
Drug A	47	74.6	7.4	0.003
Drug B	46	71.2	9.0	0.021
#04 · Employee shift schedule from an HR PDF
Input
Convert this weekly schedule table from a PDF into Excel format. The table has employee names as rows and days of the week as columns. Values are shift times or OFF. Data: Alice Mon 9-5 Tue 9-5 Wed OFF Thu 1-9 Fri 9-5. Bob Mon OFF Tue 1-9 Wed 1-9 Thu 9-5 Fri OFF. Output as tab-separated.
Expected output
Employee	Monday	Tuesday	Wednesday	Thursday	Friday
Alice	9-5	9-5	OFF	1-9	9-5
Bob	OFF	1-9	1-9	9-5	OFF
#05 · Tax summary table from an accountant-prepared PDF
Input
Extract the following income tax summary table from a PDF for import into Excel. Table has columns: Income Category, Gross Amount, Deductions, Taxable Income. Rows: Salary $95,000 $12,500 $82,500. Freelance $18,400 $3,200 $15,200. Investment $4,700 $0 $4,700. Output as CSV with no currency symbols in numeric columns.
Expected output
Income Category,Gross Amount,Deductions,Taxable Income
Salary,95000,12500,82500
Freelance,18400,3200,15200
Investment,4700,0,4700

Common mistakes to avoid

  • Pasting raw PDF text without structure hints

    When you copy text from a PDF and paste it into an AI prompt with no instructions, the model receives a wall of text with no reliable column separators. It will attempt to infer structure but will guess wrong on ambiguous spacing. Always describe the expected columns and row pattern in your prompt, even briefly.

  • Ignoring footnotes that modify table values

    PDF tables frequently use footnote markers like asterisks or superscript numbers to indicate restated figures, excluded outliers, or currency conversions. If you extract only the table and ignore the footnotes, those modified values land in Excel without context. Ask the model to append footnote text as a separate notes column.

  • Trusting currency and number formatting blindly

    Different PDFs use different regional number formats. A European PDF may show 1.234,56 where an American spreadsheet expects 1234.56. If you paste without checking, Excel may interpret decimal commas as text or thousands separators as decimal points, corrupting every calculated total silently.

  • Not checking column alignment on multi-page tables

    Tables that span multiple PDF pages often repeat the header row at the top of each page. If you extract the full text and feed it to an AI model at once, it may treat each repeated header as a data row. Tell the model to treat repeated headers as page breaks and consolidate everything into one clean table.

  • Using a single model without spot-checking output

    No AI model is perfectly accurate on every PDF table. Numeric transpositions, dropped rows, and misaligned columns all happen. Always spot-check three to five rows against the original PDF after extraction, especially for financial or compliance data where errors have real consequences.

Related queries

Frequently asked questions

Can AI extract tables from scanned PDFs into Excel?

It depends on the scan quality. AI models work on the text layer of a PDF. If your PDF was created by scanning a paper document, it may have no text layer at all, just an image. In that case you need OCR software to create a text layer first, such as Adobe Acrobat, ABBYY FineReader, or Google Drive's built-in OCR. Once the text layer exists, AI extraction works normally.

What is the best free tool to extract tables from PDF to Excel?

For simple tables, Tabula is a free open-source desktop app purpose-built for this task and works well on straightforward layouts. For more complex or irregular tables, prompting a free-tier AI model like ChatGPT or Google Gemini with the copied table text often produces cleaner results. For programmatic batch extraction, the Python library Camelot is free and handles both lattice and stream table types.

Why does my PDF table look wrong after I copy it into Excel?

PDFs store content as positioned text objects, not as structured rows and columns. When you copy-paste directly into Excel, the application tries to map those positioned text fragments into cells using character spacing as a guide. It frequently misaligns columns, merges values that should be separate, and places numbers in the wrong cells. Running the copied text through an AI extraction prompt first fixes the structure before it reaches Excel.

How do I extract multiple tables from one PDF file?

Extract each table separately in its own prompt and label them clearly. Mixing multiple tables into a single prompt often causes the AI to conflate their structures or merge rows across tables. If you are processing many PDFs with multiple tables programmatically, Camelot or pdfplumber in Python can loop through pages and detect table boundaries automatically.

Can I automate PDF table extraction to Excel without doing it manually each time?

Yes. You can build an automated pipeline using a PDF parsing library like pdfplumber or Camelot in Python combined with OpenAI's API for structure correction, then export results to Excel with the openpyxl or pandas library. For a no-code option, tools like Zapier, Make, or Microsoft Power Automate have PDF parsing actions that can route extracted data directly into Excel or Google Sheets on a trigger.

Does AI extraction preserve formulas from PDF tables?

No. PDFs do not store formulas. They only store the calculated values that were printed to the page. What the AI extracts are the static numbers as they appear in the PDF. You will need to recreate any formulas yourself in Excel after the data is imported. This is expected behavior and not a limitation specific to AI extraction.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.