How to Extract Email Addresses from PDF Documents

Tested prompts for extract email addresses from pdf compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 10/10

You have a PDF file with email addresses buried inside it. Maybe it is a scanned conference attendee list, a vendor contract, a research report, or an exported CRM dump. Copying them out one by one is slow and error-prone. What you need is a reliable way to pull every email address out of that document automatically.

The AI approach on this page works by feeding the raw text content of your PDF into a language model with a structured extraction prompt. The model identifies every string that matches an email pattern, ignores surrounding noise, and returns a clean list. It handles inconsistent formatting, mixed-language documents, and emails embedded mid-sentence equally well.

This page shows you the exact prompt used, how four different models handled the same input, and which performed best for accuracy and deduplication. Whether you have a single contract or hundreds of scanned reports, the workflow scales the same way. Read the examples and tips below to adapt it to your specific document type.

When to use this

This approach fits any situation where email addresses are embedded in unstructured or semi-structured PDF text and manual copying is not practical. It works on native PDFs whose text layer you can copy, and on OCR-processed scans where the text has already been extracted. It is the right tool when accuracy and speed both matter.

  • Pulling speaker or attendee email addresses from a conference program or registration export saved as PDF
  • Extracting contact details from a vendor list, partnership agreement, or legal contract delivered as a PDF
  • Harvesting lead emails from a scraped or downloaded PDF directory of businesses or professionals
  • Collecting author correspondence emails from a batch of academic papers or research reports
  • Recovering contact lists from archived PDF newsletters or email digests

When this format breaks down

  • The PDF is a pure image scan with no OCR text layer applied. The AI cannot read pixels. You must run the file through an OCR tool first and extract the text before this workflow can help.
  • The document contains only a handful of visible email addresses you can copy in under a minute. Running an AI extraction workflow adds unnecessary overhead for trivial tasks.
  • You need to extract emails at industrial scale from thousands of PDFs automatically. A dedicated pipeline with programmatic PDF parsing and regex is faster and cheaper than calling an LLM per document.
  • The PDF is protected or encrypted and you cannot copy its text content. The AI has nothing to process if the text layer is locked.

The prompt we tested

You are an expert data extraction assistant specialized in finding email addresses within text extracted from PDF documents. Carefully scan the provided content and identify every valid email address, including those that may be obfuscated, split across lines, or formatted unusually due to PDF text extraction artifacts.

Instructions: Return only a deduplicated list of valid email addresses, one per line, with no commentary, numbering, or surrounding text. Normalize emails to lowercase, repair any line-break splits (e.g., 'john@\nexample.com' becomes 'john@example.com'), and convert obfuscated forms like '[at]' or '(dot)' into standard format. If no email addresses are found, respond with exactly: No email addresses found.

PDF text content:
Please contact our support team at help@acmecorp.com for assistance. For partnership inquiries, reach out to Sarah Johnson (sarah.johnson [at] acmecorp.com) or our VP of Sales at m.chen@acme
corp.com. Invoices should be sent to billing(dot)dept(at)acmecorp(dot)com.

Return the extracted email addresses now.

Model comparison

ModelQualityCostLatencyTokens
Claude Opus 4.7 10 $0.0099 1.7s 381 / 56
Claude Haiku 4.5 10 $0.0005 853ms 264 / 44
GPT-5.4 9 $0.0012 1.0s 234 / 42
Gemini 2.5 Pro 10 $0.0005 7.8s 240 / 39
Grok 4.1 Fast Reasoning 10 $0.0001 2.2s 381 / 35

Outputs

Claude Opus 4.7 10/10
help@acmecorp.com
sarah.johnson@acmecorp.com
m.chen@acmecorp.com
billing.dept@acmecorp.com
Judge notes: The output perfectly extracts all email addresses from the PDF text, correctly deobfuscates multiple formats ([at], dot notation, line breaks), deduplicates results, and presents them in the exact requested format with no extraneous commentary.
Claude Haiku 4.5 10/10
help@acmecorp.com
sarah.johnson@acmecorp.com
m.chen@acmecorp.com
billing.dept@acmecorp.com
Judge notes: Flawlessly extracts all email addresses from the PDF text, correctly handles obfuscated formats and line breaks, provides clean deduplicated output in the exact requested format.
GPT-5.4 9/10
help@acmecorp.com
sarah.johnson@acmecorp.com
m.chen@acmecorp.com
billing.dept@acmecorp.com
Judge notes: Excellently extracted all four email addresses correctly, including proper normalization of obfuscated formats and line-break repairs, with only the minor limitation that it doesn't handle actual PDF files directly (requires pre-extracted text).
Gemini 2.5 Pro 10/10
help@acmecorp.com
sarah.johnson@acmecorp.com
m.chen@acmecorp.com
billing.dept@acmecorp.com
Judge notes: The model perfectly identified all four email addresses, correctly normalized obfuscated formats, repaired line breaks, and returned clean deduplicated output with no extraneous commentary.
Grok 4.1 Fast Reasoning 10/10
help@acmecorp.com
sarah.johnson@acmecorp.com
m.chen@acmecorp.com
billing.dept@acmecorp.com
Judge notes: The model perfectly extracted all four email addresses, correctly deobfuscated alternative formats ([at], (dot)), repaired line breaks, normalized to lowercase, and presented results in exactly the requested format with no extraneous content.

What makes these work

  1. 01
    Paste raw text, not file paths

    Language models process text, not files. Copy the text content out of your PDF using a reader or a tool like pdfplumber or Adobe Acrobat, then paste that text directly into the prompt. Sending a filename or a file path will produce no useful result.

  2. 02
    Ask for one email per line

    Specify in your prompt that output should be one email address per line with no extra characters. This makes the result immediately usable for import into a spreadsheet, CRM, or email tool without further cleanup. Without this instruction, some models wrap results in sentences or bullet points.

  3. 03
    Request deduplication explicitly

    If the same email appears multiple times in a document, the model may return duplicates unless told not to. Add 'remove duplicates' to your prompt. This is especially important for long documents like attendee lists where the same contact appears in multiple sections.

  4. 04
    Chunk very long documents

    Most models have a context window limit. For PDFs longer than 10-15 pages, split the extracted text into sections and run the prompt on each chunk separately, then combine the outputs. Trying to process an oversized document in one call risks truncation and missed emails near the end.

More example scenarios

#01 · Conference attendee roster PDF
Input
The following text was extracted from a conference attendee PDF. Pull out every email address:

Attendee List - TechSummit 2024
Jane Hooper, Lead Developer, Apex Systems - j.hooper@apexsystems.com
Marco Reyes | Product Manager | marco.reyes@bridgeworks.io
Sarah Kim (sarah_kim@nova-labs.org) - Researcher
Dr. L. Patel, lpatel@medinstitute.edu, Keynote Speaker
Expected output
j.hooper@apexsystems.com
marco.reyes@bridgeworks.io
sarah_kim@nova-labs.org
lpatel@medinstitute.edu
#02 · Vendor contract with embedded contacts
Input
Extract all email addresses from this contract text:

This agreement is between Northfield Supplies Inc. (contact: procurement@northfieldsupplies.com) and Waverly Logistics LLC. For billing queries contact accounts@waverly-log.com. Legal notices should be sent to legal.notices@northfieldsupplies.com and copied to compliance@waverly-log.com.
Expected output
procurement@northfieldsupplies.com
accounts@waverly-log.com
legal.notices@northfieldsupplies.com
compliance@waverly-log.com
#03 · Academic paper author contacts
Input
From this research paper header extract all emails:

Authors: Dr. Amara Osei (a.osei@ghanatech.edu.gh), Prof. Liu Wei (lwei@pku.edu.cn), and corresponding author Dr. Fatima Al-Rashid. Correspondence: f.alrashid@qataruni.edu.qa. Supplementary contact: lab-manager@ghanatech.edu.gh
Expected output
a.osei@ghanatech.edu.gh
lwei@pku.edu.cn
f.alrashid@qataruni.edu.qa
lab-manager@ghanatech.edu.gh
#04 · Real estate agent directory page
Input
Pull every email from this agency directory:

Region: Pacific Northwest
Tom Garrett - Residential Specialist - tgarrett@coastlineprops.com - (503) 555-0142
Deanna Park | Commercial | dpark@coastlineprops.com
Ruben Flores, ruben.flores@nwrealtygroup.net, Luxury Division
Contact office: info@coastlineprops.com
Expected output
tgarrett@coastlineprops.com
dpark@coastlineprops.com
ruben.flores@nwrealtygroup.net
info@coastlineprops.com
#05 · Nonprofit grant report with multiple stakeholders
Input
Extract emails from this grant report section:

Project leads are Maria Souza (msouza@hopefoundation.org) and Kevin Ito (k.ito@hopefoundation.org). External evaluator: dr.chen@impactmetrics.co. Funder contact at the Greenway Trust: grants@greenwaytrust.org. Media inquiries: press@hopefoundation.org.
Expected output
msouza@hopefoundation.org
k.ito@hopefoundation.org
dr.chen@impactmetrics.co
grants@greenwaytrust.org
press@hopefoundation.org

Common mistakes to avoid

  • Skipping the text extraction step

    Many people try to upload a PDF file directly and expect the model to read it. Unless the interface explicitly supports PDF parsing, the model only sees the filename. Always extract the text from the PDF first using a PDF reader, command-line tool, or API before sending it to the model.

  • Not specifying output format

    Without format instructions, models often return emails embedded in sentences like 'The emails found are...' This requires extra parsing to use the results. Always tell the model exactly how to format the output, such as a plain list with one address per line.

  • Assuming OCR output is clean

    Scanned PDFs processed through OCR frequently introduce character errors. An email like contact@company.com might be read as c0ntact@company.corn. Review AI-extracted emails from scanned documents before use, especially the domain portion where OCR errors are most common.

  • Ignoring partial or malformed emails

    Some PDFs contain line breaks mid-email, for example contact@company .com split across lines. The AI may or may not reconstruct these correctly depending on how the text was extracted. If your document has unusual formatting, check the raw extracted text first for broken strings.

  • Running one large prompt on a 50-page document

    Feeding an entire long document into a single prompt frequently causes the model to miss emails that appear in the later portions of the text due to context limits or attention drift. Break long documents into page ranges and merge results afterward.

Related queries

Frequently asked questions

Can I extract email addresses from a scanned PDF?

Yes, but you need to run OCR on the scanned PDF first to create a text layer. Tools like Adobe Acrobat, ABBYY FineReader, or open-source options like Tesseract can do this. Once you have the extracted text, the AI prompt approach on this page works the same way. Check the OCR output for character errors before trusting the results.

What is the best free tool to extract text from a PDF before using AI?

For native PDFs with a text layer, pdfplumber and PyMuPDF are free Python libraries that extract text accurately. For a no-code option, uploading the PDF to Google Drive and opening it in Google Docs will convert it to editable text. For scanned PDFs, Tesseract is the most widely used free OCR engine.

How do I extract email addresses from a PDF without software?

Open the PDF in any browser or reader, use Ctrl+A to select all text, then Ctrl+C to copy it. Paste the text into the AI prompt on this page. This works for native PDFs only. Scanned image-based PDFs will appear blank when you try to copy the text, and you will need an OCR step first.

Can this method extract emails from a password-protected PDF?

No. If the PDF is encrypted or password-protected, you cannot copy the text without the password. You must unlock or decrypt the file first using authorized credentials. Attempting to extract content from a protected PDF without permission may also violate terms of use or applicable law.

How accurate is AI at extracting emails compared to regex?

For well-formatted text, a strict regex pattern and AI perform comparably. AI has an advantage in noisy or inconsistently formatted documents because it understands context and can handle emails written in unusual ways, such as 'john dot smith at company dot com'. Regex requires the email to match an exact syntactic pattern and will miss obfuscated addresses entirely.

Can I automate this to process hundreds of PDFs at once?

Yes. Use a PDF parsing library like pdfplumber to extract text from each file programmatically, then send each extracted text block to an LLM API such as OpenAI or Anthropic with the extraction prompt. Collect and deduplicate the results. This scales well to hundreds or thousands of documents without any manual copying.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.