info@acme-consulting.com jane.doe@acme-consulting.com press@acme-consulting.com sales@acme-consulting.com
Scrape and Extract Email Addresses from Websites
Tested prompts for extract emails from website compared across 5 leading AI models.
When you land on a website and need to pull out every email address buried in the page source, contact sections, or body text, you are facing a classic data extraction problem. Maybe you are building a list of journalists for a PR outreach, compiling vendor contacts from a supplier directory, or auditing your own site to find exposed addresses before scrapers do. The core challenge is the same: email addresses rarely live in a clean column. They are mixed into paragraphs, hidden behind obfuscation tricks, or scattered across dozens of pages.
AI models handle this well because email extraction is a pattern-matching task. A well-prompted model reads raw HTML, plain text, or copied page content and returns only the valid email strings, stripping the surrounding noise. No regex debugging, no custom scraper scripts, no manual hunting.
This page shows you the exact prompt to use, how four leading models perform on the same input, and how to get consistent, clean results whether you are processing a single page or batching hundreds of them.
When to use this
This approach fits best when you have raw text or HTML from a webpage and need a clean list of email addresses fast, without writing code. It works for one-off extractions and for batch workflows where you paste content into an automated pipeline. Use it when the source material is messy, semi-structured, or mixed with unrelated text.
- Pulling contact emails from a business directory or industry association member page
- Extracting press or media contact addresses from a news outlet's staff page
- Auditing your own website's source code to find exposed email addresses before spammers do
- Compiling speaker or panelist contact info from a conference website
- Harvesting vendor or supplier emails from a wholesale marketplace listing page
When this format breaks down
- When emails are loaded dynamically via JavaScript after page render and you only have the raw HTML before execution, the addresses will not appear in the text you feed the model.
- When the site uses email obfuscation techniques like splitting the address into separate DOM elements or encoding it as an image, the model cannot reconstruct addresses it cannot see as text.
- When you need to crawl hundreds of pages automatically at scale, a prompt-based approach becomes a bottleneck. A dedicated scraping tool or script with regex is more practical for large-volume pipelines.
- When the emails are behind a login wall or CAPTCHA, you cannot access the source text in the first place, making any extraction method irrelevant.
The prompt we tested
You are an expert email address extraction assistant. Your task is to scan the provided website content and extract every valid email address found within it. Follow these instructions carefully: Extract all valid email addresses from the provided content and return them as a deduplicated, alphabetically sorted plain-text list with one email per line. Include obfuscated formats (e.g., 'name [at] domain [dot] com') by normalizing them to standard email syntax, and exclude any invalid strings, image filenames, or placeholder examples like 'example@example.com'. If no valid emails are found, respond with exactly: 'No email addresses found.' Website content to analyze: Contact Us page from acme-consulting.com: For general inquiries, reach out to info@acme-consulting.com or call us at (555) 123-4567. Our sales team can be reached at sales [at] acme-consulting [dot] com, and for press, email press@acme-consulting.com. You can also contact our CEO directly at jane.doe@acme-consulting.com. Return the extracted email addresses now.
Model comparison
| Model | Quality | Cost | Latency | Tokens |
|---|---|---|---|---|
| Claude Opus 4.7 | 10 | $0.0092 | 5.0s | 351 / 52 |
| Claude Haiku 4.5 | 9 | $0.0005 | 911ms | 253 / 40 |
| GPT-5.4 | 10 | $0.0012 | 1.2s | 237 / 41 |
| Gemini 2.5 Pro | 9 | $0.0005 | 9.1s | 247 / 41 |
| Grok 4.1 Fast Reasoning | 10 | $0.0001 | 2.6s | 386 / 38 |
Outputs
info@acme-consulting.com jane.doe@acme-consulting.com press@acme-consulting.com sales@acme-consulting.com
info@acme-consulting.com jane.doe@acme-consulting.com press@acme-consulting.com sales@acme-consulting.com
info@acme-consulting.com jane.doe@acme-consulting.com press@acme-consulting.com sales@acme-consulting.com
info@acme-consulting.com jane.doe@acme-consulting.com press@acme-consulting.com sales@acme-consulting.com
What makes these work
-
01Feed raw text, not screenshots
AI models extract emails from text, not images. Always paste the actual page source, copied body text, or raw HTML into the prompt. If you have a screenshot, run it through OCR first to get a text string the model can process.
-
02Ask for structured output explicitly
Telling the model to return one email per line, or to pair emails with associated names or company labels, prevents it from returning a paragraph of prose. Specify the output format in your prompt and the results are immediately usable without cleanup.
-
03Request deduplication in the prompt
Large directory pages often repeat the same contact email multiple times. Adding a simple instruction like 'return only unique addresses' removes duplicates at the extraction stage, so you do not have to filter the list yourself afterward.
-
04Validate format with a second instruction
Some pages contain partial or malformed strings that look like emails but are not. Ask the model to return only addresses matching the standard user@domain.tld format. This filters out artifacts like truncated text or code snippets that mimic email patterns.
More example scenarios
Here is the copied text from the 'Our Team' page of a regional news website. Extract all email addresses: Jane Ortega, Editor-in-Chief | jane.ortega@citypress.com | Sarah Lin covers metro news, reach her at s.lin@citypress.com. For advertising, contact Mark Booker: mbooker@citypress.com. Tips? Send to tips@citypress.com.
jane.ortega@citypress.com s.lin@citypress.com mbooker@citypress.com tips@citypress.com
Scan this raw HTML snippet from our homepage footer and list every email address present: <footer><p>Support: <a href='mailto:support@acmecorp.io'>support@acmecorp.io</a> | Sales: sales@acmecorp.io | Do not reply to noreply@acmecorp.io</p><p>Careers: jobs@acmecorp.io</p></footer>
support@acmecorp.io sales@acmecorp.io noreply@acmecorp.io jobs@acmecorp.io
Extract all email addresses from this text copied from the speakers page of a tech conference: Dr. Priya Nair (keynote) can be reached at p.nair@techconf.org. Workshop host Tom Reyes: treyes@designhub.net. For speaker inquiries contact the events team at speakers@summit2024.com.
p.nair@techconf.org treyes@designhub.net speakers@summit2024.com
Below is pasted text from a B2B supplier directory page. Pull every email address and pair it with the company name if visible: GreenLeaf Supplies - orders@greenleafsupplies.com | Metro Parts Co. - sales@metroparts.com, returns@metroparts.com | Horizon Wholesale: contact@horizonwholesale.net
GreenLeaf Supplies: orders@greenleafsupplies.com Metro Parts Co.: sales@metroparts.com, returns@metroparts.com Horizon Wholesale: contact@horizonwholesale.net
Extract only unique email addresses from this alumni directory text, ignoring duplicates: Alex Torres alex.torres@gmail.com | Maria Chen: m.chen@outlook.com | James Wu jwu@company.org | Alex Torres alex.torres@gmail.com | Lisa Park: lisa.park@techfirm.io
alex.torres@gmail.com m.chen@outlook.com jwu@company.org lisa.park@techfirm.io
Common mistakes to avoid
-
Pasting rendered page text instead of source
When you copy text from a browser's visible page, you often miss emails that are in HTML attributes like mailto links but not displayed as visible text. Copy the page source or use browser developer tools to get the full HTML before pasting into your prompt.
-
Skipping deduplication instructions
Forgetting to ask for unique results is the most common cause of inflated lists. A 50-contact directory page can easily return 200 rows if the same email appears in navigation, body text, and footer. Always ask the model to deduplicate.
-
Treating all extracted emails as opted-in contacts
Extracting an email from a public webpage does not grant permission to email that person for marketing. Using scraped emails for cold outreach without a lawful basis violates CAN-SPAM, GDPR, and similar regulations. Know the legal context before you use the list.
-
Not specifying what to do with obfuscated formats
Some sites write emails as 'name [at] domain [dot] com' to avoid scrapers. If you do not instruct the model to normalize these formats into standard addresses, they get returned as-is or skipped entirely. Add a line in your prompt asking the model to convert common obfuscation patterns.
Related queries
Frequently asked questions
Can AI extract emails from a website URL directly?
Not on its own. AI language models do not browse the web or fetch URLs unless connected to a tool that does. You need to retrieve the page content first, either by copying the text, pulling the HTML, or using a scraping tool, then feed that content to the model for extraction.
What is the best way to extract emails from multiple pages at once?
Build a simple pipeline: scrape each page's HTML using a tool like Playwright, Puppeteer, or a basic Python requests script, then pass each page's text to the AI model with your extraction prompt. You can batch multiple pages in one prompt if the combined text stays within the model's context window, or process them sequentially.
Is it legal to extract emails from websites?
It depends on the jurisdiction, the site's terms of service, and how you intend to use the data. Publicly visible emails can generally be read, but using them for unsolicited marketing may violate GDPR in Europe, CAN-SPAM in the US, or CASL in Canada. Always check the site's terms and applicable law before using extracted emails commercially.
How do I extract emails from a website that uses JavaScript to load content?
Standard HTML fetching will not capture emails loaded by JavaScript after the initial page render. Use a headless browser tool like Playwright or Puppeteer that executes JavaScript and returns the fully rendered DOM, then extract the email addresses from that output.
How accurate is AI at extracting emails compared to regex?
For clean, well-structured text, both methods perform similarly. AI has an edge on messy, obfuscated, or natural-language-heavy content where regex patterns miss contextual variations. Regex is faster and cheaper at high volume on structured data. For one-off or variable-format extractions, AI is more reliable and requires no code.
Can I extract emails from a PDF or document linked on a website?
Yes, but you need to convert the PDF to text first. Tools like pdfplumber, Adobe Acrobat, or online converters can extract the raw text layer from a PDF. Once you have the text, the same AI extraction prompt works exactly as it does for webpage content.
Try it with a real tool
Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.