Pull Email Addresses from Raw HTML and Page Source

Tested prompts for extract emails from html source compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 10/10

When you have a chunk of raw HTML and need to pull out every email address buried inside it, you are not looking for a tutorial on web scraping libraries or a lecture on regex theory. You need to paste in the source, get the emails out, and move on. That is exactly what this page covers: using an AI prompt to extract email addresses directly from HTML source code, whether that is a full page dump, a snippet of markup, or a messy blob of inline styles and script tags mixed with contact info.

The HTML source of a webpage is noisy. Email addresses hide inside mailto: href attributes, inside plain text nodes, inside JavaScript strings, inside meta tags, and occasionally inside comments. A naive regex will miss half of them or return garbage. AI models handle this well because they understand context, not just pattern matching.

This page shows you the exact prompt to use, compares how four different models handle the same HTML input, and gives you practical guidance on edge cases like obfuscated emails, duplicate filtering, and what to do when the source is minified or encoded.

When to use this

This approach works best when you have HTML source in hand and need a fast, accurate list of email addresses without writing code. It fits one-off tasks, quick audits, and situations where the HTML is irregular enough that a simple regex would need heavy customization to be reliable.

Auditing a scraped webpage to find all contact emails before importing into a CRM
Extracting mailto: links and plain-text emails from a competitor or partner site's source for outreach lists
Pulling contact info from a saved HTML file of a forum, directory, or member listing
Reviewing the source of your own site to find emails accidentally exposed in comments or meta tags
Processing a batch of HTML snippets exported from a content management system or email template builder

When this format breaks down

If the HTML is hundreds of kilobytes or megabytes long, it will exceed the context window of most models. Split it into sections first or use a dedicated parsing script.
If the emails are loaded dynamically via JavaScript after the initial page render, the raw HTML source will not contain them. You need a headless browser to capture the rendered DOM instead.
If you need to process thousands of HTML files automatically on a schedule, a coded solution using a regex or an HTML parser library is faster and cheaper than sending each file to an AI model.
If the emails are encoded in base64 or heavily obfuscated with character entity substitution, the model may miss some. Decode the source first before passing it to the prompt.

The prompt we tested

You are an email extraction tool. Scan the following HTML source code and pull out every unique email address it contains, including those inside mailto: links, attributes, comments, obfuscated formats (e.g., 'name [at] domain [dot] com'), and plain text.

Rules:
Return only a deduplicated list of valid email addresses, one per line, in lowercase, with no commentary, numbering, or surrounding text. Decode common obfuscations (e.g., [at], (dot), HTML entities like &#64;) into standard email format, and omit anything that is not a valid email address.

HTML source:
<html><body><p>Contact our team at <a href="mailto:Sales@Acme.com">Sales</a> or support&#64;acme.com for help.</p><!-- admin [at] acme [dot] com --><footer>Press: press@acme.com</footer></body></html>

Model comparison

Model	Quality	Cost	Latency	Tokens
Claude Opus 4.7	10	$0.0069	1.7s	293 / 33
Claude Haiku 4.5	10	$0.0004	761ms	223 / 30
GPT-5.4	10	$0.0009	1.2s	201 / 26
Gemini 2.5 Pro	10	$0.0004	8.6s	195 / 27
Grok 4.1 Fast Reasoning	10	$0.0001	2.6s	349 / 23

Outputs

Claude Opus 4.7 10/10

sales@acme.com
support@acme.com
admin@acme.com
press@acme.com

Claude Haiku 4.5 10/10

sales@acme.com
support@acme.com
admin@acme.com
press@acme.com

GPT-5.4 10/10

sales@acme.com
support@acme.com
admin@acme.com
press@acme.com

Gemini 2.5 Pro 10/10

sales@acme.com
support@acme.com
admin@acme.com
press@acme.com

Grok 4.1 Fast Reasoning 10/10

sales@acme.com
support@acme.com
admin@acme.com
press@acme.com

What makes these work

01

Ask for deduplication explicitly
HTML pages often repeat the same email in a mailto link and again in plain text. Tell the prompt to return unique emails only. Without that instruction, many models will list duplicates and you will need to clean the output yourself.
02

Request the source context alongside each email
Adding 'include where each email was found' to your prompt gets you output like 'jane@co.com (mailto href, team section)'. This is useful when you need to verify the extraction or explain to a client where contact info was located on the page.
03

Flag obfuscated formats in your prompt
Tell the model to look for common obfuscation patterns like 'name [at] domain [dot] com' and JavaScript string concatenation. Without this instruction, some models treat obfuscated addresses as plain text and skip them entirely.
04

Trim the HTML before pasting
Remove large script blocks, CSS, and base64-encoded image data before submitting. These sections almost never contain real emails, they waste tokens, and they can cause the model to miss addresses near the end of a long input due to context limits.

More example scenarios

#01 · SaaS company team page HTML snippet

Input

<div class='team-member'><h3>Jane Doe</h3><p>VP of Sales</p><a href='mailto:jane.doe@acme.com'>Contact Jane</a></div><div class='team-member'><h3>Bob Smith</h3><p>Support Lead</p><p>Reach Bob at bob.smith@acme.com for help.</p></div>

Expected output

jane.doe@acme.com
bob.smith@acme.com

#02 · Law firm directory page with mixed formats

Input

<!-- Partner emails updated 2024 --><p>For billing inquiries contact billing&#64;lawfirm.com or call us. General matters go to info@lawfirm.com. Partner John Reeves can be reached at <a href='mailto:j.reeves@lawfirm.com'>his direct line</a>.</p>

Expected output

billing@lawfirm.com
info@lawfirm.com
j.reeves@lawfirm.com

#03 · E-commerce footer with obfuscated support email

Input

<footer><p>Questions? Email us at support [at] shopexample [dot] com or use the form below. Press inquiries: press@shopexample.com</p><script>var e='returns'+'@'+'shopexample.com';</script></footer>

Expected output

support@shopexample.com (obfuscated in text)
press@shopexample.com
returns@shopexample.com (constructed in inline script)

#04 · Nonprofit event page with multiple departments

Input

<section id='contact'><ul><li>Volunteer sign-ups: <a href='mailto:volunteer@helpnow.org'>volunteer@helpnow.org</a></li><li>Donations: donate@helpnow.org</li><li>Media: media@helpnow.org</li><li>General: info@helpnow.org</li></ul></section>

Expected output

volunteer@helpnow.org
donate@helpnow.org
media@helpnow.org
info@helpnow.org

#05 · University faculty listing with duplicates and role labels

Input

<p>Department Chair: dr.lee@university.edu | Admissions: admissions@university.edu</p><p>For academic advising contact dr.lee@university.edu. Graduate office: gradoffice@university.edu. Admissions inquiries also go to admissions@university.edu.</p>

Expected output

dr.lee@university.edu
admissions@university.edu
gradoffice@university.edu

(Duplicates removed: dr.lee@university.edu and admissions@university.edu each appeared twice)

Common mistakes to avoid

Pasting the entire rendered page source
Full page HTML from a browser's 'View Source' often includes analytics scripts, ad pixels, and tracking libraries that add thousands of tokens of noise. Strip or collapse these sections first. Leaner input produces faster and more accurate output.
Assuming all emails will be in mailto: links
A significant portion of email addresses on real pages appear in plain text paragraphs, inside alt attributes, or embedded in JavaScript variables. A prompt that only checks href attributes will miss these. Make sure your prompt specifies searching all text content, not just links.
Not specifying output format
Without a format instruction, models may return emails as a bulleted list, a comma-separated string, or embedded in a sentence. If you need a clean list for a spreadsheet or script, specify 'return one email per line, no other text' to avoid manual reformatting.
Ignoring HTML entity encoding
The at-sign is sometimes encoded as @ or &commat; in HTML to reduce spam scraping. If you paste encoded HTML directly, some models will decode and find these. Others will not. Explicitly mention entity-encoded emails in your prompt to be safe.
Treating every extracted string as a valid contact email
HTML source sometimes contains example emails like user@example.com, test@test.com, or placeholder addresses inside templates. Ask the model to flag or exclude obvious placeholder and example addresses, especially if you are feeding the output directly into an outreach tool.

Related queries

Frequently asked questions

Can I extract emails from HTML source without writing code?

Yes. Pasting the HTML into an AI prompt with a clear extraction instruction is the fastest no-code method. You get a clean list in seconds. For recurring or large-scale tasks, a simple Python script using the re module or BeautifulSoup is more practical, but for one-off extractions AI is faster.

What is the best regex to extract emails from HTML?

A commonly used pattern is [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}. It catches most standard addresses but will miss obfuscated formats and HTML-entity-encoded at-signs. For messy real-world HTML, an AI model handles edge cases better than regex alone because it understands context around the text.

How do I extract emails from a mailto href specifically?

In raw HTML, mailto links look like href='mailto:name@domain.com'. A regex targeting href="mailto:([^"]+)" or href='mailto:([^']+)' will pull just those. Pass this pattern or the HTML itself to an AI prompt if you want the extraction done without writing the regex yourself.

Will this work on minified or compressed HTML?

Minified HTML removes whitespace but keeps all attributes and text content intact, so email extraction still works. Heavily compressed or binary-encoded content needs to be decoded first. If your source looks like a scrambled string rather than recognizable HTML tags, decompress or decode it before extraction.

How do I extract emails from HTML source in Python?

Use the re module with a pattern like re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', html_string). Pair it with BeautifulSoup to first decode HTML entities so encoded addresses are normalized before the regex runs. This handles the majority of real-world cases reliably.

Can AI find emails hidden in HTML comments or JavaScript?

Yes, when instructed to. HTML comments are enclosed in  tags and are part of the raw source, so they are visible in the text passed to the model. JavaScript string variables that contain email addresses are also visible in source. Include a note in your prompt to check comments and script blocks to make sure none are missed.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.

Perplexity Pro AI-powered answer engine

Try Perplexity →

CustomGPT ChatGPT trained on your content

Try CustomGPT →