sales@acme.com support@acme.com admin@acme.com press@acme.com
Pull Email Addresses from Raw HTML and Page Source
Tested prompts for extract emails from html source compared across 5 leading AI models.
When you have a chunk of raw HTML and need to pull out every email address buried inside it, you are not looking for a tutorial on web scraping libraries or a lecture on regex theory. You need to paste in the source, get the emails out, and move on. That is exactly what this page covers: using an AI prompt to extract email addresses directly from HTML source code, whether that is a full page dump, a snippet of markup, or a messy blob of inline styles and script tags mixed with contact info.
The HTML source of a webpage is noisy. Email addresses hide inside mailto: href attributes, inside plain text nodes, inside JavaScript strings, inside meta tags, and occasionally inside comments. A naive regex will miss half of them or return garbage. AI models handle this well because they understand context, not just pattern matching.
This page shows you the exact prompt to use, compares how four different models handle the same HTML input, and gives you practical guidance on edge cases like obfuscated emails, duplicate filtering, and what to do when the source is minified or encoded.
When to use this
This approach works best when you have HTML source in hand and need a fast, accurate list of email addresses without writing code. It fits one-off tasks, quick audits, and situations where the HTML is irregular enough that a simple regex would need heavy customization to be reliable.
- Auditing a scraped webpage to find all contact emails before importing into a CRM
- Extracting mailto: links and plain-text emails from a competitor or partner site's source for outreach lists
- Pulling contact info from a saved HTML file of a forum, directory, or member listing
- Reviewing the source of your own site to find emails accidentally exposed in comments or meta tags
- Processing a batch of HTML snippets exported from a content management system or email template builder
When this format breaks down
- If the HTML is hundreds of kilobytes or megabytes long, it will exceed the context window of most models. Split it into sections first or use a dedicated parsing script.
- If the emails are loaded dynamically via JavaScript after the initial page render, the raw HTML source will not contain them. You need a headless browser to capture the rendered DOM instead.
- If you need to process thousands of HTML files automatically on a schedule, a coded solution using a regex or an HTML parser library is faster and cheaper than sending each file to an AI model.
- If the emails are encoded in base64 or heavily obfuscated with character entity substitution, the model may miss some. Decode the source first before passing it to the prompt.
The prompt we tested
You are an email extraction tool. Scan the following HTML source code and pull out every unique email address it contains, including those inside mailto: links, attributes, comments, obfuscated formats (e.g., 'name [at] domain [dot] com'), and plain text. Rules: Return only a deduplicated list of valid email addresses, one per line, in lowercase, with no commentary, numbering, or surrounding text. Decode common obfuscations (e.g., [at], (dot), HTML entities like @) into standard email format, and omit anything that is not a valid email address. HTML source: <html><body><p>Contact our team at <a href="mailto:Sales@Acme.com">Sales</a> or support@acme.com for help.</p><!-- admin [at] acme [dot] com --><footer>Press: press@acme.com</footer></body></html>
Model comparison
| Model | Quality | Cost | Latency | Tokens |
|---|---|---|---|---|
| Claude Opus 4.7 | 10 | $0.0069 | 1.7s | 293 / 33 |
| Claude Haiku 4.5 | 10 | $0.0004 | 761ms | 223 / 30 |
| GPT-5.4 | 10 | $0.0009 | 1.2s | 201 / 26 |
| Gemini 2.5 Pro | 10 | $0.0004 | 8.6s | 195 / 27 |
| Grok 4.1 Fast Reasoning | 10 | $0.0001 | 2.6s | 349 / 23 |
Outputs
sales@acme.com support@acme.com admin@acme.com press@acme.com
sales@acme.com support@acme.com admin@acme.com press@acme.com
sales@acme.com support@acme.com admin@acme.com press@acme.com
sales@acme.com support@acme.com admin@acme.com press@acme.com
What makes these work
-
01Ask for deduplication explicitly
HTML pages often repeat the same email in a mailto link and again in plain text. Tell the prompt to return unique emails only. Without that instruction, many models will list duplicates and you will need to clean the output yourself.
-
02Request the source context alongside each email
Adding 'include where each email was found' to your prompt gets you output like 'jane@co.com (mailto href, team section)'. This is useful when you need to verify the extraction or explain to a client where contact info was located on the page.
-
03Flag obfuscated formats in your prompt
Tell the model to look for common obfuscation patterns like 'name [at] domain [dot] com' and JavaScript string concatenation. Without this instruction, some models treat obfuscated addresses as plain text and skip them entirely.
-
04Trim the HTML before pasting
Remove large script blocks, CSS, and base64-encoded image data before submitting. These sections almost never contain real emails, they waste tokens, and they can cause the model to miss addresses near the end of a long input due to context limits.
More example scenarios
<div class='team-member'><h3>Jane Doe</h3><p>VP of Sales</p><a href='mailto:jane.doe@acme.com'>Contact Jane</a></div><div class='team-member'><h3>Bob Smith</h3><p>Support Lead</p><p>Reach Bob at bob.smith@acme.com for help.</p></div>
jane.doe@acme.com bob.smith@acme.com
<!-- Partner emails updated 2024 --><p>For billing inquiries contact billing@lawfirm.com or call us. General matters go to info@lawfirm.com. Partner John Reeves can be reached at <a href='mailto:j.reeves@lawfirm.com'>his direct line</a>.</p>
billing@lawfirm.com info@lawfirm.com j.reeves@lawfirm.com
<footer><p>Questions? Email us at support [at] shopexample [dot] com or use the form below. Press inquiries: press@shopexample.com</p><script>var e='returns'+'@'+'shopexample.com';</script></footer>
support@shopexample.com (obfuscated in text) press@shopexample.com returns@shopexample.com (constructed in inline script)
<section id='contact'><ul><li>Volunteer sign-ups: <a href='mailto:volunteer@helpnow.org'>volunteer@helpnow.org</a></li><li>Donations: donate@helpnow.org</li><li>Media: media@helpnow.org</li><li>General: info@helpnow.org</li></ul></section>
volunteer@helpnow.org donate@helpnow.org media@helpnow.org info@helpnow.org
<p>Department Chair: dr.lee@university.edu | Admissions: admissions@university.edu</p><p>For academic advising contact dr.lee@university.edu. Graduate office: gradoffice@university.edu. Admissions inquiries also go to admissions@university.edu.</p>
dr.lee@university.edu admissions@university.edu gradoffice@university.edu (Duplicates removed: dr.lee@university.edu and admissions@university.edu each appeared twice)
Common mistakes to avoid
-
Pasting the entire rendered page source
Full page HTML from a browser's 'View Source' often includes analytics scripts, ad pixels, and tracking libraries that add thousands of tokens of noise. Strip or collapse these sections first. Leaner input produces faster and more accurate output.
-
Assuming all emails will be in mailto: links
A significant portion of email addresses on real pages appear in plain text paragraphs, inside alt attributes, or embedded in JavaScript variables. A prompt that only checks href attributes will miss these. Make sure your prompt specifies searching all text content, not just links.
-
Not specifying output format
Without a format instruction, models may return emails as a bulleted list, a comma-separated string, or embedded in a sentence. If you need a clean list for a spreadsheet or script, specify 'return one email per line, no other text' to avoid manual reformatting.
-
Ignoring HTML entity encoding
The at-sign is sometimes encoded as @ or @ in HTML to reduce spam scraping. If you paste encoded HTML directly, some models will decode and find these. Others will not. Explicitly mention entity-encoded emails in your prompt to be safe.
-
Treating every extracted string as a valid contact email
HTML source sometimes contains example emails like user@example.com, test@test.com, or placeholder addresses inside templates. Ask the model to flag or exclude obvious placeholder and example addresses, especially if you are feeding the output directly into an outreach tool.
Related queries
Frequently asked questions
Can I extract emails from HTML source without writing code?
Yes. Pasting the HTML into an AI prompt with a clear extraction instruction is the fastest no-code method. You get a clean list in seconds. For recurring or large-scale tasks, a simple Python script using the re module or BeautifulSoup is more practical, but for one-off extractions AI is faster.
What is the best regex to extract emails from HTML?
A commonly used pattern is [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}. It catches most standard addresses but will miss obfuscated formats and HTML-entity-encoded at-signs. For messy real-world HTML, an AI model handles edge cases better than regex alone because it understands context around the text.
How do I extract emails from a mailto href specifically?
In raw HTML, mailto links look like href='mailto:name@domain.com'. A regex targeting href="mailto:([^"]+)" or href='mailto:([^']+)' will pull just those. Pass this pattern or the HTML itself to an AI prompt if you want the extraction done without writing the regex yourself.
Will this work on minified or compressed HTML?
Minified HTML removes whitespace but keeps all attributes and text content intact, so email extraction still works. Heavily compressed or binary-encoded content needs to be decoded first. If your source looks like a scrambled string rather than recognizable HTML tags, decompress or decode it before extraction.
How do I extract emails from HTML source in Python?
Use the re module with a pattern like re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', html_string). Pair it with BeautifulSoup to first decode HTML entities so encoded addresses are normalized before the regex runs. This handles the majority of real-world cases reliably.
Can AI find emails hidden in HTML comments or JavaScript?
Yes, when instructed to. HTML comments are enclosed in <!-- --> tags and are part of the raw source, so they are visible in the text passed to the model. JavaScript string variables that contain email addresses are also visible in source. Include a note in your prompt to check comments and script blocks to make sure none are missed.
Try it with a real tool
Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.