# Email Extraction Script
```python
import re
# Input text containing email addresses
text = """Please contact our sales team at sales@example.com or reach out to
john.doe+inquiries@company.co.uk for partnership opportunities. You can also
email support@my-site.org for technical issues, and invalid addresses like
test@.com should be ignored."""
# Robust regex pattern (simplified RFC 5322-compatible)
# - Local part: letters, digits, and allowed special chars (._%+-)
# - @ separator
# - Domain: must start with alphanumeric, allows hyphens/dots, ends with 2+ letter TLD
email_pattern = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?(?:\.[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?)*\.[A-Za-z]{2,}'
# Find all matches and deduplicate while preserving order
matches = re.findall(email_pattern, text)
unique_emails = list(dict.fromkeys(matches))
# Display results
print("Extracted Email Addresses:")
print(unique_emails)
```
## Sample Output
```python
Extracted Email Addresses:
['sales@example.com', 'john.doe+inquiries@company.co.uk', 'support@my-site.org']
```
## How the Regex Works
The pattern matches a **local part** (allowing letters, digits, and `._%+-` for tags like `john.doe+inquiries`), followed by `@`, then a **domain** that must begin and end with an alphanumeric character (preventing invalid cases like `test@.com` where the domain starts with a dot). It requires a **top-level domain of at least 2 letters** (e.g., `.com`, `.co.uk`), and `dict.fromkeys()` removes duplicates while preserving the original order of appearance.
Python Script to Extract Email Addresses from Any Text
Tested prompts for extract email addresses from text python compared across 5 leading AI models.
If you have a block of text and need to pull out every email address it contains, you are looking for a regex-based extraction pattern in Python. This is one of the most common text-processing tasks in the language, and the core approach has not changed much in years: write a regular expression that matches the structure of an email address, run it against your string with the re module, and collect the results. The challenge is that naive patterns miss edge cases like subdomains, plus-addressed emails, or quoted local parts.
The page below tests several AI-generated Python solutions against the same prompt so you can see which one handles real-world messiness. Whether you are scraping a web page, parsing a CSV export, processing customer support tickets, or pulling contacts out of a PDF-converted text file, the right script should handle all of the above without requiring you to clean the input first.
Pick the output that matches your use case, drop it into your project, and move on. Each solution is runnable, tested, and annotated so you understand what the regex is actually doing rather than just copying a pattern you cannot debug later.
When to use this
Use a Python email extraction script when you have unstructured or semi-structured text and need a list of email addresses from it programmatically. This fits any situation where emails are embedded in prose, HTML, log files, spreadsheets, or exported documents and you cannot rely on a fixed column or field to find them.
- Scraping contact information from a set of web pages or HTML dumps
- Parsing exported CRM or helpdesk data where emails appear inside free-text fields
- Processing server or application logs to identify user accounts involved in an event
- Extracting recipient or sender addresses from a batch of raw .eml or .txt email files
- Pulling contributor emails from a large codebase's commit messages or README files
When this format breaks down
- If your data is already structured (a database column, a CSV field labeled 'email'), use a direct field access or pandas column read instead of regex on raw text. Regex adds complexity with no benefit.
- If you need to validate that extracted addresses actually exist and accept mail, regex extraction alone is not enough. You need an SMTP verification step or a third-party validation API on top.
- If the text is in a PDF with complex formatting, multi-column layouts, or scanned images, the text extraction layer (pdfplumber, pytesseract) will produce garbled output that breaks even a good regex. Fix the extraction layer first.
- If you are processing untrusted input at scale and plan to store or send to every found address, a regex script with no deduplication or domain-allowlist check can produce noisy, harmful results. Add filtering logic before acting on the output.
The prompt we tested
You are an expert Python developer and data extraction specialist. Your task is to help the user extract email addresses from the provided text using Python. Follow these instructions carefully: Provide a clean, runnable Python script using the `re` module with a robust regex pattern (RFC 5322-compatible simplified version) to extract all unique email addresses from the input text. Include a short 2-3 sentence explanation of the regex pattern and show sample output as a Python list. Keep the code concise, commented, and ready to copy-paste. Here is the text or scenario from the user: Please contact our sales team at sales@example.com or reach out to john.doe+inquiries@company.co.uk for partnership opportunities. You can also email support@my-site.org for technical issues, and invalid addresses like test@.com should be ignored. Provide a complete, working Python solution that extracts all valid email addresses from the input, along with a brief explanation of how the code works.
Model comparison
| Model | Quality | Cost | Latency | Tokens |
|---|---|---|---|---|
| Claude Opus 4.7 | 9 | $0.0548 | 8.0s | 315 / 667 |
| Claude Haiku 4.5 | 9 | $0.0029 | 3.7s | 222 / 531 |
| GPT-5.4 | 9 | $0.0050 | 5.3s | 203 / 301 |
| Gemini 2.5 Pro | Not run yet | |||
| Grok 4.1 Fast Reasoning | 9 | $0.0003 | 14.9s | 352 / 417 |
Outputs
# Email Extraction Script
```python
import re
# Input text containing email addresses
text = """Please contact our sales team at sales@example.com or reach out to
john.doe+inquiries@company.co.uk for partnership opportunities. You can also
email support@my-site.org for technical issues, and invalid addresses like
test@.com should be ignored."""
# Regex pattern for extracting valid email addresses (RFC 5322-compatible simplified)
email_pattern = r'\b[A-Za-z0-9.+_-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'
# Extract all email addresses and convert to set for uniqueness
emails = list(set(re.findall(email_pattern, text)))
# Sort for consistent output
emails.sort()
print("Extracted Email Addresses:")
print(emails)
```
**Output:**
```
Extracted Email Addresses:
['john.doe+inquiries@company.co.uk', 'sales@example.com', 'support@my-site.org']
```
---
## Explanation
**Regex Pattern Breakdown:** `r'\b[A-Za-z0-9.+_-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'`
- **`\b`** – Word boundary ensures we don't match partial email addresses
- **`[A-Za-z0-9.+_-]+`** – Local part (before @) allows alphanumeric, dots, plus signs, underscores, and hyphens
- **`@`** – Literal @ symbol separator
- **`[A-Za-z0-9.-]+`** – Domain name allows alphanumeric, dots, and hyphens
- **`\.[A-Za-z]{2,}`** – Top-level domain requires a dot followed by 2+ letters (e.g., .com, .co.uk)
- **`\b`** – Word boundary at the end
This pattern successfully extracts valid emails like `john.doe+inquiries@company.co.uk` while rejecting invalid ones like `test@.com` (missing domain name before the TLD).
```python
import re
text = """
Please contact our sales team at sales@example.com or reach out to
john.doe+inquiries@company.co.uk for partnership opportunities.
You can also email support@my-site.org for technical issues,
and invalid addresses like test@.com should be ignored.
"""
# Simplified RFC 5322-compatible email regex
email_pattern = re.compile(
r'\b[a-zA-Z0-9.!#$%&\'*+/=?^_`{|}~-]+@'
r'[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?'
r'(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)+\b'
)
# Extract unique email addresses while preserving order
emails = list(dict.fromkeys(email_pattern.findall(text)))
print(emails)
```
This regex matches a valid email local part before `@`, allowing common characters such as dots, plus signs, and underscores. It then enforces a proper domain structure with valid labels and at least one dot, which helps reject invalid addresses like `test@.com`.
Sample output:
```python
['sales@example.com', 'john.doe+inquiries@company.co.uk', 'support@my-site.org']
```
```python
import re
# Input text containing emails
text = """Please contact our sales team at sales@example.com or reach out to john.doe+inquiries@company.co.uk for partnership opportunities. You can also email support@my-site.org for technical issues, and invalid addresses like test@.com should be ignored."""
# Robust regex pattern (simplified RFC 5322): word boundary, local-part ([A-Za-z0-9._%+-]+), @, domain ([A-Za-z0-9.-]+), dot, TLD ([A-Za-z]{2,}), word boundary.
# Handles + subaddressing, -, _, etc.; ignores invalids like test@.com (domain can't start with . due to + requiring chars before \.).
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'
# Extract all matches
matches = re.findall(pattern, text)
# Unique emails preserving order
unique_emails = []
seen = set()
for email in matches:
if email not in seen:
unique_emails.append(email)
seen.add(email)
print(unique_emails)
```
**Regex Explanation**: This pattern starts/ends with word boundaries (`\b`) to avoid partial matches, captures alphanumeric + special chars (`._%+-`) in the local part, requires `@`, then domain labels (`[A-Za-z0-9.-]+`), a literal dot (`\.`), and a 2+ letter TLD. It's RFC 5322-simplified, robust for real-world use, and rejects `test@.com` since the domain part fails to match before the required `\.`.
**How the Code Works**: Uses `re.findall()` to extract all matching emails, then iterates to build a list of unique ones (preserving first-seen order via a set). Run the script to output: `['sales@example.com', 'john.doe+inquiries@company.co.uk', 'support@my-site.org']`.
What makes these work
-
01Use re.findall, not re.search
re.search stops at the first match. re.findall returns every non-overlapping match in the string as a list, which is almost always what you want when extracting emails. Pair it with re.IGNORECASE so case variations in domains are not missed.
-
02Account for plus-addressing in local parts
Many corporate and developer emails use the user+tag@domain.com format. A pattern that only allows word characters before the @ will silently drop these. Include the plus sign in your character class for the local part: [a-zA-Z0-9._%+\-]+.
-
03Support multi-part TLDs explicitly
Domains like .co.uk, .com.au, and .gov.au are common in non-US data. A pattern that only allows one dot-separated segment after the last dot will miss them. Extend your TLD group to allow one or two dot-separated segments of 2-6 characters each.
-
04Deduplicate before returning results
Raw text often repeats the same address multiple times. Wrap your re.findall result in list(set(...)) for an unordered unique set, or use dict.fromkeys() to preserve the order in which addresses first appeared, which is more useful for audit trails.
More example scenarios
The HTML body of an agency's contact page contains: '<p>For project inquiries contact <a href="mailto:hello@brightmindagency.com">hello@brightmindagency.com</a> or our lead designer at j.torres+projects@brightmindagency.com. Media requests go to press@brightmind.agency.</p>'
['hello@brightmindagency.com', 'j.torres+projects@brightmindagency.com', 'press@brightmind.agency'] -- the script correctly captures the plus-addressed variant and the .agency TLD, neither of which a minimal regex would match.
A plain-text application log excerpt: '2024-03-15 09:12:44 INFO Login attempt user=alice.smith@corp.internal ip=10.0.0.4 SUCCESS 2024-03-15 09:13:01 WARN Failed login user=b.jones@corp.internal ip=10.0.0.7 FAIL 2024-03-15 09:13:45 INFO Password reset triggered for carol_82@external-vendor.co.uk'
['alice.smith@corp.internal', 'b.jones@corp.internal', 'carol_82@external-vendor.co.uk'] -- all three addresses extracted cleanly including the internal .internal TLD and the two-part .co.uk country code domain.
A repository README section: 'Maintainers: Dr. Priya Nair <priya.nair@datasciencelab.io>, Tom Wieczorek <t.wieczorek@uni-berlin.de>. Bug reports to bugs@ossproject.org. Do not contact contributors directly at their personal addresses such as tomw1987@gmail.com.'
['priya.nair@datasciencelab.io', 't.wieczorek@uni-berlin.de', 'bugs@ossproject.org', 'tomw1987@gmail.com'] -- four distinct addresses returned; the caller can then filter by domain if personal addresses should be excluded.
A support team paste: 'Ticket #1041 - customer wrote from m.okonkwo@yahoo.co.uk asking about refund. Ticket #1042 - order placed by Zhang_Wei_88@outlook.com, escalated. CC on both: support-lead@ourcompany.com. Duplicate entry: m.okonkwo@yahoo.co.uk.'
['m.okonkwo@yahoo.co.uk', 'Zhang_Wei_88@outlook.com', 'support-lead@ourcompany.com'] -- the script deduplicates by default so the repeated address appears only once in the output list.
The raw text of an email file header block: 'From: newsletter@updates.shopbrand.com To: first.customer@gmail.com, second.customer@hotmail.com CC: analytics-tracker+campaign42@shopbrand.com Subject: Your order is on its way'
['newsletter@updates.shopbrand.com', 'first.customer@gmail.com', 'second.customer@hotmail.com', 'analytics-tracker+campaign42@shopbrand.com'] -- all four addresses captured including the subdomain sender and the plus-tagged analytics address.
Common mistakes to avoid
-
Regex that rejects valid TLDs
Many copy-pasted patterns only match TLDs of 2-4 characters, which excludes .museum, .international, and newer gTLDs like .agency or .studio. Use {2,10} or {2,} as your TLD length quantifier to avoid silently dropping legitimate addresses.
-
Not stripping surrounding punctuation
When an email appears at the end of a sentence it may be followed by a period or comma that gets absorbed into the match: 'contact@example.com.' The trailing period becomes part of the extracted string and breaks any downstream use. Strip trailing punctuation from each match after extraction.
-
Treating extraction as validation
A regex that extracts an email confirms it looks like an email, not that it is real or deliverable. If your workflow depends on the address being valid, add a separate validation step. Acting on extracted addresses without this step wastes resources and risks sending to dead addresses.
-
Running on un-decoded bytes
If your input text comes from reading a file in binary mode or from an HTTP response without decoding, the regex runs on a bytes object and produces bytes matches or raises a TypeError. Always decode your input to a Python str before running the extraction.
-
Ignoring HTML entities in scraped content
Email addresses inside HTML are sometimes obfuscated as user@domain.com where @ is the at-sign. A plain regex will not match this. If your source is HTML, parse it with BeautifulSoup and unescape entities before running extraction.
Related queries
Frequently asked questions
What is the best Python regex pattern to extract email addresses?
A solid general-purpose pattern is r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,10}'. It covers plus-addressing, hyphenated domains, and longer TLDs. Use it with re.findall(pattern, text, re.IGNORECASE). No single pattern is perfect for every input, so test it against your specific data before shipping to production.
How do I extract emails from a file in Python, not just a string?
Open the file with open('yourfile.txt', 'r', encoding='utf-8') and read its contents into a string with .read(). Then pass that string to re.findall exactly as you would any other text. For large files, read line by line with a for loop and extend a results list incrementally to avoid loading everything into memory.
Can I extract emails from a PDF using Python?
Yes, but the PDF must be text-based, not a scanned image. Use pdfplumber or PyMuPDF to extract the raw text from each page into a string, then run your regex over that string. Scanned PDFs require OCR first via pytesseract. The quality of your email extraction will directly depend on the quality of the text extraction step.
How do I remove duplicate emails from the extracted list?
Use list(set(re.findall(pattern, text))) to get a unique unordered list. If you need to preserve the order in which addresses first appeared, use list(dict.fromkeys(re.findall(pattern, text))) instead. The dict.fromkeys approach is order-preserving in Python 3.7 and above.
Does Python's re module support extracting emails from HTML directly?
It can match email-shaped strings in HTML source, but raw HTML contains noise: mailto: prefixes, HTML entities, attributes, and escaped characters. For cleaner results, parse the HTML with BeautifulSoup first, call .get_text() to strip tags, then run your regex on the resulting plain text.
How do I extract only emails from a specific domain in Python?
Extract all emails first using re.findall, then filter the result list with a list comprehension: [e for e in emails if e.endswith('@yourdomain.com')]. For subdomain-aware filtering, use e.split('@')[1] == 'yourdomain.com' or check with the endswith('.yourdomain.com') condition for subdomains.
Try it with a real tool
Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.