Using Regex to Extract Email Addresses from Text

Tested prompts for regex to extract email addresses compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 9/10

When you have a block of text, a CSV, a log file, or a scraped webpage and you need to pull out every email address, regex is the standard tool for the job. A well-written regular expression scans the input string and returns every match that fits the pattern of a valid email address, without you having to write custom parsing logic from scratch.

The challenge is that email addresses are more varied than they look. A pattern that catches john@example.com will miss j.o+hn@sub.domain.co.uk or user123@company.io if it is too narrow. Go too broad and you start pulling in false positives from structured data that resembles an email but is not one.

This page walks through the most reliable regex patterns for extracting email addresses, shows you what each AI model produces when given a real extraction task, and compares their outputs so you can pick the right approach for your specific input. Whether you are working in Python, JavaScript, or just need a pattern to drop into a text editor's find tool, you will leave with something you can use immediately.

When to use this

Regex email extraction is the right approach when you have unstructured or semi-structured text and need a fast, programmatic way to pull addresses without setting up a full parsing pipeline. It works well in scripts, data cleaning steps, and automation workflows where the input format is consistent enough that a pattern will not produce large numbers of false positives.

Extracting contact emails from scraped HTML pages or raw website dumps
Parsing log files or error reports to identify notified or affected users
Cleaning a CSV or spreadsheet column that mixes emails with other text
Pulling recipient addresses out of exported email threads or .eml files
Building a quick data pipeline step that feeds downstream deduplication or CRM import

When this format breaks down

Validating whether an email address is real or deliverable: regex confirms format only, not existence. Use an email verification API for that.
Parsing RFC 5321-compliant edge cases at scale: addresses with quoted local parts like '"john doe"@example.com' or comments will break most practical regex patterns.
Inputs where the same character sequence appears in non-email contexts, such as version strings or file paths formatted like 'v2.0@release', leading to high false-positive rates.
Production systems where email accuracy is legally or financially critical: a missed or malformed extraction can cause compliance failures that regex alone cannot guard against.

The prompt we tested

You are a regex and text-extraction assistant. Your task is to extract all email addresses from the provided text using a reliable regular expression.

Follow these instructions exactly:
Use a robust email regex such as [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,} and present it inside a fenced code block. Keep the explanation to 3-5 short bullet points, then output the extracted emails as a plain bullet list with duplicates removed and order preserved. If no emails are found, state that explicitly.

Text to process:
Please reach out to our support team at help@acme.co or escalate to Jane Doe <jane.doe+urgent@sub.example.com>. For billing questions, email billing@acme.co or call 555-123-4567. Old contact bob_smith99@mail-server.io is no longer monitored.

Return:
1. The regex pattern used (in a code block).
2. A brief explanation of what each part of the pattern matches.
3. A deduplicated list of all email addresses found in the text.

Model comparison

Model	Quality	Cost	Latency	Tokens
Claude Opus 4.7	9	$0.0331	5.2s	359 / 370
Claude Haiku 4.5	9	$0.0016	2.1s	264 / 265
GPT-5.4	9	$0.0031	2.6s	234 / 167
Gemini 2.5 Pro	9	$0.0015	9.1s	253 / 233
Grok 4.1 Fast Reasoning	9	$0.0002	5.5s	383 / 215

Outputs

Claude Opus 4.7 9/10

## Regex Pattern

```regex
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
```

## Explanation

- `[A-Za-z0-9._%+-]+` — matches the local part (before `@`), allowing letters, digits, dots, underscores, percent signs, plus, and hyphens.
- `@` — matches the literal `@` separator between local and domain parts.
- `[A-Za-z0-9.-]+` — matches the domain name, allowing letters, digits, dots, and hyphens (covers subdomains like `sub.example`).
- `\.[A-Za-z]{2,}` — matches the top-level domain: a dot followed by at least two letters (e.g., `.co`, `.com`, `.io`).
- The `+` quantifiers ensure each section has one or more valid characters, preventing matches on malformed strings.

## Extracted Emails

- help@acme.co
- jane.doe+urgent@sub.example.com
- billing@acme.co
- bob_smith99@mail-server.io

Claude Haiku 4.5 9/10

# Email Extraction Results

## Regex Pattern

```
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
```

## Pattern Explanation

• **`[A-Za-z0-9._%+-]+`** – Matches the local part (before @): letters, digits, dots, underscores, percent signs, plus signs, and hyphens
• **`@`** – Matches the literal @ symbol that separates local and domain parts
• **`[A-Za-z0-9.-]+`** – Matches the domain name: letters, digits, dots, and hyphens
• **`\.`** – Matches the literal dot before the top-level domain
• **`[A-Za-z]{2,}`** – Matches the TLD (top-level domain): 2 or more alphabetic characters

## Extracted Email Addresses

• help@acme.co
• jane.doe+urgent@sub.example.com
• billing@acme.co
• bob_smith99@mail-server.io

GPT-5.4 9/10

```regex
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
```

- `[A-Za-z0-9._%+-]+` matches the local part before `@`, including letters, digits, and common email symbols.
- `@` matches the required at-sign separating local and domain parts.
- `[A-Za-z0-9.-]+` matches the domain/subdomain portion.
- `\.` matches the literal dot before the top-level domain.
- `[A-Za-z]{2,}` matches a top-level domain with at least 2 letters.

- help@acme.co
- jane.doe+urgent@sub.example.com
- billing@acme.co
- bob_smith99@mail-server.io

Gemini 2.5 Pro 9/10

```regex
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
```

*   `[A-Za-z0-9._%+-]+` Matches the local part (username) of the email address before the `@` symbol.
*   `@` Matches the literal "@" symbol that separates the local part from the domain.
*   `[A-Za-z0-9.-]+` Matches the domain name (e.g., "example", "sub.domain").
*   `\.` Matches the literal dot separating the domain from the top-level domain.
*   `[A-Za-z]{2,}` Matches the top-level domain (TLD) like ".com" or ".co", ensuring it has at least two letters.

*   help@acme.co
*   jane.doe+urgent@sub.example.com
*   billing@acme.co
*   bob_smith99@mail-server.io

Grok 4.1 Fast Reasoning 9/10

1. The regex pattern used (in a code block).
```
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
```

2. A brief explanation of what each part of the pattern matches.
- `[A-Za-z0-9._%+-]+`: Matches one or more alphanumeric characters or allowed special chars (._%+-) in the local part before @.
- `@`: Matches the literal @ symbol.
- `[A-Za-z0-9.-]+`: Matches one or more alphanumeric characters, dots, or hyphens in the domain name.
- `\.`: Matches a literal dot separating domain and TLD.
- `[A-Za-z]{2,}`: Matches two or more alphabetic characters for the top-level domain (TLD).

3. A deduplicated list of all email addresses found in the text.
- help@acme.co
- jane.doe+urgent@sub.example.com
- billing@acme.co
- bob_smith99@mail-server.io

What makes these work

01

Use a word-boundary anchor
Wrapping your pattern with word boundaries or lookahead/lookbehind assertions prevents partial matches where an email is embedded inside a longer token. Without boundaries, a pattern can match fragments like 'name@' from 'fullname@domain.com' if the surrounding logic is loose. In Python use re.findall with a pattern that starts with a non-word boundary check, or use \b where your flavor supports it adjacent to the local part.
02

Account for subdomains and country TLDs
Many real-world addresses use multi-part domains such as user@mail.company.co.uk. A domain segment that only allows one dot will silently drop these. Match the domain portion with something like ([a-zA-Z0-9-]+\.)+[a-zA-Z]{2,} to handle arbitrary depth. Set the TLD minimum to 2 characters and avoid capping it too low since new TLDs can be long.
03

Support plus-sign and dot addressing in the local part
Services like Gmail support both dots and plus signs in the local part, for example first.last+filter@gmail.com. A local part pattern restricted to alphanumeric characters only will miss these. Use [a-zA-Z0-9._%+\-]+ as the local part character class to cover the most common valid characters without going so broad that you pick up garbage.
04

Test against both positive and negative examples
Before shipping a regex into production, run it against a set of known-good emails and a set of known-invalid strings. Include edge cases like double at-signs, trailing dots, missing TLDs, and IP-address domains. A pattern that passes only the happy path will produce noisy output on real-world messy data and quietly corrupt downstream processes.

More example scenarios

#01 · Extracting contacts from a scraped B2B company page

Input

Reach our sales team at sales@acmecorp.com or contact our support lead directly: j.harrison+support@acmecorp.co.uk. For press inquiries email media.relations@acmecorp.com. Invalid strings like @@broken and noatsign should be ignored.

Expected output

sales@acmecorp.com, j.harrison+support@acmecorp.co.uk, media.relations@acmecorp.com

#02 · Pulling user emails from a server access log

Input

2024-03-15 08:42:11 INFO Login attempt user=marco.rossi@techfirm.it ip=192.168.1.4
2024-03-15 08:43:02 ERROR Auth failed user=a.patel@vendor.org ip=10.0.0.9
2024-03-15 08:44:55 INFO Session started user=chen_wei@enterprise.com.cn ip=172.16.0.2

Expected output

marco.rossi@techfirm.it, a.patel@vendor.org, chen_wei@enterprise.com.cn

#03 · Cleaning a marketing export where emails are mixed with other data

Input

ID:1042 | Name: Priya Nair | Contact: priya.nair@retailbrand.in | Tier: Gold | Referral: n/a
ID:1043 | Name: Tom Briggs | Contact: tombriggs_AT_example.com | Tier: Silver
ID:1044 | Name: Sara Okonkwo | Contact: s.okonkwo@ngohub.org | Tier: Bronze

Expected output

priya.nair@retailbrand.in, s.okonkwo@ngohub.org
Note: tombriggs_AT_example.com is not a valid email format and is excluded. If obfuscated emails need handling, a pre-processing substitution step is required before regex extraction.

#04 · Extracting emails from a legal contract document

Input

Notices shall be sent to Acme Inc. at legal@acmeinc.com and to the Counterparty representative, dr.linda.walsh@lawgroup.net. Copy to compliance@acmeinc.com. Service of process address: 123 Main St, not an email.

Expected output

legal@acmeinc.com, dr.linda.walsh@lawgroup.net, compliance@acmeinc.com

#05 · Parsing a developer forum post to find mentioned contacts

Input

Hey, I forwarded this to devops-lead@startup.io already. You can also ping the vendor at support+tier2@cloudvendor.com. The old address (archived@legacy.startup.io) still works but responses are slow. Avoid sending to test@@broken.com.

Expected output

devops-lead@startup.io, support+tier2@cloudvendor.com, archived@legacy.startup.io
Note: test@@broken.com is excluded as it contains a double at-sign and does not match a valid email pattern.

Common mistakes to avoid

Overly greedy domain matching
Using .+ for the domain portion instead of a restricted character class causes the pattern to consume trailing punctuation, HTML tags, or whitespace as part of the email. The result looks like a valid match but the extracted string is unusable. Always restrict the domain character class and terminate the match explicitly at a word boundary or non-email character.
Forgetting case-insensitive flag
Email local parts and domains are technically case-insensitive, and real data often contains mixed case like User@Example.COM. If you do not apply the case-insensitive flag (re.IGNORECASE in Python, /i in JavaScript), you will miss uppercase variants. This is a silent failure: your script runs without errors but returns an incomplete list.
Using a pattern that cannot handle plus addressing
A significant share of modern email addresses use plus signs for filtering, such as newsletter+unsubscribe@domain.com. Patterns that only allow alphanumeric characters in the local part drop these silently. This matters most when extracting from marketing or transactional email data where plus addressing is common.
Not stripping surrounding punctuation after extraction
In prose, an email address often appears inside parentheses, followed by a comma, or wrapped in angle brackets as in <contact@example.com>. The regex may capture the surrounding character depending on how lookaheads are set up. Always trim or post-process matched strings to remove non-email characters from the start and end of each result.
Assuming regex validates deliverability
A regex match only confirms that a string looks like an email address structurally. It cannot tell you whether the domain has valid MX records, whether the mailbox exists, or whether the address will actually receive mail. Treating regex output as a clean deliverable list without a verification step leads to bounce rates and potential sender reputation damage.

Related queries

Frequently asked questions

What is the best Python regex to extract email addresses?

A reliable starting pattern in Python is r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}' used with re.findall(pattern, text, re.IGNORECASE). This covers standard addresses, plus-sign addressing, and multi-level domains. For stricter RFC compliance you need a more complex pattern, but this one handles the vast majority of real-world cases without excessive false positives.

How do I extract email addresses from a string in JavaScript?

Use the String match method with a global regex: text.match(/[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/gi). The g flag returns all matches and the i flag handles case. The result is an array of matched strings or null if there are no matches, so check for null before iterating.

Can a single regex extract all email addresses from an HTML page?

Yes, but you will likely get noise. HTML pages contain email addresses in mailto: links, in visible text, and sometimes in JavaScript strings. A regex run against raw HTML will find all of them, including addresses in hidden or template sections. If you only want visible contact emails, parse the HTML first with a library like BeautifulSoup, extract visible text, then run the regex on that.

Why does my regex miss emails with country-code TLDs like .co.uk?

Most simple patterns only match a single domain segment after the last dot, so user@company.co.uk fails because .co is treated as the TLD and .uk is left over. Fix this by matching the domain portion as one or more dot-separated segments: ([a-zA-Z0-9\-]+\.)+[a-zA-Z]{2,}. This handles arbitrary subdomain and TLD depth.

Is there a regex that validates email addresses rather than just extracting them?

For validation, a stricter pattern or a dedicated library is better than a general extraction regex. In Python, the email-validator package handles RFC compliance. For quick validation in any language, the pattern ^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$ with start and end anchors prevents partial matches, but it still cannot verify that the address is real or deliverable.

How do I extract unique email addresses and remove duplicates?

After running re.findall in Python, wrap the result in a set: list(set(re.findall(pattern, text, re.IGNORECASE))). In JavaScript, use new Set(text.match(...)) and spread it back to an array. If case sensitivity matters for deduplication, normalize all matches to lowercase before deduplicating, since user@Example.com and user@example.com are the same address.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.

Perplexity Pro AI-powered answer engine

Try Perplexity →

CustomGPT ChatGPT trained on your content

Try CustomGPT →