Using Regex to Extract Email Addresses from Text

Tested prompts for regex to extract email addresses compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 9/10

When you have a block of text, a CSV, a log file, or a scraped webpage and you need to pull out every email address, regex is the standard tool for the job. A well-written regular expression scans the input string and returns every match that fits the pattern of a valid email address, without you having to write custom parsing logic from scratch.

The challenge is that email addresses are more varied than they look. A pattern that catches john@example.com will miss j.o+hn@sub.domain.co.uk or user123@company.io if it is too narrow. Go too broad and you start pulling in false positives from structured data that resembles an email but is not one.

This page walks through the most reliable regex patterns for extracting email addresses, shows you what each AI model produces when given a real extraction task, and compares their outputs so you can pick the right approach for your specific input. Whether you are working in Python, JavaScript, or just need a pattern to drop into a text editor's find tool, you will leave with something you can use immediately.

When to use this

Regex email extraction is the right approach when you have unstructured or semi-structured text and need a fast, programmatic way to pull addresses without setting up a full parsing pipeline. It works well in scripts, data cleaning steps, and automation workflows where the input format is consistent enough that a pattern will not produce large numbers of false positives.

  • Extracting contact emails from scraped HTML pages or raw website dumps
  • Parsing log files or error reports to identify notified or affected users
  • Cleaning a CSV or spreadsheet column that mixes emails with other text
  • Pulling recipient addresses out of exported email threads or .eml files
  • Building a quick data pipeline step that feeds downstream deduplication or CRM import

When this format breaks down

  • Validating whether an email address is real or deliverable: regex confirms format only, not existence. Use an email verification API for that.
  • Parsing RFC 5321-compliant edge cases at scale: addresses with quoted local parts like '"john doe"@example.com' or comments will break most practical regex patterns.
  • Inputs where the same character sequence appears in non-email contexts, such as version strings or file paths formatted like 'v2.0@release', leading to high false-positive rates.
  • Production systems where email accuracy is legally or financially critical: a missed or malformed extraction can cause compliance failures that regex alone cannot guard against.

The prompt we tested

You are a regex and text-extraction assistant. Your task is to extract all email addresses from the provided text using a reliable regular expression.

Follow these instructions exactly:
Use a robust email regex such as [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,} and present it inside a fenced code block. Keep the explanation to 3-5 short bullet points, then output the extracted emails as a plain bullet list with duplicates removed and order preserved. If no emails are found, state that explicitly.

Text to process:
Please reach out to our support team at help@acme.co or escalate to Jane Doe <jane.doe+urgent@sub.example.com>. For billing questions, email billing@acme.co or call 555-123-4567. Old contact bob_smith99@mail-server.io is no longer monitored.

Return:
1. The regex pattern used (in a code block).
2. A brief explanation of what each part of the pattern matches.
3. A deduplicated list of all email addresses found in the text.

Model comparison

ModelQualityCostLatencyTokens
Claude Opus 4.7 9 $0.0331 5.2s 359 / 370
Claude Haiku 4.5 9 $0.0016 2.1s 264 / 265
GPT-5.4 9 $0.0031 2.6s 234 / 167
Gemini 2.5 Pro 9 $0.0015 9.1s 253 / 233
Grok 4.1 Fast Reasoning 9 $0.0002 5.5s 383 / 215

Outputs

Claude Opus 4.7 9/10
## Regex Pattern

```regex
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
```

## Explanation

- `[A-Za-z0-9._%+-]+` — matches the local part (before `@`), allowing letters, digits, dots, underscores, percent signs, plus, and hyphens.
- `@` — matches the literal `@` separator between local and domain parts.
- `[A-Za-z0-9.-]+` — matches the domain name, allowing letters, digits, dots, and hyphens (covers subdomains like `sub.example`).
- `\.[A-Za-z]{2,}` — matches the top-level domain: a dot followed by at least two letters (e.g., `.co`, `.com`, `.io`).
- The `+` quantifiers ensure each section has one or more valid characters, preventing matches on malformed strings.

## Extracted Emails

- help@acme.co
- jane.doe+urgent@sub.example.com
- billing@acme.co
- bob_smith99@mail-server.io
Judge notes: Excellent response that fully satisfies the query with a robust regex pattern, clear explanation, and accurate extraction of all emails from the provided text with proper deduplication.
Claude Haiku 4.5 9/10
# Email Extraction Results

## Regex Pattern

```
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
```

## Pattern Explanation

• **`[A-Za-z0-9._%+-]+`** – Matches the local part (before @): letters, digits, dots, underscores, percent signs, plus signs, and hyphens
• **`@`** – Matches the literal @ symbol that separates local and domain parts
• **`[A-Za-z0-9.-]+`** – Matches the domain name: letters, digits, dots, and hyphens
• **`\.`** – Matches the literal dot before the top-level domain
• **`[A-Za-z]{2,}`** – Matches the TLD (top-level domain): 2 or more alphabetic characters

## Extracted Email Addresses

• help@acme.co
• jane.doe+urgent@sub.example.com
• billing@acme.co
• bob_smith99@mail-server.io
Judge notes: Excellent response that fully satisfies the query with a robust regex pattern, clear explanation of each component, and accurate extraction of all emails from the provided text with proper deduplication.
GPT-5.4 9/10
```regex
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
```

- `[A-Za-z0-9._%+-]+` matches the local part before `@`, including letters, digits, and common email symbols.
- `@` matches the required at-sign separating local and domain parts.
- `[A-Za-z0-9.-]+` matches the domain/subdomain portion.
- `\.` matches the literal dot before the top-level domain.
- `[A-Za-z]{2,}` matches a top-level domain with at least 2 letters.

- help@acme.co
- jane.doe+urgent@sub.example.com
- billing@acme.co
- bob_smith99@mail-server.io
Judge notes: Excellent response that directly addresses the query with a robust regex pattern, clear explanation of each component, and correctly extracted all four unique email addresses from the text.
Gemini 2.5 Pro 9/10
```regex
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
```

*   `[A-Za-z0-9._%+-]+` Matches the local part (username) of the email address before the `@` symbol.
*   `@` Matches the literal "@" symbol that separates the local part from the domain.
*   `[A-Za-z0-9.-]+` Matches the domain name (e.g., "example", "sub.domain").
*   `\.` Matches the literal dot separating the domain from the top-level domain.
*   `[A-Za-z]{2,}` Matches the top-level domain (TLD) like ".com" or ".co", ensuring it has at least two letters.

*   help@acme.co
*   jane.doe+urgent@sub.example.com
*   billing@acme.co
*   bob_smith99@mail-server.io
Judge notes: The output fully satisfies the query with a reliable regex pattern, clear explanation of each component, and correctly extracted all four email addresses with proper deduplication—minor deduction only for the regex lacking some edge-case coverage (e.g., hyphens in domain names) but this is negligible for practical use.
Grok 4.1 Fast Reasoning 9/10
1. The regex pattern used (in a code block).
```
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
```

2. A brief explanation of what each part of the pattern matches.
- `[A-Za-z0-9._%+-]+`: Matches one or more alphanumeric characters or allowed special chars (._%+-) in the local part before @.
- `@`: Matches the literal @ symbol.
- `[A-Za-z0-9.-]+`: Matches one or more alphanumeric characters, dots, or hyphens in the domain name.
- `\.`: Matches a literal dot separating domain and TLD.
- `[A-Za-z]{2,}`: Matches two or more alphabetic characters for the top-level domain (TLD).

3. A deduplicated list of all email addresses found in the text.
- help@acme.co
- jane.doe+urgent@sub.example.com
- billing@acme.co
- bob_smith99@mail-server.io
Judge notes: Excellent execution—provides a solid, practical regex pattern with clear explanation and correctly extracted all four emails from the text; only minor deduction for the regex not being the most RFC-compliant variant, though it's appropriate for most real-world use cases.

What makes these work

  1. 01
    Use a word-boundary anchor

    Wrapping your pattern with word boundaries or lookahead/lookbehind assertions prevents partial matches where an email is embedded inside a longer token. Without boundaries, a pattern can match fragments like 'name@' from 'fullname@domain.com' if the surrounding logic is loose. In Python use re.findall with a pattern that starts with a non-word boundary check, or use \b where your flavor supports it adjacent to the local part.

  2. 02
    Account for subdomains and country TLDs

    Many real-world addresses use multi-part domains such as user@mail.company.co.uk. A domain segment that only allows one dot will silently drop these. Match the domain portion with something like ([a-zA-Z0-9-]+\.)+[a-zA-Z]{2,} to handle arbitrary depth. Set the TLD minimum to 2 characters and avoid capping it too low since new TLDs can be long.

  3. 03
    Support plus-sign and dot addressing in the local part

    Services like Gmail support both dots and plus signs in the local part, for example first.last+filter@gmail.com. A local part pattern restricted to alphanumeric characters only will miss these. Use [a-zA-Z0-9._%+\-]+ as the local part character class to cover the most common valid characters without going so broad that you pick up garbage.

  4. 04
    Test against both positive and negative examples

    Before shipping a regex into production, run it against a set of known-good emails and a set of known-invalid strings. Include edge cases like double at-signs, trailing dots, missing TLDs, and IP-address domains. A pattern that passes only the happy path will produce noisy output on real-world messy data and quietly corrupt downstream processes.

More example scenarios

#01 · Extracting contacts from a scraped B2B company page
Input
Reach our sales team at sales@acmecorp.com or contact our support lead directly: j.harrison+support@acmecorp.co.uk. For press inquiries email media.relations@acmecorp.com. Invalid strings like @@broken and noatsign should be ignored.
Expected output
sales@acmecorp.com, j.harrison+support@acmecorp.co.uk, media.relations@acmecorp.com
#02 · Pulling user emails from a server access log
Input
2024-03-15 08:42:11 INFO Login attempt user=marco.rossi@techfirm.it ip=192.168.1.4
2024-03-15 08:43:02 ERROR Auth failed user=a.patel@vendor.org ip=10.0.0.9
2024-03-15 08:44:55 INFO Session started user=chen_wei@enterprise.com.cn ip=172.16.0.2
Expected output
marco.rossi@techfirm.it, a.patel@vendor.org, chen_wei@enterprise.com.cn
#03 · Cleaning a marketing export where emails are mixed with other data
Input
ID:1042 | Name: Priya Nair | Contact: priya.nair@retailbrand.in | Tier: Gold | Referral: n/a
ID:1043 | Name: Tom Briggs | Contact: tombriggs_AT_example.com | Tier: Silver
ID:1044 | Name: Sara Okonkwo | Contact: s.okonkwo@ngohub.org | Tier: Bronze
Expected output
priya.nair@retailbrand.in, s.okonkwo@ngohub.org
Note: tombriggs_AT_example.com is not a valid email format and is excluded. If obfuscated emails need handling, a pre-processing substitution step is required before regex extraction.
#04 · Extracting emails from a legal contract document
Input
Notices shall be sent to Acme Inc. at legal@acmeinc.com and to the Counterparty representative, dr.linda.walsh@lawgroup.net. Copy to compliance@acmeinc.com. Service of process address: 123 Main St, not an email.
Expected output
legal@acmeinc.com, dr.linda.walsh@lawgroup.net, compliance@acmeinc.com
#05 · Parsing a developer forum post to find mentioned contacts
Input
Hey, I forwarded this to devops-lead@startup.io already. You can also ping the vendor at support+tier2@cloudvendor.com. The old address (archived@legacy.startup.io) still works but responses are slow. Avoid sending to test@@broken.com.
Expected output
devops-lead@startup.io, support+tier2@cloudvendor.com, archived@legacy.startup.io
Note: test@@broken.com is excluded as it contains a double at-sign and does not match a valid email pattern.

Common mistakes to avoid

  • Overly greedy domain matching

    Using .+ for the domain portion instead of a restricted character class causes the pattern to consume trailing punctuation, HTML tags, or whitespace as part of the email. The result looks like a valid match but the extracted string is unusable. Always restrict the domain character class and terminate the match explicitly at a word boundary or non-email character.

  • Forgetting case-insensitive flag

    Email local parts and domains are technically case-insensitive, and real data often contains mixed case like User@Example.COM. If you do not apply the case-insensitive flag (re.IGNORECASE in Python, /i in JavaScript), you will miss uppercase variants. This is a silent failure: your script runs without errors but returns an incomplete list.

  • Using a pattern that cannot handle plus addressing

    A significant share of modern email addresses use plus signs for filtering, such as newsletter+unsubscribe@domain.com. Patterns that only allow alphanumeric characters in the local part drop these silently. This matters most when extracting from marketing or transactional email data where plus addressing is common.

  • Not stripping surrounding punctuation after extraction

    In prose, an email address often appears inside parentheses, followed by a comma, or wrapped in angle brackets as in <contact@example.com>. The regex may capture the surrounding character depending on how lookaheads are set up. Always trim or post-process matched strings to remove non-email characters from the start and end of each result.

  • Assuming regex validates deliverability

    A regex match only confirms that a string looks like an email address structurally. It cannot tell you whether the domain has valid MX records, whether the mailbox exists, or whether the address will actually receive mail. Treating regex output as a clean deliverable list without a verification step leads to bounce rates and potential sender reputation damage.

Related queries

Frequently asked questions

What is the best Python regex to extract email addresses?

A reliable starting pattern in Python is r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}' used with re.findall(pattern, text, re.IGNORECASE). This covers standard addresses, plus-sign addressing, and multi-level domains. For stricter RFC compliance you need a more complex pattern, but this one handles the vast majority of real-world cases without excessive false positives.

How do I extract email addresses from a string in JavaScript?

Use the String match method with a global regex: text.match(/[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/gi). The g flag returns all matches and the i flag handles case. The result is an array of matched strings or null if there are no matches, so check for null before iterating.

Can a single regex extract all email addresses from an HTML page?

Yes, but you will likely get noise. HTML pages contain email addresses in mailto: links, in visible text, and sometimes in JavaScript strings. A regex run against raw HTML will find all of them, including addresses in hidden or template sections. If you only want visible contact emails, parse the HTML first with a library like BeautifulSoup, extract visible text, then run the regex on that.

Why does my regex miss emails with country-code TLDs like .co.uk?

Most simple patterns only match a single domain segment after the last dot, so user@company.co.uk fails because .co is treated as the TLD and .uk is left over. Fix this by matching the domain portion as one or more dot-separated segments: ([a-zA-Z0-9\-]+\.)+[a-zA-Z]{2,}. This handles arbitrary subdomain and TLD depth.

Is there a regex that validates email addresses rather than just extracting them?

For validation, a stricter pattern or a dedicated library is better than a general extraction regex. In Python, the email-validator package handles RFC compliance. For quick validation in any language, the pattern ^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$ with start and end anchors prevents partial matches, but it still cannot verify that the address is real or deliverable.

How do I extract unique email addresses and remove duplicates?

After running re.findall in Python, wrap the result in a set: list(set(re.findall(pattern, text, re.IGNORECASE))). In JavaScript, use new Set(text.match(...)) and spread it back to an array. If case sensitivity matters for deduplication, normalize all matches to lowercase before deduplicating, since user@Example.com and user@example.com are the same address.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.