AI Resume Parsers to Extract Candidate Data from PDF CVs

Tested prompts for extract data from resume pdf compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 9/10

When you receive a stack of PDF resumes, copying candidate data into a spreadsheet or ATS by hand is slow, error-prone, and a poor use of anyone's time. AI resume parsers solve this by reading the raw text of a PDF CV and returning structured data: name, contact details, work history, education, skills, and more. The result is a clean JSON object or table row you can pipe directly into a database, hiring tool, or screening workflow.

The challenge is that resumes have no standard format. One candidate submits a two-column design with a sidebar; another uses a plain Word-to-PDF export. A good AI parsing prompt needs to handle both without breaking. The models compared on this page were tested against the same varied set of real-world resume PDFs, so you can see exactly where each one succeeds and where it falls short.

This page gives you the prompt, the model outputs, a side-by-side comparison, and the editorial context you need to pick the right approach for your volume, budget, and downstream data requirements. Whether you are building a recruiting pipeline, doing one-off candidate research, or automating a hiring workflow, the answer is here.

When to use this

AI resume parsing is the right tool when you need structured candidate data out of unstructured PDF files at a speed or volume that makes manual entry impractical. It fits anywhere the output needs to feed a system: a spreadsheet, an ATS, a database, or a scoring model. It works on single files and on batches.

  • Bulk importing candidates from a job board into an internal ATS or CRM
  • Screening applicants by extracting years of experience or specific skills before a human review
  • Building a candidate database from archived PDF resumes that were never digitized
  • Automating pre-interview data sheets so recruiters walk in with structured notes already prepared
  • Comparing candidates side by side by normalizing their PDFs into a shared schema

When this format breaks down

  • Resumes submitted as scanned image PDFs with no embedded text layer: the model reads text, not pixels, so you need an OCR preprocessing step first or the output will be empty or hallucinated.
  • Highly regulated hiring contexts where every data extraction decision must be auditable and traceable to a specific rule: LLM outputs are probabilistic and can silently misread a date or job title without flagging it.
  • Real-time parsing at the moment a candidate submits a form if your latency budget is under two seconds: API round-trips for large PDFs add up, and a dedicated parsing microservice will be faster.
  • Resumes written in languages the chosen model handles poorly: quality drops sharply for low-resource languages, and you may get confident-looking but wrong extractions.

The prompt we tested

You are an expert resume parser. Extract structured candidate data from the resume text provided below and return it in a clean, organized format.

Rules and formatting requirements:
Output a single JSON object with these keys: full_name, email, phone, location, linkedin, summary, skills (array), work_experience (array of objects with company, title, start_date, end_date, location, responsibilities as array), education (array of objects with institution, degree, field, graduation_date, gpa), certifications (array), and languages (array). Use null for missing fields, preserve original wording for responsibilities, and normalize all dates to YYYY-MM format. Do not include any commentary outside the JSON.

Resume content to parse:
Jane Doe
jane.doe@email.com | (415) 555-0199 | San Francisco, CA | linkedin.com/in/janedoe

SUMMARY
Senior software engineer with 7+ years building scalable backend systems in Python and Go.

EXPERIENCE
Stripe — Senior Software Engineer (Mar 2021 – Present), San Francisco, CA
- Led migration of payments service to Go, reducing p99 latency by 40%
- Mentored 4 junior engineers and ran weekly architecture reviews

Airbnb — Software Engineer (Jun 2017 – Feb 2021), San Francisco, CA
- Built internal tooling used by 200+ engineers

EDUCATION
UC Berkeley — B.S. Computer Science, 2017, GPA 3.8

SKILLS: Python, Go, PostgreSQL, Kubernetes, AWS
CERTIFICATIONS: AWS Solutions Architect Associate (2022)

Return the extracted data now.

Model comparison

ModelQualityCostLatencyTokens
Claude Opus 4.7 9 $0.0493 4.6s 576 / 542
Claude Haiku 4.5 9 $0.0025 1.9s 375 / 434
GPT-5.4 9 $0.0063 3.7s 339 / 362
Gemini 2.5 ProNot run yet
Grok 4.1 Fast Reasoning 9 $0.0003 11.5s 487 / 346

Outputs

Claude Opus 4.7 9/10
{
  "full_name": "Jane Doe",
  "email": "jane.doe@email.com",
  "phone": "(415) 555-0199",
  "location": "San Francisco, CA",
  "linkedin": "linkedin.com/in/janedoe",
  "summary": "Senior software engineer with 7+ years building scalable backend systems in Python and Go.",
  "skills": ["Python", "Go", "PostgreSQL", "Kubernetes", "AWS"],
  "work_experience": [
    {
      "company": "Stripe",
      "title": "Senior Software Engineer",
      "start_date": "2021-03",
      "end_date": "Present",
      "location": "San Francisco, CA",
      "responsibilities": [
        "Led migration of payments service to Go, reducing p99 latency by 40%",
        "Mentored 4 junior engineers and ran weekly architecture reviews"
      ]
    },
    {
      "company": "Airbnb",
      "title": "Software Engineer",
      "start_date": "2017-06",
      "end_date": "2021-02",
      "location": "San Francisco, CA",
      "responsibilities": [
        "Built internal tooling used by 200+ engineers"
      ]
    }
  ],
  "education": [
    {
      "institution": "UC Berkeley",
      "degree": "B.S.",
      "field": "Computer Science",
      "graduation_date": "2017",
      "gpa": "3.8"
    }
  ],
  "certifications": ["AWS Solutions Architect Associate (2022)"],
  "languages": []
}
Judge notes: Excellent extraction with proper JSON formatting and field organization; only minor issue is graduation_date should be "2017-06" or similar normalized format per the YYYY-MM requirement, and "Present" could arguably be null or a specific date, but overall highly satisfactory.
Claude Haiku 4.5 9/10
```json
{
  "full_name": "Jane Doe",
  "email": "jane.doe@email.com",
  "phone": "(415) 555-0199",
  "location": "San Francisco, CA",
  "linkedin": "linkedin.com/in/janedoe",
  "summary": "Senior software engineer with 7+ years building scalable backend systems in Python and Go.",
  "skills": [
    "Python",
    "Go",
    "PostgreSQL",
    "Kubernetes",
    "AWS"
  ],
  "work_experience": [
    {
      "company": "Stripe",
      "title": "Senior Software Engineer",
      "start_date": "2021-03",
      "end_date": null,
      "location": "San Francisco, CA",
      "responsibilities": [
        "Led migration of payments service to Go, reducing p99 latency by 40%",
        "Mentored 4 junior engineers and ran weekly architecture reviews"
      ]
    },
    {
      "company": "Airbnb",
      "title": "Software Engineer",
      "start_date": "2017-06",
      "end_date": "2021-02",
      "location": "San Francisco, CA",
      "responsibilities": [
        "Built internal tooling used by 200+ engineers"
      ]
    }
  ],
  "education": [
    {
      "institution": "UC Berkeley",
      "degree": "B.S.",
      "field": "Computer Science",
      "graduation_date": "2017",
      "gpa": "3.8"
    }
  ],
  "certifications": [
    "AWS Solutions Architect Associate (2022)"
  ],
  "languages": []
}
```
Judge notes: Excellent extraction with proper JSON formatting, correct date normalization, and all required fields; minor issue is graduation_date should be "2017-01" or similar YYYY-MM format per the stated rule, though this is a borderline interpretive case.
GPT-5.4 9/10
{
  "full_name": "Jane Doe",
  "email": "jane.doe@email.com",
  "phone": "(415) 555-0199",
  "location": "San Francisco, CA",
  "linkedin": "linkedin.com/in/janedoe",
  "summary": "Senior software engineer with 7+ years building scalable backend systems in Python and Go.",
  "skills": [
    "Python",
    "Go",
    "PostgreSQL",
    "Kubernetes",
    "AWS"
  ],
  "work_experience": [
    {
      "company": "Stripe",
      "title": "Senior Software Engineer",
      "start_date": "2021-03",
      "end_date": null,
      "location": "San Francisco, CA",
      "responsibilities": [
        "Led migration of payments service to Go, reducing p99 latency by 40%",
        "Mentored 4 junior engineers and ran weekly architecture reviews"
      ]
    },
    {
      "company": "Airbnb",
      "title": "Software Engineer",
      "start_date": "2017-06",
      "end_date": "2021-02",
      "location": "San Francisco, CA",
      "responsibilities": [
        "Built internal tooling used by 200+ engineers"
      ]
    }
  ],
  "education": [
    {
      "institution": "UC Berkeley",
      "degree": "B.S.",
      "field": "Computer Science",
      "graduation_date": "2017-01",
      "gpa": "3.8"
    }
  ],
  "certifications": [
    "AWS Solutions Architect Associate (2022)"
  ],
  "languages": []
}
Judge notes: Excellent extraction with proper JSON formatting and accurate data parsing; only minor issue is graduation_date normalized to "2017-01" instead of "2017" since only year was provided.
Grok 4.1 Fast Reasoning 9/10
{
  "full_name": "Jane Doe",
  "email": "jane.doe@email.com",
  "phone": "(415) 555-0199",
  "location": "San Francisco, CA",
  "linkedin": "linkedin.com/in/janedoe",
  "summary": "Senior software engineer with 7+ years building scalable backend systems in Python and Go.",
  "skills": ["Python", "Go", "PostgreSQL", "Kubernetes", "AWS"],
  "work_experience": [
    {
      "company": "Stripe",
      "title": "Senior Software Engineer",
      "start_date": "2021-03",
      "end_date": null,
      "location": "San Francisco, CA",
      "responsibilities": [
        "Led migration of payments service to Go, reducing p99 latency by 40%",
        "Mentored 4 junior engineers and ran weekly architecture reviews"
      ]
    },
    {
      "company": "Airbnb",
      "title": "Software Engineer",
      "start_date": "2017-06",
      "end_date": "2021-02",
      "location": "San Francisco, CA",
      "responsibilities": [
        "Built internal tooling used by 200+ engineers"
      ]
    }
  ],
  "education": [
    {
      "institution": "UC Berkeley",
      "degree": "B.S.",
      "field": "Computer Science",
      "graduation_date": "2017-01",
      "gpa": "3.8"
    }
  ],
  "certifications": ["AWS Solutions Architect Associate (2022)"],
  "languages": []
}
Judge notes: Excellent extraction with proper JSON formatting and date normalization; only minor issue is graduation_date normalized to "2017-01" instead of "2017" since only year was provided in source.

What makes these work

  1. 01
    Define your output schema upfront

    Tell the model exactly which fields you want and in what format before it sees the resume text. A prompt that says 'return a JSON object with these exact keys: name, email, phone, skills, experience, education' produces far more consistent output than one that says 'extract the important information.' Consistency matters most when you are processing batches and downstream code expects the same structure every time.

  2. 02
    Handle missing fields explicitly

    Instruct the model to return null for any field it cannot find rather than guessing or omitting the key. If you do not specify this, models often skip missing fields silently, which breaks downstream code that expects a full object. Adding a parse_notes field for ambiguous cases also helps your team know when to do a manual check.

  3. 03
    Normalize dates and tenure at extraction time

    Ask the model to calculate total years of experience and convert all date formats to ISO 8601 (YYYY-MM) inside the prompt. Resumes use every date format imaginable: 'Jan 2020', '01/2020', '2020-present'. Normalizing during extraction avoids a messy cleanup step later and makes sorting or filtering candidates by experience reliable.

  4. 04
    Separate extraction from judgment

    Use one prompt to extract raw data and a second prompt to score or evaluate the candidate. Combining them in a single prompt makes both tasks worse: the model tries to serve two goals and the structured output becomes less reliable. Extract first, evaluate second, with the structured JSON as the input to the scoring step.

More example scenarios

#01 · Software engineering recruiter screening backend candidates
Input
Resume PDF for Priya Nair. She lists 6 years at Amazon as a Senior Software Engineer working on distributed systems, 2 years at a startup as a backend engineer using Go and Kafka, a BS in Computer Science from UT Austin, and skills including Python, Go, Kubernetes, and AWS. She includes her LinkedIn and GitHub URLs.
Expected output
{ "name": "Priya Nair", "email": null, "linkedin": "linkedin.com/in/priyanair", "github": "github.com/priyanair", "total_experience_years": 8, "skills": ["Python", "Go", "Kubernetes", "AWS", "Kafka"], "education": { "degree": "BS Computer Science", "school": "UT Austin" }, "last_title": "Senior Software Engineer", "last_employer": "Amazon" }
#02 · Healthcare staffing agency parsing nursing CVs
Input
CV for Marcus Bell, RN. He has 4 years in ICU nursing at Mount Sinai Hospital in New York, 2 years in a step-down unit at a regional hospital in New Jersey, holds a BSN from Rutgers, and lists certifications including CCRN and BLS. His phone number and email are in the header.
Expected output
{ "name": "Marcus Bell", "credentials": "RN", "phone": "provided", "email": "provided", "certifications": ["CCRN", "BLS"], "specialties": ["ICU", "Step-Down"], "experience": [ { "employer": "Mount Sinai Hospital", "role": "ICU Nurse", "years": 4 }, { "employer": "Regional Hospital NJ", "role": "Step-Down Nurse", "years": 2 } ], "education": { "degree": "BSN", "school": "Rutgers" } }
#03 · Executive search firm parsing C-suite PDF bios
Input
Resume for Sandra Okoye. She was CFO at a publicly traded retail company for 5 years, VP of Finance at a logistics firm for 4 years, and started as a CPA at Deloitte for 3 years. She holds an MBA from Wharton and a BA in Accounting from Howard University. Board memberships at two nonprofits are listed at the bottom.
Expected output
{ "name": "Sandra Okoye", "current_title": "CFO", "education": [ { "degree": "MBA", "school": "Wharton" }, { "degree": "BA Accounting", "school": "Howard University" } ], "certifications": ["CPA"], "career": [ { "title": "CFO", "type": "public company", "tenure_years": 5 }, { "title": "VP Finance", "industry": "logistics", "tenure_years": 4 }, { "title": "Associate", "employer": "Deloitte", "tenure_years": 3 } ], "board_memberships": 2 }
#04 · University career center digitizing alumni paper resumes
Input
Scanned-then-OCR'd PDF for James Trevino, class of 2018. Economics degree from the university. He worked at a regional bank as a credit analyst for 3 years, then at a fintech startup as a product analyst for 2 years. Skills listed: SQL, Excel, Tableau, Python basics. No email found in the document.
Expected output
{ "name": "James Trevino", "graduation_year": 2018, "degree": "Economics", "email": null, "skills": ["SQL", "Excel", "Tableau", "Python"], "experience": [ { "title": "Credit Analyst", "industry": "banking", "years": 3 }, { "title": "Product Analyst", "industry": "fintech", "years": 2 } ], "total_experience_years": 5, "parse_note": "Email not found in document" }
#05 · Freelance recruiter doing rapid candidate comparison for a client
Input
PDF resume for Amara Diallo. She is a UX designer with 7 years of experience at agencies and in-house teams. Tools listed: Figma, Sketch, Adobe XD, Maze, and Hotjar. She has worked on mobile apps, SaaS dashboards, and e-commerce sites. She holds a BFA in Graphic Design and has a portfolio URL in the header.
Expected output
{ "name": "Amara Diallo", "title": "UX Designer", "experience_years": 7, "tools": ["Figma", "Sketch", "Adobe XD", "Maze", "Hotjar"], "domains": ["mobile apps", "SaaS", "e-commerce"], "education": { "degree": "BFA Graphic Design" }, "portfolio_url": "provided", "work_type": "agency and in-house" }

Common mistakes to avoid

  • Feeding image-only PDFs directly

    Many scanned resumes are image PDFs with no embedded text. Sending these directly to an LLM API that expects text produces empty outputs or confident hallucinations. Run these files through an OCR tool first to generate a text layer, then pass the extracted text to the parsing prompt.

  • Asking for too many inferences at once

    Prompts that ask the model to extract data and infer soft skills and predict seniority level and suggest a salary range all at once produce unreliable results across the board. Each additional inference task dilutes the accuracy of the structured extraction. Stick to factual extraction in the parsing prompt.

  • Not validating email and phone formats

    Models sometimes extract garbled emails or phone numbers if the PDF formatting caused strange character spacing during text extraction. Always run a regex validation pass on contact fields after parsing. A recruiter who reaches out to a malformed email address wastes time and looks unprofessional to the candidate.

  • Ignoring multi-column PDF layouts

    Two-column resume designs often produce scrambled text when the PDF is extracted because the text reader pulls columns left to right rather than column by column. The model then tries to parse nonsense. Pre-processing with a PDF library that handles column detection before passing text to the model significantly improves output quality on these layouts.

  • Storing raw model output without a review step

    LLM outputs are probabilistic. A model can confidently return the wrong employer name or misattribute a date. For any hiring decision that depends on this data, build a spot-check review step into your workflow, especially for senior roles or high-volume batches where errors compound quickly.

Related queries

Frequently asked questions

What is the best way to extract data from a resume PDF using Python?

Use a PDF text extraction library like PyMuPDF or pdfplumber to pull the raw text from the file, then send that text to an LLM API with a structured extraction prompt. PyMuPDF handles most standard PDFs well. For scanned image PDFs, add pytesseract for OCR before the LLM step. This two-step approach is more reliable than sending the raw file bytes directly.

Can AI resume parsers handle non-English CVs?

Large models like GPT-4o and Claude 3.5 Sonnet handle Western European languages well and produce usable output for many others. Quality drops for low-resource languages. If you are processing CVs in Arabic, Hindi, or less common languages, test your specific model on a sample batch before committing to it in production. Do not assume multilingual capability without validation.

How accurate is AI resume parsing compared to dedicated ATS parsers?

For clean, text-based PDFs, modern LLMs match or exceed the accuracy of rule-based ATS parsers on standard fields like name, email, and job titles. Where LLMs sometimes fall short is on complex multi-column layouts or heavily designed resumes where the extracted text is out of order. Dedicated resume parsing APIs like Affinda or Sovren add pre-processing layers that handle layout issues, which is worth the cost at high volume.

Is it legal to use AI to parse resumes for hiring?

Legality depends on your jurisdiction and how you use the output. In the United States, using automated tools in hiring decisions is subject to Equal Employment Opportunity requirements, and some states like Illinois and New York City have specific AI hiring laws. Always consult legal counsel before using AI-parsed resume data to screen or rank candidates at scale.

How do I extract resume data from a PDF without copying and pasting?

Use a script to extract the text layer from the PDF programmatically, then pass it to an AI model with a structured prompt. Libraries like pdfplumber in Python handle text extraction in one or two lines of code. You can also use an API that accepts file uploads directly. Either way, the process takes seconds per resume and requires no manual copying.

What fields can AI reliably extract from a resume PDF?

Name, email, phone number, LinkedIn URL, job titles, employer names, employment dates, and education details extract reliably from clean PDFs. Skills lists extract well when they appear in a dedicated section. Fields like total years of experience, industry, or seniority level require the model to calculate or infer, which adds some variability. Soft skills and cultural fit signals are not reliably extractable and should not be treated as ground truth.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.