AI Tools to Extract Patient Data from Medical Record PDFs

Tested prompts for extract data from medical records pdf compared across 5 leading AI models.

BEST BY JUDGE SCORE Claude Haiku 4.5 9/10

Medical record PDFs are notoriously messy: scanned handwriting, inconsistent formatting, multi-page discharge summaries, lab result tables buried inside dense clinical notes. If you are trying to pull structured data out of them, whether for a research dataset, billing audit, care coordination workflow, or EHR migration, doing it manually is slow and error-prone at any scale beyond a handful of files.

AI language models can now read these documents and return structured outputs: patient demographics, diagnosis codes, medication lists, lab values, dates of service, and more. The key is giving the model a precise extraction prompt so it returns consistent JSON or tabular data rather than a narrative summary.

This page shows you exactly how to prompt AI models for medical record PDF extraction, compares how leading models handle the same document, and explains where the approach works reliably versus where you need extra caution. Whether you are a health IT developer, a clinical researcher, or a solo practitioner cleaning up paper charts, the guidance here applies directly to your workflow.

When to use this

This approach works best when you have a defined set of fields to extract and a repeating document structure, even if that structure varies slightly across records. It fits both one-off extractions and batch pipelines where you process dozens or thousands of PDFs and need consistent, machine-readable output every time.

Pulling patient demographics, admit/discharge dates, and primary diagnoses from hospital discharge summaries for a research cohort
Extracting current medication lists and dosages from referral packets before a specialist appointment
Scraping lab values (HbA1c, creatinine, lipid panels) across thousands of records for a population health analysis
Converting paper-based intake forms or insurance EOB PDFs into structured rows for a billing database
Identifying ICD-10 or CPT codes mentioned in clinical notes during a coding audit

When this format breaks down

Scanned documents with very poor image quality or heavy handwriting where OCR accuracy falls below roughly 85 percent. The model will hallucinate values it cannot read clearly, which is dangerous in a clinical context.
Any workflow where the extracted data feeds directly into clinical decision-making without a human review step. AI extraction errors on medication names or dosages can cause patient harm.
Documents containing highly sensitive mental health, substance use disorder, or HIV-related records that carry special legal protections (42 CFR Part 2, state parity laws). Standard AI API pipelines may not meet the required data handling standards.
When you need legally defensible extraction for billing disputes or malpractice cases. AI output alone is not audit-proof; you need traceable, human-verified extraction with source citations.

The prompt we tested

You are a clinical data extraction assistant specialized in parsing medical record PDFs. Extract structured patient data from the medical record content provided below, following the rules exactly.

Rules:
Output a single JSON object with these keys: patient_demographics (name, DOB, sex, MRN, address, phone), encounter_info (date, provider, facility, visit_type), chief_complaint, diagnoses (array of {code, description, type}), medications (array of {name, dosage, frequency, route}), allergies, vital_signs, lab_results (array of {test, value, unit, reference_range, date}), procedures, and clinical_notes_summary. Use null for any missing field, preserve exact values and units from the source, and do not infer or fabricate data. Append a 'confidence_flags' array listing any fields that were ambiguous or illegible.

Medical Record Content:
Patient: Maria Gonzalez, DOB 04/12/1968, MRN 88231-A. Seen by Dr. Patel at Riverside Family Clinic on 10/15/2024 for follow-up on Type 2 Diabetes (E11.9) and hypertension (I10). BP 142/88, HR 76, A1c 8.2% (ref 4.0-5.6). Current meds: Metformin 1000mg PO BID, Lisinopril 20mg PO daily. NKDA. Plan: increase Metformin, add Empagliflozin 10mg daily, recheck labs in 3 months.

Return the extracted data now.

Model comparison

Model	Quality	Cost	Latency	Tokens
Claude Opus 4.7	9	$0.0666	7.3s	505 / 787
Claude Haiku 4.5	9	$0.0040	3.6s	381 / 719
GPT-5.4	9	$0.0096	6.3s	324 / 583
Gemini 2.5 Pro	Not run yet
Grok 4.1 Fast Reasoning	9	$0.0003	6.3s	474 / 391

Outputs

Claude Opus 4.7 9/10

```json
{
  "patient_demographics": {
    "name": "Maria Gonzalez",
    "DOB": "04/12/1968",
    "sex": null,
    "MRN": "88231-A",
    "address": null,
    "phone": null
  },
  "encounter_info": {
    "date": "10/15/2024",
    "provider": "Dr. Patel",
    "facility": "Riverside Family Clinic",
    "visit_type": "follow-up"
  },
  "chief_complaint": "Follow-up on Type 2 Diabetes and hypertension",
  "diagnoses": [
    {
      "code": "E11.9",
      "description": "Type 2 Diabetes",
      "type": null
    },
    {
      "code": "I10",
      "description": "Hypertension",
      "type": null
    }
  ],
  "medications": [
    {
      "name": "Metformin",
      "dosage": "1000mg",
      "frequency": "BID",
      "route": "PO"
    },
    {
      "name": "Lisinopril",
      "dosage": "20mg",
      "frequency": "daily",
      "route": "PO"
    },
    {
      "name": "Empagliflozin",
      "dosage": "10mg",
      "frequency": "daily",
      "route": null
    }
  ],
  "allergies": "NKDA",
  "vital_signs": {
    "BP": "142/88",
    "HR": "76",
    "temperature": null,
    "respiratory_rate": null,
    "SpO2": null
  },
  "lab_results": [
    {
      "test": "A1c",
      "value": "8.2",
      "unit": "%",
      "reference_range": "4.0-5.6",
      "date": "10/15/2024"
    }
  ],
  "procedures": null,
  "clinical_notes_summary": "Patient seen for follow-up of Type 2 Diabetes and hypertension. BP elevated at 142/88 and A1c elevated at 8.2%. Plan: increase Metformin, add Empagliflozin 10mg daily, recheck labs in 3 months.",
  "confidence_flags": [
    "diagnoses.type (not specified in source)",
    "Empagliflozin route (not specified in source)",
    "lab_results.date (assumed from encounter date)"
  ]
}
```

Claude Haiku 4.5 9/10

```json
{
  "patient_demographics": {
    "name": "Maria Gonzalez",
    "DOB": "04/12/1968",
    "sex": null,
    "MRN": "88231-A",
    "address": null,
    "phone": null
  },
  "encounter_info": {
    "date": "10/15/2024",
    "provider": "Dr. Patel",
    "facility": "Riverside Family Clinic",
    "visit_type": "follow-up"
  },
  "chief_complaint": null,
  "diagnoses": [
    {
      "code": "E11.9",
      "description": "Type 2 Diabetes",
      "type": "chronic"
    },
    {
      "code": "I10",
      "description": "hypertension",
      "type": "chronic"
    }
  ],
  "medications": [
    {
      "name": "Metformin",
      "dosage": "1000mg",
      "frequency": "BID",
      "route": "PO"
    },
    {
      "name": "Lisinopril",
      "dosage": "20mg",
      "frequency": "daily",
      "route": "PO"
    },
    {
      "name": "Empagliflozin",
      "dosage": "10mg",
      "frequency": "daily",
      "route": null
    }
  ],
  "allergies": "NKDA",
  "vital_signs": {
    "blood_pressure": "142/88",
    "heart_rate": "76",
    "heart_rate_unit": "bpm"
  },
  "lab_results": [
    {
      "test": "A1c",
      "value": "8.2",
      "unit": "%",
      "reference_range": "4.0-5.6",
      "date": "10/15/2024"
    }
  ],
  "procedures": null,
  "clinical_notes_summary": "Patient seen for follow-up on Type 2 Diabetes and hypertension. Plan includes increasing Metformin dosage and adding Empagliflozin 10mg daily. Labs to be rechecked in 3 months.",
  "confidence_flags": [
    "patient_demographics.sex - not provided in source",
    "patient_demographics.address - not provided in source",
    "patient_demographics.phone - not provided in source",
    "chief_complaint - not explicitly stated in source",
    "medications.Empagliflozin.route - not specified in source",
    "vital_signs.temperature - not recorded in source",
    "vital_signs.respiratory_rate - not recorded in source",
    "vital_signs.oxygen_saturation - not recorded in source",
    "procedures - not mentioned in source"
  ]
}
```

GPT-5.4 9/10

{
  "patient_demographics": {
    "name": "Maria Gonzalez",
    "DOB": "04/12/1968",
    "sex": null,
    "MRN": "88231-A",
    "address": null,
    "phone": null
  },
  "encounter_info": {
    "date": "10/15/2024",
    "provider": "Dr. Patel",
    "facility": "Riverside Family Clinic",
    "visit_type": "follow-up"
  },
  "chief_complaint": "follow-up on Type 2 Diabetes and hypertension",
  "diagnoses": [
    {
      "code": "E11.9",
      "description": "Type 2 Diabetes",
      "type": null
    },
    {
      "code": "I10",
      "description": "hypertension",
      "type": null
    }
  ],
  "medications": [
    {
      "name": "Metformin",
      "dosage": "1000mg",
      "frequency": "BID",
      "route": "PO"
    },
    {
      "name": "Lisinopril",
      "dosage": "20mg",
      "frequency": "daily",
      "route": "PO"
    },
    {
      "name": "Empagliflozin",
      "dosage": "10mg",
      "frequency": "daily",
      "route": null
    }
  ],
  "allergies": "NKDA",
  "vital_signs": {
    "BP": "142/88",
    "HR": "76"
  },
  "lab_results": [
    {
      "test": "A1c",
      "value": "8.2",
      "unit": "%",
      "reference_range": "4.0-5.6",
      "date": "10/15/2024"
    }
  ],
  "procedures": null,
  "clinical_notes_summary": "Patient seen for follow-up on Type 2 Diabetes and hypertension. Blood pressure was 142/88 and heart rate was 76. A1c was 8.2% (ref 4.0-5.6). Current medications included Metformin 1000mg PO BID and Lisinopril 20mg PO daily. NKDA. Plan was to increase Metformin, add Empagliflozin 10mg daily, and recheck labs in 3 months.",
  "confidence_flags": [
    "Sex not provided.",
    "Address not provided.",
    "Phone not provided.",
    "Diagnosis type not specified.",
    "Route for Empagliflozin not specified.",
    "Metformin increase mentioned in plan, but updated dosage not specified."
  ]
}

Grok 4.1 Fast Reasoning 9/10

{
  "patient_demographics": {
    "name": "Maria Gonzalez",
    "DOB": "04/12/1968",
    "sex": null,
    "MRN": "88231-A",
    "address": null,
    "phone": null
  },
  "encounter_info": {
    "date": "10/15/2024",
    "provider": "Dr. Patel",
    "facility": "Riverside Family Clinic",
    "visit_type": "follow-up"
  },
  "chief_complaint": null,
  "diagnoses": [
    {
      "code": "E11.9",
      "description": "Type 2 Diabetes",
      "type": null
    },
    {
      "code": "I10",
      "description": "hypertension",
      "type": null
    }
  ],
  "medications": [
    {
      "name": "Metformin",
      "dosage": "1000mg",
      "frequency": "BID",
      "route": "PO"
    },
    {
      "name": "Lisinopril",
      "dosage": "20mg",
      "frequency": "daily",
      "route": "PO"
    }
  ],
  "allergies": "NKDA",
  "vital_signs": "BP 142/88, HR 76",
  "lab_results": [
    {
      "test": "A1c",
      "value": "8.2",
      "unit": "%",
      "reference_range": "4.0-5.6",
      "date": null
    }
  ],
  "procedures": null,
  "clinical_notes_summary": "Plan: increase Metformin, add Empagliflozin 10mg daily, recheck labs in 3 months.",
  "confidence_flags": []
}

What makes these work

01

Define every field upfront in the prompt
Tell the model exactly which fields you want and what to return when a field is missing. For example: 'If discharge date is not present, return null.' Without this instruction, models will invent plausible-sounding dates or skip the field inconsistently across a batch, breaking any downstream database insert.
02

Request JSON output with a schema example
Paste a skeleton JSON object into your prompt as the expected output format. Models match the structure you show them far more reliably than when you describe it in prose. This also eliminates narrative filler sentences that appear when the model defaults to a summary response instead of structured data.
03

Chunk long documents before sending
Multi-page records that exceed a model's effective context window produce worse extraction quality near the end of the document. Split records by section (e.g., demographics, problem list, medications, labs) and run separate extraction calls per section, then merge results. This also makes error tracing easier when a field comes back wrong.
04

Always include a confidence or source-quote field
Ask the model to return the verbatim text snippet it used to derive each extracted value. This lets a human reviewer spot-check high-risk fields like medication dosages or diagnosis codes in seconds rather than re-reading the entire document, and it gives you an audit trail if a value is questioned later.

More example scenarios

#01 · Extracting structured data from a hospital discharge summary

Input

Patient: Maria Lopez, DOB 03/14/1968. Admitted 11/02/2024, discharged 11/06/2024. Principal diagnosis: Community-acquired pneumonia (ICD-10: J18.9). Secondary: Type 2 diabetes mellitus (E11.9). Medications on discharge: Amoxicillin-clavulanate 875mg twice daily x7 days, Metformin 1000mg twice daily, Lisinopril 10mg daily. Follow-up with Dr. Patel in 10 days.

Expected output

{"patient_name": "Maria Lopez", "dob": "1968-03-14", "admit_date": "2024-11-02", "discharge_date": "2024-11-06", "primary_diagnosis": {"description": "Community-acquired pneumonia", "icd10": "J18.9"}, "secondary_diagnoses": [{"description": "Type 2 diabetes mellitus", "icd10": "E11.9"}], "discharge_medications": [{"name": "Amoxicillin-clavulanate", "dose": "875mg", "frequency": "BID", "duration": "7 days"}, {"name": "Metformin", "dose": "1000mg", "frequency": "BID"}, {"name": "Lisinopril", "dose": "10mg", "frequency": "daily"}], "follow_up": "Dr. Patel, 10 days"}

#02 · Pulling lab values from a cardiology panel report

Input

Lipid Panel dated 09/15/2024. Total Cholesterol: 214 mg/dL. LDL: 138 mg/dL (HIGH). HDL: 42 mg/dL (LOW). Triglycerides: 170 mg/dL. Non-HDL Cholesterol: 172 mg/dL. Ordering physician: Dr. Karen Yee, MD. Patient fasting: Yes.

Expected output

{"test_name": "Lipid Panel", "date": "2024-09-15", "ordering_physician": "Dr. Karen Yee, MD", "fasting": true, "results": [{"marker": "Total Cholesterol", "value": 214, "unit": "mg/dL", "flag": null}, {"marker": "LDL", "value": 138, "unit": "mg/dL", "flag": "HIGH"}, {"marker": "HDL", "value": 42, "unit": "mg/dL", "flag": "LOW"}, {"marker": "Triglycerides", "value": 170, "unit": "mg/dL", "flag": null}, {"marker": "Non-HDL Cholesterol", "value": 172, "unit": "mg/dL", "flag": null}]}

#03 · Extracting medication reconciliation data from a primary care visit note

Input

Mr. James Thornton, 72 y/o male, presents for annual wellness visit. Current medications reviewed: Atorvastatin 40mg nightly, Amlodipine 5mg daily, Aspirin 81mg daily, Omeprazole 20mg before breakfast. Patient reports stopping Metformin 500mg 3 months ago due to GI side effects. Allergies: Penicillin (hives), Sulfa (rash).

Expected output

{"patient": "James Thornton", "age": 72, "active_medications": [{"name": "Atorvastatin", "dose": "40mg", "frequency": "nightly"}, {"name": "Amlodipine", "dose": "5mg", "frequency": "daily"}, {"name": "Aspirin", "dose": "81mg", "frequency": "daily"}, {"name": "Omeprazole", "dose": "20mg", "frequency": "before breakfast"}], "discontinued_medications": [{"name": "Metformin", "dose": "500mg", "reason": "GI side effects", "stopped": "approx 3 months ago"}], "allergies": [{"allergen": "Penicillin", "reaction": "hives"}, {"allergen": "Sulfa", "reaction": "rash"}]}

#04 · Extracting coding data from an operative report for a billing audit

Input

Procedure performed: Laparoscopic cholecystectomy with intraoperative cholangiogram. Surgeon: Dr. Elena Rossi. Facility: Mercy General Hospital. Date of service: 10/22/2024. Diagnosis: Acute calculous cholecystitis (K80.00). CPT codes billed: 47563, 74300. Anesthesia time: 95 minutes. Complications: None.

Expected output

{"procedure": "Laparoscopic cholecystectomy with intraoperative cholangiogram", "surgeon": "Dr. Elena Rossi", "facility": "Mercy General Hospital", "dos": "2024-10-22", "diagnosis": {"description": "Acute calculous cholecystitis", "icd10": "K80.00"}, "cpt_codes": ["47563", "74300"], "anesthesia_minutes": 95, "complications": "None"}

#05 · Summarizing problem list from a multi-page geriatric assessment PDF

Input

Active problem list for Eleanor Chang, 84F. 1. Moderate Alzheimer dementia (F02.80) 2. Atrial fibrillation on anticoagulation (I48.91) 3. Osteoporosis with history of right hip fracture (M80.011A) 4. Chronic kidney disease Stage 3a (N18.31) 5. Major depressive disorder, recurrent, moderate (F33.1). Last updated by Dr. Singh 08/30/2024.

Expected output

{"patient": "Eleanor Chang", "age": 84, "sex": "F", "problem_list_date": "2024-08-30", "updating_provider": "Dr. Singh", "problems": [{"description": "Moderate Alzheimer dementia", "icd10": "F02.80"}, {"description": "Atrial fibrillation on anticoagulation", "icd10": "I48.91"}, {"description": "Osteoporosis with history of right hip fracture", "icd10": "M80.011A"}, {"description": "Chronic kidney disease Stage 3a", "icd10": "N18.31"}, {"description": "Major depressive disorder, recurrent, moderate", "icd10": "F33.1"}]}

Common mistakes to avoid

Skipping OCR quality checks
Sending a poorly scanned PDF directly to an LLM without first verifying OCR output quality is the most common failure mode. Low-confidence OCR text produces garbled medication names and transposed numeric values that the model will extract with false confidence. Run an OCR quality score or human spot-check before batching scanned records.
Assuming 100 percent recall on all fields
Models miss fields that appear in unusual locations, use non-standard abbreviations, or are split across page breaks. If your downstream process requires every field to be populated, you need a fallback flagging step that alerts a human when required fields return null rather than silently propagating gaps into your database.
Ignoring HIPAA and BAA requirements
Sending real patient data to a public AI API endpoint without a signed Business Associate Agreement is a HIPAA violation regardless of how the data is used afterward. Before running any production pipeline with identifiable records, confirm your AI vendor offers a BAA and that data is not used for model training.
Using a single generic prompt for all document types
A prompt tuned for discharge summaries will perform poorly on lab reports, operative notes, and insurance EOBs because the field names, layouts, and terminology differ significantly. Build and test a separate prompt for each document type in your corpus rather than forcing one prompt to handle everything.
Not validating extracted codes against reference sets
Models occasionally hallucinate plausible but invalid ICD-10 or CPT codes, especially for less common diagnoses. Run all extracted codes through a validation lookup against the official CMS code sets before loading them into billing or research systems. A downstream validation step costs almost nothing compared to a rejected claim or a corrupted research dataset.

Related queries

Frequently asked questions

Is it HIPAA compliant to use AI to extract data from medical record PDFs?

It can be, but only if your AI vendor has signed a Business Associate Agreement with your organization and the platform does not use your data for model training. Most major enterprise AI APIs (Azure OpenAI, AWS Bedrock, Google Cloud Vertex) offer BAA-covered tiers. Consumer-tier endpoints generally do not qualify. Always confirm before processing real patient data.

How accurate is AI extraction from medical record PDFs compared to manual abstraction?

On clearly typed, well-structured documents, modern LLMs achieve field-level accuracy in the 90 to 97 percent range for common fields like demographics, dates, and medication names. Accuracy drops significantly on handwritten notes, low-quality scans, and highly specialized terminology. Human review remains essential for any high-stakes fields such as drug dosages or allergy lists.

What is the best AI model for extracting data from medical record PDFs?

GPT-4o and Claude 3.5 Sonnet currently perform best on structured medical extraction tasks, particularly when documents include tables, mixed formatting, or clinical abbreviations. Gemini 1.5 Pro has an advantage on very long documents due to its larger context window. The best model depends on your document types; testing on a sample of your actual records is the only reliable way to compare.

Can AI extract data from handwritten medical records or scanned paper charts?

Partially. AI vision models can read legible handwriting with reasonable accuracy, but performance degrades quickly with hurried or stylized script. The practical workflow for handwritten records is to use a dedicated medical OCR tool first (such as Google Document AI or AWS Textract), review the OCR output, and then pass the cleaned text to an LLM for structured extraction.

How do I handle medical records that span 20 or more pages?

Split the document into logical sections before extraction. Most records have predictable sections: face sheet, problem list, medications, operative or procedure notes, lab results, and discharge instructions. Extract each section in a separate API call with a section-specific prompt, then merge the JSON outputs. This improves accuracy and makes it easier to identify which section produced a bad value.

What structured output format should I request when extracting medical record data?

JSON is the most practical format for downstream use because it integrates directly with databases, ETL pipelines, and EHR APIs. Ask the model to return a flat or nested JSON object with field names that match your destination schema. If you are loading into a spreadsheet for manual review, you can also request CSV, but JSON handles nested structures like medication lists far more cleanly.

Try it with a real tool

Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.

Perplexity Pro AI-powered answer engine

Try Perplexity →

CustomGPT ChatGPT trained on your content

Try CustomGPT →