# AI-Assisted Scoping Review Workflow: Wearables for Adolescent Mental Health ## 1. **Identify the Research Question (Stage 1)** - **Action:** Refine your question using the PCC framework (Population, Concept, Context): adolescents (12–18), wearable devices, mental health monitoring, 2014–2024. - **AI Tools:** ChatGPT-4 for PCC refinement; Elicit for scoping feasibility. - **Sample Prompt:** ``` Act as a scoping review methodologist. Refine this research question using JBI's PCC framework and suggest 3 sub-questions mapping interventions, outcomes, and gaps: "How are wearable devices used to monitor mental health in adolescents aged 12–18?" ``` - **Validation:** Cross-check PCC output against JBI Manual Chapter 11; have your supervisor review before locking the protocol. Register on OSF. ## 2. **Identify Relevant Studies (Stage 2)** - **Action:** Build a reproducible search strategy for PubMed, Scopus, and PsycINFO with MeSH/Emtree terms, then supplement with AI-driven discovery. - **AI Tools:** ChatGPT for Boolean string drafting; **Research Rabbit** and **Connected Papers** for citation chasing; **Elicit** for semantic search. - **Sample Prompt:** ``` Generate a PubMed search string combining MeSH and free-text terms for: (1) wearable devices/biosensors/smartwatches, (2) mental health/depression/anxiety/stress, (3) adolescents 12–18. Include 2014–2024 filter and output in PubMed syntax. ``` - **Validation:** Ask a health sciences librarian to peer-review using the **PRESS checklist**. Never rely on AI-generated citations—export raw results directly from databases. Run the string through all three databases and save the exact date/time. ## 3. **Study Selection (Stage 3)** - **Action:** Deduplicate, then screen titles/abstracts and full texts against pre-specified inclusion/exclusion criteria. - **AI Tools:** **Covidence** or **Rayyan** (AI-assisted prioritization) for screening; ChatGPT for criteria clarification; **SciSpace** for full-text Q&A. - **Sample Prompt (for screening assistance only):** ``` Given these inclusion criteria [paste], classify this abstract as INCLUDE, EXCLUDE, or UNCERTAIN and justify in one sentence: [paste abstract]. ``` - **Validation:** Dual human screening on ≥20% of records; AI suggestions are advisory, not decisive. Calculate Cohen's kappa (≥0.6). Document all exclusions with reasons for the PRISMA-ScR flow diagram. ## 4. **Chart the Data (Stage 4)** - **Action:** Develop and pilot a data-charting form (authors, year, country, sample size, age, device type, mental health outcome, measurement approach, findings, gaps). - **AI Tools:** **Elicit** (auto-extraction into tables); **SciSpace Chat with PDF**; ChatGPT for structured extraction into CSV. - **Sample Prompt:** ``` Extract the following fields from this study into a JSON object. If a field is not reported, write "NR". Do not infer. Fields: [study_design, n, mean_age, wearable_device_brand, sensor_modality, mental_health_construct, validated_scale_used, key_finding, reported_limitation]. ``` - **Validation:** Pilot the form on 5 studies with two reviewers. Verify every AI-extracted field against the source PDF—AI hallucinates numerical data and device names frequently. Log all corrections. ## 5. **Collate, Summarize, and Report (Stage 5)** - **Action:** Synthesize narratively and visually; map intervention types, outcome domains, and gaps. - **AI Tools:** ChatGPT for thematic clustering and narrative drafts; Python/R (via ChatGPT Advanced Data Analysis) for frequency tables and evidence-gap maps; **Litmaps** for temporal trend visualization. - **Sample Prompt:** ``` Using this charted dataset [attach CSV], cluster studies into themes by wearable modality and mental health construct. Identify under-researched intersections and produce an evidence-gap matrix (rows=devices, columns=outcomes). ``` - **Validation:** Verify all AI-generated summaries against primary data. Have a co-reviewer audit themes. Never let AI write claims not traceable to your extraction table. --- ## PRISMA-ScR Reproducibility Checklist - [ ] Protocol registered on OSF/Figshare before screening - [ ] Full search strings archived per database with run dates - [ ] PRISMA-ScR flow diagram with n at each stage - [ ] Inter-rater reliability (kappa) reported - [ ] AI tool names, versions, and prompt logs saved as supplementary files - [ ] All AI-extracted data verified against source PDFs (document % checked) - [ ] Charting form and raw extraction CSV shared openly - [ ] Limitations section explicitly addresses AI-assisted steps and hallucination risk - [ ] Report conforms to the 20-item **PRISMA-ScR** checklist (Tricco et al., 2018)
Running a Scoping Review Using AI-Assisted Workflows
Tested prompts for how to do a scoping review with ai compared across 5 leading AI models.
A scoping review maps the existing evidence on a topic without the strict eligibility criteria of a systematic review. Researchers, clinicians, and policy analysts use them to identify gaps, clarify concepts, and summarize a body of literature before committing to a full review. The problem is they are notoriously time-consuming: screening hundreds of abstracts, extracting data from dozens of papers, and synthesizing findings across sources can take months of manual work.
AI-assisted workflows compress that timeline significantly. Large language models can draft inclusion/exclusion rationale, extract structured data from article abstracts, generate synthesis paragraphs, and flag thematic gaps in ways that used to require a full research team. The catch is that AI does not replace methodological rigor. It accelerates the repeatable cognitive tasks while the researcher maintains judgment over decisions that affect validity.
This page walks you through exactly how to run a scoping review using AI tools, what prompts produce useful outputs, where different models perform well or fall short, and the mistakes that will undermine your results if you skip them. If you have a literature search already pulled from PubMed or Scopus and need to move from raw results to a structured synthesis, this is where to start.
When to use this
AI-assisted scoping review workflows work best when you have a defined research question, a manageable corpus of sources (roughly 50 to 500 abstracts), and a need to produce a structured synthesis rather than a definitive clinical recommendation. They are especially strong when time is constrained and the output does not require the evidentiary standards of a Cochrane-level systematic review.
- You pulled 200-400 abstracts from PubMed and need a first-pass screening with justification for each inclusion or exclusion decision
- You are drafting a grant proposal and need a rapid evidence map of what has been studied in a new therapeutic area
- A policy team needs a summary of regulatory frameworks across jurisdictions before a deadline that rules out a six-month manual review
- A PhD student needs to scope a dissertation topic and determine whether enough primary research exists to justify a specific angle
- A consulting team is onboarding to an unfamiliar industry and needs a structured overview of the published landscape before client meetings
When this format breaks down
- Clinical guideline development requiring GRADE-level evidence certainty: AI extraction errors on outcomes data or study design can propagate into recommendations that affect patient safety
- Full systematic reviews with pre-registered PRISMA protocols: AI-assisted screening without dual independent human verification does not meet the methodological standards required for publication in most systematic review journals
- Corpus sizes under 15 sources: the overhead of structuring AI prompts and verifying outputs exceeds the time saved versus reading the papers yourself
- Highly specialized technical domains where the model lacks sufficient training exposure, such as very recent regulatory filings or niche engineering standards, producing confident-sounding but factually incorrect extractions
The prompt we tested
You are an expert research methodologist specializing in scoping reviews and AI-assisted literature synthesis. Based on the researcher's topic and context below, produce a step-by-step guide for conducting a scoping review using AI tools, following the Arksey & O'Malley and JBI frameworks (identify research question, identify relevant studies, select studies, chart data, collate/summarize/report results). For each stage, specify: (1) the concrete action to take, (2) which AI tools or techniques to use (e.g., Elicit, Research Rabbit, Consensus, SciSpace, Claude/GPT for deduplication and charting, Covidence), (3) example prompts to run, and (4) validation/quality-control steps to avoid hallucination or bias. End with a reproducibility checklist and a note on PRISMA-ScR reporting. Instructions: Format as a numbered 5-stage workflow with bold stage headings and bullet points under each. Keep the total response under 700 words, use concrete tool names and sample prompts in code blocks, and conclude with a short PRISMA-ScR reproducibility checklist. Researcher's topic and context: I'm a public health PhD student planning a scoping review on the use of wearable devices for monitoring mental health in adolescents aged 12-18. I have access to PubMed, Scopus, and PsycINFO, plus a ChatGPT Plus and Elicit subscription. I want to map the range of interventions, outcomes measured, and research gaps over the past 10 years.
Model comparison
| Model | Quality | Cost | Latency | Tokens |
|---|---|---|---|---|
| Claude Opus 4.7 | 8 | $0.1484 | 31.2s | 507 / 1877 |
| Claude Haiku 4.5 | 8 | $0.0091 | 18.5s | 364 / 1741 |
| GPT-5.4 | 8 | $0.0223 | 26.8s | 320 / 1436 |
| Gemini 2.5 Pro | 1 | $0.0009 | 21.7s | 325 / 89 |
| Grok 4.1 Fast Reasoning | 7 | $0.0005 | 12.5s | 460 / 787 |
Outputs
# AI-Assisted Scoping Review Workflow: Wearables & Adolescent Mental Health ## **Stage 1: Define & Refine Research Question** **Action:** Develop a clear, structured research question using the PICOT framework (Population, Intervention, Comparison, Outcome, Timeframe). **AI Tools:** ChatGPT Plus, Elicit **Example Prompt:** ``` I'm conducting a scoping review on wearables for mental health monitoring in adolescents (12–18 years), 2014–2024. Help me refine this PICOT question: Population: adolescents aged 12–18 Intervention: wearable devices (smartwatches, fitness trackers, biosensors) Outcome: mental health outcomes (depression, anxiety, stress, sleep, physical activity, social connectivity) Timeframe: 10 years Generate 3 alternative framings and identify key inclusion/exclusion criteria. ``` **Quality Control:** - Validate question specificity with your supervisor - Cross-check PICOT components against existing scoping reviews (use PubMed/Scopus) - Ensure no overly restrictive terms that eliminate relevant studies --- ## **Stage 2: Identify Relevant Studies** **Action:** Conduct systematic searches across PubMed, Scopus, and PsycINFO using controlled vocabulary and keywords. **AI Tools:** Elicit (automated PubMed search), Research Rabbit (citation mapping), ChatGPT for search string refinement **Example Prompt for Search String:** ``` Build a comprehensive search string for PubMed and Scopus combining: - Wearable terms: wearable* OR smartwatch* OR fitness track* OR biosensor* - Mental health: (depress* OR anxi* OR stress* OR mood OR well-being) - Adolescent: adolescent* OR teen* OR youth OR "12-18 years" Include MeSH terms and field restrictions. Output as copy-paste ready strings. ``` **Example Elicit Prompt:** ``` Search PubMed for: (wearable device* OR smartwatch*) AND (mental health OR depression OR anxiety) AND (adolescent* OR teen*) Filter: 2014–2024, English language Export results with title, abstract, DOI ``` **Quality Control:** - Run identical searches in all 3 databases; document exact strings used - Manually verify first 50 results for relevance - Check Elicit export accuracy against PubMed interface (sample 10 citations) --- ## **Stage 3: Study Selection** **Action:** Screen titles/abstracts and full texts using dual review; use AI for deduplication and initial filtering. **AI Tools:** Covidence (or Rayyan, free), ChatGPT for deduplication logic, Elicit for screening **Deduplication Prompt:** ``` I have 450 citations from 3 databases (PubMed, Scopus, PsycINFO) with possible duplicates. Create a deduplication rule: match on (1) title similarity >95%, (2) author/year, (3) DOI. Output a Python-ready algorithm or mark duplicates in the attached CSV. ``` **Screening Protocol:** - Create explicit inclusion criteria checklist in Covidence - Pilot 50 abstracts with a co-reviewer; calculate Cohen's κ (target >0.60) - Use AI (ChatGPT) to flag borderline abstracts, but final decisions remain human - Full-text review: use standardized form in Covidence **Quality Control:** - Document all exclusion reasons at full-text stage - Maintain a "uncertain/appeal" list reviewed by supervisor - Track AI-assisted vs. manual screening decisions --- ## **Stage 4: Data Charting** **Action:** Extract data on study characteristics, wearable devices, mental health outcomes, and findings into a standardized table. **AI Tools:** ChatGPT (structured extraction), Covidence or Excel/Google Sheets (storage) **Example Extraction Prompt:** ``` From this abstract/full text, extract: 1. Author, year, country 2. Study design, sample size, age range 3. Wearable type (specify device model/brand) 4. Mental health outcomes measured (e.g., PHQ-9 for depression) 5. Key findings (effect size if available) 6. Limitations noted Return as JSON with null fields if data absent. ``` **Process:** - Pilot extraction on 10 papers; assess consistency - Use ChatGPT with extracted PDF text (copy-paste relevant sections) - Cross-check AI extractions with original text (10% sample audit) - Maintain version-controlled master spreadsheet **Quality Control:** - Flag missing/unclear data; mark for manual review - Verify numeric values (e.g., sample sizes, p-values) against source - Generate summary statistics by AI: "Summarize the n=45 studies by geography, device type, and outcome measure" --- ## **Stage 5: Collate, Summarize & Report Results** **Action:** Synthesize findings narratively, map intervention/outcome clusters, identify gaps. **AI Tools:** ChatGPT for thematic synthesis, Research Rabbit for citation networks, Consensus for systematic summary **Synthesis Prompt:** ``` I have a charted dataset of 45 studies on wearables for adolescent mental health. Generate: 1. Frequency table: devices used (n by type) 2. Outcome categories: cluster by mental health domain (depression, anxiety, etc.) 3. Intervention characteristics: 3–4 key themes 4. Research gaps: identify underexplored populations, designs, outcomes Return as structured report with counts and quotes from included studies. ``` **Reporting:** - Create summary tables and flow diagram (PRISMA-ScR) - Develop narrative synthesis organized by research question themes - Generate visual (e.g., bubble chart: device type vs. outcome vs. sample size) **Quality Control:** - Validate AI-generated thematic codes against original papers (manual spot-check, n=10) - Ensure all conclusions supported by data counts - Have supervisor review draft synthesis before final write-up --- ## **PRISMA-ScR Reproducibility Checklist** - [ ] Research question and PICOT registered/documented pre-review - [ ] Search strategies (all strings, filters, dates) documented and reproducible - [ ] Inclusion/exclusion criteria operationalized and piloted (κ ≥ 0.60) - [ ] Study selection: PRISMA flow diagram with counts at each stage - [ ] Data extraction form piloted and version-controlled - [ ] AI tool usage disclosed: specify prompts, models (GPT-4 vs. 3.5), audit trail - [ ] All included studies listed with full citations - [ ] Charted data table provided (supplementary) - [ ] Thematic synthesis tied to source data (not speculative) - [ ] Limitations section addresses AI tool limitations and bias risk - [ ] Protocol published or registered (OSF, PROSPERO-ScR when available) --- **Final Note:** Transparently report AI use per PRISMA-ScR 2022 updates. AI accelerates synthesis but does not replace expert judgment; maintain human oversight at screening, extraction validation, and interpretation stages.
1. **Stage 1: Identify the Research Question**
- **Concrete action**
- Define scope using PCC (Population, Concept, Context).
- Draft 2–4 scoping questions and eligibility criteria.
- Example PCC:
- **Population:** adolescents aged 12–18
- **Concept:** wearable devices used to monitor mental health
- **Context:** any setting, last 10 years
- **AI tools/techniques**
- **ChatGPT / Claude** for question refinement and inclusion/exclusion drafting
- **Elicit** to surface key concepts, synonyms, and likely outcome domains
- **Example prompts**
```text
Help me formulate 3 scoping review questions using PCC for:
Population = adolescents aged 12-18
Concept = wearable devices for monitoring mental health
Context = any setting, published in last 10 years.
Also draft inclusion/exclusion criteria.
```
```text
List synonyms and related terms for:
wearable devices, mental health monitoring, adolescents, depression, anxiety, stress, mood, sleep, digital phenotyping.
```
- **Validation / quality control**
- Manually confirm definitions from JBI guidance.
- Ensure “monitoring” is distinguished from treatment-only interventions.
- Pilot criteria on 10–20 papers; revise ambiguous rules.
- Record all scope decisions in a protocol or review log.
2. **Stage 2: Identify Relevant Studies**
- **Concrete action**
- Build database-specific search strategies for **PubMed, Scopus, PsycINFO**.
- Run searches for the last 10 years; export results in RIS/CSV.
- Supplement with citation chasing.
- **AI tools/techniques**
- **ChatGPT** for Boolean query drafting and translation across databases
- **Elicit** for seed-paper discovery
- **Research Rabbit** for citation mapping and related-paper expansion
- **PubMed/Scopus/PsycINFO** as primary sources
- **Example prompts**
```text
Create a PubMed search strategy for a scoping review on wearable devices for monitoring mental health in adolescents aged 12-18 over the last 10 years. Include MeSH and free-text terms.
```
```text
Translate this PubMed query into Scopus and PsycINFO syntax.
```
```text
Using these seed papers, suggest adjacent keywords, wearable types, and mental health outcomes I may be missing.
```
- **Validation / quality control**
- Test whether known sentinel papers are retrieved.
- Have a librarian or supervisor review the strategy if possible.
- Document exact search dates, databases, and full queries.
- Do not let AI replace database searching; use it only to refine strategy.
3. **Stage 3: Select Studies**
- **Concrete action**
- Import all records into a screening tool; deduplicate.
- Conduct title/abstract screening, then full-text screening.
- Use dual review for at least a subset; resolve conflicts.
- **AI tools/techniques**
- **Covidence** or **Rayyan** for screening workflow and deduplication
- **ChatGPT / Claude** to help operationalize borderline decisions and create a screening manual
- **Example prompts**
```text
Create a title/abstract screening decision rule set for this review. Include examples of include, exclude, and unsure cases.
```
```text
Given this abstract and my criteria, explain whether it should be included, excluded, or marked unsure. Do not invent details; use only the text provided.
```
- **Validation / quality control**
- Never rely on AI alone for inclusion/exclusion decisions.
- Double-screen at least 20% and calculate agreement.
- Keep reasons for exclusion at full-text stage.
- Manually verify duplicates missed by software.
4. **Stage 4: Chart the Data**
- **Concrete action**
- Create a standardized extraction form.
- Extract study characteristics, wearable type, sensors, mental health construct, outcomes, setting, design, sample age, and key findings.
- **AI tools/techniques**
- **Covidence** extraction forms or Excel/Sheets
- **ChatGPT / Claude** for converting PDFs/abstracts into draft structured tables
- **SciSpace** for quick paper interrogation
- **Example prompts**
```text
Design a scoping review data-charting form for studies on wearable monitoring of adolescent mental health. Include bibliographic details, country, study design, age range, wearable device, sensor type, mental health outcomes, validation metrics, feasibility, and limitations.
```
```text
From the pasted abstract/full text, extract data into this table only. If information is missing, write NR (not reported). Do not infer.
```
- **Validation / quality control**
- Pilot the form on 5 studies and refine fields.
- Verify every AI-extracted field against the paper.
- Use “NR” rather than filling gaps with plausible guesses.
- Keep quotations/page numbers for critical variables.
5. **Stage 5: Collate, Summarize, and Report Results**
- **Concrete action**
- Produce descriptive summaries: study counts by year, country, device type, outcomes, and design.
- Map intervention/monitoring approaches, outcome measures, and research gaps.
- Prepare PRISMA-ScR flow diagram and narrative synthesis.
- **AI tools/techniques**
- **ChatGPT / Claude** for thematic grouping and drafting summary text
- **Elicit** for cross-paper comparison
- Excel/R for simple frequency tables and plots
- **Example prompts**
```text
Using this extracted dataset, group studies by wearable type, mental health outcome, and study design. Suggest 4-6 evidence map categories suitable for a scoping review.
```
```text
Draft a neutral results narrative summarizing patterns and research gaps without overstating effectiveness. Base the summary only on the table provided.
```
- **Validation / quality control**
- Check all counts against the master spreadsheet.
- Distinguish evidence mapping from effectiveness claims.
- Note gaps in populations, settings, validation, and equity.
- Review AI-written summaries against source tables to remove unsupported claims.
**PRISMA-ScR reproducibility checklist**
- Review question and PCC clearly stated
- Full inclusion/exclusion criteria documented
- Exact database names, dates, and complete search strings saved
- Deduplication method reported
- Screening process and conflict resolution described
- Data-charting form included or referenced
- AI tools used, versions, dates, and tasks documented
- Human verification steps explicitly reported
- PRISMA-ScR flow diagram completed
- Limitations of AI assistance acknowledged
Use **PRISMA-ScR** when writing the final report, and explicitly state where AI assisted search development, screening support, extraction drafting, or synthesis.
Of course. Here is a step-by-step guide for conducting your scoping review on wearable devices for adolescent mental health, integrating AI tools within the JBI and Arksey & O'Malley frameworks. ### **AI-Assisted Scoping Review Workflow** **Stage 1: Identifying the Research Question** * **Action:** Refine your research question using the Population, Concept, Context (PCC) framework to ensure
1. **Identify Research Question**
- **Action**: Refine broad question to map wearables for adolescent mental health monitoring (e.g., "What is known about wearable devices for monitoring mental health outcomes in adolescents aged 12-18 from 2014-2024, including interventions, outcomes, and gaps?").
- **AI Tools**: ChatGPT Plus for iterative refinement.
- **Prompt**:
```
Refine this scoping review question using Arksey & O'Malley: Map wearables for mental health in adolescents 12-18 (2014-2024). Focus on interventions, outcomes, gaps. Suggest 3 variants.
```
- **Validation**: Cross-check with JBI checklist; consult supervisor for alignment.
2. **Identify Relevant Studies**
- **Action**: Develop/search databases with terms like ("wearable*" OR "smartwatch*") AND ("mental health" OR depress* OR anx*) AND (adolescen* OR "12-18") AND (2014:2024).
- **AI Tools**: Elicit for lit search; export to RIS for Scopus/PubMed/PsycINFO.
- **Prompt** (Elicit):
```
Search: wearable devices mental health monitoring adolescents 12-18 since 2014. Limit to peer-reviewed. Export top 200.
```
- **Validation**: Run manual database searches; compare yields (aim >500 records); document search strings in log.
3. **Select Studies**
- **Action**: Dual-screen titles/abstracts/full-texts for relevance (include any design on wearables, mental health, adolescents).
- **AI Tools**: ChatGPT Plus for initial deduplication/prioritization; Covidence (free trial) or Elicit for screening.
- **Prompt** (ChatGPT):
```
From this RIS file [upload], deduplicate and score relevance (0-10) for scoping review on wearables for adolescent mental health. Output: title, score, reason.
```
- **Validation**: Dual human review 20% sample (Kappa >0.7); resolve AI flags manually.
4. **Chart Data**
- **Action**: Extract: study design, population, wearables, outcomes (e.g., anxiety via HRV), gaps.
- **AI Tools**: Elicit/Consensus for extraction; ChatGPT for tabular charting.
- **Prompt** (ChatGPT):
```
From these 50 PDFs [upload summaries], chart in table: Author/year, Design, Wearable type, Outcomes measured, Gaps noted. Format Markdown.
```
- **Validation**: Verify 10% extracts against originals; flag inconsistencies for manual check.
5. **Collate/Summarize/Report Results**
- **Action**: Thematic map (e.g., 60% wristbands, depression>anxiety outcomes, gaps in RCTs); visualize gaps.
- **AI Tools**: ChatGPT for synthesis; Research Rabbit for citation networks.
- **Prompt** (ChatGPT):
```
Summarize charted data: Themes in wearables/interventions/outcomes for adolescent mental health. Identify top 3 gaps. Suggest figures.
```
- **Validation**: Triangulate with manual thematic analysis; bias check via JBI tools.
**Reproducibility Checklist**:
- [ ] Document all search strings/databases/dates.
- [ ] Archive AI prompts/outputs/RIS files (e.g., OSF).
- [ ] Log versions: Elicit/ChatGPT (e.g., GPT-4o).
- [ ] Dual-review percentages reported.
**PRISMA-ScR Note**: Report per PRISMA-ScR flowchart (searches n=XX, included n=XX); include AI methods transparently in methods section. (428 words)
What makes these work
-
01Give the model your criteria explicitly
Do not assume the AI knows what a scoping review inclusion criterion looks like. State each criterion as a numbered list before pasting the abstract or text. When criteria are explicit, the model's classification decisions are consistent enough to audit. When they are implied, outputs drift and you cannot trace errors back to a source.
-
02Use structured extraction over open-ended summarization
Asking 'summarize this paper' produces narrative text that is hard to compare across 80 sources. Instead, give the model a fixed schema with named fields and tell it to write 'not reported' when data is absent. This produces a table-ready output that feeds directly into your charting phase without a reformatting step.
-
03Run uncertain cases through a second prompt
Build a two-stage screening process. First pass: classify each abstract as Include, Exclude, or Uncertain. Second pass: for every Uncertain result, paste the abstract again with a prompt that asks the model to reason through each criterion one at a time before reaching a decision. This catches borderline cases without manual review of every item.
-
04Verify a random sample before scaling
Before processing your full corpus, run 15-20 abstracts through your prompt and manually check every output against your own judgment. Calculate your agreement rate. If you agree on less than 85% of cases, revise the prompt before scaling. Catching a systematic error at 20 abstracts costs minutes; catching it at 300 costs hours.
More example scenarios
I am conducting a scoping review on the relationship between urban heat exposure and mental health outcomes in adults. My inclusion criteria are: peer-reviewed empirical studies, published 2010-2024, human subjects, English language, reporting a measurable association between temperature or heat events and any mental health outcome. Here is an abstract: [paste abstract]. Classify it as Include, Exclude, or Uncertain, and give a one-sentence reason.
Include. The study is a peer-reviewed longitudinal analysis published in 2019 examining emergency department visits for psychiatric conditions during heat waves in a U.S. urban cohort, directly meeting all stated inclusion criteria.
Extract the following fields from this methods and results section of a study on AI-assisted radiology diagnosis: (1) AI model type, (2) imaging modality, (3) sample size, (4) reported sensitivity, (5) reported specificity, (6) comparator used. If a field is not reported, write 'not reported'. Here is the text: [paste excerpt].
1. Convolutional neural network (ResNet-50). 2. Chest X-ray. 3. n=1,842 patients. 4. Sensitivity: 91.3%. 5. Specificity: 88.7%. 6. Compared to board-certified radiologist consensus read.
Below are summaries of 12 included studies on remote learning outcomes for K-12 students during 2020-2022. Each summary includes the population, intervention, and key finding. Identify the 3-4 dominant themes across these studies, name each theme, and cite which study summaries support it. [paste summaries].
Theme 1: Equity gaps widened for low-income students (Studies 2, 5, 9, 11). Theme 2: Synchronous instruction outperformed asynchronous for younger learners (Studies 1, 3, 7). Theme 3: Teacher training moderated outcome quality (Studies 4, 6, 10). Theme 4: Parental involvement was a significant protective factor in elementary cohorts (Studies 8, 12).
You are helping with a scoping review on data privacy legislation. For each legal document excerpt I provide, extract: (1) jurisdiction, (2) year enacted or amended, (3) primary regulatory mechanism (consent-based, rights-based, or enforcement-based), (4) sector scope (all sectors or specific). Here is the first excerpt: [paste text].
1. European Union. 2. 2018 (GDPR, with 2021 guidance update). 3. Rights-based, with consent and legitimate interest provisions. 4. All sectors with specific carve-outs for law enforcement and national security.
Based on the following 8 study summaries from a scoping review on consumer trust in autonomous vehicles, identify what research questions remain unanswered or understudied. Be specific about population gaps, geographic gaps, and methodological gaps. [paste summaries].
Population gap: No studies examined trust in adults over 65, despite this group being a primary target demographic. Geographic gap: All 8 studies were conducted in North America or Western Europe; no data from Southeast Asia or Latin America. Methodological gap: All studies used self-reported trust scales; no behavioral or telemetry-based trust measures were used.
Common mistakes to avoid
-
Skipping the verification sample
Researchers feed their full abstract set into the AI without testing prompt accuracy first. If the model misinterprets one inclusion criterion, every downstream decision is corrupted and you may not detect it until a reviewer questions your methods. Always calibrate on a small batch before processing everything.
-
Treating AI output as a first draft of conclusions
AI synthesis paragraphs describe patterns in the text you gave it, not validated findings from the literature. If you paste summaries that contain errors, the synthesis inherits them. The researcher must verify that synthesized claims map to actual study findings before any output goes into a manuscript.
-
Using a single long prompt for the entire review
Trying to do screening, extraction, and synthesis in one prompt produces muddled outputs where the model prioritizes some tasks over others. Break the workflow into discrete stages with separate prompts. Each stage should have one job: classify, extract, or synthesize.
-
Ignoring context window limits on long documents
Pasting a 10,000-word full-text paper into a model with an 8,000-token limit causes the model to silently truncate the input and extract from an incomplete document. Know your model's context limit and either chunk long documents or use a model with a larger window. Errors from truncation are invisible without manual checking.
-
Not documenting the prompts used
Scoping reviews require a reproducible methods section. If you cannot report exactly what prompt was used at each stage, your AI-assisted process cannot be audited or replicated. Log every prompt version, the model and version used, and the date it was run, the same way you would log a database search string.
Related queries
Frequently asked questions
Can AI replace human screening in a scoping review?
Not entirely. AI can handle a reliable first-pass screen and flag uncertain cases, but the researcher still needs to review borderline decisions and a random validation sample. For published scoping reviews, most journals expect human involvement in final inclusion decisions. AI works best as a time-saving layer, not a complete replacement for researcher judgment.
Which AI tools are best for scoping reviews?
General-purpose large language models like GPT-4 or Claude work well for abstract screening and data extraction because they follow structured instructions reliably. Dedicated tools like Rayyan, Covidence with AI features, or Elicit are built specifically for literature review workflows and handle citation management alongside AI screening. The right choice depends on whether you need a standalone AI or an integrated review platform.
How do I report AI use in a scoping review methods section?
Report the tool name and version, the stage at which AI was used (screening, extraction, synthesis), the prompt or prompt structure used, the validation process you applied, and how disagreements between AI and human review were resolved. Transparency here is the same standard you would apply to any other methodological tool. Several journals now have explicit reporting guidelines for AI-assisted reviews.
How many abstracts can I realistically screen with AI?
There is no hard ceiling. Researchers have used AI-assisted screening on corpora of 5,000 or more abstracts by processing them in batches. The practical limit is your time to verify outputs, not the AI's capacity. At larger scales, invest more effort upfront in calibrating your prompt against a validation set so errors do not compound across thousands of items.
Is AI-assisted scoping review acceptable in peer-reviewed journals?
Acceptance is growing but not universal. Many journals now permit AI-assisted methods if disclosed transparently and validated against human judgment. Some require that AI screening be treated as a single reviewer with a second human reviewer for all inclusions. Check the target journal's author guidelines and any relevant reporting standards like PRISMA-ScR before designing your workflow.
What is the difference between using AI for a scoping review versus a systematic review?
Scoping reviews have more flexible methods and lower evidentiary stakes, making them a better fit for current AI tools. Systematic reviews require rigorous dual independent screening, formal risk-of-bias assessment, and often quantitative synthesis, all areas where AI error rates are not yet low enough to meet most reporting standards without extensive human verification. Use AI more cautiously in systematic review workflows.