```python
# --- Monthly revenue trends for top 5 products ---
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Ensure 'date' is datetime
sales_df['date'] = pd.to_datetime(sales_df['date'])
# Identify top 5 products by total revenue
top5_products = (
sales_df.groupby('product')['revenue']
.sum()
.nlargest(5)
.index
.tolist()
)
print("Top 5 products:", top5_products)
# Filter to top 5 and aggregate revenue by month
top5_df = sales_df[sales_df['product'].isin(top5_products)].copy()
top5_df['month'] = top5_df['date'].dt.to_period('M').dt.to_timestamp()
monthly = (
top5_df.groupby(['month', 'product'], as_index=False)['revenue']
.sum()
)
# Pivot for plotting (months x products)
pivot = monthly.pivot(index='month', columns='product', values='revenue').sort_index()
# Plot
sns.set_style("whitegrid")
plt.figure(figsize=(12, 6))
for product in top5_products: # preserve ranking order in legend
if product in pivot.columns:
plt.plot(pivot.index, pivot[product], marker='o', label=product)
plt.title("Monthly Revenue Trends — Top 5 Products")
plt.xlabel("Month")
plt.ylabel("Total Revenue")
plt.legend(title="Product")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
pivot
```
**What it does:** Converts `date` to datetime, finds the top 5 products by total revenue, filters the DataFrame, buckets dates into months using `to_period('M')`, sums revenue per product/month, pivots to a wide format, and plots one line per product. The final `pivot` display shows the underlying monthly totals for reference.
**Requirements:** `pandas`, `matplotlib`, and `seaborn` — all standard in most Jupyter environments. If missing: `%pip install pandas matplotlib seaborn`.
Write Python in Jupyter Notebooks with AI Copilots
Tested prompts for ai assistant for jupyter notebook compared across 5 leading AI models.
If you typed 'ai assistant for jupyter notebook' into Google, you're probably staring at a notebook full of half-written cells and wondering if there's a faster way to get from idea to working code. There is. AI copilots can generate pandas pipelines, write matplotlib charts, debug stack traces, and explain what a piece of code actually does, all without leaving your notebook environment.
The problem most data scientists and analysts hit is context-switching. You write a cell, get stuck, open a browser tab, dig through Stack Overflow or ChatGPT, copy something back, tweak it, and repeat. An AI assistant embedded in or paired with Jupyter cuts that loop dramatically. Instead of describing your problem to a chat window, you paste your dataframe schema and ask for exactly the transformation you need.
This page tests a specific Python-writing prompt across four leading AI models and compares the results directly. The goal is to show you which model produces the most notebook-ready code, what prompt structure gets you there fastest, and where each tool falls short so you can make a practical choice for your daily workflow.
When to use this
AI assistants paired with Jupyter notebooks are the right call when you are working iteratively with data and need code fast but accuracy matters more than raw speed. If your work involves wrangling messy datasets, building quick visualizations, writing repetitive boilerplate like train-test splits or API calls, or explaining inherited code to yourself or a team, this workflow consistently saves time.
- Cleaning and reshaping a pandas DataFrame when you know what output you want but not the exact method chain
- Generating a first-draft matplotlib or seaborn chart that you then style manually
- Writing boilerplate for ML pipelines: scalers, encoders, cross-validation loops
- Explaining an unfamiliar code block inherited from a colleague or open-source repo
- Drafting regex patterns, string formatters, or custom aggregation functions you would otherwise look up
When this format breaks down
- Production-grade code that will run unreviewed in automated pipelines: AI-generated cells often skip input validation and edge-case handling that production code requires.
- Highly specialized scientific domains like numerical PDE solvers or domain-specific bioinformatics libraries where the AI lacks reliable training coverage and errors are subtle and dangerous.
- Regulated environments where copy-pasting code from an external model violates data governance or IP policy, particularly if you are pasting actual data samples into a prompt.
- When you need the fastest possible runtime performance: AI assistants optimize for readable correctness, not vectorization or memory efficiency, and the generated code often needs profiling and rewriting before scale.
The prompt we tested
You are an expert Python and data science AI assistant embedded in a Jupyter Notebook. Help the user write, debug, or explain Python code for their notebook based on the request below. Produce clean, well-commented Python code optimized for Jupyter Notebook execution (use pandas, numpy, matplotlib/seaborn where appropriate, and display outputs with print() or direct variable references). Keep the explanation under 150 words and list any %pip install commands at the top if non-standard libraries are needed. User request: I have a pandas DataFrame called sales_df with columns 'date', 'product', and 'revenue'. I want to create a line chart showing monthly total revenue trends for the top 5 products by total revenue. Can you write the code to do this in my notebook? Respond with a clearly labeled Python code cell (inside ```python fences) ready to paste into Jupyter, followed by a brief explanation of what the code does and any required imports or pip installs.
Model comparison
| Model | Quality | Cost | Latency | Tokens |
|---|---|---|---|---|
| Claude Opus 4.7 | 8 | $0.0603 | 8.9s | 323 / 740 |
| Claude Haiku 4.5 | 7 | $0.0028 | 4.2s | 221 / 516 |
| GPT-5.4 | 8 | $0.0075 | 6.6s | 203 / 469 |
| Gemini 2.5 Pro | 7 | $0.0047 | 17.2s | 201 / 881 |
| Grok 4.1 Fast Reasoning | 8 | $0.0003 | 7.9s | 349 / 379 |
Outputs
# Pip Install (if needed)
```
%pip install pandas numpy matplotlib seaborn
```
# Code
```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Convert 'date' column to datetime if not already
sales_df['date'] = pd.to_datetime(sales_df['date'])
# Find top 5 products by total revenue
top_5_products = sales_df.groupby('product')['revenue'].sum().nlargest(5).index.tolist()
# Filter for top 5 products and create year-month column
df_filtered = sales_df[sales_df['product'].isin(top_5_products)].copy()
df_filtered['year_month'] = df_filtered['date'].dt.to_period('M')
# Calculate monthly revenue by product
monthly_revenue = df_filtered.groupby(['year_month', 'product'])['revenue'].sum().reset_index()
monthly_revenue['year_month'] = monthly_revenue['year_month'].astype(str)
# Create line chart
plt.figure(figsize=(12, 6))
for product in top_5_products:
product_data = monthly_revenue[monthly_revenue['product'] == product]
plt.plot(product_data['year_month'], product_data['revenue'], marker='o', label=product, linewidth=2)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Revenue', fontsize=12)
plt.title('Monthly Revenue Trends - Top 5 Products', fontsize=14, fontweight='bold')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
```
## Explanation
This code identifies the **top 5 products by total revenue**, then plots their monthly revenue trends as separate lines. It converts dates to year-month format, groups by product and month, and displays trends over time. The chart includes markers, legend, and grid for easy interpretation of revenue patterns across your top performers.
```python
# Required imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# --- Prepare data ---
# Make a copy to avoid modifying the original DataFrame
df = sales_df.copy()
# Ensure 'date' is in datetime format
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# Drop rows with missing dates or revenue
df = df.dropna(subset=['date', 'revenue', 'product'])
# Find top 5 products by total revenue
top_products = (
df.groupby('product', as_index=False)['revenue']
.sum()
.sort_values('revenue', ascending=False)
.head(5)['product']
.tolist()
)
print("Top 5 products by total revenue:", top_products)
# Filter to top 5 products only
df_top = df[df['product'].isin(top_products)].copy()
# Create a monthly period column
df_top['month'] = df_top['date'].dt.to_period('M').dt.to_timestamp()
# Aggregate monthly total revenue by product
monthly_revenue = (
df_top.groupby(['month', 'product'], as_index=False)['revenue']
.sum()
.sort_values(['month', 'product'])
)
# --- Plot ---
sns.set_style("whitegrid")
plt.figure(figsize=(12, 6))
sns.lineplot(
data=monthly_revenue,
x='month',
y='revenue',
hue='product',
marker='o'
)
plt.title('Monthly Total Revenue Trends for Top 5 Products')
plt.xlabel('Month')
plt.ylabel('Total Revenue')
plt.legend(title='Product', bbox_to_anchor=(1.02, 1), loc='upper left')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Optional: display the aggregated data used for plotting
monthly_revenue.head()
```
This code converts `date` to datetime, finds the top 5 products by total revenue, aggregates revenue by month for those products, and plots monthly revenue trends as a line chart. It uses only standard notebook libraries: `pandas`, `matplotlib`, and `seaborn`, so no `%pip install` is needed unless they are missing.
Of course. Here is the Python code to generate a line chart of monthly revenue for your top 5 products.
This solution first identifies the top 5 products by total revenue. It then filters the DataFrame to include only these products, aggregates their sales data by month, and finally plots the trends using `seaborn` for a clear and informative visualization.
```python
# %pip install pandas matplotlib seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# --- Create a sample sales_df DataFrame (replace with your actual data) ---
# This simulates a year of sales data for 10 different products.
np.random.seed(42)
dates = pd.to_datetime(pd.date_range(start='2023-01-01', end='2023-12-31', freq='D'))
products = [f'Product_{chr(65+i)}' for i in range(10)]
data = []
for date in dates:
num_sales = np.random.randint(5, 15)
for _ in range(num_sales):
product = np.random.choice(products)
revenue = np.random.uniform(20, 500)
data.append([date, product, revenue])
sales_df = pd.DataFrame(data, columns=['date', 'product', 'revenue'])
print("Sample DataFrame created. Head:")
display(sales_df.head())
print("-" * 50)
# -------------------------------------------------------------------------
# 1. Ensure 'date' column is in datetime format
sales_df['date'] = pd.to_datetime(sales_df['date'])
# 2. Find the top 5 products by total revenue
top_5_products = sales_df.groupby('product')['revenue'].sum().nlargest(5).index
print(f"Top 5 products by total revenue: {list(top_5_products)}")
# 3. Filter the DataFrame to include only the top 5 products
top_5_df = sales_df[sales_df['product'].isin(top_5_products)].copy()
# 4. Aggregate revenue by month for each product
# We use pd.Grouper to group dates by month ('M')
monthly_revenue = top_5_df.groupby(['product', pd.Grouper(key='date', freq='M')])['revenue'].sum().reset_index()
# 5. Plot the data
plt.style.use('seaborn-v0_8-whitegrid')
plt.figure(figsize=(14, 7))
# Use seaborn to create a line plot with a different line for each product
sns.lineplot(
data=monthly_revenue,
x='date',
y='revenue',
hue='product', # Creates separate lines for each product
marker='o'
)
# Formatting the plot
plt.title('Monthly Revenue Trends for Top 5 Products', fontsize=16)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Total Monthly Revenue', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
```
### Explanation
This code accomplishes your goal in five steps:
1. **Data Preparation**: It ensures the `date` column is a proper datetime type, which is essential for time-series analysis.
2. **Identify Top Products**: It groups the data by `product`, calculates the total revenue for each, and selects the names of the top 5.
3. **Filter Data**: A new DataFrame is created containing sales records for only these top 5 products.
4. **Monthly Aggregation**: It groups the filtered data by both `product` and month, summing the `revenue` to get the total monthly sales for each top product.
5. **Visualization**: `seab
```python
# No %pip installs needed (uses standard pandas/matplotlib)
import pandas as pd
import matplotlib.pyplot as plt
# Ensure 'date' is datetime and create year-month periods
sales_df['date'] = pd.to_datetime(sales_df['date'])
sales_df['year_month'] = sales_df['date'].dt.to_period('M')
# Get top 5 products by total revenue
product_totals = sales_df.groupby('product')['revenue'].sum().nlargest(5)
top_products = product_totals.index
# Aggregate monthly revenue for top products (unstack to wide format)
monthly_data = (sales_df[sales_df['product'].isin(top_products)]
.groupby(['year_month', 'product'])['revenue']
.sum()
.unstack(fill_value=0)
.sort_index()) # Sort months chronologically
# Create line plot
plt.figure(figsize=(12, 6))
monthly_data.plot(kind='line', linewidth=2.5, marker='o', markersize=6)
plt.title('Monthly Revenue Trends for Top 5 Products', fontsize=16)
plt.xlabel('Year-Month', fontsize=12)
plt.ylabel('Total Revenue', fontsize=12)
plt.legend(title='Product', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Display top products for reference
print("Top 5 products by total revenue:\n", product_totals)
```
This code processes `sales_df` by converting dates, identifying top 5 products by total revenue, aggregating monthly totals, and plotting line trends with markers. It auto-sorts months and displays the top products. Outputs plot inline in Jupyter and prints top products. No pip installs needed (97 words).
What makes these work
-
01Paste your schema, not a description
The single biggest prompt improvement is replacing phrases like 'I have a sales dataset' with the actual column names, dtypes, and a sample row. AI models generate far more accurate pandas code when they know whether a date column is already datetime or still a string. Run df.dtypes and df.head(2) in a cell and paste the output directly into your prompt.
-
02Specify the output shape explicitly
Tell the model exactly what the resulting DataFrame, variable, or return value should look like, including column names and types. Vague requests like 'summarize the data' produce generic code. Specific requests like 'return a DataFrame with columns month (Period) and revenue (float) sorted ascending' produce notebook-ready cells you can run immediately.
-
03Ask for one cell, not a script
Jupyter workflows are iterative. Prompting for a single focused cell that does one thing, like a merge, a groupby, or a plot, keeps AI output manageable and testable. Asking for an entire analysis script in one shot usually produces code that silently assumes variable names or import states that do not match your notebook.
-
04Include your error message verbatim
When debugging, copy the full traceback including the line number and paste it with the relevant cell. The AI can pinpoint dtype mismatches, index errors, and API changes far faster when it sees the exact failure message. Do not paraphrase the error; the specific exception class often carries the information the model needs to fix it.
More example scenarios
I have two pandas DataFrames: orders (columns: order_id, customer_id, order_date, total_amount) and customers (columns: customer_id, signup_date, country). I want to merge them, filter to UK customers only, then calculate monthly revenue grouped by month. Return a clean DataFrame with columns: month, total_revenue.
A working pandas snippet that performs pd.merge on customer_id, filters with .loc where country equals 'UK', converts order_date to datetime, extracts the month period with dt.to_period, and runs groupby with sum on total_amount, renaming columns cleanly at the end.
Write a scikit-learn pipeline that scales features with StandardScaler, trains a LogisticRegression classifier, and runs 5-fold stratified cross-validation on X_train and y_train. Print mean accuracy and standard deviation. Assume X_train and y_train are already defined.
A concise Pipeline definition combining StandardScaler and LogisticRegression, passed into cross_val_score with cv=StratifiedKFold(n_splits=5), followed by print statements reporting cv_scores.mean() and cv_scores.std() rounded to four decimal places.
I have a DataFrame called df_prices with columns: date (datetime) and close_price (float) for a stock. Plot the closing price as a line chart using matplotlib. Annotate the single highest closing price point with its date and value. Use a clean style with gridlines.
A matplotlib block that sets plt.style.use('seaborn-v0_8-whitegrid'), plots date vs close_price, finds the max row with idxmax(), and uses plt.annotate with an arrow to mark that point, displaying the formatted date and price value in the annotation text.
I have a CSV where one column called 'measurements' contains strings like '12.3 mg | 4.5 cm | 0.8 g'. Write Python to split this column into three new columns: weight_mg, length_cm, mass_g, converting each to float. Handle rows where the format is missing or malformed by returning NaN.
A solution using str.split with expand=True or str.extract with a named-group regex, wrapping the conversion in pd.to_numeric with errors='coerce' to safely produce NaN for malformed entries, assigned back to three clearly named DataFrame columns.
Write a Python function that calls the Open-Meteo weather API for a given latitude, longitude, and date range. Return a pandas DataFrame with columns: date, temperature_max, temperature_min. Use the requests library. Handle HTTP errors gracefully.
A function build_weather_df(lat, lon, start_date, end_date) that constructs the Open-Meteo URL with params dict, calls requests.get with a timeout, raises_for_status on failure, parses the JSON response into a DataFrame from the daily key, and converts the time field to datetime.
Common mistakes to avoid
-
Running AI output without inspection
AI-generated cells look syntactically correct and often run without errors while doing the wrong thing, like an inner join silently dropping rows you needed. Always print the shape and a head() after any merge, filter, or aggregation to verify the row count and values match your expectations before moving on.
-
Prompting without import context
If your notebook already imports numpy as np and pandas as pd, the model will usually match that. But if you switch models or start a new session, the generated code may use different aliases or import libraries you have not installed. Specify your environment or paste your import block into the prompt to avoid mismatches.
-
Asking for explanations of code you then ignore
A common workflow failure is using the AI to explain a block, nodding along, then reusing that code without actually reading it. Explanations are useful, but they do not substitute for understanding what the code does to your actual data. Run intermediate steps and inspect the output at each stage.
-
Using live data samples in prompts sent to external models
Pasting actual customer records, financial data, or PII into an AI prompt to get better code is a data governance risk. Use fake or anonymized sample rows that preserve the schema and dtypes but not real values. The model only needs the structure, not the content, to write correct pandas or scikit-learn code.
-
Expecting the model to know your custom library versions
Jupyter environments vary wildly. An AI trained on data through a certain date will not know about breaking changes in newer versions of pandas, scikit-learn, or matplotlib. If generated code fails with an unexpected AttributeError or deprecation warning, include your library version in the prompt or search specifically for version-specific syntax.
Related queries
Frequently asked questions
What is the best AI assistant that works directly inside Jupyter Notebook?
The most widely used options are GitHub Copilot (via the JupyterLab extension), Jupyter AI (an official JupyterLab extension supporting multiple model backends including OpenAI and Anthropic), and Continue.dev. Jupyter AI is free to set up with your own API key and integrates the most natively, giving you a chat panel alongside your notebook cells without leaving the interface.
Does GitHub Copilot work in Jupyter Notebook?
Yes, GitHub Copilot works in JupyterLab through its official extension and in VS Code notebooks natively. It provides inline autocomplete as you type in cells. It does not have a chat interface inside the classic Jupyter Notebook interface, so if you need conversational prompting, JupyterLab with the Copilot Chat extension or Jupyter AI is a better fit.
Can I use ChatGPT as an AI assistant for Jupyter notebooks?
Not natively embedded, but it is a common pairing. You write code in Jupyter, copy a failing cell or describe a task in ChatGPT, and paste the result back. Jupyter AI lets you connect GPT-4 directly inside JupyterLab via an API key, which removes most of the copy-paste friction and keeps your conversation context tied to your notebook.
Is there a free AI assistant for Jupyter Notebook?
Jupyter AI is open source and free to install. You pay only for API calls to whichever model you connect it to. If you use it with a local model via Ollama, the entire setup is free with no API costs. Google Colab also has an AI code assistant built in that is free for standard users.
How do I install an AI assistant in JupyterLab?
The quickest path is running 'pip install jupyter-ai' in your environment, then launching JupyterLab. A chat icon appears in the left sidebar. You configure it with an API key for your preferred model provider under Settings. For GitHub Copilot, install the '@jupyterlab/jupyter-ai' or the separate Copilot extension and authenticate with your GitHub account.
Can an AI assistant in Jupyter see my existing variables and dataframes?
Some can, with limitations. Jupyter AI has a context feature that can include notebook cells in the prompt. GitHub Copilot reads the cells above the cursor for inline suggestions. Neither has full kernel introspection, meaning they cannot automatically read the current state of a variable in memory. You need to paste relevant output like df.head() or df.dtypes into the prompt manually for the model to use that information.
Try it with a real tool
Run this prompt in one of these tools. Affiliate links help keep Gridlyx free.