Watch this video for a hands-on explanation and deeper insights:
To be added soon!
To be added soon!
Matplotlib Tutorial¶
Matplotlib: Used for creating basic, customizable plots from scratch.
Seaborn: Built on Matplotlib; used for making attractive and statistical plots easily.
Topics to be covered:
Matplotlib:
1- basic plotting
2- figures & axes
3- subplots
4- customization
5- histograms
6- boxplots
7- scatter
8- heatmaps
9- time-series
# Setup: imports and sample data creation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Display settings
pd.options.display.max_columns = 50
# Create a reproducible random dataset for examples
np.random.seed(0)
dates = pd.date_range('2024-01-01', periods=20)
df_time = pd.DataFrame({
'date': dates,
'sales_A': np.random.randint(50, 200, size=len(dates)),
'sales_B': np.random.randint(40, 180, size=len(dates))
})
# Small "tips-like" dataset (for seaborn categorical examples)
tips = pd.DataFrame({
# Gaussian distribution (uses Mean and Standard Deviation)
'total_bill': np.random.normal(20, 8, 200).clip(3, 60), # Mean (μ) = 20, Standard Deviation (σ) = 8, 3 = minimum allowed value, 60 = maximum allowed value
'tip': np.random.normal(3, 1, 200).clip(0, 12),
'size': np.random.randint(1,6,200),
'sex': np.random.choice(['Male','Female'], 200),
'day': np.random.choice(['Thur','Fri','Sat','Sun'],200),
'time': np.random.choice(['Lunch','Dinner'],200)
})
# Small "iris-like" dataset for pairplots and regression
iris_like = pd.DataFrame({
'sepal_length': np.random.normal(5.8, 0.8, 150),
'sepal_width': np.random.normal(3.0, 0.4, 150),
'petal_length': np.random.normal(3.7, 1.5, 150),
'petal_width': np.random.normal(1.2, 0.5, 150),
'species': np.random.choice(['setosa','versicolor','virginica'], 150)
})
print("df_time: " + str(len(df_time)))
print("tips: " + str(len(tips)))
print("iris_like: " + str(len(iris_like)))
print('\nDatasets created: df_time (time-series), tips (categorical), iris_like (pair/regression)')
print(df_time.head())
print()
print(tips.head())
print()
print(iris_like.head())
df_time: 20
tips: 200
iris_like: 150
Datasets created: df_time (time-series), tips (categorical), iris_like (pair/regression)
date sales_A sales_B
0 2024-01-01 97 155
1 2024-01-02 167 119
2 2024-01-03 117 122
3 2024-01-04 153 139
4 2024-01-05 59 69
total_bill tip size sex day time
0 19.097609 1.481971 1 Female Sat Lunch
1 27.258768 1.106955 3 Female Thur Dinner
2 26.522159 2.214913 2 Female Fri Dinner
3 21.832784 1.394706 5 Female Sat Lunch
4 11.790570 4.431840 5 Male Thur Dinner
sepal_length sepal_width petal_length petal_width species
0 5.973746 2.404899 2.093385 0.739210 virginica
1 4.592443 3.358717 3.225073 0.755476 setosa
2 6.040855 3.772222 2.087491 1.164487 versicolor
3 6.100678 2.836652 2.341414 1.079713 virginica
4 5.254899 3.261387 3.942793 1.203688 setosa
# Basic line plot with matplotlib: figures and axes
fig, ax = plt.subplots(figsize=(8,4))
ax.plot(df_time['date'], df_time['sales_A'], label='Sales A', marker='o')
ax.plot(df_time['date'], df_time['sales_B'], label='Sales B', marker='s')
ax.set_title('Sales over time (basic)')
ax.set_xlabel('Date')
ax.set_ylabel('Sales')
ax.legend()
plt.tight_layout()
plt.show()
Subplots and multiple plots¶
# Subplots: multiple axes
fig, axes = plt.subplots(2, 1, figsize=(8,6), sharex=True) #Makes both subplots share the same X-axis (same scale and ticks).
axes[0].plot(df_time['date'], df_time['sales_A'], label='Sales A', color='tab:blue')
axes[0].set_title('Sales A')
axes[1].plot(df_time['date'], df_time['sales_B'], label='Sales B', color='tab:orange')
axes[1].set_title('Sales B')
for ax in axes:
ax.grid(True)
plt.tight_layout()
plt.show()
Customizing plots (styles, colors, fonts)¶
Concept: A Figure is the overall canvas, and Axes are individual plots inside it.
Application: Compare multiple graphs side-by-side.
# Customizing style, markers, linewidth, font sizes
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(8,4))
ax.plot(df_time['date'], df_time['sales_A'], marker='o', linestyle='--', linewidth=2, label='A')
ax.plot(df_time['date'], df_time['sales_B'], marker='d', linestyle='-', linewidth=1.5, label='B')
ax.set_title('Customized sales plot', fontsize=14)
ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('Sales', fontsize=12)
ax.tick_params(axis='x', rotation=30)
ax.legend(fontsize=10)
plt.tight_layout()
plt.show()
Histogram¶
Concept: Displays data distribution using frequency bars.
Application: Understand how scores or values are spread across intervals. It is also used in univariate analysis (Single feature distribution). How a single column is distributed across.
# Histogram of total_bill (tips-like data)
plt.figure(figsize=(7,4))
plt.hist(tips['total_bill'], bins=20, edgecolor='black') #'total_bill', check its distribution
plt.title('Histogram of total_bill')
plt.xlabel('total_bill')
plt.ylabel('Frequency')
plt.show()
Boxplot¶
Concept: Shows data distribution through quartiles and detects outliers.
Application: Compare income ranges or test scores between groups.
# Boxplot to compare distributions
plt.figure(figsize=(6,4))
plt.boxplot([tips['total_bill'], tips['tip']], labels=['total_bill','tip'])
plt.title('Boxplot: total_bill vs tip')
plt.ylabel('Value')
plt.show()
/var/folders/43/0syn7psx5w1fdxg82fm99xpc0000gn/T/ipykernel_24587/63742186.py:3: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11. plt.boxplot([tips['total_bill'], tips['tip']], labels=['total_bill','tip'])
For Reference
For Reference
Calculate Q1, Q2, IQR, lower_bound, upper_bound and Identify outliers.¶
import pandas as pd
# Example data
data = {'Salary': [25, 30, 35, 40, 45, 50, 55, 100, 120, 180]}
df = pd.DataFrame(data)
# Calculate Q1 and Q3
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print("lower_bound: ", lower_bound)
print("upper_bound: ", upper_bound)
# Identify outliers
outliers = df[(df['Salary'] < lower_bound) | (df['Salary'] > upper_bound)]
print("Q1:", Q1)
print("Q3:", Q3)
print("IQR:", IQR)
print("Lower Bound:", lower_bound)
print("Upper Bound:", upper_bound)
print("\nOutliers:\n", outliers)
lower_bound: -42.5
upper_bound: 167.5
Q1: 36.25
Q3: 88.75
IQR: 52.5
Lower Bound: -42.5
Upper Bound: 167.5
Outliers:
Salary
9 180
Scatter plot¶
Concept: Plots two continuous variables to show relationships or correlations.
Application: Analyze relationship between height and weight.
# Scatter plot: tip vs total_bill
plt.figure(figsize=(6,4))
plt.scatter(tips['total_bill'], tips['tip'])
plt.title('Scatter: tip vs total_bill')
plt.xlabel('total_bill')
plt.ylabel('tip')
plt.show()
Heatmap (matrix)¶
Concept: Represents data values as colors on a grid.
Application: Visualize correlation between dataset features.
# Heatmap with matplotlib (correlation/confusion matrix)
corr = df_time[['sales_A','sales_B']].corr()
fig, ax = plt.subplots(figsize=(4,3))
cax = ax.imshow(corr, interpolation='nearest', cmap='coolwarm') #'bilinear','nearest'
ax.set_xticks([0,1]); ax.set_yticks([0,1])
ax.set_xticklabels(['sales_A','sales_B']); ax.set_yticklabels(['sales_A','sales_B'])
fig.colorbar(cax)
ax.set_title('Correlation matrix (matplotlib)')
plt.tight_layout()
plt.show()
Time-series plotting¶
Concept: Special line plot showing data points over time.
Application: Track sales or weather data across months.
# Time-series: rolling mean example
ts = df_time.set_index('date').sort_index()
ts['sales_A_7d'] = ts['sales_A'].rolling(window=3, min_periods=1).mean()
fig, ax = plt.subplots(figsize=(8,4))
ax.plot(ts.index, ts['sales_A'], label='Sales A', alpha=.5)
ax.plot(ts.index, ts['sales_A_7d'], label='3-point Rolling Mean', color='red')
ax.set_title('Time-series with rolling mean')
ax.legend()
plt.tight_layout()
plt.show()
Matplotlib vs Seaborn (Comparison)¶
Matplotlib¶
- Low-level plotting library.
- Gives full control over every element.
- Good for custom, detailed, or complex plots.
- Looks basic by default.
plt.plot(x, y)
plt.title("Line Plot")
plt.show()
Seaborn¶
- Built on top of Matplotlib.
- Used for quick, beautiful, and statistical plots.
- Works directly with Pandas DataFrames.
- Automatically handles styles, colors, and legends.
sns.scatterplot(data=df, x='age', y='income')
When to Use¶
| Task | Use |
|---|---|
| Detailed control / custom layout | Matplotlib |
| Quick, pretty, statistical visualization | Seaborn |
✅ Tip: Use Seaborn for quick EDA and Matplotlib for final customization.