Watch this video for a hands-on explanation and deeper insights:

To be added soon!

To be added soon!

1. matplotlib_tutorial

Matplotlib Tutorial¶

Matplotlib: Used for creating basic, customizable plots from scratch.
Seaborn: Built on Matplotlib; used for making attractive and statistical plots easily.

Topics to be covered:
Matplotlib:
1- basic plotting
2- figures & axes
3- subplots
4- customization
5- histograms
6- boxplots
7- scatter
8- heatmaps
9- time-series

In [88]:

# Setup: imports and sample data creation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.options.display.max_columns = 50

# Create a reproducible random dataset for examples
np.random.seed(0)
dates = pd.date_range('2024-01-01', periods=20)

df_time = pd.DataFrame({
    'date': dates,
    'sales_A': np.random.randint(50, 200, size=len(dates)),
    'sales_B': np.random.randint(40, 180, size=len(dates))
})

# Small "tips-like" dataset (for seaborn categorical examples)
tips = pd.DataFrame({
    # Gaussian distribution (uses Mean and Standard Deviation)
    'total_bill': np.random.normal(20, 8, 200).clip(3, 60), # Mean (μ) = 20, Standard Deviation (σ) = 8, 3 = minimum allowed value, 60 = maximum allowed value
    'tip': np.random.normal(3, 1, 200).clip(0, 12),
    'size': np.random.randint(1,6,200),
    'sex': np.random.choice(['Male','Female'], 200),
    'day': np.random.choice(['Thur','Fri','Sat','Sun'],200),
    'time': np.random.choice(['Lunch','Dinner'],200)
})

# Small "iris-like" dataset for pairplots and regression
iris_like = pd.DataFrame({
    'sepal_length': np.random.normal(5.8, 0.8, 150),
    'sepal_width': np.random.normal(3.0, 0.4, 150),
    'petal_length': np.random.normal(3.7, 1.5, 150),
    'petal_width': np.random.normal(1.2, 0.5, 150),
    'species': np.random.choice(['setosa','versicolor','virginica'], 150)
})

print("df_time: " + str(len(df_time)))
print("tips: " + str(len(tips)))
print("iris_like: " + str(len(iris_like)))

print('\nDatasets created: df_time (time-series), tips (categorical), iris_like (pair/regression)')
print(df_time.head())
print()
print(tips.head())
print()
print(iris_like.head())

df_time: 20
tips: 200
iris_like: 150

Datasets created: df_time (time-series), tips (categorical), iris_like (pair/regression)
        date  sales_A  sales_B
0 2024-01-01       97      155
1 2024-01-02      167      119
2 2024-01-03      117      122
3 2024-01-04      153      139
4 2024-01-05       59       69

   total_bill       tip  size     sex   day    time
0   19.097609  1.481971     1  Female   Sat   Lunch
1   27.258768  1.106955     3  Female  Thur  Dinner
2   26.522159  2.214913     2  Female   Fri  Dinner
3   21.832784  1.394706     5  Female   Sat   Lunch
4   11.790570  4.431840     5    Male  Thur  Dinner

   sepal_length  sepal_width  petal_length  petal_width     species
0      5.973746     2.404899      2.093385     0.739210   virginica
1      4.592443     3.358717      3.225073     0.755476      setosa
2      6.040855     3.772222      2.087491     1.164487  versicolor
3      6.100678     2.836652      2.341414     1.079713   virginica
4      5.254899     3.261387      3.942793     1.203688      setosa

Matplotlib — Basic plotting, Figures & Axes¶

Line Plot¶

Concept: Shows relationship between two continuous variables, often used to display trends over time.
Application: Visualize stock prices or temperature changes.

In [105]:

# Basic line plot with matplotlib: figures and axes
fig, ax = plt.subplots(figsize=(8,4))
ax.plot(df_time['date'], df_time['sales_A'], label='Sales A', marker='o')
ax.plot(df_time['date'], df_time['sales_B'], label='Sales B', marker='s')
ax.set_title('Sales over time (basic)')
ax.set_xlabel('Date')
ax.set_ylabel('Sales')
ax.legend()
plt.tight_layout()
plt.show()

No description has been provided for this image

Subplots and multiple plots¶

In [98]:

# Subplots: multiple axes
fig, axes = plt.subplots(2, 1, figsize=(8,6), sharex=True) #Makes both subplots share the same X-axis (same scale and ticks).
axes[0].plot(df_time['date'], df_time['sales_A'], label='Sales A', color='tab:blue')
axes[0].set_title('Sales A')
axes[1].plot(df_time['date'], df_time['sales_B'], label='Sales B', color='tab:orange')
axes[1].set_title('Sales B')
for ax in axes:
    ax.grid(True)
plt.tight_layout()
plt.show()

Customizing plots (styles, colors, fonts)¶

Concept: A Figure is the overall canvas, and Axes are individual plots inside it.

Application: Compare multiple graphs side-by-side.

In [46]:

# Customizing style, markers, linewidth, font sizes
plt.style.use('seaborn-v0_8-darkgrid')
fig, ax = plt.subplots(figsize=(8,4))
ax.plot(df_time['date'], df_time['sales_A'], marker='o', linestyle='--', linewidth=2, label='A')
ax.plot(df_time['date'], df_time['sales_B'], marker='d', linestyle='-', linewidth=1.5, label='B')
ax.set_title('Customized sales plot', fontsize=14)
ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('Sales', fontsize=12)
ax.tick_params(axis='x', rotation=30)
ax.legend(fontsize=10)
plt.tight_layout()
plt.show()

Histogram¶

Concept: Displays data distribution using frequency bars.

Application: Understand how scores or values are spread across intervals. It is also used in univariate analysis (Single feature distribution). How a single column is distributed across.

In [47]:

# Histogram of total_bill (tips-like data)
plt.figure(figsize=(7,4))
plt.hist(tips['total_bill'], bins=20, edgecolor='black') #'total_bill', check its distribution
plt.title('Histogram of total_bill')
plt.xlabel('total_bill')
plt.ylabel('Frequency')
plt.show()

Boxplot¶

Concept: Shows data distribution through quartiles and detects outliers.

Application: Compare income ranges or test scores between groups.

In [49]:

# Boxplot to compare distributions
plt.figure(figsize=(6,4))
plt.boxplot([tips['total_bill'], tips['tip']], labels=['total_bill','tip'])
plt.title('Boxplot: total_bill vs tip')
plt.ylabel('Value')
plt.show()

/var/folders/43/0syn7psx5w1fdxg82fm99xpc0000gn/T/ipykernel_24587/63742186.py:3: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
  plt.boxplot([tips['total_bill'], tips['tip']], labels=['total_bill','tip'])

For Reference

Calculate Q1, Q2, IQR, lower_bound, upper_bound and Identify outliers.¶

In [87]:

import pandas as pd

# Example data
data = {'Salary': [25, 30, 35, 40, 45, 50, 55, 100, 120, 180]}
df = pd.DataFrame(data)

# Calculate Q1 and Q3
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print("lower_bound: ", lower_bound)
print("upper_bound: ", upper_bound)

# Identify outliers
outliers = df[(df['Salary'] < lower_bound) | (df['Salary'] > upper_bound)]

print("Q1:", Q1)
print("Q3:", Q3)
print("IQR:", IQR)
print("Lower Bound:", lower_bound)
print("Upper Bound:", upper_bound)
print("\nOutliers:\n", outliers)

lower_bound:  -42.5
upper_bound:  167.5
Q1: 36.25
Q3: 88.75
IQR: 52.5
Lower Bound: -42.5
Upper Bound: 167.5

Outliers:
    Salary
9     180

Scatter plot¶

Concept: Plots two continuous variables to show relationships or correlations.

Application: Analyze relationship between height and weight.

In [51]:

# Scatter plot: tip vs total_bill
plt.figure(figsize=(6,4))
plt.scatter(tips['total_bill'], tips['tip'])
plt.title('Scatter: tip vs total_bill')
plt.xlabel('total_bill')
plt.ylabel('tip')
plt.show()

Heatmap (matrix)¶

Concept: Represents data values as colors on a grid.

Application: Visualize correlation between dataset features.

In [100]:

# Heatmap with matplotlib (correlation/confusion matrix)
corr = df_time[['sales_A','sales_B']].corr()
fig, ax = plt.subplots(figsize=(4,3))
cax = ax.imshow(corr, interpolation='nearest', cmap='coolwarm') #'bilinear','nearest'
ax.set_xticks([0,1]); ax.set_yticks([0,1])
ax.set_xticklabels(['sales_A','sales_B']); ax.set_yticklabels(['sales_A','sales_B'])
fig.colorbar(cax)
ax.set_title('Correlation matrix (matplotlib)')
plt.tight_layout()
plt.show()

Time-series plotting¶

Concept: Special line plot showing data points over time.

Application: Track sales or weather data across months.

In [68]:

# Time-series: rolling mean example
ts = df_time.set_index('date').sort_index()
ts['sales_A_7d'] = ts['sales_A'].rolling(window=3, min_periods=1).mean()
fig, ax = plt.subplots(figsize=(8,4))
ax.plot(ts.index, ts['sales_A'], label='Sales A', alpha=.5)
ax.plot(ts.index, ts['sales_A_7d'], label='3-point Rolling Mean', color='red')
ax.set_title('Time-series with rolling mean')
ax.legend()
plt.tight_layout()
plt.show()

Matplotlib vs Seaborn (Comparison)¶

Matplotlib¶

Low-level plotting library.
Gives full control over every element.
Good for custom, detailed, or complex plots.
Looks basic by default.

plt.plot(x, y)
plt.title("Line Plot")
plt.show()

Seaborn¶

Built on top of Matplotlib.
Used for quick, beautiful, and statistical plots.
Works directly with Pandas DataFrames.
Automatically handles styles, colors, and legends.

sns.scatterplot(data=df, x='age', y='income')

When to Use¶

Task	Use
Detailed control / custom layout	Matplotlib
Quick, pretty, statistical visualization	Seaborn

✅ Tip: Use Seaborn for quick EDA and Matplotlib for final customization.

Comprehensive Tutorial on EDA, Matplotlib, and Seaborn