Watch this video for a hands-on explanation and deeper insights:

1.ProjectNotebook

What is Pandas?

Pandas is a powerful and popular open-source Python library used for data manipulation and analysis

Data structures:

Series: A 1-dimensional labeled array (like a column).
DataFrame: A 2-dimensional labeled data structure (like a csv file, table).

Load and Explore IPL 2024 Dataset¶

Download Dataset: https://www.kaggle.com/datasets/adityabhaumik/ipl-2024-matches

In [1]:

import pandas as pd #Install pandas if error

In [63]:

df = pd.read_csv('ipl2024 Matches.csv')

In [64]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   id                   75 non-null     int64 
 1   date                 75 non-null     object
 2   team1                75 non-null     object
 3   team2                75 non-null     object
 4   toss_winner          75 non-null     object
 5   decision             72 non-null     object
 6   first_score          75 non-null     int64 
 7   first_wkts           75 non-null     int64 
 8   second_score         75 non-null     int64 
 9   second_wkts          75 non-null     int64 
 10  winner               75 non-null     object
 11  player_of_the_match  72 non-null     object
 12  most_runs            72 non-null     object
 13  most_wkts            72 non-null     object
dtypes: int64(5), object(9)
memory usage: 8.3+ KB

In [65]:

df.describe()

Out[65]:

	id	first_score	first_wkts	second_score	second_wkts
count	75.000000	75.000000	75.000000	75.000000	75.000000
mean	37.986667	179.653333	6.200000	168.320000	5.733333
std	21.771847	52.091264	2.493234	47.753353	2.946979
min	1.000000	0.000000	0.000000	0.000000	0.000000
25%	19.500000	162.000000	4.500000	145.000000	3.000000
50%	38.000000	182.000000	6.000000	173.000000	6.000000
75%	56.500000	208.000000	8.000000	197.500000	8.000000
max	74.000000	277.000000	10.000000	262.000000	10.000000

In [66]:

print(df)

    id           date      team1      team2 toss_winner decision  first_score  \
0    1  March 22,2024  Banglore     Chennai    Banglore      Bat          173   
1    2  March 23,2024      Delhi     Punjab      Punjab    Field          174   
2    3  March 23,2024    Kolkata  Hyderabad   Hyderabad    Field          208   
3    4  March 24,2024  Rajasthan    Lucknow   Rajasthan      Bat          193   
4    5  March 24,2024   Gujarat      Mumbai      Mumbai    Field          168   
..  ..            ...        ...        ...         ...      ...          ...   
70  71    May 21,2024  Hyderabad    Kolkata   Hyderabad      Bat          159   
71  72    May 22,2024   Banglore  Rajasthan   Rajasthan    Field          172   
72  73    May 24,2024  Hyderabad  Rajasthan   Rajasthan    Field          175   
73  74    May 26,2024  Hyderabad    Kolkata   Hyderabad      Bat          113   
74  74    May 26,2024  Hyderabad    Kolkata   Hyderabad      Bat          113   

    first_wkts  second_score  second_wkts     winner  player_of_the_match  \
0            6           176            4    Chennai    Mustafizur Rahman   
1            9           177            6     Punjab           Sam Curran   
2            7           204            7    Kolkata        Andre Russell   
3            4           173            6  Rajasthan         Sanju Samson   
4            6           162            9    Gujarat        Sai Sudharsan   
..         ...           ...          ...        ...                  ...   
70          10           164            2    Kolkata       Mitchell Starc   
71           8           174            6  Rajasthan  Ravichandran Ashwin   
72           9           139            7  Hyderabad        Shahbaz Ahmed   
73          10           114            2    Kolkata       Mitchell Starc   
74          10           114            2    Kolkata       Mitchell Starc   

           most_runs          most_wkts  
0         Anuj Rawat  Mustafizur Rahman  
1         Sam Curran      Kuldeep Yadav  
2      Andre Russell        T Natarajan  
3       Sanju Samson        Trent Boult  
4      Dewald Brevis     Jasprit Bumrah  
..               ...                ...  
70      Shreyas Iyer     Mitchell Starc  
71  Yashasvi Jaiswal         Avesh Khan  
72      Dhruv Jurel       Shahbaz Ahmed  
73    Venkatesh Iyer      Andre Russell  
74    Venkatesh Iyer      Andre Russell  

[75 rows x 14 columns]

In [67]:

df.head(2)

Out[67]:

	id	date	team1	team2	toss_winner	decision	first_score	first_wkts	second_score	second_wkts	winner	player_of_the_match	most_runs	most_wkts
0	1	March 22,2024	Banglore	Chennai	Banglore	Bat	173	6	176	4	Chennai	Mustafizur Rahman	Anuj Rawat	Mustafizur Rahman
1	2	March 23,2024	Delhi	Punjab	Punjab	Field	174	9	177	6	Punjab	Sam Curran	Sam Curran	Kuldeep Yadav

In [68]:

df.tail(2)

Out[68]:

	id	date	team1	team2	toss_winner	decision	first_score	first_wkts	second_score	second_wkts	winner	player_of_the_match	most_runs	most_wkts
73	74	May 26,2024	Hyderabad	Kolkata	Hyderabad	Bat	113	10	114	2	Kolkata	Mitchell Starc	Venkatesh Iyer	Andre Russell
74	74	May 26,2024	Hyderabad	Kolkata	Hyderabad	Bat	113	10	114	2	Kolkata	Mitchell Starc	Venkatesh Iyer	Andre Russell

In [69]:

print(len(df))

In [70]:

print(df)

    id           date      team1      team2 toss_winner decision  first_score  \
0    1  March 22,2024  Banglore     Chennai    Banglore      Bat          173   
1    2  March 23,2024      Delhi     Punjab      Punjab    Field          174   
2    3  March 23,2024    Kolkata  Hyderabad   Hyderabad    Field          208   
3    4  March 24,2024  Rajasthan    Lucknow   Rajasthan      Bat          193   
4    5  March 24,2024   Gujarat      Mumbai      Mumbai    Field          168   
..  ..            ...        ...        ...         ...      ...          ...   
70  71    May 21,2024  Hyderabad    Kolkata   Hyderabad      Bat          159   
71  72    May 22,2024   Banglore  Rajasthan   Rajasthan    Field          172   
72  73    May 24,2024  Hyderabad  Rajasthan   Rajasthan    Field          175   
73  74    May 26,2024  Hyderabad    Kolkata   Hyderabad      Bat          113   
74  74    May 26,2024  Hyderabad    Kolkata   Hyderabad      Bat          113   

    first_wkts  second_score  second_wkts     winner  player_of_the_match  \
0            6           176            4    Chennai    Mustafizur Rahman   
1            9           177            6     Punjab           Sam Curran   
2            7           204            7    Kolkata        Andre Russell   
3            4           173            6  Rajasthan         Sanju Samson   
4            6           162            9    Gujarat        Sai Sudharsan   
..         ...           ...          ...        ...                  ...   
70          10           164            2    Kolkata       Mitchell Starc   
71           8           174            6  Rajasthan  Ravichandran Ashwin   
72           9           139            7  Hyderabad        Shahbaz Ahmed   
73          10           114            2    Kolkata       Mitchell Starc   
74          10           114            2    Kolkata       Mitchell Starc   

           most_runs          most_wkts  
0         Anuj Rawat  Mustafizur Rahman  
1         Sam Curran      Kuldeep Yadav  
2      Andre Russell        T Natarajan  
3       Sanju Samson        Trent Boult  
4      Dewald Brevis     Jasprit Bumrah  
..               ...                ...  
70      Shreyas Iyer     Mitchell Starc  
71  Yashasvi Jaiswal         Avesh Khan  
72      Dhruv Jurel       Shahbaz Ahmed  
73    Venkatesh Iyer      Andre Russell  
74    Venkatesh Iyer      Andre Russell  

[75 rows x 14 columns]

Remove Duplicate Match Entries¶

In [71]:

df.drop_duplicates(inplace=True)

In [72]:

len(df)

Out[72]:

In [73]:

df.tail(2)

Out[73]:

	id	date	team1	team2	toss_winner	decision	first_score	first_wkts	second_score	second_wkts	winner	player_of_the_match	most_runs	most_wkts
72	73	May 24,2024	Hyderabad	Rajasthan	Rajasthan	Field	175	9	139	7	Hyderabad	Shahbaz Ahmed	Dhruv Jurel	Shahbaz Ahmed
73	74	May 26,2024	Hyderabad	Kolkata	Hyderabad	Bat	113	10	114	2	Kolkata	Mitchell Starc	Venkatesh Iyer	Andre Russell

Identify and Handle Missing Values (NaN/Blank/None)¶

In [74]:

print(df['player_of_the_match'].isnull().sum())

In [75]:

df['player_of_the_match'].fillna(0, inplace=True)

missing_count = df['player_of_the_match'].isnull().sum()

print(f"Missing Count: {missing_count}")

Missing Count: 0

In [76]:

a = df['second_score'].max()

print(round(a,0))

In [77]:

df['player_of_the_match']

Out[77]:

0       Mustafizur Rahman
1              Sam Curran
2           Andre Russell
3            Sanju Samson
4           Sai Sudharsan
             ...         
69                      0
70         Mitchell Starc
71    Ravichandran Ashwin
72          Shahbaz Ahmed
73         Mitchell Starc
Name: player_of_the_match, Length: 74, dtype: object

Clean Team Names and Text Columns¶

In [78]:

df['team1'] = df['team1'].str.strip()

In [79]:

df['team1']

Out[79]:

0      Banglore
1         Delhi
2       Kolkata
3     Rajasthan
4       Gujarat
        ...    
69    Rajasthan
70    Hyderabad
71     Banglore
72    Hyderabad
73    Hyderabad
Name: team1, Length: 74, dtype: object

Fix Typos/Wrong Formats
(Convert Data Types – Scores)

In [80]:

df['first_score'] = pd.to_numeric(df['first_score'])

In [81]:

df['first_score']

Out[81]:

0     173
1     174
2     208
3     193
4     168
     ... 
69      0
70    159
71    172
72    175
73    113
Name: first_score, Length: 74, dtype: int64

Add New Columns (WinMargin)¶

In [82]:

df['winMargin'] = abs(df['first_score'] - df['second_score'])

In [83]:

df.head(3)

Out[83]:

	id	date	team1	team2	toss_winner	decision	first_score	first_wkts	second_score	second_wkts	winner	player_of_the_match	most_runs	most_wkts	winMargin
0	1	March 22,2024	Banglore	Chennai	Banglore	Bat	173	6	176	4	Chennai	Mustafizur Rahman	Anuj Rawat	Mustafizur Rahman	3
1	2	March 23,2024	Delhi	Punjab	Punjab	Field	174	9	177	6	Punjab	Sam Curran	Sam Curran	Kuldeep Yadav	3
2	3	March 23,2024	Kolkata	Hyderabad	Hyderabad	Field	208	7	204	7	Kolkata	Andre Russell	Andre Russell	T Natarajan	4

Export Cleaned Dataset¶

In [84]:

df.to_csv('IPL2024_CleanedData.csv', index=False)

Quiz Time¶

Q1. After loading a dataset, which method would you use to quickly see the statistical summary like mean, median, min, and max?

a) df.head()
b) df.describe()
c) df.info()
d) df.value_counts()

Q2. What will happen if we run the following code?

df[“player_of_the_match”].fillna(0, inplace=True)

a) Replace NaN with None
b) Replace NaN with 0
c) Delete rows with NaN
d) Nothing happens

Q3. Which function counts missing values in a column?

a) df.isna().count()
b) df.null().count()
c) df.isnull().sum()
d) df.countnull()

Q4. If a column contains values like “100”, “200”, “300”, and you want to convert it into integers, which function is best?

a) astype(str)
b) pd.to_numeric()
c) astype(float)
d) str.strip()

Q5. Which of the following correctly saves the dataframe without index column?

a) df.to_csv(“file.csv”)
b) df.save(“file.csv”)
c) df.to_csv(“file.csv”, index=False)
d) df.write_csv(“file.csv”)

Write down the answers on a paper or notepad and cross verify later!

Answers:

Q1- b) df.describe()
Q2- b) Replace NaN with 0
Q3- c) df.isnull().sum()
Q4- b) pd.to_numeric()
Q5- c) df.to_csv(“file.csv”, index=False)