Watch this video for a hands-on explanation and deeper insights:
What is Pandas?
- Pandas is a powerful and popular open-source Python library used for data manipulation and analysis
Data structures:
- Series: A 1-dimensional labeled array (like a column).
- DataFrame: A 2-dimensional labeled data structure (like a csv file, table).
Load and Explore IPL 2024 Dataset¶
Download Dataset: https://www.kaggle.com/datasets/adityabhaumik/ipl-2024-matches
import pandas as pd #Install pandas if error
df = pd.read_csv('ipl2024 Matches.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 75 entries, 0 to 74 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 75 non-null int64 1 date 75 non-null object 2 team1 75 non-null object 3 team2 75 non-null object 4 toss_winner 75 non-null object 5 decision 72 non-null object 6 first_score 75 non-null int64 7 first_wkts 75 non-null int64 8 second_score 75 non-null int64 9 second_wkts 75 non-null int64 10 winner 75 non-null object 11 player_of_the_match 72 non-null object 12 most_runs 72 non-null object 13 most_wkts 72 non-null object dtypes: int64(5), object(9) memory usage: 8.3+ KB
df.describe()
id | first_score | first_wkts | second_score | second_wkts | |
---|---|---|---|---|---|
count | 75.000000 | 75.000000 | 75.000000 | 75.000000 | 75.000000 |
mean | 37.986667 | 179.653333 | 6.200000 | 168.320000 | 5.733333 |
std | 21.771847 | 52.091264 | 2.493234 | 47.753353 | 2.946979 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 19.500000 | 162.000000 | 4.500000 | 145.000000 | 3.000000 |
50% | 38.000000 | 182.000000 | 6.000000 | 173.000000 | 6.000000 |
75% | 56.500000 | 208.000000 | 8.000000 | 197.500000 | 8.000000 |
max | 74.000000 | 277.000000 | 10.000000 | 262.000000 | 10.000000 |
print(df)
id date team1 team2 toss_winner decision first_score \ 0 1 March 22,2024 Banglore Chennai Banglore Bat 173 1 2 March 23,2024 Delhi Punjab Punjab Field 174 2 3 March 23,2024 Kolkata Hyderabad Hyderabad Field 208 3 4 March 24,2024 Rajasthan Lucknow Rajasthan Bat 193 4 5 March 24,2024 Gujarat Mumbai Mumbai Field 168 .. .. ... ... ... ... ... ... 70 71 May 21,2024 Hyderabad Kolkata Hyderabad Bat 159 71 72 May 22,2024 Banglore Rajasthan Rajasthan Field 172 72 73 May 24,2024 Hyderabad Rajasthan Rajasthan Field 175 73 74 May 26,2024 Hyderabad Kolkata Hyderabad Bat 113 74 74 May 26,2024 Hyderabad Kolkata Hyderabad Bat 113 first_wkts second_score second_wkts winner player_of_the_match \ 0 6 176 4 Chennai Mustafizur Rahman 1 9 177 6 Punjab Sam Curran 2 7 204 7 Kolkata Andre Russell 3 4 173 6 Rajasthan Sanju Samson 4 6 162 9 Gujarat Sai Sudharsan .. ... ... ... ... ... 70 10 164 2 Kolkata Mitchell Starc 71 8 174 6 Rajasthan Ravichandran Ashwin 72 9 139 7 Hyderabad Shahbaz Ahmed 73 10 114 2 Kolkata Mitchell Starc 74 10 114 2 Kolkata Mitchell Starc most_runs most_wkts 0 Anuj Rawat Mustafizur Rahman 1 Sam Curran Kuldeep Yadav 2 Andre Russell T Natarajan 3 Sanju Samson Trent Boult 4 Dewald Brevis Jasprit Bumrah .. ... ... 70 Shreyas Iyer Mitchell Starc 71 Yashasvi Jaiswal Avesh Khan 72 Dhruv Jurel Shahbaz Ahmed 73 Venkatesh Iyer Andre Russell 74 Venkatesh Iyer Andre Russell [75 rows x 14 columns]
df.head(2)
id | date | team1 | team2 | toss_winner | decision | first_score | first_wkts | second_score | second_wkts | winner | player_of_the_match | most_runs | most_wkts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | March 22,2024 | Banglore | Chennai | Banglore | Bat | 173 | 6 | 176 | 4 | Chennai | Mustafizur Rahman | Anuj Rawat | Mustafizur Rahman |
1 | 2 | March 23,2024 | Delhi | Punjab | Punjab | Field | 174 | 9 | 177 | 6 | Punjab | Sam Curran | Sam Curran | Kuldeep Yadav |
df.tail(2)
id | date | team1 | team2 | toss_winner | decision | first_score | first_wkts | second_score | second_wkts | winner | player_of_the_match | most_runs | most_wkts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
73 | 74 | May 26,2024 | Hyderabad | Kolkata | Hyderabad | Bat | 113 | 10 | 114 | 2 | Kolkata | Mitchell Starc | Venkatesh Iyer | Andre Russell |
74 | 74 | May 26,2024 | Hyderabad | Kolkata | Hyderabad | Bat | 113 | 10 | 114 | 2 | Kolkata | Mitchell Starc | Venkatesh Iyer | Andre Russell |
print(len(df))
75
print(df)
id date team1 team2 toss_winner decision first_score \ 0 1 March 22,2024 Banglore Chennai Banglore Bat 173 1 2 March 23,2024 Delhi Punjab Punjab Field 174 2 3 March 23,2024 Kolkata Hyderabad Hyderabad Field 208 3 4 March 24,2024 Rajasthan Lucknow Rajasthan Bat 193 4 5 March 24,2024 Gujarat Mumbai Mumbai Field 168 .. .. ... ... ... ... ... ... 70 71 May 21,2024 Hyderabad Kolkata Hyderabad Bat 159 71 72 May 22,2024 Banglore Rajasthan Rajasthan Field 172 72 73 May 24,2024 Hyderabad Rajasthan Rajasthan Field 175 73 74 May 26,2024 Hyderabad Kolkata Hyderabad Bat 113 74 74 May 26,2024 Hyderabad Kolkata Hyderabad Bat 113 first_wkts second_score second_wkts winner player_of_the_match \ 0 6 176 4 Chennai Mustafizur Rahman 1 9 177 6 Punjab Sam Curran 2 7 204 7 Kolkata Andre Russell 3 4 173 6 Rajasthan Sanju Samson 4 6 162 9 Gujarat Sai Sudharsan .. ... ... ... ... ... 70 10 164 2 Kolkata Mitchell Starc 71 8 174 6 Rajasthan Ravichandran Ashwin 72 9 139 7 Hyderabad Shahbaz Ahmed 73 10 114 2 Kolkata Mitchell Starc 74 10 114 2 Kolkata Mitchell Starc most_runs most_wkts 0 Anuj Rawat Mustafizur Rahman 1 Sam Curran Kuldeep Yadav 2 Andre Russell T Natarajan 3 Sanju Samson Trent Boult 4 Dewald Brevis Jasprit Bumrah .. ... ... 70 Shreyas Iyer Mitchell Starc 71 Yashasvi Jaiswal Avesh Khan 72 Dhruv Jurel Shahbaz Ahmed 73 Venkatesh Iyer Andre Russell 74 Venkatesh Iyer Andre Russell [75 rows x 14 columns]
Remove Duplicate Match Entries¶
df.drop_duplicates(inplace=True)
len(df)
74
df.tail(2)
id | date | team1 | team2 | toss_winner | decision | first_score | first_wkts | second_score | second_wkts | winner | player_of_the_match | most_runs | most_wkts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
72 | 73 | May 24,2024 | Hyderabad | Rajasthan | Rajasthan | Field | 175 | 9 | 139 | 7 | Hyderabad | Shahbaz Ahmed | Dhruv Jurel | Shahbaz Ahmed |
73 | 74 | May 26,2024 | Hyderabad | Kolkata | Hyderabad | Bat | 113 | 10 | 114 | 2 | Kolkata | Mitchell Starc | Venkatesh Iyer | Andre Russell |
Identify and Handle Missing Values (NaN/Blank/None)¶
print(df['player_of_the_match'].isnull().sum())
3
df['player_of_the_match'].fillna(0, inplace=True)
missing_count = df['player_of_the_match'].isnull().sum()
print(f"Missing Count: {missing_count}")
Missing Count: 0
a = df['second_score'].max()
print(round(a,0))
262
df['player_of_the_match']
0 Mustafizur Rahman 1 Sam Curran 2 Andre Russell 3 Sanju Samson 4 Sai Sudharsan ... 69 0 70 Mitchell Starc 71 Ravichandran Ashwin 72 Shahbaz Ahmed 73 Mitchell Starc Name: player_of_the_match, Length: 74, dtype: object
Clean Team Names and Text Columns¶
df['team1'] = df['team1'].str.strip()
df['team1']
0 Banglore 1 Delhi 2 Kolkata 3 Rajasthan 4 Gujarat ... 69 Rajasthan 70 Hyderabad 71 Banglore 72 Hyderabad 73 Hyderabad Name: team1, Length: 74, dtype: object
Fix Typos/Wrong Formats
(Convert Data Types – Scores)
df['first_score'] = pd.to_numeric(df['first_score'])
df['first_score']
0 173 1 174 2 208 3 193 4 168 ... 69 0 70 159 71 172 72 175 73 113 Name: first_score, Length: 74, dtype: int64
Add New Columns (WinMargin)¶
df['winMargin'] = abs(df['first_score'] - df['second_score'])
df.head(3)
id | date | team1 | team2 | toss_winner | decision | first_score | first_wkts | second_score | second_wkts | winner | player_of_the_match | most_runs | most_wkts | winMargin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | March 22,2024 | Banglore | Chennai | Banglore | Bat | 173 | 6 | 176 | 4 | Chennai | Mustafizur Rahman | Anuj Rawat | Mustafizur Rahman | 3 |
1 | 2 | March 23,2024 | Delhi | Punjab | Punjab | Field | 174 | 9 | 177 | 6 | Punjab | Sam Curran | Sam Curran | Kuldeep Yadav | 3 |
2 | 3 | March 23,2024 | Kolkata | Hyderabad | Hyderabad | Field | 208 | 7 | 204 | 7 | Kolkata | Andre Russell | Andre Russell | T Natarajan | 4 |
Export Cleaned Dataset¶
df.to_csv('IPL2024_CleanedData.csv', index=False)
Quiz Time¶
Q1. After loading a dataset, which method would you use to quickly see the statistical summary like mean, median, min, and max?
a) df.head()
b) df.describe()
c) df.info()
d) df.value_counts()
Q2. What will happen if we run the following code?
df[“player_of_the_match”].fillna(0, inplace=True)
a) Replace NaN with None
b) Replace NaN with 0
c) Delete rows with NaN
d) Nothing happens
Q3. Which function counts missing values in a column?
a) df.isna().count()
b) df.null().count()
c) df.isnull().sum()
d) df.countnull()
Q4. If a column contains values like “100”, “200”, “300”, and you want to convert it into integers, which function is best?
a) astype(str)
b) pd.to_numeric()
c) astype(float)
d) str.strip()
Q5. Which of the following correctly saves the dataframe without index column?
a) df.to_csv(“file.csv”)
b) df.save(“file.csv”)
c) df.to_csv(“file.csv”, index=False)
d) df.write_csv(“file.csv”)
Write down the answers on a paper or notepad and cross verify later!
Answers:
Q1- b) df.describe()
Q2- b) Replace NaN with 0
Q3- c) df.isnull().sum()
Q4- b) pd.to_numeric()
Q5- c) df.to_csv(“file.csv”, index=False)
Write in Comment Box:
Your Name: Quiz-4/5, 80%, 60%