Learn Pandas for Data Science (Course IV)July 10, 2020 2020-08-04 10:47
Learn Pandas for Data Science (Course IV)
Inspecting data using Pandas
While working with data, it is very important to inspect the data. Knowing insights about data such as count, mean, standard deviations, min-max values, data type, etc can provide valuable information about the data we’re working with. Pandas provide easier methods to give basic insights about a DataFrame. In this chapter, you will learn about some of those methods for extracting the basic insights about a DataFrame.
For this chapter, we will be using the COVID-19 Dataset from Kaggle. You can simply download the data from this link and save the file as data.csv in the same folder where your Python Notebook is situated at. Then, you can simply load the data into your Python notebook as:
# Making necessary imports import pandas as pd # Loading the dataset df = pd.read_csv("data.csv")
Note: This dataset gets updated frequently, so the values seen in this example may slightly vary when you try the dataset yourself. However, the processes still remain the same.
Display top n rows of a Pandas DataFrame
The pandas.DataFrame.head method is used to display the top n rows of the DataFrame.
# Display top 3 rows df.head(n=3)
If the number of rows (n) is not specified, the top 5 rows are displayed as default.
# Displays top 5 rows by default df.head()
Display bottom n rows of a Pandas DataFrame
The pandas.DataFrame.tail method is used to display the bottom n rows of the DataFrame. Similar to the pandas.DataFrame.head function, if no number is passed to it, it displays the bottom 5 rows of the DataFrame.
# Display bottom 5 rows df.tail()
Display all the Column Names
Sometimes it may not be feasible to print the whole DataFrame in order to see the name of the columns present in the DataFrame, especially when there are a lot of columns. In such cases, we can use the pandas.DataFrame.columns method to extract all the column names.
# Display all the column names df.columns
Index(['Country/Region', 'Confirmed', 'Deaths', 'Recovered', 'Active', 'New cases', 'New deaths', 'New recovered', 'Deaths / 100 Cases', 'Recovered / 100 Cases', 'Deaths / 100 Recovered', 'Confirmed last week', '1 week change', '1 week % increase', 'WHO Region'], dtype='object')
Display Descriptive Statistics of the DataFrame
The pandas.DataFrame.describe is used to display the descriptive statistics of the columns of the DataFrame such as mean, count, standard deviation, minimum value, maximum value, etc. Such descriptive statistics help us understand the data well.
# Display descriptive statistics of the DataFrame df.describe()
Display Data Type, Non-Null Values and Memory Usage about a Pandas DataFrame
The pandas.DataFrame.info method is used to display the index data type and column data type, the number of non-null values, and memory usage.
# Display futher information df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 187 entries, 0 to 186 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country/Region 187 non-null object 1 Confirmed 187 non-null int64 2 Deaths 187 non-null int64 3 Recovered 187 non-null int64 4 Active 187 non-null int64 5 New cases 187 non-null int64 6 New deaths 187 non-null int64 7 New recovered 187 non-null int64 8 Deaths / 100 Cases 187 non-null float64 9 Recovered / 100 Cases 187 non-null float64 10 Deaths / 100 Recovered 187 non-null float64 11 Confirmed last week 187 non-null int64 12 1 week change 187 non-null int64 13 1 week % increase 187 non-null float64 14 WHO Region 187 non-null object dtypes: float64(4), int64(9), object(2) memory usage: 22.0+ KB
In this chapter you learned about Pandas methods that can help you understand the data well. Now in the next chapter, you will learn about Pandas methods that will help you to manipulate the data for data preprocessing.