Thanks to theidioms.com

Learn Pandas for Data Science (Course IV)

Learn Pandas for Data Science (Course IV)

Preprocessing data using Pandas

In this chapter, you will learn about some of the most commonly used functions in Pandas to preprocess data. The functions covered in this chapter are:

For this chapter, we will be preprocessing the Emission Data, which can be loaded as:

# Making necessary imports
import pandas as pd
import numpy as np

# Loading Emission data to Dataframe
data = pd.read_csv("https://github.com/plotly/datasets/raw/master/Emissions%20Data.csv")

After loading the dataset, we can view the columns or the features of the dataset as:

# View the available dataframe columns i.e., features
data.columns

OUTPUT:

Index(['Year', 'Country', 'Continent', 'Emission'], dtype='object') 

The dataset can contain various empty rows. Handling missing values is one of the most important processes while performing data preprocessing. There are various methods for handling missing values. The methods for handling missing values are out of the scope of this chapter and you may refer here for a detailed guide on handling missing values.

dropna()

In our example, we can simply drop the rows having null values. This can be done with dropna() function as:

# Drop any rows that have missing data ie NaN
data.dropna(how='any')

OUTPUT:

pandas dropna
[Note: Only a small part of the output is shown above ]
fillna()

However, all the missing NaN values can be replaced with a specific value by pandas.DataFrame.fillna function as illustrated below:

# Filling missing data with specified value
data.fillna(value=0)

OUTPUT:

pandas fill na
[Note: Only some parts of the output is shown above]
nunique()

The pandas.DataFrame.nunique function gives a count of distinct values in a column as:

# Count distinct value in column
data['Country'].nunique() #gives count of unique country names in the dataset

OUTPUT:

 197 
melt()

The pandas.DataFrame.melt is used to change DataFrame format from wide to long. It shows data by gathering columns into rows as illustrated:

# Shows data by gathering  columns into rows
pd.melt(data)

OUTPUT:

pandas melt
pandas melt
[Note: Only some parts of the output is shown above]
pivot()

The pandas.DataFrame.pivot is used to reshape our DataFrame so that it is much easier to understand the relationships in our datasets. The idea of pivot column can be hard to grasp. Refer to this detailed article on pivot function in Pandas.

# Spread rows into columns
data.pivot(columns='Year', values='Emission')

OUTPUT:

pandas pivot
[Note: Only some part of the output is shown above ]
concat()

The pandas.concat() function is used to concatenate DataFrames as illustrated below:

# Creating two new DataFrames by slicing the original DataFrame
D1 = data[:2] # First two rows of dataset
print("D1: \n", D1)

D2 = data[3:5] # Column index 3 and 4
print("\nD2: \n", D2)

# Concatinating two dataframes
D3 = pd.concat([D1, D2])
print("\nD3: \n", D3)

OUTPUT:

D1:
    Year  Country   Continent   Emission
0  2008   Aruba  South America  24.750133
1  2009   Aruba  South America  24.876706

D2:
    Year  Country   Continent     Emission
3  2011   Aruba   South America  23.922412
4  2008  Andorra        Europe   6.296125

D3:
    Year  Country      Continent   Emission
0  2008    Aruba  South America  24.750133
1  2009    Aruba  South America  24.876706
3  2011    Aruba  South America  23.922412
4  2008  Andorra         Europe   6.296125
rename()

The pandas.DataFrame.rename function is used to rename the column names of a DataFrame as:

print(data.columns) #prints columns of original data

modified = data.rename(columns = {'Continent':'Region'})

print(modified.columns) #prints column of modified data

OUTPUT:

Index(['Year', 'Country', 'Continent', 'Emission'], dtype='object')
Index(['Year', 'Country', 'Region', 'Emission'], dtype='object')
sort_values()

The pandas.DataFrame.sort_values function sorts the rows of a DataFrame based on the values of a column (low to high):

# Ordering rows by values of the column continent
ordered = data.sort_values('Continent')

ordered.head(3) # Display top 3 rows

OUTPUT:

pandas sort_values
filter()

The pandas.DataFrame.filter is used to subset rows or columns of DataFrame according to labels in the specified index.

# Selecting columns index whose name matches the given column list
f = data.filter(["Country","Emission"]) 
print(f)

OUTPUT:

pandas filter
[Note: Only a part of the output is shown above]
groupby() & get_group()

The pandas.DataFrame.groupby method is used to group data by column. However, the output of this function is a group object and not a DataFrame object. Then, we can obtain a particular group from a group object using get_group() function

# Grouping data object by the names of Continents
d = data.groupby(by='Continent') # group by column
print(type(d)) # print the type 

# Showing computed first of values within each group
print(d.first())

# Showing particular group from group object
group1=d.get_group('Asia')
print("\nAsia:(First 5 rows) \n", group1.head())

OUTPUT:

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
               Year              Country   Emission
Continent                                          
Africa         2008               Angola   1.369425
Asia           2008          Afghanistan   0.158962
Europe         2008              Andorra   6.296125
North America  2008  Antigua And Barbuda   5.628319
Oceania        2008            Australia  17.704080
South America  2008                Aruba  24.750133

Asia:(First 5 rows)
     Year               Country   Emission
8   2008           Afghanistan   0.158962
9   2009           Afghanistan   0.249074
10  2010           Afghanistan   0.302936
11  2011           Afghanistan   0.425262
20  2008  United Arab Emirates  23.033600

This is all for different data preprocessing functions in pandas. Head onto the final chapter on Visualizing data using Pandas.

Leave your thought here

Your email address will not be published. Required fields are marked *

Close Bitnami banner
Bitnami