Data pre-processing involves a series of data preparation steps used to remove unwanted noise and to filter out necessary data from a dataset. Learn how to preprocess data in this article by reading about seven different ways to handle missing data in Python.

Data Preprocessing in Python - Handling Missing Data

There is a general convention that states that almost 80% of one’s time is spent in pre-processing data whereas only 20% is used to build the actual ML model itself. Hence, we can understand that data pre-processing is a vital step in building intelligent robust ML models.

Techniques For Handling Missing Data

Data may not always be complete i.e. some of the values in the data may be missing or null. Thus, there are a specific set of ways to handle the missing data and make the data complete.

The following example shows that the ‘Years of Experience’ of ‘Employee’ is missing. Also, the ‘Salary (in USD per year)’ of ‘Junior Manager’ is missing.

Data Preprocessing in Python - Handling Missing Data
import pandas as pd

# Creating the dataframe as shown above

df = pd.DataFrame({'Job Position': ['CEO', 'Senior Manager', 'Junior Manager', 'Employee', 'Assistant Staff'], 'Years of Experience':[5, 4, 3, None, 1], 'Salary':[100000,80000,None,40000, 20000]})

# Viewing the contents of the dataframe
df.head()

Some of the ways to handle missing data are listed below:

1. Data Removal

Remove the missing data rows (data points) from the dataset. However, when using this technique will decrease the available dataset and in turn result in less robustness of data point if the size of dataset is originally small.

Data Removal - Handling Missing Data
# Dropping the 2nd and 3rd index
dropped_df = df.drop([2,3],axis=0)

# Viewing the dataframe
dropped_df

2. Fill missing value through statistical imputation

Fill the missing data by taking the mean or median of the available data points. Generally, the median of the data points is used to fill the missing values as it is not affected heavily by outliers like the mean. Here, we have used the median to fill the missing data.

Fill missing values through statistical imputation - Handling Missing Data
# Filling each column with their mean values

df['Years of Experience'] = df['Years of Experience'].fillna(df['Years of Experience'].mean())

df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

# Viewing the dataframe
df

3. Fill missing value using observation

Manually fill in the missing data from observation. This may be possible sometimes for small datasets but for larger datasets it is very difficult to do so.

Fill missing value using observation  - Handling Missing Data

4. Fill in the most repeated value

Fill in the missing value using the most repeated value in the dataset. This is done when most of the data is repeated and there is good reasoning to do so. Since there are no repeated values in the example, we can fill it with any one of the numbers in the respective column.

Fill in the most repeated value  - Data Preprocessing in Python

5. Fill in with random value within the range of available data

Take the given range of data points and fill in the data by randomly selecting a value from the available range.

Fill in with random value within the range of available data  - Data Preprocessing in Python

6. Fill in by regression

Use regression analysis to find the most probable data point for filling in the dataset.

Fill in by regression - Data Preprocessing in Python
from sklearn.linear_model import LinearRegression

# Excluding the rows with the null data
train_df = df.drop([2,3],axis=0)

# Creating linear regression model
regr = LinearRegression()

# Here the target is the Salary and the feature is Years of Experience
regr.fit(train_df[['Years of Experience']], train_df[['Salary']])

# Predicting for 3 years of experience
regr.predict([[3]])

Therefore, the salary for 3 years of experience by regression is 60000. Now, finding the years of experience based on salary.

from sklearn.linear_model import LinearRegression

# Excluding the rows with the null data
train_df = df.drop([2,3],axis=0)

# Creating linear regression model
regr = LinearRegression()

# Here the target is the Years of Experience and the feature is Salary
regr.fit(train_df[['Salary']], train_df[['Years of Experience']])

# Predicting for 40000 salary
regr.predict([[40000.0]])

Therefore, the years of experience for 40000 salary is 2.

In Conclusion

Do you have any problems handling missing data in Python? Let us know in the comment section below.

LEAVE A REPLY

Please enter your comment!
Please enter your name here