Data pre-processing involves a series of data preparation steps used to remove unwanted noise and to filter out necessary data from a dataset. Learn how to preprocess data in this article by reading about seven different ways to handle missing data in Python.

There is a general convention that states that almost 80% of one’s time is spent in pre-processing data whereas only 20% is used to build the actual ML model itself. Hence, we can understand that data pre-processing is a vital step in building intelligent robust ML models.

**Techniques** **For** **Handling Missing Data**

Data may not always be complete i.e. some of the values in the data may be missing or null. Thus, there are a specific set of ways to handle the missing data and make the data complete.

The following example shows that the ‘Years of Experience’ of ‘Employee’ is missing. Also, the ‘Salary (in USD per year)’ of ‘Junior Manager’ is missing.

import pandas as pd # Creating the dataframe as shown above df = pd.DataFrame({'Job Position': ['CEO', 'Senior Manager', 'Junior Manager', 'Employee', 'Assistant Staff'], 'Years of Experience':[5, 4, 3, None, 1], 'Salary':[100000,80000,None,40000, 20000]}) # Viewing the contents of the dataframe df.head()

Some of the ways to handle missing data are listed below:

**1.** **Data Removal**

Remove the missing data rows (data points) from the dataset. However, when using this technique will decrease the available dataset and in turn result in less robustness of data point if the size of dataset is originally small.

# Dropping the 2nd and 3rd index dropped_df = df.drop([2,3],axis=0) # Viewing the dataframe dropped_df

**2. Fill missing value through statistical imputation **

Fill the missing data by taking the mean or median of the available data points. Generally, the median of the data points is used to fill the missing values as it is not affected heavily by outliers like the mean. Here, we have used the median to fill the missing data.

# Filling each column with their mean values df['Years of Experience'] = df['Years of Experience'].fillna(df['Years of Experience'].mean()) df['Salary'] = df['Salary'].fillna(df['Salary'].mean()) # Viewing the dataframe df

**3. Fill missing value using observation **

Manually fill in the missing data from observation. This may be possible sometimes for small datasets but for larger datasets it is very difficult to do so.

**4. Fill in the most repeated value **

Fill in the missing value using the most repeated value in the dataset. This is done when most of the data is repeated and there is good reasoning to do so. Since there are no repeated values in the example, we can fill it with any one of the numbers in the respective column.

**5.** **Fill in with random value within the range of available data **

Take the given range of data points and fill in the data by randomly selecting a value from the available range.

**6. Fill in by regression**

Use regression analysis to find the most probable data point for filling in the dataset.

from sklearn.linear_model import LinearRegression # Excluding the rows with the null data train_df = df.drop([2,3],axis=0) # Creating linear regression model regr = LinearRegression() # Here the target is the Salary and the feature is Years of Experience regr.fit(train_df[['Years of Experience']], train_df[['Salary']]) # Predicting for 3 years of experience regr.predict([[3]])

Therefore, the salary for 3 years of experience by regression is 60000. Now, finding the years of experience based on salary.

from sklearn.linear_model import LinearRegression # Excluding the rows with the null data train_df = df.drop([2,3],axis=0) # Creating linear regression model regr = LinearRegression() # Here the target is the Years of Experience and the feature is Salary regr.fit(train_df[['Salary']], train_df[['Years of Experience']]) # Predicting for 40000 salary regr.predict([[40000.0]])

Therefore, the years of experience for 40000 salary is 2.

**In Conclusion**

Do you have any problems handling missing data in Python? Let us know in the comment section below.