Thanks to theidioms.com

Analyzing the sinking of the Titanic – Data Analysis with Python (Course V)

Analyzing the sinking of the Titanic – Data Analysis with Python (Course V)

Exploratory Data Analysis – Part 1

First of all importing necessary libraries to work with the dataset in Python.

# Importing necessary libraries

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

As you may recall from your knowledge of Python,

  • NumPy is used for numerical computations
  • Pandas is used for data processing as well as for CSV file I/O
  • Matplotlib and Seaborn are used for data visualization

Next, reading in the data using the read_csv() method of Pandas and looking at the first five rows using the head() method.

# Reading in the data

df = pd.read_csv('train.csv')
df.head()
DataFrame head

This certainly gives us a perspective of what kind of data we are dealing with. Let us look at the shape of the dataframe to better understand how many rows and columns are there.

# Finding the shape of the dataframe

df.shape
Output: (891, 12)

There are a total of 891 rows and 12 columns.

Now, getting a basic statistical description of the columns containing numeric in the dataset using the describe() method.

df.describe()
Statistical description of the dataset

Interesting! The ‘count’ of ‘Age’ is not 891 which means that there are missing values in the dataset. We should certainly check the entire dataframe for missing values as a first step.

The isnull() method is useful in finding which data values are null in the dataframe.

# Checking for null values in the dataframe

df.isnull()
Checking for null values in Pandas

The places where the values are ‘True’ is where the dataset contains null data. Now, summing up all the values that are ‘True’ in the dataset to find the number of missing values per column. We will be using the sum() method for this.

# Checking for total null values per column

df.isnull().sum()
PassengerId 0 
Survived 0 
Pclass 0 
Name 0 
Sex 0 
Age 177 
SibSp 0 
Parch 0 
Ticket 0 
Fare 0 
Cabin 687 
Embarked 2 
dtype: int64

As we can see, there are 177 missing values in ‘Age’ column, 687 missing values in ‘Cabin’ column and 2 missing values in ‘Embarked’ column.

In most real-life datasets, there can be a lot of missing values and there are different ways to fill in these missing values. If you are interested in learning about that in a separate course, please let us know in the comment section!

Types Of Features

By now, you must have already had a feel of the data. Therefore, it is the right time to talk about the different types of features you are looking at.

Numerical/Continuous features: A feature is said to be numerical or continuous if it can take values between any two points or between the minimum or maximum values in the features column. For example, ‘Age’ is a continuous feature in the dataset.

Categorical features: A categorical feature is one that has two or more categories and each value in that feature can be categorised by them. For example, gender is a categorical variable having two categories (male and female). ‘Sex’ and ‘Embarked’ are categorical features in the dataset.

Ordinal features: An ordinal feature is similar to categorical features, but the difference between them is that we can have relative ordering or sorting between the values. For eg: If we have a feature like Height with values Tall, Medium, Short, then Height is an ordinal variable. ‘PClass’ is an ordinal feature in the dataset.

DateTime features: A feature is said to be a DateTime feature if the feature holds DateTime values. For example, a feature with the value ‘2020/02/01 01:01:00″ is a DateTime feature. There are no DateTime features in the given dataset.

Co-ordinate features: A feature is said to be a co-ordinate feature if the feature holds co-ordinate values. For example, a feature with the value ‘(27.7172, 85.3240)’ is a co-ordinate feature. There are no co-ordinate features in the given dataset.

Frequency features: A feature is said to be a frequency feature if the feature holds a count of items as its value. For example, a feature with the value ‘200’ is a frequency feature if it represents the count of 200 people who are on the Titanic. ‘SibSp’ is a frequency feature.

Now, time for your first quiz! Be prepared to answer which column represents which type of feature in the dataset.

Leave your thought here

Your email address will not be published. Required fields are marked *

Close Bitnami banner
Bitnami