Analyzing the sinking of the Titanic – Data Analysis with Python (Course V)July 12, 2020 2020-08-04 10:49
Analyzing the sinking of the Titanic – Data Analysis with Python (Course V)
Exploratory Data Analysis – Part 1
First of all importing necessary libraries to work with the dataset in Python.
# Importing necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
As you may recall from your knowledge of Python,
- NumPy is used for numerical computations
- Pandas is used for data processing as well as for CSV file I/O
- Matplotlib and Seaborn are used for data visualization
# Reading in the data df = pd.read_csv('train.csv') df.head()
This certainly gives us a perspective of what kind of data we are dealing with. Let us look at the shape of the dataframe to better understand how many rows and columns are there.
# Finding the shape of the dataframe df.shape
Output: (891, 12)
There are a total of 891 rows and 12 columns.
Now, getting a basic statistical description of the columns containing numeric in the dataset using the describe() method.
Interesting! The ‘count’ of ‘Age’ is not 891 which means that there are missing values in the dataset. We should certainly check the entire dataframe for missing values as a first step.
The isnull() method is useful in finding which data values are null in the dataframe.
# Checking for null values in the dataframe df.isnull()
The places where the values are ‘True’ is where the dataset contains null data. Now, summing up all the values that are ‘True’ in the dataset to find the number of missing values per column. We will be using the sum() method for this.
# Checking for total null values per column df.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
As we can see, there are 177 missing values in ‘Age’ column, 687 missing values in ‘Cabin’ column and 2 missing values in ‘Embarked’ column.
In most real-life datasets, there can be a lot of missing values and there are different ways to fill in these missing values. If you are interested in learning about that in a separate course, please let us know in the comment section!
Types Of Features
By now, you must have already had a feel of the data. Therefore, it is the right time to talk about the different types of features you are looking at.
Numerical/Continuous features: A feature is said to be numerical or continuous if it can take values between any two points or between the minimum or maximum values in the features column. For example, ‘Age’ is a continuous feature in the dataset.
Categorical features: A categorical feature is one that has two or more categories and each value in that feature can be categorised by them. For example, gender is a categorical variable having two categories (male and female). ‘Sex’ and ‘Embarked’ are categorical features in the dataset.
Ordinal features: An ordinal feature is similar to categorical features, but the difference between them is that we can have relative ordering or sorting between the values. For eg: If we have a feature like Height with values Tall, Medium, Short, then Height is an ordinal variable. ‘PClass’ is an ordinal feature in the dataset.
DateTime features: A feature is said to be a DateTime feature if the feature holds DateTime values. For example, a feature with the value ‘2020/02/01 01:01:00″ is a DateTime feature. There are no DateTime features in the given dataset.
Co-ordinate features: A feature is said to be a co-ordinate feature if the feature holds co-ordinate values. For example, a feature with the value ‘(27.7172, 85.3240)’ is a co-ordinate feature. There are no co-ordinate features in the given dataset.
Frequency features: A feature is said to be a frequency feature if the feature holds a count of items as its value. For example, a feature with the value ‘200’ is a frequency feature if it represents the count of 200 people who are on the Titanic. ‘SibSp’ is a frequency feature.
Now, time for your first quiz! Be prepared to answer which column represents which type of feature in the dataset.