Thanks to theidioms.com

Learn Pandas for Data Science (Course IV)

Learn Pandas for Data Science (Course IV)

Data Structures in Pandas: DataFrame

Pandas DataFrame is a labeled data structure where data is aligned in a tabular fashion in rows and columns. Its format is similar to that of an excel spreadsheet or a SQL table. The columns of the DataFrame can be heterogeneous, i.e., the columns can have elements of different data types. In this chapter, you will learn about various methods for creating DataFrames. You will also learn about selecting data from a DataFrame.

The general syntax for creating a simple DataFrame is similar to that of creating a Pandas Series:

df = pandas.DataFrame(data=data, index=index)

Pandas DataFrame accepts many different kinds of data inputs like a dictionary of 1-D n-d arrays, lists, dicts, or Series, 2-D numpy ndarray, structured or record ndarray, another DataFrame. Along with the data, you can optionally pass index (row labels) and columns (column labels).

How to Create Pandas DataFrame?

There are numerous ways by which we can create a Pandas DataFrame. Some of the most widely used methods for creating a Pandas DataFrame are discussed below:

Creating a Pandas DataFrame from a dictionary of nd arrays or lists

Pandas DataFrame can be created from a dictionary whose values are lists or nd arrays. The keys of the dictionary will be the column names for the DataFrame.

# Making necessary imports
import pandas as pd

# Defining a dictionary of nd arrays or lists
d = {'one': [1, 2, 3, 4],
    'two': [4, 3, 2, 1]}

# Creating a DataFrame from the Dictionary
df0 = pd.DataFrame(d)

# Printing dataframe and its index
print(df0)
print(df0.index)

# Show well formatted dataframe in jupyter notebook or jupyter lab
df0

OUTPUT:

    one  two
0    1    4
1    2    3
2    3    2
3    4    1
RangeIndex(start=0, stop=4, step=1)
Creating a Pandas DataFram from a Dictionary of Pandas Series

We can also pass a Dictionary whose values are Pandas Series to the pandas.DataFrame method to create DataFrames.

# Making necessary imports
import pandas as pd

# Defining a dictionary of Pandas series 
dict_data= {'Column_one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'Column_two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

# Creating a DataFrame from the Dictionary
df1 = pd.DataFrame(dict_data)

# Printing dataframe and its index
print(df1)
print(df1.index)

OUTPUT:

       Column_one  Column_two
a         1.0           1
b         2.0           2
c         3.0           3
d         NaN           4
Index(['a', 'b', 'c', 'd'], dtype='object') 
Creating a Pandas DataFrame from a Structured or Record array

Structured arrays are ndarrays whose datatype is a composition of simpler datatypes organized as a sequence of named fields. Structured arrays can also be passed into the pandas.DatFrame() to create a Pandas DataFrame.

# Making necessary imports
import pandas as pd
import numpy as np

# Defining structured or record array
data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])
data[:] = [(1, 2., 'Hello'), (2, 3., "World")]

# Creating a DataFrame
df2=pd.DataFrame(data)
print(df2) # Printing the DataFrame

OUTPUT:

   A   B      C
0  1  2.0  b'Hello'
1  2  3.0  b'World'

How to select from a DataFrame?

After creating the DataFrame, it is essential to be able to select particular data from the DataFrame. Luckily, Pandas provides an easy and efficient way to select certain data from a DataFrame. The methods for selecting particular rows and columns are discussed below

Selecting columns from a Pandas DataFrame

The syntax for selecting a particular column is

DataFrameName['column_name']

EXAMPLE:

# Making necessary imports
import pandas as pd

# Defining a DataFrame with three columns namely 'one', 'two' and 'three'
df0 = pd.DataFrame({'one': [1, 2, 3, 4],
                'two': [4, 3, 2, 1],
                'three': [5, 6, 7, 8]})

# Selecting the column 'one'
df0['one'] 

OUTPUT:

0    1
1    2
2    3
3    4
Name: one, dtype: int64

We can also select multiple columns from the DataFrame by passing a list of column names. In the above example, we can select only the columns ‘one’ and ‘two’ by

df0[['one', 'two']]

OUTPUT:

  one two
0  1   4
1  2   3
2  3   2
3  4   1

Note: When only one column is selected, the result is a Pandas Series whereas when multiple columns are selected, the result is a Pandas DataFrame.

Selecting rows from a Pandas DataFrame

Pandas uses iloc() method to extract rows using an imaginary index position which isn’t visible in the data frame. For understanding iloc, let us take the following example:

# Making necessary imports
import pandas as pd

data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
          {'a': 100, 'b': 200, 'c': 300, 'd': 400},
          {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]

df = pd.DataFrame(data)
print(df) #printing the DataFrame

OUTPUT:

      a     b     c     d
0     1     2     3     4
1   100   200   300   400
2  1000  2000  3000  4000

Now, some of the examples of selecting particular data from the DataFrame using iloc are illustrated below:

# Selecting value 2
print("First:\n",df.iloc[0,1])

# Selecting column index 1 i.e., second column of data frame 'b'
print("\nSecond:\n",df.iloc[:,1])

# Selecting row from index 1 to 2 and col from index 1 to 2
print("\nThird:\n",df.iloc[1:3, 1:3]) 
First:
2

Second:
0       2
1     200
2    2000
Name: b, dtype: int64

Third:
     b     c
1   200   300
2   2000  3000

This is how you create DataFrames and select particular rows or columns from it. Now, in the next chapter, you will learn how to load external data as Pandas Series or DataFrame.

Leave your thought here

Your email address will not be published. Required fields are marked *

Close Bitnami banner
Bitnami