Python Pandas: A complete tutorial for Beginners

In the following post, we show the most useful functions to take advantage of our data stored in dataframes.

The previous week we were investigating the functions that are most used to handle data in NumPy array structures, however, sometimes we need to use a different format to perform the cleansing and transformation of data faster. Pandas is one of the libraries that better improves NumPy, besides being developed as an extension of NumPy, because it allows us to read different formats and solve mathematical or numerical problems.

One of the most important advantages of Pandas is its integrated indexation that allows us to work with very big quantities of data being able to read and write very fast.

Pandas Series VS Pandas Dataframe

NumPy uses to store data in ndarray format, however, Pandas has different formats that can be used according to the functionality we are looking for.

Pandas Series

This type is very similar to Python lists but has the advantages of Pandas row indexing that can be modified according to the interest we have. The two most common ways to create a series are from a dictionary or using lists. 

First, we import the Pandas package.

import numpy as np

We create the lists and the dictionary with some data we want to add to our series and then, create the object.

labels = ['name', 'info', 'phone']
data = ['Safari', 'Playground', '1234567890']

dict_data = {
    'name':'Safari',
    'info':'Playground',
    'phone':'1234567890'
}

pd.Series(data, labels) #or pd.Series(dict_data)

Return

name Safari
info Playground
phone 1234567890

Pandas Dataframe

Create a Dataframe

Dataframes are the most widely used data structures in Pandas because they provide us more information that we can use. We are going to create a Dataframe that will contain people.

data = [['William', 'Male', 'Data Engineer'],['Sophia', 'Female', 'Software Engineer']]
labels = ['Name', 'Gender', 'Job']

pd.DataFrame(data, None,labels)

Return

  Name Gender Job
0 William Male Data Engineer
1 Sophie Female Software Engineer

When we are creating a Dataframe we must the next input parameters:

  • The data you want to load
  • The index list, in this case, 'None' because we want a default numerical index
  • A list with the labels of our data

However, the normal is that we create the Dataframes from data sets that we have stored in a file or database. We have functions that allow us to read data from different sources, such as:

  • read_csv()
  • read_excel()
  • read_html() #This function looks for HTML tables

Also, in each function, you can specify if you want to load the header of the data (if it exists) and the codification of the data if you use data in languages such as Spanish.

Once the dataframe has been processed, it can be saved again using the functions pd.to_csv(), pd.to_excel()...

Select columns in Pandas

To select one of more columns of the Dataframe, it is done in the same way as the indexes of the list.

#If we want to get it as Dataframe
df[['Name']]
#If we want to get it as serie
df['Name']

Select rows in Pandas

If you want to select the rows that you are interested in, the way to do it is similar to a 'where' in SQL or a filter in Spark.

Depending on the type of selection we want, there are two main functions. The 'loc' is used to search by the row name and 'iloc' to search by the row index. In our example, both functions need the same parameter because we are using the default indexing.

df.loc[0]
df.iloc[0]

Return

Name William
Gender Male
Job Data Engineer

If we want to get a specific value, we can search rows and columns at the same time.

df.loc[0, 'Job']

Return

Data Engineer

Although this search can be helpful, it is more useful to be able to filter by the features that interest us. To do this, you can apply filters on the columns that affect the whole Dataframe.

To find all the people in the Dataframe who are a woman

df[df['Gender'] == 'Female']

Return

Name Gender Job
Sophia Female Software Engineer

Create a new column in Pandas

We will add a column with the age of each one.

df['Age'] = pd.Series([29,25])

Return

  Name Gender Job Age
0 William Male Data Engineer 29
1 Sophia Female Software Engineer 25

Drop a column in Pandas

To remove a column or a row you must use the drop function specified which axis it belongs to.

#Drop column
df.drop('Age', axis=1, inplace=True)
#Drop row
df.drop(0, axis=0, inplace=True)

GroupBy in Pandas

The functionality of GroupBy is the same as in other programming languages or databases. It consists of grouping on an attribute and calculating all the values of each group.

First, as always, we will create our Dataframe. In this small test, we will use the previous Dataframe but adding more people and a column with the age of each one.

data = [['William', 'Male', 'Data Engineer', 32], ['Sophia', 'Female', 'Software Engineer', 18], ['Daniel', 'Male', 'Data Scientist', 29]]
labels = ['Name', 'Gender', 'Job', 'Age']

df = pd.DataFrame(data, None,labels)

Return

  Name Gender Job Age
0 William Male Data Engineer 32
1 Sophia Female Software Engineer 18
2 Daniel Male Data Scientist 29

We create a GroupBy object using gender label and then we calculate the average of each group.

df.groupby('Gender').mean()

In addition to the average, other calculations can be used, or all of them can be performed using the describe() function.

Relevant functions

Delete rows that have any column with a NaN value.

df.dropna()

Delete columns that have any NaN value.

df.dropna(axis=1)

Replace each of the values that are NaN.

df.fillna('Missing this value')

Return the first 5 rows.

df.head()

Return the single values of a column (they are not repeated)

df['Gender'].unique()

Count the number of values of each type in a column.

df['Gender'].value_counts()

Sort the Dataframe by column values.

df.sort_values('Age')

Transforming the structure by pivoting in Pandas

This function is very useful when we want to restructure the data to make a better visualization and show the data in table format. To do this, you must specify what values we want to pivot from our dataframe as column, values, and indexes.

df.pivot_table(values='Age', index=['Gender'], columns=['Job'])

Return

Job Data Engineer Data Scientist Software Engineer
Gender      
Female NaN NaN 18.0
Male 32.0 29.0 NaN

Pandas multi-index

This is a more advanced feature of Pandas, however, it can be very useful in some situations. Besides the numerical index that dataframe has, we can establish more indexes that help us to filter better.

In this case, we are going to create a Dataframe that will have multiple indexes if it belongs to a cloth or electronics database.

database = ['Electronics DB', 'Electronics DB', 'Clothes DB', 'Clothes DB']
index = [1,2,1,2]
multiple_index = list(zip(database, index))
multiple_index = pd.MultiIndex.from_tuples(multiple_index)

data = [['Mobile'],['TV'], ['T-Shirt'], ['Jeans']]

pd.DataFrame(data, multiple_index, ['Product'])

Return

    Product
Electronics DB 1 Mobile
Electronics DB 2 TV
Clothes DB 1 T-Shirt
Clothes DB 2 Jeans

Conclusions

Pandas give us solutions to most of our problems in data manipulation. Therefore, it is important to use it because it saves us a lot of time in exploratory analysis and data cleansing.
Besides, we usually use many Dataframes that contain different data. To be able to put these Dataframes together when they have data in common or even when they are different, we use concatenation and joins. However, we will see this in the next post.


Your subscription could not be saved. Please try again.
Your subscription has been successful. Thank you for joining this great data world.

GET OUR NEWSLETTER

You'll get the latest posts delivered to your inbox.