In the following post, we show the most useful functions to take advantage of our data stored in dataframes.
The previous week we were investigating the functions that are most used to handle data in NumPy array structures, however, sometimes we need to use a different format to perform the cleansing and transformation of data faster. Pandas is one of the libraries that better improves NumPy, besides being developed as an extension of NumPy, because it allows us to read different formats and solve mathematical or numerical problems.
One of the most important advantages of Pandas is its integrated indexation that allows us to work with very big quantities of data being able to read and write very fast.
Pandas Series VS Pandas Dataframe
NumPy uses to store data in ndarray format, however, Pandas has different formats that can be used according to the functionality we are looking for.
Pandas Series
This type is very similar to Python lists but has the advantages of Pandas row indexing that can be modified according to the interest we have. The two most common ways to create a series are from a dictionary or using lists.
First, we import the Pandas package.
import numpy as np
We create the lists and the dictionary with some data we want to add to our series and then, create the object.
labels = ['name', 'info', 'phone']
data = ['Safari', 'Playground', '1234567890']
dict_data = {
'name':'Safari',
'info':'Playground',
'phone':'1234567890'
}
pd.Series(data, labels) #or pd.Series(dict_data)
Return
name | Safari |
info | Playground |
phone | 1234567890 |
Pandas Dataframe
Create a Dataframe
Dataframes are the most widely used data structures in Pandas because they provide us more information that we can use. We are going to create a Dataframe that will contain people.
data = [['William', 'Male', 'Data Engineer'],['Sophia', 'Female', 'Software Engineer']]
labels = ['Name', 'Gender', 'Job']
pd.DataFrame(data, None,labels)
Return
Name | Gender | Job | |
0 | William | Male | Data Engineer |
1 | Sophie | Female | Software Engineer |
When we are creating a Dataframe we must the next input parameters:
- The data you want to load
- The index list, in this case, 'None' because we want a default numerical index
- A list with the labels of our data
However, the normal is that we create the Dataframes from data sets that we have stored in a file or database. We have functions that allow us to read data from different sources, such as:
- read_csv()
- read_excel()
- read_html() #This function looks for HTML tables
Also, in each function, you can specify if you want to load the header of the data (if it exists) and the codification of the data if you use data in languages such as Spanish.
Once the dataframe has been processed, it can be saved again using the functions pd.to_csv(), pd.to_excel()...
Select columns in Pandas
To select one of more columns of the Dataframe, it is done in the same way as the indexes of the list.
#If we want to get it as Dataframe
df[['Name']]
#If we want to get it as serie
df['Name']
Select rows in Pandas
If you want to select the rows that you are interested in, the way to do it is similar to a 'where' in SQL or a filter in Spark.
Depending on the type of selection we want, there are two main functions. The 'loc' is used to search by the row name and 'iloc' to search by the row index. In our example, both functions need the same parameter because we are using the default indexing.
df.loc[0]
df.iloc[0]
Return
Name | William |
Gender | Male |
Job | Data Engineer |
If we want to get a specific value, we can search rows and columns at the same time.
df.loc[0, 'Job']
Return
Data Engineer |
Although this search can be helpful, it is more useful to be able to filter by the features that interest us. To do this, you can apply filters on the columns that affect the whole Dataframe.
To find all the people in the Dataframe who are a woman
df[df['Gender'] == 'Female']
Return
Name | Gender | Job |
Sophia | Female | Software Engineer |
Create a new column in Pandas
We will add a column with the age of each one.
df['Age'] = pd.Series([29,25])
Return
Name | Gender | Job | Age | |
0 | William | Male | Data Engineer | 29 |
1 | Sophia | Female | Software Engineer | 25 |
Drop a column in Pandas
To remove a column or a row you must use the drop function specified which axis it belongs to.
#Drop column
df.drop('Age', axis=1, inplace=True)
#Drop row
df.drop(0, axis=0, inplace=True)
GroupBy in Pandas
The functionality of GroupBy is the same as in other programming languages or databases. It consists of grouping on an attribute and calculating all the values of each group.
First, as always, we will create our Dataframe. In this small test, we will use the previous Dataframe but adding more people and a column with the age of each one.
data = [['William', 'Male', 'Data Engineer', 32], ['Sophia', 'Female', 'Software Engineer', 18], ['Daniel', 'Male', 'Data Scientist', 29]]
labels = ['Name', 'Gender', 'Job', 'Age']
df = pd.DataFrame(data, None,labels)
Return
Name | Gender | Job | Age | |
0 | William | Male | Data Engineer | 32 |
1 | Sophia | Female | Software Engineer | 18 |
2 | Daniel | Male | Data Scientist | 29 |
We create a GroupBy object using gender label and then we calculate the average of each group.
df.groupby('Gender').mean()
In addition to the average, other calculations can be used, or all of them can be performed using the describe() function.
Relevant functions
Delete rows that have any column with a NaN value.
df.dropna()
Delete columns that have any NaN value.
df.dropna(axis=1)
Replace each of the values that are NaN.
df.fillna('Missing this value')
Return the first 5 rows.
df.head()
Return the single values of a column (they are not repeated)
df['Gender'].unique()
Count the number of values of each type in a column.
df['Gender'].value_counts()
Sort the Dataframe by column values.
df.sort_values('Age')
Transforming the structure by pivoting in Pandas
This function is very useful when we want to restructure the data to make a better visualization and show the data in table format. To do this, you must specify what values we want to pivot from our dataframe as column, values, and indexes.
df.pivot_table(values='Age', index=['Gender'], columns=['Job'])
Return
Job | Data Engineer | Data Scientist | Software Engineer |
Gender | |||
Female | NaN | NaN | 18.0 |
Male | 32.0 | 29.0 | NaN |
Pandas multi-index
This is a more advanced feature of Pandas, however, it can be very useful in some situations. Besides the numerical index that dataframe has, we can establish more indexes that help us to filter better.
In this case, we are going to create a Dataframe that will have multiple indexes if it belongs to a cloth or electronics database.
database = ['Electronics DB', 'Electronics DB', 'Clothes DB', 'Clothes DB']
index = [1,2,1,2]
multiple_index = list(zip(database, index))
multiple_index = pd.MultiIndex.from_tuples(multiple_index)
data = [['Mobile'],['TV'], ['T-Shirt'], ['Jeans']]
pd.DataFrame(data, multiple_index, ['Product'])
Return
Product | ||
Electronics DB | 1 | Mobile |
Electronics DB | 2 | TV |
Clothes DB | 1 | T-Shirt |
Clothes DB | 2 | Jeans |
Conclusions
Pandas give us solutions to most of our problems in data manipulation. Therefore, it is important to use it because it saves us a lot of time in exploratory analysis and data cleansing.
Besides, we usually use many Dataframes that contain different data. To be able to put these Dataframes together when they have data in common or even when they are different, we use concatenation and joins. However, we will see this in the next post.