An in-depth analysis can take days or even months. However, with a single Python command, we can analyze to get a global view of our data.
Before manipulating data in the Big Data world or creating predicting models that solve our problems, we have to check what data we use and its potential, because bad data produce bad solutions.
There are many techniques for data cleansing, feature removal, statistical analysis, descriptive analysis, among others. However, we can obtain a first exploratory data analysis using the pandas profiling package to know how to orient our analysis in a fast way.
Pandas Profiling Execution
First, we have to install the package.
pip install pandas-profiling
For this example, we will use the New York Airbnb dataset which is available in the following link: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
We create a Pandas dataframe from the .csv dataset.
import pandas as pd from pandas profiling import ProfileReport df = pd.read_csv('AB_NYC_2019.csv')
To create the report with the analysis, it's very simple, use the command below.
ProfileReport(df, title="New York Report")
To view the report, you can run it in Jupyter or directly save it as a HTML document.
#Jupyter profile.to_widgets() #Save HTML profile.to_file("your_report.html")
What statistical metrics does the report provide?
It shows an overview of the dataframe data and it indicates common warnings that exist in our data which can damage our model or analysis.
Price variable in the variables section
Unlike the general view, in the variables section, we find a detailed analysis of each variable and its distribution by statistical indicators.
Interactions & Correlations
In these sections, we can find the correlation and interactions between the variables indicating the dependence of each pair of variables and their density
Missing values & Sample
Missing values section
These sections are very useful to check which features have more missing values and know what data must be transformed afterwards and, also, show some data samples.
This package allows us to have a small first contact to know how to use the data. Also, we have used the default configuration using a single command but pandas profiling has a multitude of configurations to adapt the report to our requirements.