Exploratory Data Analysis is a critical step in the data science process, aimed at understanding the main characteristics of a dataset through visualizations and statistics. In this video we will illustrate how to perform exploratory data analysis using Python libraries to visualize and understand the data stored in CrateDB.
Exploratory data analysis is an approach to analyze datasets with various statistical methods and visualizations. This isn't just about making calculations or plotting charts; it's a thorough process of 'getting to know' the data. We look for patterns and trends that give us insights into how our variables relate to each other, and we seek to understand the underlying structure of the dataset. It’s also crucial that we identify any deviations from expected trends, which could be anomalies or outliers that might indicate errors or unique, significant occurrences that warrant further investigation. Alongside numerical analysis, visualization is a powerful tool in EDA. Through charts, graphs, and plots, we make complex data more accessible and understandable.
In this notebook, we explore weather data using the PyCaret library, an open-source machine learning library in Python that significantly reduces the time from hypothesis to insights. PyCaret stands out with its low-code approach, providing an efficient end-to-end machine learning pipeline that's easy for beginners and robust enough for expert users. PyCaret, coupled with CrateDB is a powerful duo for handling large-scale data analytics and ML projects. CrateDB excels in real-time handling of massive amounts of data, while PyCaret applies its machine learning algorithms directly to this data for an array of analytics tasks.
The journey through this notebook will take us through data extraction from CrateDB, preprocessing with PyCaret, and visualization to understand data distributions and relationships. We start by setting up our environment, connecting to CrateDB, and preparing our data for analysis. All necessary dependencies are located in the requirements.txt file and they can be installed by running pip install command. If you are running this notebook in an environment like Google Colab, use the absolute path as illustrated here.
As a first step, we are looking into connecting to a CrateDB instance using SQLAlchemy, a popular SQL toolkit and Object-Relational Mapping (ORM) library for Python. Firstly, we establish a connection string and we can have two scenarios: one for connecting to a local CrateDB instance which is often used for development and testing purposes, and another for connecting to a CrateDB Cloud instance, which is typically utilized for production databases hosted in the cloud.
Once the connection string is defined, we create an engine object in SQLAlchemy. It's a common interface to the database that we are querying from. The following code demonstrates fetching data from the CrateDB instance and loading it into a pandas DataFrame. We execute a simple SQL query to select all records from weather_data
. After successfully fetching the data, we use df.head(5)
to preview the first five rows of our DataFrame.
In our dataset, the timestamp
column is crucial as it holds the key to time-based analysis. It is stored as a big integer value that isn’t automatically recognized as datetime
by pandas. We use the to_datetime
function from pandas which converts the timestamp
column into datetime
object, which pandas and Python can understand and manipulate as time-based data.
Once the conversion is complete, we take this timestamp
column and set it as the index of our DataFrame, as it unlocks the potential to resample the data based on periods, compute rolling statistics, and much more. It's an essential step in preparing our data for time-series forecasting or any kind of temporal analysis.
In time series datasets, like weather records, it's common to encounter missing values. Here, we implement time-weighted interpolation to estimate missing values in our temperature
, humidity
, and wind_speed
columns. Interpolation is a statistical method used to estimate unknown values that fall within the range of a set of known values. By specifying the method as 'time', we instruct pandas to consider the time dimension. This means that the estimated values for missing data points will be calculated with the linear time difference between the known points, using the date time index as a reference.
Before setting up the environment for the analysis, we do the final preprocessing step. This piece of code extracts the data for a specific city (Berlin in our case), calculates daily averages and organizes the data frame with a timestamp index for further analysis or visualization.
Now, we initialize our forecasting environment using PyCaret. We start by defining the forecast_horizon
, which is set to 10
. This parameter is crucial as it tells our model how far into the future we want to predict; in this case, 10 days. Next, we specify fold as 3 which indicates that we will be using a 3-fold cross-validation process.
TSForecastingExperiment
object serves as a primary tool for time series analysis. To the setup method we also pass the target variable which is temperature
, and a list of seasonal periods to consider. The seasonal periods parameter is a list containing 1, 5, and 20, which are the cycle lengths we hypothesize might be inherent in the temperature data.
After running this code we can observe a table with a summary of the settings and many findings from the time series forecasting setup. We won’t go into details here, but it is worth mentioning that the model reports possible seasonality for 5 and 20 days options.
By calling plot_model
function we are instructing PyCaret to plot the original data. It's an interactive plot, allowing us to select a particular area on the graph and expand further at a more granular level than what we can initially see in the plot.
PyCaret comes with a variety of tools and visualization options. The next snippet tells us how to generate the Autocorrelation Function (ACF) plot. It is a tool used to visualize how the data points in a time series are related to each other. For our example, the graph indicates that there is a stronger correlation between recent weather temperatures and today's temperature compared to temperatures measured further in the past.
Finally, PyCaret offers an easy way to plot all these graphs in a single multi-plot by using the diagnostics plot. The diagnostics also include Periodogram, Histogram, Q-Q Plot, and Partial Auto-Correlation Function Plots. From these plots, we can further learn the distribution of our data, frequencies, and correlations.
With this overview, you are equipped to perform your EDA on time series data, leverage the integrations between PyCaret and CrateDB, and be on your way to drawing insightful forecasts from your datasets.