In this section, we will explore the versatility of SQL in accessing and manipulating various types of data, specifically focusing on time-series data. We will dive deeper into how SQL can be effectively used for handling time-series data, allowing for efficient and robust data analysis. This will include discussing various SQL functions and methods that are particularly useful in the context of time-series data.
In the upcoming practical exercise, we will be using Jupyter notebook to interact directly with CrateDB. This hands-on approach will enable us to create tables and import the sample data that we've discussed earlier. We will then use SQL to query this data and visualize the results using pandas DataFrames and Plotly.
Throughout this exercise, we will explore the extensive SQL querying capabilities of CrateDB. This includes the use of aggregations to summarize our data, Common Table Expressions (CTEs) to simplify complex queries, and window functions for advanced calculations. We will also cover the interpolation of missing values to ensure data continuity, and the calculation of moving averages for trend analysis. Furthermore, we will discuss the use of JOINs to combine data from different tables based on common columns.
Lastly, we will go over the use of JSON in pandas DataFrames for managing semi-structured data. This exercise will provide a comprehensive overview of the practical applications of CrateDB in handling time-series data.
We start by setting up our environment, connecting to CrateDB, and preparing our data for analysis. All necessary dependencies can be installed by running the pip install command. To work with our code examples, a connection to a CrateDB database cluster is essential. This could be via a managed CrateDB Cloud instance or a local setup, perhaps using Docker. Replace the placeholders with your actual CrateDB credentials to establish a secure connection.
Once the connection string is defined, we create an engine object in SQLAlchemy. We turn to SQL to define the structure for our weather data storage. CrateDB speaks SQL, so it's straightforward to create a table with a structure that fits our dataset. If you're using CrateDB Cloud, there is a convenient option to import your weather data directly through an integrated import feature available in the tutorials section of the CrateDB Cloud Console.
If you haven't used the CrateDB Cloud tutorial's sample data, you can load your dataset using the COPY FROM statement. Here's an example command that loads data from a CSV file hosted on GitHub. The result will indicate how many records were loaded successfully. For instance, a response like the following signifies a successful data import of 70,000 records.
Let's now turn to data querying. With SQL, we can easily retrieve data and use Python to load it into a pandas DataFrame for further analysis. For clarity reasons, we convert the timestamps which are stored in cratedb as big integers to the python datetime format.
CrateDB is optimized for fast aggregations, thanks to its columnar storage engine. Here's an example where we calculate the average temperature for each location using the average function. We can further enhance our query to capture the highest and lowest temperatures and their corresponding times. This is easily achieved with MAX BY
and MIN BY
aggregation functions that streamline the process into a single query.
The next section of the notebook illustrates how to use the plotly library to visualize data. Such visualizations can quickly reveal patterns and areas where data might be missing.
The rest of the notebook illustrates advanced SQL queries and interpolations. It gives you detailed instructions on how to fill in missing values using advanced SQL features of CrateDB, such as window functions and Common Table Expressions. By following the notebook you will also learn how to calculate moving averages and how CrateDB’s support for JSON as a native data type provides us with the flexibility to handle dynamic schema requirements and how to handle these structures in queries and pandas data frames.
There are additional useful funtions available in CrateDB that have not been covered in the Jupyter Notebook.
Data binning is useful to group data by time-based intervals, e.g. every 10 seconds. This function is also useful, for example, to harmonize non-equidistant events to correlate them to equidistant events.
Some use cases might require a sampling of the dataset, this can easily be achieved with a modulo operator on the internal doc id attribute.
Domain-specific logic can be put into user defined functions to make it reusable in an easy and consistent manner. Here we calculate the temperature in Fahrenheit from degrees Celsius in a dedicated function.
Please check the CrateDB documentation for further aggregation functions, like hyperloglog distinct for approximate distinct counts, or array aggregation functions.