Apache Parquet is a free and open-source column-oriented data storage format. It provides optimized data storage and retrieval due to its efficient data compression, which is able to handle complex data structures in bulk. Even though for this tutorial we are going to use Python, Parquet files are compatible with multiple languages and data processing frameworks. Here it will be used to transfer data from data storage to CrateDB.
Besides using the Crate library, we are going to use Pandas, SQLAlchemy and Apache Arrow libraries as well, so you should install them. To do so, you can use the following pip install command. To check the latest version supported by CrateDB, have a look at the CrateDB documentation - SQLAlchemy.
Before getting started, make sure to create the table that will be populated. In this tutorial, we will use a Parquet file containing information from yellow taxi rides from January 2022 in New York, refer to this link to download the file. To create the corresponding table, you can use the following SQL statement, which can be executed in CrateDB console, for example
If you are not using the data described above for this tutorial, that’s ok, just make sure to create the corresponding table before proceeding.
Once everything is installed and the table is successfully created, you should import the required resources, as seen below.
The first step is to read the Parquet file, which will be saved in theny_taxi_parquet object
.
Now, make sure to set up the SQLAlchemy engine and session as seen below. If you are not using localhost, remember to replace the URI string with your own.
Before processing the newly imported file, the corresponding structure must be replicated as an object so the SQLAlchemy library can work its magic. With that goal, the class NY_taxi
was implemented, as seen below. Important to notice that the variable __tablename__
corresponds to the name of the table created previously.
For further details on how to use the SQLAlchemy with CrateDB, you can refer to the documentation here.
Next, for each record represented in the Parquet file, an object should be instantiated and saved to the session to be committed afterward. However, depending on the size of the file/number of records, the best approach is to process it in batches. In the following step, we are going to read each batch of the Parquet file, transform it into a Pandas data frame and process each row, instantiating an object and adding it to the session before committing.
While the batches are processed, you should already see the new records in your CrateDB instance.
This tutorial provided one of many strategies to import Parquet files into CrateDB using Python, PyArrow, SQLAlchemy and Pandas. To further understand how to fully explore SQLAlchemy in CrateDB, have a look at the documentation CrateDB documentation - SQLAlchemy or keep on exploring the tutorial topics presented on CrateDB Community.