Full-Text: Exploring the Netflix Catalog

CrateDB is an exceptional choice for handling complex queries and large-scale data sets. One of its standout features is its full-text search capabilities, built on top of the powerful Lucene library. This makes it a great fit for organizing, searching, and analyzing extensive datasets.

In this tutorial, we will explore how to manage a dataset of Netflix titles, making use of CrateDB Cloud’s full-text search capabilities. Each entry in our imaginary dataset will have the following attributes:

  • show_id: A unique identifier for each show or movie.

  • type: Specifies whether the title is a movie, TV show, or another format.

  • title: The title of the movie or show.

  • director: The name of the director.

  • cast: An array listing the cast members.

  • country: The country where the title was produced.

  • date_added: A timestamp indicating when the title was added to the catalog.

  • release_year: The year the title was released.

  • rating: The content rating (e.g., PG, R, etc.).

  • duration: The duration of the title in minutes or seasons.

  • listed_in: An array containing genres that the title falls under.

  • description: A textual description of the title, indexed using full-text search.

To begin, let’s create the schema for this dataset:

Creating the Table

CrateDB uses SQL, a powerful and familiar language for database management. To store the data, create a table with columns tailored to the dataset using the CREATE TABLE command. Importantly, you will also take advantage of CrateDB’s full-text search capabilities by setting up a full-text index on the description column. This will enable you to perform complex textual queries later on.

CREATE TABLE "netflix_catalog" (
   "show_id" TEXT PRIMARY KEY,
   "type" TEXT,
   "title" TEXT,
   "director" TEXT,
   "cast" ARRAY(TEXT),
   "country" TEXT,
   "date_added" TIMESTAMP,
   "release_year" TEXT,
   "rating" TEXT,
   "duration" TEXT,
   "listed_in"  ARRAY(TEXT),
   "description" TEXT INDEX using fulltext
);

Run the above SQL command in CrateDB to set up your table. With the table ready, you’re now set to insert the dataset.

Inserting Data

Now, insert data into the table you just created, by using the COPY FROM SQL statement.

COPY netflix_catalog
FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_netflix.json.gz'
WITH (format = 'json', compression='gzip');

Run the above SQL command in CrateDB to import the dataset. After this commands finishes, you are now ready to start querying the dataset.