Skip to content
Events

From Documents to Dialogue: Unlocking PDF Data with a Smart Chatbot

Mountains of valuable information are locked away inside PDF files. Whether it’s business reports, regulatory documents, user manuals, or research papers, the ability to extract and utilize insights from these documents is becoming essential.  

In this live session, Simon Prickett (Developer Advocate at CrateDB), will begin by showing you how to extract data from text and images in PDF files, storing it in CrateDB. From there, you’ll see how to generate embeddings using AI models and perform hybrid semantic and keyword searches with SQL queries. Finally, we’ll put it all together and demonstrate a natural language chatbot that takes questions in plain English, returning responses from a large language model.  

We’ll share the code for a complete Python project that you’ll be able to try out with a free CrateDB cloud database and your own PDF files.  

Register now and learn how CrateDB can fit into your organization’s RAG pipelines. 

What you'll Learn

  • Data Preparation: Discover how to extract text from PDF documents, generate textual descriptions of images and store these in CrateDB for vector and full-text searching. 
  • Data Retrieval and Augmentation: Learn how natural language search queries are converted to vector embeddings and used in semantic and hybrid searches.  You’ll also see how to augment a Large Language Model (LLM) prompt with data from the database. 
  • Response Generation: In the final step of the pipeline, we’ll introduce techniques for generating coherent and fluent responses to users. 

Date and Time

Wednesday 12th February 2025 at: 

  • 5pm CET (Berlin) 
  • 4pm GMT (London) 
  • 11am EST (New York) 
  • 8am PST (Los Angeles) 

Duration 45 minutes. 

Details

February 12, 2025

Webinar

CrateDB

Venue