CrateDB Blog | Development, integrations, IoT, & more

Step by Step Guide to Building a PDF Knowledge Assistant

Written by Wierd van der Haar | 2025-01-15

This guide outlines how to build a PDF Knowledge Assistant, covering:

  • Setting up a project folder.
  • Installing dependencies.
  • Using two Python scripts (one for extracting data from PDFs, and one for creating a chatbot).

*** This article is part of blog series. If you haven't read the previous articles yet, be sure to check them out:

Disclaimer

This guide provides a basic example to help you get started with building a PDF Knowledge Assistant. It is intended as a starting point and does not cover advanced use cases, optimizations, or production-grade considerations. Be sure to customize and enhance the implementation based on your specific needs and requirements.

Important Note: This script uses OpenAI’s API for image descriptions and embedding generation. As such, the content of your PDFs (including text and images) may be sent to OpenAI’s servers for processing. Do not use this script for confidential or sensitive PDFs unless you are certain it complies with your data privacy and security requirements.

For processing sensitive data, consider using local or self-hosted Large Language Models (LLMs) such as:

  • OpenLLM: A framework for running open-source LLMs locally.
  • Hugging Face Transformers: Offers pre-trained models like BERT, GPT-2, and more.
  • LLaMA (Large Language Model Meta AI): An efficient model designed for local use, available via Meta’s research initiative.
  • Falcon: A highly performant open-source LLM optimized for inference and fine-tuning.
  • Rasa: Focused on building local conversational AI.

By using local models, you can retain complete control over your data while still leveraging advanced capabilities for text and image processing.

 

>> Click on this link to access the full guide on GitHub

 

*** Continue reading: Making a Production-Ready AI Knowledge Assistant