top of page
Search

Simple Introduction to Building a Small Language Model for Your Documents Using S3 and PDFs

  • mschneider90265
  • Apr 27
  • 3 min read

Overview
Overview

Overview

Ever wanted to build your own chatbot that understands your internal documents? With the power of language models and document embeddings, you can turn a collection of PDFs stored in Amazon S3 into an intelligent, searchable assistant. This article walks you through how to do that step-by-step using open-source tools and a retrieval-augmented generation (RAG) pipeline.


I tested the code with my own data but only did a visual inspection as I generalized the commands and removed my personal data.


Step 1: Download PDFs from S3

What it's doing:

  • Connects to your AWS S3 bucket using boto3 (AWS SDK for Python)

  • Looks for files within a specific prefix (like a folder path)

  • Downloads all .pdf files from that prefix to your local machine (or wherever your code is running)

Why it's important:

  • Your data lives in S3, but to use it in a chatbot, you need the raw content from those files.

  • PDFs aren't directly usable by language models—you need to extract the text.



Step 2: Extract Text from PDFs

What it's doing:

  • Uses PyMuPDF (or similar library) to read each PDF and extract text page-by-page

  • Compiles all the extracted text into one long string

Why it's important:

  • PDFs are binary files, not plain text.

  • You must extract readable text from them before a model can understand or generate answers based on the content.



Step 3: Chunk the Text

What it's doing:

  • Breaks the long string of text into smaller chunks (e.g., 500-character sections with some overlap between chunks)

  • Overlap (e.g., 50 characters) helps maintain context between chunks

Why it's important:

  • Embedding models and language models work better on short, focused text segments.

  • Large documents can overwhelm models, and splitting helps maintain performance and retrieval accuracy.



Step 4: Generate Embeddings and Store

What it's doing:

  • Transforms each text chunk into a vector (a numerical representation of meaning) using a sentence embedding model

  • Stores those vectors in a FAISS index (a fast similarity search engine)

Why it's important:

  • When a user asks a question, you don’t search the entire document—you search the vector space for the most semantically similar text chunks.

  • This lets the chatbot find relevant information fast, even if the wording is different from the user's question.



Step 5: Build a Simple Retrieval-Augmented Chatbot

What it's doing:

  • When a user asks a question:

    1. Embed the question using the same embedding model

    2. Search the FAISS index for the top-matching chunks of text

    3. Construct a prompt by combining the context with the user's question

    4. Feed that prompt into a language model (local or API) to generate a coherent answer

Why it's important:

  • Language models don’t “know” your PDFs out of the box.

  • This approach retrieves relevant knowledge from your data and lets the LLM generate natural answers based on that information.



Ready to Run

This whole setup is called RAG (Retrieval-Augmented Generation). It's like giving a model short-term memory by letting it “look up” relevant chunks from your documents when it needs them.

Now you can call:



Conclusion

You now have a fully working pipeline example to build a lightweight chatbot trained on your own documents using a small language model and smart retrieval techniques. Whether you're making a company knowledge bot, internal assistant, or just experimenting, this setup is fast, scalable, and cost-effective.


Want to Deploy This?

You can:

  • Wrap it in a Flask API

  • Run it locally or deploy on AWS Lambda or SageMaker

  • Replace distilgpt2 with ChatGPT or Mistral API for better answers


I also have added a bonus version that streams the chatbot's response for a more "live typing" feel.





 

 
 
 

Comments


bottom of page