Simple Introduction to Building a Small Language Model for Your Documents Using S3 and PDFs

Overview

Ever wanted to build your own chatbot that understands your internal documents? With the power of language models and document embeddings, you can turn a collection of PDFs stored in Amazon S3 into an intelligent, searchable assistant. This article walks you through how to do that step-by-step using open-source tools and a retrieval-augmented generation (RAG) pipeline.

I tested the code with my own data but only did a visual inspection as I generalized the commands and removed my personal data.

Step 1: Download PDFs from S3

What it's doing:

Connects to your AWS S3 bucket using boto3 (AWS SDK for Python)
Looks for files within a specific prefix (like a folder path)
Downloads all .pdf files from that prefix to your local machine (or wherever your code is running)

Why it's important:

Your data lives in S3, but to use it in a chatbot, you need the raw content from those files.
PDFs aren't directly usable by language models—you need to extract the text.

Step 2: Extract Text from PDFs

What it's doing:

Uses PyMuPDF (or similar library) to read each PDF and extract text page-by-page
Compiles all the extracted text into one long string

Why it's important:

PDFs are binary files, not plain text.
You must extract readable text from them before a model can understand or generate answers based on the content.

Step 3: Chunk the Text

What it's doing:

Breaks the long string of text into smaller chunks (e.g., 500-character sections with some overlap between chunks)
Overlap (e.g., 50 characters) helps maintain context between chunks

Why it's important:

Embedding models and language models work better on short, focused text segments.
Large documents can overwhelm models, and splitting helps maintain performance and retrieval accuracy.

Step 4: Generate Embeddings and Store

What it's doing:

Transforms each text chunk into a vector (a numerical representation of meaning) using a sentence embedding model
Stores those vectors in a FAISS index (a fast similarity search engine)

Why it's important:

When a user asks a question, you don’t search the entire document—you search the vector space for the most semantically similar text chunks.
This lets the chatbot find relevant information fast, even if the wording is different from the user's question.

Step 5: Build a Simple Retrieval-Augmented Chatbot

What it's doing:

When a user asks a question:
1. Embed the question using the same embedding model
2. Search the FAISS index for the top-matching chunks of text
3. Construct a prompt by combining the context with the user's question
4. Feed that prompt into a language model (local or API) to generate a coherent answer

Why it's important:

Language models don’t “know” your PDFs out of the box.
This approach retrieves relevant knowledge from your data and lets the LLM generate natural answers based on that information.

Ready to Run

This whole setup is called RAG (Retrieval-Augmented Generation). It's like giving a model short-term memory by letting it “look up” relevant chunks from your documents when it needs them.

Now you can call:

Conclusion

You now have a fully working pipeline example to build a lightweight chatbot trained on your own documents using a small language model and smart retrieval techniques. Whether you're making a company knowledge bot, internal assistant, or just experimenting, this setup is fast, scalable, and cost-effective.

Want to Deploy This?

You can: