Simple Introduction to Building a Small Language Model for Your Documents Using S3 and PDFs
- mschneider90265
- Apr 27
- 3 min read

Overview
Ever wanted to build your own chatbot that understands your internal documents? With the power of language models and document embeddings, you can turn a collection of PDFs stored in Amazon S3 into an intelligent, searchable assistant. This article walks you through how to do that step-by-step using open-source tools and a retrieval-augmented generation (RAG) pipeline.
I tested the code with my own data but only did a visual inspection as I generalized the commands and removed my personal data.
Step 1: Download PDFs from S3
What it's doing:
Connects to your AWS S3 bucket using boto3 (AWS SDK for Python)
Looks for files within a specific prefix (like a folder path)
Downloads all .pdf files from that prefix to your local machine (or wherever your code is running)
Why it's important:
Your data lives in S3, but to use it in a chatbot, you need the raw content from those files.
PDFs aren't directly usable by language models—you need to extract the text.

Step 2: Extract Text from PDFs
What it's doing:
Uses PyMuPDF (or similar library) to read each PDF and extract text page-by-page
Compiles all the extracted text into one long string
Why it's important:
PDFs are binary files, not plain text.
You must extract readable text from them before a model can understand or generate answers based on the content.

Step 3: Chunk the Text
What it's doing:
Breaks the long string of text into smaller chunks (e.g., 500-character sections with some overlap between chunks)
Overlap (e.g., 50 characters) helps maintain context between chunks
Why it's important:
Embedding models and language models work better on short, focused text segments.
Large documents can overwhelm models, and splitting helps maintain performance and retrieval accuracy.

Step 4: Generate Embeddings and Store
What it's doing:
Transforms each text chunk into a vector (a numerical representation of meaning) using a sentence embedding model
Stores those vectors in a FAISS index (a fast similarity search engine)
Why it's important:
When a user asks a question, you don’t search the entire document—you search the vector space for the most semantically similar text chunks.
This lets the chatbot find relevant information fast, even if the wording is different from the user's question.

Step 5: Build a Simple Retrieval-Augmented Chatbot
What it's doing:
When a user asks a question:
Embed the question using the same embedding model
Search the FAISS index for the top-matching chunks of text
Construct a prompt by combining the context with the user's question
Feed that prompt into a language model (local or API) to generate a coherent answer
Why it's important:
Language models don’t “know” your PDFs out of the box.
This approach retrieves relevant knowledge from your data and lets the LLM generate natural answers based on that information.

Ready to Run
This whole setup is called RAG (Retrieval-Augmented Generation). It's like giving a model short-term memory by letting it “look up” relevant chunks from your documents when it needs them.
Now you can call:

Conclusion
You now have a fully working pipeline example to build a lightweight chatbot trained on your own documents using a small language model and smart retrieval techniques. Whether you're making a company knowledge bot, internal assistant, or just experimenting, this setup is fast, scalable, and cost-effective.
Want to Deploy This?
You can:
Wrap it in a Flask API
Run it locally or deploy on AWS Lambda or SageMaker
Replace distilgpt2 with ChatGPT or Mistral API for better answers
I also have added a bonus version that streams the chatbot's response for a more "live typing" feel.


Comments