Domain-Specific Q&A Assistants with Local LLMs

Introduction

I wanted a Q&A bot that knows my field—computer science, scientific articles, or internal engineering docs—without sending queries to a general-purpose cloud model. By hosting a local LLM, I get on-premise privacy and sub-second responses. Here’s how I set up a domain-specific Q&A assistant tailored to my needs.

Why Build a Domain-Specific Assistant

Privacy & Compliance: Sensitive docs never leave my network.
Accuracy in Context: The model focuses on domain data rather than general web knowledge.
Cost Efficiency: No per-query API fees when I’m debugging or prototyping.

High-Level Steps

Collect & Clean Domain Data – Gather relevant docs (PDFs, wikis, spreadsheets).
Option A: Fine-Tuning – Continue-training the base LLM on in-domain text.
Option B: Prompt Engineering – Use few-shot context injection without retraining.
Deploy & Query – Serve via an API or simple CLI for quick lookups.

I’ll walk through both fine-tuning (Option A) and prompt-based (Option B) approaches so I can compare what works best.

1. Collect & Clean Domain Data

I dumped all my domain docs into a data/ folder. For PDFs I used pdfplumber, for Word docs python-docx, and for CSVs I cleaned them up with pandas. Then I sanitized text—removed headers, footers, and redacted PII manually.

import os
import pandas as pd
from pdfplumber import open as open_pdf
from docx import Document

def extract_text(path):
    if path.endswith('.pdf'):
        with open_pdf(path) as pdf:
            return '\n'.join([p.extract_text() for p in pdf.pages])
    elif path.endswith('.docx'):
        return '\n'.join([p.text for p in Document(path).paragraphs])
    elif path.endswith('.csv'):
        df = pd.read_csv(path)
        return '\n'.join(df['notes'].tolist())
    return ''

docs = []
for f in os.listdir('data/'):
    docs.append(extract_text(os.path.join('data/', f)))
print(f"Loaded {len(docs)} domain documents.")

A quick regex pass helped remove common headers or sensitive tokens.

2. Option A: Fine-Tuning the LLM

Why: If my domain data is large enough (at least a few hundred MB of text), fine-tuning can embed domain knowledge directly into the model’s weights.

# Using Hugging Face Trainer and LoRA for efficiency
pip install peft

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType

model_name = "gpt2-medium"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = AutoModelForCausalLM.from_pretrained(model_name)

# Setup LoRA
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8, alpha=16, dropout=0.05
)
model = get_peft_model(model, peft_config)

# Prepare dataset
from datasets import Dataset
encodings = tokenizer(docs, truncation=True, padding=True)
dataset   = Dataset.from_dict(encodings)

# Training arguments
t_args = TrainingArguments(
    output_dir='fine_tuned',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    logging_steps=50,
    save_steps=200
)

trainer = Trainer(
    model=model,
    args=t_args,
    train_dataset=dataset
)
trainer.train()

model.save_pretrained('domain-qa-model')

This takes some GPU time, but the resulting model feels way more knowledgeable on my docs.

3. Option B: Prompt Engineering

Why: If I don’t want to fine-tune or my data is small, I can just feed relevant context in the prompt.

from transformers import pipeline

qa_pipeline = pipeline('question-answering', model='gpt2-medium')

def answer_question(question, context):
    return qa_pipeline({
        'question': question,
        'context': context
    })['answer']

# Retrieve relevant docs or chunks (using our earlier semantic search)
ctx_chunks = retrieve(question, k=3)
full_context = '\n\n'.join(ctx_chunks)
print(answer_question('What is the policy on data retention?', full_context))

Less upfront cost, but might have to tweak prompt length or chunk selection to get precise answers.

4. Deploy & Query

I wrapped both approaches behind a small FastAPI app so I can hit POST /ask with JSON:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class Query(BaseModel):
    question: str

@app.post('/ask')
def ask(q: Query):
    if use_finetuned:
        ans = model.generate(...)
    else:
        ctx = retrieve(q.question)
        ans = answer_question(q.question, '\n\n'.join(ctx))
    return {'answer': ans}

Now I have a local Q&A service that respects my data’s privacy and works offline.

Wrapping Up

This setup lets me query my own domain knowledge quickly and privately. Next up: integrating this into a Slack bot so I can ask questions right from my chat client.