2026-01-28 14:38:31 -05:00

5.6 KiB

Raw Permalink Blame History

concepts you can explore here

this stack lets you hands-on explore several pieces of the modern AI stack. here's what each thing is and how to poke at it.

inference

what it is: running a trained model to generate output. the model's weights are frozen - no learning happens, just prediction.

where it lives: ollama

explore it:

# see what's loaded
docker exec ollama ollama list

# run a prompt directly
docker exec ollama ollama run tinyllama "explain insurance in one sentence"

# watch resource usage during generation
just stats

what to notice:

first response is slow (loading model into RAM)
subsequent responses faster (model stays loaded)
tokens/sec depends on model size and your CPU
bigger model = smarter but slower

try different models:

just pull qwen2:0.5b    # tiny, fast, dumb
just pull tinyllama     # small, decent
just pull phi3:mini     # bigger, slower, smarter
just pull gemma2:2b     # google's small model

embeddings

what it is: converting text into vectors (arrays of numbers) where similar meanings are close together in vector space.

where it lives: open-webui runs all-MiniLM-L6-v2 when you upload docs

explore it:

# check chroma has collections after uploading a doc
curl http://localhost:8007/api/v2/collections | jq

what to notice:

embedding is fast (small model, ~80MB)
same text always produces same vector
"car" and "automobile" vectors are close
"car" and "banana" vectors are far

the math: vectors are 384 dimensions. similarity measured by cosine distance - how much they point the same direction.

vector database

what it is: a database optimized for "find me things similar to X" instead of "find me things equal to X".

where it lives: chroma on port 8007

explore it:

# health check
curl http://localhost:8007/api/v2/heartbeat

# list collections (each uploaded doc becomes one)
curl http://localhost:8007/api/v2/collections | jq

# get collection details
curl http://localhost:8007/api/v2/collections/{collection_id} | jq

what to notice:

stores vectors + metadata + original text
uses HNSW algorithm for fast approximate nearest neighbor search
"approximate" because exact search is O(n), HNSW is O(log n)

RAG (retrieval-augmented generation)

what it is: instead of asking the LLM to know everything, retrieve relevant context first and inject it into the prompt.

where it lives: open-webui orchestrates the full flow

the flow:

your question
    ↓
embed question → vector
    ↓
search chroma → similar chunks
    ↓
build prompt: "given this context: {chunks}, answer: {question}"
    ↓
send to ollama → generate answer

explore it:

upload a PDF to open-webui
start a chat, attach the document
ask something specific from the doc
notice it quotes/references the content

what to notice:

model doesn't "know" your doc - it's just in the prompt
retrieval is fast, generation is slow
quality depends on: chunk size, number of chunks retrieved, model capability
if wrong chunks retrieved, answer will be wrong

tokenization

what it is: breaking text into pieces (tokens) the model understands. not words - subword units.

explore it:

# ollama shows token count in verbose mode
docker exec ollama ollama run tinyllama "hello world" --verbose

what to notice:

"hello" might be 1 token, "unconstitutional" might be 3
~4 chars per token on average for english
context window = max tokens model can see at once
tinyllama: 2048 tokens, phi3: 4096 tokens

quantization

what it is: using smaller numbers (4-bit integers instead of 16-bit floats) to represent weights. makes models smaller and faster with minimal quality loss.

where it lives: the models you pull are already quantized

explore it:

# model names often indicate quantization
# tinyllama is Q4_0 by default (4-bit)
docker exec ollama ollama show tinyllama

what to notice:

q4 = 4-bit, q8 = 8-bit, f16 = full precision
4-bit is ~4x smaller than 16-bit
quality loss is usually small for 4-bit
your CPU thanks you

what this stack does NOT cover

these require GPUs, money, or both:

training: teaching a model from scratch. needs 100s of GPUs, millions of dollars
fine-tuning: even LoRA needs a decent GPU (8GB+ VRAM minimum)
MLOps: experiment tracking, model registries, CI/CD for models
multi-GPU inference: tensor parallelism, pipeline parallelism

this stack is the "use pretrained models" side of the diagram:

[someone else trained it] → [we download it] → [we run inference]
                                    ↓
                            [we add RAG for context]

experiments to try

compare models: ask same question to tinyllama vs phi3:mini. notice quality/speed tradeoff
RAG quality: upload a doc, ask questions. see when it gets things right vs wrong
context limits: paste a huge prompt, see when model refuses or truncates
embedding similarity: upload two similar docs, see if asking about one retrieves chunks from the other
token counting: write a long prompt, estimate tokens, check with --verbose

glossary

term	meaning
inference	running a model to get output
embedding	text → vector
vector	array of numbers representing meaning
RAG	retrieve context, then generate
token	subword unit models understand
quantization	compress weights to smaller numbers
context window	max tokens model can process
VRAM	GPU memory (we don't have this)
HNSW	algorithm for fast similarity search

5.6 KiB Raw Permalink Blame History

concepts you can explore here

inference

embeddings

vector database

RAG (retrieval-augmented generation)

tokenization

quantization

what this stack does NOT cover

experiments to try

glossary

5.6 KiB

Raw Permalink Blame History