5.6 KiB
concepts you can explore here
this stack lets you hands-on explore several pieces of the modern AI stack. here's what each thing is and how to poke at it.
inference
what it is: running a trained model to generate output. the model's weights are frozen - no learning happens, just prediction.
where it lives: ollama
explore it:
# see what's loaded
docker exec ollama ollama list
# run a prompt directly
docker exec ollama ollama run tinyllama "explain insurance in one sentence"
# watch resource usage during generation
just stats
what to notice:
- first response is slow (loading model into RAM)
- subsequent responses faster (model stays loaded)
- tokens/sec depends on model size and your CPU
- bigger model = smarter but slower
try different models:
just pull qwen2:0.5b # tiny, fast, dumb
just pull tinyllama # small, decent
just pull phi3:mini # bigger, slower, smarter
just pull gemma2:2b # google's small model
embeddings
what it is: converting text into vectors (arrays of numbers) where similar meanings are close together in vector space.
where it lives: open-webui runs all-MiniLM-L6-v2 when you upload docs
explore it:
# check chroma has collections after uploading a doc
curl http://localhost:8007/api/v2/collections | jq
what to notice:
- embedding is fast (small model, ~80MB)
- same text always produces same vector
- "car" and "automobile" vectors are close
- "car" and "banana" vectors are far
the math: vectors are 384 dimensions. similarity measured by cosine distance - how much they point the same direction.
vector database
what it is: a database optimized for "find me things similar to X" instead of "find me things equal to X".
where it lives: chroma on port 8007
explore it:
# health check
curl http://localhost:8007/api/v2/heartbeat
# list collections (each uploaded doc becomes one)
curl http://localhost:8007/api/v2/collections | jq
# get collection details
curl http://localhost:8007/api/v2/collections/{collection_id} | jq
what to notice:
- stores vectors + metadata + original text
- uses HNSW algorithm for fast approximate nearest neighbor search
- "approximate" because exact search is O(n), HNSW is O(log n)
RAG (retrieval-augmented generation)
what it is: instead of asking the LLM to know everything, retrieve relevant context first and inject it into the prompt.
where it lives: open-webui orchestrates the full flow
the flow:
your question
↓
embed question → vector
↓
search chroma → similar chunks
↓
build prompt: "given this context: {chunks}, answer: {question}"
↓
send to ollama → generate answer
explore it:
- upload a PDF to open-webui
- start a chat, attach the document
- ask something specific from the doc
- notice it quotes/references the content
what to notice:
- model doesn't "know" your doc - it's just in the prompt
- retrieval is fast, generation is slow
- quality depends on: chunk size, number of chunks retrieved, model capability
- if wrong chunks retrieved, answer will be wrong
tokenization
what it is: breaking text into pieces (tokens) the model understands. not words - subword units.
explore it:
# ollama shows token count in verbose mode
docker exec ollama ollama run tinyllama "hello world" --verbose
what to notice:
- "hello" might be 1 token, "unconstitutional" might be 3
- ~4 chars per token on average for english
- context window = max tokens model can see at once
- tinyllama: 2048 tokens, phi3: 4096 tokens
quantization
what it is: using smaller numbers (4-bit integers instead of 16-bit floats) to represent weights. makes models smaller and faster with minimal quality loss.
where it lives: the models you pull are already quantized
explore it:
# model names often indicate quantization
# tinyllama is Q4_0 by default (4-bit)
docker exec ollama ollama show tinyllama
what to notice:
- q4 = 4-bit, q8 = 8-bit, f16 = full precision
- 4-bit is ~4x smaller than 16-bit
- quality loss is usually small for 4-bit
- your CPU thanks you
what this stack does NOT cover
these require GPUs, money, or both:
- training: teaching a model from scratch. needs 100s of GPUs, millions of dollars
- fine-tuning: even LoRA needs a decent GPU (8GB+ VRAM minimum)
- MLOps: experiment tracking, model registries, CI/CD for models
- multi-GPU inference: tensor parallelism, pipeline parallelism
this stack is the "use pretrained models" side of the diagram:
[someone else trained it] → [we download it] → [we run inference]
↓
[we add RAG for context]
experiments to try
-
compare models: ask same question to tinyllama vs phi3:mini. notice quality/speed tradeoff
-
RAG quality: upload a doc, ask questions. see when it gets things right vs wrong
-
context limits: paste a huge prompt, see when model refuses or truncates
-
embedding similarity: upload two similar docs, see if asking about one retrieves chunks from the other
-
token counting: write a long prompt, estimate tokens, check with --verbose
glossary
| term | meaning |
|---|---|
| inference | running a model to get output |
| embedding | text → vector |
| vector | array of numbers representing meaning |
| RAG | retrieve context, then generate |
| token | subword unit models understand |
| quantization | compress weights to smaller numbers |
| context window | max tokens model can process |
| VRAM | GPU memory (we don't have this) |
| HNSW | algorithm for fast similarity search |