Add a concepts doc

2026-01-28 14:37:18 -05:00 · 2026-01-28 14:37:18 -05:00 · 0361b64502
commit 0361b64502
parent ac8c2a490c
2 changed files with 197 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -71,6 +71,15 @@ just nuke      # delete everything including data
 4. start a chat, click the + button, attach the document
 5. ask questions about it

+## learning
+
+see [docs/concepts.md](docs/concepts.md) for hands-on exploration of:
+- inference and tokenization
+- embeddings and vector search
+- RAG flow end-to-end
+- quantization
+- experiments to try
+
 ## what this isn't

 this is inference, not ML. we're not training anything - just running models that others trained. the "learning" in machine learning happened elsewhere on gpu clusters. we're just using the results.
--- a/docs/concepts.md
+++ b/docs/concepts.md
@ -0,0 +1,188 @@
+# concepts you can explore here
+
+this stack lets you hands-on explore several pieces of the modern AI stack.
+here's what each thing is and how to poke at it.
+
+## inference
+
+**what it is**: running a trained model to generate output. the model's weights are frozen - no learning happens, just prediction.
+
+**where it lives**: ollama
+
+**explore it**:
+```bash
+# see what's loaded
+docker exec ollama ollama list
+
+# run a prompt directly
+docker exec ollama ollama run tinyllama "explain insurance in one sentence"
+
+# watch resource usage during generation
+just stats
+```
+
+**what to notice**:
+- first response is slow (loading model into RAM)
+- subsequent responses faster (model stays loaded)
+- tokens/sec depends on model size and your CPU
+- bigger model = smarter but slower
+
+**try different models**:
+```bash
+just pull qwen2:0.5b    # tiny, fast, dumb
+just pull tinyllama     # small, decent
+just pull phi3:mini     # bigger, slower, smarter
+just pull gemma2:2b     # google's small model
+```
+
+## embeddings
+
+**what it is**: converting text into vectors (arrays of numbers) where similar meanings are close together in vector space.
+
+**where it lives**: open-webui runs `all-MiniLM-L6-v2` when you upload docs
+
+**explore it**:
+```bash
+# check chroma has collections after uploading a doc
+curl http://localhost:8007/api/v2/collections | jq
+```
+
+**what to notice**:
+- embedding is fast (small model, ~80MB)
+- same text always produces same vector
+- "car" and "automobile" vectors are close
+- "car" and "banana" vectors are far
+
+**the math**: vectors are 384 dimensions. similarity measured by cosine distance - how much they point the same direction.
+
+## vector database
+
+**what it is**: a database optimized for "find me things similar to X" instead of "find me things equal to X".
+
+**where it lives**: chroma on port 8007
+
+**explore it**:
+```bash
+# health check
+curl http://localhost:8007/api/v2/heartbeat
+
+# list collections (each uploaded doc becomes one)
+curl http://localhost:8007/api/v2/collections | jq
+
+# get collection details
+curl http://localhost:8007/api/v2/collections/{collection_id} | jq
+```
+
+**what to notice**:
+- stores vectors + metadata + original text
+- uses HNSW algorithm for fast approximate nearest neighbor search
+- "approximate" because exact search is O(n), HNSW is O(log n)
+
+## RAG (retrieval-augmented generation)
+
+**what it is**: instead of asking the LLM to know everything, retrieve relevant context first and inject it into the prompt.
+
+**where it lives**: open-webui orchestrates the full flow
+
+**the flow**:
+```
+your question
+    ↓
+embed question → vector
+    ↓
+search chroma → similar chunks
+    ↓
+build prompt: "given this context: {chunks}, answer: {question}"
+    ↓
+send to ollama → generate answer
+```
+
+**explore it**:
+1. upload a PDF to open-webui
+2. start a chat, attach the document
+3. ask something specific from the doc
+4. notice it quotes/references the content
+
+**what to notice**:
+- model doesn't "know" your doc - it's just in the prompt
+- retrieval is fast, generation is slow
+- quality depends on: chunk size, number of chunks retrieved, model capability
+- if wrong chunks retrieved, answer will be wrong
+
+## tokenization
+
+**what it is**: breaking text into pieces (tokens) the model understands. not words - subword units.
+
+**explore it**:
+```bash
+# ollama shows token count in verbose mode
+docker exec ollama ollama run tinyllama "hello world" --verbose
+```
+
+**what to notice**:
+- "hello" might be 1 token, "unconstitutional" might be 3
+- ~4 chars per token on average for english
+- context window = max tokens model can see at once
+- tinyllama: 2048 tokens, phi3: 4096 tokens
+
+## quantization
+
+**what it is**: using smaller numbers (4-bit integers instead of 16-bit floats) to represent weights. makes models smaller and faster with minimal quality loss.
+
+**where it lives**: the models you pull are already quantized
+
+**explore it**:
+```bash
+# model names often indicate quantization
+# tinyllama is Q4_0 by default (4-bit)
+docker exec ollama ollama show tinyllama
+```
+
+**what to notice**:
+- q4 = 4-bit, q8 = 8-bit, f16 = full precision
+- 4-bit is ~4x smaller than 16-bit
+- quality loss is usually small for 4-bit
+- your CPU thanks you
+
+## what this stack does NOT cover
+
+these require GPUs, money, or both:
+
+- **training**: teaching a model from scratch. needs 100s of GPUs, millions of dollars
+- **fine-tuning**: even LoRA needs a decent GPU (8GB+ VRAM minimum)
+- **MLOps**: experiment tracking, model registries, CI/CD for models
+- **multi-GPU inference**: tensor parallelism, pipeline parallelism
+
+this stack is the "use pretrained models" side of the diagram:
+
+```
+[someone else trained it] → [we download it] → [we run inference]
+                                    ↓
+                            [we add RAG for context]
+```
+
+## experiments to try
+
+1. **compare models**: ask same question to tinyllama vs phi3:mini. notice quality/speed tradeoff
+
+2. **RAG quality**: upload a doc, ask questions. see when it gets things right vs wrong
+
+3. **context limits**: paste a huge prompt, see when model refuses or truncates
+
+4. **embedding similarity**: upload two similar docs, see if asking about one retrieves chunks from the other
+
+5. **token counting**: write a long prompt, estimate tokens, check with --verbose
+
+## glossary
+
+| term | meaning |
+|------|---------|
+| inference | running a model to get output |
+| embedding | text → vector |
+| vector | array of numbers representing meaning |
+| RAG | retrieve context, then generate |
+| token | subword unit models understand |
+| quantization | compress weights to smaller numbers |
+| context window | max tokens model can process |
+| VRAM | GPU memory (we don't have this) |
+| HNSW | algorithm for fast similarity search |