diff --git a/README.md b/README.md index e2bb803..468f060 100644 --- a/README.md +++ b/README.md @@ -71,6 +71,15 @@ just nuke # delete everything including data 4. start a chat, click the + button, attach the document 5. ask questions about it +## learning + +see [docs/concepts.md](docs/concepts.md) for hands-on exploration of: +- inference and tokenization +- embeddings and vector search +- RAG flow end-to-end +- quantization +- experiments to try + ## what this isn't this is inference, not ML. we're not training anything - just running models that others trained. the "learning" in machine learning happened elsewhere on gpu clusters. we're just using the results. diff --git a/docs/concepts.md b/docs/concepts.md new file mode 100644 index 0000000..cf04d70 --- /dev/null +++ b/docs/concepts.md @@ -0,0 +1,188 @@ +# concepts you can explore here + +this stack lets you hands-on explore several pieces of the modern AI stack. +here's what each thing is and how to poke at it. + +## inference + +**what it is**: running a trained model to generate output. the model's weights are frozen - no learning happens, just prediction. + +**where it lives**: ollama + +**explore it**: +```bash +# see what's loaded +docker exec ollama ollama list + +# run a prompt directly +docker exec ollama ollama run tinyllama "explain insurance in one sentence" + +# watch resource usage during generation +just stats +``` + +**what to notice**: +- first response is slow (loading model into RAM) +- subsequent responses faster (model stays loaded) +- tokens/sec depends on model size and your CPU +- bigger model = smarter but slower + +**try different models**: +```bash +just pull qwen2:0.5b # tiny, fast, dumb +just pull tinyllama # small, decent +just pull phi3:mini # bigger, slower, smarter +just pull gemma2:2b # google's small model +``` + +## embeddings + +**what it is**: converting text into vectors (arrays of numbers) where similar meanings are close together in vector space. + +**where it lives**: open-webui runs `all-MiniLM-L6-v2` when you upload docs + +**explore it**: +```bash +# check chroma has collections after uploading a doc +curl http://localhost:8007/api/v2/collections | jq +``` + +**what to notice**: +- embedding is fast (small model, ~80MB) +- same text always produces same vector +- "car" and "automobile" vectors are close +- "car" and "banana" vectors are far + +**the math**: vectors are 384 dimensions. similarity measured by cosine distance - how much they point the same direction. + +## vector database + +**what it is**: a database optimized for "find me things similar to X" instead of "find me things equal to X". + +**where it lives**: chroma on port 8007 + +**explore it**: +```bash +# health check +curl http://localhost:8007/api/v2/heartbeat + +# list collections (each uploaded doc becomes one) +curl http://localhost:8007/api/v2/collections | jq + +# get collection details +curl http://localhost:8007/api/v2/collections/{collection_id} | jq +``` + +**what to notice**: +- stores vectors + metadata + original text +- uses HNSW algorithm for fast approximate nearest neighbor search +- "approximate" because exact search is O(n), HNSW is O(log n) + +## RAG (retrieval-augmented generation) + +**what it is**: instead of asking the LLM to know everything, retrieve relevant context first and inject it into the prompt. + +**where it lives**: open-webui orchestrates the full flow + +**the flow**: +``` +your question + ↓ +embed question → vector + ↓ +search chroma → similar chunks + ↓ +build prompt: "given this context: {chunks}, answer: {question}" + ↓ +send to ollama → generate answer +``` + +**explore it**: +1. upload a PDF to open-webui +2. start a chat, attach the document +3. ask something specific from the doc +4. notice it quotes/references the content + +**what to notice**: +- model doesn't "know" your doc - it's just in the prompt +- retrieval is fast, generation is slow +- quality depends on: chunk size, number of chunks retrieved, model capability +- if wrong chunks retrieved, answer will be wrong + +## tokenization + +**what it is**: breaking text into pieces (tokens) the model understands. not words - subword units. + +**explore it**: +```bash +# ollama shows token count in verbose mode +docker exec ollama ollama run tinyllama "hello world" --verbose +``` + +**what to notice**: +- "hello" might be 1 token, "unconstitutional" might be 3 +- ~4 chars per token on average for english +- context window = max tokens model can see at once +- tinyllama: 2048 tokens, phi3: 4096 tokens + +## quantization + +**what it is**: using smaller numbers (4-bit integers instead of 16-bit floats) to represent weights. makes models smaller and faster with minimal quality loss. + +**where it lives**: the models you pull are already quantized + +**explore it**: +```bash +# model names often indicate quantization +# tinyllama is Q4_0 by default (4-bit) +docker exec ollama ollama show tinyllama +``` + +**what to notice**: +- q4 = 4-bit, q8 = 8-bit, f16 = full precision +- 4-bit is ~4x smaller than 16-bit +- quality loss is usually small for 4-bit +- your CPU thanks you + +## what this stack does NOT cover + +these require GPUs, money, or both: + +- **training**: teaching a model from scratch. needs 100s of GPUs, millions of dollars +- **fine-tuning**: even LoRA needs a decent GPU (8GB+ VRAM minimum) +- **MLOps**: experiment tracking, model registries, CI/CD for models +- **multi-GPU inference**: tensor parallelism, pipeline parallelism + +this stack is the "use pretrained models" side of the diagram: + +``` +[someone else trained it] → [we download it] → [we run inference] + ↓ + [we add RAG for context] +``` + +## experiments to try + +1. **compare models**: ask same question to tinyllama vs phi3:mini. notice quality/speed tradeoff + +2. **RAG quality**: upload a doc, ask questions. see when it gets things right vs wrong + +3. **context limits**: paste a huge prompt, see when model refuses or truncates + +4. **embedding similarity**: upload two similar docs, see if asking about one retrieves chunks from the other + +5. **token counting**: write a long prompt, estimate tokens, check with --verbose + +## glossary + +| term | meaning | +|------|---------| +| inference | running a model to get output | +| embedding | text → vector | +| vector | array of numbers representing meaning | +| RAG | retrieve context, then generate | +| token | subword unit models understand | +| quantization | compress weights to smaller numbers | +| context window | max tokens model can process | +| VRAM | GPU memory (we don't have this) | +| HNSW | algorithm for fast similarity search |