# concepts you can explore here this stack lets you hands-on explore several pieces of the modern AI stack. here's what each thing is and how to poke at it. ## inference **what it is**: running a trained model to generate output. the model's weights are frozen - no learning happens, just prediction. **where it lives**: ollama **explore it**: ```bash # see what's loaded docker exec ollama ollama list # run a prompt directly docker exec ollama ollama run tinyllama "explain insurance in one sentence" # watch resource usage during generation just stats ``` **what to notice**: - first response is slow (loading model into RAM) - subsequent responses faster (model stays loaded) - tokens/sec depends on model size and your CPU - bigger model = smarter but slower **try different models**: ```bash just pull qwen2:0.5b # tiny, fast, dumb just pull tinyllama # small, decent just pull phi3:mini # bigger, slower, smarter just pull gemma2:2b # google's small model ``` ## embeddings **what it is**: converting text into vectors (arrays of numbers) where similar meanings are close together in vector space. **where it lives**: open-webui runs `all-MiniLM-L6-v2` when you upload docs **explore it**: ```bash # check chroma has collections after uploading a doc curl http://localhost:8007/api/v2/collections | jq ``` **what to notice**: - embedding is fast (small model, ~80MB) - same text always produces same vector - "car" and "automobile" vectors are close - "car" and "banana" vectors are far **the math**: vectors are 384 dimensions. similarity measured by cosine distance - how much they point the same direction. ## vector database **what it is**: a database optimized for "find me things similar to X" instead of "find me things equal to X". **where it lives**: chroma on port 8007 **explore it**: ```bash # health check curl http://localhost:8007/api/v2/heartbeat # list collections (each uploaded doc becomes one) curl http://localhost:8007/api/v2/collections | jq # get collection details curl http://localhost:8007/api/v2/collections/{collection_id} | jq ``` **what to notice**: - stores vectors + metadata + original text - uses HNSW algorithm for fast approximate nearest neighbor search - "approximate" because exact search is O(n), HNSW is O(log n) ## RAG (retrieval-augmented generation) **what it is**: instead of asking the LLM to know everything, retrieve relevant context first and inject it into the prompt. **where it lives**: open-webui orchestrates the full flow **the flow**: ``` your question ↓ embed question → vector ↓ search chroma → similar chunks ↓ build prompt: "given this context: {chunks}, answer: {question}" ↓ send to ollama → generate answer ``` **explore it**: 1. upload a PDF to open-webui 2. start a chat, attach the document 3. ask something specific from the doc 4. notice it quotes/references the content **what to notice**: - model doesn't "know" your doc - it's just in the prompt - retrieval is fast, generation is slow - quality depends on: chunk size, number of chunks retrieved, model capability - if wrong chunks retrieved, answer will be wrong ## tokenization **what it is**: breaking text into pieces (tokens) the model understands. not words - subword units. **explore it**: ```bash # ollama shows token count in verbose mode docker exec ollama ollama run tinyllama "hello world" --verbose ``` **what to notice**: - "hello" might be 1 token, "unconstitutional" might be 3 - ~4 chars per token on average for english - context window = max tokens model can see at once - tinyllama: 2048 tokens, phi3: 4096 tokens ## quantization **what it is**: using smaller numbers (4-bit integers instead of 16-bit floats) to represent weights. makes models smaller and faster with minimal quality loss. **where it lives**: the models you pull are already quantized **explore it**: ```bash # model names often indicate quantization # tinyllama is Q4_0 by default (4-bit) docker exec ollama ollama show tinyllama ``` **what to notice**: - q4 = 4-bit, q8 = 8-bit, f16 = full precision - 4-bit is ~4x smaller than 16-bit - quality loss is usually small for 4-bit - your CPU thanks you ## what this stack does NOT cover these require GPUs, money, or both: - **training**: teaching a model from scratch. needs 100s of GPUs, millions of dollars - **fine-tuning**: even LoRA needs a decent GPU (8GB+ VRAM minimum) - **MLOps**: experiment tracking, model registries, CI/CD for models - **multi-GPU inference**: tensor parallelism, pipeline parallelism this stack is the "use pretrained models" side of the diagram: ``` [someone else trained it] → [we download it] → [we run inference] ↓ [we add RAG for context] ``` ## experiments to try 1. **compare models**: ask same question to tinyllama vs phi3:mini. notice quality/speed tradeoff 2. **RAG quality**: upload a doc, ask questions. see when it gets things right vs wrong 3. **context limits**: paste a huge prompt, see when model refuses or truncates 4. **embedding similarity**: upload two similar docs, see if asking about one retrieves chunks from the other 5. **token counting**: write a long prompt, estimate tokens, check with --verbose ## glossary | term | meaning | |------|---------| | inference | running a model to get output | | embedding | text → vector | | vector | array of numbers representing meaning | | RAG | retrieve context, then generate | | token | subword unit models understand | | quantization | compress weights to smaller numbers | | context window | max tokens model can process | | VRAM | GPU memory (we don't have this) | | HNSW | algorithm for fast similarity search |