Add a concepts doc
This commit is contained in:
parent
ac8c2a490c
commit
0361b64502
2 changed files with 197 additions and 0 deletions
|
|
@ -71,6 +71,15 @@ just nuke # delete everything including data
|
||||||
4. start a chat, click the + button, attach the document
|
4. start a chat, click the + button, attach the document
|
||||||
5. ask questions about it
|
5. ask questions about it
|
||||||
|
|
||||||
|
## learning
|
||||||
|
|
||||||
|
see [docs/concepts.md](docs/concepts.md) for hands-on exploration of:
|
||||||
|
- inference and tokenization
|
||||||
|
- embeddings and vector search
|
||||||
|
- RAG flow end-to-end
|
||||||
|
- quantization
|
||||||
|
- experiments to try
|
||||||
|
|
||||||
## what this isn't
|
## what this isn't
|
||||||
|
|
||||||
this is inference, not ML. we're not training anything - just running models that others trained. the "learning" in machine learning happened elsewhere on gpu clusters. we're just using the results.
|
this is inference, not ML. we're not training anything - just running models that others trained. the "learning" in machine learning happened elsewhere on gpu clusters. we're just using the results.
|
||||||
|
|
|
||||||
188
docs/concepts.md
Normal file
188
docs/concepts.md
Normal file
|
|
@ -0,0 +1,188 @@
|
||||||
|
# concepts you can explore here
|
||||||
|
|
||||||
|
this stack lets you hands-on explore several pieces of the modern AI stack.
|
||||||
|
here's what each thing is and how to poke at it.
|
||||||
|
|
||||||
|
## inference
|
||||||
|
|
||||||
|
**what it is**: running a trained model to generate output. the model's weights are frozen - no learning happens, just prediction.
|
||||||
|
|
||||||
|
**where it lives**: ollama
|
||||||
|
|
||||||
|
**explore it**:
|
||||||
|
```bash
|
||||||
|
# see what's loaded
|
||||||
|
docker exec ollama ollama list
|
||||||
|
|
||||||
|
# run a prompt directly
|
||||||
|
docker exec ollama ollama run tinyllama "explain insurance in one sentence"
|
||||||
|
|
||||||
|
# watch resource usage during generation
|
||||||
|
just stats
|
||||||
|
```
|
||||||
|
|
||||||
|
**what to notice**:
|
||||||
|
- first response is slow (loading model into RAM)
|
||||||
|
- subsequent responses faster (model stays loaded)
|
||||||
|
- tokens/sec depends on model size and your CPU
|
||||||
|
- bigger model = smarter but slower
|
||||||
|
|
||||||
|
**try different models**:
|
||||||
|
```bash
|
||||||
|
just pull qwen2:0.5b # tiny, fast, dumb
|
||||||
|
just pull tinyllama # small, decent
|
||||||
|
just pull phi3:mini # bigger, slower, smarter
|
||||||
|
just pull gemma2:2b # google's small model
|
||||||
|
```
|
||||||
|
|
||||||
|
## embeddings
|
||||||
|
|
||||||
|
**what it is**: converting text into vectors (arrays of numbers) where similar meanings are close together in vector space.
|
||||||
|
|
||||||
|
**where it lives**: open-webui runs `all-MiniLM-L6-v2` when you upload docs
|
||||||
|
|
||||||
|
**explore it**:
|
||||||
|
```bash
|
||||||
|
# check chroma has collections after uploading a doc
|
||||||
|
curl http://localhost:8007/api/v2/collections | jq
|
||||||
|
```
|
||||||
|
|
||||||
|
**what to notice**:
|
||||||
|
- embedding is fast (small model, ~80MB)
|
||||||
|
- same text always produces same vector
|
||||||
|
- "car" and "automobile" vectors are close
|
||||||
|
- "car" and "banana" vectors are far
|
||||||
|
|
||||||
|
**the math**: vectors are 384 dimensions. similarity measured by cosine distance - how much they point the same direction.
|
||||||
|
|
||||||
|
## vector database
|
||||||
|
|
||||||
|
**what it is**: a database optimized for "find me things similar to X" instead of "find me things equal to X".
|
||||||
|
|
||||||
|
**where it lives**: chroma on port 8007
|
||||||
|
|
||||||
|
**explore it**:
|
||||||
|
```bash
|
||||||
|
# health check
|
||||||
|
curl http://localhost:8007/api/v2/heartbeat
|
||||||
|
|
||||||
|
# list collections (each uploaded doc becomes one)
|
||||||
|
curl http://localhost:8007/api/v2/collections | jq
|
||||||
|
|
||||||
|
# get collection details
|
||||||
|
curl http://localhost:8007/api/v2/collections/{collection_id} | jq
|
||||||
|
```
|
||||||
|
|
||||||
|
**what to notice**:
|
||||||
|
- stores vectors + metadata + original text
|
||||||
|
- uses HNSW algorithm for fast approximate nearest neighbor search
|
||||||
|
- "approximate" because exact search is O(n), HNSW is O(log n)
|
||||||
|
|
||||||
|
## RAG (retrieval-augmented generation)
|
||||||
|
|
||||||
|
**what it is**: instead of asking the LLM to know everything, retrieve relevant context first and inject it into the prompt.
|
||||||
|
|
||||||
|
**where it lives**: open-webui orchestrates the full flow
|
||||||
|
|
||||||
|
**the flow**:
|
||||||
|
```
|
||||||
|
your question
|
||||||
|
↓
|
||||||
|
embed question → vector
|
||||||
|
↓
|
||||||
|
search chroma → similar chunks
|
||||||
|
↓
|
||||||
|
build prompt: "given this context: {chunks}, answer: {question}"
|
||||||
|
↓
|
||||||
|
send to ollama → generate answer
|
||||||
|
```
|
||||||
|
|
||||||
|
**explore it**:
|
||||||
|
1. upload a PDF to open-webui
|
||||||
|
2. start a chat, attach the document
|
||||||
|
3. ask something specific from the doc
|
||||||
|
4. notice it quotes/references the content
|
||||||
|
|
||||||
|
**what to notice**:
|
||||||
|
- model doesn't "know" your doc - it's just in the prompt
|
||||||
|
- retrieval is fast, generation is slow
|
||||||
|
- quality depends on: chunk size, number of chunks retrieved, model capability
|
||||||
|
- if wrong chunks retrieved, answer will be wrong
|
||||||
|
|
||||||
|
## tokenization
|
||||||
|
|
||||||
|
**what it is**: breaking text into pieces (tokens) the model understands. not words - subword units.
|
||||||
|
|
||||||
|
**explore it**:
|
||||||
|
```bash
|
||||||
|
# ollama shows token count in verbose mode
|
||||||
|
docker exec ollama ollama run tinyllama "hello world" --verbose
|
||||||
|
```
|
||||||
|
|
||||||
|
**what to notice**:
|
||||||
|
- "hello" might be 1 token, "unconstitutional" might be 3
|
||||||
|
- ~4 chars per token on average for english
|
||||||
|
- context window = max tokens model can see at once
|
||||||
|
- tinyllama: 2048 tokens, phi3: 4096 tokens
|
||||||
|
|
||||||
|
## quantization
|
||||||
|
|
||||||
|
**what it is**: using smaller numbers (4-bit integers instead of 16-bit floats) to represent weights. makes models smaller and faster with minimal quality loss.
|
||||||
|
|
||||||
|
**where it lives**: the models you pull are already quantized
|
||||||
|
|
||||||
|
**explore it**:
|
||||||
|
```bash
|
||||||
|
# model names often indicate quantization
|
||||||
|
# tinyllama is Q4_0 by default (4-bit)
|
||||||
|
docker exec ollama ollama show tinyllama
|
||||||
|
```
|
||||||
|
|
||||||
|
**what to notice**:
|
||||||
|
- q4 = 4-bit, q8 = 8-bit, f16 = full precision
|
||||||
|
- 4-bit is ~4x smaller than 16-bit
|
||||||
|
- quality loss is usually small for 4-bit
|
||||||
|
- your CPU thanks you
|
||||||
|
|
||||||
|
## what this stack does NOT cover
|
||||||
|
|
||||||
|
these require GPUs, money, or both:
|
||||||
|
|
||||||
|
- **training**: teaching a model from scratch. needs 100s of GPUs, millions of dollars
|
||||||
|
- **fine-tuning**: even LoRA needs a decent GPU (8GB+ VRAM minimum)
|
||||||
|
- **MLOps**: experiment tracking, model registries, CI/CD for models
|
||||||
|
- **multi-GPU inference**: tensor parallelism, pipeline parallelism
|
||||||
|
|
||||||
|
this stack is the "use pretrained models" side of the diagram:
|
||||||
|
|
||||||
|
```
|
||||||
|
[someone else trained it] → [we download it] → [we run inference]
|
||||||
|
↓
|
||||||
|
[we add RAG for context]
|
||||||
|
```
|
||||||
|
|
||||||
|
## experiments to try
|
||||||
|
|
||||||
|
1. **compare models**: ask same question to tinyllama vs phi3:mini. notice quality/speed tradeoff
|
||||||
|
|
||||||
|
2. **RAG quality**: upload a doc, ask questions. see when it gets things right vs wrong
|
||||||
|
|
||||||
|
3. **context limits**: paste a huge prompt, see when model refuses or truncates
|
||||||
|
|
||||||
|
4. **embedding similarity**: upload two similar docs, see if asking about one retrieves chunks from the other
|
||||||
|
|
||||||
|
5. **token counting**: write a long prompt, estimate tokens, check with --verbose
|
||||||
|
|
||||||
|
## glossary
|
||||||
|
|
||||||
|
| term | meaning |
|
||||||
|
|------|---------|
|
||||||
|
| inference | running a model to get output |
|
||||||
|
| embedding | text → vector |
|
||||||
|
| vector | array of numbers representing meaning |
|
||||||
|
| RAG | retrieve context, then generate |
|
||||||
|
| token | subword unit models understand |
|
||||||
|
| quantization | compress weights to smaller numbers |
|
||||||
|
| context window | max tokens model can process |
|
||||||
|
| VRAM | GPU memory (we don't have this) |
|
||||||
|
| HNSW | algorithm for fast similarity search |
|
||||||
Loading…
Reference in a new issue