local-llm-stack/README.md

# local-llm-stack

run llms locally on cpu. slow but complete.

## what's in the box

```
ollama (port 11434)     - runs the models
    ↓
open-webui (port 3001)  - chat interface + RAG
    ↓
chroma (port 8007)      - vector database for document retrieval
```

## quickstart

```bash
just up              # start everything
just pull tinyllama  # download a small model (~600MB)
just open            # open web ui at localhost:3001
```

## the stack explained

**ollama** - inference engine. downloads models, loads them into memory, generates tokens. uses llama.cpp under the hood which is optimized for cpu.

**open-webui** - web interface for chatting. also handles:
- document upload (pdf, txt, etc)
- embedding documents into vectors
- RAG (retrieval-augmented generation)
- conversation history

**chroma** - vector database. when you upload docs:
1. open-webui chunks the text
2. embedding model converts chunks to vectors
3. vectors stored in chroma
4. when you ask a question, similar chunks retrieved
5. chunks injected into prompt as context

## models for cpu

| model | params | ram needed | speed |
|-------|--------|------------|-------|
| qwen2:0.5b | 0.5B | ~1GB | fast |
| tinyllama | 1.1B | ~2GB | fast |
| gemma2:2b | 2B | ~3GB | ok |
| phi3:mini | 3.8B | ~4GB | slow |

```bash
just pull qwen2:0.5b
just pull tinyllama
just recommend  # see all options
```

## useful commands

```bash
just up        # start
just down      # stop
just logs      # watch all logs
just models    # list downloaded models
just stats     # cpu/mem usage
just nuke      # delete everything including data
```

## testing rag

1. open http://localhost:3001
2. click workspace (top left) > documents
3. upload a pdf or txt file
4. start a chat, click the + button, attach the document
5. ask questions about it

## learning

see [docs/concepts.md](docs/concepts.md) for hands-on exploration of:
- inference and tokenization
- embeddings and vector search
- RAG flow end-to-end
- quantization
- experiments to try

## what this isn't

this is inference, not ML. we're not training anything - just running models that others trained. the "learning" in machine learning happened elsewhere on gpu clusters. we're just using the results.

## hardware notes

tested on intel i5-6500t (no gpu). expect:
- ~2-5 tokens/sec with tinyllama
- ~1-2 tokens/sec with phi3:mini
- first response slow (model loading)
- subsequent responses faster (model stays in ram)

more ram = can run bigger models. 16gb should handle 7b models (slowly).