95 lines
2.5 KiB
Markdown
95 lines
2.5 KiB
Markdown
# local-llm-stack
|
|
|
|
run llms locally on cpu. slow but complete.
|
|
|
|
## what's in the box
|
|
|
|
```
|
|
ollama (port 11434) - runs the models
|
|
↓
|
|
open-webui (port 3001) - chat interface + RAG
|
|
↓
|
|
chroma (port 8007) - vector database for document retrieval
|
|
```
|
|
|
|
## quickstart
|
|
|
|
```bash
|
|
just up # start everything
|
|
just pull tinyllama # download a small model (~600MB)
|
|
just open # open web ui at localhost:3001
|
|
```
|
|
|
|
## the stack explained
|
|
|
|
**ollama** - inference engine. downloads models, loads them into memory, generates tokens. uses llama.cpp under the hood which is optimized for cpu.
|
|
|
|
**open-webui** - web interface for chatting. also handles:
|
|
- document upload (pdf, txt, etc)
|
|
- embedding documents into vectors
|
|
- RAG (retrieval-augmented generation)
|
|
- conversation history
|
|
|
|
**chroma** - vector database. when you upload docs:
|
|
1. open-webui chunks the text
|
|
2. embedding model converts chunks to vectors
|
|
3. vectors stored in chroma
|
|
4. when you ask a question, similar chunks retrieved
|
|
5. chunks injected into prompt as context
|
|
|
|
## models for cpu
|
|
|
|
| model | params | ram needed | speed |
|
|
|-------|--------|------------|-------|
|
|
| qwen2:0.5b | 0.5B | ~1GB | fast |
|
|
| tinyllama | 1.1B | ~2GB | fast |
|
|
| gemma2:2b | 2B | ~3GB | ok |
|
|
| phi3:mini | 3.8B | ~4GB | slow |
|
|
|
|
```bash
|
|
just pull qwen2:0.5b
|
|
just pull tinyllama
|
|
just recommend # see all options
|
|
```
|
|
|
|
## useful commands
|
|
|
|
```bash
|
|
just up # start
|
|
just down # stop
|
|
just logs # watch all logs
|
|
just models # list downloaded models
|
|
just stats # cpu/mem usage
|
|
just nuke # delete everything including data
|
|
```
|
|
|
|
## testing rag
|
|
|
|
1. open http://localhost:3001
|
|
2. click workspace (top left) > documents
|
|
3. upload a pdf or txt file
|
|
4. start a chat, click the + button, attach the document
|
|
5. ask questions about it
|
|
|
|
## learning
|
|
|
|
see [docs/concepts.md](docs/concepts.md) for hands-on exploration of:
|
|
- inference and tokenization
|
|
- embeddings and vector search
|
|
- RAG flow end-to-end
|
|
- quantization
|
|
- experiments to try
|
|
|
|
## what this isn't
|
|
|
|
this is inference, not ML. we're not training anything - just running models that others trained. the "learning" in machine learning happened elsewhere on gpu clusters. we're just using the results.
|
|
|
|
## hardware notes
|
|
|
|
tested on intel i5-6500t (no gpu). expect:
|
|
- ~2-5 tokens/sec with tinyllama
|
|
- ~1-2 tokens/sec with phi3:mini
|
|
- first response slow (model loading)
|
|
- subsequent responses faster (model stays in ram)
|
|
|
|
more ram = can run bigger models. 16gb should handle 7b models (slowly).
|