Add initial generation:
This commit is contained in:
commit
a79cfcaca3
4 changed files with 203 additions and 0 deletions
4
.gitignore
vendored
Normal file
4
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,4 @@
|
||||||
|
# docker volumes are external, nothing to ignore really
|
||||||
|
# but good practice to have this file
|
||||||
|
*.log
|
||||||
|
.env
|
||||||
86
README.md
Normal file
86
README.md
Normal file
|
|
@ -0,0 +1,86 @@
|
||||||
|
# local-llm-stack
|
||||||
|
|
||||||
|
run llms locally on cpu. slow but complete.
|
||||||
|
|
||||||
|
## what's in the box
|
||||||
|
|
||||||
|
```
|
||||||
|
ollama (port 11434) - runs the models
|
||||||
|
↓
|
||||||
|
open-webui (port 3001) - chat interface + RAG
|
||||||
|
↓
|
||||||
|
chroma (port 8007) - vector database for document retrieval
|
||||||
|
```
|
||||||
|
|
||||||
|
## quickstart
|
||||||
|
|
||||||
|
```bash
|
||||||
|
just up # start everything
|
||||||
|
just pull tinyllama # download a small model (~600MB)
|
||||||
|
just open # open web ui at localhost:3001
|
||||||
|
```
|
||||||
|
|
||||||
|
## the stack explained
|
||||||
|
|
||||||
|
**ollama** - inference engine. downloads models, loads them into memory, generates tokens. uses llama.cpp under the hood which is optimized for cpu.
|
||||||
|
|
||||||
|
**open-webui** - web interface for chatting. also handles:
|
||||||
|
- document upload (pdf, txt, etc)
|
||||||
|
- embedding documents into vectors
|
||||||
|
- RAG (retrieval-augmented generation)
|
||||||
|
- conversation history
|
||||||
|
|
||||||
|
**chroma** - vector database. when you upload docs:
|
||||||
|
1. open-webui chunks the text
|
||||||
|
2. embedding model converts chunks to vectors
|
||||||
|
3. vectors stored in chroma
|
||||||
|
4. when you ask a question, similar chunks retrieved
|
||||||
|
5. chunks injected into prompt as context
|
||||||
|
|
||||||
|
## models for cpu
|
||||||
|
|
||||||
|
| model | params | ram needed | speed |
|
||||||
|
|-------|--------|------------|-------|
|
||||||
|
| qwen2:0.5b | 0.5B | ~1GB | fast |
|
||||||
|
| tinyllama | 1.1B | ~2GB | fast |
|
||||||
|
| gemma2:2b | 2B | ~3GB | ok |
|
||||||
|
| phi3:mini | 3.8B | ~4GB | slow |
|
||||||
|
|
||||||
|
```bash
|
||||||
|
just pull qwen2:0.5b
|
||||||
|
just pull tinyllama
|
||||||
|
just recommend # see all options
|
||||||
|
```
|
||||||
|
|
||||||
|
## useful commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
just up # start
|
||||||
|
just down # stop
|
||||||
|
just logs # watch all logs
|
||||||
|
just models # list downloaded models
|
||||||
|
just stats # cpu/mem usage
|
||||||
|
just nuke # delete everything including data
|
||||||
|
```
|
||||||
|
|
||||||
|
## testing rag
|
||||||
|
|
||||||
|
1. open http://localhost:3001
|
||||||
|
2. click workspace (top left) > documents
|
||||||
|
3. upload a pdf or txt file
|
||||||
|
4. start a chat, click the + button, attach the document
|
||||||
|
5. ask questions about it
|
||||||
|
|
||||||
|
## what this isn't
|
||||||
|
|
||||||
|
this is inference, not ML. we're not training anything - just running models that others trained. the "learning" in machine learning happened elsewhere on gpu clusters. we're just using the results.
|
||||||
|
|
||||||
|
## hardware notes
|
||||||
|
|
||||||
|
tested on intel i5-6500t (no gpu). expect:
|
||||||
|
- ~2-5 tokens/sec with tinyllama
|
||||||
|
- ~1-2 tokens/sec with phi3:mini
|
||||||
|
- first response slow (model loading)
|
||||||
|
- subsequent responses faster (model stays in ram)
|
||||||
|
|
||||||
|
more ram = can run bigger models. 16gb should handle 7b models (slowly).
|
||||||
54
docker-compose.yml
Normal file
54
docker-compose.yml
Normal file
|
|
@ -0,0 +1,54 @@
|
||||||
|
services:
|
||||||
|
ollama:
|
||||||
|
image: ollama/ollama:latest
|
||||||
|
container_name: ollama
|
||||||
|
volumes:
|
||||||
|
- ollama_data:/root/.ollama
|
||||||
|
ports:
|
||||||
|
- "11434:11434"
|
||||||
|
restart: unless-stopped
|
||||||
|
# no gpu, cpu only - it'll be slow but it works
|
||||||
|
# add this if you ever get a gpu:
|
||||||
|
# deploy:
|
||||||
|
# resources:
|
||||||
|
# reservations:
|
||||||
|
# devices:
|
||||||
|
# - driver: nvidia
|
||||||
|
# count: all
|
||||||
|
# capabilities: [gpu]
|
||||||
|
|
||||||
|
open-webui:
|
||||||
|
image: ghcr.io/open-webui/open-webui:main
|
||||||
|
container_name: open-webui
|
||||||
|
volumes:
|
||||||
|
- open_webui_data:/app/backend/data
|
||||||
|
ports:
|
||||||
|
- "3001:8080"
|
||||||
|
environment:
|
||||||
|
- OLLAMA_BASE_URL=http://ollama:11434
|
||||||
|
- WEBUI_AUTH=false
|
||||||
|
# use external chroma for RAG
|
||||||
|
- CHROMA_HTTP_HOST=chroma
|
||||||
|
- CHROMA_HTTP_PORT=8000
|
||||||
|
- RAG_EMBEDDING_MODEL=all-MiniLM-L6-v2
|
||||||
|
depends_on:
|
||||||
|
- ollama
|
||||||
|
- chroma
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
chroma:
|
||||||
|
image: chromadb/chroma:latest
|
||||||
|
container_name: chroma
|
||||||
|
volumes:
|
||||||
|
- chroma_data:/chroma/chroma
|
||||||
|
ports:
|
||||||
|
- "8007:8000"
|
||||||
|
environment:
|
||||||
|
- IS_PERSISTENT=TRUE
|
||||||
|
- ANONYMIZED_TELEMETRY=FALSE
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
ollama_data:
|
||||||
|
open_webui_data:
|
||||||
|
chroma_data:
|
||||||
59
justfile
Normal file
59
justfile
Normal file
|
|
@ -0,0 +1,59 @@
|
||||||
|
# local-llm-stack commands
|
||||||
|
|
||||||
|
# start everything
|
||||||
|
up:
|
||||||
|
docker compose up -d
|
||||||
|
|
||||||
|
# stop everything
|
||||||
|
down:
|
||||||
|
docker compose down
|
||||||
|
|
||||||
|
# view logs (all services)
|
||||||
|
logs:
|
||||||
|
docker compose logs -f
|
||||||
|
|
||||||
|
# view logs for specific service
|
||||||
|
log service:
|
||||||
|
docker compose logs -f {{service}}
|
||||||
|
|
||||||
|
# pull a model (default: tinyllama - small and fast on cpu)
|
||||||
|
pull model="tinyllama":
|
||||||
|
docker exec ollama ollama pull {{model}}
|
||||||
|
|
||||||
|
# list downloaded models
|
||||||
|
models:
|
||||||
|
docker exec ollama ollama list
|
||||||
|
|
||||||
|
# run a quick test prompt
|
||||||
|
test model="tinyllama":
|
||||||
|
docker exec ollama ollama run {{model}} "Say hello in exactly 5 words"
|
||||||
|
|
||||||
|
# open the web ui
|
||||||
|
open:
|
||||||
|
xdg-open http://localhost:3001
|
||||||
|
|
||||||
|
# check status of all services
|
||||||
|
status:
|
||||||
|
docker compose ps
|
||||||
|
|
||||||
|
# restart a service
|
||||||
|
restart service:
|
||||||
|
docker compose restart {{service}}
|
||||||
|
|
||||||
|
# nuke everything (volumes too)
|
||||||
|
nuke:
|
||||||
|
docker compose down -v
|
||||||
|
|
||||||
|
# show resource usage
|
||||||
|
stats:
|
||||||
|
docker stats ollama open-webui chroma --no-stream
|
||||||
|
|
||||||
|
# recommended small models for cpu
|
||||||
|
recommend:
|
||||||
|
@echo "models that won't melt your cpu:"
|
||||||
|
@echo " tinyllama - 1.1B params, very fast"
|
||||||
|
@echo " phi3:mini - 3.8B params, smart for size"
|
||||||
|
@echo " qwen2:0.5b - 0.5B params, tiny"
|
||||||
|
@echo " gemma2:2b - 2B params, decent"
|
||||||
|
@echo ""
|
||||||
|
@echo "pull with: just pull tinyllama"
|
||||||
Loading…
Reference in a new issue