No description
Find a file
2026-01-28 14:38:31 -05:00
docs Add a concepts doc 2026-01-28 14:38:31 -05:00
.gitignore Add initial generation: 2026-01-28 14:38:31 -05:00
compose.yml Add initial generation: 2026-01-28 14:38:31 -05:00
justfile Add initial generation: 2026-01-28 14:38:31 -05:00
README.md Add a concepts doc 2026-01-28 14:38:31 -05:00

local-llm-stack

run llms locally on cpu. slow but complete.

what's in the box

ollama (port 11434)     - runs the models
    ↓
open-webui (port 3001)  - chat interface + RAG
    ↓
chroma (port 8007)      - vector database for document retrieval

quickstart

just up              # start everything
just pull tinyllama  # download a small model (~600MB)
just open            # open web ui at localhost:3001

the stack explained

ollama - inference engine. downloads models, loads them into memory, generates tokens. uses llama.cpp under the hood which is optimized for cpu.

open-webui - web interface for chatting. also handles:

  • document upload (pdf, txt, etc)
  • embedding documents into vectors
  • RAG (retrieval-augmented generation)
  • conversation history

chroma - vector database. when you upload docs:

  1. open-webui chunks the text
  2. embedding model converts chunks to vectors
  3. vectors stored in chroma
  4. when you ask a question, similar chunks retrieved
  5. chunks injected into prompt as context

models for cpu

model params ram needed speed
qwen2:0.5b 0.5B ~1GB fast
tinyllama 1.1B ~2GB fast
gemma2:2b 2B ~3GB ok
phi3:mini 3.8B ~4GB slow
just pull qwen2:0.5b
just pull tinyllama
just recommend  # see all options

useful commands

just up        # start
just down      # stop
just logs      # watch all logs
just models    # list downloaded models
just stats     # cpu/mem usage
just nuke      # delete everything including data

testing rag

  1. open http://localhost:3001
  2. click workspace (top left) > documents
  3. upload a pdf or txt file
  4. start a chat, click the + button, attach the document
  5. ask questions about it

learning

see docs/concepts.md for hands-on exploration of:

  • inference and tokenization
  • embeddings and vector search
  • RAG flow end-to-end
  • quantization
  • experiments to try

what this isn't

this is inference, not ML. we're not training anything - just running models that others trained. the "learning" in machine learning happened elsewhere on gpu clusters. we're just using the results.

hardware notes

tested on intel i5-6500t (no gpu). expect:

  • ~2-5 tokens/sec with tinyllama
  • ~1-2 tokens/sec with phi3:mini
  • first response slow (model loading)
  • subsequent responses faster (model stays in ram)

more ram = can run bigger models. 16gb should handle 7b models (slowly).