# local-llm-stack run llms locally on cpu. slow but complete. ## what's in the box ``` ollama (port 11434) - runs the models ↓ open-webui (port 3001) - chat interface + RAG ↓ chroma (port 8007) - vector database for document retrieval ``` ## quickstart ```bash just up # start everything just pull tinyllama # download a small model (~600MB) just open # open web ui at localhost:3001 ``` ## the stack explained **ollama** - inference engine. downloads models, loads them into memory, generates tokens. uses llama.cpp under the hood which is optimized for cpu. **open-webui** - web interface for chatting. also handles: - document upload (pdf, txt, etc) - embedding documents into vectors - RAG (retrieval-augmented generation) - conversation history **chroma** - vector database. when you upload docs: 1. open-webui chunks the text 2. embedding model converts chunks to vectors 3. vectors stored in chroma 4. when you ask a question, similar chunks retrieved 5. chunks injected into prompt as context ## models for cpu | model | params | ram needed | speed | |-------|--------|------------|-------| | qwen2:0.5b | 0.5B | ~1GB | fast | | tinyllama | 1.1B | ~2GB | fast | | gemma2:2b | 2B | ~3GB | ok | | phi3:mini | 3.8B | ~4GB | slow | ```bash just pull qwen2:0.5b just pull tinyllama just recommend # see all options ``` ## useful commands ```bash just up # start just down # stop just logs # watch all logs just models # list downloaded models just stats # cpu/mem usage just nuke # delete everything including data ``` ## testing rag 1. open http://localhost:3001 2. click workspace (top left) > documents 3. upload a pdf or txt file 4. start a chat, click the + button, attach the document 5. ask questions about it ## learning see [docs/concepts.md](docs/concepts.md) for hands-on exploration of: - inference and tokenization - embeddings and vector search - RAG flow end-to-end - quantization - experiments to try ## what this isn't this is inference, not ML. we're not training anything - just running models that others trained. the "learning" in machine learning happened elsewhere on gpu clusters. we're just using the results. ## hardware notes tested on intel i5-6500t (no gpu). expect: - ~2-5 tokens/sec with tinyllama - ~1-2 tokens/sec with phi3:mini - first response slow (model loading) - subsequent responses faster (model stays in ram) more ram = can run bigger models. 16gb should handle 7b models (slowly).