local-llm-stack/docs/concepts.md
2026-01-28 14:38:31 -05:00

5.6 KiB

concepts you can explore here

this stack lets you hands-on explore several pieces of the modern AI stack. here's what each thing is and how to poke at it.

inference

what it is: running a trained model to generate output. the model's weights are frozen - no learning happens, just prediction.

where it lives: ollama

explore it:

# see what's loaded
docker exec ollama ollama list

# run a prompt directly
docker exec ollama ollama run tinyllama "explain insurance in one sentence"

# watch resource usage during generation
just stats

what to notice:

  • first response is slow (loading model into RAM)
  • subsequent responses faster (model stays loaded)
  • tokens/sec depends on model size and your CPU
  • bigger model = smarter but slower

try different models:

just pull qwen2:0.5b    # tiny, fast, dumb
just pull tinyllama     # small, decent
just pull phi3:mini     # bigger, slower, smarter
just pull gemma2:2b     # google's small model

embeddings

what it is: converting text into vectors (arrays of numbers) where similar meanings are close together in vector space.

where it lives: open-webui runs all-MiniLM-L6-v2 when you upload docs

explore it:

# check chroma has collections after uploading a doc
curl http://localhost:8007/api/v2/collections | jq

what to notice:

  • embedding is fast (small model, ~80MB)
  • same text always produces same vector
  • "car" and "automobile" vectors are close
  • "car" and "banana" vectors are far

the math: vectors are 384 dimensions. similarity measured by cosine distance - how much they point the same direction.

vector database

what it is: a database optimized for "find me things similar to X" instead of "find me things equal to X".

where it lives: chroma on port 8007

explore it:

# health check
curl http://localhost:8007/api/v2/heartbeat

# list collections (each uploaded doc becomes one)
curl http://localhost:8007/api/v2/collections | jq

# get collection details
curl http://localhost:8007/api/v2/collections/{collection_id} | jq

what to notice:

  • stores vectors + metadata + original text
  • uses HNSW algorithm for fast approximate nearest neighbor search
  • "approximate" because exact search is O(n), HNSW is O(log n)

RAG (retrieval-augmented generation)

what it is: instead of asking the LLM to know everything, retrieve relevant context first and inject it into the prompt.

where it lives: open-webui orchestrates the full flow

the flow:

your question
    ↓
embed question → vector
    ↓
search chroma → similar chunks
    ↓
build prompt: "given this context: {chunks}, answer: {question}"
    ↓
send to ollama → generate answer

explore it:

  1. upload a PDF to open-webui
  2. start a chat, attach the document
  3. ask something specific from the doc
  4. notice it quotes/references the content

what to notice:

  • model doesn't "know" your doc - it's just in the prompt
  • retrieval is fast, generation is slow
  • quality depends on: chunk size, number of chunks retrieved, model capability
  • if wrong chunks retrieved, answer will be wrong

tokenization

what it is: breaking text into pieces (tokens) the model understands. not words - subword units.

explore it:

# ollama shows token count in verbose mode
docker exec ollama ollama run tinyllama "hello world" --verbose

what to notice:

  • "hello" might be 1 token, "unconstitutional" might be 3
  • ~4 chars per token on average for english
  • context window = max tokens model can see at once
  • tinyllama: 2048 tokens, phi3: 4096 tokens

quantization

what it is: using smaller numbers (4-bit integers instead of 16-bit floats) to represent weights. makes models smaller and faster with minimal quality loss.

where it lives: the models you pull are already quantized

explore it:

# model names often indicate quantization
# tinyllama is Q4_0 by default (4-bit)
docker exec ollama ollama show tinyllama

what to notice:

  • q4 = 4-bit, q8 = 8-bit, f16 = full precision
  • 4-bit is ~4x smaller than 16-bit
  • quality loss is usually small for 4-bit
  • your CPU thanks you

what this stack does NOT cover

these require GPUs, money, or both:

  • training: teaching a model from scratch. needs 100s of GPUs, millions of dollars
  • fine-tuning: even LoRA needs a decent GPU (8GB+ VRAM minimum)
  • MLOps: experiment tracking, model registries, CI/CD for models
  • multi-GPU inference: tensor parallelism, pipeline parallelism

this stack is the "use pretrained models" side of the diagram:

[someone else trained it] → [we download it] → [we run inference]
                                    ↓
                            [we add RAG for context]

experiments to try

  1. compare models: ask same question to tinyllama vs phi3:mini. notice quality/speed tradeoff

  2. RAG quality: upload a doc, ask questions. see when it gets things right vs wrong

  3. context limits: paste a huge prompt, see when model refuses or truncates

  4. embedding similarity: upload two similar docs, see if asking about one retrieves chunks from the other

  5. token counting: write a long prompt, estimate tokens, check with --verbose

glossary

term meaning
inference running a model to get output
embedding text → vector
vector array of numbers representing meaning
RAG retrieve context, then generate
token subword unit models understand
quantization compress weights to smaller numbers
context window max tokens model can process
VRAM GPU memory (we don't have this)
HNSW algorithm for fast similarity search