lofivor/OPTIMIZATIONS.md

# lofivor optimizations

organized by performance goal. see journal.txt for detailed benchmarks.

## current ceiling

- **~700k entities @ 60fps** (i5-6500T / HD 530 integrated, SSBO)
- **~950k entities @ ~57fps** (i5-6500T / HD 530 integrated, SSBO)
- bottleneck: GPU-bound (update loop stays <5ms even at 950k)

---

## completed optimizations

### rendering pipeline (GPU)

#### baseline: individual drawCircle

- technique: `rl.drawCircle()` per entity
- result: ~5k entities @ 60fps
- problem: each call = separate GPU draw call

#### optimization 1: texture blitting

- technique: pre-render circle to 16x16 texture, `drawTexture()` per entity
- result: ~50k entities @ 60fps
- improvement: **10x** over baseline
- why it works: raylib batches same-texture draws internally

#### optimization 2: rlgl quad batching

- technique: bypass `drawTexture()`, submit vertices directly via `rl.gl`
- result: ~100k entities @ 60fps
- improvement: **2x** over texture blitting, **20x** total
- why it works: eliminates per-call overhead, vertices go straight to GPU buffer

#### optimization 3: increased batch buffer

- technique: increase raylib batch buffer from 8192 to 32768 vertices
- result: ~140k entities @ 60fps (i5-6500T)
- improvement: **~40%** over default buffer
- why it works: fewer GPU flushes per frame

#### optimization 4: GPU instancing (tested, minimal gain)

- technique: `drawMeshInstanced()` with per-entity transform matrices
- result: ~150k entities @ 60fps (i5-6500T) - similar to rlgl batching
- improvement: **negligible** on integrated graphics
- why it didn't help:
  - integrated GPU shares system RAM (no PCIe transfer savings)
  - 64-byte Matrix per entity vs ~80 bytes for rlgl vertices (similar bandwidth)
  - bottleneck is memory bandwidth, not draw call overhead
  - rlgl batching already minimizes draw calls effectively
- note: may help more on discrete GPUs with dedicated VRAM

#### optimization 5: SSBO instance data

- technique: pack entity data (x, y, color) into 12-byte struct, upload via SSBO
- result: **~700k entities @ 60fps** (i5-6500T / HD 530)
- improvement: **~5x** over previous best, **~140x** total from baseline
- comparison:
  - batch buffer (0.3.1): 60fps @ ~140k
  - GPU instancing (0.4.0): 60fps @ ~150k
  - SSBO: 60fps @ ~700k, ~57fps @ 950k
- why it works:
  - 12 bytes vs 64 bytes (matrices) = 5.3x less bandwidth
  - 12 bytes vs 80 bytes (rlgl vertices) = 6.7x less bandwidth
  - no CPU-side matrix calculations
  - GPU does NDC conversion and color unpacking
- implementation notes:
  - custom vertex shader reads from SSBO using `gl_InstanceID`
  - single `rlDrawVertexArrayInstanced()` call for all entities
  - gotcha: don't use `rlSetUniformSampler()` for custom GL code - use `rlSetUniform()` with int type instead (see `docs/raylib_rlSetUniformSampler_bug.md`)

---

## future optimizations

### milestone: push GPU ceiling higher

these target the rendering bottleneck since update loop is already fast.

| technique              | description                                                          | expected gain                   |
| ---------------------- | -------------------------------------------------------------------- | ------------------------------- |
| ~~SSBO instance data~~ | ~~pack (x, y, color) = 12 bytes instead of 64-byte matrices~~        | **done** - see optimization 5   |
| compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync            | significant                     |
| OpenGL vs Vulkan       | test raylib's Vulkan backend                                         | unknown                         |
| discrete GPU testing   | test on dedicated GPU where instancing/SSBO shine                    | significant (different hw)      |

#### rendering culling

| technique          | description                              | expected gain          |
| ------------------ | ---------------------------------------- | ---------------------- |
| frustum culling    | skip entities outside view               | depends on game design |
| LOD rendering      | reduce detail for distant/small entities | moderate               |
| temporal rendering | update/render subset per frame           | moderate               |

---

### milestone: push CPU ceiling (when it becomes the bottleneck)

currently not the bottleneck - update stays <1ms at 100k. these become relevant when adding game logic, AI, or collision.

#### collision detection

| technique          | description                               | expected gain          |
| ------------------ | ----------------------------------------- | ---------------------- |
| uniform grid       | spatial hash, O(1) neighbor lookup        | high for dense scenes  |
| quadtree           | adaptive spatial partitioning             | high for sparse scenes |
| broad/narrow phase | cheap AABB check before precise collision | moderate               |

#### update loop

| technique        | description                                     | expected gain       |
| ---------------- | ----------------------------------------------- | ------------------- |
| SIMD (AVX2/SSE)  | vectorized position/velocity math               | 2-4x on update      |
| struct-of-arrays | cache-friendly memory layout for SIMD           | enables better SIMD |
| multithreading   | thread pool for parallel entity updates         | scales with cores   |
| fixed-point math | integer math, deterministic, potentially faster | minor-moderate      |

#### memory layout

| technique             | description                           | expected gain               |
| --------------------- | ------------------------------------- | --------------------------- |
| cache-friendly layout | hot data together, cold data separate | reduces cache misses        |
| entity pools          | pre-allocated, reusable entity slots  | reduces allocation overhead |
| component packing     | minimize struct padding               | better cache utilization    |

---

## testing methodology

1. set target entity count
2. run for 30+ seconds
3. record frame times (target: stable 16.7ms)
4. note when 60fps breaks
5. compare update_ms vs render_ms to identify bottleneck

see journal.txt for raw benchmark data.