# lofivor optimizations organized by performance goal. see journal.txt for detailed benchmarks. ## current ceiling - **~700k entities @ 60fps** (i5-6500T / HD 530 integrated, SSBO) - **~950k entities @ ~57fps** (i5-6500T / HD 530 integrated, SSBO) - bottleneck: GPU-bound (update loop stays <5ms even at 950k) --- ## completed optimizations ### rendering pipeline (GPU) #### baseline: individual drawCircle - technique: `rl.drawCircle()` per entity - result: ~5k entities @ 60fps - problem: each call = separate GPU draw call #### optimization 1: texture blitting - technique: pre-render circle to 16x16 texture, `drawTexture()` per entity - result: ~50k entities @ 60fps - improvement: **10x** over baseline - why it works: raylib batches same-texture draws internally #### optimization 2: rlgl quad batching - technique: bypass `drawTexture()`, submit vertices directly via `rl.gl` - result: ~100k entities @ 60fps - improvement: **2x** over texture blitting, **20x** total - why it works: eliminates per-call overhead, vertices go straight to GPU buffer #### optimization 3: increased batch buffer - technique: increase raylib batch buffer from 8192 to 32768 vertices - result: ~140k entities @ 60fps (i5-6500T) - improvement: **~40%** over default buffer - why it works: fewer GPU flushes per frame #### optimization 4: GPU instancing (tested, minimal gain) - technique: `drawMeshInstanced()` with per-entity transform matrices - result: ~150k entities @ 60fps (i5-6500T) - similar to rlgl batching - improvement: **negligible** on integrated graphics - why it didn't help: - integrated GPU shares system RAM (no PCIe transfer savings) - 64-byte Matrix per entity vs ~80 bytes for rlgl vertices (similar bandwidth) - bottleneck is memory bandwidth, not draw call overhead - rlgl batching already minimizes draw calls effectively - note: may help more on discrete GPUs with dedicated VRAM #### optimization 5: SSBO instance data - technique: pack entity data (x, y, color) into 12-byte struct, upload via SSBO - result: **~700k entities @ 60fps** (i5-6500T / HD 530) - improvement: **~5x** over previous best, **~140x** total from baseline - comparison: - batch buffer (0.3.1): 60fps @ ~140k - GPU instancing (0.4.0): 60fps @ ~150k - SSBO: 60fps @ ~700k, ~57fps @ 950k - why it works: - 12 bytes vs 64 bytes (matrices) = 5.3x less bandwidth - 12 bytes vs 80 bytes (rlgl vertices) = 6.7x less bandwidth - no CPU-side matrix calculations - GPU does NDC conversion and color unpacking - implementation notes: - custom vertex shader reads from SSBO using `gl_InstanceID` - single `rlDrawVertexArrayInstanced()` call for all entities - gotcha: don't use `rlSetUniformSampler()` for custom GL code - use `rlSetUniform()` with int type instead (see `docs/raylib_rlSetUniformSampler_bug.md`) --- ## future optimizations ### milestone: push GPU ceiling higher these target the rendering bottleneck since update loop is already fast. | technique | description | expected gain | | ---------------------- | -------------------------------------------------------------------- | ------------------------------- | | ~~SSBO instance data~~ | ~~pack (x, y, color) = 12 bytes instead of 64-byte matrices~~ | **done** - see optimization 5 | | compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync | significant | | OpenGL vs Vulkan | test raylib's Vulkan backend | unknown | | discrete GPU testing | test on dedicated GPU where instancing/SSBO shine | significant (different hw) | #### rendering culling | technique | description | expected gain | | ------------------ | ---------------------------------------- | ---------------------- | | frustum culling | skip entities outside view | depends on game design | | LOD rendering | reduce detail for distant/small entities | moderate | | temporal rendering | update/render subset per frame | moderate | --- ### milestone: push CPU ceiling (when it becomes the bottleneck) currently not the bottleneck - update stays <1ms at 100k. these become relevant when adding game logic, AI, or collision. #### collision detection | technique | description | expected gain | | ------------------ | ----------------------------------------- | ---------------------- | | uniform grid | spatial hash, O(1) neighbor lookup | high for dense scenes | | quadtree | adaptive spatial partitioning | high for sparse scenes | | broad/narrow phase | cheap AABB check before precise collision | moderate | #### update loop | technique | description | expected gain | | ---------------- | ----------------------------------------------- | ------------------- | | SIMD (AVX2/SSE) | vectorized position/velocity math | 2-4x on update | | struct-of-arrays | cache-friendly memory layout for SIMD | enables better SIMD | | multithreading | thread pool for parallel entity updates | scales with cores | | fixed-point math | integer math, deterministic, potentially faster | minor-moderate | #### memory layout | technique | description | expected gain | | --------------------- | ------------------------------------- | --------------------------- | | cache-friendly layout | hot data together, cold data separate | reduces cache misses | | entity pools | pre-allocated, reusable entity slots | reduces allocation overhead | | component packing | minimize struct padding | better cache utilization | --- ## testing methodology 1. set target entity count 2. run for 30+ seconds 3. record frame times (target: stable 16.7ms) 4. note when 60fps breaks 5. compare update_ms vs render_ms to identify bottleneck see journal.txt for raw benchmark data.