diff --git a/OPTIMIZATIONS.md b/OPTIMIZATIONS.md new file mode 100644 index 0000000..d3d4a63 --- /dev/null +++ b/OPTIMIZATIONS.md @@ -0,0 +1,101 @@ +# lofivor optimizations + +organized by performance goal. see journal.txt for detailed benchmarks. + +## current ceiling + +- **100k entities @ 60fps** (AMD Radeon) +- **50k entities @ 60fps** (i5-6500T integrated) +- bottleneck: GPU-bound (update loop stays <1ms even at 100k) + +--- + +## completed optimizations + +### rendering pipeline (GPU) + +#### baseline: individual drawCircle + +- technique: `rl.drawCircle()` per entity +- result: ~5k entities @ 60fps +- problem: each call = separate GPU draw call + +#### optimization 1: texture blitting + +- technique: pre-render circle to 16x16 texture, `drawTexture()` per entity +- result: ~50k entities @ 60fps +- improvement: **10x** over baseline +- why it works: raylib batches same-texture draws internally + +#### optimization 2: rlgl quad batching + +- technique: bypass `drawTexture()`, submit vertices directly via `rl.gl` +- result: ~100k entities @ 60fps +- improvement: **2x** over texture blitting, **20x** total +- why it works: eliminates per-call overhead, vertices go straight to GPU buffer + +--- + +## future optimizations + +### milestone: push GPU ceiling higher + +these target the rendering bottleneck since update loop is already fast. + +| technique | description | expected gain | +| ---------------------- | -------------------------------------------------------------------- | ------------- | +| increase batch buffer | raylib default is 8192 vertices (2048 quads). larger = fewer flushes | moderate | +| GPU instancing | single draw call for all entities, GPU handles transforms | significant | +| compute shader updates | move entity positions to GPU entirely | significant | +| OpenGL vs Vulkan | test raylib's Vulkan backend | unknown | + +#### rendering culling + +| technique | description | expected gain | +| ------------------ | ---------------------------------------- | ---------------------- | +| frustum culling | skip entities outside view | depends on game design | +| LOD rendering | reduce detail for distant/small entities | moderate | +| temporal rendering | update/render subset per frame | moderate | + +--- + +### milestone: push CPU ceiling (when it becomes the bottleneck) + +currently not the bottleneck - update stays <1ms at 100k. these become relevant when adding game logic, AI, or collision. + +#### collision detection + +| technique | description | expected gain | +| ------------------ | ----------------------------------------- | ---------------------- | +| uniform grid | spatial hash, O(1) neighbor lookup | high for dense scenes | +| quadtree | adaptive spatial partitioning | high for sparse scenes | +| broad/narrow phase | cheap AABB check before precise collision | moderate | + +#### update loop + +| technique | description | expected gain | +| ---------------- | ----------------------------------------------- | ------------------- | +| SIMD (AVX2/SSE) | vectorized position/velocity math | 2-4x on update | +| struct-of-arrays | cache-friendly memory layout for SIMD | enables better SIMD | +| multithreading | thread pool for parallel entity updates | scales with cores | +| fixed-point math | integer math, deterministic, potentially faster | minor-moderate | + +#### memory layout + +| technique | description | expected gain | +| --------------------- | ------------------------------------- | --------------------------- | +| cache-friendly layout | hot data together, cold data separate | reduces cache misses | +| entity pools | pre-allocated, reusable entity slots | reduces allocation overhead | +| component packing | minimize struct padding | better cache utilization | + +--- + +## testing methodology + +1. set target entity count +2. run for 30+ seconds +3. record frame times (target: stable 16.7ms) +4. note when 60fps breaks +5. compare update_ms vs render_ms to identify bottleneck + +see journal.txt for raw benchmark data.