lofivor optimization journal ============================= goal: maximize entity count at 60fps for survivor-like game baseline: individual drawCircle calls ------------------------------------- technique: rl.drawCircle() per entity in loop code: sandbox_main.zig:144-151 bottleneck: render-bound (update <1ms even at 30k entities) benchmark1.log results (AMD Radeon): - 60fps stable: ~4000 entities - 60fps breaks: ~5000 entities (19.9ms frame) - 10k entities: ~43ms frame - 20k entities: ~77ms frame - 25k entities: ~97ms frame analysis: linear scaling, each drawCircle = separate GPU draw call --- optimization 1: texture blitting -------------------------------- technique: pre-render circle to 16x16 texture, drawTexture() per entity code: sandbox_main.zig:109-124, 170-177 benchmark2.log results: - 60fps stable: ~50,000 entities - 60fps breaks: ~52,000-55,000 entities (18-21ms frame) - 23k entities: 16.7ms frame (still vsync-locked) - 59k entities: 20.6ms frame extended benchmark (benchmark3): - 50k entities: 16.7ms (vsync-locked, briefly touches 19ms) - 60k entities: 20.7ms - 70k entities: 23.7ms - 80k entities: 30.1ms - 100k entities: 33-37ms (~30fps) comparison to baseline: - baseline broke 60fps at ~5,000 entities - texture blitting breaks at ~50,000 entities - ~10x improvement in entity ceiling analysis: raylib batches texture draws internally when using same texture. individual drawCircle() = separate draw call each. drawTexture() with same texture = batched into fewer GPU calls. notes: render_ms stays ~16-18ms up to ~50k, then scales roughly linearly. at 100k entities we're at ~30fps which is still playable. update loop remains negligible (<0.6ms even at 100k). --- optimization 2: rlgl quad batching ----------------------------------- technique: bypass drawTexture(), submit vertices directly via rlgl code: sandbox_main.zig:175-197 - rl.gl.rlSetTexture() once - rl.gl.rlBegin(rl_quads) - loop: rlTexCoord2f + rlVertex2f for 4 vertices per entity - rl.gl.rlEnd() benchmark3.log results: - 40k entities: 16.7ms (vsync-locked) - 100k entities: 16.7-19.2ms (~55-60fps) comparison to optimization 1: - texture blitting: 100k @ 33-37ms (~30fps) - rlgl batching: 100k @ 16.7-19ms (~55-60fps) - ~2x improvement total improvement from baseline: - baseline: 60fps @ ~5k entities - final: 60fps @ ~100k entities - ~20x improvement overall analysis: drawTexture() has per-call overhead (type conversions, batch state checks). rlgl submits vertices directly to GPU buffer. raylib's internal batch (8192 vertices = ~2048 quads) auto-flushes, so 100k entities = ~49 draw calls vs 100k drawTexture calls with their overhead. --- hardware comparison: i5-6500T (integrated graphics) --------------------------------------------------- system: Intel i5-6500T with Intel HD Graphics 530 baseline version: 0.3.0 (quad batching optimized) benchmark.log results: - 30k entities: 16.7ms mostly, occasional spikes to 20-27ms - 40k entities: 16.7ms stable - 50k entities: 16.7ms mostly, breaks to 20ms occasionally - 60k entities: 47ms spike, settles ~20ms (breaks 60fps) - 100k entities: ~31ms frame - 200k entities: ~57ms frame (render=48.8ms, update=8.0ms) comparison to AMD Radeon (optimization 2 results): - AMD Radeon: 60fps stable @ ~100k entities - i5-6500T HD 530: 60fps stable @ ~50k entities - ~2x difference due to integrated vs discrete GPU notes: update loop scales well on both (update_ms < 10ms even at 200k). bottleneck is purely GPU-bound. integrated graphics has fewer shader units and less memory bandwidth, explaining the ~2x entity ceiling difference. --- optimization 3: increased batch buffer -------------------------------------- technique: increase raylib batch buffer from 8192 to 32768 vertices code: sandbox_main.zig:150-156 version: 0.3.1 benchmark_0.3.1.log results (i5-6500T / HD 530): - 60fps stable: ~140k entities - 140k entities: 16.7ms (vsync-locked) - 170k entities: 19.3ms (breaks 60fps) - 190k entities: 18.8ms - 240k entities: 23ms (benchmark exit) comparison to 0.3.0 on same hardware: - 0.3.0: 60fps breaks at ~64k entities - 0.3.1: 60fps stable at ~140k entities - ~2x improvement from larger batch buffer analysis: fewer GPU flushes per frame. default buffer (8192 verts = 2048 quads) means 100k entities = ~49 flushes. larger buffer (32768 verts = 8192 quads) means 100k entities = ~12 flushes. less driver overhead per frame. --- optimization 4: GPU instancing (tested, minimal gain) ----------------------------------------------------- technique: drawMeshInstanced() with per-entity 4x4 transform matrices code: sandbox_main.zig:173-192, 245-257 version: 0.4.0 benchmark_0.4.0.log results (i5-6500T / HD 530): - 60fps stable: ~150k entities (with occasional spikes) - 150k entities: 16.7ms (vsync-locked, spiky) - 190k entities: 18-19ms - 240k entities: 23ms - 270k entities: 28ms (benchmark exit) comparison to 0.3.1 (batch buffer) on same hardware: - 0.3.1: 60fps @ ~140k, exits @ ~240k - 0.4.0: 60fps @ ~150k, exits @ ~270k - negligible improvement (~7% more entities) analysis: GPU instancing didn't help much on integrated graphics because: 1. shared system RAM means no PCIe transfer savings (discrete GPUs benefit here) 2. 64-byte Matrix per entity vs ~80 bytes for rlgl vertices (similar bandwidth) 3. bottleneck is memory bandwidth, not draw call overhead 4. rlgl batching already minimizes draw calls effectively conclusion: for integrated graphics, rlgl quad batching is already near-optimal. GPU instancing shines on discrete GPUs where PCIe transfer is the bottleneck. keep both paths available for testing on different hardware. --- optimization 5: SSBO instance data ---------------------------------- technique: pack entity data into 12-byte struct (x, y, color), upload via SSBO code: ssbo_renderer.zig, shaders/entity.vert, shaders/entity.frag version: ssbo branch struct GpuEntity { x: f32, // 4 bytes y: f32, // 4 bytes color: u32, // 4 bytes (0xRRGGBB) }; // = 12 bytes total vs GPU instancing: 64-byte Matrix per entity vs rlgl batching: ~80 bytes per entity (4 vertices * 5 floats * 4 bytes) benchmark_ssbo.log results (i5-6500T / HD 530): - 200k entities: 16.7ms (60fps) - 350k entities: 16.7ms (60fps) - 450k entities: 16.7ms (60fps) - 550k entities: 16.7ms (60fps) - 700k entities: 16.7ms (60fps) - 950k entities: 17.5-18.9ms (~53-57fps, benchmark exit) comparison to previous optimizations on same hardware: - 0.3.1 (batch buffer): 60fps @ ~140k entities - 0.4.0 (GPU instancing): 60fps @ ~150k entities - SSBO: 60fps @ ~700k entities - ~5x improvement over previous best! analysis: SSBO massively reduces per-entity bandwidth: - 12 bytes vs 64 bytes (matrices) = 5.3x less data - 12 bytes vs 80 bytes (rlgl) = 6.7x less data - single instanced draw call, no CPU-side transform calculations - shader does NDC conversion and color unpacking on GPU gotcha found: raylib's rlSetUniformSampler() doesn't work with custom GL code. use rlSetUniform() with RL_SHADER_UNIFORM_INT instead. see docs/raylib_rlSetUniformSampler_bug.md for details. total improvement from baseline: - baseline: 60fps @ ~5k entities (individual drawCircle) - SSBO: 60fps @ ~700k entities - ~140x improvement overall! --- optimization 6: compute shader updates -------------------------------------- technique: move entity position + respawn logic from CPU to GPU compute shader code: compute.zig, shaders/entity_update.comp, ssbo_renderer.zig version: 0.7.0 struct GpuEntity { x: f32, // 4 bytes y: f32, // 4 bytes packed_vel: i32, // 4 bytes (vx/vy in fixed-point 8.8) color: u32, // 4 bytes }; // = 16 bytes total (was 12) changes: - entity_update.comp: position update, center check, edge respawn, velocity calc - GPU RNG: PCG-style PRNG seeded with entity id + frame number - ssbo_renderer: renderComputeMode() only uploads NEW entities (when count grows) - CPU update loop skipped entirely when compute enabled benchmark results (i5-6500T / HD 530): - update time: ~5ms → ~0ms at 150k entities - render time unchanged (GPU-bound as before) - total frame time improvement at high entity counts analysis: CPU was doing ~150k position updates + distance checks + respawn logic per frame. now GPU does it in parallel via 256-thread workgroups. CPU only uploads new entities when user adds them, not per-frame. memory barrier ensures compute writes visible to vertex shader before draw. flags: - --compute: GPU compute updates (now default) - --cpu: fallback to CPU update path for comparison