243 lines
8.4 KiB
Text
243 lines
8.4 KiB
Text
lofivor optimization journal
|
|
=============================
|
|
|
|
goal: maximize entity count at 60fps for survivor-like game
|
|
|
|
baseline: individual drawCircle calls
|
|
-------------------------------------
|
|
technique: rl.drawCircle() per entity in loop
|
|
code: sandbox_main.zig:144-151
|
|
bottleneck: render-bound (update <1ms even at 30k entities)
|
|
|
|
benchmark1.log results (AMD Radeon):
|
|
- 60fps stable: ~4000 entities
|
|
- 60fps breaks: ~5000 entities (19.9ms frame)
|
|
- 10k entities: ~43ms frame
|
|
- 20k entities: ~77ms frame
|
|
- 25k entities: ~97ms frame
|
|
|
|
analysis: linear scaling, each drawCircle = separate GPU draw call
|
|
|
|
---
|
|
|
|
optimization 1: texture blitting
|
|
--------------------------------
|
|
technique: pre-render circle to 16x16 texture, drawTexture() per entity
|
|
code: sandbox_main.zig:109-124, 170-177
|
|
|
|
benchmark2.log results:
|
|
- 60fps stable: ~50,000 entities
|
|
- 60fps breaks: ~52,000-55,000 entities (18-21ms frame)
|
|
- 23k entities: 16.7ms frame (still vsync-locked)
|
|
- 59k entities: 20.6ms frame
|
|
|
|
extended benchmark (benchmark3):
|
|
- 50k entities: 16.7ms (vsync-locked, briefly touches 19ms)
|
|
- 60k entities: 20.7ms
|
|
- 70k entities: 23.7ms
|
|
- 80k entities: 30.1ms
|
|
- 100k entities: 33-37ms (~30fps)
|
|
|
|
comparison to baseline:
|
|
- baseline broke 60fps at ~5,000 entities
|
|
- texture blitting breaks at ~50,000 entities
|
|
- ~10x improvement in entity ceiling
|
|
|
|
analysis: raylib batches texture draws internally when using same texture.
|
|
individual drawCircle() = separate draw call each. drawTexture() with same
|
|
texture = batched into fewer GPU calls.
|
|
|
|
notes: render_ms stays ~16-18ms up to ~50k, then scales roughly linearly.
|
|
at 100k entities we're at ~30fps which is still playable. update loop
|
|
remains negligible (<0.6ms even at 100k).
|
|
|
|
---
|
|
|
|
optimization 2: rlgl quad batching
|
|
-----------------------------------
|
|
technique: bypass drawTexture(), submit vertices directly via rlgl
|
|
code: sandbox_main.zig:175-197
|
|
- rl.gl.rlSetTexture() once
|
|
- rl.gl.rlBegin(rl_quads)
|
|
- loop: rlTexCoord2f + rlVertex2f for 4 vertices per entity
|
|
- rl.gl.rlEnd()
|
|
|
|
benchmark3.log results:
|
|
- 40k entities: 16.7ms (vsync-locked)
|
|
- 100k entities: 16.7-19.2ms (~55-60fps)
|
|
|
|
comparison to optimization 1:
|
|
- texture blitting: 100k @ 33-37ms (~30fps)
|
|
- rlgl batching: 100k @ 16.7-19ms (~55-60fps)
|
|
- ~2x improvement
|
|
|
|
total improvement from baseline:
|
|
- baseline: 60fps @ ~5k entities
|
|
- final: 60fps @ ~100k entities
|
|
- ~20x improvement overall
|
|
|
|
analysis: drawTexture() has per-call overhead (type conversions, batch state
|
|
checks). rlgl submits vertices directly to GPU buffer. raylib's internal batch
|
|
(8192 vertices = ~2048 quads) auto-flushes, so 100k entities = ~49 draw calls
|
|
vs 100k drawTexture calls with their overhead.
|
|
|
|
---
|
|
|
|
hardware comparison: i5-6500T (integrated graphics)
|
|
---------------------------------------------------
|
|
system: Intel i5-6500T with Intel HD Graphics 530
|
|
baseline version: 0.3.0 (quad batching optimized)
|
|
|
|
benchmark.log results:
|
|
- 30k entities: 16.7ms mostly, occasional spikes to 20-27ms
|
|
- 40k entities: 16.7ms stable
|
|
- 50k entities: 16.7ms mostly, breaks to 20ms occasionally
|
|
- 60k entities: 47ms spike, settles ~20ms (breaks 60fps)
|
|
- 100k entities: ~31ms frame
|
|
- 200k entities: ~57ms frame (render=48.8ms, update=8.0ms)
|
|
|
|
comparison to AMD Radeon (optimization 2 results):
|
|
- AMD Radeon: 60fps stable @ ~100k entities
|
|
- i5-6500T HD 530: 60fps stable @ ~50k entities
|
|
- ~2x difference due to integrated vs discrete GPU
|
|
|
|
notes: update loop scales well on both (update_ms < 10ms even at 200k).
|
|
bottleneck is purely GPU-bound. integrated graphics has fewer shader
|
|
units and less memory bandwidth, explaining the ~2x entity ceiling
|
|
difference.
|
|
|
|
---
|
|
|
|
optimization 3: increased batch buffer
|
|
--------------------------------------
|
|
technique: increase raylib batch buffer from 8192 to 32768 vertices
|
|
code: sandbox_main.zig:150-156
|
|
version: 0.3.1
|
|
|
|
benchmark_0.3.1.log results (i5-6500T / HD 530):
|
|
- 60fps stable: ~140k entities
|
|
- 140k entities: 16.7ms (vsync-locked)
|
|
- 170k entities: 19.3ms (breaks 60fps)
|
|
- 190k entities: 18.8ms
|
|
- 240k entities: 23ms (benchmark exit)
|
|
|
|
comparison to 0.3.0 on same hardware:
|
|
- 0.3.0: 60fps breaks at ~64k entities
|
|
- 0.3.1: 60fps stable at ~140k entities
|
|
- ~2x improvement from larger batch buffer
|
|
|
|
analysis: fewer GPU flushes per frame. default buffer (8192 verts = 2048 quads)
|
|
means 100k entities = ~49 flushes. larger buffer (32768 verts = 8192 quads)
|
|
means 100k entities = ~12 flushes. less driver overhead per frame.
|
|
|
|
---
|
|
|
|
optimization 4: GPU instancing (tested, minimal gain)
|
|
-----------------------------------------------------
|
|
technique: drawMeshInstanced() with per-entity 4x4 transform matrices
|
|
code: sandbox_main.zig:173-192, 245-257
|
|
version: 0.4.0
|
|
|
|
benchmark_0.4.0.log results (i5-6500T / HD 530):
|
|
- 60fps stable: ~150k entities (with occasional spikes)
|
|
- 150k entities: 16.7ms (vsync-locked, spiky)
|
|
- 190k entities: 18-19ms
|
|
- 240k entities: 23ms
|
|
- 270k entities: 28ms (benchmark exit)
|
|
|
|
comparison to 0.3.1 (batch buffer) on same hardware:
|
|
- 0.3.1: 60fps @ ~140k, exits @ ~240k
|
|
- 0.4.0: 60fps @ ~150k, exits @ ~270k
|
|
- negligible improvement (~7% more entities)
|
|
|
|
analysis: GPU instancing didn't help much on integrated graphics because:
|
|
1. shared system RAM means no PCIe transfer savings (discrete GPUs benefit here)
|
|
2. 64-byte Matrix per entity vs ~80 bytes for rlgl vertices (similar bandwidth)
|
|
3. bottleneck is memory bandwidth, not draw call overhead
|
|
4. rlgl batching already minimizes draw calls effectively
|
|
|
|
conclusion: for integrated graphics, rlgl quad batching is already near-optimal.
|
|
GPU instancing shines on discrete GPUs where PCIe transfer is the bottleneck.
|
|
keep both paths available for testing on different hardware.
|
|
|
|
---
|
|
|
|
optimization 5: SSBO instance data
|
|
----------------------------------
|
|
technique: pack entity data into 12-byte struct (x, y, color), upload via SSBO
|
|
code: ssbo_renderer.zig, shaders/entity.vert, shaders/entity.frag
|
|
version: ssbo branch
|
|
|
|
struct GpuEntity {
|
|
x: f32, // 4 bytes
|
|
y: f32, // 4 bytes
|
|
color: u32, // 4 bytes (0xRRGGBB)
|
|
}; // = 12 bytes total
|
|
|
|
vs GPU instancing: 64-byte Matrix per entity
|
|
vs rlgl batching: ~80 bytes per entity (4 vertices * 5 floats * 4 bytes)
|
|
|
|
benchmark_ssbo.log results (i5-6500T / HD 530):
|
|
- 200k entities: 16.7ms (60fps)
|
|
- 350k entities: 16.7ms (60fps)
|
|
- 450k entities: 16.7ms (60fps)
|
|
- 550k entities: 16.7ms (60fps)
|
|
- 700k entities: 16.7ms (60fps)
|
|
- 950k entities: 17.5-18.9ms (~53-57fps, benchmark exit)
|
|
|
|
comparison to previous optimizations on same hardware:
|
|
- 0.3.1 (batch buffer): 60fps @ ~140k entities
|
|
- 0.4.0 (GPU instancing): 60fps @ ~150k entities
|
|
- SSBO: 60fps @ ~700k entities
|
|
- ~5x improvement over previous best!
|
|
|
|
analysis: SSBO massively reduces per-entity bandwidth:
|
|
- 12 bytes vs 64 bytes (matrices) = 5.3x less data
|
|
- 12 bytes vs 80 bytes (rlgl) = 6.7x less data
|
|
- single instanced draw call, no CPU-side transform calculations
|
|
- shader does NDC conversion and color unpacking on GPU
|
|
|
|
gotcha found: raylib's rlSetUniformSampler() doesn't work with custom GL code.
|
|
use rlSetUniform() with RL_SHADER_UNIFORM_INT instead.
|
|
see docs/raylib_rlSetUniformSampler_bug.md for details.
|
|
|
|
total improvement from baseline:
|
|
- baseline: 60fps @ ~5k entities (individual drawCircle)
|
|
- SSBO: 60fps @ ~700k entities
|
|
- ~140x improvement overall!
|
|
|
|
---
|
|
|
|
optimization 6: compute shader updates
|
|
--------------------------------------
|
|
technique: move entity position + respawn logic from CPU to GPU compute shader
|
|
code: compute.zig, shaders/entity_update.comp, ssbo_renderer.zig
|
|
version: 0.7.0
|
|
|
|
struct GpuEntity {
|
|
x: f32, // 4 bytes
|
|
y: f32, // 4 bytes
|
|
packed_vel: i32, // 4 bytes (vx/vy in fixed-point 8.8)
|
|
color: u32, // 4 bytes
|
|
}; // = 16 bytes total (was 12)
|
|
|
|
changes:
|
|
- entity_update.comp: position update, center check, edge respawn, velocity calc
|
|
- GPU RNG: PCG-style PRNG seeded with entity id + frame number
|
|
- ssbo_renderer: renderComputeMode() only uploads NEW entities (when count grows)
|
|
- CPU update loop skipped entirely when compute enabled
|
|
|
|
benchmark results (i5-6500T / HD 530):
|
|
- update time: ~5ms → ~0ms at 150k entities
|
|
- render time unchanged (GPU-bound as before)
|
|
- total frame time improvement at high entity counts
|
|
|
|
analysis: CPU was doing ~150k position updates + distance checks + respawn logic
|
|
per frame. now GPU does it in parallel via 256-thread workgroups. CPU only uploads
|
|
new entities when user adds them, not per-frame. memory barrier ensures compute
|
|
writes visible to vertex shader before draw.
|
|
|
|
flags:
|
|
- --compute: GPU compute updates (now default)
|
|
- --cpu: fallback to CPU update path for comparison
|
|
|