Update docs with ssbo discovery
This commit is contained in:
parent
64760d0a35
commit
91a60ba632
3 changed files with 121 additions and 10 deletions
|
|
@ -4,9 +4,9 @@ organized by performance goal. see journal.txt for detailed benchmarks.
|
||||||
|
|
||||||
## current ceiling
|
## current ceiling
|
||||||
|
|
||||||
- **~150k entities @ 60fps** (i5-6500T / HD 530 integrated)
|
- **~700k entities @ 60fps** (i5-6500T / HD 530 integrated, SSBO)
|
||||||
- **~260k entities @ 60fps** (AMD Radeon discrete)
|
- **~950k entities @ ~57fps** (i5-6500T / HD 530 integrated, SSBO)
|
||||||
- bottleneck: GPU-bound (update loop stays <1ms even at 200k+)
|
- bottleneck: GPU-bound (update loop stays <5ms even at 950k)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -53,6 +53,25 @@ organized by performance goal. see journal.txt for detailed benchmarks.
|
||||||
- rlgl batching already minimizes draw calls effectively
|
- rlgl batching already minimizes draw calls effectively
|
||||||
- note: may help more on discrete GPUs with dedicated VRAM
|
- note: may help more on discrete GPUs with dedicated VRAM
|
||||||
|
|
||||||
|
#### optimization 5: SSBO instance data
|
||||||
|
|
||||||
|
- technique: pack entity data (x, y, color) into 12-byte struct, upload via SSBO
|
||||||
|
- result: **~700k entities @ 60fps** (i5-6500T / HD 530)
|
||||||
|
- improvement: **~5x** over previous best, **~140x** total from baseline
|
||||||
|
- comparison:
|
||||||
|
- batch buffer (0.3.1): 60fps @ ~140k
|
||||||
|
- GPU instancing (0.4.0): 60fps @ ~150k
|
||||||
|
- SSBO: 60fps @ ~700k, ~57fps @ 950k
|
||||||
|
- why it works:
|
||||||
|
- 12 bytes vs 64 bytes (matrices) = 5.3x less bandwidth
|
||||||
|
- 12 bytes vs 80 bytes (rlgl vertices) = 6.7x less bandwidth
|
||||||
|
- no CPU-side matrix calculations
|
||||||
|
- GPU does NDC conversion and color unpacking
|
||||||
|
- implementation notes:
|
||||||
|
- custom vertex shader reads from SSBO using `gl_InstanceID`
|
||||||
|
- single `rlDrawVertexArrayInstanced()` call for all entities
|
||||||
|
- gotcha: don't use `rlSetUniformSampler()` for custom GL code - use `rlSetUniform()` with int type instead (see `docs/raylib_rlSetUniformSampler_bug.md`)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## future optimizations
|
## future optimizations
|
||||||
|
|
@ -63,7 +82,7 @@ these target the rendering bottleneck since update loop is already fast.
|
||||||
|
|
||||||
| technique | description | expected gain |
|
| technique | description | expected gain |
|
||||||
| ---------------------- | -------------------------------------------------------------------- | ------------------------------- |
|
| ---------------------- | -------------------------------------------------------------------- | ------------------------------- |
|
||||||
| SSBO instance data | pack (x, y, color) = 12 bytes instead of 64-byte matrices | moderate (less bandwidth) |
|
| ~~SSBO instance data~~ | ~~pack (x, y, color) = 12 bytes instead of 64-byte matrices~~ | **done** - see optimization 5 |
|
||||||
| compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync | significant |
|
| compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync | significant |
|
||||||
| OpenGL vs Vulkan | test raylib's Vulkan backend | unknown |
|
| OpenGL vs Vulkan | test raylib's Vulkan backend | unknown |
|
||||||
| discrete GPU testing | test on dedicated GPU where instancing/SSBO shine | significant (different hw) |
|
| discrete GPU testing | test on dedicated GPU where instancing/SSBO shine | significant (different hw) |
|
||||||
|
|
|
||||||
2
TODO.md
2
TODO.md
|
|
@ -58,7 +58,7 @@ further options (if needed):
|
||||||
|
|
||||||
- [x] increase raylib batch buffer (currently 8192 vertices = 2048 quads)
|
- [x] increase raylib batch buffer (currently 8192 vertices = 2048 quads)
|
||||||
- [x] GPU instancing (single draw call for all entities)
|
- [x] GPU instancing (single draw call for all entities)
|
||||||
- [ ] SSBO instance data (12 bytes vs 64-byte matrices)
|
- [x] SSBO instance data (12 bytes vs 64-byte matrices)
|
||||||
- [ ] compute shader entity updates (if raylib supports)
|
- [ ] compute shader entity updates (if raylib supports)
|
||||||
- [ ] compare OpenGL vs Vulkan backend
|
- [ ] compare OpenGL vs Vulkan backend
|
||||||
|
|
||||||
|
|
|
||||||
102
journal.txt
102
journal.txt
|
|
@ -108,9 +108,101 @@ difference.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
optimization 3: [pending]
|
optimization 3: increased batch buffer
|
||||||
-------------------------
|
--------------------------------------
|
||||||
technique:
|
technique: increase raylib batch buffer from 8192 to 32768 vertices
|
||||||
results:
|
code: sandbox_main.zig:150-156
|
||||||
notes:
|
version: 0.3.1
|
||||||
|
|
||||||
|
benchmark_0.3.1.log results (i5-6500T / HD 530):
|
||||||
|
- 60fps stable: ~140k entities
|
||||||
|
- 140k entities: 16.7ms (vsync-locked)
|
||||||
|
- 170k entities: 19.3ms (breaks 60fps)
|
||||||
|
- 190k entities: 18.8ms
|
||||||
|
- 240k entities: 23ms (benchmark exit)
|
||||||
|
|
||||||
|
comparison to 0.3.0 on same hardware:
|
||||||
|
- 0.3.0: 60fps breaks at ~64k entities
|
||||||
|
- 0.3.1: 60fps stable at ~140k entities
|
||||||
|
- ~2x improvement from larger batch buffer
|
||||||
|
|
||||||
|
analysis: fewer GPU flushes per frame. default buffer (8192 verts = 2048 quads)
|
||||||
|
means 100k entities = ~49 flushes. larger buffer (32768 verts = 8192 quads)
|
||||||
|
means 100k entities = ~12 flushes. less driver overhead per frame.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
optimization 4: GPU instancing (tested, minimal gain)
|
||||||
|
-----------------------------------------------------
|
||||||
|
technique: drawMeshInstanced() with per-entity 4x4 transform matrices
|
||||||
|
code: sandbox_main.zig:173-192, 245-257
|
||||||
|
version: 0.4.0
|
||||||
|
|
||||||
|
benchmark_0.4.0.log results (i5-6500T / HD 530):
|
||||||
|
- 60fps stable: ~150k entities (with occasional spikes)
|
||||||
|
- 150k entities: 16.7ms (vsync-locked, spiky)
|
||||||
|
- 190k entities: 18-19ms
|
||||||
|
- 240k entities: 23ms
|
||||||
|
- 270k entities: 28ms (benchmark exit)
|
||||||
|
|
||||||
|
comparison to 0.3.1 (batch buffer) on same hardware:
|
||||||
|
- 0.3.1: 60fps @ ~140k, exits @ ~240k
|
||||||
|
- 0.4.0: 60fps @ ~150k, exits @ ~270k
|
||||||
|
- negligible improvement (~7% more entities)
|
||||||
|
|
||||||
|
analysis: GPU instancing didn't help much on integrated graphics because:
|
||||||
|
1. shared system RAM means no PCIe transfer savings (discrete GPUs benefit here)
|
||||||
|
2. 64-byte Matrix per entity vs ~80 bytes for rlgl vertices (similar bandwidth)
|
||||||
|
3. bottleneck is memory bandwidth, not draw call overhead
|
||||||
|
4. rlgl batching already minimizes draw calls effectively
|
||||||
|
|
||||||
|
conclusion: for integrated graphics, rlgl quad batching is already near-optimal.
|
||||||
|
GPU instancing shines on discrete GPUs where PCIe transfer is the bottleneck.
|
||||||
|
keep both paths available for testing on different hardware.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
optimization 5: SSBO instance data
|
||||||
|
----------------------------------
|
||||||
|
technique: pack entity data into 12-byte struct (x, y, color), upload via SSBO
|
||||||
|
code: ssbo_renderer.zig, shaders/entity.vert, shaders/entity.frag
|
||||||
|
version: ssbo branch
|
||||||
|
|
||||||
|
struct GpuEntity {
|
||||||
|
x: f32, // 4 bytes
|
||||||
|
y: f32, // 4 bytes
|
||||||
|
color: u32, // 4 bytes (0xRRGGBB)
|
||||||
|
}; // = 12 bytes total
|
||||||
|
|
||||||
|
vs GPU instancing: 64-byte Matrix per entity
|
||||||
|
vs rlgl batching: ~80 bytes per entity (4 vertices * 5 floats * 4 bytes)
|
||||||
|
|
||||||
|
benchmark_ssbo.log results (i5-6500T / HD 530):
|
||||||
|
- 200k entities: 16.7ms (60fps)
|
||||||
|
- 350k entities: 16.7ms (60fps)
|
||||||
|
- 450k entities: 16.7ms (60fps)
|
||||||
|
- 550k entities: 16.7ms (60fps)
|
||||||
|
- 700k entities: 16.7ms (60fps)
|
||||||
|
- 950k entities: 17.5-18.9ms (~53-57fps, benchmark exit)
|
||||||
|
|
||||||
|
comparison to previous optimizations on same hardware:
|
||||||
|
- 0.3.1 (batch buffer): 60fps @ ~140k entities
|
||||||
|
- 0.4.0 (GPU instancing): 60fps @ ~150k entities
|
||||||
|
- SSBO: 60fps @ ~700k entities
|
||||||
|
- ~5x improvement over previous best!
|
||||||
|
|
||||||
|
analysis: SSBO massively reduces per-entity bandwidth:
|
||||||
|
- 12 bytes vs 64 bytes (matrices) = 5.3x less data
|
||||||
|
- 12 bytes vs 80 bytes (rlgl) = 6.7x less data
|
||||||
|
- single instanced draw call, no CPU-side transform calculations
|
||||||
|
- shader does NDC conversion and color unpacking on GPU
|
||||||
|
|
||||||
|
gotcha found: raylib's rlSetUniformSampler() doesn't work with custom GL code.
|
||||||
|
use rlSetUniform() with RL_SHADER_UNIFORM_INT instead.
|
||||||
|
see docs/raylib_rlSetUniformSampler_bug.md for details.
|
||||||
|
|
||||||
|
total improvement from baseline:
|
||||||
|
- baseline: 60fps @ ~5k entities (individual drawCircle)
|
||||||
|
- SSBO: 60fps @ ~700k entities
|
||||||
|
- ~140x improvement overall!
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue