Update TODO and OPTIMIZATIONS with gpu discovery

This commit is contained in:
Jared Miller 2025-12-16 06:06:43 -05:00
parent 1c0f552032
commit 9e926e0646
2 changed files with 38 additions and 11 deletions

View file

@ -4,9 +4,9 @@ organized by performance goal. see journal.txt for detailed benchmarks.
## current ceiling ## current ceiling
- **100k entities @ 60fps** (AMD Radeon) - **~150k entities @ 60fps** (i5-6500T / HD 530 integrated)
- **50k entities @ 60fps** (i5-6500T integrated) - **~260k entities @ 60fps** (AMD Radeon discrete)
- bottleneck: GPU-bound (update loop stays <1ms even at 100k) - bottleneck: GPU-bound (update loop stays <1ms even at 200k+)
--- ---
@ -34,6 +34,25 @@ organized by performance goal. see journal.txt for detailed benchmarks.
- improvement: **2x** over texture blitting, **20x** total - improvement: **2x** over texture blitting, **20x** total
- why it works: eliminates per-call overhead, vertices go straight to GPU buffer - why it works: eliminates per-call overhead, vertices go straight to GPU buffer
#### optimization 3: increased batch buffer
- technique: increase raylib batch buffer from 8192 to 32768 vertices
- result: ~140k entities @ 60fps (i5-6500T)
- improvement: **~40%** over default buffer
- why it works: fewer GPU flushes per frame
#### optimization 4: GPU instancing (tested, minimal gain)
- technique: `drawMeshInstanced()` with per-entity transform matrices
- result: ~150k entities @ 60fps (i5-6500T) - similar to rlgl batching
- improvement: **negligible** on integrated graphics
- why it didn't help:
- integrated GPU shares system RAM (no PCIe transfer savings)
- 64-byte Matrix per entity vs ~80 bytes for rlgl vertices (similar bandwidth)
- bottleneck is memory bandwidth, not draw call overhead
- rlgl batching already minimizes draw calls effectively
- note: may help more on discrete GPUs with dedicated VRAM
--- ---
## future optimizations ## future optimizations
@ -42,12 +61,12 @@ organized by performance goal. see journal.txt for detailed benchmarks.
these target the rendering bottleneck since update loop is already fast. these target the rendering bottleneck since update loop is already fast.
| technique | description | expected gain | | technique | description | expected gain |
| ---------------------- | -------------------------------------------------------------------- | ------------- | | ---------------------- | -------------------------------------------------------------------- | ------------------------------- |
| increase batch buffer | raylib default is 8192 vertices (2048 quads). larger = fewer flushes | moderate | | SSBO instance data | pack (x, y, color) = 12 bytes instead of 64-byte matrices | moderate (less bandwidth) |
| GPU instancing | single draw call for all entities, GPU handles transforms | significant | | compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync | significant |
| compute shader updates | move entity positions to GPU entirely | significant | | OpenGL vs Vulkan | test raylib's Vulkan backend | unknown |
| OpenGL vs Vulkan | test raylib's Vulkan backend | unknown | | discrete GPU testing | test on dedicated GPU where instancing/SSBO shine | significant (different hw) |
#### rendering culling #### rendering culling

12
TODO.md
View file

@ -56,11 +56,19 @@ further options (if needed):
## phase 5: rendering experiments ## phase 5: rendering experiments
- [ ] increase raylib batch buffer (currently 8192 vertices = 2048 quads) - [x] increase raylib batch buffer (currently 8192 vertices = 2048 quads)
- [ ] GPU instancing (single draw call for all entities) - [x] GPU instancing (single draw call for all entities)
- [ ] SSBO instance data (12 bytes vs 64-byte matrices)
- [ ] compute shader entity updates (if raylib supports) - [ ] compute shader entity updates (if raylib supports)
- [ ] compare OpenGL vs Vulkan backend - [ ] compare OpenGL vs Vulkan backend
findings (i5-6500T / HD 530):
- batch buffer increase: ~140k @ 60fps (was ~100k)
- GPU instancing: ~150k @ 60fps - negligible gain over rlgl batching
- instancing doesn't help on integrated graphics (shared RAM, no PCIe savings)
- bottleneck is memory bandwidth, not draw call overhead
- rlgl batching is already near-optimal for this hardware
## future optimization concepts ## future optimization concepts
- [ ] SIMD entity updates (AVX2/SSE) - [ ] SIMD entity updates (AVX2/SSE)