Update TODO and OPTIMIZATIONS with gpu discovery

2025-12-16 06:06:43 -05:00 · 2025-12-16 06:06:43 -05:00 · 9e926e0646
commit 9e926e0646
parent 1c0f552032
2 changed files with 38 additions and 11 deletions
--- a/OPTIMIZATIONS.md
+++ b/OPTIMIZATIONS.md
@ -4,9 +4,9 @@ organized by performance goal. see journal.txt for detailed benchmarks.
 ## current ceiling
- **100k entities @ 60fps** (AMD Radeon)
+- **~150k entities @ 60fps** (i5-6500T / HD 530 integrated)
- **50k entities @ 60fps** (i5-6500T integrated)
+- **~260k entities @ 60fps** (AMD Radeon discrete)
- bottleneck: GPU-bound (update loop stays <1ms even at 100k)
+- bottleneck: GPU-bound (update loop stays <1ms even at 200k+)
 ---
@ -34,6 +34,25 @@ organized by performance goal. see journal.txt for detailed benchmarks.
 - improvement: **2x** over texture blitting, **20x** total
 - why it works: eliminates per-call overhead, vertices go straight to GPU buffer
 #### optimization 3: increased batch buffer
 - technique: increase raylib batch buffer from 8192 to 32768 vertices
 - result: ~140k entities @ 60fps (i5-6500T)
 - improvement: **~40%** over default buffer
 - why it works: fewer GPU flushes per frame
 #### optimization 4: GPU instancing (tested, minimal gain)
 - technique: `drawMeshInstanced()` with per-entity transform matrices
 - result: ~150k entities @ 60fps (i5-6500T) - similar to rlgl batching
 - improvement: **negligible** on integrated graphics
 - why it didn't help:
  - integrated GPU shares system RAM (no PCIe transfer savings)
  - 64-byte Matrix per entity vs ~80 bytes for rlgl vertices (similar bandwidth)
  - bottleneck is memory bandwidth, not draw call overhead
  - rlgl batching already minimizes draw calls effectively
 - note: may help more on discrete GPUs with dedicated VRAM
 ---
 ## future optimizations
@ -42,12 +61,12 @@ organized by performance goal. see journal.txt for detailed benchmarks.
 these target the rendering bottleneck since update loop is already fast.
-| technique              | description                                                          | expected gain |
+| technique              | description                                                          | expected gain                   |
-| ---------------------- | -------------------------------------------------------------------- | ------------- |
+| ---------------------- | -------------------------------------------------------------------- | ------------------------------- |
-| increase batch buffer  | raylib default is 8192 vertices (2048 quads). larger = fewer flushes | moderate      |
+| SSBO instance data     | pack (x, y, color) = 12 bytes instead of 64-byte matrices            | moderate (less bandwidth)       |
-| GPU instancing         | single draw call for all entities, GPU handles transforms            | significant   |
+| compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync            | significant                     |
-| compute shader updates | move entity positions to GPU entirely                                | significant   |
+| OpenGL vs Vulkan       | test raylib's Vulkan backend                                         | unknown                         |
-| OpenGL vs Vulkan       | test raylib's Vulkan backend                                         | unknown       |
+| discrete GPU testing   | test on dedicated GPU where instancing/SSBO shine                    | significant (different hw)      |
 #### rendering culling
--- a/TODO.md
+++ b/TODO.md
@ -56,11 +56,19 @@ further options (if needed):
 ## phase 5: rendering experiments
- [ ] increase raylib batch buffer (currently 8192 vertices = 2048 quads)
+- [x] increase raylib batch buffer (currently 8192 vertices = 2048 quads)
- [ ] GPU instancing (single draw call for all entities)
+- [x] GPU instancing (single draw call for all entities)
 - [ ] SSBO instance data (12 bytes vs 64-byte matrices)
 - [ ] compute shader entity updates (if raylib supports)
 - [ ] compare OpenGL vs Vulkan backend
 findings (i5-6500T / HD 530):
 - batch buffer increase: ~140k @ 60fps (was ~100k)
 - GPU instancing: ~150k @ 60fps - negligible gain over rlgl batching
 - instancing doesn't help on integrated graphics (shared RAM, no PCIe savings)
 - bottleneck is memory bandwidth, not draw call overhead
 - rlgl batching is already near-optimal for this hardware
 ## future optimization concepts
 - [ ] SIMD entity updates (AVX2/SSE)