From 6ac5b10b9342d7acc0b199f6f130ee8f7a4207a6 Mon Sep 17 00:00:00 2001 From: Jared Miller Date: Tue, 16 Dec 2025 06:06:43 -0500 Subject: [PATCH] Update TODO and OPTIMIZATIONS with gpu discovery --- OPTIMIZATIONS.md | 37 ++++++++++++++++++++++++++++--------- TODO.md | 12 ++++++++++-- 2 files changed, 38 insertions(+), 11 deletions(-) diff --git a/OPTIMIZATIONS.md b/OPTIMIZATIONS.md index d3d4a63..57da516 100644 --- a/OPTIMIZATIONS.md +++ b/OPTIMIZATIONS.md @@ -4,9 +4,9 @@ organized by performance goal. see journal.txt for detailed benchmarks. ## current ceiling -- **100k entities @ 60fps** (AMD Radeon) -- **50k entities @ 60fps** (i5-6500T integrated) -- bottleneck: GPU-bound (update loop stays <1ms even at 100k) +- **~150k entities @ 60fps** (i5-6500T / HD 530 integrated) +- **~260k entities @ 60fps** (AMD Radeon discrete) +- bottleneck: GPU-bound (update loop stays <1ms even at 200k+) --- @@ -34,6 +34,25 @@ organized by performance goal. see journal.txt for detailed benchmarks. - improvement: **2x** over texture blitting, **20x** total - why it works: eliminates per-call overhead, vertices go straight to GPU buffer +#### optimization 3: increased batch buffer + +- technique: increase raylib batch buffer from 8192 to 32768 vertices +- result: ~140k entities @ 60fps (i5-6500T) +- improvement: **~40%** over default buffer +- why it works: fewer GPU flushes per frame + +#### optimization 4: GPU instancing (tested, minimal gain) + +- technique: `drawMeshInstanced()` with per-entity transform matrices +- result: ~150k entities @ 60fps (i5-6500T) - similar to rlgl batching +- improvement: **negligible** on integrated graphics +- why it didn't help: + - integrated GPU shares system RAM (no PCIe transfer savings) + - 64-byte Matrix per entity vs ~80 bytes for rlgl vertices (similar bandwidth) + - bottleneck is memory bandwidth, not draw call overhead + - rlgl batching already minimizes draw calls effectively +- note: may help more on discrete GPUs with dedicated VRAM + --- ## future optimizations @@ -42,12 +61,12 @@ organized by performance goal. see journal.txt for detailed benchmarks. these target the rendering bottleneck since update loop is already fast. -| technique | description | expected gain | -| ---------------------- | -------------------------------------------------------------------- | ------------- | -| increase batch buffer | raylib default is 8192 vertices (2048 quads). larger = fewer flushes | moderate | -| GPU instancing | single draw call for all entities, GPU handles transforms | significant | -| compute shader updates | move entity positions to GPU entirely | significant | -| OpenGL vs Vulkan | test raylib's Vulkan backend | unknown | +| technique | description | expected gain | +| ---------------------- | -------------------------------------------------------------------- | ------------------------------- | +| SSBO instance data | pack (x, y, color) = 12 bytes instead of 64-byte matrices | moderate (less bandwidth) | +| compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync | significant | +| OpenGL vs Vulkan | test raylib's Vulkan backend | unknown | +| discrete GPU testing | test on dedicated GPU where instancing/SSBO shine | significant (different hw) | #### rendering culling diff --git a/TODO.md b/TODO.md index be4bebb..103889e 100644 --- a/TODO.md +++ b/TODO.md @@ -56,11 +56,19 @@ further options (if needed): ## phase 5: rendering experiments -- [ ] increase raylib batch buffer (currently 8192 vertices = 2048 quads) -- [ ] GPU instancing (single draw call for all entities) +- [x] increase raylib batch buffer (currently 8192 vertices = 2048 quads) +- [x] GPU instancing (single draw call for all entities) +- [ ] SSBO instance data (12 bytes vs 64-byte matrices) - [ ] compute shader entity updates (if raylib supports) - [ ] compare OpenGL vs Vulkan backend +findings (i5-6500T / HD 530): +- batch buffer increase: ~140k @ 60fps (was ~100k) +- GPU instancing: ~150k @ 60fps - negligible gain over rlgl batching +- instancing doesn't help on integrated graphics (shared RAM, no PCIe savings) +- bottleneck is memory bandwidth, not draw call overhead +- rlgl batching is already near-optimal for this hardware + ## future optimization concepts - [ ] SIMD entity updates (AVX2/SSE)