From 6ac5b10b9342d7acc0b199f6f130ee8f7a4207a6 Mon Sep 17 00:00:00 2001
From: Jared Miller <jared@smell.flowers>
Date: Tue, 16 Dec 2025 06:06:43 -0500
Subject: [PATCH] Update TODO and OPTIMIZATIONS with gpu discovery

---
 OPTIMIZATIONS.md | 37 ++++++++++++++++++++++++++++---------
 TODO.md          | 12 ++++++++++--
 2 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/OPTIMIZATIONS.md b/OPTIMIZATIONS.md
index d3d4a63..57da516 100644
--- a/OPTIMIZATIONS.md
+++ b/OPTIMIZATIONS.md
@@ -4,9 +4,9 @@ organized by performance goal. see journal.txt for detailed benchmarks.
 
 ## current ceiling
 
-- **100k entities @ 60fps** (AMD Radeon)
-- **50k entities @ 60fps** (i5-6500T integrated)
-- bottleneck: GPU-bound (update loop stays <1ms even at 100k)
+- **~150k entities @ 60fps** (i5-6500T / HD 530 integrated)
+- **~260k entities @ 60fps** (AMD Radeon discrete)
+- bottleneck: GPU-bound (update loop stays <1ms even at 200k+)
 
 ---
 
@@ -34,6 +34,25 @@ organized by performance goal. see journal.txt for detailed benchmarks.
 - improvement: **2x** over texture blitting, **20x** total
 - why it works: eliminates per-call overhead, vertices go straight to GPU buffer
 
+#### optimization 3: increased batch buffer
+
+- technique: increase raylib batch buffer from 8192 to 32768 vertices
+- result: ~140k entities @ 60fps (i5-6500T)
+- improvement: **~40%** over default buffer
+- why it works: fewer GPU flushes per frame
+
+#### optimization 4: GPU instancing (tested, minimal gain)
+
+- technique: `drawMeshInstanced()` with per-entity transform matrices
+- result: ~150k entities @ 60fps (i5-6500T) - similar to rlgl batching
+- improvement: **negligible** on integrated graphics
+- why it didn't help:
+  - integrated GPU shares system RAM (no PCIe transfer savings)
+  - 64-byte Matrix per entity vs ~80 bytes for rlgl vertices (similar bandwidth)
+  - bottleneck is memory bandwidth, not draw call overhead
+  - rlgl batching already minimizes draw calls effectively
+- note: may help more on discrete GPUs with dedicated VRAM
+
 ---
 
 ## future optimizations
@@ -42,12 +61,12 @@ organized by performance goal. see journal.txt for detailed benchmarks.
 
 these target the rendering bottleneck since update loop is already fast.
 
-| technique              | description                                                          | expected gain |
-| ---------------------- | -------------------------------------------------------------------- | ------------- |
-| increase batch buffer  | raylib default is 8192 vertices (2048 quads). larger = fewer flushes | moderate      |
-| GPU instancing         | single draw call for all entities, GPU handles transforms            | significant   |
-| compute shader updates | move entity positions to GPU entirely                                | significant   |
-| OpenGL vs Vulkan       | test raylib's Vulkan backend                                         | unknown       |
+| technique              | description                                                          | expected gain                   |
+| ---------------------- | -------------------------------------------------------------------- | ------------------------------- |
+| SSBO instance data     | pack (x, y, color) = 12 bytes instead of 64-byte matrices            | moderate (less bandwidth)       |
+| compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync            | significant                     |
+| OpenGL vs Vulkan       | test raylib's Vulkan backend                                         | unknown                         |
+| discrete GPU testing   | test on dedicated GPU where instancing/SSBO shine                    | significant (different hw)      |
 
 #### rendering culling
 
diff --git a/TODO.md b/TODO.md
index be4bebb..103889e 100644
--- a/TODO.md
+++ b/TODO.md
@@ -56,11 +56,19 @@ further options (if needed):
 
 ## phase 5: rendering experiments
 
-- [ ] increase raylib batch buffer (currently 8192 vertices = 2048 quads)
-- [ ] GPU instancing (single draw call for all entities)
+- [x] increase raylib batch buffer (currently 8192 vertices = 2048 quads)
+- [x] GPU instancing (single draw call for all entities)
+- [ ] SSBO instance data (12 bytes vs 64-byte matrices)
 - [ ] compute shader entity updates (if raylib supports)
 - [ ] compare OpenGL vs Vulkan backend
 
+findings (i5-6500T / HD 530):
+- batch buffer increase: ~140k @ 60fps (was ~100k)
+- GPU instancing: ~150k @ 60fps - negligible gain over rlgl batching
+- instancing doesn't help on integrated graphics (shared RAM, no PCIe savings)
+- bottleneck is memory bandwidth, not draw call overhead
+- rlgl batching is already near-optimal for this hardware
+
 ## future optimization concepts
 
 - [ ] SIMD entity updates (AVX2/SSE)