diff --git a/OPTIMIZATIONS.md b/OPTIMIZATIONS.md
index 57da516..911110f 100644
--- a/OPTIMIZATIONS.md
+++ b/OPTIMIZATIONS.md
@@ -4,9 +4,9 @@ organized by performance goal. see journal.txt for detailed benchmarks.
 
 ## current ceiling
 
-- **~150k entities @ 60fps** (i5-6500T / HD 530 integrated)
-- **~260k entities @ 60fps** (AMD Radeon discrete)
-- bottleneck: GPU-bound (update loop stays <1ms even at 200k+)
+- **~700k entities @ 60fps** (i5-6500T / HD 530 integrated, SSBO)
+- **~950k entities @ ~57fps** (i5-6500T / HD 530 integrated, SSBO)
+- bottleneck: GPU-bound (update loop stays <5ms even at 950k)
 
 ---
 
@@ -53,6 +53,25 @@ organized by performance goal. see journal.txt for detailed benchmarks.
   - rlgl batching already minimizes draw calls effectively
 - note: may help more on discrete GPUs with dedicated VRAM
 
+#### optimization 5: SSBO instance data
+
+- technique: pack entity data (x, y, color) into 12-byte struct, upload via SSBO
+- result: **~700k entities @ 60fps** (i5-6500T / HD 530)
+- improvement: **~5x** over previous best, **~140x** total from baseline
+- comparison:
+  - batch buffer (0.3.1): 60fps @ ~140k
+  - GPU instancing (0.4.0): 60fps @ ~150k
+  - SSBO: 60fps @ ~700k, ~57fps @ 950k
+- why it works:
+  - 12 bytes vs 64 bytes (matrices) = 5.3x less bandwidth
+  - 12 bytes vs 80 bytes (rlgl vertices) = 6.7x less bandwidth
+  - no CPU-side matrix calculations
+  - GPU does NDC conversion and color unpacking
+- implementation notes:
+  - custom vertex shader reads from SSBO using `gl_InstanceID`
+  - single `rlDrawVertexArrayInstanced()` call for all entities
+  - gotcha: don't use `rlSetUniformSampler()` for custom GL code - use `rlSetUniform()` with int type instead (see `docs/raylib_rlSetUniformSampler_bug.md`)
+
 ---
 
 ## future optimizations
@@ -63,7 +82,7 @@ these target the rendering bottleneck since update loop is already fast.
 
 | technique              | description                                                          | expected gain                   |
 | ---------------------- | -------------------------------------------------------------------- | ------------------------------- |
-| SSBO instance data     | pack (x, y, color) = 12 bytes instead of 64-byte matrices            | moderate (less bandwidth)       |
+| ~~SSBO instance data~~ | ~~pack (x, y, color) = 12 bytes instead of 64-byte matrices~~        | **done** - see optimization 5   |
 | compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync            | significant                     |
 | OpenGL vs Vulkan       | test raylib's Vulkan backend                                         | unknown                         |
 | discrete GPU testing   | test on dedicated GPU where instancing/SSBO shine                    | significant (different hw)      |
diff --git a/TODO.md b/TODO.md
index 103889e..928bdfc 100644
--- a/TODO.md
+++ b/TODO.md
@@ -58,7 +58,7 @@ further options (if needed):
 
 - [x] increase raylib batch buffer (currently 8192 vertices = 2048 quads)
 - [x] GPU instancing (single draw call for all entities)
-- [ ] SSBO instance data (12 bytes vs 64-byte matrices)
+- [x] SSBO instance data (12 bytes vs 64-byte matrices)
 - [ ] compute shader entity updates (if raylib supports)
 - [ ] compare OpenGL vs Vulkan backend
 
diff --git a/journal.txt b/journal.txt
index 3b1cdad..34c1234 100644
--- a/journal.txt
+++ b/journal.txt
@@ -108,9 +108,101 @@ difference.
 
 ---
 
-optimization 3: [pending]
--------------------------
-technique:
-results:
-notes:
+optimization 3: increased batch buffer
+--------------------------------------
+technique: increase raylib batch buffer from 8192 to 32768 vertices
+code: sandbox_main.zig:150-156
+version: 0.3.1
+
+benchmark_0.3.1.log results (i5-6500T / HD 530):
+- 60fps stable: ~140k entities
+- 140k entities: 16.7ms (vsync-locked)
+- 170k entities: 19.3ms (breaks 60fps)
+- 190k entities: 18.8ms
+- 240k entities: 23ms (benchmark exit)
+
+comparison to 0.3.0 on same hardware:
+- 0.3.0: 60fps breaks at ~64k entities
+- 0.3.1: 60fps stable at ~140k entities
+- ~2x improvement from larger batch buffer
+
+analysis: fewer GPU flushes per frame. default buffer (8192 verts = 2048 quads)
+means 100k entities = ~49 flushes. larger buffer (32768 verts = 8192 quads)
+means 100k entities = ~12 flushes. less driver overhead per frame.
+
+---
+
+optimization 4: GPU instancing (tested, minimal gain)
+-----------------------------------------------------
+technique: drawMeshInstanced() with per-entity 4x4 transform matrices
+code: sandbox_main.zig:173-192, 245-257
+version: 0.4.0
+
+benchmark_0.4.0.log results (i5-6500T / HD 530):
+- 60fps stable: ~150k entities (with occasional spikes)
+- 150k entities: 16.7ms (vsync-locked, spiky)
+- 190k entities: 18-19ms
+- 240k entities: 23ms
+- 270k entities: 28ms (benchmark exit)
+
+comparison to 0.3.1 (batch buffer) on same hardware:
+- 0.3.1: 60fps @ ~140k, exits @ ~240k
+- 0.4.0: 60fps @ ~150k, exits @ ~270k
+- negligible improvement (~7% more entities)
+
+analysis: GPU instancing didn't help much on integrated graphics because:
+1. shared system RAM means no PCIe transfer savings (discrete GPUs benefit here)
+2. 64-byte Matrix per entity vs ~80 bytes for rlgl vertices (similar bandwidth)
+3. bottleneck is memory bandwidth, not draw call overhead
+4. rlgl batching already minimizes draw calls effectively
+
+conclusion: for integrated graphics, rlgl quad batching is already near-optimal.
+GPU instancing shines on discrete GPUs where PCIe transfer is the bottleneck.
+keep both paths available for testing on different hardware.
+
+---
+
+optimization 5: SSBO instance data
+----------------------------------
+technique: pack entity data into 12-byte struct (x, y, color), upload via SSBO
+code: ssbo_renderer.zig, shaders/entity.vert, shaders/entity.frag
+version: ssbo branch
+
+struct GpuEntity {
+    x: f32,      // 4 bytes
+    y: f32,      // 4 bytes
+    color: u32,  // 4 bytes (0xRRGGBB)
+};              // = 12 bytes total
+
+vs GPU instancing: 64-byte Matrix per entity
+vs rlgl batching: ~80 bytes per entity (4 vertices * 5 floats * 4 bytes)
+
+benchmark_ssbo.log results (i5-6500T / HD 530):
+- 200k entities: 16.7ms (60fps)
+- 350k entities: 16.7ms (60fps)
+- 450k entities: 16.7ms (60fps)
+- 550k entities: 16.7ms (60fps)
+- 700k entities: 16.7ms (60fps)
+- 950k entities: 17.5-18.9ms (~53-57fps, benchmark exit)
+
+comparison to previous optimizations on same hardware:
+- 0.3.1 (batch buffer): 60fps @ ~140k entities
+- 0.4.0 (GPU instancing): 60fps @ ~150k entities
+- SSBO: 60fps @ ~700k entities
+- ~5x improvement over previous best!
+
+analysis: SSBO massively reduces per-entity bandwidth:
+- 12 bytes vs 64 bytes (matrices) = 5.3x less data
+- 12 bytes vs 80 bytes (rlgl) = 6.7x less data
+- single instanced draw call, no CPU-side transform calculations
+- shader does NDC conversion and color unpacking on GPU
+
+gotcha found: raylib's rlSetUniformSampler() doesn't work with custom GL code.
+use rlSetUniform() with RL_SHADER_UNIFORM_INT instead.
+see docs/raylib_rlSetUniformSampler_bug.md for details.
+
+total improvement from baseline:
+- baseline: 60fps @ ~5k entities (individual drawCircle)
+- SSBO: 60fps @ ~700k entities
+- ~140x improvement overall!