Add hd530 notes with point sprite experience

2025-12-17 21:22:02 -05:00 · 2025-12-17 21:22:02 -05:00 · 55b0d7fab7
commit 55b0d7fab7
parent a842800ede
2 changed files with 208 additions and 0 deletions
--- a/docs/hd530_optimization_guide.md
+++ b/docs/hd530_optimization_guide.md
@ -0,0 +1,119 @@
+# intel hd 530 optimization guide for lofivor
+
+based on hardware specs and empirical testing.
+
+## hardware constraints
+
+from `intel_hd_graphics_530.txt`:
+
+| resource   | value               | implication                                 |
+| ---------- | -------             | -------------                               |
+| ROPs       | 3                   | fill rate limited - this is our ceiling     |
+| TMUs       | 24                  | texture sampling is relatively fast         |
+| memory     | shared DDR4 ~30GB/s | bandwidth is precious, no VRAM              |
+| pixel rate | 2.85 GPixel/s       | max theoretical throughput                  |
+| EUs        | 24 (192 ALUs)       | decent compute, weak vs discrete            |
+| L3 cache   | 768 KB              | small, cache misses hurt                    |
+
+the bottleneck is ROPs (fill rate), not vertices or compute.
+
+## what works (proven)
+
+### SSBO instance data
+- 16 bytes per entity vs 64 bytes (matrices)
+- minimizes bandwidth on shared memory bus
+- result: ~5x improvement over instancing
+
+### compute shader updates
+- GPU does position/velocity updates
+- no CPU→GPU sync per frame
+- result: update time essentially free
+
+### texture sampling
+- 22.8 GTexel/s is fast relative to other units
+- pre-baked circle texture beats procedural math
+- result: 2x faster than procedural fragment shader
+
+### instanced triangles/quads
+- most optimized driver path
+- intel mesa heavily optimizes this
+- result: baseline, hard to beat
+
+## what doesn't work (proven)
+
+### point sprites
+- theoretically 6x fewer vertices
+- reality: 2.4x SLOWER on this hardware
+- triangle rasterizer is more optimized
+- see `docs/point_sprites_experiment.md`
+
+### procedural fragment shaders
+- `length()`, `smoothstep()`, `discard` are expensive
+- EUs are weaker than discrete GPUs
+- `discard` breaks early-z optimization
+- result: 3.7x slower than texture sampling
+
+### complex fragment math
+- only 24 EUs, each running 8 ALUs
+- transcendentals (sqrt, sin, cos) are 4x slower than FMAD
+- avoid in hot path
+
+## what to try next (theoretical)
+
+### likely to help
+
+| technique                            | why it should work                      | expected gain            |
+| -----------                          | -------------------                     | ---------------          |
+| frustum culling (GPU)                | reduce fill rate, which is bottleneck   | 10-30% depending on view |
+| smaller points when zoomed out (LOD) | fewer pixels per entity = less ROP work | 20-40%                   |
+| early-z / depth pre-pass             | skip fragment work for occluded pixels  | moderate                 |
+
+### unlikely to help
+
+| technique                | why it won't help                         |
+| -----------              | ------------------                        |
+| more vertex optimization | already fill rate bound, not vertex bound |
+| SIMD on CPU              | updates already on GPU                    |
+| multithreading           | CPU isn't the bottleneck                  |
+| different vertex layouts | negligible vs fill rate                   |
+
+### uncertain (need to test)
+
+| technique           | notes                                                 |
+| -----------         | -------                                               |
+| vulkan backend      | might have less driver overhead, or might not matter  |
+| indirect draw calls | GPU decides what to render, but we're not CPU bound   |
+| fp16 in shaders     | HD 530 has 2:1 fp16 ratio, might help fragment shader |
+
+## key insights
+
+1. fill rate is king - with only 3 ROPs, everything comes down to how many
+   pixels we're writing. optimizations that don't reduce pixel count won't
+   help.
+
+2. shared memory hurts - no dedicated VRAM means CPU and GPU compete for
+   bandwidth. keep data transfers minimal.
+
+3. driver optimization matters - the "common path" (triangles) is more
+   optimized than alternatives (points). don't be clever.
+
+4. texture sampling is cheap - 22.8 GTexel/s is fast. prefer texture
+   lookups over ALU math in fragment shaders.
+
+5. avoid discard - breaks early-z, causes pipeline stalls. alpha blending
+   is faster than discard.
+
+## current ceiling
+
+~950k entities @ 57fps (SSBO + compute + quads)
+
+to go higher, we need to reduce fill rate:
+- cull offscreen entities
+- reduce entity size when zoomed out
+- or accept lower fps at higher counts
+
+## references
+
+- intel gen9 compute architecture whitepaper
+- empirical benchmarks in `benchmark_current_i56500t.log`
+- point sprites experiment in `docs/point_sprites_experiment.md`
--- a/docs/point_sprites_experiment.md
+++ b/docs/point_sprites_experiment.md
@ -0,0 +1,89 @@
+# point sprites experiment
+
+branch: `point-sprites` (point-sprites work)
+date: 2024-12
+hardware: intel hd 530 (skylake gt2, i5-6500T)
+
+## hypothesis
+
+point sprites should be faster than quads because:
+- 1 vertex per entity instead of 6 (quad = 2 triangles)
+- less vertex throughput
+- `gl_PointCoord` provides texture coords automatically
+
+## implementation
+
+### vertex shader changes
+- removed quad vertex attributes (position, texcoord)
+- use `gl_PointSize = 16.0 * zoom` for size control
+- position calculated from SSBO data only
+
+### fragment shader changes
+- use `gl_PointCoord` instead of vertex texcoord
+- sample circle texture for alpha
+
+### renderer changes
+- load `glEnable` and `glDrawArraysInstanced` via `rlGetProcAddress`
+- enable `GL_PROGRAM_POINT_SIZE`
+- draw with `glDrawArraysInstanced(GL_POINTS, 0, 1, count)`
+- removed VBO (no vertex data needed)
+
+## results
+
+### attempt 1: procedural circle in fragment shader
+
+```glsl
+vec2 coord = gl_PointCoord - vec2(0.5);
+float dist = length(coord);
+float alpha = 1.0 - smoothstep(0.4, 0.5, dist);
+if (alpha < 0.01) discard;
+```
+
+**benchmark @ 350k entities:**
+- point sprites: 23ms render, 43fps
+- quads (main): 6.2ms render, 151fps
+- **result: 3.7x SLOWER**
+
+**why:** `discard` breaks early-z optimization, `length()` and `smoothstep()` are ALU-heavy, intel integrated GPUs are weak at fragment shader math.
+
+### attempt 2: texture sampling
+
+```glsl
+float alpha = texture(circleTexture, gl_PointCoord).r;
+finalColor = vec4(fragColor, alpha);
+```
+
+**benchmark @ 450k entities:**
+- point sprites: 19.1ms render, 52fps
+- quads (main): 8.0ms render, 122fps
+- **result: 2.4x SLOWER**
+
+better than procedural, but still significantly slower than quads.
+
+## analysis
+
+the theoretical advantage (1/6 vertices) doesn't translate to real performance because:
+
+1. **triangle path is more optimized** - intel's driver heavily optimizes the standard triangle rasterization path. point sprites use a less-traveled code path.
+
+2. **fill rate is the bottleneck** - HD 530 has only 3 ROPs. we're bound by how fast we can write pixels, not by vertex count. reducing vertices from 6 to 1 doesn't help when fill rate is the constraint.
+
+3. **point size overhead** - each point requires computing `gl_PointSize` and setting up the point sprite rasterization, which may have per-vertex overhead.
+
+4. **texture cache behavior** - `gl_PointCoord` may have worse cache locality than explicit vertex texcoords.
+
+## conclusion
+
+**point sprites are a regression on intel hd 530.**
+
+the optimization makes theoretical sense but fails in practice on this hardware. the quad/triangle path is simply more optimized in intel's mesa driver.
+
+**keep this branch for testing on discrete GPUs** where point sprites might actually help (nvidia/amd have different optimization priorities).
+
+## lessons learned
+
+1. always benchmark, don't assume
+2. "fewer vertices" doesn't always mean faster
+3. integrated GPU optimization is different from discrete
+4. the most optimized path is usually the most common path (triangles)
+5. fill rate matters more than vertex count at high entity counts