Add hd530 notes with point sprite experience

2025-12-17 21:22:02 -05:00 · 2025-12-17 21:22:02 -05:00 · 55b0d7fab7
commit 55b0d7fab7
parent a842800ede
2 changed files with 208 additions and 0 deletions
--- a/docs/hd530_optimization_guide.md
+++ b/docs/hd530_optimization_guide.md
@ -0,0 +1,119 @@
 # intel hd 530 optimization guide for lofivor
 based on hardware specs and empirical testing.
 ## hardware constraints
 from `intel_hd_graphics_530.txt`:
 | resource   | value               | implication                                 |
 | ---------- | -------             | -------------                               |
 | ROPs       | 3                   | fill rate limited - this is our ceiling     |
 | TMUs       | 24                  | texture sampling is relatively fast         |
 | memory     | shared DDR4 ~30GB/s | bandwidth is precious, no VRAM              |
 | pixel rate | 2.85 GPixel/s       | max theoretical throughput                  |
 | EUs        | 24 (192 ALUs)       | decent compute, weak vs discrete            |
 | L3 cache   | 768 KB              | small, cache misses hurt                    |
 the bottleneck is ROPs (fill rate), not vertices or compute.
 ## what works (proven)
 ### SSBO instance data
 - 16 bytes per entity vs 64 bytes (matrices)
 - minimizes bandwidth on shared memory bus
 - result: ~5x improvement over instancing
 ### compute shader updates
 - GPU does position/velocity updates
 - no CPU→GPU sync per frame
 - result: update time essentially free
 ### texture sampling
 - 22.8 GTexel/s is fast relative to other units
 - pre-baked circle texture beats procedural math
 - result: 2x faster than procedural fragment shader
 ### instanced triangles/quads
 - most optimized driver path
 - intel mesa heavily optimizes this
 - result: baseline, hard to beat
 ## what doesn't work (proven)
 ### point sprites
 - theoretically 6x fewer vertices
 - reality: 2.4x SLOWER on this hardware
 - triangle rasterizer is more optimized
 - see `docs/point_sprites_experiment.md`
 ### procedural fragment shaders
 - `length()`, `smoothstep()`, `discard` are expensive
 - EUs are weaker than discrete GPUs
 - `discard` breaks early-z optimization
 - result: 3.7x slower than texture sampling
 ### complex fragment math
 - only 24 EUs, each running 8 ALUs
 - transcendentals (sqrt, sin, cos) are 4x slower than FMAD
 - avoid in hot path
 ## what to try next (theoretical)
 ### likely to help
 | technique                            | why it should work                      | expected gain            |
 | -----------                          | -------------------                     | ---------------          |
 | frustum culling (GPU)                | reduce fill rate, which is bottleneck   | 10-30% depending on view |
 | smaller points when zoomed out (LOD) | fewer pixels per entity = less ROP work | 20-40%                   |
 | early-z / depth pre-pass             | skip fragment work for occluded pixels  | moderate                 |
 ### unlikely to help
 | technique                | why it won't help                         |
 | -----------              | ------------------                        |
 | more vertex optimization | already fill rate bound, not vertex bound |
 | SIMD on CPU              | updates already on GPU                    |
 | multithreading           | CPU isn't the bottleneck                  |
 | different vertex layouts | negligible vs fill rate                   |
 ### uncertain (need to test)
 | technique           | notes                                                 |
 | -----------         | -------                                               |
 | vulkan backend      | might have less driver overhead, or might not matter  |
 | indirect draw calls | GPU decides what to render, but we're not CPU bound   |
 | fp16 in shaders     | HD 530 has 2:1 fp16 ratio, might help fragment shader |
 ## key insights
 1. fill rate is king - with only 3 ROPs, everything comes down to how many
   pixels we're writing. optimizations that don't reduce pixel count won't
   help.
 2. shared memory hurts - no dedicated VRAM means CPU and GPU compete for
   bandwidth. keep data transfers minimal.
 3. driver optimization matters - the "common path" (triangles) is more
   optimized than alternatives (points). don't be clever.
 4. texture sampling is cheap - 22.8 GTexel/s is fast. prefer texture
   lookups over ALU math in fragment shaders.
 5. avoid discard - breaks early-z, causes pipeline stalls. alpha blending
   is faster than discard.
 ## current ceiling
 ~950k entities @ 57fps (SSBO + compute + quads)
 to go higher, we need to reduce fill rate:
 - cull offscreen entities
 - reduce entity size when zoomed out
 - or accept lower fps at higher counts
 ## references
 - intel gen9 compute architecture whitepaper
 - empirical benchmarks in `benchmark_current_i56500t.log`
 - point sprites experiment in `docs/point_sprites_experiment.md`
--- a/docs/point_sprites_experiment.md
+++ b/docs/point_sprites_experiment.md
@ -0,0 +1,89 @@
 # point sprites experiment
 branch: `point-sprites` (point-sprites work)
 date: 2024-12
 hardware: intel hd 530 (skylake gt2, i5-6500T)
 ## hypothesis
 point sprites should be faster than quads because:
 - 1 vertex per entity instead of 6 (quad = 2 triangles)
 - less vertex throughput
 - `gl_PointCoord` provides texture coords automatically
 ## implementation
 ### vertex shader changes
 - removed quad vertex attributes (position, texcoord)
 - use `gl_PointSize = 16.0 * zoom` for size control
 - position calculated from SSBO data only
 ### fragment shader changes
 - use `gl_PointCoord` instead of vertex texcoord
 - sample circle texture for alpha
 ### renderer changes
 - load `glEnable` and `glDrawArraysInstanced` via `rlGetProcAddress`
 - enable `GL_PROGRAM_POINT_SIZE`
 - draw with `glDrawArraysInstanced(GL_POINTS, 0, 1, count)`
 - removed VBO (no vertex data needed)
 ## results
 ### attempt 1: procedural circle in fragment shader
 ```glsl
 vec2 coord = gl_PointCoord - vec2(0.5);
 float dist = length(coord);
 float alpha = 1.0 - smoothstep(0.4, 0.5, dist);
 if (alpha < 0.01) discard;
 ```
 **benchmark @ 350k entities:**
 - point sprites: 23ms render, 43fps
 - quads (main): 6.2ms render, 151fps
 - **result: 3.7x SLOWER**
 **why:** `discard` breaks early-z optimization, `length()` and `smoothstep()` are ALU-heavy, intel integrated GPUs are weak at fragment shader math.
 ### attempt 2: texture sampling
 ```glsl
 float alpha = texture(circleTexture, gl_PointCoord).r;
 finalColor = vec4(fragColor, alpha);
 ```
 **benchmark @ 450k entities:**
 - point sprites: 19.1ms render, 52fps
 - quads (main): 8.0ms render, 122fps
 - **result: 2.4x SLOWER**
 better than procedural, but still significantly slower than quads.
 ## analysis
 the theoretical advantage (1/6 vertices) doesn't translate to real performance because:
 1. **triangle path is more optimized** - intel's driver heavily optimizes the standard triangle rasterization path. point sprites use a less-traveled code path.
 2. **fill rate is the bottleneck** - HD 530 has only 3 ROPs. we're bound by how fast we can write pixels, not by vertex count. reducing vertices from 6 to 1 doesn't help when fill rate is the constraint.
 3. **point size overhead** - each point requires computing `gl_PointSize` and setting up the point sprite rasterization, which may have per-vertex overhead.
 4. **texture cache behavior** - `gl_PointCoord` may have worse cache locality than explicit vertex texcoords.
 ## conclusion
 **point sprites are a regression on intel hd 530.**
 the optimization makes theoretical sense but fails in practice on this hardware. the quad/triangle path is simply more optimized in intel's mesa driver.
 **keep this branch for testing on discrete GPUs** where point sprites might actually help (nvidia/amd have different optimization priorities).
 ## lessons learned
 1. always benchmark, don't assume
 2. "fewer vertices" doesn't always mean faster
 3. integrated GPU optimization is different from discrete
 4. the most optimized path is usually the most common path (triangles)
 5. fill rate matters more than vertex count at high entity counts