diff --git a/docs/hd530_optimization_guide.md b/docs/hd530_optimization_guide.md new file mode 100644 index 0000000..1689f37 --- /dev/null +++ b/docs/hd530_optimization_guide.md @@ -0,0 +1,119 @@ +# intel hd 530 optimization guide for lofivor + +based on hardware specs and empirical testing. + +## hardware constraints + +from `intel_hd_graphics_530.txt`: + +| resource | value | implication | +| ---------- | ------- | ------------- | +| ROPs | 3 | fill rate limited - this is our ceiling | +| TMUs | 24 | texture sampling is relatively fast | +| memory | shared DDR4 ~30GB/s | bandwidth is precious, no VRAM | +| pixel rate | 2.85 GPixel/s | max theoretical throughput | +| EUs | 24 (192 ALUs) | decent compute, weak vs discrete | +| L3 cache | 768 KB | small, cache misses hurt | + +the bottleneck is ROPs (fill rate), not vertices or compute. + +## what works (proven) + +### SSBO instance data +- 16 bytes per entity vs 64 bytes (matrices) +- minimizes bandwidth on shared memory bus +- result: ~5x improvement over instancing + +### compute shader updates +- GPU does position/velocity updates +- no CPU→GPU sync per frame +- result: update time essentially free + +### texture sampling +- 22.8 GTexel/s is fast relative to other units +- pre-baked circle texture beats procedural math +- result: 2x faster than procedural fragment shader + +### instanced triangles/quads +- most optimized driver path +- intel mesa heavily optimizes this +- result: baseline, hard to beat + +## what doesn't work (proven) + +### point sprites +- theoretically 6x fewer vertices +- reality: 2.4x SLOWER on this hardware +- triangle rasterizer is more optimized +- see `docs/point_sprites_experiment.md` + +### procedural fragment shaders +- `length()`, `smoothstep()`, `discard` are expensive +- EUs are weaker than discrete GPUs +- `discard` breaks early-z optimization +- result: 3.7x slower than texture sampling + +### complex fragment math +- only 24 EUs, each running 8 ALUs +- transcendentals (sqrt, sin, cos) are 4x slower than FMAD +- avoid in hot path + +## what to try next (theoretical) + +### likely to help + +| technique | why it should work | expected gain | +| ----------- | ------------------- | --------------- | +| frustum culling (GPU) | reduce fill rate, which is bottleneck | 10-30% depending on view | +| smaller points when zoomed out (LOD) | fewer pixels per entity = less ROP work | 20-40% | +| early-z / depth pre-pass | skip fragment work for occluded pixels | moderate | + +### unlikely to help + +| technique | why it won't help | +| ----------- | ------------------ | +| more vertex optimization | already fill rate bound, not vertex bound | +| SIMD on CPU | updates already on GPU | +| multithreading | CPU isn't the bottleneck | +| different vertex layouts | negligible vs fill rate | + +### uncertain (need to test) + +| technique | notes | +| ----------- | ------- | +| vulkan backend | might have less driver overhead, or might not matter | +| indirect draw calls | GPU decides what to render, but we're not CPU bound | +| fp16 in shaders | HD 530 has 2:1 fp16 ratio, might help fragment shader | + +## key insights + +1. fill rate is king - with only 3 ROPs, everything comes down to how many + pixels we're writing. optimizations that don't reduce pixel count won't + help. + +2. shared memory hurts - no dedicated VRAM means CPU and GPU compete for + bandwidth. keep data transfers minimal. + +3. driver optimization matters - the "common path" (triangles) is more + optimized than alternatives (points). don't be clever. + +4. texture sampling is cheap - 22.8 GTexel/s is fast. prefer texture + lookups over ALU math in fragment shaders. + +5. avoid discard - breaks early-z, causes pipeline stalls. alpha blending + is faster than discard. + +## current ceiling + +~950k entities @ 57fps (SSBO + compute + quads) + +to go higher, we need to reduce fill rate: +- cull offscreen entities +- reduce entity size when zoomed out +- or accept lower fps at higher counts + +## references + +- intel gen9 compute architecture whitepaper +- empirical benchmarks in `benchmark_current_i56500t.log` +- point sprites experiment in `docs/point_sprites_experiment.md` diff --git a/docs/point_sprites_experiment.md b/docs/point_sprites_experiment.md new file mode 100644 index 0000000..b9fe40e --- /dev/null +++ b/docs/point_sprites_experiment.md @@ -0,0 +1,89 @@ +# point sprites experiment + +branch: `point-sprites` (point-sprites work) +date: 2024-12 +hardware: intel hd 530 (skylake gt2, i5-6500T) + +## hypothesis + +point sprites should be faster than quads because: +- 1 vertex per entity instead of 6 (quad = 2 triangles) +- less vertex throughput +- `gl_PointCoord` provides texture coords automatically + +## implementation + +### vertex shader changes +- removed quad vertex attributes (position, texcoord) +- use `gl_PointSize = 16.0 * zoom` for size control +- position calculated from SSBO data only + +### fragment shader changes +- use `gl_PointCoord` instead of vertex texcoord +- sample circle texture for alpha + +### renderer changes +- load `glEnable` and `glDrawArraysInstanced` via `rlGetProcAddress` +- enable `GL_PROGRAM_POINT_SIZE` +- draw with `glDrawArraysInstanced(GL_POINTS, 0, 1, count)` +- removed VBO (no vertex data needed) + +## results + +### attempt 1: procedural circle in fragment shader + +```glsl +vec2 coord = gl_PointCoord - vec2(0.5); +float dist = length(coord); +float alpha = 1.0 - smoothstep(0.4, 0.5, dist); +if (alpha < 0.01) discard; +``` + +**benchmark @ 350k entities:** +- point sprites: 23ms render, 43fps +- quads (main): 6.2ms render, 151fps +- **result: 3.7x SLOWER** + +**why:** `discard` breaks early-z optimization, `length()` and `smoothstep()` are ALU-heavy, intel integrated GPUs are weak at fragment shader math. + +### attempt 2: texture sampling + +```glsl +float alpha = texture(circleTexture, gl_PointCoord).r; +finalColor = vec4(fragColor, alpha); +``` + +**benchmark @ 450k entities:** +- point sprites: 19.1ms render, 52fps +- quads (main): 8.0ms render, 122fps +- **result: 2.4x SLOWER** + +better than procedural, but still significantly slower than quads. + +## analysis + +the theoretical advantage (1/6 vertices) doesn't translate to real performance because: + +1. **triangle path is more optimized** - intel's driver heavily optimizes the standard triangle rasterization path. point sprites use a less-traveled code path. + +2. **fill rate is the bottleneck** - HD 530 has only 3 ROPs. we're bound by how fast we can write pixels, not by vertex count. reducing vertices from 6 to 1 doesn't help when fill rate is the constraint. + +3. **point size overhead** - each point requires computing `gl_PointSize` and setting up the point sprite rasterization, which may have per-vertex overhead. + +4. **texture cache behavior** - `gl_PointCoord` may have worse cache locality than explicit vertex texcoords. + +## conclusion + +**point sprites are a regression on intel hd 530.** + +the optimization makes theoretical sense but fails in practice on this hardware. the quad/triangle path is simply more optimized in intel's mesa driver. + +**keep this branch for testing on discrete GPUs** where point sprites might actually help (nvidia/amd have different optimization priorities). + +## lessons learned + +1. always benchmark, don't assume +2. "fewer vertices" doesn't always mean faster +3. integrated GPU optimization is different from discrete +4. the most optimized path is usually the most common path (triangles) +5. fill rate matters more than vertex count at high entity counts