lofivor/docs/hd530_optimization_guide.md

119 lines
4.5 KiB
Markdown

# intel hd 530 optimization guide for lofivor
based on hardware specs and empirical testing.
## hardware constraints
from `intel_hd_graphics_530.txt`:
| resource | value | implication |
| ---------- | ------- | ------------- |
| ROPs | 3 | fill rate limited - this is our ceiling |
| TMUs | 24 | texture sampling is relatively fast |
| memory | shared DDR4 ~30GB/s | bandwidth is precious, no VRAM |
| pixel rate | 2.85 GPixel/s | max theoretical throughput |
| EUs | 24 (192 ALUs) | decent compute, weak vs discrete |
| L3 cache | 768 KB | small, cache misses hurt |
the bottleneck is ROPs (fill rate), not vertices or compute.
## what works (proven)
### SSBO instance data
- 16 bytes per entity vs 64 bytes (matrices)
- minimizes bandwidth on shared memory bus
- result: ~5x improvement over instancing
### compute shader updates
- GPU does position/velocity updates
- no CPU→GPU sync per frame
- result: update time essentially free
### texture sampling
- 22.8 GTexel/s is fast relative to other units
- pre-baked circle texture beats procedural math
- result: 2x faster than procedural fragment shader
### instanced triangles/quads
- most optimized driver path
- intel mesa heavily optimizes this
- result: baseline, hard to beat
## what doesn't work (proven)
### point sprites
- theoretically 6x fewer vertices
- reality: 2.4x SLOWER on this hardware
- triangle rasterizer is more optimized
- see `docs/point_sprites_experiment.md`
### procedural fragment shaders
- `length()`, `smoothstep()`, `discard` are expensive
- EUs are weaker than discrete GPUs
- `discard` breaks early-z optimization
- result: 3.7x slower than texture sampling
### complex fragment math
- only 24 EUs, each running 8 ALUs
- transcendentals (sqrt, sin, cos) are 4x slower than FMAD
- avoid in hot path
## what to try next (theoretical)
### likely to help
| technique | why it should work | expected gain |
| ----------- | ------------------- | --------------- |
| frustum culling (GPU) | reduce fill rate, which is bottleneck | 10-30% depending on view |
| smaller points when zoomed out (LOD) | fewer pixels per entity = less ROP work | 20-40% |
| early-z / depth pre-pass | skip fragment work for occluded pixels | moderate |
### unlikely to help
| technique | why it won't help |
| ----------- | ------------------ |
| more vertex optimization | already fill rate bound, not vertex bound |
| SIMD on CPU | updates already on GPU |
| multithreading | CPU isn't the bottleneck |
| different vertex layouts | negligible vs fill rate |
### uncertain (need to test)
| technique | notes |
| ----------- | ------- |
| vulkan backend | might have less driver overhead, or might not matter |
| indirect draw calls | GPU decides what to render, but we're not CPU bound |
| fp16 in shaders | HD 530 has 2:1 fp16 ratio, might help fragment shader |
## key insights
1. fill rate is king - with only 3 ROPs, everything comes down to how many
pixels we're writing. optimizations that don't reduce pixel count won't
help.
2. shared memory hurts - no dedicated VRAM means CPU and GPU compete for
bandwidth. keep data transfers minimal.
3. driver optimization matters - the "common path" (triangles) is more
optimized than alternatives (points). don't be clever.
4. texture sampling is cheap - 22.8 GTexel/s is fast. prefer texture
lookups over ALU math in fragment shaders.
5. avoid discard - breaks early-z, causes pipeline stalls. alpha blending
is faster than discard.
## current ceiling
~950k entities @ 57fps (SSBO + compute + quads)
to go higher, we need to reduce fill rate:
- cull offscreen entities
- reduce entity size when zoomed out
- or accept lower fps at higher counts
## references
- intel gen9 compute architecture whitepaper
- empirical benchmarks in `benchmark_current_i56500t.log`
- point sprites experiment in `docs/point_sprites_experiment.md`