4.5 KiB
intel hd 530 optimization guide for lofivor
based on hardware specs and empirical testing.
hardware constraints
from intel_hd_graphics_530.txt:
| resource | value | implication |
|---|---|---|
| ROPs | 3 | fill rate limited - this is our ceiling |
| TMUs | 24 | texture sampling is relatively fast |
| memory | shared DDR4 ~30GB/s | bandwidth is precious, no VRAM |
| pixel rate | 2.85 GPixel/s | max theoretical throughput |
| EUs | 24 (192 ALUs) | decent compute, weak vs discrete |
| L3 cache | 768 KB | small, cache misses hurt |
the bottleneck is ROPs (fill rate), not vertices or compute.
what works (proven)
SSBO instance data
- 16 bytes per entity vs 64 bytes (matrices)
- minimizes bandwidth on shared memory bus
- result: ~5x improvement over instancing
compute shader updates
- GPU does position/velocity updates
- no CPU→GPU sync per frame
- result: update time essentially free
texture sampling
- 22.8 GTexel/s is fast relative to other units
- pre-baked circle texture beats procedural math
- result: 2x faster than procedural fragment shader
instanced triangles/quads
- most optimized driver path
- intel mesa heavily optimizes this
- result: baseline, hard to beat
what doesn't work (proven)
point sprites
- theoretically 6x fewer vertices
- reality: 2.4x SLOWER on this hardware
- triangle rasterizer is more optimized
- see
docs/point_sprites_experiment.md
procedural fragment shaders
length(),smoothstep(),discardare expensive- EUs are weaker than discrete GPUs
discardbreaks early-z optimization- result: 3.7x slower than texture sampling
complex fragment math
- only 24 EUs, each running 8 ALUs
- transcendentals (sqrt, sin, cos) are 4x slower than FMAD
- avoid in hot path
what to try next (theoretical)
likely to help
| technique | why it should work | expected gain |
|---|---|---|
| frustum culling (GPU) | reduce fill rate, which is bottleneck | 10-30% depending on view |
| smaller points when zoomed out (LOD) | fewer pixels per entity = less ROP work | 20-40% |
| early-z / depth pre-pass | skip fragment work for occluded pixels | moderate |
unlikely to help
| technique | why it won't help |
|---|---|
| more vertex optimization | already fill rate bound, not vertex bound |
| SIMD on CPU | updates already on GPU |
| multithreading | CPU isn't the bottleneck |
| different vertex layouts | negligible vs fill rate |
uncertain (need to test)
| technique | notes |
|---|---|
| vulkan backend | might have less driver overhead, or might not matter |
| indirect draw calls | GPU decides what to render, but we're not CPU bound |
| fp16 in shaders | HD 530 has 2:1 fp16 ratio, might help fragment shader |
key insights
-
fill rate is king - with only 3 ROPs, everything comes down to how many pixels we're writing. optimizations that don't reduce pixel count won't help.
-
shared memory hurts - no dedicated VRAM means CPU and GPU compete for bandwidth. keep data transfers minimal.
-
driver optimization matters - the "common path" (triangles) is more optimized than alternatives (points). don't be clever.
-
texture sampling is cheap - 22.8 GTexel/s is fast. prefer texture lookups over ALU math in fragment shaders.
-
avoid discard - breaks early-z, causes pipeline stalls. alpha blending is faster than discard.
current ceiling
~950k entities @ 57fps (SSBO + compute + quads)
to go higher, we need to reduce fill rate:
- cull offscreen entities
- reduce entity size when zoomed out
- or accept lower fps at higher counts
references
- intel gen9 compute architecture whitepaper
- empirical benchmarks in
benchmark_current_i56500t.log - point sprites experiment in
docs/point_sprites_experiment.md