119 lines
4.5 KiB
Markdown
119 lines
4.5 KiB
Markdown
# intel hd 530 optimization guide for lofivor
|
|
|
|
based on hardware specs and empirical testing.
|
|
|
|
## hardware constraints
|
|
|
|
from `intel_hd_graphics_530.txt`:
|
|
|
|
| resource | value | implication |
|
|
| ---------- | ------- | ------------- |
|
|
| ROPs | 3 | fill rate limited - this is our ceiling |
|
|
| TMUs | 24 | texture sampling is relatively fast |
|
|
| memory | shared DDR4 ~30GB/s | bandwidth is precious, no VRAM |
|
|
| pixel rate | 2.85 GPixel/s | max theoretical throughput |
|
|
| EUs | 24 (192 ALUs) | decent compute, weak vs discrete |
|
|
| L3 cache | 768 KB | small, cache misses hurt |
|
|
|
|
the bottleneck is ROPs (fill rate), not vertices or compute.
|
|
|
|
## what works (proven)
|
|
|
|
### SSBO instance data
|
|
- 16 bytes per entity vs 64 bytes (matrices)
|
|
- minimizes bandwidth on shared memory bus
|
|
- result: ~5x improvement over instancing
|
|
|
|
### compute shader updates
|
|
- GPU does position/velocity updates
|
|
- no CPU→GPU sync per frame
|
|
- result: update time essentially free
|
|
|
|
### texture sampling
|
|
- 22.8 GTexel/s is fast relative to other units
|
|
- pre-baked circle texture beats procedural math
|
|
- result: 2x faster than procedural fragment shader
|
|
|
|
### instanced triangles/quads
|
|
- most optimized driver path
|
|
- intel mesa heavily optimizes this
|
|
- result: baseline, hard to beat
|
|
|
|
## what doesn't work (proven)
|
|
|
|
### point sprites
|
|
- theoretically 6x fewer vertices
|
|
- reality: 2.4x SLOWER on this hardware
|
|
- triangle rasterizer is more optimized
|
|
- see `docs/point_sprites_experiment.md`
|
|
|
|
### procedural fragment shaders
|
|
- `length()`, `smoothstep()`, `discard` are expensive
|
|
- EUs are weaker than discrete GPUs
|
|
- `discard` breaks early-z optimization
|
|
- result: 3.7x slower than texture sampling
|
|
|
|
### complex fragment math
|
|
- only 24 EUs, each running 8 ALUs
|
|
- transcendentals (sqrt, sin, cos) are 4x slower than FMAD
|
|
- avoid in hot path
|
|
|
|
## what to try next (theoretical)
|
|
|
|
### likely to help
|
|
|
|
| technique | why it should work | expected gain |
|
|
| ----------- | ------------------- | --------------- |
|
|
| frustum culling (GPU) | reduce fill rate, which is bottleneck | 10-30% depending on view |
|
|
| smaller points when zoomed out (LOD) | fewer pixels per entity = less ROP work | 20-40% |
|
|
| early-z / depth pre-pass | skip fragment work for occluded pixels | moderate |
|
|
|
|
### unlikely to help
|
|
|
|
| technique | why it won't help |
|
|
| ----------- | ------------------ |
|
|
| more vertex optimization | already fill rate bound, not vertex bound |
|
|
| SIMD on CPU | updates already on GPU |
|
|
| multithreading | CPU isn't the bottleneck |
|
|
| different vertex layouts | negligible vs fill rate |
|
|
|
|
### uncertain (need to test)
|
|
|
|
| technique | notes |
|
|
| ----------- | ------- |
|
|
| vulkan backend | might have less driver overhead, or might not matter |
|
|
| indirect draw calls | GPU decides what to render, but we're not CPU bound |
|
|
| fp16 in shaders | HD 530 has 2:1 fp16 ratio, might help fragment shader |
|
|
|
|
## key insights
|
|
|
|
1. fill rate is king - with only 3 ROPs, everything comes down to how many
|
|
pixels we're writing. optimizations that don't reduce pixel count won't
|
|
help.
|
|
|
|
2. shared memory hurts - no dedicated VRAM means CPU and GPU compete for
|
|
bandwidth. keep data transfers minimal.
|
|
|
|
3. driver optimization matters - the "common path" (triangles) is more
|
|
optimized than alternatives (points). don't be clever.
|
|
|
|
4. texture sampling is cheap - 22.8 GTexel/s is fast. prefer texture
|
|
lookups over ALU math in fragment shaders.
|
|
|
|
5. avoid discard - breaks early-z, causes pipeline stalls. alpha blending
|
|
is faster than discard.
|
|
|
|
## current ceiling
|
|
|
|
~950k entities @ 57fps (SSBO + compute + quads)
|
|
|
|
to go higher, we need to reduce fill rate:
|
|
- cull offscreen entities
|
|
- reduce entity size when zoomed out
|
|
- or accept lower fps at higher counts
|
|
|
|
## references
|
|
|
|
- intel gen9 compute architecture whitepaper
|
|
- empirical benchmarks in `benchmark_current_i56500t.log`
|
|
- point sprites experiment in `docs/point_sprites_experiment.md`
|