lofivor/docs/hd530_optimization_guide.md

# intel hd 530 optimization guide for lofivor

based on hardware specs and empirical testing.

## hardware constraints

from `intel_hd_graphics_530.txt`:

| resource   | value               | implication                                 |
| ---------- | -------             | -------------                               |
| ROPs       | 3                   | fill rate limited - this is our ceiling     |
| TMUs       | 24                  | texture sampling is relatively fast         |
| memory     | shared DDR4 ~30GB/s | bandwidth is precious, no VRAM              |
| pixel rate | 2.85 GPixel/s       | max theoretical throughput                  |
| EUs        | 24 (192 ALUs)       | decent compute, weak vs discrete            |
| L3 cache   | 768 KB              | small, cache misses hurt                    |

the bottleneck is ROPs (fill rate), not vertices or compute.

## what works (proven)

### SSBO instance data
- 16 bytes per entity vs 64 bytes (matrices)
- minimizes bandwidth on shared memory bus
- result: ~5x improvement over instancing

### compute shader updates
- GPU does position/velocity updates
- no CPU→GPU sync per frame
- result: update time essentially free

### texture sampling
- 22.8 GTexel/s is fast relative to other units
- pre-baked circle texture beats procedural math
- result: 2x faster than procedural fragment shader

### instanced triangles/quads
- most optimized driver path
- intel mesa heavily optimizes this
- result: baseline, hard to beat

## what doesn't work (proven)

### point sprites
- theoretically 6x fewer vertices
- reality: 2.4x SLOWER on this hardware
- triangle rasterizer is more optimized
- see `docs/point_sprites_experiment.md`

### procedural fragment shaders
- `length()`, `smoothstep()`, `discard` are expensive
- EUs are weaker than discrete GPUs
- `discard` breaks early-z optimization
- result: 3.7x slower than texture sampling

### complex fragment math
- only 24 EUs, each running 8 ALUs
- transcendentals (sqrt, sin, cos) are 4x slower than FMAD
- avoid in hot path

## what to try next (theoretical)

### likely to help

| technique                            | why it should work                      | expected gain            |
| -----------                          | -------------------                     | ---------------          |
| frustum culling (GPU)                | reduce fill rate, which is bottleneck   | 10-30% depending on view |
| smaller points when zoomed out (LOD) | fewer pixels per entity = less ROP work | 20-40%                   |
| early-z / depth pre-pass             | skip fragment work for occluded pixels  | moderate                 |

### unlikely to help

| technique                | why it won't help                         |
| -----------              | ------------------                        |
| more vertex optimization | already fill rate bound, not vertex bound |
| SIMD on CPU              | updates already on GPU                    |
| multithreading           | CPU isn't the bottleneck                  |
| different vertex layouts | negligible vs fill rate                   |

### uncertain (need to test)

| technique           | notes                                                 |
| -----------         | -------                                               |
| vulkan backend      | might have less driver overhead, or might not matter  |
| indirect draw calls | GPU decides what to render, but we're not CPU bound   |
| fp16 in shaders     | HD 530 has 2:1 fp16 ratio, might help fragment shader |

## key insights

1. fill rate is king - with only 3 ROPs, everything comes down to how many
   pixels we're writing. optimizations that don't reduce pixel count won't
   help.

2. shared memory hurts - no dedicated VRAM means CPU and GPU compete for
   bandwidth. keep data transfers minimal.

3. driver optimization matters - the "common path" (triangles) is more
   optimized than alternatives (points). don't be clever.

4. texture sampling is cheap - 22.8 GTexel/s is fast. prefer texture
   lookups over ALU math in fragment shaders.

5. avoid discard - breaks early-z, causes pipeline stalls. alpha blending
   is faster than discard.

## current ceiling

~950k entities @ 57fps (SSBO + compute + quads)

to go higher, we need to reduce fill rate:
- cull offscreen entities
- reduce entity size when zoomed out
- or accept lower fps at higher counts

## references

- intel gen9 compute architecture whitepaper
- empirical benchmarks in `benchmark_current_i56500t.log`
- point sprites experiment in `docs/point_sprites_experiment.md`