Add hd530 notes with point sprite experience
This commit is contained in:
parent
a842800ede
commit
55b0d7fab7
2 changed files with 208 additions and 0 deletions
119
docs/hd530_optimization_guide.md
Normal file
119
docs/hd530_optimization_guide.md
Normal file
|
|
@ -0,0 +1,119 @@
|
||||||
|
# intel hd 530 optimization guide for lofivor
|
||||||
|
|
||||||
|
based on hardware specs and empirical testing.
|
||||||
|
|
||||||
|
## hardware constraints
|
||||||
|
|
||||||
|
from `intel_hd_graphics_530.txt`:
|
||||||
|
|
||||||
|
| resource | value | implication |
|
||||||
|
| ---------- | ------- | ------------- |
|
||||||
|
| ROPs | 3 | fill rate limited - this is our ceiling |
|
||||||
|
| TMUs | 24 | texture sampling is relatively fast |
|
||||||
|
| memory | shared DDR4 ~30GB/s | bandwidth is precious, no VRAM |
|
||||||
|
| pixel rate | 2.85 GPixel/s | max theoretical throughput |
|
||||||
|
| EUs | 24 (192 ALUs) | decent compute, weak vs discrete |
|
||||||
|
| L3 cache | 768 KB | small, cache misses hurt |
|
||||||
|
|
||||||
|
the bottleneck is ROPs (fill rate), not vertices or compute.
|
||||||
|
|
||||||
|
## what works (proven)
|
||||||
|
|
||||||
|
### SSBO instance data
|
||||||
|
- 16 bytes per entity vs 64 bytes (matrices)
|
||||||
|
- minimizes bandwidth on shared memory bus
|
||||||
|
- result: ~5x improvement over instancing
|
||||||
|
|
||||||
|
### compute shader updates
|
||||||
|
- GPU does position/velocity updates
|
||||||
|
- no CPU→GPU sync per frame
|
||||||
|
- result: update time essentially free
|
||||||
|
|
||||||
|
### texture sampling
|
||||||
|
- 22.8 GTexel/s is fast relative to other units
|
||||||
|
- pre-baked circle texture beats procedural math
|
||||||
|
- result: 2x faster than procedural fragment shader
|
||||||
|
|
||||||
|
### instanced triangles/quads
|
||||||
|
- most optimized driver path
|
||||||
|
- intel mesa heavily optimizes this
|
||||||
|
- result: baseline, hard to beat
|
||||||
|
|
||||||
|
## what doesn't work (proven)
|
||||||
|
|
||||||
|
### point sprites
|
||||||
|
- theoretically 6x fewer vertices
|
||||||
|
- reality: 2.4x SLOWER on this hardware
|
||||||
|
- triangle rasterizer is more optimized
|
||||||
|
- see `docs/point_sprites_experiment.md`
|
||||||
|
|
||||||
|
### procedural fragment shaders
|
||||||
|
- `length()`, `smoothstep()`, `discard` are expensive
|
||||||
|
- EUs are weaker than discrete GPUs
|
||||||
|
- `discard` breaks early-z optimization
|
||||||
|
- result: 3.7x slower than texture sampling
|
||||||
|
|
||||||
|
### complex fragment math
|
||||||
|
- only 24 EUs, each running 8 ALUs
|
||||||
|
- transcendentals (sqrt, sin, cos) are 4x slower than FMAD
|
||||||
|
- avoid in hot path
|
||||||
|
|
||||||
|
## what to try next (theoretical)
|
||||||
|
|
||||||
|
### likely to help
|
||||||
|
|
||||||
|
| technique | why it should work | expected gain |
|
||||||
|
| ----------- | ------------------- | --------------- |
|
||||||
|
| frustum culling (GPU) | reduce fill rate, which is bottleneck | 10-30% depending on view |
|
||||||
|
| smaller points when zoomed out (LOD) | fewer pixels per entity = less ROP work | 20-40% |
|
||||||
|
| early-z / depth pre-pass | skip fragment work for occluded pixels | moderate |
|
||||||
|
|
||||||
|
### unlikely to help
|
||||||
|
|
||||||
|
| technique | why it won't help |
|
||||||
|
| ----------- | ------------------ |
|
||||||
|
| more vertex optimization | already fill rate bound, not vertex bound |
|
||||||
|
| SIMD on CPU | updates already on GPU |
|
||||||
|
| multithreading | CPU isn't the bottleneck |
|
||||||
|
| different vertex layouts | negligible vs fill rate |
|
||||||
|
|
||||||
|
### uncertain (need to test)
|
||||||
|
|
||||||
|
| technique | notes |
|
||||||
|
| ----------- | ------- |
|
||||||
|
| vulkan backend | might have less driver overhead, or might not matter |
|
||||||
|
| indirect draw calls | GPU decides what to render, but we're not CPU bound |
|
||||||
|
| fp16 in shaders | HD 530 has 2:1 fp16 ratio, might help fragment shader |
|
||||||
|
|
||||||
|
## key insights
|
||||||
|
|
||||||
|
1. fill rate is king - with only 3 ROPs, everything comes down to how many
|
||||||
|
pixels we're writing. optimizations that don't reduce pixel count won't
|
||||||
|
help.
|
||||||
|
|
||||||
|
2. shared memory hurts - no dedicated VRAM means CPU and GPU compete for
|
||||||
|
bandwidth. keep data transfers minimal.
|
||||||
|
|
||||||
|
3. driver optimization matters - the "common path" (triangles) is more
|
||||||
|
optimized than alternatives (points). don't be clever.
|
||||||
|
|
||||||
|
4. texture sampling is cheap - 22.8 GTexel/s is fast. prefer texture
|
||||||
|
lookups over ALU math in fragment shaders.
|
||||||
|
|
||||||
|
5. avoid discard - breaks early-z, causes pipeline stalls. alpha blending
|
||||||
|
is faster than discard.
|
||||||
|
|
||||||
|
## current ceiling
|
||||||
|
|
||||||
|
~950k entities @ 57fps (SSBO + compute + quads)
|
||||||
|
|
||||||
|
to go higher, we need to reduce fill rate:
|
||||||
|
- cull offscreen entities
|
||||||
|
- reduce entity size when zoomed out
|
||||||
|
- or accept lower fps at higher counts
|
||||||
|
|
||||||
|
## references
|
||||||
|
|
||||||
|
- intel gen9 compute architecture whitepaper
|
||||||
|
- empirical benchmarks in `benchmark_current_i56500t.log`
|
||||||
|
- point sprites experiment in `docs/point_sprites_experiment.md`
|
||||||
89
docs/point_sprites_experiment.md
Normal file
89
docs/point_sprites_experiment.md
Normal file
|
|
@ -0,0 +1,89 @@
|
||||||
|
# point sprites experiment
|
||||||
|
|
||||||
|
branch: `point-sprites` (point-sprites work)
|
||||||
|
date: 2024-12
|
||||||
|
hardware: intel hd 530 (skylake gt2, i5-6500T)
|
||||||
|
|
||||||
|
## hypothesis
|
||||||
|
|
||||||
|
point sprites should be faster than quads because:
|
||||||
|
- 1 vertex per entity instead of 6 (quad = 2 triangles)
|
||||||
|
- less vertex throughput
|
||||||
|
- `gl_PointCoord` provides texture coords automatically
|
||||||
|
|
||||||
|
## implementation
|
||||||
|
|
||||||
|
### vertex shader changes
|
||||||
|
- removed quad vertex attributes (position, texcoord)
|
||||||
|
- use `gl_PointSize = 16.0 * zoom` for size control
|
||||||
|
- position calculated from SSBO data only
|
||||||
|
|
||||||
|
### fragment shader changes
|
||||||
|
- use `gl_PointCoord` instead of vertex texcoord
|
||||||
|
- sample circle texture for alpha
|
||||||
|
|
||||||
|
### renderer changes
|
||||||
|
- load `glEnable` and `glDrawArraysInstanced` via `rlGetProcAddress`
|
||||||
|
- enable `GL_PROGRAM_POINT_SIZE`
|
||||||
|
- draw with `glDrawArraysInstanced(GL_POINTS, 0, 1, count)`
|
||||||
|
- removed VBO (no vertex data needed)
|
||||||
|
|
||||||
|
## results
|
||||||
|
|
||||||
|
### attempt 1: procedural circle in fragment shader
|
||||||
|
|
||||||
|
```glsl
|
||||||
|
vec2 coord = gl_PointCoord - vec2(0.5);
|
||||||
|
float dist = length(coord);
|
||||||
|
float alpha = 1.0 - smoothstep(0.4, 0.5, dist);
|
||||||
|
if (alpha < 0.01) discard;
|
||||||
|
```
|
||||||
|
|
||||||
|
**benchmark @ 350k entities:**
|
||||||
|
- point sprites: 23ms render, 43fps
|
||||||
|
- quads (main): 6.2ms render, 151fps
|
||||||
|
- **result: 3.7x SLOWER**
|
||||||
|
|
||||||
|
**why:** `discard` breaks early-z optimization, `length()` and `smoothstep()` are ALU-heavy, intel integrated GPUs are weak at fragment shader math.
|
||||||
|
|
||||||
|
### attempt 2: texture sampling
|
||||||
|
|
||||||
|
```glsl
|
||||||
|
float alpha = texture(circleTexture, gl_PointCoord).r;
|
||||||
|
finalColor = vec4(fragColor, alpha);
|
||||||
|
```
|
||||||
|
|
||||||
|
**benchmark @ 450k entities:**
|
||||||
|
- point sprites: 19.1ms render, 52fps
|
||||||
|
- quads (main): 8.0ms render, 122fps
|
||||||
|
- **result: 2.4x SLOWER**
|
||||||
|
|
||||||
|
better than procedural, but still significantly slower than quads.
|
||||||
|
|
||||||
|
## analysis
|
||||||
|
|
||||||
|
the theoretical advantage (1/6 vertices) doesn't translate to real performance because:
|
||||||
|
|
||||||
|
1. **triangle path is more optimized** - intel's driver heavily optimizes the standard triangle rasterization path. point sprites use a less-traveled code path.
|
||||||
|
|
||||||
|
2. **fill rate is the bottleneck** - HD 530 has only 3 ROPs. we're bound by how fast we can write pixels, not by vertex count. reducing vertices from 6 to 1 doesn't help when fill rate is the constraint.
|
||||||
|
|
||||||
|
3. **point size overhead** - each point requires computing `gl_PointSize` and setting up the point sprite rasterization, which may have per-vertex overhead.
|
||||||
|
|
||||||
|
4. **texture cache behavior** - `gl_PointCoord` may have worse cache locality than explicit vertex texcoords.
|
||||||
|
|
||||||
|
## conclusion
|
||||||
|
|
||||||
|
**point sprites are a regression on intel hd 530.**
|
||||||
|
|
||||||
|
the optimization makes theoretical sense but fails in practice on this hardware. the quad/triangle path is simply more optimized in intel's mesa driver.
|
||||||
|
|
||||||
|
**keep this branch for testing on discrete GPUs** where point sprites might actually help (nvidia/amd have different optimization priorities).
|
||||||
|
|
||||||
|
## lessons learned
|
||||||
|
|
||||||
|
1. always benchmark, don't assume
|
||||||
|
2. "fewer vertices" doesn't always mean faster
|
||||||
|
3. integrated GPU optimization is different from discrete
|
||||||
|
4. the most optimized path is usually the most common path (triangles)
|
||||||
|
5. fill rate matters more than vertex count at high entity counts
|
||||||
Loading…
Reference in a new issue