point sprites experiment

branch: point-sprites (point-sprites work) date: 2024-12 hardware: intel hd 530 (skylake gt2, i5-6500T)

hypothesis

point sprites should be faster than quads because:

1 vertex per entity instead of 6 (quad = 2 triangles)
less vertex throughput
gl_PointCoord provides texture coords automatically

implementation

vertex shader changes

removed quad vertex attributes (position, texcoord)
use gl_PointSize = 16.0 * zoom for size control
position calculated from SSBO data only

fragment shader changes

use gl_PointCoord instead of vertex texcoord
sample circle texture for alpha

renderer changes

load glEnable and glDrawArraysInstanced via rlGetProcAddress
enable GL_PROGRAM_POINT_SIZE
draw with glDrawArraysInstanced(GL_POINTS, 0, 1, count)
removed VBO (no vertex data needed)

results

attempt 1: procedural circle in fragment shader

vec2 coord = gl_PointCoord - vec2(0.5);
float dist = length(coord);
float alpha = 1.0 - smoothstep(0.4, 0.5, dist);
if (alpha < 0.01) discard;

benchmark @ 350k entities:

point sprites: 23ms render, 43fps
quads (main): 6.2ms render, 151fps
result: 3.7x SLOWER

why: discard breaks early-z optimization, length() and smoothstep() are ALU-heavy, intel integrated GPUs are weak at fragment shader math.

attempt 2: texture sampling

float alpha = texture(circleTexture, gl_PointCoord).r;
finalColor = vec4(fragColor, alpha);

benchmark @ 450k entities:

point sprites: 19.1ms render, 52fps
quads (main): 8.0ms render, 122fps
result: 2.4x SLOWER

better than procedural, but still significantly slower than quads.

analysis

the theoretical advantage (1/6 vertices) doesn't translate to real performance because:

triangle path is more optimized - intel's driver heavily optimizes the standard triangle rasterization path. point sprites use a less-traveled code path.
fill rate is the bottleneck - HD 530 has only 3 ROPs. we're bound by how fast we can write pixels, not by vertex count. reducing vertices from 6 to 1 doesn't help when fill rate is the constraint.
point size overhead - each point requires computing gl_PointSize and setting up the point sprite rasterization, which may have per-vertex overhead.
texture cache behavior - gl_PointCoord may have worse cache locality than explicit vertex texcoords.

conclusion

point sprites are a regression on intel hd 530.

the optimization makes theoretical sense but fails in practice on this hardware. the quad/triangle path is simply more optimized in intel's mesa driver.

keep this branch for testing on discrete GPUs where point sprites might actually help (nvidia/amd have different optimization priorities).

lessons learned

always benchmark, don't assume
"fewer vertices" doesn't always mean faster
integrated GPU optimization is different from discrete
the most optimized path is usually the most common path (triangles)
fill rate matters more than vertex count at high entity counts

3 KiB Raw Blame History