Add glossary and rops doc

This commit is contained in:
Jared Miller 2025-12-19 07:36:23 -05:00
parent 55b0d7fab7
commit 5b890b18e4
No known key found for this signature in database
2 changed files with 493 additions and 0 deletions

292
docs/GLOSSARY.txt Normal file
View file

@ -0,0 +1,292 @@
lofivor glossary
================
terms that come up when optimizing graphics.
clock cycle
-----------
one "tick" of the processor's internal clock.
a CPU or GPU has a crystal oscillator that vibrates at a fixed rate.
each vibration = one cycle. the processor does some work each cycle.
1 GHz = 1 billion cycles per second
1 MHz = 1 million cycles per second
so a 1 GHz processor has 1 billion opportunities to do work per second.
"one operation per cycle" is idealized. real work often takes multiple
cycles (memory access: 100+ cycles, division: 10-20 cycles, add: 1 cycle).
your HD 530 runs at ~950 MHz, so roughly 950 million cycles per second.
at 60fps, that's about 15.8 million cycles per frame.
fill rate
---------
pixels written per second. measured in megapixels/s or gigapixels/s.
fill rate = ROPs * clock speed * pixels per clock
your HD 530: 3 ROPs * 950 MHz * 1 = 2.85 GPixels/s theoretical max.
overdraw
--------
drawing the same pixel multiple times per frame.
if two entities overlap, the back one gets drawn, then the front one
overwrites it. the back one's work was wasted.
overdraw ratio = total pixels drawn / screen pixels
1080p = 2.07M pixels. if you draw 20M pixels, overdraw = ~10x.
bandwidth
---------
data transfer rate. measured in bytes/second (GB/s, MB/s).
memory bandwidth = how fast data moves between processor and RAM.
your HD 530 shares DDR4 with the CPU: ~30 GB/s total.
a discrete GPU has dedicated VRAM: 200-900 GB/s.
latency
-------
time delay. measured in nanoseconds (ns) or cycles.
memory latency = time to fetch data from RAM.
- L1 cache: ~4 cycles
- L2 cache: ~12 cycles
- L3 cache: ~40 cycles
- main RAM: ~200 cycles
this is why cache matters. a cache miss = 50x slower than a hit.
throughput vs latency
---------------------
latency = how long ONE thing takes.
throughput = how many things per second.
a pipeline can have high latency but high throughput.
example: a car wash takes 10 minutes (latency).
but if cars enter every 1 minute, throughput is 60 cars/hour.
GPUs hide latency with throughput. one thread waits for memory?
switch to another thread. thousands of threads keep the GPU busy.
draw call
---------
one command from CPU to GPU: "draw this batch of geometry."
each draw call has overhead:
- CPU prepares command buffer
- driver validates state
- GPU switches context
1 draw call for 1M triangles: fast.
1M draw calls for 1M triangles: slow.
lofivor uses 1 draw call for all entities (instanced rendering).
instancing
----------
drawing many copies of the same geometry in one draw call.
instead of: draw triangle, draw triangle, draw triangle...
you say: draw this triangle 1 million times, here are the positions.
the GPU handles the replication. massively more efficient.
shader
------
a small program that runs on the GPU.
the name is historical - early shaders calculated shading/lighting.
but today: a shader is just software running on GPU hardware.
it doesn't have to do with shading at all.
more precisely: a shader turns one piece of data into another piece of data.
- vertex shader: positions → screen coordinates
- fragment shader: fragments → pixel colors
- compute shader: data → data (anything)
GPUs are massively parallel, so shaders run on thousands of inputs at once.
CPUs have stagnated; GPUs keep getting faster. modern engines like UE5
increasingly use shaders for work that used to be CPU-only.
SSBO (shader storage buffer object)
-----------------------------------
a block of GPU memory that shaders can read/write.
unlike uniforms (small, read-only), SSBOs can be large and writable.
lofivor stores all entity data in an SSBO: positions, velocities, colors.
compute shader
--------------
a shader that does general computation, not rendering.
runs on GPU cores but doesn't output pixels. just processes data.
lofivor uses compute shaders to update entity positions.
because compute exists, shaders can be anything: physics, AI, sorting,
image processing. the GPU is a general-purpose parallel processor.
fragment / pixel shader
-----------------------
program that runs once per pixel (actually per "fragment").
determines the final color of each pixel. this is where:
- texture sampling happens
- lighting calculations happen
- the expensive math lives
lofivor's fragment shader: sample texture, multiply by color. trivial.
AAA game fragment shader: 500+ instructions. expensive.
vertex shader
-------------
program that runs once per vertex.
transforms 3D positions to screen positions. lofivor's vertex shader
reads from SSBO and positions the quad corners.
ROP (render output unit)
------------------------
final stage of GPU pipeline. writes pixels to framebuffer.
handles: depth test, stencil test, blending, antialiasing.
your bottleneck on HD 530. see docs/rops.txt.
TMU (texture mapping unit)
--------------------------
samples textures. reads pixel colors from texture memory.
your HD 530 has 24 TMUs. they're fast (22.8 GTexels/s).
texture sampling is cheap relative to ROPs on this hardware.
EU (execution unit)
-------------------
intel's term for shader cores.
your HD 530 has 24 EUs, each with 8 ALUs = 192 ALUs total.
these run your vertex, fragment, and compute shaders.
ALU (arithmetic logic unit)
---------------------------
does math. add, multiply, compare, bitwise operations.
one ALU can do one operation per cycle (simple ops).
complex ops (sqrt, sin, cos) take multiple cycles.
framebuffer
-----------
the image being rendered. lives in GPU memory.
at 1080p with 32-bit color: 1920 * 1080 * 4 = 8.3 MB.
double-buffered (front + back): 16.6 MB.
vsync
-----
synchronizing frame presentation with monitor refresh.
without vsync: tearing (half old frame, half new frame).
with vsync: smooth, but if you miss 16.7ms, you wait for next refresh.
frame budget
------------
time available per frame.
60 fps = 16.67 ms per frame
30 fps = 33.33 ms per frame
everything (CPU + GPU) must complete within budget or frames drop.
pipeline stall
--------------
GPU waiting for something. bad for performance.
causes:
- waiting for memory (cache miss)
- waiting for previous stage to finish
- synchronization points (barriers)
- `discard` in fragment shader (breaks early-z)
early-z
-------
optimization: test depth BEFORE running fragment shader.
if pixel will be occluded, skip the expensive shader work.
`discard` breaks this because GPU can't know depth until shader runs.
LOD (level of detail)
---------------------
using simpler geometry/textures for distant objects.
far away = fewer pixels = less detail needed.
saves vertices, texture bandwidth, and fill rate.
frustum culling
---------------
don't draw what's outside the camera view.
the "frustum" is the pyramid-shaped visible region.
anything outside = wasted work. cull it before sending to GPU.
spatial partitioning
--------------------
organizing entities by position for fast queries.
types: grid, quadtree, octree, BVH.
"which entities are near point X?" goes from O(n) to O(log n).
essential for collision detection at scale.

201
docs/rops.txt Normal file
View file

@ -0,0 +1,201 @@
rops: render output units
=========================
what they are, where they came from, and what yours can do.
what is a rop?
--------------
ROP = Render Output Unit (originally "Raster Operations Pipeline")
it's the final stage of the GPU pipeline. after all the fancy shader
math is done, the ROP is the unit that actually writes pixels to memory.
think of it as the bottleneck between "calculated" and "visible."
a ROP does:
- depth testing (is this pixel in front of what's already there?)
- stencil testing (mask operations)
- blending (alpha, additive, etc)
- anti-aliasing resolve
- writing the final color to the framebuffer
one ROP can write one pixel per clock cycle (roughly).
the first rop
-------------
the term comes from the IBM 8514/A (1987), which had dedicated hardware
for "raster operations" - bitwise operations on pixels (AND, OR, XOR).
this was revolutionary because before this, the CPU did all pixel math.
but the modern ROP as we know it emerged with:
NVIDIA NV1 (1995)
one of the first chips with dedicated pixel output hardware
could do ~1 million textured pixels/second
3dfx Voodoo (1996)
the card that defined the modern GPU pipeline
had 1 TMU + 1 pixel pipeline (essentially 1 ROP)
could push 45 million pixels/second
that ONE pipeline ran Quake at 640x480
NVIDIA GeForce 256 (1999)
"the first GPU" - named itself with that term
4 pixel pipelines = 4 ROPs
480 million pixels/second
so the original consumer 3D cards had... 1 ROP. and they ran Quake.
what one rop can do
-------------------
let's do the math.
one ROP at 100 MHz (3dfx Voodoo era):
100 million cycles/second
~1 pixel per cycle
= 100 megapixels/second
at 640x480 @ 60fps:
640 * 480 * 60 = 18.4 megapixels/second needed
so ONE ROP at 100MHz could handle 640x480 with ~5x headroom for overdraw.
at 1024x768 @ 60fps:
1024 * 768 * 60 = 47 megapixels/second
now you're at 2x overdraw max. still playable, but tight.
one modern rop
--------------
a single modern ROP runs at ~1-2 GHz and can do more per cycle:
- multiple color outputs (MRT)
- 64-bit or 128-bit color formats
- compressed writes
rough estimate for one ROP at 1.5 GHz:
~1.5 billion pixels/second base throughput
at 1920x1080 @ 60fps:
1920 * 1080 * 60 = 124 megapixels/second
one ROP could handle 1080p with 12x overdraw headroom.
at 4K @ 60fps:
3840 * 2160 * 60 = 497 megapixels/second
one ROP could handle 4K with 3x overdraw. tight, but possible.
your three rops (intel hd 530)
------------------------------
HD 530 specs:
- 3 ROPs
- ~950 MHz boost clock
- theoretical: 2.85 GPixels/second
let's break that down:
at 1080p @ 60fps (124 MP/s needed):
2850 / 124 = 23x overdraw budget
that's actually generous! you could draw each pixel 23 times.
so why does lofivor struggle at 1M entities?
because 1M entities at 4x4 pixels = 16M pixels minimum.
but with overlap? let's say average 10x overdraw:
160M pixels/frame
at 60fps = 9.6 billion pixels/second
your ceiling is 2.85 billion.
so you're 3.4x over budget. that's why you top out around 300k-400k
before frame drops (which matches empirical testing).
the real constraint
-------------------
ROPs don't work in isolation. they're limited by:
1. MEMORY BANDWIDTH
each pixel write = memory access
HD 530 shares DDR4 with CPU (~30 GB/s)
at 32-bit color: 30GB/s / 4 bytes = 7.5 billion pixels/second max
but you're competing with CPU, texture reads, etc.
realistic: maybe 2-3 billion pixels for framebuffer writes
2. TEXTURE SAMPLING
if fragment shader samples textures, TMUs must keep up
HD 530 has 24 TMUs, so this isn't the bottleneck
3. SHADER EXECUTION
ROPs wait for fragments to be shaded
if shaders are slow, ROPs starve
lofivor's shaders are trivial, so this isn't the bottleneck
for lofivor specifically: your 3 ROPs are THE ceiling.
what could you do with more rops?
---------------------------------
comparison:
Intel HD 530: 3 ROPs, 2.85 GPixels/s
GTX 1060: 48 ROPs, 72 GPixels/s
RTX 3080: 96 ROPs, 164 GPixels/s
RTX 4090: 176 ROPs, 443 GPixels/s
with a GTX 1060 (25x your fill rate):
lofivor could probably hit 5-10 million entities
with an RTX 4090 (155x your fill rate):
tens of millions, limited by other factors
perspective: what 3 rops means historically
-------------------------------------------
your HD 530 has roughly the fill rate of:
- GeForce 4 Ti 4600 (2002): 4 ROPs, 1.2 GPixels/s
- Radeon 9700 Pro (2002): 8 ROPs, 2.6 GPixels/s
you're running hardware that, in raw pixel output, matches GPUs from
20+ years ago. but with modern features (compute shaders, SSBO, etc).
this is why lofivor is interesting: you're achieving 700k+ entities
on fill-rate-equivalent hardware that originally ran games with
maybe 10,000 triangles on screen.
the difference is technique. those 2002 games did complex per-pixel
lighting, shadows, multiple texture passes. lofivor does one texture
sample and one blend. same fill rate, 100x the entities.
the lesson
----------
ROPs are simple: they write pixels.
the number you have determines your pixel budget.
everything else (shaders, vertices, CPU logic) only matters if
the ROPs aren't your bottleneck.
with 3 ROPs, you have roughly 2.85 billion pixels/second.
spend them wisely:
- cull what's offscreen (don't spend pixels on invisible things)
- shrink distant objects (LOD saves pixels)
- reduce overlap (spatial organization)
- keep shaders simple (don't starve the ROPs)
your 3 ROPs can do remarkable things. Quake ran on 1.