From 5b890b18e45c5ea76d046d1ea6abfc8ac16c6c79 Mon Sep 17 00:00:00 2001 From: Jared Miller Date: Fri, 19 Dec 2025 07:36:23 -0500 Subject: [PATCH] Add glossary and rops doc --- docs/GLOSSARY.txt | 292 ++++++++++++++++++++++++++++++++++++++++++++++ docs/rops.txt | 201 +++++++++++++++++++++++++++++++ 2 files changed, 493 insertions(+) create mode 100644 docs/GLOSSARY.txt create mode 100644 docs/rops.txt diff --git a/docs/GLOSSARY.txt b/docs/GLOSSARY.txt new file mode 100644 index 0000000..ed821ed --- /dev/null +++ b/docs/GLOSSARY.txt @@ -0,0 +1,292 @@ +lofivor glossary +================ + +terms that come up when optimizing graphics. + + +clock cycle +----------- + +one "tick" of the processor's internal clock. + +a CPU or GPU has a crystal oscillator that vibrates at a fixed rate. +each vibration = one cycle. the processor does some work each cycle. + + 1 GHz = 1 billion cycles per second + 1 MHz = 1 million cycles per second + +so a 1 GHz processor has 1 billion opportunities to do work per second. + +"one operation per cycle" is idealized. real work often takes multiple +cycles (memory access: 100+ cycles, division: 10-20 cycles, add: 1 cycle). + +your HD 530 runs at ~950 MHz, so roughly 950 million cycles per second. +at 60fps, that's about 15.8 million cycles per frame. + + +fill rate +--------- + +pixels written per second. measured in megapixels/s or gigapixels/s. + + fill rate = ROPs * clock speed * pixels per clock + +your HD 530: 3 ROPs * 950 MHz * 1 = 2.85 GPixels/s theoretical max. + + +overdraw +-------- + +drawing the same pixel multiple times per frame. + +if two entities overlap, the back one gets drawn, then the front one +overwrites it. the back one's work was wasted. + + overdraw ratio = total pixels drawn / screen pixels + +1080p = 2.07M pixels. if you draw 20M pixels, overdraw = ~10x. + + +bandwidth +--------- + +data transfer rate. measured in bytes/second (GB/s, MB/s). + +memory bandwidth = how fast data moves between processor and RAM. + +your HD 530 shares DDR4 with the CPU: ~30 GB/s total. +a discrete GPU has dedicated VRAM: 200-900 GB/s. + + +latency +------- + +time delay. measured in nanoseconds (ns) or cycles. + +memory latency = time to fetch data from RAM. + - L1 cache: ~4 cycles + - L2 cache: ~12 cycles + - L3 cache: ~40 cycles + - main RAM: ~200 cycles + +this is why cache matters. a cache miss = 50x slower than a hit. + + +throughput vs latency +--------------------- + +latency = how long ONE thing takes. +throughput = how many things per second. + +a pipeline can have high latency but high throughput. + +example: a car wash takes 10 minutes (latency). +but if cars enter every 1 minute, throughput is 60 cars/hour. + +GPUs hide latency with throughput. one thread waits for memory? +switch to another thread. thousands of threads keep the GPU busy. + + +draw call +--------- + +one command from CPU to GPU: "draw this batch of geometry." + +each draw call has overhead: + - CPU prepares command buffer + - driver validates state + - GPU switches context + +1 draw call for 1M triangles: fast. +1M draw calls for 1M triangles: slow. + +lofivor uses 1 draw call for all entities (instanced rendering). + + +instancing +---------- + +drawing many copies of the same geometry in one draw call. + +instead of: draw triangle, draw triangle, draw triangle... +you say: draw this triangle 1 million times, here are the positions. + +the GPU handles the replication. massively more efficient. + + +shader +------ + +a small program that runs on the GPU. + +the name is historical - early shaders calculated shading/lighting. +but today: a shader is just software running on GPU hardware. +it doesn't have to do with shading at all. + +more precisely: a shader turns one piece of data into another piece of data. + - vertex shader: positions → screen coordinates + - fragment shader: fragments → pixel colors + - compute shader: data → data (anything) + +GPUs are massively parallel, so shaders run on thousands of inputs at once. +CPUs have stagnated; GPUs keep getting faster. modern engines like UE5 +increasingly use shaders for work that used to be CPU-only. + + +SSBO (shader storage buffer object) +----------------------------------- + +a block of GPU memory that shaders can read/write. + +unlike uniforms (small, read-only), SSBOs can be large and writable. +lofivor stores all entity data in an SSBO: positions, velocities, colors. + + +compute shader +-------------- + +a shader that does general computation, not rendering. + +runs on GPU cores but doesn't output pixels. just processes data. +lofivor uses compute shaders to update entity positions. + +because compute exists, shaders can be anything: physics, AI, sorting, +image processing. the GPU is a general-purpose parallel processor. + + +fragment / pixel shader +----------------------- + +program that runs once per pixel (actually per "fragment"). + +determines the final color of each pixel. this is where: + - texture sampling happens + - lighting calculations happen + - the expensive math lives + +lofivor's fragment shader: sample texture, multiply by color. trivial. +AAA game fragment shader: 500+ instructions. expensive. + + +vertex shader +------------- + +program that runs once per vertex. + +transforms 3D positions to screen positions. lofivor's vertex shader +reads from SSBO and positions the quad corners. + + +ROP (render output unit) +------------------------ + +final stage of GPU pipeline. writes pixels to framebuffer. + +handles: depth test, stencil test, blending, antialiasing. +your bottleneck on HD 530. see docs/rops.txt. + + +TMU (texture mapping unit) +-------------------------- + +samples textures. reads pixel colors from texture memory. + +your HD 530 has 24 TMUs. they're fast (22.8 GTexels/s). +texture sampling is cheap relative to ROPs on this hardware. + + +EU (execution unit) +------------------- + +intel's term for shader cores. + +your HD 530 has 24 EUs, each with 8 ALUs = 192 ALUs total. +these run your vertex, fragment, and compute shaders. + + +ALU (arithmetic logic unit) +--------------------------- + +does math. add, multiply, compare, bitwise operations. + +one ALU can do one operation per cycle (simple ops). +complex ops (sqrt, sin, cos) take multiple cycles. + + +framebuffer +----------- + +the image being rendered. lives in GPU memory. + +at 1080p with 32-bit color: 1920 * 1080 * 4 = 8.3 MB. +double-buffered (front + back): 16.6 MB. + + +vsync +----- + +synchronizing frame presentation with monitor refresh. + +without vsync: tearing (half old frame, half new frame). +with vsync: smooth, but if you miss 16.7ms, you wait for next refresh. + + +frame budget +------------ + +time available per frame. + + 60 fps = 16.67 ms per frame + 30 fps = 33.33 ms per frame + +everything (CPU + GPU) must complete within budget or frames drop. + + +pipeline stall +-------------- + +GPU waiting for something. bad for performance. + +causes: + - waiting for memory (cache miss) + - waiting for previous stage to finish + - synchronization points (barriers) + - `discard` in fragment shader (breaks early-z) + + +early-z +------- + +optimization: test depth BEFORE running fragment shader. + +if pixel will be occluded, skip the expensive shader work. +`discard` breaks this because GPU can't know depth until shader runs. + + +LOD (level of detail) +--------------------- + +using simpler geometry/textures for distant objects. + +far away = fewer pixels = less detail needed. +saves vertices, texture bandwidth, and fill rate. + + +frustum culling +--------------- + +don't draw what's outside the camera view. + +the "frustum" is the pyramid-shaped visible region. +anything outside = wasted work. cull it before sending to GPU. + + +spatial partitioning +-------------------- + +organizing entities by position for fast queries. + +types: grid, quadtree, octree, BVH. + +"which entities are near point X?" goes from O(n) to O(log n). +essential for collision detection at scale. diff --git a/docs/rops.txt b/docs/rops.txt new file mode 100644 index 0000000..30c82f1 --- /dev/null +++ b/docs/rops.txt @@ -0,0 +1,201 @@ +rops: render output units +========================= + +what they are, where they came from, and what yours can do. + + +what is a rop? +-------------- + +ROP = Render Output Unit (originally "Raster Operations Pipeline") + +it's the final stage of the GPU pipeline. after all the fancy shader +math is done, the ROP is the unit that actually writes pixels to memory. + +think of it as the bottleneck between "calculated" and "visible." + +a ROP does: + - depth testing (is this pixel in front of what's already there?) + - stencil testing (mask operations) + - blending (alpha, additive, etc) + - anti-aliasing resolve + - writing the final color to the framebuffer + +one ROP can write one pixel per clock cycle (roughly). + + +the first rop +------------- + +the term comes from the IBM 8514/A (1987), which had dedicated hardware +for "raster operations" - bitwise operations on pixels (AND, OR, XOR). +this was revolutionary because before this, the CPU did all pixel math. + +but the modern ROP as we know it emerged with: + + NVIDIA NV1 (1995) + one of the first chips with dedicated pixel output hardware + could do ~1 million textured pixels/second + + 3dfx Voodoo (1996) + the card that defined the modern GPU pipeline + had 1 TMU + 1 pixel pipeline (essentially 1 ROP) + could push 45 million pixels/second + that ONE pipeline ran Quake at 640x480 + + NVIDIA GeForce 256 (1999) + "the first GPU" - named itself with that term + 4 pixel pipelines = 4 ROPs + 480 million pixels/second + +so the original consumer 3D cards had... 1 ROP. and they ran Quake. + + +what one rop can do +------------------- + +let's do the math. + +one ROP at 100 MHz (3dfx Voodoo era): + 100 million cycles/second + ~1 pixel per cycle + = 100 megapixels/second + +at 640x480 @ 60fps: + 640 * 480 * 60 = 18.4 megapixels/second needed + +so ONE ROP at 100MHz could handle 640x480 with ~5x headroom for overdraw. + +at 1024x768 @ 60fps: + 1024 * 768 * 60 = 47 megapixels/second + +now you're at 2x overdraw max. still playable, but tight. + + +one modern rop +-------------- + +a single modern ROP runs at ~1-2 GHz and can do more per cycle: + - multiple color outputs (MRT) + - 64-bit or 128-bit color formats + - compressed writes + +rough estimate for one ROP at 1.5 GHz: + ~1.5 billion pixels/second base throughput + +at 1920x1080 @ 60fps: + 1920 * 1080 * 60 = 124 megapixels/second + +one ROP could handle 1080p with 12x overdraw headroom. + +at 4K @ 60fps: + 3840 * 2160 * 60 = 497 megapixels/second + +one ROP could handle 4K with 3x overdraw. tight, but possible. + + +your three rops (intel hd 530) +------------------------------ + +HD 530 specs: + - 3 ROPs + - ~950 MHz boost clock + - theoretical: 2.85 GPixels/second + +let's break that down: + +at 1080p @ 60fps (124 MP/s needed): + 2850 / 124 = 23x overdraw budget + +that's actually generous! you could draw each pixel 23 times. + +so why does lofivor struggle at 1M entities? + +because 1M entities at 4x4 pixels = 16M pixels minimum. +but with overlap? let's say average 10x overdraw: + 160M pixels/frame + at 60fps = 9.6 billion pixels/second + +your ceiling is 2.85 billion. + +so you're 3.4x over budget. that's why you top out around 300k-400k +before frame drops (which matches empirical testing). + + +the real constraint +------------------- + +ROPs don't work in isolation. they're limited by: + + 1. MEMORY BANDWIDTH + each pixel write = memory access + HD 530 shares DDR4 with CPU (~30 GB/s) + at 32-bit color: 30GB/s / 4 bytes = 7.5 billion pixels/second max + but you're competing with CPU, texture reads, etc. + realistic: maybe 2-3 billion pixels for framebuffer writes + + 2. TEXTURE SAMPLING + if fragment shader samples textures, TMUs must keep up + HD 530 has 24 TMUs, so this isn't the bottleneck + + 3. SHADER EXECUTION + ROPs wait for fragments to be shaded + if shaders are slow, ROPs starve + lofivor's shaders are trivial, so this isn't the bottleneck + +for lofivor specifically: your 3 ROPs are THE ceiling. + + +what could you do with more rops? +--------------------------------- + +comparison: + + Intel HD 530: 3 ROPs, 2.85 GPixels/s + GTX 1060: 48 ROPs, 72 GPixels/s + RTX 3080: 96 ROPs, 164 GPixels/s + RTX 4090: 176 ROPs, 443 GPixels/s + +with a GTX 1060 (25x your fill rate): + lofivor could probably hit 5-10 million entities + +with an RTX 4090 (155x your fill rate): + tens of millions, limited by other factors + + +perspective: what 3 rops means historically +------------------------------------------- + +your HD 530 has roughly the fill rate of: + - GeForce 4 Ti 4600 (2002): 4 ROPs, 1.2 GPixels/s + - Radeon 9700 Pro (2002): 8 ROPs, 2.6 GPixels/s + +you're running hardware that, in raw pixel output, matches GPUs from +20+ years ago. but with modern features (compute shaders, SSBO, etc). + +this is why lofivor is interesting: you're achieving 700k+ entities +on fill-rate-equivalent hardware that originally ran games with +maybe 10,000 triangles on screen. + +the difference is technique. those 2002 games did complex per-pixel +lighting, shadows, multiple texture passes. lofivor does one texture +sample and one blend. same fill rate, 100x the entities. + + +the lesson +---------- + +ROPs are simple: they write pixels. + +the number you have determines your pixel budget. +everything else (shaders, vertices, CPU logic) only matters if +the ROPs aren't your bottleneck. + +with 3 ROPs, you have roughly 2.85 billion pixels/second. +spend them wisely: + - cull what's offscreen (don't spend pixels on invisible things) + - shrink distant objects (LOD saves pixels) + - reduce overlap (spatial organization) + - keep shaders simple (don't starve the ROPs) + +your 3 ROPs can do remarkable things. Quake ran on 1.