Add glossary and rops doc

2025-12-19 07:36:23 -05:00 · 2025-12-19 07:36:23 -05:00 · 5b890b18e4
commit 5b890b18e4
parent 55b0d7fab7
2 changed files with 493 additions and 0 deletions
--- a/docs/GLOSSARY.txt
+++ b/docs/GLOSSARY.txt
@ -0,0 +1,292 @@
+lofivor glossary
+================
+
+terms that come up when optimizing graphics.
+
+
+clock cycle
+-----------
+
+one "tick" of the processor's internal clock.
+
+a CPU or GPU has a crystal oscillator that vibrates at a fixed rate.
+each vibration = one cycle. the processor does some work each cycle.
+
+  1 GHz = 1 billion cycles per second
+  1 MHz = 1 million cycles per second
+
+so a 1 GHz processor has 1 billion opportunities to do work per second.
+
+"one operation per cycle" is idealized. real work often takes multiple
+cycles (memory access: 100+ cycles, division: 10-20 cycles, add: 1 cycle).
+
+your HD 530 runs at ~950 MHz, so roughly 950 million cycles per second.
+at 60fps, that's about 15.8 million cycles per frame.
+
+
+fill rate
+---------
+
+pixels written per second. measured in megapixels/s or gigapixels/s.
+
+  fill rate = ROPs * clock speed * pixels per clock
+
+your HD 530: 3 ROPs * 950 MHz * 1 = 2.85 GPixels/s theoretical max.
+
+
+overdraw
+--------
+
+drawing the same pixel multiple times per frame.
+
+if two entities overlap, the back one gets drawn, then the front one
+overwrites it. the back one's work was wasted.
+
+  overdraw ratio = total pixels drawn / screen pixels
+
+1080p = 2.07M pixels. if you draw 20M pixels, overdraw = ~10x.
+
+
+bandwidth
+---------
+
+data transfer rate. measured in bytes/second (GB/s, MB/s).
+
+memory bandwidth = how fast data moves between processor and RAM.
+
+your HD 530 shares DDR4 with the CPU: ~30 GB/s total.
+a discrete GPU has dedicated VRAM: 200-900 GB/s.
+
+
+latency
+-------
+
+time delay. measured in nanoseconds (ns) or cycles.
+
+memory latency = time to fetch data from RAM.
+  - L1 cache: ~4 cycles
+  - L2 cache: ~12 cycles
+  - L3 cache: ~40 cycles
+  - main RAM: ~200 cycles
+
+this is why cache matters. a cache miss = 50x slower than a hit.
+
+
+throughput vs latency
+---------------------
+
+latency = how long ONE thing takes.
+throughput = how many things per second.
+
+a pipeline can have high latency but high throughput.
+
+example: a car wash takes 10 minutes (latency).
+but if cars enter every 1 minute, throughput is 60 cars/hour.
+
+GPUs hide latency with throughput. one thread waits for memory?
+switch to another thread. thousands of threads keep the GPU busy.
+
+
+draw call
+---------
+
+one command from CPU to GPU: "draw this batch of geometry."
+
+each draw call has overhead:
+  - CPU prepares command buffer
+  - driver validates state
+  - GPU switches context
+
+1 draw call for 1M triangles: fast.
+1M draw calls for 1M triangles: slow.
+
+lofivor uses 1 draw call for all entities (instanced rendering).
+
+
+instancing
+----------
+
+drawing many copies of the same geometry in one draw call.
+
+instead of: draw triangle, draw triangle, draw triangle...
+you say: draw this triangle 1 million times, here are the positions.
+
+the GPU handles the replication. massively more efficient.
+
+
+shader
+------
+
+a small program that runs on the GPU.
+
+the name is historical - early shaders calculated shading/lighting.
+but today: a shader is just software running on GPU hardware.
+it doesn't have to do with shading at all.
+
+more precisely: a shader turns one piece of data into another piece of data.
+  - vertex shader: positions → screen coordinates
+  - fragment shader: fragments → pixel colors
+  - compute shader: data → data (anything)
+
+GPUs are massively parallel, so shaders run on thousands of inputs at once.
+CPUs have stagnated; GPUs keep getting faster. modern engines like UE5
+increasingly use shaders for work that used to be CPU-only.
+
+
+SSBO (shader storage buffer object)
+-----------------------------------
+
+a block of GPU memory that shaders can read/write.
+
+unlike uniforms (small, read-only), SSBOs can be large and writable.
+lofivor stores all entity data in an SSBO: positions, velocities, colors.
+
+
+compute shader
+--------------
+
+a shader that does general computation, not rendering.
+
+runs on GPU cores but doesn't output pixels. just processes data.
+lofivor uses compute shaders to update entity positions.
+
+because compute exists, shaders can be anything: physics, AI, sorting,
+image processing. the GPU is a general-purpose parallel processor.
+
+
+fragment / pixel shader
+-----------------------
+
+program that runs once per pixel (actually per "fragment").
+
+determines the final color of each pixel. this is where:
+  - texture sampling happens
+  - lighting calculations happen
+  - the expensive math lives
+
+lofivor's fragment shader: sample texture, multiply by color. trivial.
+AAA game fragment shader: 500+ instructions. expensive.
+
+
+vertex shader
+-------------
+
+program that runs once per vertex.
+
+transforms 3D positions to screen positions. lofivor's vertex shader
+reads from SSBO and positions the quad corners.
+
+
+ROP (render output unit)
+------------------------
+
+final stage of GPU pipeline. writes pixels to framebuffer.
+
+handles: depth test, stencil test, blending, antialiasing.
+your bottleneck on HD 530. see docs/rops.txt.
+
+
+TMU (texture mapping unit)
+--------------------------
+
+samples textures. reads pixel colors from texture memory.
+
+your HD 530 has 24 TMUs. they're fast (22.8 GTexels/s).
+texture sampling is cheap relative to ROPs on this hardware.
+
+
+EU (execution unit)
+-------------------
+
+intel's term for shader cores.
+
+your HD 530 has 24 EUs, each with 8 ALUs = 192 ALUs total.
+these run your vertex, fragment, and compute shaders.
+
+
+ALU (arithmetic logic unit)
+---------------------------
+
+does math. add, multiply, compare, bitwise operations.
+
+one ALU can do one operation per cycle (simple ops).
+complex ops (sqrt, sin, cos) take multiple cycles.
+
+
+framebuffer
+-----------
+
+the image being rendered. lives in GPU memory.
+
+at 1080p with 32-bit color: 1920 * 1080 * 4 = 8.3 MB.
+double-buffered (front + back): 16.6 MB.
+
+
+vsync
+-----
+
+synchronizing frame presentation with monitor refresh.
+
+without vsync: tearing (half old frame, half new frame).
+with vsync: smooth, but if you miss 16.7ms, you wait for next refresh.
+
+
+frame budget
+------------
+
+time available per frame.
+
+  60 fps = 16.67 ms per frame
+  30 fps = 33.33 ms per frame
+
+everything (CPU + GPU) must complete within budget or frames drop.
+
+
+pipeline stall
+--------------
+
+GPU waiting for something. bad for performance.
+
+causes:
+  - waiting for memory (cache miss)
+  - waiting for previous stage to finish
+  - synchronization points (barriers)
+  - `discard` in fragment shader (breaks early-z)
+
+
+early-z
+-------
+
+optimization: test depth BEFORE running fragment shader.
+
+if pixel will be occluded, skip the expensive shader work.
+`discard` breaks this because GPU can't know depth until shader runs.
+
+
+LOD (level of detail)
+---------------------
+
+using simpler geometry/textures for distant objects.
+
+far away = fewer pixels = less detail needed.
+saves vertices, texture bandwidth, and fill rate.
+
+
+frustum culling
+---------------
+
+don't draw what's outside the camera view.
+
+the "frustum" is the pyramid-shaped visible region.
+anything outside = wasted work. cull it before sending to GPU.
+
+
+spatial partitioning
+--------------------
+
+organizing entities by position for fast queries.
+
+types: grid, quadtree, octree, BVH.
+
+"which entities are near point X?" goes from O(n) to O(log n).
+essential for collision detection at scale.
--- a/docs/rops.txt
+++ b/docs/rops.txt
@ -0,0 +1,201 @@
+rops: render output units
+=========================
+
+what they are, where they came from, and what yours can do.
+
+
+what is a rop?
+--------------
+
+ROP = Render Output Unit (originally "Raster Operations Pipeline")
+
+it's the final stage of the GPU pipeline. after all the fancy shader
+math is done, the ROP is the unit that actually writes pixels to memory.
+
+think of it as the bottleneck between "calculated" and "visible."
+
+a ROP does:
+  - depth testing (is this pixel in front of what's already there?)
+  - stencil testing (mask operations)
+  - blending (alpha, additive, etc)
+  - anti-aliasing resolve
+  - writing the final color to the framebuffer
+
+one ROP can write one pixel per clock cycle (roughly).
+
+
+the first rop
+-------------
+
+the term comes from the IBM 8514/A (1987), which had dedicated hardware
+for "raster operations" - bitwise operations on pixels (AND, OR, XOR).
+this was revolutionary because before this, the CPU did all pixel math.
+
+but the modern ROP as we know it emerged with:
+
+  NVIDIA NV1 (1995)
+    one of the first chips with dedicated pixel output hardware
+    could do ~1 million textured pixels/second
+
+  3dfx Voodoo (1996)
+    the card that defined the modern GPU pipeline
+    had 1 TMU + 1 pixel pipeline (essentially 1 ROP)
+    could push 45 million pixels/second
+    that ONE pipeline ran Quake at 640x480
+
+  NVIDIA GeForce 256 (1999)
+    "the first GPU" - named itself with that term
+    4 pixel pipelines = 4 ROPs
+    480 million pixels/second
+
+so the original consumer 3D cards had... 1 ROP. and they ran Quake.
+
+
+what one rop can do
+-------------------
+
+let's do the math.
+
+one ROP at 100 MHz (3dfx Voodoo era):
+  100 million cycles/second
+  ~1 pixel per cycle
+  = 100 megapixels/second
+
+at 640x480 @ 60fps:
+  640 * 480 * 60 = 18.4 megapixels/second needed
+
+so ONE ROP at 100MHz could handle 640x480 with ~5x headroom for overdraw.
+
+at 1024x768 @ 60fps:
+  1024 * 768 * 60 = 47 megapixels/second
+
+now you're at 2x overdraw max. still playable, but tight.
+
+
+one modern rop
+--------------
+
+a single modern ROP runs at ~1-2 GHz and can do more per cycle:
+  - multiple color outputs (MRT)
+  - 64-bit or 128-bit color formats
+  - compressed writes
+
+rough estimate for one ROP at 1.5 GHz:
+  ~1.5 billion pixels/second base throughput
+
+at 1920x1080 @ 60fps:
+  1920 * 1080 * 60 = 124 megapixels/second
+
+one ROP could handle 1080p with 12x overdraw headroom.
+
+at 4K @ 60fps:
+  3840 * 2160 * 60 = 497 megapixels/second
+
+one ROP could handle 4K with 3x overdraw. tight, but possible.
+
+
+your three rops (intel hd 530)
+------------------------------
+
+HD 530 specs:
+  - 3 ROPs
+  - ~950 MHz boost clock
+  - theoretical: 2.85 GPixels/second
+
+let's break that down:
+
+at 1080p @ 60fps (124 MP/s needed):
+  2850 / 124 = 23x overdraw budget
+
+that's actually generous! you could draw each pixel 23 times.
+
+so why does lofivor struggle at 1M entities?
+
+because 1M entities at 4x4 pixels = 16M pixels minimum.
+but with overlap? let's say average 10x overdraw:
+  160M pixels/frame
+  at 60fps = 9.6 billion pixels/second
+
+your ceiling is 2.85 billion.
+
+so you're 3.4x over budget. that's why you top out around 300k-400k
+before frame drops (which matches empirical testing).
+
+
+the real constraint
+-------------------
+
+ROPs don't work in isolation. they're limited by:
+
+  1. MEMORY BANDWIDTH
+     each pixel write = memory access
+     HD 530 shares DDR4 with CPU (~30 GB/s)
+     at 32-bit color: 30GB/s / 4 bytes = 7.5 billion pixels/second max
+     but you're competing with CPU, texture reads, etc.
+     realistic: maybe 2-3 billion pixels for framebuffer writes
+
+  2. TEXTURE SAMPLING
+     if fragment shader samples textures, TMUs must keep up
+     HD 530 has 24 TMUs, so this isn't the bottleneck
+
+  3. SHADER EXECUTION
+     ROPs wait for fragments to be shaded
+     if shaders are slow, ROPs starve
+     lofivor's shaders are trivial, so this isn't the bottleneck
+
+for lofivor specifically: your 3 ROPs are THE ceiling.
+
+
+what could you do with more rops?
+---------------------------------
+
+comparison:
+
+  Intel HD 530:     3 ROPs,  2.85 GPixels/s
+  GTX 1060:        48 ROPs,  72 GPixels/s
+  RTX 3080:        96 ROPs, 164 GPixels/s
+  RTX 4090:       176 ROPs, 443 GPixels/s
+
+with a GTX 1060 (25x your fill rate):
+  lofivor could probably hit 5-10 million entities
+
+with an RTX 4090 (155x your fill rate):
+  tens of millions, limited by other factors
+
+
+perspective: what 3 rops means historically
+-------------------------------------------
+
+your HD 530 has roughly the fill rate of:
+  - GeForce 4 Ti 4600 (2002): 4 ROPs, 1.2 GPixels/s
+  - Radeon 9700 Pro (2002): 8 ROPs, 2.6 GPixels/s
+
+you're running hardware that, in raw pixel output, matches GPUs from
+20+ years ago. but with modern features (compute shaders, SSBO, etc).
+
+this is why lofivor is interesting: you're achieving 700k+ entities
+on fill-rate-equivalent hardware that originally ran games with
+maybe 10,000 triangles on screen.
+
+the difference is technique. those 2002 games did complex per-pixel
+lighting, shadows, multiple texture passes. lofivor does one texture
+sample and one blend. same fill rate, 100x the entities.
+
+
+the lesson
+----------
+
+ROPs are simple: they write pixels.
+
+the number you have determines your pixel budget.
+everything else (shaders, vertices, CPU logic) only matters if
+the ROPs aren't your bottleneck.
+
+with 3 ROPs, you have roughly 2.85 billion pixels/second.
+spend them wisely:
+  - cull what's offscreen (don't spend pixels on invisible things)
+  - shrink distant objects (LOD saves pixels)
+  - reduce overlap (spatial organization)
+  - keep shaders simple (don't starve the ROPs)
+
+your 3 ROPs can do remarkable things. Quake ran on 1.