Add glossary and rops doc
This commit is contained in:
parent
55b0d7fab7
commit
5b890b18e4
2 changed files with 493 additions and 0 deletions
292
docs/GLOSSARY.txt
Normal file
292
docs/GLOSSARY.txt
Normal file
|
|
@ -0,0 +1,292 @@
|
|||
lofivor glossary
|
||||
================
|
||||
|
||||
terms that come up when optimizing graphics.
|
||||
|
||||
|
||||
clock cycle
|
||||
-----------
|
||||
|
||||
one "tick" of the processor's internal clock.
|
||||
|
||||
a CPU or GPU has a crystal oscillator that vibrates at a fixed rate.
|
||||
each vibration = one cycle. the processor does some work each cycle.
|
||||
|
||||
1 GHz = 1 billion cycles per second
|
||||
1 MHz = 1 million cycles per second
|
||||
|
||||
so a 1 GHz processor has 1 billion opportunities to do work per second.
|
||||
|
||||
"one operation per cycle" is idealized. real work often takes multiple
|
||||
cycles (memory access: 100+ cycles, division: 10-20 cycles, add: 1 cycle).
|
||||
|
||||
your HD 530 runs at ~950 MHz, so roughly 950 million cycles per second.
|
||||
at 60fps, that's about 15.8 million cycles per frame.
|
||||
|
||||
|
||||
fill rate
|
||||
---------
|
||||
|
||||
pixels written per second. measured in megapixels/s or gigapixels/s.
|
||||
|
||||
fill rate = ROPs * clock speed * pixels per clock
|
||||
|
||||
your HD 530: 3 ROPs * 950 MHz * 1 = 2.85 GPixels/s theoretical max.
|
||||
|
||||
|
||||
overdraw
|
||||
--------
|
||||
|
||||
drawing the same pixel multiple times per frame.
|
||||
|
||||
if two entities overlap, the back one gets drawn, then the front one
|
||||
overwrites it. the back one's work was wasted.
|
||||
|
||||
overdraw ratio = total pixels drawn / screen pixels
|
||||
|
||||
1080p = 2.07M pixels. if you draw 20M pixels, overdraw = ~10x.
|
||||
|
||||
|
||||
bandwidth
|
||||
---------
|
||||
|
||||
data transfer rate. measured in bytes/second (GB/s, MB/s).
|
||||
|
||||
memory bandwidth = how fast data moves between processor and RAM.
|
||||
|
||||
your HD 530 shares DDR4 with the CPU: ~30 GB/s total.
|
||||
a discrete GPU has dedicated VRAM: 200-900 GB/s.
|
||||
|
||||
|
||||
latency
|
||||
-------
|
||||
|
||||
time delay. measured in nanoseconds (ns) or cycles.
|
||||
|
||||
memory latency = time to fetch data from RAM.
|
||||
- L1 cache: ~4 cycles
|
||||
- L2 cache: ~12 cycles
|
||||
- L3 cache: ~40 cycles
|
||||
- main RAM: ~200 cycles
|
||||
|
||||
this is why cache matters. a cache miss = 50x slower than a hit.
|
||||
|
||||
|
||||
throughput vs latency
|
||||
---------------------
|
||||
|
||||
latency = how long ONE thing takes.
|
||||
throughput = how many things per second.
|
||||
|
||||
a pipeline can have high latency but high throughput.
|
||||
|
||||
example: a car wash takes 10 minutes (latency).
|
||||
but if cars enter every 1 minute, throughput is 60 cars/hour.
|
||||
|
||||
GPUs hide latency with throughput. one thread waits for memory?
|
||||
switch to another thread. thousands of threads keep the GPU busy.
|
||||
|
||||
|
||||
draw call
|
||||
---------
|
||||
|
||||
one command from CPU to GPU: "draw this batch of geometry."
|
||||
|
||||
each draw call has overhead:
|
||||
- CPU prepares command buffer
|
||||
- driver validates state
|
||||
- GPU switches context
|
||||
|
||||
1 draw call for 1M triangles: fast.
|
||||
1M draw calls for 1M triangles: slow.
|
||||
|
||||
lofivor uses 1 draw call for all entities (instanced rendering).
|
||||
|
||||
|
||||
instancing
|
||||
----------
|
||||
|
||||
drawing many copies of the same geometry in one draw call.
|
||||
|
||||
instead of: draw triangle, draw triangle, draw triangle...
|
||||
you say: draw this triangle 1 million times, here are the positions.
|
||||
|
||||
the GPU handles the replication. massively more efficient.
|
||||
|
||||
|
||||
shader
|
||||
------
|
||||
|
||||
a small program that runs on the GPU.
|
||||
|
||||
the name is historical - early shaders calculated shading/lighting.
|
||||
but today: a shader is just software running on GPU hardware.
|
||||
it doesn't have to do with shading at all.
|
||||
|
||||
more precisely: a shader turns one piece of data into another piece of data.
|
||||
- vertex shader: positions → screen coordinates
|
||||
- fragment shader: fragments → pixel colors
|
||||
- compute shader: data → data (anything)
|
||||
|
||||
GPUs are massively parallel, so shaders run on thousands of inputs at once.
|
||||
CPUs have stagnated; GPUs keep getting faster. modern engines like UE5
|
||||
increasingly use shaders for work that used to be CPU-only.
|
||||
|
||||
|
||||
SSBO (shader storage buffer object)
|
||||
-----------------------------------
|
||||
|
||||
a block of GPU memory that shaders can read/write.
|
||||
|
||||
unlike uniforms (small, read-only), SSBOs can be large and writable.
|
||||
lofivor stores all entity data in an SSBO: positions, velocities, colors.
|
||||
|
||||
|
||||
compute shader
|
||||
--------------
|
||||
|
||||
a shader that does general computation, not rendering.
|
||||
|
||||
runs on GPU cores but doesn't output pixels. just processes data.
|
||||
lofivor uses compute shaders to update entity positions.
|
||||
|
||||
because compute exists, shaders can be anything: physics, AI, sorting,
|
||||
image processing. the GPU is a general-purpose parallel processor.
|
||||
|
||||
|
||||
fragment / pixel shader
|
||||
-----------------------
|
||||
|
||||
program that runs once per pixel (actually per "fragment").
|
||||
|
||||
determines the final color of each pixel. this is where:
|
||||
- texture sampling happens
|
||||
- lighting calculations happen
|
||||
- the expensive math lives
|
||||
|
||||
lofivor's fragment shader: sample texture, multiply by color. trivial.
|
||||
AAA game fragment shader: 500+ instructions. expensive.
|
||||
|
||||
|
||||
vertex shader
|
||||
-------------
|
||||
|
||||
program that runs once per vertex.
|
||||
|
||||
transforms 3D positions to screen positions. lofivor's vertex shader
|
||||
reads from SSBO and positions the quad corners.
|
||||
|
||||
|
||||
ROP (render output unit)
|
||||
------------------------
|
||||
|
||||
final stage of GPU pipeline. writes pixels to framebuffer.
|
||||
|
||||
handles: depth test, stencil test, blending, antialiasing.
|
||||
your bottleneck on HD 530. see docs/rops.txt.
|
||||
|
||||
|
||||
TMU (texture mapping unit)
|
||||
--------------------------
|
||||
|
||||
samples textures. reads pixel colors from texture memory.
|
||||
|
||||
your HD 530 has 24 TMUs. they're fast (22.8 GTexels/s).
|
||||
texture sampling is cheap relative to ROPs on this hardware.
|
||||
|
||||
|
||||
EU (execution unit)
|
||||
-------------------
|
||||
|
||||
intel's term for shader cores.
|
||||
|
||||
your HD 530 has 24 EUs, each with 8 ALUs = 192 ALUs total.
|
||||
these run your vertex, fragment, and compute shaders.
|
||||
|
||||
|
||||
ALU (arithmetic logic unit)
|
||||
---------------------------
|
||||
|
||||
does math. add, multiply, compare, bitwise operations.
|
||||
|
||||
one ALU can do one operation per cycle (simple ops).
|
||||
complex ops (sqrt, sin, cos) take multiple cycles.
|
||||
|
||||
|
||||
framebuffer
|
||||
-----------
|
||||
|
||||
the image being rendered. lives in GPU memory.
|
||||
|
||||
at 1080p with 32-bit color: 1920 * 1080 * 4 = 8.3 MB.
|
||||
double-buffered (front + back): 16.6 MB.
|
||||
|
||||
|
||||
vsync
|
||||
-----
|
||||
|
||||
synchronizing frame presentation with monitor refresh.
|
||||
|
||||
without vsync: tearing (half old frame, half new frame).
|
||||
with vsync: smooth, but if you miss 16.7ms, you wait for next refresh.
|
||||
|
||||
|
||||
frame budget
|
||||
------------
|
||||
|
||||
time available per frame.
|
||||
|
||||
60 fps = 16.67 ms per frame
|
||||
30 fps = 33.33 ms per frame
|
||||
|
||||
everything (CPU + GPU) must complete within budget or frames drop.
|
||||
|
||||
|
||||
pipeline stall
|
||||
--------------
|
||||
|
||||
GPU waiting for something. bad for performance.
|
||||
|
||||
causes:
|
||||
- waiting for memory (cache miss)
|
||||
- waiting for previous stage to finish
|
||||
- synchronization points (barriers)
|
||||
- `discard` in fragment shader (breaks early-z)
|
||||
|
||||
|
||||
early-z
|
||||
-------
|
||||
|
||||
optimization: test depth BEFORE running fragment shader.
|
||||
|
||||
if pixel will be occluded, skip the expensive shader work.
|
||||
`discard` breaks this because GPU can't know depth until shader runs.
|
||||
|
||||
|
||||
LOD (level of detail)
|
||||
---------------------
|
||||
|
||||
using simpler geometry/textures for distant objects.
|
||||
|
||||
far away = fewer pixels = less detail needed.
|
||||
saves vertices, texture bandwidth, and fill rate.
|
||||
|
||||
|
||||
frustum culling
|
||||
---------------
|
||||
|
||||
don't draw what's outside the camera view.
|
||||
|
||||
the "frustum" is the pyramid-shaped visible region.
|
||||
anything outside = wasted work. cull it before sending to GPU.
|
||||
|
||||
|
||||
spatial partitioning
|
||||
--------------------
|
||||
|
||||
organizing entities by position for fast queries.
|
||||
|
||||
types: grid, quadtree, octree, BVH.
|
||||
|
||||
"which entities are near point X?" goes from O(n) to O(log n).
|
||||
essential for collision detection at scale.
|
||||
201
docs/rops.txt
Normal file
201
docs/rops.txt
Normal file
|
|
@ -0,0 +1,201 @@
|
|||
rops: render output units
|
||||
=========================
|
||||
|
||||
what they are, where they came from, and what yours can do.
|
||||
|
||||
|
||||
what is a rop?
|
||||
--------------
|
||||
|
||||
ROP = Render Output Unit (originally "Raster Operations Pipeline")
|
||||
|
||||
it's the final stage of the GPU pipeline. after all the fancy shader
|
||||
math is done, the ROP is the unit that actually writes pixels to memory.
|
||||
|
||||
think of it as the bottleneck between "calculated" and "visible."
|
||||
|
||||
a ROP does:
|
||||
- depth testing (is this pixel in front of what's already there?)
|
||||
- stencil testing (mask operations)
|
||||
- blending (alpha, additive, etc)
|
||||
- anti-aliasing resolve
|
||||
- writing the final color to the framebuffer
|
||||
|
||||
one ROP can write one pixel per clock cycle (roughly).
|
||||
|
||||
|
||||
the first rop
|
||||
-------------
|
||||
|
||||
the term comes from the IBM 8514/A (1987), which had dedicated hardware
|
||||
for "raster operations" - bitwise operations on pixels (AND, OR, XOR).
|
||||
this was revolutionary because before this, the CPU did all pixel math.
|
||||
|
||||
but the modern ROP as we know it emerged with:
|
||||
|
||||
NVIDIA NV1 (1995)
|
||||
one of the first chips with dedicated pixel output hardware
|
||||
could do ~1 million textured pixels/second
|
||||
|
||||
3dfx Voodoo (1996)
|
||||
the card that defined the modern GPU pipeline
|
||||
had 1 TMU + 1 pixel pipeline (essentially 1 ROP)
|
||||
could push 45 million pixels/second
|
||||
that ONE pipeline ran Quake at 640x480
|
||||
|
||||
NVIDIA GeForce 256 (1999)
|
||||
"the first GPU" - named itself with that term
|
||||
4 pixel pipelines = 4 ROPs
|
||||
480 million pixels/second
|
||||
|
||||
so the original consumer 3D cards had... 1 ROP. and they ran Quake.
|
||||
|
||||
|
||||
what one rop can do
|
||||
-------------------
|
||||
|
||||
let's do the math.
|
||||
|
||||
one ROP at 100 MHz (3dfx Voodoo era):
|
||||
100 million cycles/second
|
||||
~1 pixel per cycle
|
||||
= 100 megapixels/second
|
||||
|
||||
at 640x480 @ 60fps:
|
||||
640 * 480 * 60 = 18.4 megapixels/second needed
|
||||
|
||||
so ONE ROP at 100MHz could handle 640x480 with ~5x headroom for overdraw.
|
||||
|
||||
at 1024x768 @ 60fps:
|
||||
1024 * 768 * 60 = 47 megapixels/second
|
||||
|
||||
now you're at 2x overdraw max. still playable, but tight.
|
||||
|
||||
|
||||
one modern rop
|
||||
--------------
|
||||
|
||||
a single modern ROP runs at ~1-2 GHz and can do more per cycle:
|
||||
- multiple color outputs (MRT)
|
||||
- 64-bit or 128-bit color formats
|
||||
- compressed writes
|
||||
|
||||
rough estimate for one ROP at 1.5 GHz:
|
||||
~1.5 billion pixels/second base throughput
|
||||
|
||||
at 1920x1080 @ 60fps:
|
||||
1920 * 1080 * 60 = 124 megapixels/second
|
||||
|
||||
one ROP could handle 1080p with 12x overdraw headroom.
|
||||
|
||||
at 4K @ 60fps:
|
||||
3840 * 2160 * 60 = 497 megapixels/second
|
||||
|
||||
one ROP could handle 4K with 3x overdraw. tight, but possible.
|
||||
|
||||
|
||||
your three rops (intel hd 530)
|
||||
------------------------------
|
||||
|
||||
HD 530 specs:
|
||||
- 3 ROPs
|
||||
- ~950 MHz boost clock
|
||||
- theoretical: 2.85 GPixels/second
|
||||
|
||||
let's break that down:
|
||||
|
||||
at 1080p @ 60fps (124 MP/s needed):
|
||||
2850 / 124 = 23x overdraw budget
|
||||
|
||||
that's actually generous! you could draw each pixel 23 times.
|
||||
|
||||
so why does lofivor struggle at 1M entities?
|
||||
|
||||
because 1M entities at 4x4 pixels = 16M pixels minimum.
|
||||
but with overlap? let's say average 10x overdraw:
|
||||
160M pixels/frame
|
||||
at 60fps = 9.6 billion pixels/second
|
||||
|
||||
your ceiling is 2.85 billion.
|
||||
|
||||
so you're 3.4x over budget. that's why you top out around 300k-400k
|
||||
before frame drops (which matches empirical testing).
|
||||
|
||||
|
||||
the real constraint
|
||||
-------------------
|
||||
|
||||
ROPs don't work in isolation. they're limited by:
|
||||
|
||||
1. MEMORY BANDWIDTH
|
||||
each pixel write = memory access
|
||||
HD 530 shares DDR4 with CPU (~30 GB/s)
|
||||
at 32-bit color: 30GB/s / 4 bytes = 7.5 billion pixels/second max
|
||||
but you're competing with CPU, texture reads, etc.
|
||||
realistic: maybe 2-3 billion pixels for framebuffer writes
|
||||
|
||||
2. TEXTURE SAMPLING
|
||||
if fragment shader samples textures, TMUs must keep up
|
||||
HD 530 has 24 TMUs, so this isn't the bottleneck
|
||||
|
||||
3. SHADER EXECUTION
|
||||
ROPs wait for fragments to be shaded
|
||||
if shaders are slow, ROPs starve
|
||||
lofivor's shaders are trivial, so this isn't the bottleneck
|
||||
|
||||
for lofivor specifically: your 3 ROPs are THE ceiling.
|
||||
|
||||
|
||||
what could you do with more rops?
|
||||
---------------------------------
|
||||
|
||||
comparison:
|
||||
|
||||
Intel HD 530: 3 ROPs, 2.85 GPixels/s
|
||||
GTX 1060: 48 ROPs, 72 GPixels/s
|
||||
RTX 3080: 96 ROPs, 164 GPixels/s
|
||||
RTX 4090: 176 ROPs, 443 GPixels/s
|
||||
|
||||
with a GTX 1060 (25x your fill rate):
|
||||
lofivor could probably hit 5-10 million entities
|
||||
|
||||
with an RTX 4090 (155x your fill rate):
|
||||
tens of millions, limited by other factors
|
||||
|
||||
|
||||
perspective: what 3 rops means historically
|
||||
-------------------------------------------
|
||||
|
||||
your HD 530 has roughly the fill rate of:
|
||||
- GeForce 4 Ti 4600 (2002): 4 ROPs, 1.2 GPixels/s
|
||||
- Radeon 9700 Pro (2002): 8 ROPs, 2.6 GPixels/s
|
||||
|
||||
you're running hardware that, in raw pixel output, matches GPUs from
|
||||
20+ years ago. but with modern features (compute shaders, SSBO, etc).
|
||||
|
||||
this is why lofivor is interesting: you're achieving 700k+ entities
|
||||
on fill-rate-equivalent hardware that originally ran games with
|
||||
maybe 10,000 triangles on screen.
|
||||
|
||||
the difference is technique. those 2002 games did complex per-pixel
|
||||
lighting, shadows, multiple texture passes. lofivor does one texture
|
||||
sample and one blend. same fill rate, 100x the entities.
|
||||
|
||||
|
||||
the lesson
|
||||
----------
|
||||
|
||||
ROPs are simple: they write pixels.
|
||||
|
||||
the number you have determines your pixel budget.
|
||||
everything else (shaders, vertices, CPU logic) only matters if
|
||||
the ROPs aren't your bottleneck.
|
||||
|
||||
with 3 ROPs, you have roughly 2.85 billion pixels/second.
|
||||
spend them wisely:
|
||||
- cull what's offscreen (don't spend pixels on invisible things)
|
||||
- shrink distant objects (LOD saves pixels)
|
||||
- reduce overlap (spatial organization)
|
||||
- keep shaders simple (don't starve the ROPs)
|
||||
|
||||
your 3 ROPs can do remarkable things. Quake ran on 1.
|
||||
Loading…
Reference in a new issue