Add glossary and rops doc
This commit is contained in:
parent
55b0d7fab7
commit
5b890b18e4
2 changed files with 493 additions and 0 deletions
292
docs/GLOSSARY.txt
Normal file
292
docs/GLOSSARY.txt
Normal file
|
|
@ -0,0 +1,292 @@
|
||||||
|
lofivor glossary
|
||||||
|
================
|
||||||
|
|
||||||
|
terms that come up when optimizing graphics.
|
||||||
|
|
||||||
|
|
||||||
|
clock cycle
|
||||||
|
-----------
|
||||||
|
|
||||||
|
one "tick" of the processor's internal clock.
|
||||||
|
|
||||||
|
a CPU or GPU has a crystal oscillator that vibrates at a fixed rate.
|
||||||
|
each vibration = one cycle. the processor does some work each cycle.
|
||||||
|
|
||||||
|
1 GHz = 1 billion cycles per second
|
||||||
|
1 MHz = 1 million cycles per second
|
||||||
|
|
||||||
|
so a 1 GHz processor has 1 billion opportunities to do work per second.
|
||||||
|
|
||||||
|
"one operation per cycle" is idealized. real work often takes multiple
|
||||||
|
cycles (memory access: 100+ cycles, division: 10-20 cycles, add: 1 cycle).
|
||||||
|
|
||||||
|
your HD 530 runs at ~950 MHz, so roughly 950 million cycles per second.
|
||||||
|
at 60fps, that's about 15.8 million cycles per frame.
|
||||||
|
|
||||||
|
|
||||||
|
fill rate
|
||||||
|
---------
|
||||||
|
|
||||||
|
pixels written per second. measured in megapixels/s or gigapixels/s.
|
||||||
|
|
||||||
|
fill rate = ROPs * clock speed * pixels per clock
|
||||||
|
|
||||||
|
your HD 530: 3 ROPs * 950 MHz * 1 = 2.85 GPixels/s theoretical max.
|
||||||
|
|
||||||
|
|
||||||
|
overdraw
|
||||||
|
--------
|
||||||
|
|
||||||
|
drawing the same pixel multiple times per frame.
|
||||||
|
|
||||||
|
if two entities overlap, the back one gets drawn, then the front one
|
||||||
|
overwrites it. the back one's work was wasted.
|
||||||
|
|
||||||
|
overdraw ratio = total pixels drawn / screen pixels
|
||||||
|
|
||||||
|
1080p = 2.07M pixels. if you draw 20M pixels, overdraw = ~10x.
|
||||||
|
|
||||||
|
|
||||||
|
bandwidth
|
||||||
|
---------
|
||||||
|
|
||||||
|
data transfer rate. measured in bytes/second (GB/s, MB/s).
|
||||||
|
|
||||||
|
memory bandwidth = how fast data moves between processor and RAM.
|
||||||
|
|
||||||
|
your HD 530 shares DDR4 with the CPU: ~30 GB/s total.
|
||||||
|
a discrete GPU has dedicated VRAM: 200-900 GB/s.
|
||||||
|
|
||||||
|
|
||||||
|
latency
|
||||||
|
-------
|
||||||
|
|
||||||
|
time delay. measured in nanoseconds (ns) or cycles.
|
||||||
|
|
||||||
|
memory latency = time to fetch data from RAM.
|
||||||
|
- L1 cache: ~4 cycles
|
||||||
|
- L2 cache: ~12 cycles
|
||||||
|
- L3 cache: ~40 cycles
|
||||||
|
- main RAM: ~200 cycles
|
||||||
|
|
||||||
|
this is why cache matters. a cache miss = 50x slower than a hit.
|
||||||
|
|
||||||
|
|
||||||
|
throughput vs latency
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
latency = how long ONE thing takes.
|
||||||
|
throughput = how many things per second.
|
||||||
|
|
||||||
|
a pipeline can have high latency but high throughput.
|
||||||
|
|
||||||
|
example: a car wash takes 10 minutes (latency).
|
||||||
|
but if cars enter every 1 minute, throughput is 60 cars/hour.
|
||||||
|
|
||||||
|
GPUs hide latency with throughput. one thread waits for memory?
|
||||||
|
switch to another thread. thousands of threads keep the GPU busy.
|
||||||
|
|
||||||
|
|
||||||
|
draw call
|
||||||
|
---------
|
||||||
|
|
||||||
|
one command from CPU to GPU: "draw this batch of geometry."
|
||||||
|
|
||||||
|
each draw call has overhead:
|
||||||
|
- CPU prepares command buffer
|
||||||
|
- driver validates state
|
||||||
|
- GPU switches context
|
||||||
|
|
||||||
|
1 draw call for 1M triangles: fast.
|
||||||
|
1M draw calls for 1M triangles: slow.
|
||||||
|
|
||||||
|
lofivor uses 1 draw call for all entities (instanced rendering).
|
||||||
|
|
||||||
|
|
||||||
|
instancing
|
||||||
|
----------
|
||||||
|
|
||||||
|
drawing many copies of the same geometry in one draw call.
|
||||||
|
|
||||||
|
instead of: draw triangle, draw triangle, draw triangle...
|
||||||
|
you say: draw this triangle 1 million times, here are the positions.
|
||||||
|
|
||||||
|
the GPU handles the replication. massively more efficient.
|
||||||
|
|
||||||
|
|
||||||
|
shader
|
||||||
|
------
|
||||||
|
|
||||||
|
a small program that runs on the GPU.
|
||||||
|
|
||||||
|
the name is historical - early shaders calculated shading/lighting.
|
||||||
|
but today: a shader is just software running on GPU hardware.
|
||||||
|
it doesn't have to do with shading at all.
|
||||||
|
|
||||||
|
more precisely: a shader turns one piece of data into another piece of data.
|
||||||
|
- vertex shader: positions → screen coordinates
|
||||||
|
- fragment shader: fragments → pixel colors
|
||||||
|
- compute shader: data → data (anything)
|
||||||
|
|
||||||
|
GPUs are massively parallel, so shaders run on thousands of inputs at once.
|
||||||
|
CPUs have stagnated; GPUs keep getting faster. modern engines like UE5
|
||||||
|
increasingly use shaders for work that used to be CPU-only.
|
||||||
|
|
||||||
|
|
||||||
|
SSBO (shader storage buffer object)
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
a block of GPU memory that shaders can read/write.
|
||||||
|
|
||||||
|
unlike uniforms (small, read-only), SSBOs can be large and writable.
|
||||||
|
lofivor stores all entity data in an SSBO: positions, velocities, colors.
|
||||||
|
|
||||||
|
|
||||||
|
compute shader
|
||||||
|
--------------
|
||||||
|
|
||||||
|
a shader that does general computation, not rendering.
|
||||||
|
|
||||||
|
runs on GPU cores but doesn't output pixels. just processes data.
|
||||||
|
lofivor uses compute shaders to update entity positions.
|
||||||
|
|
||||||
|
because compute exists, shaders can be anything: physics, AI, sorting,
|
||||||
|
image processing. the GPU is a general-purpose parallel processor.
|
||||||
|
|
||||||
|
|
||||||
|
fragment / pixel shader
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
program that runs once per pixel (actually per "fragment").
|
||||||
|
|
||||||
|
determines the final color of each pixel. this is where:
|
||||||
|
- texture sampling happens
|
||||||
|
- lighting calculations happen
|
||||||
|
- the expensive math lives
|
||||||
|
|
||||||
|
lofivor's fragment shader: sample texture, multiply by color. trivial.
|
||||||
|
AAA game fragment shader: 500+ instructions. expensive.
|
||||||
|
|
||||||
|
|
||||||
|
vertex shader
|
||||||
|
-------------
|
||||||
|
|
||||||
|
program that runs once per vertex.
|
||||||
|
|
||||||
|
transforms 3D positions to screen positions. lofivor's vertex shader
|
||||||
|
reads from SSBO and positions the quad corners.
|
||||||
|
|
||||||
|
|
||||||
|
ROP (render output unit)
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
final stage of GPU pipeline. writes pixels to framebuffer.
|
||||||
|
|
||||||
|
handles: depth test, stencil test, blending, antialiasing.
|
||||||
|
your bottleneck on HD 530. see docs/rops.txt.
|
||||||
|
|
||||||
|
|
||||||
|
TMU (texture mapping unit)
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
samples textures. reads pixel colors from texture memory.
|
||||||
|
|
||||||
|
your HD 530 has 24 TMUs. they're fast (22.8 GTexels/s).
|
||||||
|
texture sampling is cheap relative to ROPs on this hardware.
|
||||||
|
|
||||||
|
|
||||||
|
EU (execution unit)
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
intel's term for shader cores.
|
||||||
|
|
||||||
|
your HD 530 has 24 EUs, each with 8 ALUs = 192 ALUs total.
|
||||||
|
these run your vertex, fragment, and compute shaders.
|
||||||
|
|
||||||
|
|
||||||
|
ALU (arithmetic logic unit)
|
||||||
|
---------------------------
|
||||||
|
|
||||||
|
does math. add, multiply, compare, bitwise operations.
|
||||||
|
|
||||||
|
one ALU can do one operation per cycle (simple ops).
|
||||||
|
complex ops (sqrt, sin, cos) take multiple cycles.
|
||||||
|
|
||||||
|
|
||||||
|
framebuffer
|
||||||
|
-----------
|
||||||
|
|
||||||
|
the image being rendered. lives in GPU memory.
|
||||||
|
|
||||||
|
at 1080p with 32-bit color: 1920 * 1080 * 4 = 8.3 MB.
|
||||||
|
double-buffered (front + back): 16.6 MB.
|
||||||
|
|
||||||
|
|
||||||
|
vsync
|
||||||
|
-----
|
||||||
|
|
||||||
|
synchronizing frame presentation with monitor refresh.
|
||||||
|
|
||||||
|
without vsync: tearing (half old frame, half new frame).
|
||||||
|
with vsync: smooth, but if you miss 16.7ms, you wait for next refresh.
|
||||||
|
|
||||||
|
|
||||||
|
frame budget
|
||||||
|
------------
|
||||||
|
|
||||||
|
time available per frame.
|
||||||
|
|
||||||
|
60 fps = 16.67 ms per frame
|
||||||
|
30 fps = 33.33 ms per frame
|
||||||
|
|
||||||
|
everything (CPU + GPU) must complete within budget or frames drop.
|
||||||
|
|
||||||
|
|
||||||
|
pipeline stall
|
||||||
|
--------------
|
||||||
|
|
||||||
|
GPU waiting for something. bad for performance.
|
||||||
|
|
||||||
|
causes:
|
||||||
|
- waiting for memory (cache miss)
|
||||||
|
- waiting for previous stage to finish
|
||||||
|
- synchronization points (barriers)
|
||||||
|
- `discard` in fragment shader (breaks early-z)
|
||||||
|
|
||||||
|
|
||||||
|
early-z
|
||||||
|
-------
|
||||||
|
|
||||||
|
optimization: test depth BEFORE running fragment shader.
|
||||||
|
|
||||||
|
if pixel will be occluded, skip the expensive shader work.
|
||||||
|
`discard` breaks this because GPU can't know depth until shader runs.
|
||||||
|
|
||||||
|
|
||||||
|
LOD (level of detail)
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
using simpler geometry/textures for distant objects.
|
||||||
|
|
||||||
|
far away = fewer pixels = less detail needed.
|
||||||
|
saves vertices, texture bandwidth, and fill rate.
|
||||||
|
|
||||||
|
|
||||||
|
frustum culling
|
||||||
|
---------------
|
||||||
|
|
||||||
|
don't draw what's outside the camera view.
|
||||||
|
|
||||||
|
the "frustum" is the pyramid-shaped visible region.
|
||||||
|
anything outside = wasted work. cull it before sending to GPU.
|
||||||
|
|
||||||
|
|
||||||
|
spatial partitioning
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
organizing entities by position for fast queries.
|
||||||
|
|
||||||
|
types: grid, quadtree, octree, BVH.
|
||||||
|
|
||||||
|
"which entities are near point X?" goes from O(n) to O(log n).
|
||||||
|
essential for collision detection at scale.
|
||||||
201
docs/rops.txt
Normal file
201
docs/rops.txt
Normal file
|
|
@ -0,0 +1,201 @@
|
||||||
|
rops: render output units
|
||||||
|
=========================
|
||||||
|
|
||||||
|
what they are, where they came from, and what yours can do.
|
||||||
|
|
||||||
|
|
||||||
|
what is a rop?
|
||||||
|
--------------
|
||||||
|
|
||||||
|
ROP = Render Output Unit (originally "Raster Operations Pipeline")
|
||||||
|
|
||||||
|
it's the final stage of the GPU pipeline. after all the fancy shader
|
||||||
|
math is done, the ROP is the unit that actually writes pixels to memory.
|
||||||
|
|
||||||
|
think of it as the bottleneck between "calculated" and "visible."
|
||||||
|
|
||||||
|
a ROP does:
|
||||||
|
- depth testing (is this pixel in front of what's already there?)
|
||||||
|
- stencil testing (mask operations)
|
||||||
|
- blending (alpha, additive, etc)
|
||||||
|
- anti-aliasing resolve
|
||||||
|
- writing the final color to the framebuffer
|
||||||
|
|
||||||
|
one ROP can write one pixel per clock cycle (roughly).
|
||||||
|
|
||||||
|
|
||||||
|
the first rop
|
||||||
|
-------------
|
||||||
|
|
||||||
|
the term comes from the IBM 8514/A (1987), which had dedicated hardware
|
||||||
|
for "raster operations" - bitwise operations on pixels (AND, OR, XOR).
|
||||||
|
this was revolutionary because before this, the CPU did all pixel math.
|
||||||
|
|
||||||
|
but the modern ROP as we know it emerged with:
|
||||||
|
|
||||||
|
NVIDIA NV1 (1995)
|
||||||
|
one of the first chips with dedicated pixel output hardware
|
||||||
|
could do ~1 million textured pixels/second
|
||||||
|
|
||||||
|
3dfx Voodoo (1996)
|
||||||
|
the card that defined the modern GPU pipeline
|
||||||
|
had 1 TMU + 1 pixel pipeline (essentially 1 ROP)
|
||||||
|
could push 45 million pixels/second
|
||||||
|
that ONE pipeline ran Quake at 640x480
|
||||||
|
|
||||||
|
NVIDIA GeForce 256 (1999)
|
||||||
|
"the first GPU" - named itself with that term
|
||||||
|
4 pixel pipelines = 4 ROPs
|
||||||
|
480 million pixels/second
|
||||||
|
|
||||||
|
so the original consumer 3D cards had... 1 ROP. and they ran Quake.
|
||||||
|
|
||||||
|
|
||||||
|
what one rop can do
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
let's do the math.
|
||||||
|
|
||||||
|
one ROP at 100 MHz (3dfx Voodoo era):
|
||||||
|
100 million cycles/second
|
||||||
|
~1 pixel per cycle
|
||||||
|
= 100 megapixels/second
|
||||||
|
|
||||||
|
at 640x480 @ 60fps:
|
||||||
|
640 * 480 * 60 = 18.4 megapixels/second needed
|
||||||
|
|
||||||
|
so ONE ROP at 100MHz could handle 640x480 with ~5x headroom for overdraw.
|
||||||
|
|
||||||
|
at 1024x768 @ 60fps:
|
||||||
|
1024 * 768 * 60 = 47 megapixels/second
|
||||||
|
|
||||||
|
now you're at 2x overdraw max. still playable, but tight.
|
||||||
|
|
||||||
|
|
||||||
|
one modern rop
|
||||||
|
--------------
|
||||||
|
|
||||||
|
a single modern ROP runs at ~1-2 GHz and can do more per cycle:
|
||||||
|
- multiple color outputs (MRT)
|
||||||
|
- 64-bit or 128-bit color formats
|
||||||
|
- compressed writes
|
||||||
|
|
||||||
|
rough estimate for one ROP at 1.5 GHz:
|
||||||
|
~1.5 billion pixels/second base throughput
|
||||||
|
|
||||||
|
at 1920x1080 @ 60fps:
|
||||||
|
1920 * 1080 * 60 = 124 megapixels/second
|
||||||
|
|
||||||
|
one ROP could handle 1080p with 12x overdraw headroom.
|
||||||
|
|
||||||
|
at 4K @ 60fps:
|
||||||
|
3840 * 2160 * 60 = 497 megapixels/second
|
||||||
|
|
||||||
|
one ROP could handle 4K with 3x overdraw. tight, but possible.
|
||||||
|
|
||||||
|
|
||||||
|
your three rops (intel hd 530)
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
HD 530 specs:
|
||||||
|
- 3 ROPs
|
||||||
|
- ~950 MHz boost clock
|
||||||
|
- theoretical: 2.85 GPixels/second
|
||||||
|
|
||||||
|
let's break that down:
|
||||||
|
|
||||||
|
at 1080p @ 60fps (124 MP/s needed):
|
||||||
|
2850 / 124 = 23x overdraw budget
|
||||||
|
|
||||||
|
that's actually generous! you could draw each pixel 23 times.
|
||||||
|
|
||||||
|
so why does lofivor struggle at 1M entities?
|
||||||
|
|
||||||
|
because 1M entities at 4x4 pixels = 16M pixels minimum.
|
||||||
|
but with overlap? let's say average 10x overdraw:
|
||||||
|
160M pixels/frame
|
||||||
|
at 60fps = 9.6 billion pixels/second
|
||||||
|
|
||||||
|
your ceiling is 2.85 billion.
|
||||||
|
|
||||||
|
so you're 3.4x over budget. that's why you top out around 300k-400k
|
||||||
|
before frame drops (which matches empirical testing).
|
||||||
|
|
||||||
|
|
||||||
|
the real constraint
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
ROPs don't work in isolation. they're limited by:
|
||||||
|
|
||||||
|
1. MEMORY BANDWIDTH
|
||||||
|
each pixel write = memory access
|
||||||
|
HD 530 shares DDR4 with CPU (~30 GB/s)
|
||||||
|
at 32-bit color: 30GB/s / 4 bytes = 7.5 billion pixels/second max
|
||||||
|
but you're competing with CPU, texture reads, etc.
|
||||||
|
realistic: maybe 2-3 billion pixels for framebuffer writes
|
||||||
|
|
||||||
|
2. TEXTURE SAMPLING
|
||||||
|
if fragment shader samples textures, TMUs must keep up
|
||||||
|
HD 530 has 24 TMUs, so this isn't the bottleneck
|
||||||
|
|
||||||
|
3. SHADER EXECUTION
|
||||||
|
ROPs wait for fragments to be shaded
|
||||||
|
if shaders are slow, ROPs starve
|
||||||
|
lofivor's shaders are trivial, so this isn't the bottleneck
|
||||||
|
|
||||||
|
for lofivor specifically: your 3 ROPs are THE ceiling.
|
||||||
|
|
||||||
|
|
||||||
|
what could you do with more rops?
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
comparison:
|
||||||
|
|
||||||
|
Intel HD 530: 3 ROPs, 2.85 GPixels/s
|
||||||
|
GTX 1060: 48 ROPs, 72 GPixels/s
|
||||||
|
RTX 3080: 96 ROPs, 164 GPixels/s
|
||||||
|
RTX 4090: 176 ROPs, 443 GPixels/s
|
||||||
|
|
||||||
|
with a GTX 1060 (25x your fill rate):
|
||||||
|
lofivor could probably hit 5-10 million entities
|
||||||
|
|
||||||
|
with an RTX 4090 (155x your fill rate):
|
||||||
|
tens of millions, limited by other factors
|
||||||
|
|
||||||
|
|
||||||
|
perspective: what 3 rops means historically
|
||||||
|
-------------------------------------------
|
||||||
|
|
||||||
|
your HD 530 has roughly the fill rate of:
|
||||||
|
- GeForce 4 Ti 4600 (2002): 4 ROPs, 1.2 GPixels/s
|
||||||
|
- Radeon 9700 Pro (2002): 8 ROPs, 2.6 GPixels/s
|
||||||
|
|
||||||
|
you're running hardware that, in raw pixel output, matches GPUs from
|
||||||
|
20+ years ago. but with modern features (compute shaders, SSBO, etc).
|
||||||
|
|
||||||
|
this is why lofivor is interesting: you're achieving 700k+ entities
|
||||||
|
on fill-rate-equivalent hardware that originally ran games with
|
||||||
|
maybe 10,000 triangles on screen.
|
||||||
|
|
||||||
|
the difference is technique. those 2002 games did complex per-pixel
|
||||||
|
lighting, shadows, multiple texture passes. lofivor does one texture
|
||||||
|
sample and one blend. same fill rate, 100x the entities.
|
||||||
|
|
||||||
|
|
||||||
|
the lesson
|
||||||
|
----------
|
||||||
|
|
||||||
|
ROPs are simple: they write pixels.
|
||||||
|
|
||||||
|
the number you have determines your pixel budget.
|
||||||
|
everything else (shaders, vertices, CPU logic) only matters if
|
||||||
|
the ROPs aren't your bottleneck.
|
||||||
|
|
||||||
|
with 3 ROPs, you have roughly 2.85 billion pixels/second.
|
||||||
|
spend them wisely:
|
||||||
|
- cull what's offscreen (don't spend pixels on invisible things)
|
||||||
|
- shrink distant objects (LOD saves pixels)
|
||||||
|
- reduce overlap (spatial organization)
|
||||||
|
- keep shaders simple (don't starve the ROPs)
|
||||||
|
|
||||||
|
your 3 ROPs can do remarkable things. Quake ran on 1.
|
||||||
Loading…
Reference in a new issue