Add doc "why rendering millions of entities is hard"
This commit is contained in:
parent
9f3495b882
commit
6dcafc8f3c
1 changed files with 316 additions and 0 deletions
316
docs/why-millions-is-hard.txt
Normal file
316
docs/why-millions-is-hard.txt
Normal file
|
|
@ -0,0 +1,316 @@
|
||||||
|
why rendering millions of entities is hard
|
||||||
|
=========================================
|
||||||
|
|
||||||
|
and what "hard" actually means, from first principles.
|
||||||
|
|
||||||
|
|
||||||
|
the simple answer
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
every frame, your computer does work. work takes time. you have 16.7
|
||||||
|
milliseconds to do all the work before the next frame (at 60fps).
|
||||||
|
|
||||||
|
if the work takes longer than 16.7ms, you miss the deadline. frames drop.
|
||||||
|
the game stutters.
|
||||||
|
|
||||||
|
10 million entities means 10 million units of work. whether that fits in
|
||||||
|
16.7ms depends on how much work each unit is.
|
||||||
|
|
||||||
|
|
||||||
|
what is "work" anyway?
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
let's trace what happens when you draw one entity:
|
||||||
|
|
||||||
|
1. CPU: "here's an entity at position (340, 512), color cyan"
|
||||||
|
2. that data travels over a bus to the GPU
|
||||||
|
3. GPU: receives the data, stores it in memory
|
||||||
|
4. GPU: runs a vertex shader (figures out where on screen)
|
||||||
|
5. GPU: runs a fragment shader (figures out what color each pixel is)
|
||||||
|
6. GPU: writes pixels to the framebuffer
|
||||||
|
7. framebuffer gets sent to your monitor
|
||||||
|
|
||||||
|
each step has a speed limit. the slowest step is your bottleneck.
|
||||||
|
|
||||||
|
|
||||||
|
the bottlenecks, explained simply
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
MEMORY BANDWIDTH
|
||||||
|
how fast data can move around. measured in GB/s.
|
||||||
|
|
||||||
|
think of it like a highway. you can have a fast car (processor), but
|
||||||
|
if the highway is jammed, you're stuck in traffic.
|
||||||
|
|
||||||
|
an integrated GPU (like Intel HD 530) shares the highway with the CPU.
|
||||||
|
a discrete GPU (like an RTX card) has its own private highway.
|
||||||
|
|
||||||
|
this is why lofivor's SSBO optimization helped so much: shrinking
|
||||||
|
entity data from 64 bytes to 12 bytes means 5x less traffic.
|
||||||
|
|
||||||
|
DRAW CALLS
|
||||||
|
every time you say "GPU, draw this thing", there's overhead.
|
||||||
|
the CPU and GPU have to synchronize, state gets set up, etc.
|
||||||
|
|
||||||
|
1 draw call for 1 million entities: fast
|
||||||
|
1 million draw calls for 1 million entities: slow
|
||||||
|
|
||||||
|
this is why batching matters. not the drawing itself, but the
|
||||||
|
*coordination* of drawing.
|
||||||
|
|
||||||
|
FILL RATE
|
||||||
|
how many pixels the GPU can color per second.
|
||||||
|
|
||||||
|
a 4x4 pixel entity = 16 pixels
|
||||||
|
1 million entities = 16 million pixels minimum
|
||||||
|
|
||||||
|
but your screen is only ~2 million pixels (1920x1080). so entities
|
||||||
|
overlap. "overdraw" means coloring the same pixel multiple times.
|
||||||
|
|
||||||
|
10 million overlapping entities might touch each pixel 50+ times.
|
||||||
|
that's 100 million pixel operations.
|
||||||
|
|
||||||
|
SHADER COMPLEXITY
|
||||||
|
the GPU runs a tiny program for each vertex and each pixel.
|
||||||
|
|
||||||
|
simple: "put it here, color it this" = fast
|
||||||
|
complex: "calculate lighting from 8 sources, sample 4 textures,
|
||||||
|
apply normal mapping, do fresnel..." = slow
|
||||||
|
|
||||||
|
lofivor's shaders are trivial. AAA game shaders are not.
|
||||||
|
|
||||||
|
CPU-GPU SYNCHRONIZATION
|
||||||
|
the CPU and GPU work in parallel, but sometimes they have to wait
|
||||||
|
for each other.
|
||||||
|
|
||||||
|
if the CPU needs to read GPU results, it stalls.
|
||||||
|
if the GPU needs new data and the CPU is busy, it stalls.
|
||||||
|
|
||||||
|
good code keeps them both busy without waiting.
|
||||||
|
|
||||||
|
|
||||||
|
why "real games" hit CPU walls
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
rendering is just putting colors on pixels. that's the GPU's job.
|
||||||
|
|
||||||
|
but games aren't just rendering. they're also:
|
||||||
|
|
||||||
|
- COLLISION DETECTION
|
||||||
|
does entity A overlap entity B?
|
||||||
|
|
||||||
|
naive approach: check every pair
|
||||||
|
1,000 entities = 500,000 checks (n squared / 2)
|
||||||
|
10,000 entities = 50,000,000 checks
|
||||||
|
1,000,000 entities = 500,000,000,000,000 checks
|
||||||
|
|
||||||
|
that's 500 trillion. per frame. not happening.
|
||||||
|
|
||||||
|
smart approach: spatial partitioning (grids, quadtrees)
|
||||||
|
only check nearby entities. but still, at millions of entities,
|
||||||
|
even "nearby" is a lot.
|
||||||
|
|
||||||
|
- AI / BEHAVIOR
|
||||||
|
each entity decides what to do.
|
||||||
|
|
||||||
|
simple: move toward player. cheap.
|
||||||
|
complex: pathfind around obstacles, consider threats, coordinate
|
||||||
|
with allies, remember state. expensive.
|
||||||
|
|
||||||
|
lofivor entities just drift in a direction. no decisions.
|
||||||
|
a real game enemy makes decisions every frame.
|
||||||
|
|
||||||
|
- PHYSICS
|
||||||
|
entities push each other, bounce, have mass and friction.
|
||||||
|
every interaction is math. lots of entities = lots of math.
|
||||||
|
|
||||||
|
- GAME LOGIC
|
||||||
|
damage calculations, spawning, leveling, cooldowns, buffs...
|
||||||
|
all of this runs on the CPU, every frame.
|
||||||
|
|
||||||
|
so: lofivor can render 700k entities because they don't DO anything.
|
||||||
|
a game with 700k entities that think, collide, and interact would
|
||||||
|
need god-tier optimization or would simply not run.
|
||||||
|
|
||||||
|
|
||||||
|
what makes AAA games slow on old hardware?
|
||||||
|
------------------------------------------
|
||||||
|
|
||||||
|
it's not entity count. most AAA games have maybe hundreds of
|
||||||
|
"entities" on screen. it's everything else:
|
||||||
|
|
||||||
|
TEXTURE RESOLUTION
|
||||||
|
a 4K texture is 67 million pixels of data. per texture.
|
||||||
|
one character might have 10+ textures (diffuse, normal, specular,
|
||||||
|
roughness, ambient occlusion...).
|
||||||
|
|
||||||
|
old hardware: less VRAM, slower texture sampling.
|
||||||
|
|
||||||
|
SHADER COMPLEXITY
|
||||||
|
modern materials simulate light physics. subsurface scattering,
|
||||||
|
global illumination, ray-traced reflections.
|
||||||
|
|
||||||
|
each pixel might do hundreds of math operations.
|
||||||
|
|
||||||
|
POST-PROCESSING
|
||||||
|
bloom, motion blur, depth of field, ambient occlusion, anti-aliasing.
|
||||||
|
full-screen passes that touch every pixel multiple times.
|
||||||
|
|
||||||
|
MESH COMPLEXITY
|
||||||
|
a character might be 100,000 triangles.
|
||||||
|
10 characters = 1 million triangles.
|
||||||
|
each triangle goes through the vertex shader.
|
||||||
|
|
||||||
|
SHADOWS
|
||||||
|
render the scene again from the light's perspective.
|
||||||
|
for each light. every frame.
|
||||||
|
|
||||||
|
AAA games are doing 100x more work per pixel than lofivor.
|
||||||
|
lofivor is doing 100x more pixels than AAA games.
|
||||||
|
|
||||||
|
different problems.
|
||||||
|
|
||||||
|
|
||||||
|
the "abuse" vs "respect" distinction
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
|
abuse: making the hardware do unnecessary work.
|
||||||
|
respect: achieving your goal with minimal waste.
|
||||||
|
|
||||||
|
examples of abuse (that lofivor fixed):
|
||||||
|
|
||||||
|
- sending 64 bytes (a full matrix) when you need 12 bytes (x, y, color)
|
||||||
|
- one draw call per entity when you could batch
|
||||||
|
- calculating transforms on CPU when GPU could do it
|
||||||
|
- clearing the screen twice
|
||||||
|
- uploading the same data every frame
|
||||||
|
|
||||||
|
examples of abuse in the wild:
|
||||||
|
|
||||||
|
- electron apps using a whole browser to show a chat window
|
||||||
|
- games that re-render static UI every frame
|
||||||
|
- loading 4K textures for objects that appear 20 pixels tall
|
||||||
|
- running AI pathfinding for off-screen entities
|
||||||
|
|
||||||
|
the hardware has limits. respecting them means fitting your game
|
||||||
|
within those limits through smart decisions. abusing them means
|
||||||
|
throwing cycles at problems you created yourself.
|
||||||
|
|
||||||
|
|
||||||
|
so can you do 1 million entities with juice on old hardware?
|
||||||
|
------------------------------------------------------------
|
||||||
|
|
||||||
|
yes, with the right decisions.
|
||||||
|
|
||||||
|
what "juice" typically means:
|
||||||
|
- screen shake (free, just offset the camera)
|
||||||
|
- particle effects (separate system, heavily optimized)
|
||||||
|
- flash/hit feedback (change a color value)
|
||||||
|
- sound (different system entirely)
|
||||||
|
|
||||||
|
particles are special: they're designed for millions of tiny things.
|
||||||
|
they don't collide, don't think, often don't even persist (spawn,
|
||||||
|
drift, fade, die). GPU particle systems are essentially what lofivor
|
||||||
|
became: minimal data, instanced rendering.
|
||||||
|
|
||||||
|
what would kill you at 1 million:
|
||||||
|
- per-entity collision
|
||||||
|
- per-entity AI
|
||||||
|
- per-entity sprite variety (texture switches)
|
||||||
|
- per-entity complex shaders
|
||||||
|
|
||||||
|
what you could do:
|
||||||
|
- 1 million particles (visual only, no logic)
|
||||||
|
- 10,000 enemies with collision/AI + 990,000 particles
|
||||||
|
- 100,000 enemies with simple behavior + spatial hash collision
|
||||||
|
|
||||||
|
the secret: most of what looks like "millions of things" in games
|
||||||
|
is actually a small number of meaningful entities + a large number
|
||||||
|
of dumb particles.
|
||||||
|
|
||||||
|
|
||||||
|
the laws of physics (sort of)
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
there are hard limits:
|
||||||
|
|
||||||
|
MEMORY BUS BANDWIDTH
|
||||||
|
a DDR4 system might move 25 GB/s.
|
||||||
|
1 million entities at 12 bytes each = 12 MB.
|
||||||
|
at 60fps = 720 MB/s just for entity data.
|
||||||
|
that's only 3% of bandwidth. plenty of room.
|
||||||
|
|
||||||
|
but a naive approach (64 bytes, plus overhead) could be
|
||||||
|
10x worse. suddenly you're at 30%.
|
||||||
|
|
||||||
|
CLOCK CYCLES
|
||||||
|
a 3GHz CPU does 3 billion operations per second.
|
||||||
|
at 60fps, that's 50 million operations per frame.
|
||||||
|
1 million entities = 50 operations each.
|
||||||
|
|
||||||
|
50 operations is: a few multiplies, some loads/stores, a branch.
|
||||||
|
that's barely enough for "move in a direction".
|
||||||
|
pathfinding? AI? collision? not a chance.
|
||||||
|
|
||||||
|
PARALLELISM
|
||||||
|
GPUs have thousands of cores but they're simple.
|
||||||
|
CPUs have few cores but they're smart.
|
||||||
|
|
||||||
|
entity rendering: perfectly parallel (GPU wins)
|
||||||
|
entity decision-making: often sequential (CPU bound)
|
||||||
|
|
||||||
|
so yes, physics constrains us. but "physics" here means:
|
||||||
|
- how fast electrons move through silicon
|
||||||
|
- how much data fits on a wire
|
||||||
|
- how many transistors fit on a chip
|
||||||
|
|
||||||
|
within those limits, there's room. lots of room, if you're clever.
|
||||||
|
lofivor went from 5k to 700k by being clever, not by breaking physics.
|
||||||
|
|
||||||
|
|
||||||
|
the actual lesson
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
the limit isn't really "the hardware can't do it."
|
||||||
|
|
||||||
|
the limit is "the hardware can't do it THE WAY YOU'RE DOING IT."
|
||||||
|
|
||||||
|
every optimization in lofivor was finding a different way:
|
||||||
|
- don't draw circles, blit textures
|
||||||
|
- don't call functions, submit vertices directly
|
||||||
|
- don't send matrices, send packed structs
|
||||||
|
- don't update on CPU, use compute shaders
|
||||||
|
|
||||||
|
the hardware was always capable of 700k. the code wasn't asking right.
|
||||||
|
|
||||||
|
this is true at every level. that old laptop struggling with 10k
|
||||||
|
entities in some game? probably not the laptop's fault. probably
|
||||||
|
the game is doing something wasteful that doesn't need to be.
|
||||||
|
|
||||||
|
"runs poorly on old hardware" often means "we didn't try to make
|
||||||
|
it run on old hardware" not "it's impossible on old hardware."
|
||||||
|
|
||||||
|
|
||||||
|
closing thought
|
||||||
|
---------------
|
||||||
|
|
||||||
|
10 million is a lot. but 1 million? 2 million?
|
||||||
|
|
||||||
|
with discipline: yes.
|
||||||
|
with decisions that respect the hardware: yes.
|
||||||
|
with awareness of what's actually expensive: yes.
|
||||||
|
|
||||||
|
the knowledge of what's expensive is the key.
|
||||||
|
|
||||||
|
most developers don't have it. they use high-level abstractions
|
||||||
|
that hide the cost. they've never seen a frame budget or a
|
||||||
|
bandwidth calculation.
|
||||||
|
|
||||||
|
lofivor is a learning tool. the journey from 5k to 700k teaches
|
||||||
|
where the costs are. once you see them, you can't unsee them.
|
||||||
|
|
||||||
|
you start asking: "what is this actually doing? what does it cost?
|
||||||
|
is there a cheaper way?"
|
||||||
|
|
||||||
|
that's the skill. not the specific techniques—those change with
|
||||||
|
hardware. the skill is asking the questions.
|
||||||
Loading…
Reference in a new issue