Add doc "why rendering millions of entities is hard"

2025-12-17 14:02:41 -05:00 · 2025-12-17 14:02:41 -05:00 · 6dcafc8f3c
commit 6dcafc8f3c
parent 9f3495b882
1 changed files with 316 additions and 0 deletions
--- a/docs/why-millions-is-hard.txt
+++ b/docs/why-millions-is-hard.txt
@ -0,0 +1,316 @@
+why rendering millions of entities is hard
+=========================================
+
+and what "hard" actually means, from first principles.
+
+
+the simple answer
+-----------------
+
+every frame, your computer does work. work takes time. you have 16.7
+milliseconds to do all the work before the next frame (at 60fps).
+
+if the work takes longer than 16.7ms, you miss the deadline. frames drop.
+the game stutters.
+
+10 million entities means 10 million units of work. whether that fits in
+16.7ms depends on how much work each unit is.
+
+
+what is "work" anyway?
+----------------------
+
+let's trace what happens when you draw one entity:
+
+  1. CPU: "here's an entity at position (340, 512), color cyan"
+  2. that data travels over a bus to the GPU
+  3. GPU: receives the data, stores it in memory
+  4. GPU: runs a vertex shader (figures out where on screen)
+  5. GPU: runs a fragment shader (figures out what color each pixel is)
+  6. GPU: writes pixels to the framebuffer
+  7. framebuffer gets sent to your monitor
+
+each step has a speed limit. the slowest step is your bottleneck.
+
+
+the bottlenecks, explained simply
+---------------------------------
+
+MEMORY BANDWIDTH
+  how fast data can move around. measured in GB/s.
+
+  think of it like a highway. you can have a fast car (processor), but
+  if the highway is jammed, you're stuck in traffic.
+
+  an integrated GPU (like Intel HD 530) shares the highway with the CPU.
+  a discrete GPU (like an RTX card) has its own private highway.
+
+  this is why lofivor's SSBO optimization helped so much: shrinking
+  entity data from 64 bytes to 12 bytes means 5x less traffic.
+
+DRAW CALLS
+  every time you say "GPU, draw this thing", there's overhead.
+  the CPU and GPU have to synchronize, state gets set up, etc.
+
+  1 draw call for 1 million entities: fast
+  1 million draw calls for 1 million entities: slow
+
+  this is why batching matters. not the drawing itself, but the
+  *coordination* of drawing.
+
+FILL RATE
+  how many pixels the GPU can color per second.
+
+  a 4x4 pixel entity = 16 pixels
+  1 million entities = 16 million pixels minimum
+
+  but your screen is only ~2 million pixels (1920x1080). so entities
+  overlap. "overdraw" means coloring the same pixel multiple times.
+
+  10 million overlapping entities might touch each pixel 50+ times.
+  that's 100 million pixel operations.
+
+SHADER COMPLEXITY
+  the GPU runs a tiny program for each vertex and each pixel.
+
+  simple: "put it here, color it this" = fast
+  complex: "calculate lighting from 8 sources, sample 4 textures,
+           apply normal mapping, do fresnel..." = slow
+
+  lofivor's shaders are trivial. AAA game shaders are not.
+
+CPU-GPU SYNCHRONIZATION
+  the CPU and GPU work in parallel, but sometimes they have to wait
+  for each other.
+
+  if the CPU needs to read GPU results, it stalls.
+  if the GPU needs new data and the CPU is busy, it stalls.
+
+  good code keeps them both busy without waiting.
+
+
+why "real games" hit CPU walls
+------------------------------
+
+rendering is just putting colors on pixels. that's the GPU's job.
+
+but games aren't just rendering. they're also:
+
+  - COLLISION DETECTION
+    does entity A overlap entity B?
+
+    naive approach: check every pair
+    1,000 entities = 500,000 checks (n squared / 2)
+    10,000 entities = 50,000,000 checks
+    1,000,000 entities = 500,000,000,000,000 checks
+
+    that's 500 trillion. per frame. not happening.
+
+    smart approach: spatial partitioning (grids, quadtrees)
+    only check nearby entities. but still, at millions of entities,
+    even "nearby" is a lot.
+
+  - AI / BEHAVIOR
+    each entity decides what to do.
+
+    simple: move toward player. cheap.
+    complex: pathfind around obstacles, consider threats, coordinate
+             with allies, remember state. expensive.
+
+    lofivor entities just drift in a direction. no decisions.
+    a real game enemy makes decisions every frame.
+
+  - PHYSICS
+    entities push each other, bounce, have mass and friction.
+    every interaction is math. lots of entities = lots of math.
+
+  - GAME LOGIC
+    damage calculations, spawning, leveling, cooldowns, buffs...
+    all of this runs on the CPU, every frame.
+
+so: lofivor can render 700k entities because they don't DO anything.
+a game with 700k entities that think, collide, and interact would
+need god-tier optimization or would simply not run.
+
+
+what makes AAA games slow on old hardware?
+------------------------------------------
+
+it's not entity count. most AAA games have maybe hundreds of
+"entities" on screen. it's everything else:
+
+  TEXTURE RESOLUTION
+    a 4K texture is 67 million pixels of data. per texture.
+    one character might have 10+ textures (diffuse, normal, specular,
+    roughness, ambient occlusion...).
+
+    old hardware: less VRAM, slower texture sampling.
+
+  SHADER COMPLEXITY
+    modern materials simulate light physics. subsurface scattering,
+    global illumination, ray-traced reflections.
+
+    each pixel might do hundreds of math operations.
+
+  POST-PROCESSING
+    bloom, motion blur, depth of field, ambient occlusion, anti-aliasing.
+    full-screen passes that touch every pixel multiple times.
+
+  MESH COMPLEXITY
+    a character might be 100,000 triangles.
+    10 characters = 1 million triangles.
+    each triangle goes through the vertex shader.
+
+  SHADOWS
+    render the scene again from the light's perspective.
+    for each light. every frame.
+
+AAA games are doing 100x more work per pixel than lofivor.
+lofivor is doing 100x more pixels than AAA games.
+
+different problems.
+
+
+the "abuse" vs "respect" distinction
+------------------------------------
+
+abuse: making the hardware do unnecessary work.
+respect: achieving your goal with minimal waste.
+
+examples of abuse (that lofivor fixed):
+
+  - sending 64 bytes (a full matrix) when you need 12 bytes (x, y, color)
+  - one draw call per entity when you could batch
+  - calculating transforms on CPU when GPU could do it
+  - clearing the screen twice
+  - uploading the same data every frame
+
+examples of abuse in the wild:
+
+  - electron apps using a whole browser to show a chat window
+  - games that re-render static UI every frame
+  - loading 4K textures for objects that appear 20 pixels tall
+  - running AI pathfinding for off-screen entities
+
+the hardware has limits. respecting them means fitting your game
+within those limits through smart decisions. abusing them means
+throwing cycles at problems you created yourself.
+
+
+so can you do 1 million entities with juice on old hardware?
+------------------------------------------------------------
+
+yes, with the right decisions.
+
+what "juice" typically means:
+  - screen shake (free, just offset the camera)
+  - particle effects (separate system, heavily optimized)
+  - flash/hit feedback (change a color value)
+  - sound (different system entirely)
+
+particles are special: they're designed for millions of tiny things.
+they don't collide, don't think, often don't even persist (spawn,
+drift, fade, die). GPU particle systems are essentially what lofivor
+became: minimal data, instanced rendering.
+
+what would kill you at 1 million:
+  - per-entity collision
+  - per-entity AI
+  - per-entity sprite variety (texture switches)
+  - per-entity complex shaders
+
+what you could do:
+  - 1 million particles (visual only, no logic)
+  - 10,000 enemies with collision/AI + 990,000 particles
+  - 100,000 enemies with simple behavior + spatial hash collision
+
+the secret: most of what looks like "millions of things" in games
+is actually a small number of meaningful entities + a large number
+of dumb particles.
+
+
+the laws of physics (sort of)
+-----------------------------
+
+there are hard limits:
+
+  MEMORY BUS BANDWIDTH
+    a DDR4 system might move 25 GB/s.
+    1 million entities at 12 bytes each = 12 MB.
+    at 60fps = 720 MB/s just for entity data.
+    that's only 3% of bandwidth. plenty of room.
+
+    but a naive approach (64 bytes, plus overhead) could be
+    10x worse. suddenly you're at 30%.
+
+  CLOCK CYCLES
+    a 3GHz CPU does 3 billion operations per second.
+    at 60fps, that's 50 million operations per frame.
+    1 million entities = 50 operations each.
+
+    50 operations is: a few multiplies, some loads/stores, a branch.
+    that's barely enough for "move in a direction".
+    pathfinding? AI? collision? not a chance.
+
+  PARALLELISM
+    GPUs have thousands of cores but they're simple.
+    CPUs have few cores but they're smart.
+
+    entity rendering: perfectly parallel (GPU wins)
+    entity decision-making: often sequential (CPU bound)
+
+so yes, physics constrains us. but "physics" here means:
+  - how fast electrons move through silicon
+  - how much data fits on a wire
+  - how many transistors fit on a chip
+
+within those limits, there's room. lots of room, if you're clever.
+lofivor went from 5k to 700k by being clever, not by breaking physics.
+
+
+the actual lesson
+-----------------
+
+the limit isn't really "the hardware can't do it."
+
+the limit is "the hardware can't do it THE WAY YOU'RE DOING IT."
+
+every optimization in lofivor was finding a different way:
+  - don't draw circles, blit textures
+  - don't call functions, submit vertices directly
+  - don't send matrices, send packed structs
+  - don't update on CPU, use compute shaders
+
+the hardware was always capable of 700k. the code wasn't asking right.
+
+this is true at every level. that old laptop struggling with 10k
+entities in some game? probably not the laptop's fault. probably
+the game is doing something wasteful that doesn't need to be.
+
+"runs poorly on old hardware" often means "we didn't try to make
+it run on old hardware" not "it's impossible on old hardware."
+
+
+closing thought
+---------------
+
+10 million is a lot. but 1 million? 2 million?
+
+with discipline: yes.
+with decisions that respect the hardware: yes.
+with awareness of what's actually expensive: yes.
+
+the knowledge of what's expensive is the key.
+
+most developers don't have it. they use high-level abstractions
+that hide the cost. they've never seen a frame budget or a
+bandwidth calculation.
+
+lofivor is a learning tool. the journey from 5k to 700k teaches
+where the costs are. once you see them, you can't unsee them.
+
+you start asking: "what is this actually doing? what does it cost?
+is there a cheaper way?"
+
+that's the skill. not the specific techniques—those change with
+hardware. the skill is asking the questions.