Compare commits

...

7 commits
0.7.0 ... main

18 changed files with 1186 additions and 23 deletions

View file

@ -1,12 +1,14 @@
name: release name: release
on: on:
release: push:
types: [published] tags:
- '*'
jobs: jobs:
build: build:
runs-on: codeberg-small runs-on: ubuntu-latest
container: catthehacker/ubuntu:act-latest
steps: steps:
- uses: actions/checkout@v4 - uses: actions/checkout@v4
@ -35,16 +37,32 @@ jobs:
- name: Upload to release - name: Upload to release
env: env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} FORGEJO_TOKEN: ${{ secrets.FORGEJO_TOKEN }}
run: | run: |
RELEASE_ID="${{ github.event.release.id }}" TAG="${{ github.ref_name }}"
API_URL="${{ github.api_url }}/repos/${{ github.repository }}/releases/${RELEASE_ID}/assets" API_BASE="${{ github.server_url }}/api/v1"
REPO="${{ github.repository }}"
# check if release exists
RELEASE_ID=$(curl -sf \
-H "Authorization: token ${FORGEJO_TOKEN}" \
"${API_BASE}/repos/${REPO}/releases/tags/${TAG}" | jq -r '.id // empty')
if [ -z "$RELEASE_ID" ]; then
echo "Creating release for ${TAG}..."
RELEASE_ID=$(curl -sf \
-H "Authorization: token ${FORGEJO_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"tag_name":"'"${TAG}"'","name":"'"${TAG}"'"}' \
"${API_BASE}/repos/${REPO}/releases" | jq -r '.id')
fi
echo "Release ID: ${RELEASE_ID}"
for file in lofivor-linux-x86_64 lofivor-windows-x86_64.exe; do for file in lofivor-linux-x86_64 lofivor-windows-x86_64.exe; do
echo "Uploading $file..." echo "Uploading $file..."
curl -X POST \ curl -sf \
-H "Authorization: token ${GITHUB_TOKEN}" \ -H "Authorization: token ${FORGEJO_TOKEN}" \
-H "Content-Type: application/octet-stream" \ -F "attachment=@${file}" \
--data-binary @"$file" \ "${API_BASE}/repos/${REPO}/releases/${RELEASE_ID}/assets?name=${file}"
"${API_URL}?name=${file}"
done done

View file

@ -82,8 +82,8 @@ these target the rendering bottleneck since update loop is already fast.
| technique | description | expected gain | | technique | description | expected gain |
| ---------------------- | -------------------------------------------------------------------- | ------------------------------- | | ---------------------- | -------------------------------------------------------------------- | ------------------------------- |
| ~~SSBO instance data~~ | ~~pack (x, y, color) = 12 bytes instead of 64-byte matrices~~ | **done** - see optimization 5 | | SSBO instance data | pack (x, y, color) = 12 bytes instead of 64-byte matrices | done - see optimization 5 |
| compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync | significant | | compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync | done - see optimization 6 |
| OpenGL vs Vulkan | test raylib's Vulkan backend | unknown | | OpenGL vs Vulkan | test raylib's Vulkan backend | unknown |
| discrete GPU testing | test on dedicated GPU where instancing/SSBO shine | significant (different hw) | | discrete GPU testing | test on dedicated GPU where instancing/SSBO shine | significant (different hw) |
@ -126,6 +126,33 @@ currently not the bottleneck - update stays <1ms at 100k. these become relevant
| entity pools | pre-allocated, reusable entity slots | reduces allocation overhead | | entity pools | pre-allocated, reusable entity slots | reduces allocation overhead |
| component packing | minimize struct padding | better cache utilization | | component packing | minimize struct padding | better cache utilization |
#### estimated gains summary
| Optimization | Expected Gain | Why |
|------------------------|---------------|---------------------------------------------------|
| SIMD updates | 0% | Update already on GPU |
| Multithreaded update | 0% | Update already on GPU |
| Cache-friendly layouts | 0% | CPU doesn't iterate entities |
| Fixed-point math | 0% or worse | GPUs are optimized for float |
| SoA vs AoS | ~5% | Only helps data upload, not bottleneck |
| Frustum culling | 5-15% | Most entities converge to center anyway |
| LOD rendering | 20-40% | Real gains - fewer fragments for distant entities |
| Temporal techniques | ~50% | But with visual artifacts (flickering) |
Realistic total if you did everything: ~30-50% improvement
That'd take you from ~1.4M @ 38fps to maybe ~1.8-2M @ 38fps, or ~1.4M @ 50-55fps.
What would actually move the needle:
- GPU-side frustum culling in compute shader (cull before render, not after)
- Point sprites instead of quads for distant entities (4 vertices → 1)
- Indirect draw calls (GPU decides what to render, CPU never touches entity data)
Your real bottleneck is fill rate and vertex throughput on HD 530 integrated
graphics. The CPU side is already essentially free.
--- ---
## testing methodology ## testing methodology

24
TODO.md
View file

@ -59,7 +59,7 @@ further options (if needed):
- [x] increase raylib batch buffer (currently 8192 vertices = 2048 quads) - [x] increase raylib batch buffer (currently 8192 vertices = 2048 quads)
- [x] GPU instancing (single draw call for all entities) - [x] GPU instancing (single draw call for all entities)
- [x] SSBO instance data (12 bytes vs 64-byte matrices) - [x] SSBO instance data (12 bytes vs 64-byte matrices)
- [ ] compute shader entity updates (if raylib supports) - [x] compute shader entity updates (raylib supports via rlgl)
- [ ] compare OpenGL vs Vulkan backend - [ ] compare OpenGL vs Vulkan backend
findings (i5-6500T / HD 530): findings (i5-6500T / HD 530):
@ -68,14 +68,18 @@ findings (i5-6500T / HD 530):
- instancing doesn't help on integrated graphics (shared RAM, no PCIe savings) - instancing doesn't help on integrated graphics (shared RAM, no PCIe savings)
- bottleneck is memory bandwidth, not draw call overhead - bottleneck is memory bandwidth, not draw call overhead
- rlgl batching is already near-optimal for this hardware - rlgl batching is already near-optimal for this hardware
- compute shaders: update time ~5ms → ~0ms at 150k entities (CPU freed entirely)
## future optimization concepts ## future optimization concepts (GPU-focused)
- [ ] SIMD entity updates (AVX2/SSE) - [ ] GPU-side frustum culling in compute shader
- [ ] struct-of-arrays vs array-of-structs benchmark - [ ] point sprites for distant/small entities (4 verts → 1)
- [ ] multithreaded update loop (thread pool) - [ ] indirect draw calls (glDrawArraysIndirect)
- [ ] cache-friendly memory layouts
- [ ] LOD rendering (skip distant entities or reduce detail) ## future optimization concepts (CPU - not currently bottleneck)
- [ ] frustum culling (only render visible)
- [ ] temporal techniques (update subset per frame) - [ ] SIMD / SoA / multithreading (if game logic makes CPU hot again)
- [ ] fixed-point vs floating-point math
## other ideas that aren't about optimization
- [ ] scanline shader

292
docs/GLOSSARY.txt Normal file
View file

@ -0,0 +1,292 @@
lofivor glossary
================
terms that come up when optimizing graphics.
clock cycle
-----------
one "tick" of the processor's internal clock.
a CPU or GPU has a crystal oscillator that vibrates at a fixed rate.
each vibration = one cycle. the processor does some work each cycle.
1 GHz = 1 billion cycles per second
1 MHz = 1 million cycles per second
so a 1 GHz processor has 1 billion opportunities to do work per second.
"one operation per cycle" is idealized. real work often takes multiple
cycles (memory access: 100+ cycles, division: 10-20 cycles, add: 1 cycle).
your HD 530 runs at ~950 MHz, so roughly 950 million cycles per second.
at 60fps, that's about 15.8 million cycles per frame.
fill rate
---------
pixels written per second. measured in megapixels/s or gigapixels/s.
fill rate = ROPs * clock speed * pixels per clock
your HD 530: 3 ROPs * 950 MHz * 1 = 2.85 GPixels/s theoretical max.
overdraw
--------
drawing the same pixel multiple times per frame.
if two entities overlap, the back one gets drawn, then the front one
overwrites it. the back one's work was wasted.
overdraw ratio = total pixels drawn / screen pixels
1080p = 2.07M pixels. if you draw 20M pixels, overdraw = ~10x.
bandwidth
---------
data transfer rate. measured in bytes/second (GB/s, MB/s).
memory bandwidth = how fast data moves between processor and RAM.
your HD 530 shares DDR4 with the CPU: ~30 GB/s total.
a discrete GPU has dedicated VRAM: 200-900 GB/s.
latency
-------
time delay. measured in nanoseconds (ns) or cycles.
memory latency = time to fetch data from RAM.
- L1 cache: ~4 cycles
- L2 cache: ~12 cycles
- L3 cache: ~40 cycles
- main RAM: ~200 cycles
this is why cache matters. a cache miss = 50x slower than a hit.
throughput vs latency
---------------------
latency = how long ONE thing takes.
throughput = how many things per second.
a pipeline can have high latency but high throughput.
example: a car wash takes 10 minutes (latency).
but if cars enter every 1 minute, throughput is 60 cars/hour.
GPUs hide latency with throughput. one thread waits for memory?
switch to another thread. thousands of threads keep the GPU busy.
draw call
---------
one command from CPU to GPU: "draw this batch of geometry."
each draw call has overhead:
- CPU prepares command buffer
- driver validates state
- GPU switches context
1 draw call for 1M triangles: fast.
1M draw calls for 1M triangles: slow.
lofivor uses 1 draw call for all entities (instanced rendering).
instancing
----------
drawing many copies of the same geometry in one draw call.
instead of: draw triangle, draw triangle, draw triangle...
you say: draw this triangle 1 million times, here are the positions.
the GPU handles the replication. massively more efficient.
shader
------
a small program that runs on the GPU.
the name is historical - early shaders calculated shading/lighting.
but today: a shader is just software running on GPU hardware.
it doesn't have to do with shading at all.
more precisely: a shader turns one piece of data into another piece of data.
- vertex shader: positions → screen coordinates
- fragment shader: fragments → pixel colors
- compute shader: data → data (anything)
GPUs are massively parallel, so shaders run on thousands of inputs at once.
CPUs have stagnated; GPUs keep getting faster. modern engines like UE5
increasingly use shaders for work that used to be CPU-only.
SSBO (shader storage buffer object)
-----------------------------------
a block of GPU memory that shaders can read/write.
unlike uniforms (small, read-only), SSBOs can be large and writable.
lofivor stores all entity data in an SSBO: positions, velocities, colors.
compute shader
--------------
a shader that does general computation, not rendering.
runs on GPU cores but doesn't output pixels. just processes data.
lofivor uses compute shaders to update entity positions.
because compute exists, shaders can be anything: physics, AI, sorting,
image processing. the GPU is a general-purpose parallel processor.
fragment / pixel shader
-----------------------
program that runs once per pixel (actually per "fragment").
determines the final color of each pixel. this is where:
- texture sampling happens
- lighting calculations happen
- the expensive math lives
lofivor's fragment shader: sample texture, multiply by color. trivial.
AAA game fragment shader: 500+ instructions. expensive.
vertex shader
-------------
program that runs once per vertex.
transforms 3D positions to screen positions. lofivor's vertex shader
reads from SSBO and positions the quad corners.
ROP (render output unit)
------------------------
final stage of GPU pipeline. writes pixels to framebuffer.
handles: depth test, stencil test, blending, antialiasing.
your bottleneck on HD 530. see docs/rops.txt.
TMU (texture mapping unit)
--------------------------
samples textures. reads pixel colors from texture memory.
your HD 530 has 24 TMUs. they're fast (22.8 GTexels/s).
texture sampling is cheap relative to ROPs on this hardware.
EU (execution unit)
-------------------
intel's term for shader cores.
your HD 530 has 24 EUs, each with 8 ALUs = 192 ALUs total.
these run your vertex, fragment, and compute shaders.
ALU (arithmetic logic unit)
---------------------------
does math. add, multiply, compare, bitwise operations.
one ALU can do one operation per cycle (simple ops).
complex ops (sqrt, sin, cos) take multiple cycles.
framebuffer
-----------
the image being rendered. lives in GPU memory.
at 1080p with 32-bit color: 1920 * 1080 * 4 = 8.3 MB.
double-buffered (front + back): 16.6 MB.
vsync
-----
synchronizing frame presentation with monitor refresh.
without vsync: tearing (half old frame, half new frame).
with vsync: smooth, but if you miss 16.7ms, you wait for next refresh.
frame budget
------------
time available per frame.
60 fps = 16.67 ms per frame
30 fps = 33.33 ms per frame
everything (CPU + GPU) must complete within budget or frames drop.
pipeline stall
--------------
GPU waiting for something. bad for performance.
causes:
- waiting for memory (cache miss)
- waiting for previous stage to finish
- synchronization points (barriers)
- `discard` in fragment shader (breaks early-z)
early-z
-------
optimization: test depth BEFORE running fragment shader.
if pixel will be occluded, skip the expensive shader work.
`discard` breaks this because GPU can't know depth until shader runs.
LOD (level of detail)
---------------------
using simpler geometry/textures for distant objects.
far away = fewer pixels = less detail needed.
saves vertices, texture bandwidth, and fill rate.
frustum culling
---------------
don't draw what's outside the camera view.
the "frustum" is the pyramid-shaped visible region.
anything outside = wasted work. cull it before sending to GPU.
spatial partitioning
--------------------
organizing entities by position for fast queries.
types: grid, quadtree, octree, BVH.
"which entities are near point X?" goes from O(n) to O(log n).
essential for collision detection at scale.

View file

@ -0,0 +1,119 @@
# intel hd 530 optimization guide for lofivor
based on hardware specs and empirical testing.
## hardware constraints
from `intel_hd_graphics_530.txt`:
| resource | value | implication |
| ---------- | ------- | ------------- |
| ROPs | 3 | fill rate limited - this is our ceiling |
| TMUs | 24 | texture sampling is relatively fast |
| memory | shared DDR4 ~30GB/s | bandwidth is precious, no VRAM |
| pixel rate | 2.85 GPixel/s | max theoretical throughput |
| EUs | 24 (192 ALUs) | decent compute, weak vs discrete |
| L3 cache | 768 KB | small, cache misses hurt |
the bottleneck is ROPs (fill rate), not vertices or compute.
## what works (proven)
### SSBO instance data
- 16 bytes per entity vs 64 bytes (matrices)
- minimizes bandwidth on shared memory bus
- result: ~5x improvement over instancing
### compute shader updates
- GPU does position/velocity updates
- no CPU→GPU sync per frame
- result: update time essentially free
### texture sampling
- 22.8 GTexel/s is fast relative to other units
- pre-baked circle texture beats procedural math
- result: 2x faster than procedural fragment shader
### instanced triangles/quads
- most optimized driver path
- intel mesa heavily optimizes this
- result: baseline, hard to beat
## what doesn't work (proven)
### point sprites
- theoretically 6x fewer vertices
- reality: 2.4x SLOWER on this hardware
- triangle rasterizer is more optimized
- see `docs/point_sprites_experiment.md`
### procedural fragment shaders
- `length()`, `smoothstep()`, `discard` are expensive
- EUs are weaker than discrete GPUs
- `discard` breaks early-z optimization
- result: 3.7x slower than texture sampling
### complex fragment math
- only 24 EUs, each running 8 ALUs
- transcendentals (sqrt, sin, cos) are 4x slower than FMAD
- avoid in hot path
## what to try next (theoretical)
### likely to help
| technique | why it should work | expected gain |
| ----------- | ------------------- | --------------- |
| frustum culling (GPU) | reduce fill rate, which is bottleneck | 10-30% depending on view |
| smaller points when zoomed out (LOD) | fewer pixels per entity = less ROP work | 20-40% |
| early-z / depth pre-pass | skip fragment work for occluded pixels | moderate |
### unlikely to help
| technique | why it won't help |
| ----------- | ------------------ |
| more vertex optimization | already fill rate bound, not vertex bound |
| SIMD on CPU | updates already on GPU |
| multithreading | CPU isn't the bottleneck |
| different vertex layouts | negligible vs fill rate |
### uncertain (need to test)
| technique | notes |
| ----------- | ------- |
| vulkan backend | might have less driver overhead, or might not matter |
| indirect draw calls | GPU decides what to render, but we're not CPU bound |
| fp16 in shaders | HD 530 has 2:1 fp16 ratio, might help fragment shader |
## key insights
1. fill rate is king - with only 3 ROPs, everything comes down to how many
pixels we're writing. optimizations that don't reduce pixel count won't
help.
2. shared memory hurts - no dedicated VRAM means CPU and GPU compete for
bandwidth. keep data transfers minimal.
3. driver optimization matters - the "common path" (triangles) is more
optimized than alternatives (points). don't be clever.
4. texture sampling is cheap - 22.8 GTexel/s is fast. prefer texture
lookups over ALU math in fragment shaders.
5. avoid discard - breaks early-z, causes pipeline stalls. alpha blending
is faster than discard.
## current ceiling
~950k entities @ 57fps (SSBO + compute + quads)
to go higher, we need to reduce fill rate:
- cull offscreen entities
- reduce entity size when zoomed out
- or accept lower fps at higher counts
## references
- intel gen9 compute architecture whitepaper
- empirical benchmarks in `benchmark_current_i56500t.log`
- point sprites experiment in `docs/point_sprites_experiment.md`

View file

@ -0,0 +1,89 @@
# point sprites experiment
branch: `point-sprites` (point-sprites work)
date: 2024-12
hardware: intel hd 530 (skylake gt2, i5-6500T)
## hypothesis
point sprites should be faster than quads because:
- 1 vertex per entity instead of 6 (quad = 2 triangles)
- less vertex throughput
- `gl_PointCoord` provides texture coords automatically
## implementation
### vertex shader changes
- removed quad vertex attributes (position, texcoord)
- use `gl_PointSize = 16.0 * zoom` for size control
- position calculated from SSBO data only
### fragment shader changes
- use `gl_PointCoord` instead of vertex texcoord
- sample circle texture for alpha
### renderer changes
- load `glEnable` and `glDrawArraysInstanced` via `rlGetProcAddress`
- enable `GL_PROGRAM_POINT_SIZE`
- draw with `glDrawArraysInstanced(GL_POINTS, 0, 1, count)`
- removed VBO (no vertex data needed)
## results
### attempt 1: procedural circle in fragment shader
```glsl
vec2 coord = gl_PointCoord - vec2(0.5);
float dist = length(coord);
float alpha = 1.0 - smoothstep(0.4, 0.5, dist);
if (alpha < 0.01) discard;
```
**benchmark @ 350k entities:**
- point sprites: 23ms render, 43fps
- quads (main): 6.2ms render, 151fps
- **result: 3.7x SLOWER**
**why:** `discard` breaks early-z optimization, `length()` and `smoothstep()` are ALU-heavy, intel integrated GPUs are weak at fragment shader math.
### attempt 2: texture sampling
```glsl
float alpha = texture(circleTexture, gl_PointCoord).r;
finalColor = vec4(fragColor, alpha);
```
**benchmark @ 450k entities:**
- point sprites: 19.1ms render, 52fps
- quads (main): 8.0ms render, 122fps
- **result: 2.4x SLOWER**
better than procedural, but still significantly slower than quads.
## analysis
the theoretical advantage (1/6 vertices) doesn't translate to real performance because:
1. **triangle path is more optimized** - intel's driver heavily optimizes the standard triangle rasterization path. point sprites use a less-traveled code path.
2. **fill rate is the bottleneck** - HD 530 has only 3 ROPs. we're bound by how fast we can write pixels, not by vertex count. reducing vertices from 6 to 1 doesn't help when fill rate is the constraint.
3. **point size overhead** - each point requires computing `gl_PointSize` and setting up the point sprite rasterization, which may have per-vertex overhead.
4. **texture cache behavior** - `gl_PointCoord` may have worse cache locality than explicit vertex texcoords.
## conclusion
**point sprites are a regression on intel hd 530.**
the optimization makes theoretical sense but fails in practice on this hardware. the quad/triangle path is simply more optimized in intel's mesa driver.
**keep this branch for testing on discrete GPUs** where point sprites might actually help (nvidia/amd have different optimization priorities).
## lessons learned
1. always benchmark, don't assume
2. "fewer vertices" doesn't always mean faster
3. integrated GPU optimization is different from discrete
4. the most optimized path is usually the most common path (triangles)
5. fill rate matters more than vertex count at high entity counts

201
docs/rops.txt Normal file
View file

@ -0,0 +1,201 @@
rops: render output units
=========================
what they are, where they came from, and what yours can do.
what is a rop?
--------------
ROP = Render Output Unit (originally "Raster Operations Pipeline")
it's the final stage of the GPU pipeline. after all the fancy shader
math is done, the ROP is the unit that actually writes pixels to memory.
think of it as the bottleneck between "calculated" and "visible."
a ROP does:
- depth testing (is this pixel in front of what's already there?)
- stencil testing (mask operations)
- blending (alpha, additive, etc)
- anti-aliasing resolve
- writing the final color to the framebuffer
one ROP can write one pixel per clock cycle (roughly).
the first rop
-------------
the term comes from the IBM 8514/A (1987), which had dedicated hardware
for "raster operations" - bitwise operations on pixels (AND, OR, XOR).
this was revolutionary because before this, the CPU did all pixel math.
but the modern ROP as we know it emerged with:
NVIDIA NV1 (1995)
one of the first chips with dedicated pixel output hardware
could do ~1 million textured pixels/second
3dfx Voodoo (1996)
the card that defined the modern GPU pipeline
had 1 TMU + 1 pixel pipeline (essentially 1 ROP)
could push 45 million pixels/second
that ONE pipeline ran Quake at 640x480
NVIDIA GeForce 256 (1999)
"the first GPU" - named itself with that term
4 pixel pipelines = 4 ROPs
480 million pixels/second
so the original consumer 3D cards had... 1 ROP. and they ran Quake.
what one rop can do
-------------------
let's do the math.
one ROP at 100 MHz (3dfx Voodoo era):
100 million cycles/second
~1 pixel per cycle
= 100 megapixels/second
at 640x480 @ 60fps:
640 * 480 * 60 = 18.4 megapixels/second needed
so ONE ROP at 100MHz could handle 640x480 with ~5x headroom for overdraw.
at 1024x768 @ 60fps:
1024 * 768 * 60 = 47 megapixels/second
now you're at 2x overdraw max. still playable, but tight.
one modern rop
--------------
a single modern ROP runs at ~1-2 GHz and can do more per cycle:
- multiple color outputs (MRT)
- 64-bit or 128-bit color formats
- compressed writes
rough estimate for one ROP at 1.5 GHz:
~1.5 billion pixels/second base throughput
at 1920x1080 @ 60fps:
1920 * 1080 * 60 = 124 megapixels/second
one ROP could handle 1080p with 12x overdraw headroom.
at 4K @ 60fps:
3840 * 2160 * 60 = 497 megapixels/second
one ROP could handle 4K with 3x overdraw. tight, but possible.
your three rops (intel hd 530)
------------------------------
HD 530 specs:
- 3 ROPs
- ~950 MHz boost clock
- theoretical: 2.85 GPixels/second
let's break that down:
at 1080p @ 60fps (124 MP/s needed):
2850 / 124 = 23x overdraw budget
that's actually generous! you could draw each pixel 23 times.
so why does lofivor struggle at 1M entities?
because 1M entities at 4x4 pixels = 16M pixels minimum.
but with overlap? let's say average 10x overdraw:
160M pixels/frame
at 60fps = 9.6 billion pixels/second
your ceiling is 2.85 billion.
so you're 3.4x over budget. that's why you top out around 300k-400k
before frame drops (which matches empirical testing).
the real constraint
-------------------
ROPs don't work in isolation. they're limited by:
1. MEMORY BANDWIDTH
each pixel write = memory access
HD 530 shares DDR4 with CPU (~30 GB/s)
at 32-bit color: 30GB/s / 4 bytes = 7.5 billion pixels/second max
but you're competing with CPU, texture reads, etc.
realistic: maybe 2-3 billion pixels for framebuffer writes
2. TEXTURE SAMPLING
if fragment shader samples textures, TMUs must keep up
HD 530 has 24 TMUs, so this isn't the bottleneck
3. SHADER EXECUTION
ROPs wait for fragments to be shaded
if shaders are slow, ROPs starve
lofivor's shaders are trivial, so this isn't the bottleneck
for lofivor specifically: your 3 ROPs are THE ceiling.
what could you do with more rops?
---------------------------------
comparison:
Intel HD 530: 3 ROPs, 2.85 GPixels/s
GTX 1060: 48 ROPs, 72 GPixels/s
RTX 3080: 96 ROPs, 164 GPixels/s
RTX 4090: 176 ROPs, 443 GPixels/s
with a GTX 1060 (25x your fill rate):
lofivor could probably hit 5-10 million entities
with an RTX 4090 (155x your fill rate):
tens of millions, limited by other factors
perspective: what 3 rops means historically
-------------------------------------------
your HD 530 has roughly the fill rate of:
- GeForce 4 Ti 4600 (2002): 4 ROPs, 1.2 GPixels/s
- Radeon 9700 Pro (2002): 8 ROPs, 2.6 GPixels/s
you're running hardware that, in raw pixel output, matches GPUs from
20+ years ago. but with modern features (compute shaders, SSBO, etc).
this is why lofivor is interesting: you're achieving 700k+ entities
on fill-rate-equivalent hardware that originally ran games with
maybe 10,000 triangles on screen.
the difference is technique. those 2002 games did complex per-pixel
lighting, shadows, multiple texture passes. lofivor does one texture
sample and one blend. same fill rate, 100x the entities.
the lesson
----------
ROPs are simple: they write pixels.
the number you have determines your pixel budget.
everything else (shaders, vertices, CPU logic) only matters if
the ROPs aren't your bottleneck.
with 3 ROPs, you have roughly 2.85 billion pixels/second.
spend them wisely:
- cull what's offscreen (don't spend pixels on invisible things)
- shrink distant objects (LOD saves pixels)
- reduce overlap (spatial organization)
- keep shaders simple (don't starve the ROPs)
your 3 ROPs can do remarkable things. Quake ran on 1.

View file

@ -0,0 +1,316 @@
why rendering millions of entities is hard
=========================================
and what "hard" actually means, from first principles.
the simple answer
-----------------
every frame, your computer does work. work takes time. you have 16.7
milliseconds to do all the work before the next frame (at 60fps).
if the work takes longer than 16.7ms, you miss the deadline. frames drop.
the game stutters.
10 million entities means 10 million units of work. whether that fits in
16.7ms depends on how much work each unit is.
what is "work" anyway?
----------------------
let's trace what happens when you draw one entity:
1. CPU: "here's an entity at position (340, 512), color cyan"
2. that data travels over a bus to the GPU
3. GPU: receives the data, stores it in memory
4. GPU: runs a vertex shader (figures out where on screen)
5. GPU: runs a fragment shader (figures out what color each pixel is)
6. GPU: writes pixels to the framebuffer
7. framebuffer gets sent to your monitor
each step has a speed limit. the slowest step is your bottleneck.
the bottlenecks, explained simply
---------------------------------
MEMORY BANDWIDTH
how fast data can move around. measured in GB/s.
think of it like a highway. you can have a fast car (processor), but
if the highway is jammed, you're stuck in traffic.
an integrated GPU (like Intel HD 530) shares the highway with the CPU.
a discrete GPU (like an RTX card) has its own private highway.
this is why lofivor's SSBO optimization helped so much: shrinking
entity data from 64 bytes to 12 bytes means 5x less traffic.
DRAW CALLS
every time you say "GPU, draw this thing", there's overhead.
the CPU and GPU have to synchronize, state gets set up, etc.
1 draw call for 1 million entities: fast
1 million draw calls for 1 million entities: slow
this is why batching matters. not the drawing itself, but the
*coordination* of drawing.
FILL RATE
how many pixels the GPU can color per second.
a 4x4 pixel entity = 16 pixels
1 million entities = 16 million pixels minimum
but your screen is only ~2 million pixels (1920x1080). so entities
overlap. "overdraw" means coloring the same pixel multiple times.
10 million overlapping entities might touch each pixel 50+ times.
that's 100 million pixel operations.
SHADER COMPLEXITY
the GPU runs a tiny program for each vertex and each pixel.
simple: "put it here, color it this" = fast
complex: "calculate lighting from 8 sources, sample 4 textures,
apply normal mapping, do fresnel..." = slow
lofivor's shaders are trivial. AAA game shaders are not.
CPU-GPU SYNCHRONIZATION
the CPU and GPU work in parallel, but sometimes they have to wait
for each other.
if the CPU needs to read GPU results, it stalls.
if the GPU needs new data and the CPU is busy, it stalls.
good code keeps them both busy without waiting.
why "real games" hit CPU walls
------------------------------
rendering is just putting colors on pixels. that's the GPU's job.
but games aren't just rendering. they're also:
- COLLISION DETECTION
does entity A overlap entity B?
naive approach: check every pair
1,000 entities = 500,000 checks (n squared / 2)
10,000 entities = 50,000,000 checks
1,000,000 entities = 500,000,000,000,000 checks
that's 500 trillion. per frame. not happening.
smart approach: spatial partitioning (grids, quadtrees)
only check nearby entities. but still, at millions of entities,
even "nearby" is a lot.
- AI / BEHAVIOR
each entity decides what to do.
simple: move toward player. cheap.
complex: pathfind around obstacles, consider threats, coordinate
with allies, remember state. expensive.
lofivor entities just drift in a direction. no decisions.
a real game enemy makes decisions every frame.
- PHYSICS
entities push each other, bounce, have mass and friction.
every interaction is math. lots of entities = lots of math.
- GAME LOGIC
damage calculations, spawning, leveling, cooldowns, buffs...
all of this runs on the CPU, every frame.
so: lofivor can render 700k entities because they don't DO anything.
a game with 700k entities that think, collide, and interact would
need god-tier optimization or would simply not run.
what makes AAA games slow on old hardware?
------------------------------------------
it's not entity count. most AAA games have maybe hundreds of
"entities" on screen. it's everything else:
TEXTURE RESOLUTION
a 4K texture is 67 million pixels of data. per texture.
one character might have 10+ textures (diffuse, normal, specular,
roughness, ambient occlusion...).
old hardware: less VRAM, slower texture sampling.
SHADER COMPLEXITY
modern materials simulate light physics. subsurface scattering,
global illumination, ray-traced reflections.
each pixel might do hundreds of math operations.
POST-PROCESSING
bloom, motion blur, depth of field, ambient occlusion, anti-aliasing.
full-screen passes that touch every pixel multiple times.
MESH COMPLEXITY
a character might be 100,000 triangles.
10 characters = 1 million triangles.
each triangle goes through the vertex shader.
SHADOWS
render the scene again from the light's perspective.
for each light. every frame.
AAA games are doing 100x more work per pixel than lofivor.
lofivor is doing 100x more pixels than AAA games.
different problems.
the "abuse" vs "respect" distinction
------------------------------------
abuse: making the hardware do unnecessary work.
respect: achieving your goal with minimal waste.
examples of abuse (that lofivor fixed):
- sending 64 bytes (a full matrix) when you need 12 bytes (x, y, color)
- one draw call per entity when you could batch
- calculating transforms on CPU when GPU could do it
- clearing the screen twice
- uploading the same data every frame
examples of abuse in the wild:
- electron apps using a whole browser to show a chat window
- games that re-render static UI every frame
- loading 4K textures for objects that appear 20 pixels tall
- running AI pathfinding for off-screen entities
the hardware has limits. respecting them means fitting your game
within those limits through smart decisions. abusing them means
throwing cycles at problems you created yourself.
so can you do 1 million entities with juice on old hardware?
------------------------------------------------------------
yes, with the right decisions.
what "juice" typically means:
- screen shake (free, just offset the camera)
- particle effects (separate system, heavily optimized)
- flash/hit feedback (change a color value)
- sound (different system entirely)
particles are special: they're designed for millions of tiny things.
they don't collide, don't think, often don't even persist (spawn,
drift, fade, die). GPU particle systems are essentially what lofivor
became: minimal data, instanced rendering.
what would kill you at 1 million:
- per-entity collision
- per-entity AI
- per-entity sprite variety (texture switches)
- per-entity complex shaders
what you could do:
- 1 million particles (visual only, no logic)
- 10,000 enemies with collision/AI + 990,000 particles
- 100,000 enemies with simple behavior + spatial hash collision
the secret: most of what looks like "millions of things" in games
is actually a small number of meaningful entities + a large number
of dumb particles.
the laws of physics (sort of)
-----------------------------
there are hard limits:
MEMORY BUS BANDWIDTH
a DDR4 system might move 25 GB/s.
1 million entities at 12 bytes each = 12 MB.
at 60fps = 720 MB/s just for entity data.
that's only 3% of bandwidth. plenty of room.
but a naive approach (64 bytes, plus overhead) could be
10x worse. suddenly you're at 30%.
CLOCK CYCLES
a 3GHz CPU does 3 billion operations per second.
at 60fps, that's 50 million operations per frame.
1 million entities = 50 operations each.
50 operations is: a few multiplies, some loads/stores, a branch.
that's barely enough for "move in a direction".
pathfinding? AI? collision? not a chance.
PARALLELISM
GPUs have thousands of cores but they're simple.
CPUs have few cores but they're smart.
entity rendering: perfectly parallel (GPU wins)
entity decision-making: often sequential (CPU bound)
so yes, physics constrains us. but "physics" here means:
- how fast electrons move through silicon
- how much data fits on a wire
- how many transistors fit on a chip
within those limits, there's room. lots of room, if you're clever.
lofivor went from 5k to 700k by being clever, not by breaking physics.
the actual lesson
-----------------
the limit isn't really "the hardware can't do it."
the limit is "the hardware can't do it THE WAY YOU'RE DOING IT."
every optimization in lofivor was finding a different way:
- don't draw circles, blit textures
- don't call functions, submit vertices directly
- don't send matrices, send packed structs
- don't update on CPU, use compute shaders
the hardware was always capable of 700k. the code wasn't asking right.
this is true at every level. that old laptop struggling with 10k
entities in some game? probably not the laptop's fault. probably
the game is doing something wasteful that doesn't need to be.
"runs poorly on old hardware" often means "we didn't try to make
it run on old hardware" not "it's impossible on old hardware."
closing thought
---------------
10 million is a lot. but 1 million? 2 million?
with discipline: yes.
with decisions that respect the hardware: yes.
with awareness of what's actually expensive: yes.
the knowledge of what's expensive is the key.
most developers don't have it. they use high-level abstractions
that hide the cost. they've never seen a frame budget or a
bandwidth calculation.
lofivor is a learning tool. the journey from 5k to 700k teaches
where the costs are. once you see them, you can't unsee them.
you start asking: "what is this actually doing? what does it cost?
is there a cheaper way?"
that's the skill. not the specific techniques—those change with
hardware. the skill is asking the questions.

View file

@ -0,0 +1,8 @@
the baseline: one draw call per entity, pure and simple
- individual rl.drawCircle() calls in a loop
- ~5k entities at 60fps before frame times tank
- linear scaling: 10k = ~43ms, 20k = ~77ms
- render-bound (update loop stays under 1ms even at 30k)
- each circle is its own GPU draw call
- the starting point for optimization experiments

View file

@ -0,0 +1,8 @@
pre-render once, blit many: 10x improvement
- render circle to 16x16 texture at startup
- drawTexture() per entity instead of drawCircle()
- raylib batches same-texture draws internally
- ~50k entities at 60fps
- simple change, big win
- still one function call per entity, but GPU work is batched

View file

@ -0,0 +1,9 @@
bypass the wrapper, go straight to rlgl: 2x more
- skip drawTexture(), submit vertices directly via rl.gl
- manually build quads: rlTexCoord2f + rlVertex2f per corner
- rlBegin/rlEnd wraps the whole entity loop
- ~100k entities at 60fps
- eliminates per-call function overhead
- vertices go straight to GPU buffer
- 20x improvement over baseline

View file

@ -0,0 +1,11 @@
bigger buffer, fewer flushes: squeezing out more headroom
- increased raylib batch buffer from 8192 to 32768 vertices
- ~140k entities at 60fps on i5-6500T
- ~40% improvement over default buffer
- fewer GPU flushes per frame
- also added: release workflows for github and forgejo
- added OPTIMIZATIONS.md documenting the journey
- added README, UI panel with FPS display
- heap allocated entity array to support 1 million entities
- per-entity RGB colors

View file

@ -0,0 +1,13 @@
gpu instancing: a disappointing discovery
- drawMeshInstanced() with per-entity transform matrices
- ~150k entities at 60fps - barely better than rlgl batching
- negligible improvement on integrated graphics
- why it didn't help:
- integrated GPU shares system RAM (no PCIe transfer savings)
- 64-byte matrix per entity vs ~80 bytes for rlgl vertices
- bottleneck is memory bandwidth, not draw call overhead
- rlgl batching already minimizes draw calls effectively
- orthographic camera setup for 2D-like rendering
- heap-allocated transforms buffer (64MB too big for stack)
- lesson learned: not all "advanced" techniques are wins

View file

@ -0,0 +1,17 @@
ssbo breakthrough: 5x gain by shrinking the data
- pack entity data (x, y, color) into 12-byte struct
- upload via shader storage buffer object (SSBO)
- ~700k entities at 60fps (i5-6500T / HD 530)
- ~950k entities at ~57fps
- 5x improvement over previous best
- 140x total from baseline
- why it works:
- 12 bytes vs 64 bytes (matrices) = 5.3x less bandwidth
- 12 bytes vs 80 bytes (rlgl vertices) = 6.7x less bandwidth
- no CPU-side matrix calculations
- GPU does NDC conversion and color unpacking
- custom vertex/fragment shaders
- single rlDrawVertexArrayInstanced() call for all entities
- shaders embedded at build time
- removed FPS cap, added optional vsync arg

View file

@ -0,0 +1,5 @@
cross-platform release: adding windows to the party
- updated github release workflow
- builds for both linux and windows now
- no code changes, just CI/CD work

View file

@ -0,0 +1,10 @@
zoom and pan: making millions of entities explorable
- mouse wheel zoom
- click and drag panning
- orthographic camera transforms
- memory panel showing entity buffer sizes
- background draws immediately (no flicker)
- tab key toggles UI panels
- explained "lofivor" name in README (lo-fi survivor)
- shader updated for zoom/pan transforms

View file

@ -0,0 +1,5 @@
quick exit: zoom out then quit
- q key first zooms out, second press quits
- nice way to see the full entity field before closing
- minor UI text fix

View file

@ -0,0 +1,11 @@
compute shader: moving physics to the GPU
- entity position updates now run on GPU via compute shader
- GPU-based RNG for entity velocity randomization
- full simulation loop stays on GPU, no CPU roundtrip
- new compute.zig module for shader management
- GpuEntity struct with position, velocity, and color
- tracy profiling integration
- FPS display turns green (good) or red (bad)
- added design docs for zoom/pan and compute shader work
- cross-platform alignment fixes for shader data