Compare commits
No commits in common. "main" and "0.7.0" have entirely different histories.
18 changed files with 23 additions and 1186 deletions
|
|
@ -1,14 +1,12 @@
|
||||||
name: release
|
name: release
|
||||||
|
|
||||||
on:
|
on:
|
||||||
push:
|
release:
|
||||||
tags:
|
types: [published]
|
||||||
- '*'
|
|
||||||
|
|
||||||
jobs:
|
jobs:
|
||||||
build:
|
build:
|
||||||
runs-on: ubuntu-latest
|
runs-on: codeberg-small
|
||||||
container: catthehacker/ubuntu:act-latest
|
|
||||||
|
|
||||||
steps:
|
steps:
|
||||||
- uses: actions/checkout@v4
|
- uses: actions/checkout@v4
|
||||||
|
|
@ -37,32 +35,16 @@ jobs:
|
||||||
|
|
||||||
- name: Upload to release
|
- name: Upload to release
|
||||||
env:
|
env:
|
||||||
FORGEJO_TOKEN: ${{ secrets.FORGEJO_TOKEN }}
|
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||||
run: |
|
run: |
|
||||||
TAG="${{ github.ref_name }}"
|
RELEASE_ID="${{ github.event.release.id }}"
|
||||||
API_BASE="${{ github.server_url }}/api/v1"
|
API_URL="${{ github.api_url }}/repos/${{ github.repository }}/releases/${RELEASE_ID}/assets"
|
||||||
REPO="${{ github.repository }}"
|
|
||||||
|
|
||||||
# check if release exists
|
|
||||||
RELEASE_ID=$(curl -sf \
|
|
||||||
-H "Authorization: token ${FORGEJO_TOKEN}" \
|
|
||||||
"${API_BASE}/repos/${REPO}/releases/tags/${TAG}" | jq -r '.id // empty')
|
|
||||||
|
|
||||||
if [ -z "$RELEASE_ID" ]; then
|
|
||||||
echo "Creating release for ${TAG}..."
|
|
||||||
RELEASE_ID=$(curl -sf \
|
|
||||||
-H "Authorization: token ${FORGEJO_TOKEN}" \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"tag_name":"'"${TAG}"'","name":"'"${TAG}"'"}' \
|
|
||||||
"${API_BASE}/repos/${REPO}/releases" | jq -r '.id')
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "Release ID: ${RELEASE_ID}"
|
|
||||||
|
|
||||||
for file in lofivor-linux-x86_64 lofivor-windows-x86_64.exe; do
|
for file in lofivor-linux-x86_64 lofivor-windows-x86_64.exe; do
|
||||||
echo "Uploading $file..."
|
echo "Uploading $file..."
|
||||||
curl -sf \
|
curl -X POST \
|
||||||
-H "Authorization: token ${FORGEJO_TOKEN}" \
|
-H "Authorization: token ${GITHUB_TOKEN}" \
|
||||||
-F "attachment=@${file}" \
|
-H "Content-Type: application/octet-stream" \
|
||||||
"${API_BASE}/repos/${REPO}/releases/${RELEASE_ID}/assets?name=${file}"
|
--data-binary @"$file" \
|
||||||
|
"${API_URL}?name=${file}"
|
||||||
done
|
done
|
||||||
|
|
|
||||||
|
|
@ -82,8 +82,8 @@ these target the rendering bottleneck since update loop is already fast.
|
||||||
|
|
||||||
| technique | description | expected gain |
|
| technique | description | expected gain |
|
||||||
| ---------------------- | -------------------------------------------------------------------- | ------------------------------- |
|
| ---------------------- | -------------------------------------------------------------------- | ------------------------------- |
|
||||||
| SSBO instance data | pack (x, y, color) = 12 bytes instead of 64-byte matrices | done - see optimization 5 |
|
| ~~SSBO instance data~~ | ~~pack (x, y, color) = 12 bytes instead of 64-byte matrices~~ | **done** - see optimization 5 |
|
||||||
| compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync | done - see optimization 6 |
|
| compute shader updates | move entity positions to GPU entirely, avoid CPU→GPU sync | significant |
|
||||||
| OpenGL vs Vulkan | test raylib's Vulkan backend | unknown |
|
| OpenGL vs Vulkan | test raylib's Vulkan backend | unknown |
|
||||||
| discrete GPU testing | test on dedicated GPU where instancing/SSBO shine | significant (different hw) |
|
| discrete GPU testing | test on dedicated GPU where instancing/SSBO shine | significant (different hw) |
|
||||||
|
|
||||||
|
|
@ -126,33 +126,6 @@ currently not the bottleneck - update stays <1ms at 100k. these become relevant
|
||||||
| entity pools | pre-allocated, reusable entity slots | reduces allocation overhead |
|
| entity pools | pre-allocated, reusable entity slots | reduces allocation overhead |
|
||||||
| component packing | minimize struct padding | better cache utilization |
|
| component packing | minimize struct padding | better cache utilization |
|
||||||
|
|
||||||
#### estimated gains summary
|
|
||||||
|
|
||||||
| Optimization | Expected Gain | Why |
|
|
||||||
|------------------------|---------------|---------------------------------------------------|
|
|
||||||
| SIMD updates | 0% | Update already on GPU |
|
|
||||||
| Multithreaded update | 0% | Update already on GPU |
|
|
||||||
| Cache-friendly layouts | 0% | CPU doesn't iterate entities |
|
|
||||||
| Fixed-point math | 0% or worse | GPUs are optimized for float |
|
|
||||||
| SoA vs AoS | ~5% | Only helps data upload, not bottleneck |
|
|
||||||
| Frustum culling | 5-15% | Most entities converge to center anyway |
|
|
||||||
| LOD rendering | 20-40% | Real gains - fewer fragments for distant entities |
|
|
||||||
| Temporal techniques | ~50% | But with visual artifacts (flickering) |
|
|
||||||
|
|
||||||
Realistic total if you did everything: ~30-50% improvement
|
|
||||||
|
|
||||||
That'd take you from ~1.4M @ 38fps to maybe ~1.8-2M @ 38fps, or ~1.4M @ 50-55fps.
|
|
||||||
|
|
||||||
What would actually move the needle:
|
|
||||||
- GPU-side frustum culling in compute shader (cull before render, not after)
|
|
||||||
- Point sprites instead of quads for distant entities (4 vertices → 1)
|
|
||||||
- Indirect draw calls (GPU decides what to render, CPU never touches entity data)
|
|
||||||
|
|
||||||
Your real bottleneck is fill rate and vertex throughput on HD 530 integrated
|
|
||||||
graphics. The CPU side is already essentially free.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## testing methodology
|
## testing methodology
|
||||||
|
|
|
||||||
24
TODO.md
24
TODO.md
|
|
@ -59,7 +59,7 @@ further options (if needed):
|
||||||
- [x] increase raylib batch buffer (currently 8192 vertices = 2048 quads)
|
- [x] increase raylib batch buffer (currently 8192 vertices = 2048 quads)
|
||||||
- [x] GPU instancing (single draw call for all entities)
|
- [x] GPU instancing (single draw call for all entities)
|
||||||
- [x] SSBO instance data (12 bytes vs 64-byte matrices)
|
- [x] SSBO instance data (12 bytes vs 64-byte matrices)
|
||||||
- [x] compute shader entity updates (raylib supports via rlgl)
|
- [ ] compute shader entity updates (if raylib supports)
|
||||||
- [ ] compare OpenGL vs Vulkan backend
|
- [ ] compare OpenGL vs Vulkan backend
|
||||||
|
|
||||||
findings (i5-6500T / HD 530):
|
findings (i5-6500T / HD 530):
|
||||||
|
|
@ -68,18 +68,14 @@ findings (i5-6500T / HD 530):
|
||||||
- instancing doesn't help on integrated graphics (shared RAM, no PCIe savings)
|
- instancing doesn't help on integrated graphics (shared RAM, no PCIe savings)
|
||||||
- bottleneck is memory bandwidth, not draw call overhead
|
- bottleneck is memory bandwidth, not draw call overhead
|
||||||
- rlgl batching is already near-optimal for this hardware
|
- rlgl batching is already near-optimal for this hardware
|
||||||
- compute shaders: update time ~5ms → ~0ms at 150k entities (CPU freed entirely)
|
|
||||||
|
|
||||||
## future optimization concepts (GPU-focused)
|
## future optimization concepts
|
||||||
|
|
||||||
- [ ] GPU-side frustum culling in compute shader
|
- [ ] SIMD entity updates (AVX2/SSE)
|
||||||
- [ ] point sprites for distant/small entities (4 verts → 1)
|
- [ ] struct-of-arrays vs array-of-structs benchmark
|
||||||
- [ ] indirect draw calls (glDrawArraysIndirect)
|
- [ ] multithreaded update loop (thread pool)
|
||||||
|
- [ ] cache-friendly memory layouts
|
||||||
## future optimization concepts (CPU - not currently bottleneck)
|
- [ ] LOD rendering (skip distant entities or reduce detail)
|
||||||
|
- [ ] frustum culling (only render visible)
|
||||||
- [ ] SIMD / SoA / multithreading (if game logic makes CPU hot again)
|
- [ ] temporal techniques (update subset per frame)
|
||||||
|
- [ ] fixed-point vs floating-point math
|
||||||
## other ideas that aren't about optimization
|
|
||||||
|
|
||||||
- [ ] scanline shader
|
|
||||||
|
|
|
||||||
|
|
@ -1,292 +0,0 @@
|
||||||
lofivor glossary
|
|
||||||
================
|
|
||||||
|
|
||||||
terms that come up when optimizing graphics.
|
|
||||||
|
|
||||||
|
|
||||||
clock cycle
|
|
||||||
-----------
|
|
||||||
|
|
||||||
one "tick" of the processor's internal clock.
|
|
||||||
|
|
||||||
a CPU or GPU has a crystal oscillator that vibrates at a fixed rate.
|
|
||||||
each vibration = one cycle. the processor does some work each cycle.
|
|
||||||
|
|
||||||
1 GHz = 1 billion cycles per second
|
|
||||||
1 MHz = 1 million cycles per second
|
|
||||||
|
|
||||||
so a 1 GHz processor has 1 billion opportunities to do work per second.
|
|
||||||
|
|
||||||
"one operation per cycle" is idealized. real work often takes multiple
|
|
||||||
cycles (memory access: 100+ cycles, division: 10-20 cycles, add: 1 cycle).
|
|
||||||
|
|
||||||
your HD 530 runs at ~950 MHz, so roughly 950 million cycles per second.
|
|
||||||
at 60fps, that's about 15.8 million cycles per frame.
|
|
||||||
|
|
||||||
|
|
||||||
fill rate
|
|
||||||
---------
|
|
||||||
|
|
||||||
pixels written per second. measured in megapixels/s or gigapixels/s.
|
|
||||||
|
|
||||||
fill rate = ROPs * clock speed * pixels per clock
|
|
||||||
|
|
||||||
your HD 530: 3 ROPs * 950 MHz * 1 = 2.85 GPixels/s theoretical max.
|
|
||||||
|
|
||||||
|
|
||||||
overdraw
|
|
||||||
--------
|
|
||||||
|
|
||||||
drawing the same pixel multiple times per frame.
|
|
||||||
|
|
||||||
if two entities overlap, the back one gets drawn, then the front one
|
|
||||||
overwrites it. the back one's work was wasted.
|
|
||||||
|
|
||||||
overdraw ratio = total pixels drawn / screen pixels
|
|
||||||
|
|
||||||
1080p = 2.07M pixels. if you draw 20M pixels, overdraw = ~10x.
|
|
||||||
|
|
||||||
|
|
||||||
bandwidth
|
|
||||||
---------
|
|
||||||
|
|
||||||
data transfer rate. measured in bytes/second (GB/s, MB/s).
|
|
||||||
|
|
||||||
memory bandwidth = how fast data moves between processor and RAM.
|
|
||||||
|
|
||||||
your HD 530 shares DDR4 with the CPU: ~30 GB/s total.
|
|
||||||
a discrete GPU has dedicated VRAM: 200-900 GB/s.
|
|
||||||
|
|
||||||
|
|
||||||
latency
|
|
||||||
-------
|
|
||||||
|
|
||||||
time delay. measured in nanoseconds (ns) or cycles.
|
|
||||||
|
|
||||||
memory latency = time to fetch data from RAM.
|
|
||||||
- L1 cache: ~4 cycles
|
|
||||||
- L2 cache: ~12 cycles
|
|
||||||
- L3 cache: ~40 cycles
|
|
||||||
- main RAM: ~200 cycles
|
|
||||||
|
|
||||||
this is why cache matters. a cache miss = 50x slower than a hit.
|
|
||||||
|
|
||||||
|
|
||||||
throughput vs latency
|
|
||||||
---------------------
|
|
||||||
|
|
||||||
latency = how long ONE thing takes.
|
|
||||||
throughput = how many things per second.
|
|
||||||
|
|
||||||
a pipeline can have high latency but high throughput.
|
|
||||||
|
|
||||||
example: a car wash takes 10 minutes (latency).
|
|
||||||
but if cars enter every 1 minute, throughput is 60 cars/hour.
|
|
||||||
|
|
||||||
GPUs hide latency with throughput. one thread waits for memory?
|
|
||||||
switch to another thread. thousands of threads keep the GPU busy.
|
|
||||||
|
|
||||||
|
|
||||||
draw call
|
|
||||||
---------
|
|
||||||
|
|
||||||
one command from CPU to GPU: "draw this batch of geometry."
|
|
||||||
|
|
||||||
each draw call has overhead:
|
|
||||||
- CPU prepares command buffer
|
|
||||||
- driver validates state
|
|
||||||
- GPU switches context
|
|
||||||
|
|
||||||
1 draw call for 1M triangles: fast.
|
|
||||||
1M draw calls for 1M triangles: slow.
|
|
||||||
|
|
||||||
lofivor uses 1 draw call for all entities (instanced rendering).
|
|
||||||
|
|
||||||
|
|
||||||
instancing
|
|
||||||
----------
|
|
||||||
|
|
||||||
drawing many copies of the same geometry in one draw call.
|
|
||||||
|
|
||||||
instead of: draw triangle, draw triangle, draw triangle...
|
|
||||||
you say: draw this triangle 1 million times, here are the positions.
|
|
||||||
|
|
||||||
the GPU handles the replication. massively more efficient.
|
|
||||||
|
|
||||||
|
|
||||||
shader
|
|
||||||
------
|
|
||||||
|
|
||||||
a small program that runs on the GPU.
|
|
||||||
|
|
||||||
the name is historical - early shaders calculated shading/lighting.
|
|
||||||
but today: a shader is just software running on GPU hardware.
|
|
||||||
it doesn't have to do with shading at all.
|
|
||||||
|
|
||||||
more precisely: a shader turns one piece of data into another piece of data.
|
|
||||||
- vertex shader: positions → screen coordinates
|
|
||||||
- fragment shader: fragments → pixel colors
|
|
||||||
- compute shader: data → data (anything)
|
|
||||||
|
|
||||||
GPUs are massively parallel, so shaders run on thousands of inputs at once.
|
|
||||||
CPUs have stagnated; GPUs keep getting faster. modern engines like UE5
|
|
||||||
increasingly use shaders for work that used to be CPU-only.
|
|
||||||
|
|
||||||
|
|
||||||
SSBO (shader storage buffer object)
|
|
||||||
-----------------------------------
|
|
||||||
|
|
||||||
a block of GPU memory that shaders can read/write.
|
|
||||||
|
|
||||||
unlike uniforms (small, read-only), SSBOs can be large and writable.
|
|
||||||
lofivor stores all entity data in an SSBO: positions, velocities, colors.
|
|
||||||
|
|
||||||
|
|
||||||
compute shader
|
|
||||||
--------------
|
|
||||||
|
|
||||||
a shader that does general computation, not rendering.
|
|
||||||
|
|
||||||
runs on GPU cores but doesn't output pixels. just processes data.
|
|
||||||
lofivor uses compute shaders to update entity positions.
|
|
||||||
|
|
||||||
because compute exists, shaders can be anything: physics, AI, sorting,
|
|
||||||
image processing. the GPU is a general-purpose parallel processor.
|
|
||||||
|
|
||||||
|
|
||||||
fragment / pixel shader
|
|
||||||
-----------------------
|
|
||||||
|
|
||||||
program that runs once per pixel (actually per "fragment").
|
|
||||||
|
|
||||||
determines the final color of each pixel. this is where:
|
|
||||||
- texture sampling happens
|
|
||||||
- lighting calculations happen
|
|
||||||
- the expensive math lives
|
|
||||||
|
|
||||||
lofivor's fragment shader: sample texture, multiply by color. trivial.
|
|
||||||
AAA game fragment shader: 500+ instructions. expensive.
|
|
||||||
|
|
||||||
|
|
||||||
vertex shader
|
|
||||||
-------------
|
|
||||||
|
|
||||||
program that runs once per vertex.
|
|
||||||
|
|
||||||
transforms 3D positions to screen positions. lofivor's vertex shader
|
|
||||||
reads from SSBO and positions the quad corners.
|
|
||||||
|
|
||||||
|
|
||||||
ROP (render output unit)
|
|
||||||
------------------------
|
|
||||||
|
|
||||||
final stage of GPU pipeline. writes pixels to framebuffer.
|
|
||||||
|
|
||||||
handles: depth test, stencil test, blending, antialiasing.
|
|
||||||
your bottleneck on HD 530. see docs/rops.txt.
|
|
||||||
|
|
||||||
|
|
||||||
TMU (texture mapping unit)
|
|
||||||
--------------------------
|
|
||||||
|
|
||||||
samples textures. reads pixel colors from texture memory.
|
|
||||||
|
|
||||||
your HD 530 has 24 TMUs. they're fast (22.8 GTexels/s).
|
|
||||||
texture sampling is cheap relative to ROPs on this hardware.
|
|
||||||
|
|
||||||
|
|
||||||
EU (execution unit)
|
|
||||||
-------------------
|
|
||||||
|
|
||||||
intel's term for shader cores.
|
|
||||||
|
|
||||||
your HD 530 has 24 EUs, each with 8 ALUs = 192 ALUs total.
|
|
||||||
these run your vertex, fragment, and compute shaders.
|
|
||||||
|
|
||||||
|
|
||||||
ALU (arithmetic logic unit)
|
|
||||||
---------------------------
|
|
||||||
|
|
||||||
does math. add, multiply, compare, bitwise operations.
|
|
||||||
|
|
||||||
one ALU can do one operation per cycle (simple ops).
|
|
||||||
complex ops (sqrt, sin, cos) take multiple cycles.
|
|
||||||
|
|
||||||
|
|
||||||
framebuffer
|
|
||||||
-----------
|
|
||||||
|
|
||||||
the image being rendered. lives in GPU memory.
|
|
||||||
|
|
||||||
at 1080p with 32-bit color: 1920 * 1080 * 4 = 8.3 MB.
|
|
||||||
double-buffered (front + back): 16.6 MB.
|
|
||||||
|
|
||||||
|
|
||||||
vsync
|
|
||||||
-----
|
|
||||||
|
|
||||||
synchronizing frame presentation with monitor refresh.
|
|
||||||
|
|
||||||
without vsync: tearing (half old frame, half new frame).
|
|
||||||
with vsync: smooth, but if you miss 16.7ms, you wait for next refresh.
|
|
||||||
|
|
||||||
|
|
||||||
frame budget
|
|
||||||
------------
|
|
||||||
|
|
||||||
time available per frame.
|
|
||||||
|
|
||||||
60 fps = 16.67 ms per frame
|
|
||||||
30 fps = 33.33 ms per frame
|
|
||||||
|
|
||||||
everything (CPU + GPU) must complete within budget or frames drop.
|
|
||||||
|
|
||||||
|
|
||||||
pipeline stall
|
|
||||||
--------------
|
|
||||||
|
|
||||||
GPU waiting for something. bad for performance.
|
|
||||||
|
|
||||||
causes:
|
|
||||||
- waiting for memory (cache miss)
|
|
||||||
- waiting for previous stage to finish
|
|
||||||
- synchronization points (barriers)
|
|
||||||
- `discard` in fragment shader (breaks early-z)
|
|
||||||
|
|
||||||
|
|
||||||
early-z
|
|
||||||
-------
|
|
||||||
|
|
||||||
optimization: test depth BEFORE running fragment shader.
|
|
||||||
|
|
||||||
if pixel will be occluded, skip the expensive shader work.
|
|
||||||
`discard` breaks this because GPU can't know depth until shader runs.
|
|
||||||
|
|
||||||
|
|
||||||
LOD (level of detail)
|
|
||||||
---------------------
|
|
||||||
|
|
||||||
using simpler geometry/textures for distant objects.
|
|
||||||
|
|
||||||
far away = fewer pixels = less detail needed.
|
|
||||||
saves vertices, texture bandwidth, and fill rate.
|
|
||||||
|
|
||||||
|
|
||||||
frustum culling
|
|
||||||
---------------
|
|
||||||
|
|
||||||
don't draw what's outside the camera view.
|
|
||||||
|
|
||||||
the "frustum" is the pyramid-shaped visible region.
|
|
||||||
anything outside = wasted work. cull it before sending to GPU.
|
|
||||||
|
|
||||||
|
|
||||||
spatial partitioning
|
|
||||||
--------------------
|
|
||||||
|
|
||||||
organizing entities by position for fast queries.
|
|
||||||
|
|
||||||
types: grid, quadtree, octree, BVH.
|
|
||||||
|
|
||||||
"which entities are near point X?" goes from O(n) to O(log n).
|
|
||||||
essential for collision detection at scale.
|
|
||||||
|
|
@ -1,119 +0,0 @@
|
||||||
# intel hd 530 optimization guide for lofivor
|
|
||||||
|
|
||||||
based on hardware specs and empirical testing.
|
|
||||||
|
|
||||||
## hardware constraints
|
|
||||||
|
|
||||||
from `intel_hd_graphics_530.txt`:
|
|
||||||
|
|
||||||
| resource | value | implication |
|
|
||||||
| ---------- | ------- | ------------- |
|
|
||||||
| ROPs | 3 | fill rate limited - this is our ceiling |
|
|
||||||
| TMUs | 24 | texture sampling is relatively fast |
|
|
||||||
| memory | shared DDR4 ~30GB/s | bandwidth is precious, no VRAM |
|
|
||||||
| pixel rate | 2.85 GPixel/s | max theoretical throughput |
|
|
||||||
| EUs | 24 (192 ALUs) | decent compute, weak vs discrete |
|
|
||||||
| L3 cache | 768 KB | small, cache misses hurt |
|
|
||||||
|
|
||||||
the bottleneck is ROPs (fill rate), not vertices or compute.
|
|
||||||
|
|
||||||
## what works (proven)
|
|
||||||
|
|
||||||
### SSBO instance data
|
|
||||||
- 16 bytes per entity vs 64 bytes (matrices)
|
|
||||||
- minimizes bandwidth on shared memory bus
|
|
||||||
- result: ~5x improvement over instancing
|
|
||||||
|
|
||||||
### compute shader updates
|
|
||||||
- GPU does position/velocity updates
|
|
||||||
- no CPU→GPU sync per frame
|
|
||||||
- result: update time essentially free
|
|
||||||
|
|
||||||
### texture sampling
|
|
||||||
- 22.8 GTexel/s is fast relative to other units
|
|
||||||
- pre-baked circle texture beats procedural math
|
|
||||||
- result: 2x faster than procedural fragment shader
|
|
||||||
|
|
||||||
### instanced triangles/quads
|
|
||||||
- most optimized driver path
|
|
||||||
- intel mesa heavily optimizes this
|
|
||||||
- result: baseline, hard to beat
|
|
||||||
|
|
||||||
## what doesn't work (proven)
|
|
||||||
|
|
||||||
### point sprites
|
|
||||||
- theoretically 6x fewer vertices
|
|
||||||
- reality: 2.4x SLOWER on this hardware
|
|
||||||
- triangle rasterizer is more optimized
|
|
||||||
- see `docs/point_sprites_experiment.md`
|
|
||||||
|
|
||||||
### procedural fragment shaders
|
|
||||||
- `length()`, `smoothstep()`, `discard` are expensive
|
|
||||||
- EUs are weaker than discrete GPUs
|
|
||||||
- `discard` breaks early-z optimization
|
|
||||||
- result: 3.7x slower than texture sampling
|
|
||||||
|
|
||||||
### complex fragment math
|
|
||||||
- only 24 EUs, each running 8 ALUs
|
|
||||||
- transcendentals (sqrt, sin, cos) are 4x slower than FMAD
|
|
||||||
- avoid in hot path
|
|
||||||
|
|
||||||
## what to try next (theoretical)
|
|
||||||
|
|
||||||
### likely to help
|
|
||||||
|
|
||||||
| technique | why it should work | expected gain |
|
|
||||||
| ----------- | ------------------- | --------------- |
|
|
||||||
| frustum culling (GPU) | reduce fill rate, which is bottleneck | 10-30% depending on view |
|
|
||||||
| smaller points when zoomed out (LOD) | fewer pixels per entity = less ROP work | 20-40% |
|
|
||||||
| early-z / depth pre-pass | skip fragment work for occluded pixels | moderate |
|
|
||||||
|
|
||||||
### unlikely to help
|
|
||||||
|
|
||||||
| technique | why it won't help |
|
|
||||||
| ----------- | ------------------ |
|
|
||||||
| more vertex optimization | already fill rate bound, not vertex bound |
|
|
||||||
| SIMD on CPU | updates already on GPU |
|
|
||||||
| multithreading | CPU isn't the bottleneck |
|
|
||||||
| different vertex layouts | negligible vs fill rate |
|
|
||||||
|
|
||||||
### uncertain (need to test)
|
|
||||||
|
|
||||||
| technique | notes |
|
|
||||||
| ----------- | ------- |
|
|
||||||
| vulkan backend | might have less driver overhead, or might not matter |
|
|
||||||
| indirect draw calls | GPU decides what to render, but we're not CPU bound |
|
|
||||||
| fp16 in shaders | HD 530 has 2:1 fp16 ratio, might help fragment shader |
|
|
||||||
|
|
||||||
## key insights
|
|
||||||
|
|
||||||
1. fill rate is king - with only 3 ROPs, everything comes down to how many
|
|
||||||
pixels we're writing. optimizations that don't reduce pixel count won't
|
|
||||||
help.
|
|
||||||
|
|
||||||
2. shared memory hurts - no dedicated VRAM means CPU and GPU compete for
|
|
||||||
bandwidth. keep data transfers minimal.
|
|
||||||
|
|
||||||
3. driver optimization matters - the "common path" (triangles) is more
|
|
||||||
optimized than alternatives (points). don't be clever.
|
|
||||||
|
|
||||||
4. texture sampling is cheap - 22.8 GTexel/s is fast. prefer texture
|
|
||||||
lookups over ALU math in fragment shaders.
|
|
||||||
|
|
||||||
5. avoid discard - breaks early-z, causes pipeline stalls. alpha blending
|
|
||||||
is faster than discard.
|
|
||||||
|
|
||||||
## current ceiling
|
|
||||||
|
|
||||||
~950k entities @ 57fps (SSBO + compute + quads)
|
|
||||||
|
|
||||||
to go higher, we need to reduce fill rate:
|
|
||||||
- cull offscreen entities
|
|
||||||
- reduce entity size when zoomed out
|
|
||||||
- or accept lower fps at higher counts
|
|
||||||
|
|
||||||
## references
|
|
||||||
|
|
||||||
- intel gen9 compute architecture whitepaper
|
|
||||||
- empirical benchmarks in `benchmark_current_i56500t.log`
|
|
||||||
- point sprites experiment in `docs/point_sprites_experiment.md`
|
|
||||||
|
|
@ -1,89 +0,0 @@
|
||||||
# point sprites experiment
|
|
||||||
|
|
||||||
branch: `point-sprites` (point-sprites work)
|
|
||||||
date: 2024-12
|
|
||||||
hardware: intel hd 530 (skylake gt2, i5-6500T)
|
|
||||||
|
|
||||||
## hypothesis
|
|
||||||
|
|
||||||
point sprites should be faster than quads because:
|
|
||||||
- 1 vertex per entity instead of 6 (quad = 2 triangles)
|
|
||||||
- less vertex throughput
|
|
||||||
- `gl_PointCoord` provides texture coords automatically
|
|
||||||
|
|
||||||
## implementation
|
|
||||||
|
|
||||||
### vertex shader changes
|
|
||||||
- removed quad vertex attributes (position, texcoord)
|
|
||||||
- use `gl_PointSize = 16.0 * zoom` for size control
|
|
||||||
- position calculated from SSBO data only
|
|
||||||
|
|
||||||
### fragment shader changes
|
|
||||||
- use `gl_PointCoord` instead of vertex texcoord
|
|
||||||
- sample circle texture for alpha
|
|
||||||
|
|
||||||
### renderer changes
|
|
||||||
- load `glEnable` and `glDrawArraysInstanced` via `rlGetProcAddress`
|
|
||||||
- enable `GL_PROGRAM_POINT_SIZE`
|
|
||||||
- draw with `glDrawArraysInstanced(GL_POINTS, 0, 1, count)`
|
|
||||||
- removed VBO (no vertex data needed)
|
|
||||||
|
|
||||||
## results
|
|
||||||
|
|
||||||
### attempt 1: procedural circle in fragment shader
|
|
||||||
|
|
||||||
```glsl
|
|
||||||
vec2 coord = gl_PointCoord - vec2(0.5);
|
|
||||||
float dist = length(coord);
|
|
||||||
float alpha = 1.0 - smoothstep(0.4, 0.5, dist);
|
|
||||||
if (alpha < 0.01) discard;
|
|
||||||
```
|
|
||||||
|
|
||||||
**benchmark @ 350k entities:**
|
|
||||||
- point sprites: 23ms render, 43fps
|
|
||||||
- quads (main): 6.2ms render, 151fps
|
|
||||||
- **result: 3.7x SLOWER**
|
|
||||||
|
|
||||||
**why:** `discard` breaks early-z optimization, `length()` and `smoothstep()` are ALU-heavy, intel integrated GPUs are weak at fragment shader math.
|
|
||||||
|
|
||||||
### attempt 2: texture sampling
|
|
||||||
|
|
||||||
```glsl
|
|
||||||
float alpha = texture(circleTexture, gl_PointCoord).r;
|
|
||||||
finalColor = vec4(fragColor, alpha);
|
|
||||||
```
|
|
||||||
|
|
||||||
**benchmark @ 450k entities:**
|
|
||||||
- point sprites: 19.1ms render, 52fps
|
|
||||||
- quads (main): 8.0ms render, 122fps
|
|
||||||
- **result: 2.4x SLOWER**
|
|
||||||
|
|
||||||
better than procedural, but still significantly slower than quads.
|
|
||||||
|
|
||||||
## analysis
|
|
||||||
|
|
||||||
the theoretical advantage (1/6 vertices) doesn't translate to real performance because:
|
|
||||||
|
|
||||||
1. **triangle path is more optimized** - intel's driver heavily optimizes the standard triangle rasterization path. point sprites use a less-traveled code path.
|
|
||||||
|
|
||||||
2. **fill rate is the bottleneck** - HD 530 has only 3 ROPs. we're bound by how fast we can write pixels, not by vertex count. reducing vertices from 6 to 1 doesn't help when fill rate is the constraint.
|
|
||||||
|
|
||||||
3. **point size overhead** - each point requires computing `gl_PointSize` and setting up the point sprite rasterization, which may have per-vertex overhead.
|
|
||||||
|
|
||||||
4. **texture cache behavior** - `gl_PointCoord` may have worse cache locality than explicit vertex texcoords.
|
|
||||||
|
|
||||||
## conclusion
|
|
||||||
|
|
||||||
**point sprites are a regression on intel hd 530.**
|
|
||||||
|
|
||||||
the optimization makes theoretical sense but fails in practice on this hardware. the quad/triangle path is simply more optimized in intel's mesa driver.
|
|
||||||
|
|
||||||
**keep this branch for testing on discrete GPUs** where point sprites might actually help (nvidia/amd have different optimization priorities).
|
|
||||||
|
|
||||||
## lessons learned
|
|
||||||
|
|
||||||
1. always benchmark, don't assume
|
|
||||||
2. "fewer vertices" doesn't always mean faster
|
|
||||||
3. integrated GPU optimization is different from discrete
|
|
||||||
4. the most optimized path is usually the most common path (triangles)
|
|
||||||
5. fill rate matters more than vertex count at high entity counts
|
|
||||||
201
docs/rops.txt
201
docs/rops.txt
|
|
@ -1,201 +0,0 @@
|
||||||
rops: render output units
|
|
||||||
=========================
|
|
||||||
|
|
||||||
what they are, where they came from, and what yours can do.
|
|
||||||
|
|
||||||
|
|
||||||
what is a rop?
|
|
||||||
--------------
|
|
||||||
|
|
||||||
ROP = Render Output Unit (originally "Raster Operations Pipeline")
|
|
||||||
|
|
||||||
it's the final stage of the GPU pipeline. after all the fancy shader
|
|
||||||
math is done, the ROP is the unit that actually writes pixels to memory.
|
|
||||||
|
|
||||||
think of it as the bottleneck between "calculated" and "visible."
|
|
||||||
|
|
||||||
a ROP does:
|
|
||||||
- depth testing (is this pixel in front of what's already there?)
|
|
||||||
- stencil testing (mask operations)
|
|
||||||
- blending (alpha, additive, etc)
|
|
||||||
- anti-aliasing resolve
|
|
||||||
- writing the final color to the framebuffer
|
|
||||||
|
|
||||||
one ROP can write one pixel per clock cycle (roughly).
|
|
||||||
|
|
||||||
|
|
||||||
the first rop
|
|
||||||
-------------
|
|
||||||
|
|
||||||
the term comes from the IBM 8514/A (1987), which had dedicated hardware
|
|
||||||
for "raster operations" - bitwise operations on pixels (AND, OR, XOR).
|
|
||||||
this was revolutionary because before this, the CPU did all pixel math.
|
|
||||||
|
|
||||||
but the modern ROP as we know it emerged with:
|
|
||||||
|
|
||||||
NVIDIA NV1 (1995)
|
|
||||||
one of the first chips with dedicated pixel output hardware
|
|
||||||
could do ~1 million textured pixels/second
|
|
||||||
|
|
||||||
3dfx Voodoo (1996)
|
|
||||||
the card that defined the modern GPU pipeline
|
|
||||||
had 1 TMU + 1 pixel pipeline (essentially 1 ROP)
|
|
||||||
could push 45 million pixels/second
|
|
||||||
that ONE pipeline ran Quake at 640x480
|
|
||||||
|
|
||||||
NVIDIA GeForce 256 (1999)
|
|
||||||
"the first GPU" - named itself with that term
|
|
||||||
4 pixel pipelines = 4 ROPs
|
|
||||||
480 million pixels/second
|
|
||||||
|
|
||||||
so the original consumer 3D cards had... 1 ROP. and they ran Quake.
|
|
||||||
|
|
||||||
|
|
||||||
what one rop can do
|
|
||||||
-------------------
|
|
||||||
|
|
||||||
let's do the math.
|
|
||||||
|
|
||||||
one ROP at 100 MHz (3dfx Voodoo era):
|
|
||||||
100 million cycles/second
|
|
||||||
~1 pixel per cycle
|
|
||||||
= 100 megapixels/second
|
|
||||||
|
|
||||||
at 640x480 @ 60fps:
|
|
||||||
640 * 480 * 60 = 18.4 megapixels/second needed
|
|
||||||
|
|
||||||
so ONE ROP at 100MHz could handle 640x480 with ~5x headroom for overdraw.
|
|
||||||
|
|
||||||
at 1024x768 @ 60fps:
|
|
||||||
1024 * 768 * 60 = 47 megapixels/second
|
|
||||||
|
|
||||||
now you're at 2x overdraw max. still playable, but tight.
|
|
||||||
|
|
||||||
|
|
||||||
one modern rop
|
|
||||||
--------------
|
|
||||||
|
|
||||||
a single modern ROP runs at ~1-2 GHz and can do more per cycle:
|
|
||||||
- multiple color outputs (MRT)
|
|
||||||
- 64-bit or 128-bit color formats
|
|
||||||
- compressed writes
|
|
||||||
|
|
||||||
rough estimate for one ROP at 1.5 GHz:
|
|
||||||
~1.5 billion pixels/second base throughput
|
|
||||||
|
|
||||||
at 1920x1080 @ 60fps:
|
|
||||||
1920 * 1080 * 60 = 124 megapixels/second
|
|
||||||
|
|
||||||
one ROP could handle 1080p with 12x overdraw headroom.
|
|
||||||
|
|
||||||
at 4K @ 60fps:
|
|
||||||
3840 * 2160 * 60 = 497 megapixels/second
|
|
||||||
|
|
||||||
one ROP could handle 4K with 3x overdraw. tight, but possible.
|
|
||||||
|
|
||||||
|
|
||||||
your three rops (intel hd 530)
|
|
||||||
------------------------------
|
|
||||||
|
|
||||||
HD 530 specs:
|
|
||||||
- 3 ROPs
|
|
||||||
- ~950 MHz boost clock
|
|
||||||
- theoretical: 2.85 GPixels/second
|
|
||||||
|
|
||||||
let's break that down:
|
|
||||||
|
|
||||||
at 1080p @ 60fps (124 MP/s needed):
|
|
||||||
2850 / 124 = 23x overdraw budget
|
|
||||||
|
|
||||||
that's actually generous! you could draw each pixel 23 times.
|
|
||||||
|
|
||||||
so why does lofivor struggle at 1M entities?
|
|
||||||
|
|
||||||
because 1M entities at 4x4 pixels = 16M pixels minimum.
|
|
||||||
but with overlap? let's say average 10x overdraw:
|
|
||||||
160M pixels/frame
|
|
||||||
at 60fps = 9.6 billion pixels/second
|
|
||||||
|
|
||||||
your ceiling is 2.85 billion.
|
|
||||||
|
|
||||||
so you're 3.4x over budget. that's why you top out around 300k-400k
|
|
||||||
before frame drops (which matches empirical testing).
|
|
||||||
|
|
||||||
|
|
||||||
the real constraint
|
|
||||||
-------------------
|
|
||||||
|
|
||||||
ROPs don't work in isolation. they're limited by:
|
|
||||||
|
|
||||||
1. MEMORY BANDWIDTH
|
|
||||||
each pixel write = memory access
|
|
||||||
HD 530 shares DDR4 with CPU (~30 GB/s)
|
|
||||||
at 32-bit color: 30GB/s / 4 bytes = 7.5 billion pixels/second max
|
|
||||||
but you're competing with CPU, texture reads, etc.
|
|
||||||
realistic: maybe 2-3 billion pixels for framebuffer writes
|
|
||||||
|
|
||||||
2. TEXTURE SAMPLING
|
|
||||||
if fragment shader samples textures, TMUs must keep up
|
|
||||||
HD 530 has 24 TMUs, so this isn't the bottleneck
|
|
||||||
|
|
||||||
3. SHADER EXECUTION
|
|
||||||
ROPs wait for fragments to be shaded
|
|
||||||
if shaders are slow, ROPs starve
|
|
||||||
lofivor's shaders are trivial, so this isn't the bottleneck
|
|
||||||
|
|
||||||
for lofivor specifically: your 3 ROPs are THE ceiling.
|
|
||||||
|
|
||||||
|
|
||||||
what could you do with more rops?
|
|
||||||
---------------------------------
|
|
||||||
|
|
||||||
comparison:
|
|
||||||
|
|
||||||
Intel HD 530: 3 ROPs, 2.85 GPixels/s
|
|
||||||
GTX 1060: 48 ROPs, 72 GPixels/s
|
|
||||||
RTX 3080: 96 ROPs, 164 GPixels/s
|
|
||||||
RTX 4090: 176 ROPs, 443 GPixels/s
|
|
||||||
|
|
||||||
with a GTX 1060 (25x your fill rate):
|
|
||||||
lofivor could probably hit 5-10 million entities
|
|
||||||
|
|
||||||
with an RTX 4090 (155x your fill rate):
|
|
||||||
tens of millions, limited by other factors
|
|
||||||
|
|
||||||
|
|
||||||
perspective: what 3 rops means historically
|
|
||||||
-------------------------------------------
|
|
||||||
|
|
||||||
your HD 530 has roughly the fill rate of:
|
|
||||||
- GeForce 4 Ti 4600 (2002): 4 ROPs, 1.2 GPixels/s
|
|
||||||
- Radeon 9700 Pro (2002): 8 ROPs, 2.6 GPixels/s
|
|
||||||
|
|
||||||
you're running hardware that, in raw pixel output, matches GPUs from
|
|
||||||
20+ years ago. but with modern features (compute shaders, SSBO, etc).
|
|
||||||
|
|
||||||
this is why lofivor is interesting: you're achieving 700k+ entities
|
|
||||||
on fill-rate-equivalent hardware that originally ran games with
|
|
||||||
maybe 10,000 triangles on screen.
|
|
||||||
|
|
||||||
the difference is technique. those 2002 games did complex per-pixel
|
|
||||||
lighting, shadows, multiple texture passes. lofivor does one texture
|
|
||||||
sample and one blend. same fill rate, 100x the entities.
|
|
||||||
|
|
||||||
|
|
||||||
the lesson
|
|
||||||
----------
|
|
||||||
|
|
||||||
ROPs are simple: they write pixels.
|
|
||||||
|
|
||||||
the number you have determines your pixel budget.
|
|
||||||
everything else (shaders, vertices, CPU logic) only matters if
|
|
||||||
the ROPs aren't your bottleneck.
|
|
||||||
|
|
||||||
with 3 ROPs, you have roughly 2.85 billion pixels/second.
|
|
||||||
spend them wisely:
|
|
||||||
- cull what's offscreen (don't spend pixels on invisible things)
|
|
||||||
- shrink distant objects (LOD saves pixels)
|
|
||||||
- reduce overlap (spatial organization)
|
|
||||||
- keep shaders simple (don't starve the ROPs)
|
|
||||||
|
|
||||||
your 3 ROPs can do remarkable things. Quake ran on 1.
|
|
||||||
|
|
@ -1,316 +0,0 @@
|
||||||
why rendering millions of entities is hard
|
|
||||||
=========================================
|
|
||||||
|
|
||||||
and what "hard" actually means, from first principles.
|
|
||||||
|
|
||||||
|
|
||||||
the simple answer
|
|
||||||
-----------------
|
|
||||||
|
|
||||||
every frame, your computer does work. work takes time. you have 16.7
|
|
||||||
milliseconds to do all the work before the next frame (at 60fps).
|
|
||||||
|
|
||||||
if the work takes longer than 16.7ms, you miss the deadline. frames drop.
|
|
||||||
the game stutters.
|
|
||||||
|
|
||||||
10 million entities means 10 million units of work. whether that fits in
|
|
||||||
16.7ms depends on how much work each unit is.
|
|
||||||
|
|
||||||
|
|
||||||
what is "work" anyway?
|
|
||||||
----------------------
|
|
||||||
|
|
||||||
let's trace what happens when you draw one entity:
|
|
||||||
|
|
||||||
1. CPU: "here's an entity at position (340, 512), color cyan"
|
|
||||||
2. that data travels over a bus to the GPU
|
|
||||||
3. GPU: receives the data, stores it in memory
|
|
||||||
4. GPU: runs a vertex shader (figures out where on screen)
|
|
||||||
5. GPU: runs a fragment shader (figures out what color each pixel is)
|
|
||||||
6. GPU: writes pixels to the framebuffer
|
|
||||||
7. framebuffer gets sent to your monitor
|
|
||||||
|
|
||||||
each step has a speed limit. the slowest step is your bottleneck.
|
|
||||||
|
|
||||||
|
|
||||||
the bottlenecks, explained simply
|
|
||||||
---------------------------------
|
|
||||||
|
|
||||||
MEMORY BANDWIDTH
|
|
||||||
how fast data can move around. measured in GB/s.
|
|
||||||
|
|
||||||
think of it like a highway. you can have a fast car (processor), but
|
|
||||||
if the highway is jammed, you're stuck in traffic.
|
|
||||||
|
|
||||||
an integrated GPU (like Intel HD 530) shares the highway with the CPU.
|
|
||||||
a discrete GPU (like an RTX card) has its own private highway.
|
|
||||||
|
|
||||||
this is why lofivor's SSBO optimization helped so much: shrinking
|
|
||||||
entity data from 64 bytes to 12 bytes means 5x less traffic.
|
|
||||||
|
|
||||||
DRAW CALLS
|
|
||||||
every time you say "GPU, draw this thing", there's overhead.
|
|
||||||
the CPU and GPU have to synchronize, state gets set up, etc.
|
|
||||||
|
|
||||||
1 draw call for 1 million entities: fast
|
|
||||||
1 million draw calls for 1 million entities: slow
|
|
||||||
|
|
||||||
this is why batching matters. not the drawing itself, but the
|
|
||||||
*coordination* of drawing.
|
|
||||||
|
|
||||||
FILL RATE
|
|
||||||
how many pixels the GPU can color per second.
|
|
||||||
|
|
||||||
a 4x4 pixel entity = 16 pixels
|
|
||||||
1 million entities = 16 million pixels minimum
|
|
||||||
|
|
||||||
but your screen is only ~2 million pixels (1920x1080). so entities
|
|
||||||
overlap. "overdraw" means coloring the same pixel multiple times.
|
|
||||||
|
|
||||||
10 million overlapping entities might touch each pixel 50+ times.
|
|
||||||
that's 100 million pixel operations.
|
|
||||||
|
|
||||||
SHADER COMPLEXITY
|
|
||||||
the GPU runs a tiny program for each vertex and each pixel.
|
|
||||||
|
|
||||||
simple: "put it here, color it this" = fast
|
|
||||||
complex: "calculate lighting from 8 sources, sample 4 textures,
|
|
||||||
apply normal mapping, do fresnel..." = slow
|
|
||||||
|
|
||||||
lofivor's shaders are trivial. AAA game shaders are not.
|
|
||||||
|
|
||||||
CPU-GPU SYNCHRONIZATION
|
|
||||||
the CPU and GPU work in parallel, but sometimes they have to wait
|
|
||||||
for each other.
|
|
||||||
|
|
||||||
if the CPU needs to read GPU results, it stalls.
|
|
||||||
if the GPU needs new data and the CPU is busy, it stalls.
|
|
||||||
|
|
||||||
good code keeps them both busy without waiting.
|
|
||||||
|
|
||||||
|
|
||||||
why "real games" hit CPU walls
|
|
||||||
------------------------------
|
|
||||||
|
|
||||||
rendering is just putting colors on pixels. that's the GPU's job.
|
|
||||||
|
|
||||||
but games aren't just rendering. they're also:
|
|
||||||
|
|
||||||
- COLLISION DETECTION
|
|
||||||
does entity A overlap entity B?
|
|
||||||
|
|
||||||
naive approach: check every pair
|
|
||||||
1,000 entities = 500,000 checks (n squared / 2)
|
|
||||||
10,000 entities = 50,000,000 checks
|
|
||||||
1,000,000 entities = 500,000,000,000,000 checks
|
|
||||||
|
|
||||||
that's 500 trillion. per frame. not happening.
|
|
||||||
|
|
||||||
smart approach: spatial partitioning (grids, quadtrees)
|
|
||||||
only check nearby entities. but still, at millions of entities,
|
|
||||||
even "nearby" is a lot.
|
|
||||||
|
|
||||||
- AI / BEHAVIOR
|
|
||||||
each entity decides what to do.
|
|
||||||
|
|
||||||
simple: move toward player. cheap.
|
|
||||||
complex: pathfind around obstacles, consider threats, coordinate
|
|
||||||
with allies, remember state. expensive.
|
|
||||||
|
|
||||||
lofivor entities just drift in a direction. no decisions.
|
|
||||||
a real game enemy makes decisions every frame.
|
|
||||||
|
|
||||||
- PHYSICS
|
|
||||||
entities push each other, bounce, have mass and friction.
|
|
||||||
every interaction is math. lots of entities = lots of math.
|
|
||||||
|
|
||||||
- GAME LOGIC
|
|
||||||
damage calculations, spawning, leveling, cooldowns, buffs...
|
|
||||||
all of this runs on the CPU, every frame.
|
|
||||||
|
|
||||||
so: lofivor can render 700k entities because they don't DO anything.
|
|
||||||
a game with 700k entities that think, collide, and interact would
|
|
||||||
need god-tier optimization or would simply not run.
|
|
||||||
|
|
||||||
|
|
||||||
what makes AAA games slow on old hardware?
|
|
||||||
------------------------------------------
|
|
||||||
|
|
||||||
it's not entity count. most AAA games have maybe hundreds of
|
|
||||||
"entities" on screen. it's everything else:
|
|
||||||
|
|
||||||
TEXTURE RESOLUTION
|
|
||||||
a 4K texture is 67 million pixels of data. per texture.
|
|
||||||
one character might have 10+ textures (diffuse, normal, specular,
|
|
||||||
roughness, ambient occlusion...).
|
|
||||||
|
|
||||||
old hardware: less VRAM, slower texture sampling.
|
|
||||||
|
|
||||||
SHADER COMPLEXITY
|
|
||||||
modern materials simulate light physics. subsurface scattering,
|
|
||||||
global illumination, ray-traced reflections.
|
|
||||||
|
|
||||||
each pixel might do hundreds of math operations.
|
|
||||||
|
|
||||||
POST-PROCESSING
|
|
||||||
bloom, motion blur, depth of field, ambient occlusion, anti-aliasing.
|
|
||||||
full-screen passes that touch every pixel multiple times.
|
|
||||||
|
|
||||||
MESH COMPLEXITY
|
|
||||||
a character might be 100,000 triangles.
|
|
||||||
10 characters = 1 million triangles.
|
|
||||||
each triangle goes through the vertex shader.
|
|
||||||
|
|
||||||
SHADOWS
|
|
||||||
render the scene again from the light's perspective.
|
|
||||||
for each light. every frame.
|
|
||||||
|
|
||||||
AAA games are doing 100x more work per pixel than lofivor.
|
|
||||||
lofivor is doing 100x more pixels than AAA games.
|
|
||||||
|
|
||||||
different problems.
|
|
||||||
|
|
||||||
|
|
||||||
the "abuse" vs "respect" distinction
|
|
||||||
------------------------------------
|
|
||||||
|
|
||||||
abuse: making the hardware do unnecessary work.
|
|
||||||
respect: achieving your goal with minimal waste.
|
|
||||||
|
|
||||||
examples of abuse (that lofivor fixed):
|
|
||||||
|
|
||||||
- sending 64 bytes (a full matrix) when you need 12 bytes (x, y, color)
|
|
||||||
- one draw call per entity when you could batch
|
|
||||||
- calculating transforms on CPU when GPU could do it
|
|
||||||
- clearing the screen twice
|
|
||||||
- uploading the same data every frame
|
|
||||||
|
|
||||||
examples of abuse in the wild:
|
|
||||||
|
|
||||||
- electron apps using a whole browser to show a chat window
|
|
||||||
- games that re-render static UI every frame
|
|
||||||
- loading 4K textures for objects that appear 20 pixels tall
|
|
||||||
- running AI pathfinding for off-screen entities
|
|
||||||
|
|
||||||
the hardware has limits. respecting them means fitting your game
|
|
||||||
within those limits through smart decisions. abusing them means
|
|
||||||
throwing cycles at problems you created yourself.
|
|
||||||
|
|
||||||
|
|
||||||
so can you do 1 million entities with juice on old hardware?
|
|
||||||
------------------------------------------------------------
|
|
||||||
|
|
||||||
yes, with the right decisions.
|
|
||||||
|
|
||||||
what "juice" typically means:
|
|
||||||
- screen shake (free, just offset the camera)
|
|
||||||
- particle effects (separate system, heavily optimized)
|
|
||||||
- flash/hit feedback (change a color value)
|
|
||||||
- sound (different system entirely)
|
|
||||||
|
|
||||||
particles are special: they're designed for millions of tiny things.
|
|
||||||
they don't collide, don't think, often don't even persist (spawn,
|
|
||||||
drift, fade, die). GPU particle systems are essentially what lofivor
|
|
||||||
became: minimal data, instanced rendering.
|
|
||||||
|
|
||||||
what would kill you at 1 million:
|
|
||||||
- per-entity collision
|
|
||||||
- per-entity AI
|
|
||||||
- per-entity sprite variety (texture switches)
|
|
||||||
- per-entity complex shaders
|
|
||||||
|
|
||||||
what you could do:
|
|
||||||
- 1 million particles (visual only, no logic)
|
|
||||||
- 10,000 enemies with collision/AI + 990,000 particles
|
|
||||||
- 100,000 enemies with simple behavior + spatial hash collision
|
|
||||||
|
|
||||||
the secret: most of what looks like "millions of things" in games
|
|
||||||
is actually a small number of meaningful entities + a large number
|
|
||||||
of dumb particles.
|
|
||||||
|
|
||||||
|
|
||||||
the laws of physics (sort of)
|
|
||||||
-----------------------------
|
|
||||||
|
|
||||||
there are hard limits:
|
|
||||||
|
|
||||||
MEMORY BUS BANDWIDTH
|
|
||||||
a DDR4 system might move 25 GB/s.
|
|
||||||
1 million entities at 12 bytes each = 12 MB.
|
|
||||||
at 60fps = 720 MB/s just for entity data.
|
|
||||||
that's only 3% of bandwidth. plenty of room.
|
|
||||||
|
|
||||||
but a naive approach (64 bytes, plus overhead) could be
|
|
||||||
10x worse. suddenly you're at 30%.
|
|
||||||
|
|
||||||
CLOCK CYCLES
|
|
||||||
a 3GHz CPU does 3 billion operations per second.
|
|
||||||
at 60fps, that's 50 million operations per frame.
|
|
||||||
1 million entities = 50 operations each.
|
|
||||||
|
|
||||||
50 operations is: a few multiplies, some loads/stores, a branch.
|
|
||||||
that's barely enough for "move in a direction".
|
|
||||||
pathfinding? AI? collision? not a chance.
|
|
||||||
|
|
||||||
PARALLELISM
|
|
||||||
GPUs have thousands of cores but they're simple.
|
|
||||||
CPUs have few cores but they're smart.
|
|
||||||
|
|
||||||
entity rendering: perfectly parallel (GPU wins)
|
|
||||||
entity decision-making: often sequential (CPU bound)
|
|
||||||
|
|
||||||
so yes, physics constrains us. but "physics" here means:
|
|
||||||
- how fast electrons move through silicon
|
|
||||||
- how much data fits on a wire
|
|
||||||
- how many transistors fit on a chip
|
|
||||||
|
|
||||||
within those limits, there's room. lots of room, if you're clever.
|
|
||||||
lofivor went from 5k to 700k by being clever, not by breaking physics.
|
|
||||||
|
|
||||||
|
|
||||||
the actual lesson
|
|
||||||
-----------------
|
|
||||||
|
|
||||||
the limit isn't really "the hardware can't do it."
|
|
||||||
|
|
||||||
the limit is "the hardware can't do it THE WAY YOU'RE DOING IT."
|
|
||||||
|
|
||||||
every optimization in lofivor was finding a different way:
|
|
||||||
- don't draw circles, blit textures
|
|
||||||
- don't call functions, submit vertices directly
|
|
||||||
- don't send matrices, send packed structs
|
|
||||||
- don't update on CPU, use compute shaders
|
|
||||||
|
|
||||||
the hardware was always capable of 700k. the code wasn't asking right.
|
|
||||||
|
|
||||||
this is true at every level. that old laptop struggling with 10k
|
|
||||||
entities in some game? probably not the laptop's fault. probably
|
|
||||||
the game is doing something wasteful that doesn't need to be.
|
|
||||||
|
|
||||||
"runs poorly on old hardware" often means "we didn't try to make
|
|
||||||
it run on old hardware" not "it's impossible on old hardware."
|
|
||||||
|
|
||||||
|
|
||||||
closing thought
|
|
||||||
---------------
|
|
||||||
|
|
||||||
10 million is a lot. but 1 million? 2 million?
|
|
||||||
|
|
||||||
with discipline: yes.
|
|
||||||
with decisions that respect the hardware: yes.
|
|
||||||
with awareness of what's actually expensive: yes.
|
|
||||||
|
|
||||||
the knowledge of what's expensive is the key.
|
|
||||||
|
|
||||||
most developers don't have it. they use high-level abstractions
|
|
||||||
that hide the cost. they've never seen a frame budget or a
|
|
||||||
bandwidth calculation.
|
|
||||||
|
|
||||||
lofivor is a learning tool. the journey from 5k to 700k teaches
|
|
||||||
where the costs are. once you see them, you can't unsee them.
|
|
||||||
|
|
||||||
you start asking: "what is this actually doing? what does it cost?
|
|
||||||
is there a cheaper way?"
|
|
||||||
|
|
||||||
that's the skill. not the specific techniques—those change with
|
|
||||||
hardware. the skill is asking the questions.
|
|
||||||
|
|
@ -1,8 +0,0 @@
|
||||||
the baseline: one draw call per entity, pure and simple
|
|
||||||
|
|
||||||
- individual rl.drawCircle() calls in a loop
|
|
||||||
- ~5k entities at 60fps before frame times tank
|
|
||||||
- linear scaling: 10k = ~43ms, 20k = ~77ms
|
|
||||||
- render-bound (update loop stays under 1ms even at 30k)
|
|
||||||
- each circle is its own GPU draw call
|
|
||||||
- the starting point for optimization experiments
|
|
||||||
|
|
@ -1,8 +0,0 @@
|
||||||
pre-render once, blit many: 10x improvement
|
|
||||||
|
|
||||||
- render circle to 16x16 texture at startup
|
|
||||||
- drawTexture() per entity instead of drawCircle()
|
|
||||||
- raylib batches same-texture draws internally
|
|
||||||
- ~50k entities at 60fps
|
|
||||||
- simple change, big win
|
|
||||||
- still one function call per entity, but GPU work is batched
|
|
||||||
|
|
@ -1,9 +0,0 @@
|
||||||
bypass the wrapper, go straight to rlgl: 2x more
|
|
||||||
|
|
||||||
- skip drawTexture(), submit vertices directly via rl.gl
|
|
||||||
- manually build quads: rlTexCoord2f + rlVertex2f per corner
|
|
||||||
- rlBegin/rlEnd wraps the whole entity loop
|
|
||||||
- ~100k entities at 60fps
|
|
||||||
- eliminates per-call function overhead
|
|
||||||
- vertices go straight to GPU buffer
|
|
||||||
- 20x improvement over baseline
|
|
||||||
|
|
@ -1,11 +0,0 @@
|
||||||
bigger buffer, fewer flushes: squeezing out more headroom
|
|
||||||
|
|
||||||
- increased raylib batch buffer from 8192 to 32768 vertices
|
|
||||||
- ~140k entities at 60fps on i5-6500T
|
|
||||||
- ~40% improvement over default buffer
|
|
||||||
- fewer GPU flushes per frame
|
|
||||||
- also added: release workflows for github and forgejo
|
|
||||||
- added OPTIMIZATIONS.md documenting the journey
|
|
||||||
- added README, UI panel with FPS display
|
|
||||||
- heap allocated entity array to support 1 million entities
|
|
||||||
- per-entity RGB colors
|
|
||||||
|
|
@ -1,13 +0,0 @@
|
||||||
gpu instancing: a disappointing discovery
|
|
||||||
|
|
||||||
- drawMeshInstanced() with per-entity transform matrices
|
|
||||||
- ~150k entities at 60fps - barely better than rlgl batching
|
|
||||||
- negligible improvement on integrated graphics
|
|
||||||
- why it didn't help:
|
|
||||||
- integrated GPU shares system RAM (no PCIe transfer savings)
|
|
||||||
- 64-byte matrix per entity vs ~80 bytes for rlgl vertices
|
|
||||||
- bottleneck is memory bandwidth, not draw call overhead
|
|
||||||
- rlgl batching already minimizes draw calls effectively
|
|
||||||
- orthographic camera setup for 2D-like rendering
|
|
||||||
- heap-allocated transforms buffer (64MB too big for stack)
|
|
||||||
- lesson learned: not all "advanced" techniques are wins
|
|
||||||
|
|
@ -1,17 +0,0 @@
|
||||||
ssbo breakthrough: 5x gain by shrinking the data
|
|
||||||
|
|
||||||
- pack entity data (x, y, color) into 12-byte struct
|
|
||||||
- upload via shader storage buffer object (SSBO)
|
|
||||||
- ~700k entities at 60fps (i5-6500T / HD 530)
|
|
||||||
- ~950k entities at ~57fps
|
|
||||||
- 5x improvement over previous best
|
|
||||||
- 140x total from baseline
|
|
||||||
- why it works:
|
|
||||||
- 12 bytes vs 64 bytes (matrices) = 5.3x less bandwidth
|
|
||||||
- 12 bytes vs 80 bytes (rlgl vertices) = 6.7x less bandwidth
|
|
||||||
- no CPU-side matrix calculations
|
|
||||||
- GPU does NDC conversion and color unpacking
|
|
||||||
- custom vertex/fragment shaders
|
|
||||||
- single rlDrawVertexArrayInstanced() call for all entities
|
|
||||||
- shaders embedded at build time
|
|
||||||
- removed FPS cap, added optional vsync arg
|
|
||||||
|
|
@ -1,5 +0,0 @@
|
||||||
cross-platform release: adding windows to the party
|
|
||||||
|
|
||||||
- updated github release workflow
|
|
||||||
- builds for both linux and windows now
|
|
||||||
- no code changes, just CI/CD work
|
|
||||||
|
|
@ -1,10 +0,0 @@
|
||||||
zoom and pan: making millions of entities explorable
|
|
||||||
|
|
||||||
- mouse wheel zoom
|
|
||||||
- click and drag panning
|
|
||||||
- orthographic camera transforms
|
|
||||||
- memory panel showing entity buffer sizes
|
|
||||||
- background draws immediately (no flicker)
|
|
||||||
- tab key toggles UI panels
|
|
||||||
- explained "lofivor" name in README (lo-fi survivor)
|
|
||||||
- shader updated for zoom/pan transforms
|
|
||||||
|
|
@ -1,5 +0,0 @@
|
||||||
quick exit: zoom out then quit
|
|
||||||
|
|
||||||
- q key first zooms out, second press quits
|
|
||||||
- nice way to see the full entity field before closing
|
|
||||||
- minor UI text fix
|
|
||||||
|
|
@ -1,11 +0,0 @@
|
||||||
compute shader: moving physics to the GPU
|
|
||||||
|
|
||||||
- entity position updates now run on GPU via compute shader
|
|
||||||
- GPU-based RNG for entity velocity randomization
|
|
||||||
- full simulation loop stays on GPU, no CPU roundtrip
|
|
||||||
- new compute.zig module for shader management
|
|
||||||
- GpuEntity struct with position, velocity, and color
|
|
||||||
- tracy profiling integration
|
|
||||||
- FPS display turns green (good) or red (bad)
|
|
||||||
- added design docs for zoom/pan and compute shader work
|
|
||||||
- cross-platform alignment fixes for shader data
|
|
||||||
Loading…
Reference in a new issue