rops: render output units ========================= what they are, where they came from, and what yours can do. what is a rop? -------------- ROP = Render Output Unit (originally "Raster Operations Pipeline") it's the final stage of the GPU pipeline. after all the fancy shader math is done, the ROP is the unit that actually writes pixels to memory. think of it as the bottleneck between "calculated" and "visible." a ROP does: - depth testing (is this pixel in front of what's already there?) - stencil testing (mask operations) - blending (alpha, additive, etc) - anti-aliasing resolve - writing the final color to the framebuffer one ROP can write one pixel per clock cycle (roughly). the first rop ------------- the term comes from the IBM 8514/A (1987), which had dedicated hardware for "raster operations" - bitwise operations on pixels (AND, OR, XOR). this was revolutionary because before this, the CPU did all pixel math. but the modern ROP as we know it emerged with: NVIDIA NV1 (1995) one of the first chips with dedicated pixel output hardware could do ~1 million textured pixels/second 3dfx Voodoo (1996) the card that defined the modern GPU pipeline had 1 TMU + 1 pixel pipeline (essentially 1 ROP) could push 45 million pixels/second that ONE pipeline ran Quake at 640x480 NVIDIA GeForce 256 (1999) "the first GPU" - named itself with that term 4 pixel pipelines = 4 ROPs 480 million pixels/second so the original consumer 3D cards had... 1 ROP. and they ran Quake. what one rop can do ------------------- let's do the math. one ROP at 100 MHz (3dfx Voodoo era): 100 million cycles/second ~1 pixel per cycle = 100 megapixels/second at 640x480 @ 60fps: 640 * 480 * 60 = 18.4 megapixels/second needed so ONE ROP at 100MHz could handle 640x480 with ~5x headroom for overdraw. at 1024x768 @ 60fps: 1024 * 768 * 60 = 47 megapixels/second now you're at 2x overdraw max. still playable, but tight. one modern rop -------------- a single modern ROP runs at ~1-2 GHz and can do more per cycle: - multiple color outputs (MRT) - 64-bit or 128-bit color formats - compressed writes rough estimate for one ROP at 1.5 GHz: ~1.5 billion pixels/second base throughput at 1920x1080 @ 60fps: 1920 * 1080 * 60 = 124 megapixels/second one ROP could handle 1080p with 12x overdraw headroom. at 4K @ 60fps: 3840 * 2160 * 60 = 497 megapixels/second one ROP could handle 4K with 3x overdraw. tight, but possible. your three rops (intel hd 530) ------------------------------ HD 530 specs: - 3 ROPs - ~950 MHz boost clock - theoretical: 2.85 GPixels/second let's break that down: at 1080p @ 60fps (124 MP/s needed): 2850 / 124 = 23x overdraw budget that's actually generous! you could draw each pixel 23 times. so why does lofivor struggle at 1M entities? because 1M entities at 4x4 pixels = 16M pixels minimum. but with overlap? let's say average 10x overdraw: 160M pixels/frame at 60fps = 9.6 billion pixels/second your ceiling is 2.85 billion. so you're 3.4x over budget. that's why you top out around 300k-400k before frame drops (which matches empirical testing). the real constraint ------------------- ROPs don't work in isolation. they're limited by: 1. MEMORY BANDWIDTH each pixel write = memory access HD 530 shares DDR4 with CPU (~30 GB/s) at 32-bit color: 30GB/s / 4 bytes = 7.5 billion pixels/second max but you're competing with CPU, texture reads, etc. realistic: maybe 2-3 billion pixels for framebuffer writes 2. TEXTURE SAMPLING if fragment shader samples textures, TMUs must keep up HD 530 has 24 TMUs, so this isn't the bottleneck 3. SHADER EXECUTION ROPs wait for fragments to be shaded if shaders are slow, ROPs starve lofivor's shaders are trivial, so this isn't the bottleneck for lofivor specifically: your 3 ROPs are THE ceiling. what could you do with more rops? --------------------------------- comparison: Intel HD 530: 3 ROPs, 2.85 GPixels/s GTX 1060: 48 ROPs, 72 GPixels/s RTX 3080: 96 ROPs, 164 GPixels/s RTX 4090: 176 ROPs, 443 GPixels/s with a GTX 1060 (25x your fill rate): lofivor could probably hit 5-10 million entities with an RTX 4090 (155x your fill rate): tens of millions, limited by other factors perspective: what 3 rops means historically ------------------------------------------- your HD 530 has roughly the fill rate of: - GeForce 4 Ti 4600 (2002): 4 ROPs, 1.2 GPixels/s - Radeon 9700 Pro (2002): 8 ROPs, 2.6 GPixels/s you're running hardware that, in raw pixel output, matches GPUs from 20+ years ago. but with modern features (compute shaders, SSBO, etc). this is why lofivor is interesting: you're achieving 700k+ entities on fill-rate-equivalent hardware that originally ran games with maybe 10,000 triangles on screen. the difference is technique. those 2002 games did complex per-pixel lighting, shadows, multiple texture passes. lofivor does one texture sample and one blend. same fill rate, 100x the entities. the lesson ---------- ROPs are simple: they write pixels. the number you have determines your pixel budget. everything else (shaders, vertices, CPU logic) only matters if the ROPs aren't your bottleneck. with 3 ROPs, you have roughly 2.85 billion pixels/second. spend them wisely: - cull what's offscreen (don't spend pixels on invisible things) - shrink distant objects (LOD saves pixels) - reduce overlap (spatial organization) - keep shaders simple (don't starve the ROPs) your 3 ROPs can do remarkable things. Quake ran on 1.