Built for
4K at 60fps
A brush stroke on a 4096×4096 canvas should feel the same as painting on a thumbnail. Every layer of the architecture — memory layout, GPU compute, readback strategy — exists to make that true.
at 4K (vs ~33MB full)
canvas (COW tiles)
brush rasterization
shader pipelines
Hybrid CPU + GPU
Interactive brush previews stay on CPU for zero-latency responsiveness. Heavy compositing and filter ops move to GPU compute shaders via wgpu (Vulkan / Metal / DX12 / OpenGL ES).
GPU Compositor
All layers are uploaded as wgpu textures and composited in a single render pass. 25 blend modes are implemented directly in WGSL — the CPU never touches composite pixels. Cached blend uniform buffers avoid reallocation between frames.
Dirty-Rect Readback
After GPU compositing, only the sub-region that changed is read back to CPU — typically the size of a single brush stamp. At 4K, a 40px brush reads ~6KB instead of ~33MB. The persistent composite_cpu_buffer is patched in-place.
Async Double Buffer
During interactive previews, GPU readback uses ping-pong staging buffers (B1). While frame N renders, frame N−1's data is already mapped to CPU. Zero GPU stalls during brush strokes. Commits switch to synchronous readback for instant result.
Zero-Copy Display
GPU readback bytes are transmuted directly to Color32 slices via bytemuck::cast_slice. At 4K (8.3M pixels) this is literally zero CPU work — no per-pixel conversion loop, no allocation, no copy.
Architecture Dataflow
How data moves through PaintFE on each frame — from user input through the tile engine, GPU pipelines, and back to the display.
Core Stack
wgpu 0.20 (WebGPU)
GPU compositing and compute shaders using WGSL. Runs on Vulkan, Metal, DirectX 12, and OpenGL ES. All blend modes, gradient rasterization, liquify warp, and mesh warp run as compute dispatches.
rayon 1.7 (Parallelism)
Composites, filter cores, and flip/rotate are all work-stealing parallelized. Row-level parallel composite on a 4K canvas saturates all available CPU cores.
bytemuck (Zero-Copy)
GPU readback bytes are cast directly to Color32 via bytemuck::cast_slice. At 4K this is zero CPU cycles vs a per-pixel conversion loop over 8.3M pixels.
Copy-on-Write Tiles
Images stored as Arc<RgbaImage> tile grids. Cloning a 4K layer costs ~36KB (pointer copy). Only the modified chunk is duplicated on write via Arc::make_mut().
TexturePool
GPU textures are never freed between frames. A dimension-keyed pool recycles textures, eliminating the allocate-deallocate churn that causes GPU pipeline stalls in other editors.
Brush Alpha LUT
A 256-entry lookup table maps dist² / radius² to alpha. Every per-pixel sqrt() and smoothstep is eliminated. The LUT rebuilds only when brush size or hardness changes.
B-Series & A-Series
PaintFE’s performance work follows a numbered plan tracked in the codebase. The B-series are architectural overhauls; the A-series are incremental micro-optimizations.
TextureHandle::set_partial() uploads only the dirty rectangle. A 40×40 brush writes ~6KB to the GPU instead of ~33MB for a full clone.Arc<RgbaImage>. Snapshot = pointer copy. Mutation = lazy-copy only the touched tile. Undo stack stores history practically for free.dist² / radius² to alpha. Eliminates all per-pixel sqrt() calls. Rebuilt only when properties change.| ID | Optimization | Impact |
|---|---|---|
| A1 | Zero-copy swap for GPU readback | Eliminates a redundant 33MB clone at 4K on every composite |
| A2 | Cached staging buffer | Reuses GPU readback buffer across frames, no realloc |
| A3 | Cached blend uniform slots | GPU blend buffers reused via queue.write_buffer() |
| A5 | Chunk-level prefetch in composite_partial | Pre-fetches all layer chunk data per column, CPU cache friendly |
| A7 | extract_region_rgba_fast buffer reuse | std::mem::take eliminates allocation on every region extract |
| A8 | Selective LOD invalidation | Only active layer’s LOD rebuilt on dirty, not all layers |
| A9 | VecDeque history stacks | O(1) undo history prune + cached memory_usage counter |
| A14 | SingleLayerSnapshotCommand | Dialog commits save only affected layer — 1/N memory per step |
| A16 | Chunk-level flip / rotate | Block transforms operate on whole tiles with par_iter |
| A17 | Flat visited array (flood fill) | Vec<bool> replaces HashMap — 10–20× faster large fills |
GPU Compute Pipelines
When CPU parallelism hits its wall, PaintFE dispatches to GPU compute shaders. Each pipeline has a CPU fallback that activates automatically on headless systems or integrated GPUs.
GpuGradientPipeline
Rasterizes linear, reflected, radial, and diamond gradients. Color stop LUT (256×4 RGBA) and params uniform buffer cached across frames.
GpuLiquifyPipeline
Bilinear-interpolated displacement warp. Source snapshot uploaded once per stroke; only the displacement field is re-uploaded each frame as a storage buffer.
GpuMeshWarpDisplacement
Evaluates the Catmull-Rom bicubic spline surface on GPU. Uploads ~200 bytes of deformed grid points, dispatches 16×16 workgroups, reads back the full displacement field.
Compositor (Render Pass)
Owns all WGPU render passes. Constructs per-layer bind groups, caches blend uniform buffers, runs a single render pass over all visible layers. All 25 blend modes implemented in WGSL.
Smart Memory Usage
Deep undo history at 4K is expensive — unless your undo system is built on the same COW structure as the canvas itself.
Estimates for a typical painting workflow on a 4096×4096 RGBA canvas with 4 layers.
TexturePool
GPU textures are never freed between frames. A dimension-keyed pool recycles textures, eliminating the allocation-deallocation cycle that causes GPU pipeline stalls.
Reusable Compute Buffers
GPU readback staging, displacement Vecs, preview flat buffers, and per-frame pixel caches all persist across frames. Near-zero heap allocation at steady state.
Incremental Preview Cache
Brush strokes update only the dirty rect in a persistent premultiplied Color32 buffer. Per-frame work scales with brush size, not stroke length.
See it run.
Download PaintFE and feel the difference at full resolution.