Built for
4K at 60fps

A brush stroke on a 4096×4096 canvas should feel the same as painting on a thumbnail. Every layer of the architecture — memory layout, GPU compute, readback strategy — exists to make that true.

Explore pipeline View source
~1MB
Readback per stroke
at 4K (vs ~33MB full)
~0
Cost to clone a 4K
canvas (COW tiles)
Per-pixel sqrt() in
brush rasterization
6
GPU compute
shader pipelines
Pipeline Dataflow Technologies Optimizations GPU Compute Memory

Hybrid CPU + GPU

Interactive brush previews stay on CPU for zero-latency responsiveness. Heavy compositing and filter ops move to GPU compute shaders via wgpu (Vulkan / Metal / DX12 / OpenGL ES).

Per-Frame Compositing Path
Layer StoreTiledImage / COW
GPU Uploadwgpu texture
WGSL Compositorblend + merge
Dirty Readbacksub-region only
bytemuck castzero-copy → Color32
Partial Uploadset_partial()
Displayegui texture
GPU CPU Transfer
Brush Stroke Path
Input Eventpointer / stylus
LUT Alphano sqrt()
Chunk COWArc::make_mut()
Incr. Cachedirty rect only
set_partial()~6KB vs ~33MB
Displayimmediate

GPU Compositor

All layers are uploaded as wgpu textures and composited in a single render pass. 25 blend modes are implemented directly in WGSL — the CPU never touches composite pixels. Cached blend uniform buffers avoid reallocation between frames.

Dirty-Rect Readback

After GPU compositing, only the sub-region that changed is read back to CPU — typically the size of a single brush stamp. At 4K, a 40px brush reads ~6KB instead of ~33MB. The persistent composite_cpu_buffer is patched in-place.

Async Double Buffer

During interactive previews, GPU readback uses ping-pong staging buffers (B1). While frame N renders, frame N−1's data is already mapped to CPU. Zero GPU stalls during brush strokes. Commits switch to synchronous readback for instant result.

Zero-Copy Display

GPU readback bytes are transmuted directly to Color32 slices via bytemuck::cast_slice. At 4K (8.3M pixels) this is literally zero CPU work — no per-pixel conversion loop, no allocation, no copy.

Architecture Dataflow

How data moves through PaintFE on each frame — from user input through the tile engine, GPU pipelines, and back to the display.

InputPointer / Stylus / KB
ToolsPanelhandle_input()
TiledImageCOW Arc chunks
GPU Uploadensure_layer_texture
CompositorWGSL blend pass
Readbackdirty region staging
bytemuckzero-copy cast
set_partialegui texture patch
Displayfinal frame
HistoryPixelPatch / Snapshot

Core Stack

GPU

wgpu 0.20 (WebGPU)

GPU compositing and compute shaders using WGSL. Runs on Vulkan, Metal, DirectX 12, and OpenGL ES. All blend modes, gradient rasterization, liquify warp, and mesh warp run as compute dispatches.

Vulkan • Metal • DX12 • GL ES
CPU

rayon 1.7 (Parallelism)

Composites, filter cores, and flip/rotate are all work-stealing parallelized. Row-level parallel composite on a 4K canvas saturates all available CPU cores.

par_chunks_mut • par_iter • spawn
0×COPY

bytemuck (Zero-Copy)

GPU readback bytes are cast directly to Color32 via bytemuck::cast_slice. At 4K this is zero CPU cycles vs a per-pixel conversion loop over 8.3M pixels.

cast_slice<u8, Color32>()
COW

Copy-on-Write Tiles

Images stored as Arc<RgbaImage> tile grids. Cloning a 4K layer costs ~36KB (pointer copy). Only the modified chunk is duplicated on write via Arc::make_mut().

clone() = Arc::clone • write = make_mut

TexturePool

GPU textures are never freed between frames. A dimension-keyed pool recycles textures, eliminating the allocate-deallocate churn that causes GPU pipeline stalls in other editors.

Brush Alpha LUT

A 256-entry lookup table maps dist² / radius² to alpha. Every per-pixel sqrt() and smoothstep is eliminated. The LUT rebuilds only when brush size or hardness changes.

B-Series & A-Series

PaintFE’s performance work follows a numbered plan tracked in the codebase. The B-series are architectural overhauls; the A-series are incremental micro-optimizations.

B1
Async GPU Readback
Double-buffered staging buffers ping-pong between frames. While frame N renders, frame N−1’s data is already mapped to CPU. Zero GPU stalls during previews.
B2
Partial Texture Upload
TextureHandle::set_partial() uploads only the dirty rectangle. A 40×40 brush writes ~6KB to the GPU instead of ~33MB for a full clone.
B5
COW Arc Tiles
Each tile is Arc<RgbaImage>. Snapshot = pointer copy. Mutation = lazy-copy only the touched tile. Undo stack stores history practically for free.
B6
Brush Alpha LUT
Pre-computed lookup table maps dist² / radius² to alpha. Eliminates all per-pixel sqrt() calls. Rebuilt only when properties change.
B3 / B4
Superseded by B5
Delta undo (B3) and async commits (B4) became unnecessary once COW tiles made snapshots near-free and filter jobs were already on rayon threads.
GPU
Compute Pipelines
Gradient rasterization, liquify warp, mesh warp displacement, Gaussian blur, and HSL — all have WGSL compute shaders with automatic CPU fallbacks.
IDOptimizationImpact
A1Zero-copy swap for GPU readbackEliminates a redundant 33MB clone at 4K on every composite
A2Cached staging bufferReuses GPU readback buffer across frames, no realloc
A3Cached blend uniform slotsGPU blend buffers reused via queue.write_buffer()
A5Chunk-level prefetch in composite_partialPre-fetches all layer chunk data per column, CPU cache friendly
A7extract_region_rgba_fast buffer reusestd::mem::take eliminates allocation on every region extract
A8Selective LOD invalidationOnly active layer’s LOD rebuilt on dirty, not all layers
A9VecDeque history stacksO(1) undo history prune + cached memory_usage counter
A14SingleLayerSnapshotCommandDialog commits save only affected layer — 1/N memory per step
A16Chunk-level flip / rotateBlock transforms operate on whole tiles with par_iter
A17Flat visited array (flood fill)Vec<bool> replaces HashMap — 10–20× faster large fills

GPU Compute Pipelines

When CPU parallelism hits its wall, PaintFE dispatches to GPU compute shaders. Each pipeline has a CPU fallback that activates automatically on headless systems or integrated GPUs.

GpuGradientPipeline

Rasterizes linear, reflected, radial, and diamond gradients. Color stop LUT (256×4 RGBA) and params uniform buffer cached across frames.

@compute @workgroup_size(16, 16) fn gradient_main(@builtin(global_invocation_id) gid: vec3u) { let uv = vec2f(gid.xy) / vec2f(params.size); let t = compute_gradient_t(uv, params.shape); let color = sample_lut(t); textureStore(output, gid.xy, color); }

GpuLiquifyPipeline

Bilinear-interpolated displacement warp. Source snapshot uploaded once per stroke; only the displacement field is re-uploaded each frame as a storage buffer.

@compute @workgroup_size(16, 16) fn liquify_main(@builtin(global_invocation_id) gid: vec3u) { let disp = displacement[gid.y * width + gid.x]; let src_uv = vec2f(gid.xy) + disp; let color = bilinear_sample(source, src_uv); textureStore(output, gid.xy, color); }

GpuMeshWarpDisplacement

Evaluates the Catmull-Rom bicubic spline surface on GPU. Uploads ~200 bytes of deformed grid points, dispatches 16×16 workgroups, reads back the full displacement field.

@compute @workgroup_size(16, 16) fn mesh_warp_main(@builtin(global_invocation_id) gid: vec3u) { let cell = find_cell(gid.xy); let weights = catmull_rom_weights(cell.local_uv); let deformed = evaluate_surface(weights, cell.points); displacement[gid.y * width + gid.x] = deformed - original; }

Compositor (Render Pass)

Owns all WGPU render passes. Constructs per-layer bind groups, caches blend uniform buffers, runs a single render pass over all visible layers. All 25 blend modes implemented in WGSL.

fn blend_normal(base: vec4f, top: vec4f) -> vec4f { let a = top.a + base.a * (1.0 - top.a); let rgb = (top.rgb * top.a + base.rgb * base.a * (1.0 - top.a)) / max(a, 0.001); return vec4f(rgb, a); }

Smart Memory Usage

Deep undo history at 4K is expensive — unless your undo system is built on the same COW structure as the canvas itself.

Undo Memory — 4K Canvas, 20-Step History
Full snapshot undonaive, before COW
~1.2 GB
COW Arc tiles undocurrent, per modified chunk
~40 MB
PixelPatch brush undostroke delta only
~5 MB

Estimates for a typical painting workflow on a 4096×4096 RGBA canvas with 4 layers.

TexturePool

GPU textures are never freed between frames. A dimension-keyed pool recycles textures, eliminating the allocation-deallocation cycle that causes GPU pipeline stalls.

Reusable Compute Buffers

GPU readback staging, displacement Vecs, preview flat buffers, and per-frame pixel caches all persist across frames. Near-zero heap allocation at steady state.

Incremental Preview Cache

Brush strokes update only the dirty rect in a persistent premultiplied Color32 buffer. Per-frame work scales with brush size, not stroke length.

See it run.

Download PaintFE and feel the difference at full resolution.

Download Free View Source on GitHub