Performance Architecture | GPU-Accelerated Image Editing

render pipeline

Hybrid CPU + GPU

Interactive brush previews stay on CPU for zero-latency responsiveness. Heavy compositing and filter ops move to GPU compute shaders via wgpu (Vulkan / Metal / DX12 / OpenGL ES).

Per-Frame Compositing Path

Layer StoreTiledImage / COW

GPU Uploadwgpu texture

WGSL Compositorblend + merge

Dirty Readbacksub-region only

bytemuck castzero-copy → Color32

Partial Uploadset_partial()

Displayegui texture

GPU CPU Transfer

Brush Stroke Path

Input Eventpointer / stylus

LUT Alphano sqrt()

Chunk COWArc::make_mut()

Incr. Cachedirty rect only

set_partial()~6KB vs ~33MB

Displayimmediate

GPU Compositor

All layers are uploaded as wgpu textures and composited in a single render pass. 25 blend modes are implemented directly in WGSL — the CPU never touches composite pixels. Cached blend uniform buffers avoid reallocation between frames.

Dirty-Rect Readback

After GPU compositing, only the sub-region that changed is read back to CPU — typically the size of a single brush stamp. At 4K, a 40px brush reads ~6KB instead of ~33MB. The persistent composite_cpu_buffer is patched in-place.

Async Double Buffer

During interactive previews, GPU readback uses ping-pong staging buffers (B1). While frame N renders, frame N−1's data is already mapped to CPU. Zero GPU stalls during brush strokes. Commits switch to synchronous readback for instant result.

Zero-Copy Display

GPU readback bytes are transmuted directly to Color32 slices via bytemuck::cast_slice. At 4K (8.3M pixels) this is literally zero CPU work — no per-pixel conversion loop, no allocation, no copy.

system map

Architecture Dataflow

How data moves through PaintFE on each frame — from user input through the tile engine, GPU pipelines, and back to the display.

⌨ InputPointer / Stylus / KB

✎ ToolsPanelhandle_input()

▦ TiledImageCOW Arc chunks

◆ GPU Uploadensure_layer_texture

★ CompositorWGSL blend pass

↧ Readbackdirty region staging

⇄ bytemuckzero-copy cast

▣ set_partialegui texture patch

☷ Displayfinal frame

↺ HistoryPixelPatch / Snapshot

technologies

Core Stack

Current desktop stack snapshot: egui/eframe 0.34.1 for UI, winit 0.30 for windowing and input, wgpu 29 for GPU work, rayon 1.12 for CPU parallelism, and Rhai 1.24 for scripting.

GPU

wgpu 29 (WebGPU)

GPU compositing and compute shaders using WGSL. Runs on Vulkan, Metal, DirectX 12, and OpenGL ES. All blend modes, gradient rasterization, liquify warp, and mesh warp run as compute dispatches.

Vulkan • Metal • DX12 • GL ES

CPU

rayon 1.12 (Parallelism)

Composites, filter cores, and flip/rotate are all work-stealing parallelized. Row-level parallel composite on a 4K canvas saturates all available CPU cores.

par_chunks_mut • par_iter • spawn

0×COPY

bytemuck (Zero-Copy)

GPU readback bytes are cast directly to Color32 via bytemuck::cast_slice. At 4K this is zero CPU cycles vs a per-pixel conversion loop over 8.3M pixels.

cast_slice<u8, Color32>()

COW

Copy-on-Write Tiles

Images stored as Arc<RgbaImage> tile grids. Cloning a 4K layer costs ~36KB (pointer copy). Only the modified chunk is duplicated on write via Arc::make_mut().

clone() = Arc::clone • write = make_mut

TexturePool

GPU textures are never freed between frames. A dimension-keyed pool recycles textures, eliminating the allocate-deallocate churn that causes GPU pipeline stalls in other editors.

Brush Alpha LUT

A 256-entry lookup table maps dist² / radius² to alpha. Every per-pixel sqrt() and smoothstep is eliminated. The LUT rebuilds only when brush size or hardness changes.

optimization plans

B-Series & A-Series

PaintFE’s performance work follows a numbered plan tracked in the codebase. The B-series are architectural overhauls; the A-series are incremental micro-optimizations.

B1

Async GPU Readback

Double-buffered staging buffers ping-pong between frames. While frame N renders, frame N−1’s data is already mapped to CPU. Zero GPU stalls during previews.

B2

Partial Texture Upload

TextureHandle::set_partial() uploads only the dirty rectangle. A 40×40 brush writes ~6KB to the GPU instead of ~33MB for a full clone.

B5

COW Arc Tiles

Each tile is Arc<RgbaImage>. Snapshot = pointer copy. Mutation = lazy-copy only the touched tile. Undo stack stores history practically for free.

B6

Brush Alpha LUT

Pre-computed lookup table maps dist² / radius² to alpha. Eliminates all per-pixel sqrt() calls. Rebuilt only when properties change.

B3 / B4

Superseded by B5

Delta undo (B3) and async commits (B4) became unnecessary once COW tiles made snapshots near-free and filter jobs were already on rayon threads.

GPU

Compute Pipelines

Gradient rasterization, liquify warp, mesh warp displacement, Gaussian blur, and HSL — all have WGSL compute shaders with automatic CPU fallbacks.

ID	Optimization	Impact
A1	Zero-copy swap for GPU readback	Eliminates a redundant 33MB clone at 4K on every composite
A2	Cached staging buffer	Reuses GPU readback buffer across frames, no realloc
A3	Cached blend uniform slots	GPU blend buffers reused via `queue.write_buffer()`
A5	Chunk-level prefetch in composite_partial	Pre-fetches all layer chunk data per column, CPU cache friendly
A7	extract_region_rgba_fast buffer reuse	`std::mem::take` eliminates allocation on every region extract
A8	Selective LOD invalidation	Only active layer’s LOD rebuilt on dirty, not all layers
A9	VecDeque history stacks	O(1) undo history prune + cached memory_usage counter
A14	SingleLayerSnapshotCommand	Dialog commits save only affected layer — 1/N memory per step
A16	Chunk-level flip / rotate	Block transforms operate on whole tiles with par_iter
A17	Flat visited array (flood fill)	Vec<bool> replaces HashMap — 10–20× faster large fills

WGSL compute

GPU Compute Pipelines

When CPU parallelism hits its wall, PaintFE dispatches to GPU compute shaders. Each pipeline has a CPU fallback that activates automatically on headless systems or integrated GPUs.

GpuGradientPipeline

Rasterizes linear, reflected, radial, and diamond gradients. Color stop LUT (256×4 RGBA) and params uniform buffer cached across frames.

@compute @workgroup_size(16, 16) fn gradient_main(@builtin(global_invocation_id) gid: vec3u) { let uv = vec2f(gid.xy) / vec2f(params.size); let t = compute_gradient_t(uv, params.shape); let color = sample_lut(t); textureStore(output, gid.xy, color); }

GpuLiquifyPipeline

Bilinear-interpolated displacement warp. Source snapshot uploaded once per stroke; only the displacement field is re-uploaded each frame as a storage buffer.

@compute @workgroup_size(16, 16) fn liquify_main(@builtin(global_invocation_id) gid: vec3u) { let disp = displacement[gid.y * width + gid.x]; let src_uv = vec2f(gid.xy) + disp; let color = bilinear_sample(source, src_uv); textureStore(output, gid.xy, color); }

GpuMeshWarpDisplacement

Evaluates the Catmull-Rom bicubic spline surface on GPU. Uploads ~200 bytes of deformed grid points, dispatches 16×16 workgroups, reads back the full displacement field.

@compute @workgroup_size(16, 16) fn mesh_warp_main(@builtin(global_invocation_id) gid: vec3u) { let cell = find_cell(gid.xy); let weights = catmull_rom_weights(cell.local_uv); let deformed = evaluate_surface(weights, cell.points); displacement[gid.y * width + gid.x] = deformed - original; }

Compositor (Render Pass)

Owns all WGPU render passes. Constructs per-layer bind groups, caches blend uniform buffers, runs a single render pass over all visible layers. All 25 blend modes implemented in WGSL.

fn blend_normal(base: vec4f, top: vec4f) -> vec4f { let a = top.a + base.a * (1.0 - top.a); let rgb = (top.rgb * top.a + base.rgb * base.a * (1.0 - top.a)) / max(a, 0.001); return vec4f(rgb, a); }

memory

Smart Memory Usage

Deep undo history at 4K is expensive — unless your undo system is built on the same COW structure as the canvas itself.

Undo Memory — 4K Canvas, 20-Step History

Full snapshot undonaive, before COW

~1.2 GB

COW Arc tiles undocurrent, per modified chunk

~40 MB

PixelPatch brush undostroke delta only

~5 MB

Estimates for a typical painting workflow on a 4096×4096 RGBA canvas with 4 layers.

♻

TexturePool

GPU textures are never freed between frames. A dimension-keyed pool recycles textures, eliminating the allocation-deallocation cycle that causes GPU pipeline stalls.

⚙

Reusable Compute Buffers

GPU readback staging, displacement Vecs, preview flat buffers, and per-frame pixel caches all persist across frames. Near-zero heap allocation at steady state.

↻

Incremental Preview Cache

Brush strokes update only the dirty rect in a persistent premultiplied Color32 buffer. Per-frame work scales with brush size, not stroke length.

See it run.

Download PaintFE and feel the difference at full resolution.

Download Free View Source on GitHub

Built for4K at 60fps

Hybrid CPU + GPU

GPU Compositor

Dirty-Rect Readback

Async Double Buffer

Zero-Copy Display

Architecture Dataflow

Core Stack

wgpu 29 (WebGPU)

rayon 1.12 (Parallelism)

bytemuck (Zero-Copy)

Copy-on-Write Tiles

TexturePool

Brush Alpha LUT

B-Series & A-Series

GPU Compute Pipelines

GpuGradientPipeline

GpuLiquifyPipeline

GpuMeshWarpDisplacement

Compositor (Render Pass)

Smart Memory Usage

TexturePool

Reusable Compute Buffers

Incremental Preview Cache

See it run.

Built for
4K at 60fps