Summary
On a busy multi-tenant production node, Electric.Shapes.Consumer processes accumulate ~7–9 MB resident heaps each, while their actual live state is ~8 KB. The heap is floating garbage on the old generation that is never reclaimed, because (a) consumers rarely hibernate and (b) they spawn with the BEAM default fullsweep_after (65535), so generational GC never performs a shrinking full sweep. With several thousand such processes per node this reached ~36 GB of unreclaimable heap on a single node. A manual :erlang.garbage_collect/1 on any consumer drops its heap ~90%, confirming the retained memory is unreachable garbage rather than live state.
Evidence (single production node)
- 5,043 consumers; 4,455 with heap
> 4 MB; aggregate ~36 GB.
hibernating = 0 / 5043 (none in :gen_server.loop_hibernate / :erlang.hibernate).
- Representative per-consumer snapshot (
:sys.get_state/1 + Process.info/2):
total_heap_size ≈ 6.7–9.2 MB
:erts_debug.size/1 of the entire GenServer state ≈ 8 KB (shape ~3 KB, writer ~1 KB; buffer, txn_offset_mapping, transaction_builder all empty; pending_txn: nil)
hibernate_after = 600000 ms (10 min)
garbage_collection[:fullsweep_after] = 65535 (BEAM default)
garbage_collection[:minor_gcs] ≈ 40–50 (≪ 65535 ⇒ a full sweep has never run)
- Manual
:erlang.garbage_collect(pid) reclaims ~90% (heap → ~0.2 MB).
The shapes involved are correctly indexed (equality on a high-cardinality column) and process only a few dozen tiny transactions per day each, so this is not driven by transaction size, throughput, or unindexed WHERE evaluation — it is purely a GC-policy issue against a large fleet of long-lived, mostly-idle processes.
Root cause
fullsweep_after defaults to 65535: the old heap is only fully collected on a full sweep, which won't happen until fullsweep_after minor GCs or a hibernation. Consumers:
- rarely hibernate —
shape_hibernate_after is large (10 min in this deployment) and any received message re-arms the GenServer inactivity timeout, so active-ish consumers seldom idle that long; and
- never approach 65535 minor GCs.
So promoted floating garbage accumulates on the old heap indefinitely. This is the same class of problem already addressed for Bandit handler processes via handler_fullsweep_after (whose doc comment in config.ex describes exactly this old-heap accumulation), but Electric.Shapes.Consumer was never wired into a fullsweep_after spawn option — Consumer.start_link/1 is a plain GenServer.start_link/3 with no :spawn_opt.
Proposed fix
Spawn consumers with a tunable fullsweep_after, mirroring ShapeLogCollector, which already does spawn_opt: Electric.StackConfig.spawn_opts(stack_id, :shape_log_collector). Thread a :consumer entry through the existing process_spawn_opts config (or add a dedicated knob analogous to handler_fullsweep_after) into Consumer.start_link/1:
GenServer.start_link(__MODULE__, init_arg,
name: name(stack_id, shape_handle),
spawn_opt: Electric.StackConfig.spawn_opts(stack_id, :consumer)
)
A modest fullsweep_after bounds each consumer's heap to its true working set with negligible extra GC cost (live state is ~8 KB, so full sweeps are cheap). Consider applying the same to Consumer.Snapshotter and Consumer.Materializer.
Note on tuning: the per-process garbage-vs-minor-GC ratio observed here is high (~8 MB accumulated over only ~40–50 minor GCs), so the effective fullsweep_after should be low (order of 10–20) to actually bound the heap — values in the hundreds/thousands still let substantial garbage accumulate between sweeps for these low-allocation processes.
Operational mitigation (no deploy)
Periodically running :erlang.garbage_collect/1 over all registered consumers reclaims the garbage immediately (heaps slowly refill until the spawn-opt fix ships).
Affected code
packages/sync-service/lib/electric/shapes/consumer.ex — start_link/1 (no :spawn_opt)
packages/sync-service/lib/electric/replication/shape_log_collector.ex — existing spawn_opts pattern to mirror
packages/sync-service/lib/electric/config.ex — handler_fullsweep_after / process_spawn_opts precedent
Summary
On a busy multi-tenant production node,
Electric.Shapes.Consumerprocesses accumulate ~7–9 MB resident heaps each, while their actual live state is ~8 KB. The heap is floating garbage on the old generation that is never reclaimed, because (a) consumers rarely hibernate and (b) they spawn with the BEAM defaultfullsweep_after(65535), so generational GC never performs a shrinking full sweep. With several thousand such processes per node this reached ~36 GB of unreclaimable heap on a single node. A manual:erlang.garbage_collect/1on any consumer drops its heap ~90%, confirming the retained memory is unreachable garbage rather than live state.Evidence (single production node)
> 4 MB; aggregate ~36 GB.hibernating= 0 / 5043 (none in:gen_server.loop_hibernate/:erlang.hibernate).:sys.get_state/1+Process.info/2):total_heap_size≈ 6.7–9.2 MB:erts_debug.size/1of the entire GenServer state ≈ 8 KB (shape ~3 KB, writer ~1 KB;buffer,txn_offset_mapping,transaction_builderall empty;pending_txn: nil)hibernate_after= 600000 ms (10 min)garbage_collection[:fullsweep_after]= 65535 (BEAM default)garbage_collection[:minor_gcs]≈ 40–50 (≪ 65535 ⇒ a full sweep has never run):erlang.garbage_collect(pid)reclaims ~90% (heap → ~0.2 MB).The shapes involved are correctly indexed (equality on a high-cardinality column) and process only a few dozen tiny transactions per day each, so this is not driven by transaction size, throughput, or unindexed WHERE evaluation — it is purely a GC-policy issue against a large fleet of long-lived, mostly-idle processes.
Root cause
fullsweep_afterdefaults to65535: the old heap is only fully collected on a full sweep, which won't happen untilfullsweep_afterminor GCs or a hibernation. Consumers:shape_hibernate_afteris large (10 min in this deployment) and any received message re-arms the GenServer inactivity timeout, so active-ish consumers seldom idle that long; andSo promoted floating garbage accumulates on the old heap indefinitely. This is the same class of problem already addressed for Bandit handler processes via
handler_fullsweep_after(whose doc comment inconfig.exdescribes exactly this old-heap accumulation), butElectric.Shapes.Consumerwas never wired into afullsweep_afterspawn option —Consumer.start_link/1is a plainGenServer.start_link/3with no:spawn_opt.Proposed fix
Spawn consumers with a tunable
fullsweep_after, mirroringShapeLogCollector, which already doesspawn_opt: Electric.StackConfig.spawn_opts(stack_id, :shape_log_collector). Thread a:consumerentry through the existingprocess_spawn_optsconfig (or add a dedicated knob analogous tohandler_fullsweep_after) intoConsumer.start_link/1:A modest
fullsweep_afterbounds each consumer's heap to its true working set with negligible extra GC cost (live state is ~8 KB, so full sweeps are cheap). Consider applying the same toConsumer.SnapshotterandConsumer.Materializer.Note on tuning: the per-process garbage-vs-minor-GC ratio observed here is high (~8 MB accumulated over only ~40–50 minor GCs), so the effective
fullsweep_aftershould be low (order of 10–20) to actually bound the heap — values in the hundreds/thousands still let substantial garbage accumulate between sweeps for these low-allocation processes.Operational mitigation (no deploy)
Periodically running
:erlang.garbage_collect/1over all registered consumers reclaims the garbage immediately (heaps slowly refill until the spawn-opt fix ships).Affected code
packages/sync-service/lib/electric/shapes/consumer.ex—start_link/1(no:spawn_opt)packages/sync-service/lib/electric/replication/shape_log_collector.ex— existingspawn_optspattern to mirrorpackages/sync-service/lib/electric/config.ex—handler_fullsweep_after/process_spawn_optsprecedent