Skip to content

Shape consumers accumulate multi-MB heaps of unreclaimed floating garbage (default fullsweep_after, rarely hibernate) #4476

@erik-the-implementer

Description

@erik-the-implementer

Summary

On a busy multi-tenant production node, Electric.Shapes.Consumer processes accumulate ~7–9 MB resident heaps each, while their actual live state is ~8 KB. The heap is floating garbage on the old generation that is never reclaimed, because (a) consumers rarely hibernate and (b) they spawn with the BEAM default fullsweep_after (65535), so generational GC never performs a shrinking full sweep. With several thousand such processes per node this reached ~36 GB of unreclaimable heap on a single node. A manual :erlang.garbage_collect/1 on any consumer drops its heap ~90%, confirming the retained memory is unreachable garbage rather than live state.

Evidence (single production node)

  • 5,043 consumers; 4,455 with heap > 4 MB; aggregate ~36 GB.
  • hibernating = 0 / 5043 (none in :gen_server.loop_hibernate / :erlang.hibernate).
  • Representative per-consumer snapshot (:sys.get_state/1 + Process.info/2):
    • total_heap_size6.7–9.2 MB
    • :erts_debug.size/1 of the entire GenServer state ≈ 8 KB (shape ~3 KB, writer ~1 KB; buffer, txn_offset_mapping, transaction_builder all empty; pending_txn: nil)
    • hibernate_after = 600000 ms (10 min)
    • garbage_collection[:fullsweep_after] = 65535 (BEAM default)
    • garbage_collection[:minor_gcs]40–50 (≪ 65535 ⇒ a full sweep has never run)
  • Manual :erlang.garbage_collect(pid) reclaims ~90% (heap → ~0.2 MB).

The shapes involved are correctly indexed (equality on a high-cardinality column) and process only a few dozen tiny transactions per day each, so this is not driven by transaction size, throughput, or unindexed WHERE evaluation — it is purely a GC-policy issue against a large fleet of long-lived, mostly-idle processes.

Root cause

fullsweep_after defaults to 65535: the old heap is only fully collected on a full sweep, which won't happen until fullsweep_after minor GCs or a hibernation. Consumers:

  1. rarely hibernateshape_hibernate_after is large (10 min in this deployment) and any received message re-arms the GenServer inactivity timeout, so active-ish consumers seldom idle that long; and
  2. never approach 65535 minor GCs.

So promoted floating garbage accumulates on the old heap indefinitely. This is the same class of problem already addressed for Bandit handler processes via handler_fullsweep_after (whose doc comment in config.ex describes exactly this old-heap accumulation), but Electric.Shapes.Consumer was never wired into a fullsweep_after spawn option — Consumer.start_link/1 is a plain GenServer.start_link/3 with no :spawn_opt.

Proposed fix

Spawn consumers with a tunable fullsweep_after, mirroring ShapeLogCollector, which already does spawn_opt: Electric.StackConfig.spawn_opts(stack_id, :shape_log_collector). Thread a :consumer entry through the existing process_spawn_opts config (or add a dedicated knob analogous to handler_fullsweep_after) into Consumer.start_link/1:

GenServer.start_link(__MODULE__, init_arg,
  name: name(stack_id, shape_handle),
  spawn_opt: Electric.StackConfig.spawn_opts(stack_id, :consumer)
)

A modest fullsweep_after bounds each consumer's heap to its true working set with negligible extra GC cost (live state is ~8 KB, so full sweeps are cheap). Consider applying the same to Consumer.Snapshotter and Consumer.Materializer.

Note on tuning: the per-process garbage-vs-minor-GC ratio observed here is high (~8 MB accumulated over only ~40–50 minor GCs), so the effective fullsweep_after should be low (order of 10–20) to actually bound the heap — values in the hundreds/thousands still let substantial garbage accumulate between sweeps for these low-allocation processes.

Operational mitigation (no deploy)

Periodically running :erlang.garbage_collect/1 over all registered consumers reclaims the garbage immediately (heaps slowly refill until the spawn-opt fix ships).

Affected code

  • packages/sync-service/lib/electric/shapes/consumer.exstart_link/1 (no :spawn_opt)
  • packages/sync-service/lib/electric/replication/shape_log_collector.ex — existing spawn_opts pattern to mirror
  • packages/sync-service/lib/electric/config.exhandler_fullsweep_after / process_spawn_opts precedent

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions