Skip to content

bloat: virtualize per-chipset writer dispatch in fl::Channel::showPixels (stop inlining 3 writer templates into one symbol) #2981

@zackees

Description

@zackees

Tracked in #2974 (item TBD-E from the 1-4KB band audit).

Current state

fl::Channel::showPixels(PixelController<RGB, 1, -1>&) is 1,884 B on ESP32-S3 Blink. The bloat breakdown by inlined callee:

Inlined callee Bytes Notes
fl::Channel::resolveDynamicDriver() 927 B called x43 from inside showPixels
fl::(anonymous namespace)::ReorderingPixelIteratorAny::ReorderingPixelIteratorAny(...) 743 B XYMap reordering + iterator ctor
fl::(anonymous namespace)::writeUCS7604(fl::vector_psram<u8>*, fl::PixelIterator&, ...) 507 B gated by FASTLED_DISABLE_UCS7604
fl::(anonymous namespace)::emitDisabledDriverError(fl::string const&, fl::string const&, ...) 490 B gated FL_NO_INLINE already (#2773 follow-up to #2832)
fl::PixelIterator::writeWS2812<...>(...) inlined clockless dispatch
fl::PixelIterator::writeSK9822<...>(...) inlined SPI dispatch
fl::PixelIterator::writeAPA102<...>(...) inlined SPI dispatch

So a single showPixels symbol carries: the pre-bound vs. dynamic-driver dispatch, three chipset writer template specializations, the dynamic-driver resolution chain, the disabled-driver diagnostic, and the XYMap-reordering iterator construction --- all welded together.

Where the body lives

src/fl/channels/channel.cpp.hpp lines 449-673 (Channel::showPixels). The per-chipset dispatch is the inner switch blocks:

  • Lines 523-547: ClocklessChipset switch on clockless->encoder -> pixelIterator.writeWS2812(&data) / writeUCS7604(...)
  • Lines 549-619: SpiChipsetConfig switch on config.chipset (11 cases) -> writeAPA102 / writeSK9822 / writeWS2801 / writeP9813 / writeLPD8806 / writeLPD6803 / writeSM16716 / writeHD108

All writeXXX methods live in src/fl/chipsets/encoders/pixel_iterator.h (lines 202, 223, 255, ...) and are fully inlined templates that the compiler folds into showPixels via the call sites above.

The 43x resolveDynamicDriver call count --- diagnose

resolveDynamicDriver() itself is already FL_NO_INLINE (channel.cpp.hpp:390). It is statically called from exactly one site in showPixels (line 493).

The x43 figure in the backref graph is almost certainly the per-instruction caller count from disassembly --- every machine-level branch to resolveDynamicDriver plus all the implied per-edge counts from the symbol graph after the switch tables get flattened, not 43 distinct call sites in C++. Symptom: even though FL_NO_INLINE keeps the body out-of-line, the call-site setup (build the args, save caller-saved regs, branch, restore) is duplicated by the compiler at every branch fan-in inside the dispatch switches. With ~12-14 writeXXX cases each having their own restore path, the short instruction sequence around the call multiplies.

The actionable read: FL_NO_INLINE is doing its job on the body, but the call site itself is being duplicated by the switch dispatch.

Proposed fix

Move per-chipset writer selection off the inline switch and onto the IChannelDriver interface (src/fl/channels/driver.h) --- or to a function-pointer table held by ChannelData:

Option A: virtual writer on IChannelDriver

class IChannelDriver {
public:
    // ... existing enqueue/show/poll ...

    // Default: dispatch via the current inline switch (back-compat).
    // Override per-driver to call exactly the writeXXX the driver supports.
    virtual void encodePixels(PixelIterator& it,
                              fl::vector_psram<u8>* out,
                              const ChipsetVariant& chipset) FL_NOEXCEPT;
};

Each concrete driver implements just the encoders it needs. Channel::showPixels becomes:

driver->encodePixels(pixelIterator, &data, mChipset);

The 3+ writer templates each become one out-of-line symbol (the virtual override body), not 3 specializations inlined into one ~700 B blob.

Option B: function-pointer table on ChipsetVariant / ChannelData

A static constexpr table indexed by ClocklessEncoder / SpiChipset enum, pointing at &PixelIterator::writeXXX. Avoids the vtable cost but keeps the call site to one indirect branch.

Either way, the goal is to make the writer dispatch a single indirect call out of showPixels, splitting the writer bodies into their own symbols where dead-code elimination (--gc-sections) can drop the unused ones.

Estimated savings

~600-1000 B on this symbol (showPixels itself), with a small portion of that re-spent in the out-of-line writer specializations. Net savings: ~400-800 B because previously-merged-and-deduplicated writer code now lives once per chipset rather than once per instantiation site.

This is additive with the gates already shipped: FASTLED_DISABLE_UCS7604 (#2920), FASTLED_DISABLE_SPI_CHIPSETS (#2913), FASTLED_DISABLE_DYNAMIC_DRIVER (#2926).

Perf trade-off

1 virtual call per FastLED.show() --- sub-microsecond cost on every supported target:

  • ESP32-S3 @ 240 MHz: ~1 indirect-jump = 3-5 cycles = ~20 ns
  • Even at 60 Hz, this is 20 ns / 16.7 ms = 1.2 ppm of frame time
  • WS2812 timing budget for 100 LEDs is ~3 ms; the virtual call is 0.00067% of one frame's encode time

Verdict: free. The encode loop runs numLeds * bytes_per_led * 8 bit-bangs; one extra branch in the prologue is invisible.

Constraint preservation

Per #2974, logging stays enabled. This fix is purely a dispatch-shape change --- every existing FL_ERROR / FL_WARN site continues to fire as before. The emitDisabledDriverError cold helper (already FL_NO_INLINE per #2773) remains untouched.

Acceptance criteria

  • Channel::showPixels symbol drops below ~1,000 B on ESP32-S3 Blink
  • writeWS2812, writeSK9822, writeAPA102 appear as distinct symbols in the bloat report (not folded into showPixels)
  • Unused writer specializations are dead-stripped by --gc-sections when their driver isn't linked
  • All existing tests pass: bash test --cpp
  • No measurable frame-time regression on a 60 Hz WS2812 + 100 LED Blink sketch

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions