Skip to content

Web Frontend -- v1.3.5.2 Addendum

This addendum extends the Web Frontend Guided Demo, the Phase 14 Addendum, and the Phase 15 Addendum with the v1.3.5.2 WebSocket stress-test and backpressure scenario. The focus here is operational behaviour under load: how the ws://localhost:8765/ws/events bridge handles a sustained 1000-events/sec burst, how failing subscribers route to the dead-letter queue, and how OpenTelemetry spans narrate the end-to-end path.


Scenario Overview

Goals

  • Exercise the WebSocket bridge at a workload roughly two orders of magnitude above the typical demo rate.
  • Verify that slow or unresponsive subscribers do not block the publish path.
  • Capture evidence -- spans, metrics, DLQ entries -- that the operator can use to diagnose a real incident.

Moving Parts

Component Role Source
EventBus In-process pub/sub src/fcc/messaging/bus.py
DeadLetterQueue Captures undeliverable events src/fcc/messaging/dlq.py
WS bridge Protocol adapter fcc protocol ws-bridge CLI
FccTracer Span emission src/fcc/observability/tracing.py
FccMetrics Counter/gauge/histogram src/fcc/observability/metrics.py
StressTestPanel (design) React UI for driving the burst frontend/src/pages/StressTestPanel.tsx (scheduled for v1.3.7 alongside the web admin UI; see ROADMAP.md v1.3.7 section)

Note on StressTestPanel.tsx: at v1.3.5.2 the panel was a design proposal; at v1.3.5.4 it was re-scoped into v1.3.7 where it ships together with the new /admin React route, consuming the same server-side harness documented below. The server-side stress-test harness is fully functional today — the UI surface arrives in v1.3.7 without breaking the underlying protocol contract.


Starting the Bridge

The backend WebSocket bridge is started via the fcc CLI or via the Docker backend image. Either entry point yields the same endpoint on port 8765 with an HTTP health check on /health served from the same process.

# Local development
fcc protocol ws-bridge --host 0.0.0.0 --port 8765

# Docker
docker compose up -d fcc-backend
curl http://localhost:8765/health

Once the bridge is live, the frontend's Events page connects to ws://localhost:8765/ws/events and begins receiving the default event firehose. The stress-test scenario injects additional synthetic traffic on top of this live stream.


Driving the 1000-Events/sec Burst

Client-Side Harness

The burst is produced by a small Python harness that publishes directly to the EventBus while the bridge is connected. The harness does not need to know anything about the WebSocket layer -- the bridge observes the bus and forwards to connected clients.

import time
from fcc.messaging.bus import EventBus
from fcc.messaging.events import Event, EventType

bus = EventBus.default()

def burst(events_per_second: int = 1000, duration_s: int = 10) -> None:
    interval = 1.0 / events_per_second
    end = time.time() + duration_s
    count = 0
    while time.time() < end:
        bus.publish(
            Event(
                event_type=EventType.WORKFLOW_STEP_COMPLETED,
                source="stress_harness",
                payload={"seq": count},
            )
        )
        count += 1
        time.sleep(interval)
    print(f"Published {count} events in {duration_s}s")

Server-Side Throttles

The bridge applies three throttles in front of any subscriber:

Throttle Default Purpose
Per-connection send queue 2,000 events Prevents an individual client from exhausting memory
Subscriber invoke timeout 250 ms Bounded blocking per subscriber
Global DLQ capacity Unbounded (in-memory) Captures undeliverable events for later review

When the per-connection queue is full, the bridge applies back-pressure: newer events are dropped for that client and a single events.dropped counter increment is recorded. The bus itself is not slowed down; other clients continue to receive events normally.


Backpressure and the Dead-Letter Queue

Capture Path

When a subscriber either raises an exception or exceeds the invoke timeout, the bus delivers the event to the DeadLetterQueue via DLQ.capture. Each entry is a frozen DeadLetterEntry dataclass that carries the original Event, a human-readable error message, a UTC ISO-8601 timestamp, and a retry counter.

from fcc.messaging.dlq import DeadLetterQueue, DeadLetterEntry

dlq = DeadLetterQueue()

# The bus invokes dlq.enqueue(event, error) internally.
# Inspection is synchronous and thread-safe.
entries = dlq.entries()
for entry in entries:
    print(entry.event.event_type.value, entry.error_message, entry.retry_count)

Inspection from the UI

The proposed StressTestPanel.tsx exposes DLQ depth as a live gauge and provides a table view of the most recent entries. The panel queries the backend through two endpoints:

Endpoint Response Shape
GET /dlq/summary { depth, oldest_ts, newest_ts } Aggregate metrics
GET /dlq/entries?limit=N list of DeadLetterEntry.to_dict() Detail table
POST /dlq/retry { retried: int } Re-publish all entries

Retry Semantics

Retrying a DLQ entry re-publishes the original Event through the bus. If the subscriber that originally failed is still unhealthy, the event lands back in the DLQ with retry_count incremented. A separate operator tool can apply an exponential backoff between retries; the core DLQ API is deliberately simple.


Observability During the Burst

OpenTelemetry Spans

Three span names are emitted during the scenario. Each span carries attributes that make it easy to filter traces by subsystem when inspecting an incident.

Span Name Emitted By Key Attributes
workflow.step action_engine persona_id, action_type, duration_ms
event.publish EventBus event_type, source, subscriber_count
subscriber.invoke Subscriber adapter subscriber_id, event_type, status

Spans are emitted via FccTracer.start_span() and exported through the console or JSON file exporter by default. When opentelemetry-sdk is installed, spans are forwarded to the configured OTLP endpoint.

Metrics to Watch

Metric Type Healthy Range Alert Threshold
events.published Counter rising steadily flat (publisher stuck)
events.delivered Counter tracks published divergence > 5%
events.dropped Counter near zero > 1% of published
events.dlq.size Gauge < 100 > 1000 and rising
events.delivery.latency_ms Histogram p95 < 50 ms p95 > 250 ms
ws.connected_clients Gauge matches frontend sessions sharp drop

Trace Correlation

Every event carries a trace_id attribute that survives WebSocket serialization. A frontend consumer can cross-reference a UI-visible event with the full server-side span tree by filtering traces on this ID. This is the primary mechanism for diagnosing a slow render in the live UI during the burst.


Reading the Stress-Test Panel (Design)

The panel is organized into four regions. All regions poll the backend at a 500 ms cadence while the burst is active and drop to a 2 s cadence when idle.

Region 1 -- Throughput Strip

A sparkline chart showing events.published and events.delivered counters over the last 60 seconds. A divergence between the two lines is the earliest visual indicator that a subscriber is struggling.

Region 2 -- Connection Table

Per-connection queue depth, last message age, and current state. Rows in the backpressure state are highlighted. A click on a row opens a drawer with the last 10 events sent to that specific client.

Region 3 -- DLQ Panel

Live DLQ depth, a sortable table of entries (event type, error, age, retry count), and a retry button. Retry operations are confirmed with a toast that reports the count returned by POST /dlq/retry.

Region 4 -- Span Timeline

A flame-graph-style view of the last trace captured during the burst. Spans are colour-coded by name (workflow.step, event.publish, subscriber.invoke). Hovering a span reveals its attributes and the trace ID.


Running the Scenario End-to-End

Terminal 1 -- Backend

fcc protocol ws-bridge --host 0.0.0.0 --port 8765 \
    --metrics-exporter console \
    --trace-exporter json:/tmp/fcc-spans.json

Terminal 2 -- Frontend

cd frontend
npm run dev
# Visit http://localhost:5173/stress-test  (once the panel is shipped)

Terminal 3 -- Stress Harness

python -m fcc.tools.stress_harness \
    --events-per-second 1000 \
    --duration 10 \
    --failing-subscriber-ratio 0.05

Terminal 4 -- Live DLQ Tail

curl -s http://localhost:8765/dlq/summary | jq .
# { "depth": 47, "oldest_ts": "...", "newest_ts": "..." }

Expected Outcomes

A healthy run produces the following observable outcomes:

Indicator Expected Value Meaning
Publisher tick rate ~1000/sec sustained Harness not throttled
events.delivered / events.published >= 0.99 Subscribers keeping up
DLQ depth after 10s burst ~500 (with 5% failure injection) Matches injection rate
p95 events.delivery.latency_ms < 50 ms No queue buildup
Frontend framerate >= 30 fps UI renders responsive

An unhealthy run surfaces on the stress-test panel as one or more of: divergent publish/deliver sparklines, backpressure-highlighted connection rows, a DLQ depth that rises faster than the failure injection rate, or a span timeline dominated by long subscriber.invoke bars.


Troubleshooting

Symptom Probable Cause Next Step
events.dropped rising on a single connection Slow client, network saturation Close and reconnect that client; inspect client-side profiler
DLQ depth rising for one event type One subscriber failing that event type Filter DLQ.entries() by event_type and inspect error_message
subscriber.invoke spans exceed 250 ms Subscriber is doing I/O in the hot path Move I/O off the event bus; use an async adapter
Frontend disconnects and cannot reconnect Bridge crashed or port conflict Check /health endpoint and bridge logs
Missing trace data OTel SDK not installed pip install opentelemetry-sdk and restart bridge

Tips

  • Keep the failure-injection ratio low (< 10%) during initial runs to avoid saturating the DLQ while you are still learning the panel.
  • Use the trace_id surfaced in the stress panel to jump directly to the full span tree in the trace viewer -- this is much faster than scrolling through server logs.
  • The stress harness is deterministic when seeded: pass --seed 42 to reproduce a specific failure pattern during debugging.
  • When profiling the frontend, set VITE_EVENT_BATCH_SIZE=50 so the React render loop coalesces events rather than re-rendering per event.

See also