Web Frontend -- v1.3.5.2 Addendum¶

This addendum extends the Web Frontend Guided Demo, the Phase 14 Addendum, and the Phase 15 Addendum with the v1.3.5.2 WebSocket stress-test and backpressure scenario. The focus here is operational behaviour under load: how the ws://localhost:8765/ws/events bridge handles a sustained 1000-events/sec burst, how failing subscribers route to the dead-letter queue, and how OpenTelemetry spans narrate the end-to-end path.

Scenario Overview¶

Goals¶

Exercise the WebSocket bridge at a workload roughly two orders of magnitude above the typical demo rate.
Verify that slow or unresponsive subscribers do not block the publish path.
Capture evidence -- spans, metrics, DLQ entries -- that the operator can use to diagnose a real incident.

Moving Parts¶

Component	Role	Source
`EventBus`	In-process pub/sub	`src/fcc/messaging/bus.py`
`DeadLetterQueue`	Captures undeliverable events	`src/fcc/messaging/dlq.py`
WS bridge	Protocol adapter	`fcc protocol ws-bridge` CLI
`FccTracer`	Span emission	`src/fcc/observability/tracing.py`
`FccMetrics`	Counter/gauge/histogram	`src/fcc/observability/metrics.py`
`StressTestPanel` (design)	React UI for driving the burst	`frontend/src/pages/StressTestPanel.tsx` (scheduled for v1.3.7 alongside the web admin UI; see `ROADMAP.md` v1.3.7 section)

Note on StressTestPanel.tsx: at v1.3.5.2 the panel was a design proposal; at v1.3.5.4 it was re-scoped into v1.3.7 where it ships together with the new /admin React route, consuming the same server-side harness documented below. The server-side stress-test harness is fully functional today — the UI surface arrives in v1.3.7 without breaking the underlying protocol contract.

Starting the Bridge¶

The backend WebSocket bridge is started via the fcc CLI or via the Docker backend image. Either entry point yields the same endpoint on port 8765 with an HTTP health check on /health served from the same process.

# Local development
fcc protocol ws-bridge --host 0.0.0.0 --port 8765

# Docker
docker compose up -d fcc-backend
curl http://localhost:8765/health

Once the bridge is live, the frontend's Events page connects to ws://localhost:8765/ws/events and begins receiving the default event firehose. The stress-test scenario injects additional synthetic traffic on top of this live stream.

Driving the 1000-Events/sec Burst¶

Client-Side Harness¶

The burst is produced by a small Python harness that publishes directly to the EventBus while the bridge is connected. The harness does not need to know anything about the WebSocket layer -- the bridge observes the bus and forwards to connected clients.

import time
from fcc.messaging.bus import EventBus
from fcc.messaging.events import Event, EventType

bus = EventBus.default()

def burst(events_per_second: int = 1000, duration_s: int = 10) -> None:
    interval = 1.0 / events_per_second
    end = time.time() + duration_s
    count = 0
    while time.time() < end:
        bus.publish(
            Event(
                event_type=EventType.WORKFLOW_STEP_COMPLETED,
                source="stress_harness",
                payload={"seq": count},
            )
        )
        count += 1
        time.sleep(interval)
    print(f"Published {count} events in {duration_s}s")

Server-Side Throttles¶

The bridge applies three throttles in front of any subscriber:

Throttle	Default	Purpose
Per-connection send queue	2,000 events	Prevents an individual client from exhausting memory
Subscriber invoke timeout	250 ms	Bounded blocking per subscriber
Global DLQ capacity	Unbounded (in-memory)	Captures undeliverable events for later review

When the per-connection queue is full, the bridge applies back-pressure: newer events are dropped for that client and a single events.dropped counter increment is recorded. The bus itself is not slowed down; other clients continue to receive events normally.

Backpressure and the Dead-Letter Queue¶

Capture Path¶

When a subscriber either raises an exception or exceeds the invoke timeout, the bus delivers the event to the DeadLetterQueue via DLQ.capture. Each entry is a frozen DeadLetterEntry dataclass that carries the original Event, a human-readable error message, a UTC ISO-8601 timestamp, and a retry counter.

from fcc.messaging.dlq import DeadLetterQueue, DeadLetterEntry

dlq = DeadLetterQueue()

# The bus invokes dlq.enqueue(event, error) internally.
# Inspection is synchronous and thread-safe.
entries = dlq.entries()
for entry in entries:
    print(entry.event.event_type.value, entry.error_message, entry.retry_count)

Inspection from the UI¶

The proposed StressTestPanel.tsx exposes DLQ depth as a live gauge and provides a table view of the most recent entries. The panel queries the backend through two endpoints:

Endpoint	Response	Shape
`GET /dlq/summary`	`{ depth, oldest_ts, newest_ts }`	Aggregate metrics
`GET /dlq/entries?limit=N`	list of `DeadLetterEntry.to_dict()`	Detail table
`POST /dlq/retry`	`{ retried: int }`	Re-publish all entries

Retry Semantics¶

Retrying a DLQ entry re-publishes the original Event through the bus. If the subscriber that originally failed is still unhealthy, the event lands back in the DLQ with retry_count incremented. A separate operator tool can apply an exponential backoff between retries; the core DLQ API is deliberately simple.

Observability During the Burst¶

OpenTelemetry Spans¶

Three span names are emitted during the scenario. Each span carries attributes that make it easy to filter traces by subsystem when inspecting an incident.

Span Name	Emitted By	Key Attributes
`workflow.step`	`action_engine`	`persona_id`, `action_type`, `duration_ms`
`event.publish`	`EventBus`	`event_type`, `source`, `subscriber_count`
`subscriber.invoke`	Subscriber adapter	`subscriber_id`, `event_type`, `status`

Spans are emitted via FccTracer.start_span() and exported through the console or JSON file exporter by default. When opentelemetry-sdk is installed, spans are forwarded to the configured OTLP endpoint.

Metrics to Watch¶

Metric	Type	Healthy Range	Alert Threshold
`events.published`	Counter	rising steadily	flat (publisher stuck)
`events.delivered`	Counter	tracks `published`	divergence > 5%
`events.dropped`	Counter	near zero	> 1% of `published`
`events.dlq.size`	Gauge	< 100	> 1000 and rising
`events.delivery.latency_ms`	Histogram	p95 < 50 ms	p95 > 250 ms
`ws.connected_clients`	Gauge	matches frontend sessions	sharp drop

Trace Correlation¶

Every event carries a trace_id attribute that survives WebSocket serialization. A frontend consumer can cross-reference a UI-visible event with the full server-side span tree by filtering traces on this ID. This is the primary mechanism for diagnosing a slow render in the live UI during the burst.

Reading the Stress-Test Panel (Design)¶

The panel is organized into four regions. All regions poll the backend at a 500 ms cadence while the burst is active and drop to a 2 s cadence when idle.

Region 1 -- Throughput Strip¶

A sparkline chart showing events.published and events.delivered counters over the last 60 seconds. A divergence between the two lines is the earliest visual indicator that a subscriber is struggling.

Region 2 -- Connection Table¶

Per-connection queue depth, last message age, and current state. Rows in the backpressure state are highlighted. A click on a row opens a drawer with the last 10 events sent to that specific client.

Region 3 -- DLQ Panel¶

Live DLQ depth, a sortable table of entries (event type, error, age, retry count), and a retry button. Retry operations are confirmed with a toast that reports the count returned by POST /dlq/retry.

Region 4 -- Span Timeline¶

A flame-graph-style view of the last trace captured during the burst. Spans are colour-coded by name (workflow.step, event.publish, subscriber.invoke). Hovering a span reveals its attributes and the trace ID.

Running the Scenario End-to-End¶

Terminal 1 -- Backend¶

fcc protocol ws-bridge --host 0.0.0.0 --port 8765 \
    --metrics-exporter console \
    --trace-exporter json:/tmp/fcc-spans.json

Terminal 2 -- Frontend¶

cd frontend
npm run dev
# Visit http://localhost:5173/stress-test  (once the panel is shipped)

Terminal 3 -- Stress Harness¶

python -m fcc.tools.stress_harness \
    --events-per-second 1000 \
    --duration 10 \
    --failing-subscriber-ratio 0.05

Terminal 4 -- Live DLQ Tail¶

curl -s http://localhost:8765/dlq/summary | jq .
# { "depth": 47, "oldest_ts": "...", "newest_ts": "..." }

Expected Outcomes¶

A healthy run produces the following observable outcomes:

Indicator	Expected Value	Meaning
Publisher tick rate	~1000/sec sustained	Harness not throttled
`events.delivered` / `events.published`	>= 0.99	Subscribers keeping up
DLQ depth after 10s burst	~500 (with 5% failure injection)	Matches injection rate
p95 `events.delivery.latency_ms`	< 50 ms	No queue buildup
Frontend framerate	>= 30 fps	UI renders responsive

An unhealthy run surfaces on the stress-test panel as one or more of: divergent publish/deliver sparklines, backpressure-highlighted connection rows, a DLQ depth that rises faster than the failure injection rate, or a span timeline dominated by long subscriber.invoke bars.

Troubleshooting¶

Symptom	Probable Cause	Next Step
`events.dropped` rising on a single connection	Slow client, network saturation	Close and reconnect that client; inspect client-side profiler
DLQ depth rising for one event type	One subscriber failing that event type	Filter `DLQ.entries()` by `event_type` and inspect `error_message`
`subscriber.invoke` spans exceed 250 ms	Subscriber is doing I/O in the hot path	Move I/O off the event bus; use an async adapter
Frontend disconnects and cannot reconnect	Bridge crashed or port conflict	Check `/health` endpoint and bridge logs
Missing trace data	OTel SDK not installed	`pip install opentelemetry-sdk` and restart bridge

Tips¶

Keep the failure-injection ratio low (< 10%) during initial runs to avoid saturating the DLQ while you are still learning the panel.
Use the trace_id surfaced in the stress panel to jump directly to the full span tree in the trace viewer -- this is much faster than scrolling through server logs.
The stress harness is deterministic when seeded: pass --seed 42 to reproduce a specific failure pattern during debugging.
When profiling the frontend, set VITE_EVENT_BATCH_SIZE=50 so the React render loop coalesces events rather than re-rendering per event.