Web Frontend -- v1.3.5.2 Addendum¶
This addendum extends the Web Frontend Guided Demo,
the Phase 14 Addendum, and the
Phase 15 Addendum with the
v1.3.5.2 WebSocket stress-test and backpressure scenario. The focus here
is operational behaviour under load: how the ws://localhost:8765/ws/events
bridge handles a sustained 1000-events/sec burst, how failing subscribers
route to the dead-letter queue, and how OpenTelemetry spans narrate the
end-to-end path.
Scenario Overview¶
Goals¶
- Exercise the WebSocket bridge at a workload roughly two orders of magnitude above the typical demo rate.
- Verify that slow or unresponsive subscribers do not block the publish path.
- Capture evidence -- spans, metrics, DLQ entries -- that the operator can use to diagnose a real incident.
Moving Parts¶
| Component | Role | Source |
|---|---|---|
EventBus |
In-process pub/sub | src/fcc/messaging/bus.py |
DeadLetterQueue |
Captures undeliverable events | src/fcc/messaging/dlq.py |
| WS bridge | Protocol adapter | fcc protocol ws-bridge CLI |
FccTracer |
Span emission | src/fcc/observability/tracing.py |
FccMetrics |
Counter/gauge/histogram | src/fcc/observability/metrics.py |
StressTestPanel (design) |
React UI for driving the burst | frontend/src/pages/StressTestPanel.tsx (scheduled for v1.3.7 alongside the web admin UI; see ROADMAP.md v1.3.7 section) |
Note on StressTestPanel.tsx: at v1.3.5.2 the panel was a design
proposal; at v1.3.5.4 it was re-scoped into v1.3.7 where it ships
together with the new /admin React route, consuming the same
server-side harness documented below. The server-side stress-test
harness is fully functional today — the UI surface arrives in v1.3.7
without breaking the underlying protocol contract.
Starting the Bridge¶
The backend WebSocket bridge is started via the fcc CLI or via the
Docker backend image. Either entry point yields the same endpoint on
port 8765 with an HTTP health check on /health served from the same
process.
# Local development
fcc protocol ws-bridge --host 0.0.0.0 --port 8765
# Docker
docker compose up -d fcc-backend
curl http://localhost:8765/health
Once the bridge is live, the frontend's Events page connects to
ws://localhost:8765/ws/events and begins receiving the default event
firehose. The stress-test scenario injects additional synthetic
traffic on top of this live stream.
Driving the 1000-Events/sec Burst¶
Client-Side Harness¶
The burst is produced by a small Python harness that publishes directly
to the EventBus while the bridge is connected. The harness does not
need to know anything about the WebSocket layer -- the bridge observes
the bus and forwards to connected clients.
import time
from fcc.messaging.bus import EventBus
from fcc.messaging.events import Event, EventType
bus = EventBus.default()
def burst(events_per_second: int = 1000, duration_s: int = 10) -> None:
interval = 1.0 / events_per_second
end = time.time() + duration_s
count = 0
while time.time() < end:
bus.publish(
Event(
event_type=EventType.WORKFLOW_STEP_COMPLETED,
source="stress_harness",
payload={"seq": count},
)
)
count += 1
time.sleep(interval)
print(f"Published {count} events in {duration_s}s")
Server-Side Throttles¶
The bridge applies three throttles in front of any subscriber:
| Throttle | Default | Purpose |
|---|---|---|
| Per-connection send queue | 2,000 events | Prevents an individual client from exhausting memory |
| Subscriber invoke timeout | 250 ms | Bounded blocking per subscriber |
| Global DLQ capacity | Unbounded (in-memory) | Captures undeliverable events for later review |
When the per-connection queue is full, the bridge applies
back-pressure: newer events are dropped for that client and a single
events.dropped counter increment is recorded. The bus itself is not
slowed down; other clients continue to receive events normally.
Backpressure and the Dead-Letter Queue¶
Capture Path¶
When a subscriber either raises an exception or exceeds the invoke
timeout, the bus delivers the event to the DeadLetterQueue via
DLQ.capture. Each entry is a frozen DeadLetterEntry dataclass
that carries the original Event, a human-readable error message, a
UTC ISO-8601 timestamp, and a retry counter.
from fcc.messaging.dlq import DeadLetterQueue, DeadLetterEntry
dlq = DeadLetterQueue()
# The bus invokes dlq.enqueue(event, error) internally.
# Inspection is synchronous and thread-safe.
entries = dlq.entries()
for entry in entries:
print(entry.event.event_type.value, entry.error_message, entry.retry_count)
Inspection from the UI¶
The proposed StressTestPanel.tsx exposes DLQ depth as a live gauge and
provides a table view of the most recent entries. The panel queries the
backend through two endpoints:
| Endpoint | Response | Shape |
|---|---|---|
GET /dlq/summary |
{ depth, oldest_ts, newest_ts } |
Aggregate metrics |
GET /dlq/entries?limit=N |
list of DeadLetterEntry.to_dict() |
Detail table |
POST /dlq/retry |
{ retried: int } |
Re-publish all entries |
Retry Semantics¶
Retrying a DLQ entry re-publishes the original Event through the
bus. If the subscriber that originally failed is still unhealthy, the
event lands back in the DLQ with retry_count incremented. A separate
operator tool can apply an exponential backoff between retries; the
core DLQ API is deliberately simple.
Observability During the Burst¶
OpenTelemetry Spans¶
Three span names are emitted during the scenario. Each span carries attributes that make it easy to filter traces by subsystem when inspecting an incident.
| Span Name | Emitted By | Key Attributes |
|---|---|---|
workflow.step |
action_engine |
persona_id, action_type, duration_ms |
event.publish |
EventBus |
event_type, source, subscriber_count |
subscriber.invoke |
Subscriber adapter | subscriber_id, event_type, status |
Spans are emitted via FccTracer.start_span() and exported through the
console or JSON file exporter by default. When opentelemetry-sdk is
installed, spans are forwarded to the configured OTLP endpoint.
Metrics to Watch¶
| Metric | Type | Healthy Range | Alert Threshold |
|---|---|---|---|
events.published |
Counter | rising steadily | flat (publisher stuck) |
events.delivered |
Counter | tracks published |
divergence > 5% |
events.dropped |
Counter | near zero | > 1% of published |
events.dlq.size |
Gauge | < 100 | > 1000 and rising |
events.delivery.latency_ms |
Histogram | p95 < 50 ms | p95 > 250 ms |
ws.connected_clients |
Gauge | matches frontend sessions | sharp drop |
Trace Correlation¶
Every event carries a trace_id attribute that survives WebSocket
serialization. A frontend consumer can cross-reference a UI-visible
event with the full server-side span tree by filtering traces on this
ID. This is the primary mechanism for diagnosing a slow render in the
live UI during the burst.
Reading the Stress-Test Panel (Design)¶
The panel is organized into four regions. All regions poll the backend at a 500 ms cadence while the burst is active and drop to a 2 s cadence when idle.
Region 1 -- Throughput Strip¶
A sparkline chart showing events.published and events.delivered
counters over the last 60 seconds. A divergence between the two lines
is the earliest visual indicator that a subscriber is struggling.
Region 2 -- Connection Table¶
Per-connection queue depth, last message age, and current state. Rows
in the backpressure state are highlighted. A click on a row opens a
drawer with the last 10 events sent to that specific client.
Region 3 -- DLQ Panel¶
Live DLQ depth, a sortable table of entries (event type, error, age,
retry count), and a retry button. Retry operations are confirmed with
a toast that reports the count returned by POST /dlq/retry.
Region 4 -- Span Timeline¶
A flame-graph-style view of the last trace captured during the burst.
Spans are colour-coded by name (workflow.step, event.publish,
subscriber.invoke). Hovering a span reveals its attributes and the
trace ID.
Running the Scenario End-to-End¶
Terminal 1 -- Backend¶
fcc protocol ws-bridge --host 0.0.0.0 --port 8765 \
--metrics-exporter console \
--trace-exporter json:/tmp/fcc-spans.json
Terminal 2 -- Frontend¶
Terminal 3 -- Stress Harness¶
python -m fcc.tools.stress_harness \
--events-per-second 1000 \
--duration 10 \
--failing-subscriber-ratio 0.05
Terminal 4 -- Live DLQ Tail¶
curl -s http://localhost:8765/dlq/summary | jq .
# { "depth": 47, "oldest_ts": "...", "newest_ts": "..." }
Expected Outcomes¶
A healthy run produces the following observable outcomes:
| Indicator | Expected Value | Meaning |
|---|---|---|
| Publisher tick rate | ~1000/sec sustained | Harness not throttled |
events.delivered / events.published |
>= 0.99 | Subscribers keeping up |
| DLQ depth after 10s burst | ~500 (with 5% failure injection) | Matches injection rate |
p95 events.delivery.latency_ms |
< 50 ms | No queue buildup |
| Frontend framerate | >= 30 fps | UI renders responsive |
An unhealthy run surfaces on the stress-test panel as one or more of:
divergent publish/deliver sparklines, backpressure-highlighted
connection rows, a DLQ depth that rises faster than the failure
injection rate, or a span timeline dominated by long
subscriber.invoke bars.
Troubleshooting¶
| Symptom | Probable Cause | Next Step |
|---|---|---|
events.dropped rising on a single connection |
Slow client, network saturation | Close and reconnect that client; inspect client-side profiler |
| DLQ depth rising for one event type | One subscriber failing that event type | Filter DLQ.entries() by event_type and inspect error_message |
subscriber.invoke spans exceed 250 ms |
Subscriber is doing I/O in the hot path | Move I/O off the event bus; use an async adapter |
| Frontend disconnects and cannot reconnect | Bridge crashed or port conflict | Check /health endpoint and bridge logs |
| Missing trace data | OTel SDK not installed | pip install opentelemetry-sdk and restart bridge |
Tips¶
- Keep the failure-injection ratio low (< 10%) during initial runs to avoid saturating the DLQ while you are still learning the panel.
- Use the
trace_idsurfaced in the stress panel to jump directly to the full span tree in the trace viewer -- this is much faster than scrolling through server logs. - The stress harness is deterministic when seeded: pass
--seed 42to reproduce a specific failure pattern during debugging. - When profiling the frontend, set
VITE_EVENT_BATCH_SIZE=50so the React render loop coalesces events rather than re-rendering per event.
See also¶
- Web Frontend Guided Demo -- Base demo and protocol overview
- Web Frontend Phase 15 Addendum -- Object model, compliance heatmap, federation
- Messaging Phase 15 Addendum -- Priority, DLQ, and circuit-breaker walkthrough
- Full-Stack Ecosystem v1.3.5.2 Addendum -- End-to-end wiring
- Event Bus Pub/Sub Sequence Diagram -- Publish path narrative
- Plugin Hierarchy Class Diagram --
EventSubscriberPluginsection