An async API can accept a request while the real work is still queued, retrying, or stuck behind
one bad message. This case keeps that distance visible: one correlation ID, explicit state,
bounded retries, DLQ movement, alarms, and runbook output tied back to the request.
The API call may succeed while downstream work stalls. A queue can hold work for minutes,
retries can happen in bursts, and one bad message can turn the system into guesswork unless
the workflow records what happened at each step.
Correlation has to survive the request, queue message, state history, and telemetry.
Retryable failures should keep the workflow alive without hiding that something went wrong.
Poison messages need a clear DLQ path, not a quiet loop.
Queue depth and age matter because they are often the first user-impact signal.
02 — Architecture
Keep intake, processing, state, and signals separate.
The local model is deliberately small. It keeps the moving parts visible: submit the work,
store state, process from the queue, emit telemetry, and ask for status later.
GET /workflows/{requestId} reads stored request state
The client submits and polls later. The API accepting the request is not the same thing as
the work being done.
03 — Failure scenarios
Five scenarios that cover the failure paths easy to miss.
Each scenario has a concrete operating habit attached. The test that locks each one in is
collapsed by default so the section reads as operations, not a wall of syntax.
One request. One correlation ID. Submit, queue, process, complete.
The standard case: a request is submitted, put on the queue, picked up by the worker, and completed. The correlation ID stays on every step, from the initial store through the queue message, state transitions, and telemetry events.
What this shows
Correlation has to be explicit at submission, not retrofitted later. The test asserts the same ID on the request, the queue message, and the final state.
Operating habit
When a request goes missing, I start from the correlation ID and follow one thread: request.submitted, worker.processing_started, state.update, completion. That path is faster than guessing which component dropped it.
Show the test that locks this in
WorkflowEngineTests.cs
Happy path
C# xUnit test for this scenario.
public sealed class WorkflowEngineTests
{
[Fact]
public void KeepsCorrelationAcrossRequestAndState()
{
var telemetry = new InMemoryTelemetry();
var engine = new WorkflowEngine(telemetry);
var submit = engine.Submit(new WorkflowRequestInput(
WorkflowType: "analytics-export",
CorrelationId: "corr-docs-happy",
Payload: new Dictionary<string, object?> { ["accountId"] = "acct-1" }));
var message = engine.GetQueue()[0];
var request = engine.GetRequest(submit.RequestId);
var history = engine.StateHistory(submit.RequestId);
Assert.Equal("corr-docs-happy", submit.CorrelationId);
Assert.Equal("corr-docs-happy", request?.CorrelationId);
Assert.Equal("corr-docs-happy", message.CorrelationId);
Assert.All(history, state => Assert.Equal("corr-docs-happy", state.CorrelationId));
var result = engine.HandleNextMessage(_ => new WorkerResult(WorkerResultKind.Success, "complete"));
Assert.Equal(WorkerResultKind.Success, result?.Kind);
Assert.Equal(WorkflowStatus.Completed, engine.GetRequest(submit.RequestId)?.Status);
}
}
A dependency fails. The request retries. The workflow survives.
A retryable downstream failure does not kill the workflow. The worker records the retry, requeues the message with a decremented attempt count, and emits worker.retry_scheduled so the retry is visible before the next attempt starts.
What this shows
The retry path stays inside the state machine. The transition is recorded so support can see that work is retrying, not stalled silently.
Operating habit
If workflow.worker_retries_total is climbing but workflow.failed_total is flat, I read that as retries doing their job, not a hang. I check the attempt counter on the state record before escalating.
Show the test that locks this in
WorkflowEngineTests.cs
Transient failure
C# xUnit test for this scenario.
public sealed class WorkflowEngineTransientTests
{
[Fact]
public void RetriesOnTransientAndCompletes()
{
var telemetry = new InMemoryTelemetry();
var engine = new WorkflowEngine(telemetry, defaultMaxAttempts: 3);
var submit = engine.Submit(new WorkflowRequestInput(
WorkflowType: "analytics-export",
CorrelationId: "corr-docs-transient",
Payload: new Dictionary<string, object?> { ["accountId"] = "acct-2" }));
engine.HandleNextMessage(_ => new WorkerResult(WorkerResultKind.RetryableFailure, "timeout"));
engine.HandleNextMessage(_ => new WorkerResult(WorkerResultKind.Success, "recovered"));
Assert.Equal(WorkflowStatus.Completed, engine.GetRequest(submit.RequestId)?.Status);
Assert.Equal(1, telemetry.CountByName("workflow.worker_retries_total", "corr-docs-transient"));
}
}
Retries exhausted. Request marked failed. DLQ has the message.
When a message fails beyond the retry limit, the worker marks the request failed, moves the message to the dead-letter queue, and attaches failure context. The workflow does not loop, and support does not start from scratch.
What this shows
A failed request and a DLQ entry are both asserted. The failure path is explicit: workflow.failed_total increments and WorkQueueDlqAlarm has a source to fire on.
Operating habit
When WorkQueueDlqAlarm fires and workflow.failed_total is above threshold, I pull the DLQ, read the failure context on the request state, and work from the correlation ID rather than the raw message.
Show the test that locks this in
WorkflowEngineTests.cs
Poison message
C# xUnit test for this scenario.
public sealed class WorkflowEnginePoisonTests
{
[Fact]
public void SendsPoisonMessagesToDlqAndMarksFailed()
{
var telemetry = new InMemoryTelemetry();
var engine = new WorkflowEngine(telemetry, defaultMaxAttempts: 1);
var submit = engine.Submit(new WorkflowRequestInput(
WorkflowType: "analytics-export",
CorrelationId: "corr-docs-poison",
Payload: new Dictionary<string, object?> { ["accountId"] = "acct-3" },
MaxAttempts: 1));
var result = engine.HandleNextMessage(_ => new WorkerResult(
WorkerResultKind.PermanentFailure,
"schema mismatch"));
Assert.Equal(WorkerResultKind.PermanentFailure, result?.Kind);
Assert.Equal(WorkflowStatus.Failed, engine.GetRequest(submit.RequestId)?.Status);
Assert.Single(engine.GetDlq());
Assert.Equal(1, telemetry.CountByName("workflow.failed_total", "corr-docs-poison"));
}
}
Queue depth and age become an operational signal, not a mystery.
When queue depth exceeds the configured threshold, the engine emits workflow.backlog_warning. When message age crosses the latency boundary, it emits workflow.backlog_age_breach. Both fire before a user complaint about slowness becomes the first indicator of a problem.
What this shows
Backlog signals use real metric names that map directly to WorkQueueAgeAlarm thresholds in the CDK stack. The test asserts both signals are emitted by the in-memory telemetry sink when the queue crosses the configured limits.
Operating habit
I treat workflow.backlog_age_breach with no matching retry spike as a depth problem, not a worker failure. I look at queue age and worker capacity before escalating to the dependency.
Show the test that locks this in
WorkflowEngineTests.cs
Backlog / latency
C# xUnit test for this scenario.
public sealed class WorkflowEngineBacklogTests
{
[Fact]
public void EmitsBacklogSignalOnQueueLatency()
{
var telemetry = new InMemoryTelemetry();
var engine = new WorkflowEngine(telemetry, backlogWarningAgeMs: 100, backlogThreshold: 3);
for (var i = 0; i < 4; i += 1)
{
engine.Submit(new WorkflowRequestInput(
WorkflowType: "analytics-export",
CorrelationId: $"corr-docs-backlog-{i + 1}",
IdempotencyKey: $"idem-docs-backlog-{i + 1}",
Payload: new Dictionary<string, object?> { ["accountId"] = $"acct-{i + 1}" }));
}
engine.RunBacklogChecks(DateTime.UtcNow.AddMinutes(5));
Assert.Equal(1, telemetry.CountByName("workflow.backlog_age_breach", "corr-docs-backlog-1"));
Assert.Equal(1, telemetry.CountByName("workflow.backlog_warning", "corr-docs-backlog-1"));
}
}
workflow-scenarios.test.ts
Backlog / latency
TypeScript test covering the same behavior.
describe("backlog signal", () => {
it("emits backlog metric and log once queue is deep and aged", () => {
const telemetry = new InMemoryTelemetry();
const engine = new WorkflowEngine(telemetry, { backlogWarningAgeMs: 100, queueBacklogThreshold: 3 });
for (let i = 0; i < 4; i += 1) {
engine.submit({
workflowType: "analytics-export",
correlationId: `corr-backlog-${i + 1}`,
idempotencyKey: `idem-${i + 1}`,
payload: { accountId: `${i + 1}` }
});
}
engine.runBacklogChecks(new Date(Date.now() + 5_000));
expect(telemetry.countByName("workflow.backlog_age_breach", "corr-backlog-1")).toBe(1);
expect(telemetry.countByName("workflow.backlog_warning", "corr-backlog-1")).toBe(1);
});
});
The second submit with the same key returns the first request.
A client retry or duplicate event with the same idempotency key returns the existing request instead of creating duplicate queue work. The reuse boundary, shown in the implementation section, is what keeps the queue from getting a second message.
What this shows
Idempotency is enforced at the intake boundary. The test asserts the same requestId is returned and the queue still has exactly one message.
Operating habit
I read request.idempotent_reused as a normal signal when clients retry. If it spikes without a matching workflow.submit_total increase, I check whether upstream retry logic is too aggressive before touching the workflow.
Show the test that locks this in
WorkflowEngineTests.cs
Idempotency
C# xUnit test for this scenario.
public sealed class WorkflowEngineIdempotencyTests
{
[Fact]
public void ReusesRequestForDuplicateIdempotencyKey()
{
var telemetry = new InMemoryTelemetry();
var engine = new WorkflowEngine(telemetry);
var first = engine.Submit(new WorkflowRequestInput(
WorkflowType: "analytics-export",
CorrelationId: "corr-docs-idem-1",
IdempotencyKey: "idem-reuse",
Payload: new Dictionary<string, object?> { ["accountId"] = "acct-4" }));
var second = engine.Submit(new WorkflowRequestInput(
WorkflowType: "analytics-export",
CorrelationId: "corr-docs-idem-2",
IdempotencyKey: "idem-reuse",
Payload: new Dictionary<string, object?> { ["accountId"] = "acct-4" }));
Assert.Equal(first.RequestId, second.RequestId);
Assert.Equal("corr-docs-idem-1", second.CorrelationId);
Assert.Equal(1, engine.GetQueue().Count);
}
}
workflow-scenarios.test.ts
Idempotency
TypeScript test covering the same behavior.
describe("idempotent submit", () => {
it("reuses the same request for repeated idempotency keys", () => {
const telemetry = new InMemoryTelemetry();
const engine = new WorkflowEngine(telemetry);
const first = engine.submit({
workflowType: "analytics-export",
correlationId: "corr-idem-1",
idempotencyKey: "idem-repeat",
payload: { accountId: "acct-4" }
});
const second = engine.submit({
workflowType: "analytics-export",
correlationId: "corr-idem-2",
idempotencyKey: "idem-repeat",
payload: { accountId: "acct-4" }
});
expect(second.reused).toBe(true);
expect(second.requestId).toBe(first.requestId);
expect(engine.getQueue()).toHaveLength(1);
});
});
04 — Implementation
Keep the workflow behavior testable before adding AWS clients.
The code keeps state transitions away from AWS clients. That makes the behavior easy to test
locally, then the CDK stack shows how the same model would sit inside AWS.
workflow-engine.ts
State transitions stay in the workflow model.
The AWS boundary can change later; the state transition rules remain testable locally.
The runbook output keeps the failure state, evidence, and next checks together.
public runbook(requestId: string): Runbook | undefined {
const request = this.requests.get(requestId);
if (!request) return undefined;
return {
requestId,
correlationId: request.correlationId,
state: request.status,
summary: request.status === "failed"
? "Failure path observed. Check DLQ, upstream dependency, and downstream error detail."
: "Workflow still in progress. Check state history and queue age until it advances.",
evidence: [
{
correlationId: request.correlationId,
requestId,
scenario: "workflow.failure",
nextSteps: [
"Confirm state in request store.",
"Check the DLQ for this correlation ID.",
"Re-run only after the downstream dependency is healthy."
]
}
],
dlqSize: this.dlq.length
};
}
05 — AWS architecture and IAM boundaries
CDK defines the AWS boundary around the same workflow.
Only the infrastructure layer is TypeScript CDK. The C# path stays local in this version.
The CDK stack shows where the local model would sit in AWS, with IAM scoped per Lambda role.
See observability-workflow-stack.ts for the full construct.
Signals are useful only if they help narrow the failure.
Signals are emitted through an in-memory telemetry sink in local runs and tests. They are not
real production telemetry. The names and shapes are what you would wire to CloudWatch Metrics
and structured logs in a deployed version.
Metric nameWhat it tells an operator
workflow.submit_total
Total accepted requests. A flatline here while the client is active usually means the API boundary is unhealthy.
workflow.worker_retries_total
Retry pressure on the worker. Rising without matching completions means a dependency is struggling.
workflow.failed_total
Requests that exhausted retries and moved to the DLQ. Each increment should produce a runbook entry.
workflow.backlog_age_breach
A message waited longer than the configured age threshold. Maps to WorkQueueAgeAlarm in CDK.
workflow.backlog_warning
Queue depth crossed the warning threshold. Often the first signal before latency reaches users.
Trace and log event names:
request.submitted · request.correlation · worker.processing_started · worker.retry_scheduled · state.update · request.idempotent_reused.
Every event carries the correlation ID so a support thread starts with a single value.
Runbook output shape
Failed requests produce a runbook entry tied to the correlation ID. Support does not start
from a blank page.
WorkflowEngine.cs
Give failed requests a runbook shape.
The runbook output keeps failure state, evidence, and next checks together under the same correlation ID.
public Runbook BuildRunbook(WorkflowEngine engine, string requestId)
{
return engine.BuildRunbook(requestId);
}
workflow-engine.ts
Give failed requests a runbook shape.
The TypeScript runbook uses the same correlation ID thread: submit, queue message, state transitions, telemetry, runbook.
public runbook(requestId: string): Runbook | undefined {
const request = this.requests.get(requestId);
if (!request) return undefined;
return {
requestId,
correlationId: request.correlationId,
state: request.status,
summary: request.status === "failed"
? "Failure path observed. Check DLQ, upstream dependency, and downstream error detail."
: "Workflow still in progress. Check state history and queue age until it advances.",
evidence: [
{
correlationId: request.correlationId,
requestId,
scenario: "workflow.failure",
nextSteps: [
"Confirm state in request store.",
"Check the DLQ for this correlation ID.",
"Re-run only after the downstream dependency is healthy."
]
}
],
dlqSize: this.dlq.length
};
}
07 — Tests and checks
Tests keep the operational story honest.
Five scenarios, two runtimes. The asserted signals use the same metric names that would appear
in CloudWatch. Run with: npm test, npm run build, npm run cdk:test, npm run synth, plus dotnet test for the C# path.
ScenarioBehaviorAsserted signalFiles
Happy path
Submit, queue, process, complete; correlation ID continuous
What I chose, what I left out, and what would come next.
01
Local scope first
The implementation focuses on behavior and operating signals before adding production concerns such as identity, secret rotation, and cost controls.
02
Correlation-first debugging
The model keeps a stable correlation ID across request, queue, telemetry, and state transitions. That makes support work easier, but it also means event fields need discipline.
03
Signal clarity vs alert noise
Backlog and retry signals are explicit, with thresholds tied to user impact. The tradeoff is that dashboards need tuning before noisy conditions make the alerts easy to ignore.
Intentionally absent
No live AWS deployment in this version. Confidence comes from local workflow behavior, CDK synthesis, and CDK assertion tests.
No production identity layer. The example stays focused on async workflow behavior and failure mechanics.
Queue visibility and worker parallelism use representative defaults, not a full load profile.
Production path
Identity and auth. Add an identity provider and gateway authorizer before the submit function; scope per-tenant.
Secret rotation and cost controls. Wire AWS Secrets Manager rotation and set per-function concurrency limits and SQS batch size.
CloudWatch wiring. Connect the metric names to real CloudWatch EMF emissions; tune WorkflowObservabilityDashboard thresholds against real traffic.
Load and concurrency profile. Run a load profile to find the queue depth and visibility timeout that keeps latency predictable without over-provisioning.