AWS serverless operations case study

Operating an event-driven workflow under failure.

An async API can accept a request while the real work is still queued, retrying, or stuck behind one bad message. This case keeps that distance visible: one correlation ID, explicit state, bounded retries, DLQ movement, alarms, and runbook output tied back to the request.

Open source repository AWS serverless skill area

Problem: Async work can fail after intake
Code paths: C# + TypeScript + CDK
Operational intent: Find and explain failures
Scope: Local model + CDK synth, no live deploy
Validation: Behavior tests + CDK assertions

Source files Show files in the companion repository

TypeScript + CDK

TypeScript workflow domain src/workflow-engine.ts Workflow state transitions, queue messages, failure branches, and runbook output.
TypeScript scenario and telemetry runner tests test/workflow-scenarios.test.ts Behavior tests for correlation, retry, DLQ, backlog, and idempotency.
CDK architecture iac/lib/observability-workflow-stack.ts API Gateway, Lambda roles, DynamoDB, SQS/DLQ, alarms, dashboard, and scoped permissions.
CDK assertions iac/test/observability-workflow-stack.test.ts Infrastructure tests for API, compute, queues, DLQ policy, alarms, dashboard, and IAM boundaries.

C#/.NET

C# workflow domain csharp/WorkflowObservabilityWorkflow/WorkflowEngine.cs The same workflow model in .NET, with in-memory telemetry and local scenarios.
C# scenario tests csharp/WorkflowObservabilityWorkflow.Tests/WorkflowEngineTests.cs xUnit coverage for retry, DLQ, backlog, idempotency, and correlation continuity.

Problem
Architecture
Failure scenarios
Implementation
AWS architecture
Observability
Tests
Tradeoffs

01 — Problem

The hard part is not accepting the request.

The API call may succeed while downstream work stalls. A queue can hold work for minutes, retries can happen in bursts, and one bad message can turn the system into guesswork unless the workflow records what happened at each step.

Correlation has to survive the request, queue message, state history, and telemetry.
Retryable failures should keep the workflow alive without hiding that something went wrong.
Poison messages need a clear DLQ path, not a quiet loop.
Queue depth and age matter because they are often the first user-impact signal.

02 — Architecture

Keep intake, processing, state, and signals separate.

The local model is deliberately small. It keeps the moving parts visible: submit the work, store state, process from the queue, emit telemetry, and ask for status later.

POST /workflows API boundary

queued state stored + queue message

processing worker picks up message

completed state closed, transition recorded

retryable error requeued, attempt decremented

exhausted → DLQ + failed WorkQueueDlq, runbook attached

Out of band

GET /workflows/{requestId} reads stored request state

The client submits and polls later. The API accepting the request is not the same thing as the work being done.

03 — Failure scenarios

Five scenarios that cover the failure paths easy to miss.

Each scenario has a concrete operating habit attached. The test that locks each one in is collapsed by default so the section reads as operations, not a wall of syntax.

Choose a failure scenario Happy path Submit to completed, correlation intact. Transient failure Retryable error, bounded attempts, workflow stays alive. Poison message Permanent failure lands in the DLQ with context. Backlog / latency Queue age emits a signal before users start complaining. Idempotency Repeated submit returns the existing request, no duplicate work.

One request. One correlation ID. Submit, queue, process, complete.

The standard case: a request is submitted, put on the queue, picked up by the worker, and completed. The correlation ID stays on every step, from the initial store through the queue message, state transitions, and telemetry events.

What this shows

Correlation has to be explicit at submission, not retrofitted later. The test asserts the same ID on the request, the queue message, and the final state.

Operating habit

When a request goes missing, I start from the correlation ID and follow one thread: request.submitted, worker.processing_started, state.update, completion. That path is faster than guessing which component dropped it.

Show the test that locks this in

C#/.NET TypeScript

WorkflowEngineTests.cs

Happy path

C# xUnit test for this scenario.

public sealed class WorkflowEngineTests
{
    [Fact]
    public void KeepsCorrelationAcrossRequestAndState()
    {
        var telemetry = new InMemoryTelemetry();
        var engine = new WorkflowEngine(telemetry);

        var submit = engine.Submit(new WorkflowRequestInput(
            WorkflowType: "analytics-export",
            CorrelationId: "corr-docs-happy",
            Payload: new Dictionary<string, object?> { ["accountId"] = "acct-1" }));

        var message = engine.GetQueue()[0];
        var request = engine.GetRequest(submit.RequestId);
        var history = engine.StateHistory(submit.RequestId);

        Assert.Equal("corr-docs-happy", submit.CorrelationId);
        Assert.Equal("corr-docs-happy", request?.CorrelationId);
        Assert.Equal("corr-docs-happy", message.CorrelationId);
        Assert.All(history, state => Assert.Equal("corr-docs-happy", state.CorrelationId));

        var result = engine.HandleNextMessage(_ => new WorkerResult(WorkerResultKind.Success, "complete"));
        Assert.Equal(WorkerResultKind.Success, result?.Kind);
        Assert.Equal(WorkflowStatus.Completed, engine.GetRequest(submit.RequestId)?.Status);
    }
}

workflow-scenarios.test.ts

Happy path

TypeScript test covering the same behavior.

describe("happy path", () => {
  it("keeps correlationId through submit, queue, and completion", () => {
    const telemetry = new InMemoryTelemetry();
    const engine = new WorkflowEngine(telemetry, { defaultMaxAttempts: 3 });

    const submitResult = engine.submit({
      workflowType: "analytics-export",
      correlationId: "corr-happy",
      payload: { accountId: "acct-1" }
    });

    expect(submitResult.status).toBe("queued");
    expect(submitResult.correlationId).toBe("corr-happy");

    const result = engine.handleNextMessage(() => ({
      kind: "success",
      detail: "completed"
    }));
    const finalRequest = engine.getRequest(submitResult.requestId);

    expect(result?.kind).toBe("success");
    expect(finalRequest?.status).toBe("completed");
    expect(finalRequest?.correlationId).toBe("corr-happy");
    expect(engine.getDlq()).toHaveLength(0);
  });
});

A dependency fails. The request retries. The workflow survives.

A retryable downstream failure does not kill the workflow. The worker records the retry, requeues the message with a decremented attempt count, and emits worker.retry_scheduled so the retry is visible before the next attempt starts.

What this shows

The retry path stays inside the state machine. The transition is recorded so support can see that work is retrying, not stalled silently.

Operating habit

If workflow.worker_retries_total is climbing but workflow.failed_total is flat, I read that as retries doing their job, not a hang. I check the attempt counter on the state record before escalating.

Show the test that locks this in

C#/.NET TypeScript

WorkflowEngineTests.cs

Transient failure

C# xUnit test for this scenario.

public sealed class WorkflowEngineTransientTests
{
    [Fact]
    public void RetriesOnTransientAndCompletes()
    {
        var telemetry = new InMemoryTelemetry();
        var engine = new WorkflowEngine(telemetry, defaultMaxAttempts: 3);

        var submit = engine.Submit(new WorkflowRequestInput(
            WorkflowType: "analytics-export",
            CorrelationId: "corr-docs-transient",
            Payload: new Dictionary<string, object?> { ["accountId"] = "acct-2" }));

        engine.HandleNextMessage(_ => new WorkerResult(WorkerResultKind.RetryableFailure, "timeout"));
        engine.HandleNextMessage(_ => new WorkerResult(WorkerResultKind.Success, "recovered"));

        Assert.Equal(WorkflowStatus.Completed, engine.GetRequest(submit.RequestId)?.Status);
        Assert.Equal(1, telemetry.CountByName("workflow.worker_retries_total", "corr-docs-transient"));
    }
}

workflow-scenarios.test.ts

Transient failure

TypeScript test covering the same behavior.

describe("transient retry", () => {
  it("retries and succeeds on second attempt", () => {
    const telemetry = new InMemoryTelemetry();
    const engine = new WorkflowEngine(telemetry);

    const submitResult = engine.submit({
      workflowType: "analytics-export",
      correlationId: "corr-transient",
      payload: { accountId: "acct-2" }
    });

    const retry = engine.handleNextMessage(() => ({
      kind: "retryableFailure",
      detail: "dependency timeout"
    }));

    expect(retry?.kind).toBe("retryableFailure");
    expect(engine.getQueue()).toHaveLength(1);
    expect(engine.getRequest(submitResult.requestId)?.status).toBe("queued");
    expect(telemetry.countByName("workflow.worker_retries_total", "corr-transient")).toBe(1);

    const success = engine.handleNextMessage(() => ({
      kind: "success",
      detail: "finalized"
    }));

    expect(success?.kind).toBe("success");
    expect(engine.getRequest(submitResult.requestId)?.status).toBe("completed");
  });
});

Retries exhausted. Request marked failed. DLQ has the message.

When a message fails beyond the retry limit, the worker marks the request failed, moves the message to the dead-letter queue, and attaches failure context. The workflow does not loop, and support does not start from scratch.

What this shows

A failed request and a DLQ entry are both asserted. The failure path is explicit: workflow.failed_total increments and WorkQueueDlqAlarm has a source to fire on.

Operating habit

When WorkQueueDlqAlarm fires and workflow.failed_total is above threshold, I pull the DLQ, read the failure context on the request state, and work from the correlation ID rather than the raw message.

Show the test that locks this in

C#/.NET TypeScript

WorkflowEngineTests.cs

Poison message

C# xUnit test for this scenario.

public sealed class WorkflowEnginePoisonTests
{
    [Fact]
    public void SendsPoisonMessagesToDlqAndMarksFailed()
    {
        var telemetry = new InMemoryTelemetry();
        var engine = new WorkflowEngine(telemetry, defaultMaxAttempts: 1);

        var submit = engine.Submit(new WorkflowRequestInput(
            WorkflowType: "analytics-export",
            CorrelationId: "corr-docs-poison",
            Payload: new Dictionary<string, object?> { ["accountId"] = "acct-3" },
            MaxAttempts: 1));

        var result = engine.HandleNextMessage(_ => new WorkerResult(
            WorkerResultKind.PermanentFailure,
            "schema mismatch"));

        Assert.Equal(WorkerResultKind.PermanentFailure, result?.Kind);
        Assert.Equal(WorkflowStatus.Failed, engine.GetRequest(submit.RequestId)?.Status);
        Assert.Single(engine.GetDlq());
        Assert.Equal(1, telemetry.CountByName("workflow.failed_total", "corr-docs-poison"));
    }
}

workflow-scenarios.test.ts

Poison message

TypeScript test covering the same behavior.

describe("poison handling", () => {
  it("moves exhausted failures to dlq and marks failed", () => {
    const telemetry = new InMemoryTelemetry();
    const engine = new WorkflowEngine(telemetry, { defaultMaxAttempts: 1 });

    const submitResult = engine.submit({
      workflowType: "analytics-export",
      correlationId: "corr-poison",
      payload: { accountId: "acct-3" }
    });

    const failure = engine.handleNextMessage(() => ({
      kind: "permanentFailure",
      detail: "schema mismatch"
    }));

    expect(failure?.kind).toBe("permanentFailure");
    expect(engine.getRequest(submitResult.requestId)?.status).toBe("failed");
    expect(engine.getDlq()).toHaveLength(1);
    expect(telemetry.countByName("workflow.failed_total", "corr-poison")).toBe(1);
  });
});

Queue depth and age become an operational signal, not a mystery.

When queue depth exceeds the configured threshold, the engine emits workflow.backlog_warning. When message age crosses the latency boundary, it emits workflow.backlog_age_breach. Both fire before a user complaint about slowness becomes the first indicator of a problem.

What this shows

Backlog signals use real metric names that map directly to WorkQueueAgeAlarm thresholds in the CDK stack. The test asserts both signals are emitted by the in-memory telemetry sink when the queue crosses the configured limits.

Operating habit

I treat workflow.backlog_age_breach with no matching retry spike as a depth problem, not a worker failure. I look at queue age and worker capacity before escalating to the dependency.

Show the test that locks this in

C#/.NET TypeScript

WorkflowEngineTests.cs

Backlog / latency

C# xUnit test for this scenario.

public sealed class WorkflowEngineBacklogTests
{
    [Fact]
    public void EmitsBacklogSignalOnQueueLatency()
    {
        var telemetry = new InMemoryTelemetry();
        var engine = new WorkflowEngine(telemetry, backlogWarningAgeMs: 100, backlogThreshold: 3);

        for (var i = 0; i < 4; i += 1)
        {
            engine.Submit(new WorkflowRequestInput(
                WorkflowType: "analytics-export",
                CorrelationId: $"corr-docs-backlog-{i + 1}",
                IdempotencyKey: $"idem-docs-backlog-{i + 1}",
                Payload: new Dictionary<string, object?> { ["accountId"] = $"acct-{i + 1}" }));
        }

        engine.RunBacklogChecks(DateTime.UtcNow.AddMinutes(5));

        Assert.Equal(1, telemetry.CountByName("workflow.backlog_age_breach", "corr-docs-backlog-1"));
        Assert.Equal(1, telemetry.CountByName("workflow.backlog_warning", "corr-docs-backlog-1"));
    }
}

workflow-scenarios.test.ts

Backlog / latency

TypeScript test covering the same behavior.

describe("backlog signal", () => {
  it("emits backlog metric and log once queue is deep and aged", () => {
    const telemetry = new InMemoryTelemetry();
    const engine = new WorkflowEngine(telemetry, { backlogWarningAgeMs: 100, queueBacklogThreshold: 3 });

    for (let i = 0; i < 4; i += 1) {
      engine.submit({
        workflowType: "analytics-export",
        correlationId: `corr-backlog-${i + 1}`,
        idempotencyKey: `idem-${i + 1}`,
        payload: { accountId: `${i + 1}` }
      });
    }

    engine.runBacklogChecks(new Date(Date.now() + 5_000));

    expect(telemetry.countByName("workflow.backlog_age_breach", "corr-backlog-1")).toBe(1);
    expect(telemetry.countByName("workflow.backlog_warning", "corr-backlog-1")).toBe(1);
  });
});

The second submit with the same key returns the first request.

A client retry or duplicate event with the same idempotency key returns the existing request instead of creating duplicate queue work. The reuse boundary, shown in the implementation section, is what keeps the queue from getting a second message.

What this shows

Idempotency is enforced at the intake boundary. The test asserts the same requestId is returned and the queue still has exactly one message.

Operating habit

I read request.idempotent_reused as a normal signal when clients retry. If it spikes without a matching workflow.submit_total increase, I check whether upstream retry logic is too aggressive before touching the workflow.

Show the test that locks this in

C#/.NET TypeScript

WorkflowEngineTests.cs

Idempotency

C# xUnit test for this scenario.

public sealed class WorkflowEngineIdempotencyTests
{
    [Fact]
    public void ReusesRequestForDuplicateIdempotencyKey()
    {
        var telemetry = new InMemoryTelemetry();
        var engine = new WorkflowEngine(telemetry);

        var first = engine.Submit(new WorkflowRequestInput(
            WorkflowType: "analytics-export",
            CorrelationId: "corr-docs-idem-1",
            IdempotencyKey: "idem-reuse",
            Payload: new Dictionary<string, object?> { ["accountId"] = "acct-4" }));

        var second = engine.Submit(new WorkflowRequestInput(
            WorkflowType: "analytics-export",
            CorrelationId: "corr-docs-idem-2",
            IdempotencyKey: "idem-reuse",
            Payload: new Dictionary<string, object?> { ["accountId"] = "acct-4" }));

        Assert.Equal(first.RequestId, second.RequestId);
        Assert.Equal("corr-docs-idem-1", second.CorrelationId);
        Assert.Equal(1, engine.GetQueue().Count);
    }
}

workflow-scenarios.test.ts

Idempotency

TypeScript test covering the same behavior.

describe("idempotent submit", () => {
  it("reuses the same request for repeated idempotency keys", () => {
    const telemetry = new InMemoryTelemetry();
    const engine = new WorkflowEngine(telemetry);

    const first = engine.submit({
      workflowType: "analytics-export",
      correlationId: "corr-idem-1",
      idempotencyKey: "idem-repeat",
      payload: { accountId: "acct-4" }
    });

    const second = engine.submit({
      workflowType: "analytics-export",
      correlationId: "corr-idem-2",
      idempotencyKey: "idem-repeat",
      payload: { accountId: "acct-4" }
    });

    expect(second.reused).toBe(true);
    expect(second.requestId).toBe(first.requestId);
    expect(engine.getQueue()).toHaveLength(1);
  });
});

04 — Implementation

Keep the workflow behavior testable before adding AWS clients.

The code keeps state transitions away from AWS clients. That makes the behavior easy to test locally, then the CDK stack shows how the same model would sit inside AWS.

workflow-engine.ts

State transitions stay in the workflow model.

The AWS boundary can change later; the state transition rules remain testable locally.

private applyStateTransition(requestId: string, next: WorkflowStatus, updatedAt: string, eventName: string): void {
    const current = this.getRequest(requestId);
    const nextRequest = { ...current, status: next, updatedAt };
    this.requests.set(requestId, nextRequest);
    this.history.set(requestId, [...(this.history.get(requestId) ?? []), nextRequest]);
    this.telemetry.emitTrace(eventName, nextRequest.correlationId, {
      requestId,
      from: current.status,
      to: next
    });
    this.telemetry.emitLog("state.update", nextRequest.correlationId, {
      requestId,
      status: next
    });
  }

C#/.NET TypeScript

C#/.NET

WorkflowEngine.cs

Accept work and create a queue message.

Submit stores the request and creates the queue message with the same correlation ID.

public SubmitResult Submit(WorkflowRequestInput input)
{
    var correlationId = input.CorrelationId ?? Guid.NewGuid().ToString("N");
    var requestId = Guid.NewGuid().ToString("N");
    var request = new WorkflowRequest(
        requestId,
        correlationId,
        input.WorkflowType,
        input.Payload,
        input.IdempotencyKey,
        DateTime.UtcNow,
        DateTime.UtcNow,
        WorkflowStatus.Submitted,
        0,
        input.MaxAttempts,
        null);

    requests[requestId] = request;
    ApplyTransition(requestId, WorkflowStatus.Queued, DateTime.UtcNow, "request.submitted");
    queue.Enqueue(new QueueMessage(
        Guid.NewGuid().ToString("N"),
        requestId,
        correlationId,
        input.WorkflowType,
        input.Payload,
        0,
        input.MaxAttempts,
        request.CreatedAt));

    telemetry.EmitMetric("workflow.submit_total", correlationId, new Dictionary<string, object?>
    {
        ["requestId"] = requestId
    });

    return new SubmitResult(requestId, correlationId, WorkflowStatus.Queued, false);
}

WorkflowEngine.cs

Process one queued message at a time.

The worker path records processing, success, retry, and failed transitions in one testable place.

public WorkerResult? ProcessMessage(WorkflowEngine engine, Func<WorkerContext, WorkerResult> handler)
{
    return engine.HandleNextMessage(handler);
}

WorkflowEngine.cs

Keep duplicate intake idempotent.

A repeated idempotency key returns the original request instead of creating duplicate work.

public SubmitResult? ReuseIdempotentRequest(string idempotencyKey)
{
    if (!idempotencyIndex.TryGetValue(idempotencyKey, out var existingRequestId) ||
        !requests.TryGetValue(existingRequestId, out var existing))
    {
        return null;
    }

    telemetry.EmitLog("request.idempotent_reused", existing.CorrelationId, new Dictionary<string, object?>
    {
        ["requestId"] = existing.RequestId,
        ["idempotencyKey"] = idempotencyKey,
        ["status"] = existing.Status.ToString().ToLowerInvariant()
    });

    return new SubmitResult(existing.RequestId, existing.CorrelationId, existing.Status, true);
}

WorkflowEngine.cs

Turn backlog into an explicit signal.

The queue check emits a warning when depth and age cross the configured threshold.

public void CheckBacklog(WorkflowEngine engine, DateTime nowUtc)
{
    engine.RunBacklogChecks(nowUtc);
}

WorkflowEngine.cs

Give failed requests a runbook shape.

The runbook output keeps the failure state, evidence, and next checks together.

public Runbook BuildRunbook(WorkflowEngine engine, string requestId)
{
    return engine.BuildRunbook(requestId);
}

TypeScript / Node.js

workflow-engine.ts

Accept work and create a queue message.

Submit stores the request and creates the queue message with the same correlation ID.

public submit(input: WorkflowRequestInput): SubmitResult {
    const correlationId = input.correlationId ?? randomUUID();
    const requestId = randomUUID();
    const request: WorkflowRequest = {
      requestId,
      correlationId,
      workflowType: input.workflowType,
      payload: input.payload,
      idempotencyKey: input.idempotencyKey,
      createdAt: new Date().toISOString(),
      updatedAt: new Date().toISOString(),
      status: "submitted",
      attempts: 0,
      maxAttempts: input.maxAttempts ?? 3
    };

    this.requests.set(requestId, request);
    this.applyStateTransition(requestId, "queued", request.updatedAt, "request.submitted");
    this.queue.push({
      messageId: randomUUID(),
      requestId,
      correlationId,
      workflowType: input.workflowType,
      payload: input.payload,
      attempts: 0,
      maxAttempts: request.maxAttempts,
      enqueuedAt: request.createdAt
    });

    this.telemetry.emitMetric("workflow.submit_total", correlationId, { requestId });
    this.telemetry.emitTrace("request.correlation", correlationId, { requestId, path: "submit->queued" });
    return { requestId, correlationId, status: "queued", reused: false };
  }

workflow-engine.ts

Process one queued message at a time.

The worker path records retry, DLQ, and completed outcomes through explicit transitions.

public handleNextMessage(handler: (context: WorkerContext) => WorkerResult): WorkerResult | undefined {
    const message = this.queue.shift();
    if (!message) return undefined;

    const request = this.getRequest(message.requestId);
    this.applyStateTransition(request.requestId, "processing", new Date().toISOString(), "worker.processing_started");
    const result = handler({ request, message });

    if (result.kind === "success") {
      this.completeRequest(request.requestId, message, result);
      return result;
    }

    if (result.kind === "retryableFailure" && message.attempts + 1 < message.maxAttempts) {
      this.telemetry.emitMetric("workflow.worker_retries_total", request.correlationId, {
        requestId: request.requestId,
        attempt: message.attempts + 1
      });
      this.queue.push({ ...message, messageId: randomUUID(), attempts: message.attempts + 1 });
      this.applyStateTransition(request.requestId, "queued", new Date().toISOString(), "worker.retry_scheduled");
      return result;
    }

    this.moveToDlq({ ...message, attempts: message.attempts + 1 }, request, result);
    return result;
  }

workflow-engine.ts

Keep duplicate intake idempotent.

The TypeScript path uses the same idempotency boundary as the C# version.

private reuseIdempotentRequest(idempotencyKey: string): SubmitResult | undefined {
    const existingRequestId = this.idempotencyIndex.get(idempotencyKey);
    const existing = existingRequestId ? this.requests.get(existingRequestId) : undefined;
    if (!existing) return undefined;

    this.telemetry.emitLog("request.idempotent_reused", existing.correlationId, {
      requestId: existing.requestId,
      idempotencyKey,
      status: existing.status
    });

    return {
      requestId: existing.requestId,
      correlationId: existing.correlationId,
      status: existing.status,
      reused: true
    };
  }

workflow-engine.ts

Give failed requests a runbook shape.

The runbook output keeps the failure state, evidence, and next checks together.

public runbook(requestId: string): Runbook | undefined {
    const request = this.requests.get(requestId);
    if (!request) return undefined;

    return {
      requestId,
      correlationId: request.correlationId,
      state: request.status,
      summary: request.status === "failed"
        ? "Failure path observed. Check DLQ, upstream dependency, and downstream error detail."
        : "Workflow still in progress. Check state history and queue age until it advances.",
      evidence: [
        {
          correlationId: request.correlationId,
          requestId,
          scenario: "workflow.failure",
          nextSteps: [
            "Confirm state in request store.",
            "Check the DLQ for this correlation ID.",
            "Re-run only after the downstream dependency is healthy."
          ]
        }
      ],
      dlqSize: this.dlq.length
    };
  }

05 — AWS architecture and IAM boundaries

CDK defines the AWS boundary around the same workflow.

Only the infrastructure layer is TypeScript CDK. The C# path stays local in this version. The CDK stack shows where the local model would sit in AWS, with IAM scoped per Lambda role. See observability-workflow-stack.ts for the full construct.

Edge

POST /workflows GET /workflows/{requestId}

Compute

SubmitFunction StatusFunction WorkerFunction

State & queue

WorkflowStateTable WorkQueue WorkQueueDlq

Signals

WorkQueueAgeAlarm WorkQueueDlqAlarm WorkerErrorRateAlarm WorkflowObservabilityDashboard

06 — Observability and runbook output

Signals are useful only if they help narrow the failure.

Signals are emitted through an in-memory telemetry sink in local runs and tests. They are not real production telemetry. The names and shapes are what you would wire to CloudWatch Metrics and structured logs in a deployed version.

Metric name What it tells an operator

workflow.submit_total

Total accepted requests. A flatline here while the client is active usually means the API boundary is unhealthy.

workflow.worker_retries_total

Retry pressure on the worker. Rising without matching completions means a dependency is struggling.

workflow.failed_total

Requests that exhausted retries and moved to the DLQ. Each increment should produce a runbook entry.

workflow.backlog_age_breach

A message waited longer than the configured age threshold. Maps to WorkQueueAgeAlarm in CDK.

workflow.backlog_warning

Queue depth crossed the warning threshold. Often the first signal before latency reaches users.

Trace and log event names: request.submitted · request.correlation · worker.processing_started · worker.retry_scheduled · state.update · request.idempotent_reused. Every event carries the correlation ID so a support thread starts with a single value.

Runbook output shape

Failed requests produce a runbook entry tied to the correlation ID. Support does not start from a blank page.

C#/.NET TypeScript

WorkflowEngine.cs

Give failed requests a runbook shape.

The runbook output keeps failure state, evidence, and next checks together under the same correlation ID.

public Runbook BuildRunbook(WorkflowEngine engine, string requestId)
{
    return engine.BuildRunbook(requestId);
}

workflow-engine.ts

Give failed requests a runbook shape.

The TypeScript runbook uses the same correlation ID thread: submit, queue message, state transitions, telemetry, runbook.

public runbook(requestId: string): Runbook | undefined {
    const request = this.requests.get(requestId);
    if (!request) return undefined;

    return {
      requestId,
      correlationId: request.correlationId,
      state: request.status,
      summary: request.status === "failed"
        ? "Failure path observed. Check DLQ, upstream dependency, and downstream error detail."
        : "Workflow still in progress. Check state history and queue age until it advances.",
      evidence: [
        {
          correlationId: request.correlationId,
          requestId,
          scenario: "workflow.failure",
          nextSteps: [
            "Confirm state in request store.",
            "Check the DLQ for this correlation ID.",
            "Re-run only after the downstream dependency is healthy."
          ]
        }
      ],
      dlqSize: this.dlq.length
    };
  }

07 — Tests and checks

Tests keep the operational story honest.

Five scenarios, two runtimes. The asserted signals use the same metric names that would appear in CloudWatch. Run with: npm test, npm run build, npm run cdk:test, npm run synth, plus dotnet test for the C# path.

Scenario Behavior Asserted signal Files

Happy path

Submit, queue, process, complete; correlation ID continuous

request.submitted + state.update (completed)

WorkflowEngineTests.cs / workflow-scenarios.test.ts

Transient failure

Worker retries, attempt count decrements, workflow survives

workflow.worker_retries_total + worker.retry_scheduled

WorkflowEngineTests.cs / workflow-scenarios.test.ts

Poison message

Retries exhausted, request failed, DLQ has 1 message

workflow.failed_total == 1, DLQ.count == 1

WorkflowEngineTests.cs / workflow-scenarios.test.ts

Backlog / latency

Queue depth and age cross thresholds, signals emitted

workflow.backlog_warning + workflow.backlog_age_breach

WorkflowEngineTests.cs / workflow-scenarios.test.ts

Idempotency

Duplicate key returns same requestId, queue unchanged

request.idempotent_reused, queue.count == 1

WorkflowEngineTests.cs / workflow-scenarios.test.ts

Show the full TypeScript test file

describe("happy path", () => {
  it("keeps correlationId through submit, queue, and completion", () => {
    const telemetry = new InMemoryTelemetry();
    const engine = new WorkflowEngine(telemetry, { defaultMaxAttempts: 3 });

    const submitResult = engine.submit({
      workflowType: "analytics-export",
      correlationId: "corr-happy",
      payload: { accountId: "acct-1" }
    });

    expect(submitResult.status).toBe("queued");
    expect(submitResult.correlationId).toBe("corr-happy");

    const result = engine.handleNextMessage(() => ({
      kind: "success",
      detail: "completed"
    }));
    const finalRequest = engine.getRequest(submitResult.requestId);

    expect(result?.kind).toBe("success");
    expect(finalRequest?.status).toBe("completed");
    expect(finalRequest?.correlationId).toBe("corr-happy");
    expect(engine.getDlq()).toHaveLength(0);
  });
});
// snippet:test-happy-path-end

// snippet:test-transient-start
describe("transient retry", () => {
  it("retries and succeeds on second attempt", () => {
    const telemetry = new InMemoryTelemetry();
    const engine = new WorkflowEngine(telemetry);

    const submitResult = engine.submit({
      workflowType: "analytics-export",
      correlationId: "corr-transient",
      payload: { accountId: "acct-2" }
    });

    const retry = engine.handleNextMessage(() => ({
      kind: "retryableFailure",
      detail: "dependency timeout"
    }));

    expect(retry?.kind).toBe("retryableFailure");
    expect(engine.getQueue()).toHaveLength(1);
    expect(engine.getRequest(submitResult.requestId)?.status).toBe("queued");
    expect(telemetry.countByName("workflow.worker_retries_total", "corr-transient")).toBe(1);

    const success = engine.handleNextMessage(() => ({
      kind: "success",
      detail: "finalized"
    }));

    expect(success?.kind).toBe("success");
    expect(engine.getRequest(submitResult.requestId)?.status).toBe("completed");
  });
});
// snippet:test-transient-end

// snippet:test-poison-start
describe("poison handling", () => {
  it("moves exhausted failures to dlq and marks failed", () => {
    const telemetry = new InMemoryTelemetry();
    const engine = new WorkflowEngine(telemetry, { defaultMaxAttempts: 1 });

    const submitResult = engine.submit({
      workflowType: "analytics-export",
      correlationId: "corr-poison",
      payload: { accountId: "acct-3" }
    });

    const failure = engine.handleNextMessage(() => ({
      kind: "permanentFailure",
      detail: "schema mismatch"
    }));

    expect(failure?.kind).toBe("permanentFailure");
    expect(engine.getRequest(submitResult.requestId)?.status).toBe("failed");
    expect(engine.getDlq()).toHaveLength(1);
    expect(telemetry.countByName("workflow.failed_total", "corr-poison")).toBe(1);
  });
});
// snippet:test-poison-end

// snippet:test-backlog-start
describe("backlog signal", () => {
  it("emits backlog metric and log once queue is deep and aged", () => {
    const telemetry = new InMemoryTelemetry();
    const engine = new WorkflowEngine(telemetry, { backlogWarningAgeMs: 100, queueBacklogThreshold: 3 });

    for (let i = 0; i < 4; i += 1) {
      engine.submit({
        workflowType: "analytics-export",
        correlationId: `corr-backlog-${i + 1}`,
        idempotencyKey: `idem-${i + 1}`,
        payload: { accountId: `${i + 1}` }
      });
    }

    engine.runBacklogChecks(new Date(Date.now() + 5_000));

    expect(telemetry.countByName("workflow.backlog_age_breach", "corr-backlog-1")).toBe(1);
    expect(telemetry.countByName("workflow.backlog_warning", "corr-backlog-1")).toBe(1);
  });
});
// snippet:test-backlog-end

// snippet:test-idempotency-start
describe("idempotent submit", () => {
  it("reuses the same request for repeated idempotency keys", () => {
    const telemetry = new InMemoryTelemetry();
    const engine = new WorkflowEngine(telemetry);

    const first = engine.submit({
      workflowType: "analytics-export",
      correlationId: "corr-idem-1",
      idempotencyKey: "idem-repeat",
      payload: { accountId: "acct-4" }
    });

    const second = engine.submit({
      workflowType: "analytics-export",
      correlationId: "corr-idem-2",
      idempotencyKey: "idem-repeat",
      payload: { accountId: "acct-4" }
    });

    expect(second.reused).toBe(true);
    expect(second.requestId).toBe(first.requestId);
    expect(engine.getQueue()).toHaveLength(1);
  });
});
// snippet:test-idempotency-end

08 — Tradeoffs

What I chose, what I left out, and what would come next.

01

Local scope first

The implementation focuses on behavior and operating signals before adding production concerns such as identity, secret rotation, and cost controls.
02

Correlation-first debugging

The model keeps a stable correlation ID across request, queue, telemetry, and state transitions. That makes support work easier, but it also means event fields need discipline.
03

Signal clarity vs alert noise

Backlog and retry signals are explicit, with thresholds tied to user impact. The tradeoff is that dashboards need tuning before noisy conditions make the alerts easy to ignore.

Intentionally absent

No live AWS deployment in this version. Confidence comes from local workflow behavior, CDK synthesis, and CDK assertion tests.
No production identity layer. The example stays focused on async workflow behavior and failure mechanics.
Queue visibility and worker parallelism use representative defaults, not a full load profile.

Production path

Identity and auth. Add an identity provider and gateway authorizer before the submit function; scope per-tenant.
Secret rotation and cost controls. Wire AWS Secrets Manager rotation and set per-function concurrency limits and SQS batch size.
CloudWatch wiring. Connect the metric names to real CloudWatch EMF emissions; tune WorkflowObservabilityDashboard thresholds against real traffic.
Load and concurrency profile. Run a load profile to find the queue depth and visibility timeout that keeps latency predictable without over-provisioning.

The hard part is not accepting the request.

Keep intake, processing, state, and signals separate.

Five scenarios that cover the failure paths easy to miss.

What this shows

Operating habit

What this shows

Operating habit

What this shows

Operating habit

What this shows

Operating habit

What this shows

Operating habit

Keep the workflow behavior testable before adding AWS clients.

C#/.NET

TypeScript / Node.js

CDK defines the AWS boundary around the same workflow.

Signals are useful only if they help narrow the failure.

Runbook output shape

Tests keep the operational story honest.

What I chose, what I left out, and what would come next.

Local scope first

Correlation-first debugging

Signal clarity vs alert noise

Intentionally absent

Production path

Related pages.