Detecting ECS Task Failures with EventBridge

#aws

I recently had to add some EventBridge rules for detecting and responding to ECS task failures. I spent a bit of time trying to find examples online, and ended up coming up with a couple of my own. I’m posting them here hopefully to save others the trouble.

TL;DR: See the eventbridge rule below below.

Stopped Task Error Codes

If there is an issue running a container, ECS will set the stoppedReason field to be one of a few possible values. The full list is here, however most contain the terms “Error” or “Failure”. So, we can use a wildcard pattern to match these failures

Detecting Exit Codes

Like with most programs, the containers in an ECS task also return exit codes indicating their status. There are some common exit codes that might be returned see the “Common exit codes” section here.

However, since all we need is the exit code, we can use this filter to detect any tasks which exit with a non-zero exit code, indicating a failure. We can use the EventBridge anything-but pattern to only match failed tasks.

EventBridge Rule

Combining both these scenarios using the $or operator, this rule should be able to detect most of the possible failure scenarios for an ECS task.

{
  "source": ["aws.ecs"],
  "detail-type": ["ECS Task State Change"],
  "detail": {
    "lastStatus": ["STOPPED"],
    "$or": [
      {
        "stoppedReason": [{
          "wildcard": "*Error*"
        }, {
          "wildcard": "*error*"
        }, {
          "wildcard": "*Failed*"
        }]
      },
      {
        "containers": {
          "exitCode": [{
            "anything-but": [0]
          }]
        },
        "stoppedReason": ["Essential container in task exited"]
      }
    ]
  }
}

Change Log

  • 3/1/2024 - Initial Revision

Found a typo or technical problem? file an issue!