AWS ECS Exec on ECS Fargate with Terraform

AWS ECS Exec on ECS Fargate with Terraform

Balancing security trade-offs is a time-honored and often dreaded activity that all development shops must contend with. On one hand, applications should be secure enough for users to be confident that their data is protected from prying eyes and unwanted manipulation. On the other hand, too many security controls can bog down developers to the point where they are spending more time securing the app, than actually developing it.

At Simple Thread, we are continually striving to operate within that sweet spot that values both a firm foundation in security, and a development environment that still produces rapid results.  It’s rare for us to find a new service or technology that meets both of those requirements: improving security, while at the same time, making it easier for developers to do their jobs.  But the new Amazon ECS Exec feature does exactly that!

Simple Thread’s bread and butter development comes in the form of data-driven web applications that run on Amazon Web Services (AWS) infrastructure. These applications typically run inside Docker containers that are deployed as Amazon Elastic Container Service (ECS) tasks. And those tasks are hosted in a Fargate cluster, which is a service that allows developers to run containers without any knowledge of, or access to, the underlying infrastructure that hosts them. While this infrastructure abstraction can often lead to reduced costs and less administrative overhead, it does make it more difficult if you need to connect directly to a task running in Fargate.

During the course of our application development, we do occasionally need to connect to a running container, in order to set up, test, or debug a feature of an application. In a traditional Docker architecture, a developer could connect to the underlying host machine and then run a docker exec -it command to invoke an interactive shell inside the container. But in a Fargate cluster, the developer doesn’t have the ability to connect to the host infrastructure, and thus, no ability to exec into a task.

In order to get around this limitation, Simple Thread projects typically used a bastion host EC2 instance.  Developer’s could tunnel into the bastion host and then create an SSH connection to a Fargate task. It was a multi-step process that had several drawbacks.

In addition to maintaining the bastion host EC2 instance itself, the environment had to be configured to allow SSH traffic between the user, the bastion host, and the target ECS task. This meant opening up ports in both internal and external security groups, as well as maintaining a set of SSH keys to allow sessions to authenticate. Handling the storage and distribution of these keys is always a tricky matter, because in the wrong hands, a bad actor could do some real damage.

Enter ECS Exec

That was the landscape before March of 2021, when Amazon fielded ECS Exec, a new feature which allowed for a direct connection to ECS tasks, including those running on Fargate clusters. This was exactly what many ECS users (and especially Fargate users) had been clamoring for. This obviates the need for those bastion host instances. There was much rejoicing, but it gets even better!

Under the covers, ECS Exec uses AWS Systems Manager Session Manager (SSM). SSM creates an interactive session that doesn’t use SSH, and therefore doesn’t require any SSH keys. It also doesn’t require any external ports to be opened in your security groups. And that’s not all! Access to the SSM service, like any other AWS service, can be tightly controlled using IAM policies, ensuring that only users with the proper permissions will be able to connect. And last, but certainly not least, SSM sessions can be logged, in their entirety, with services like Cloudtrail or Cloudwatch, so you’ll always know who did what, and when. Needless to say, implementing this new feature in our applications quickly became a high priority.

As it turns out, it didn’t take too much modification to add ECS Exec into our baseline infrastructure code. We use Terraform, an infrastructure as code (IaC) tool, to manage all of our AWS infrastructure. As such, some of the code samples I share below are Terraform snippets, but it should be fairly straightforward to convert the logic into other popular tools for AWS control, like the AWS CLI or Python Boto, should you prefer one of those.

First and foremost, in order to create an SSM connection to an AWS resource, that resource must have the SSM agent installed. In the case of ECS tasks, the underlying host must have the agent running. Yet another great feature of Fargate is that you get this for free: tasks launched on Fargate automatically have access to a running SSM agent without you lifting a finger. If, however, you run your ECS tasks in an EC2 cluster, you will need to ensure that those EC2 instances all have the agent installed and running.
You can view a complete list of ECS Exec prerequisites here.

Task Definition Changes

For our use case, I created a new task definition that would be dedicated to this feature. When developers want to connect into the environment, they spin up an instance of this new task, which uses the same image as our production application, but is specifically configured to allow for these SSM connections. These tasks are intended to be ephemeral, and as soon as the developer session is terminated, the task is immediately stopped. If instead you wish to create long-running tasks with ECS Exec enabled, you may want to consider adding the following snippet into your task definition:


"linuxParameters": {
  "initProcessEnabled": true
}

This will start the init process in the container, and can help to clean up zombie processes that can be created over time when running the SSM agent.
This and other considerations for using ECS Exec are discussed here.

So, before creating a new task definition, I first needed to create a new task role. This role needs some extra permissions to enable SSM. The minimum set of permissions would look something like this:


{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "ssmmessages:CreateControlChannel",
      "ssmmessages:CreateDataChannel",
      "ssmmessages:OpenControlChannel",
      "ssmmessages:OpenDataChannel"
    ],
    "Resource": "*"
  }]
}

You can simply create the above IAM policy and attach it to your task role. Alternatively, Amazon provides an AWS-managed role that has all of these permissions (and some more). The role is called AmazonSSMManagedInstanceCore and it can be used on either EC2 instances or ECS tasks to enable all of the SSM capabilities.  Once you have the new task role in place, you can create your task definition.  My Terraform code for both the role and the task looks like this:


resource "aws_iam_role" "ecs_exec_task_role" {
  name = "${local.environment}-ecs-exec-task-role"
  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

  tags = local.common_tags
}

resource "aws_iam_role_policy_attachment" "ecs-ssm-role-policy-attach" {
  role       = aws_iam_role.ecs_ssh_task_role.id
  policy_arn = data.aws_iam_policy.AmazonSSMManagedInstanceCore.arn
}

resource "aws_ecs_task_definition" "ecs-exec" {
  family                   = "${local.environment}-ecs-exec"
  execution_role_arn       = aws_iam_role.ecs_execution_role.arn
  task_role_arn            = aws_iam_role.ecs_exec_task_role.arn
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = local.api_ssh_cpu
  memory                   = local.api_ssh_memory
  container_definitions = templatefile("templates/task_definition_template.json.tpl",
    {
      name            = "${local.environment}-ecs-exec"
      environment     = local.environment
      app_image       = aws_ecr_repository.api.repository_url
      aws_region      = "us-east-1"
      command         = ["tail", "-f", "/dev/null"]
      port_mappings   = []
      log_group       = aws_cloudwatch_log_group.session_manager_log_group.name
    }
  )
}

Tying It All Together

Once I had my task definition created, the next thing was to get the developers set up with permissions to actually run one of these things.  Users need extra privileges to be able to run an ECS Exec command on a task.  So, you will probably need to add something like the following snippet to the role or permissions set of any user who needs to make this sort of connection.


{
  "Sid": "ExecuteCommandOnExecTask",
  "Effect": "Allow",
  "Resource": [
    "*"
  ],
  "Action": [
    "ecs:ExecuteCommand"
  ],
  "Condition": {
    "StringEquals": {
      "ecs:container-name": "${local.environment}-api-exec"
    }
  }
}

And finally, in order for developers to actually run the task, we run a bash script, which uses the AWS CLI.  The most important section of that script is:


#!/bin/bash

set -e

...

echo "Starting task..."

COMMAND="aws ecs run-task --region $REGION
  --cluster $CLUSTER
  --task-definition $TASK_DEFINITION
  --count 1
  --launch-type FARGATE
  --network-configuration $NETWORK_CONFIG
  --enable-execute-command
  --overrides {\"containerOverrides\":[{\"name\":\"$TASK_DEFINITION\",\"environment\":[{\"name\":\"AWS_USER\",\"value\":\"$AWS_USER\"}]}]}
  --tags [{"key":"Owner","value":"$AWS_USER"}]"

RESPONSE=$($COMMAND)

...

Not too different from any other aws ecs run-task command, but with one key difference.  It’s that --enable-execute-command flag.  That is crucial, and ECS Exec won’t work without it.  This means that you can’t use this new feature on already-running tasks that weren’t started with this flag.  If you want to use ECS Exec on one of these tasks, you’ll have to restart it first.

Log All The Things

I mentioned one other key feature of ECS Exec: the ability to log sessions.  This is a fantastic feature, which logs literally every keystroke of a user’s Exec session.  However, there is one limitation to this logging.

Since all ECS Exec sessions are conducted as the root user within the container, there is no reference in the Cloudwatch logs to the individual AWS user who invoked the session.  After a little digging I found that ExecuteCommand events are also logged in Cloudtrail, and those entries do contain the AWS username.  So it is possible to correlate the Cloudtrail ExecuteCommand events with the Cloudwatch SSM logs, to piece together an attributable audit trail.  However, Cloudtrail events, especially in busy AWS accounts, can occur at the rate of hundreds per second.  It can make searching for these events in the console much like the proverbial needle in a haystack.  And even then, it seems that the timestamps of the two log entries (Cloudtrail and Cloudwatch) don’t exactly match.  So, in the event of a security audit, you could be left correlating these entries with some “fuzzy” logic.

In order to ameliorate (but not entirely alleviate) this problem, I implemented a couple of things to make our lives a little easier.  As you can see in the aws ecs run-task above, I added an --overrides section that sets an environment variable in the task with the AWS username of the person running the script.  Within the docker image for the task, I added a .bash_profile that echoes this environment variable to standard out.  Thus, under normal circumstances, the username now shows up in the Cloudwatch session logs.

However, this is definitely more of a convenience than a bonafide security control.  Since the script is run client-side, a malicious user could simply change the script, removing their username, or even worse, adding someone else’s.  Definitely don’t rely on this in the event of a real security investigation.

I also added an AWS EventBridge (formerly Cloudwatch Events) rule that sends off an email on an AWS SNS alert topic whenever any user starts an ExecuteCommand session.  The alert contains key information about the event (including the username), and lets us see when people are connecting into the environment so that we can watch for any problematic patterns of activity.  The Terraform code for the EventBridge rule follows.  The most important thing in the code is the event_pattern, which allows you to trigger alerts off of the right events.


resource "aws_cloudwatch_event_rule" "ecs-exec-notify" {
  name        = "notify-on-ecs-exec-events"
  description = "Sends alert email to SNS topic when any user uses ExecuteCommand on a running container"

event_pattern = <<EOF
{
  "source": ["aws.ecs"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventName": ["ExecuteCommand"]
  }
}
EOF
}

resource "aws_cloudwatch_event_target" "ecs-exec-sns" {
  rule      = aws_cloudwatch_event_rule.ecs-exec-notify.name
  target_id = "SendToSNS"
  arn       = aws_sns_topic.alerts.arn

  input_transformer {
    input_paths = {
      ...
    }
    input_template = <<EOF
      ...
EOF
  }
}

Happy Exec’ing

And there you have it. Your infrastructure needs will likely have their own nuances and tweaks required, but I tried to give an example of each piece of the puzzle and how it all works together. Hopefully that’s all you’ll need to know to get started with ECS Exec, and start streamlining your network connectivity and security posture.

Now feel free to go terminate that bastion host instance and while you’re at it, go ahead and burn those SSH keys too.  They have no place in our ECS-Exec-enabled future.  Good luck, and happy Exec’ing!

Loved the article? Hated it? Didn’t even read it?

We’d love to hear from you.

Reach Out

Comments (2)

  1. 100% agreed with the main takeaway of this post: ECSExec is an absolute gamechanger. I’ve had tinker with it and I’m so happy about it that when I saw this post on my RSS I skimmed and went straight into the comment section just to reenforce the message. Start. Using. It.

    1. Thanks, Alvaro. It’s a great service, and definitely has the potential to make a lot of DevOps engineers’ lives easier!

Leave a comment

Leave a Reply

Your email address will not be published.

More Insights

View All