· 17 min read

Why You Should Deploy Services with ECS

# Software Engineering
This article was auto-translated from Chinese. Some nuances may be lost in translation.

I really hate using AWS in small teams, but if you unfortunately have to build a web service on AWS, then I’d strongly recommend using ECS.

As for why it’s unfortunate, we’ll talk about that next time 🥲.

What ECS Is

ECS (Elastic Container Service) is AWS’s container orchestration service. It manages the lifecycle of your Docker containers for you. Before getting started, there are a few core concepts:

Cluster

A logical grouping of compute resources. You can think of it as a data center, where all your containers run inside this cluster. The cluster itself doesn’t cost anything; what costs money are the compute resources running inside it.

Task Definition

A container spec sheet. It defines which Docker image to use, how much CPU and memory to allocate, environment variables, port mappings, log settings, and more. Every change creates a new revision, which makes tracking and rollback easy.

Task

An actual container instance running according to a Task Definition. One Task Definition can run many Tasks at the same time. Tasks can be long-running or one-off.

Service

A higher-level abstraction that manages Tasks. It ensures a specified number of Tasks keep running, and if one dies, it automatically starts a new one. It also integrates with ALB, networking, and Security Groups to route traffic to healthy Tasks.

Fargate

AWS’s serverless compute engine. With Fargate, you don’t need to manage EC2 instances yourself. You just tell AWS how much CPU and memory your container needs, and AWS handles the underlying machines. The other option is the EC2 launch type, where you manage the machines yourself. It offers more flexibility, but also more operational burden.

In short, the relationship is: a Cluster contains multiple Services, each Service runs multiple Tasks based on a Task Definition, and Tasks actually run on Fargate or EC2.

If You Must Use AWS, Just Consider ECS

When you see the list above, the instinctive reaction is that simply spinning up an EC2 instance and running things there seems more straightforward.

On the surface, yes. But managing a single EC2 instance is like raising a pet: it gets sick, it ages, and it needs constant care from you (all of the following are real disasters 😢):

  • OS security updates: kernel patches, glibc vulnerabilities, OpenSSL updates. If you don’t update, you’re basically running naked; if you do update, you might break your application
  • Memory management: too little swap and you get OOM-killed; too much and performance slows down. When OOM happens, you may not even be able to SSH in, and can only stare helplessly at the console
  • Log rotation: forget to configure it, and six months later the disk fills up and the service dies
  • Process management: if the application crashes, who restarts it? systemd? pm2? Each one requires extra setup
  • SSH key management: who has permission to log in? Has the key been removed after someone left? (Better not know the answer)
  • Deployment method: rsync? scp? Writing your own deploy script? Every deployment becomes a prayer
  • Environment drift: an EC2 instance that has been running for half a year will not be the same as a newly created one. Nobody remembers what was installed or manually changed, and nobody dares to rebuild it

The philosophy of containers is the opposite: every deployment starts from a clean image, the environment doesn’t drift, and if something breaks, just delete it and start over.

Some people will say EC2 can also prevent environment drift through Launch Templates or AMIs, and you can use Ansible to keep every instance in the same state. But in practice, you still need to consider:

  • ECS fundamentally starts containers from Docker images, and Dockerfiles usually live in the same repository as the application code, so it’s easy to track and debug locally. Ansible, however, is often managed separately, which indirectly adds mental overhead for developers
  • To manage AMIs / Launch Templates, you also need to introduce another tool specifically for “managing AMIs,” such as Packer, which is overkill for many applications

Based on my own experience and that of people around me (my impression over three years, plus four friends), most larger companies nowadays have basically moved toward containerization.

If configured well, it can even be cheaper than VPS instances. Containerization also has major operational advantages, such as environment consistency, reproducible deployments, better resource utilization, and horizontal scaling.

The next decision is ECS or EKS.

EKS Has a Higher Barrier Than You Think

EKS is AWS’s managed Kubernetes.

It sounds great, but Kubernetes itself is already highly complex: Pods, Deployments, Services, Ingress, ConfigMaps, Secrets, Namespaces, Helm charts, plus all kinds of CRDs. EKS only manages the control plane for you; everything else is still on you.

If your team has dedicated SREs and already uses Kubernetes, EKS is a reasonable choice. But for most teams of 3–10 people, it’s overkill. ECS is much simpler, and compared with EC2, it has several advantages.

If You Don’t Want to Manage Machines, Use Fargate

With ECS + Fargate, you only need to define which image your container runs and how much CPU and memory it needs, and AWS handles the underlying machines. All the OS patching, disk monitoring, and SSH key management mentioned above become unnecessary. You pay for the time the container actually runs, not for an idle machine.

Built-in Blue/Green Deployments

The concept of Blue/Green deployment is straightforward: the version currently running in production is Blue, and after you deploy a new version, it runs on another set of containers called Green. Both sets exist at the same time, but traffic is still routed to Blue. You can first verify that Green works through a test listener (for example, opening an 8080 port), and once confirmed, switch traffic over.

Compared with traditional rolling updates, the biggest advantages of Blue/Green are:

  • You can test before switching: the new version is already running in the production environment, with the same database and the same environment variables. You can verify functionality through the test listener instead of discovering it’s broken only after deployment
  • Traffic switching can be gradual: CodeDeploy supports Canary and Linear strategies. For example, you can shift 10% of traffic first and observe for 5 minutes, then switch all traffic if everything is fine, instead of going all in at once
  • One-click rollback when something goes wrong: Green blew up? Just click a button in the CodeDeploy console and switch back to Blue. No git revert, no rerunning the pipeline, and no manual edits to the task definition

By comparison, rollback on EC2 means SSHing in, manually switching back to the old version, restarting the process, and hoping for the best. Blue/Green is much better in this regard.

The core spirit of Blue/Green deployment can be found in this article, where my excellent former colleague Henry shares how Hahow implemented Blue/Green deployments. Even though it’s been 9 years, the concept still applies.

Docker Image Management and ECR Integration

ECS integrates natively with ECR. You can specify the ECR image URI directly in the Task Definition, and once IAM permissions are set, it can be pulled.

An important principle: do not use the latest tag. Use the commit hash as the tag for every build, so you always know which version of the code is running in production and can trace issues when they happen. ECR supports tag immutability, which prevents someone from overwriting an existing tag at the mechanism level.

ECR can be configured with a Lifecyle Policy to retain Docker images according to certain rules and automatically delete unused or expired ones.

From a security standpoint, development and production environments should be fully isolated. In AWS, this is usually done by separating them into different Accounts, so ECR should also be separated by environment.

But that means the same Docker image ends up stored in different ECRs, which is a trade-off made for security reasons. Personally, I prefer sharing a single ECR source for everything. There is no absolute right answer; it depends on your team’s current goals.

Auto Scaling

ECS Services natively support auto scaling. You can automatically adjust the number of tasks based on CPU usage, memory, or ALB request count, and you can also customize it—for example, scaling dynamically based on the number of messages in SQS.

The setup is not complicated; you just define a target tracking policy.

EC2 also has Auto Scaling Groups, but then you still need to maintain AMIs, launch templates, and ensure new instances are consistent with the existing ones. ECS + Fargate scaling is just spinning up more containers: clean and straightforward.

Note that auto scaling pauses during Blue/Green deployments. You can handle this by using scripts before and after the pipeline to control the scaling policy.

Logging and Monitoring

ECS natively supports sending container stdout/stderr to CloudWatch Logs. Just set the awslogs log driver in the Task Definition. No need to install the CloudWatch Agent on EC2, set up log groups, or handle log rotation—when the container is killed and restarted, the logs are still in CloudWatch.

With CloudWatch Container Insights, you can directly see CPU / memory / network usage for each service and each task, without installing node exporters or setting up Prometheus yourself.

Security Considerations

ECS + Fargate has a frequently overlooked security advantage: there is no SSH.

That may sound restrictive, but it’s actually a good thing. No SSH means nobody can “temporarily” log in and change things, nobody can secretly install software, and nobody can forget to log out. All changes must go through the Task Definition and CI/CD pipeline, which makes it immutable infrastructure by design.

If you need to debug, you can use ECS Exec (based on SSM Session Manager) to enter the container, and every session is audited.

Summary

ECS is not the kind of technology choice that gets people excited, but it’s sufficient, stable, and has a reasonable learning curve. In the AWS world, the boring choice is often the best choice (the less expensive one).

Design the Workflow Starting from Deployment

If you ask me what the most important thing is when building container services on AWS, I’d say: get deployment right first. First figure out how code goes from a git push to running in production.

ECS deployment refers to the process of packaging the latest version of code into a Docker image, updating the Task Definition, and triggering deployment.

The deployment flow for testing and production environments should be considered separately based on their goals:

  • Test environments should prioritize flexibility and minimize the time and complexity between code merge and deployment
  • Staging environments should keep the same process as production as much as possible, ensuring that any errors can be found before release
  • Production environments should be strictly separated from test environments so developers do not directly access production

Deployment is the part of the development lifecycle that affects you for the longest time. Network design, once done, usually doesn’t change much; monitoring can be added gradually later. But deployment is something you deal with every day, every PR. If the deployment experience is bad, the entire team’s development efficiency will suffer.

In the AWS world, deployment usually goes through CodePipeline + CodeBuild + CodeDeploy. If you’re worried about vendor lock-in, you can also adjust parts of the flow.

For example, with GitHub Actions, AWS provides official Actions that can be integrated. But whichever path you choose, you still need to handle:

  1. Build the image and push it to ECR
  2. Update the Task Definition’s image URI
  3. Update the ECS Service to trigger deployment
  4. Wait for the service to stabilize

If it’s a production environment, you may also need Blue/Green deployment (via CodeDeploy), approval mechanisms, and cross-account ECR image synchronization. Each extra layer adds another chance for something to go wrong.

The Earlier You Introduce IaC, the Better

If you manage these AWS resources manually in the Console, you will live in fear.

One day someone may accidentally change a Security Group inbound rule, or an IAM Policy may be “temporarily” modified and then forgotten. You might spend an entire day just finding the problem. Not to mention when you need to create a second environment (staging), you’ll realize you can’t even remember how production was configured.

Managing AWS resources with Terraform should be a day-one task, not something you “clean up later when you have time.”

This is also a trade-off to consider when pursuing deployment convenience and IaC. The core idea of Terraform is to use declarative definitions to make infra reach the desired state. But in real-world usage, ECS deployment usually only involves:

  • Declaring a new task definition with the Docker image tag for this deployment
  • Updating the Service and triggering deployment

If you have to go through Terraform for every deployment, it can be a bit cumbersome, because infrastructure usually doesn’t change. Here’s the setup I currently use often.

Terraform’s lifecycle block can ignore changes to the Task Definition’s image URI, so Terraform manages infrastructure and the CI/CD pipeline manages application deployment, and the two don’t fight each other. This is the most practical ECS deployment trick in my opinion:

resource "aws_ecs_service" "app" {
  # ...
  lifecycle {
    ignore_changes = [task_definition]
  }
}

Without IaC, your infrastructure is a black box. Only the person who set it up in the first place knows what it looks like inside, and that person has usually already left.

How Deployment Complexity Eats Your Money

A lot of teams think “a slower deployment is fine” or “it’s just a few extra steps,” but these seemingly small frictions can be quantified when they accumulate.

  • Slow deployments
  • Too many steps per deployment
  • More manual operations, more chances for errors
  • Increased mental burden for developers
  • Reluctance to make major changes
  • Bigger mistakes waiting to happen

Once you quantify these hidden costs:

Time to Production

Assume a 5-person team, with each engineer costing about 2 million TWD per year including benefits and equipment, which comes out to roughly 1,000 TWD per hour.

MetricSimple DeploymentComplex Deployment
Time per deployment10 minutes45 minutes
Deployments per week155
Deployment failure rate3%15%
Recovery time after failure15 minutes2 hours
Total time spent on deployment per week~3 hours~8 hours

Just on deployment itself, complex deployment costs 5 more hours per week than simple deployment. For a 5-person team, that’s about 25 hours wasted per week, or 1,300 hours a year, which is roughly 1.3 million TWD.

The Hidden Cost of Bugs

Another hidden cost of difficult deployment is a higher bug rate. Suppose each failed change requires an extra 8 hours of investigation and repair:

MetricHigh-Frequency DeploymentLow-Frequency Deployment
Deployments per month6020
Change failure rate5%30%
Failures per month36
Repair cost per failure (person-hours)8 hours16 hours (the problem is usually more complex)
Total monthly repair cost24 hours96 hours

Low-frequency deployment spends 72 more hours per month fixing bugs. Over a year, that’s 864 hours, or about 860,000 TWD.

And that still doesn’t include:

  • Opportunity cost: time engineers spend debugging could have been used to build new features
  • User churn: bugs reaching production affect user experience, and the loss from MAU decline is hard to quantify but absolutely real
  • Psychological cost: a team that feels like it’s defusing bombs every time it deploys cannot have good morale

The Negative Loop

The boss doesn’t understand technology and thinks the dev team keeps making mistakes every time.

When the team asks to improve the deployment process, it’s hard to see immediate product impact in the short term, so it gets postponed. Since the deployment process doesn’t improve, the costs are borne directly by the developers on the ground, trust keeps eroding, and the boss no longer believes the team’s proposals.

The cost of this cycle is much higher than it seems:

  • Compound cost of delayed improvement: from the earlier estimate, the hidden annual cost of complex deployment is about 2.25 million. If improvement is delayed by one year, that money is simply burned; delayed by two years, it’s 4.5 million
  • Turnover cost: the most direct result of low morale is people quitting. The replacement cost of an engineer is roughly 50%–150% of their annual salary (recruiting, interviews, onboarding, ramp-up). With an annual salary cost of 2 million, each replacement costs about 1–3 million. If a 5-person team loses one extra person per year because deployment feels awful, the extra cost is 1–3 million/year
  • Decision cost from trust deficit: the boss doesn’t trust the technical team, rejects technical proposals, and technical debt keeps piling up. Suppose one improvement proposal per quarter is blocked, and each would save 500,000/year. Four blocked proposals a year means missing out on about 2 million in savings
ItemAnnual Cost
Ongoing hidden deployment cost2.25 million
Extra turnover1–3 million
Opportunity cost from delayed technical improvements~2 million
Total~5.25–7.25 million/year

That’s about the annual salary of 2–3 engineers, and it gets worse over time—the later you improve it, the more costs accumulate, the fewer people you can keep, and the harder it becomes to break the vicious cycle.

Grand Total

Conservatively speaking, for a 5-person team, the hidden annual cost of a complex deployment process is around 2–2.5 million TWD. This doesn’t even include the few thousand dollars a month silently burned by NAT Gateway, the cost of idle resources, or the time you spend trying to understand AWS bills.

That’s roughly the annual salary of a junior engineer.

Deployment is worth investing in from day one. The earlier you get this right, the more cost you save every day and every PR through compounding.

Conclusion

If your team is already tied to AWS, ECS is currently the container service option I recommend most. It’s less of a headache than EC2, more pragmatic than EKS, and with Fargate the operational burden drops significantly.

But ECS itself is only one piece of the puzzle. In practice, you still need to handle VPC network design, ALB traffic distribution, IAM permission control, ECR image management, and CloudWatch monitoring configuration.

These components are interconnected, and any misconfiguration in one of them can leave you debugging for hours. Choosing ECS is only the starting point; planning the surrounding infrastructure together is what makes it complete.

I’ve recently had similar needs across several projects, so this article is basically an integration of my own thoughts. If this article gets responses from readers, I’ll introduce how to design ECS deployment workflows and architecture in practice.

Or maybe, you don’t need AWS?