With a small amount of code, you can expose running Fargate tasks to the internet directly,
each on an individual subdomain — for example ${taskId}.eu-west-2.browser.reflow.io.
This is almost always a bad idea. If you really need to do it, this guide should help.
Why you shouldn’t do this
AWS provides battle-tested patterns for containerized workloads that are easier to configure, less likely to break, and right for the vast majority of use-cases:
- Application Load Balanced Fargate Service:
a Fargate service fronted by an application load balancer.
- Health checks come built in, diverting traffic from unhealthy instances before replacing them.
- Deployments via CodePipeline monitor newly provisioned services before gradually shifting traffic.
- Traffic distributes across availability zones for data-center resilience.
- Queue Processing Fargate Service:
a Fargate service auto-scaled from an SQS queue.
- Failed jobs retry automatically.
- Scales up and down with asynchronous workload.
- Handles long-lived jobs.
Why we do it anyway
Reflow runs web browsers that record and execute end-to-end tests. Recording needs a websocket to a server holding transient state. ECS on Fargate fits — no server fleet to operate, none of the complexity of Kubernetes — and our first design used the ALB-fronted pattern above. It ran into three walls:
- Isolation. We run untrusted customer code, which demands customer workloads physically isolated from each other on zero-privilege servers. ALB-fronted services make per-customer physical isolation non-trivial.
- Cost. We wanted multi-region, but keeping one warm instance plus a NAT gateway in every region is significant money for a bootstrapped startup — and there was no way to let clusters scale to zero when unused.
- Affinity. Team members can share one browser instance to collaborate on a recording. Behind a load balancer, we couldn’t guarantee two users in the same team reach the same server.
Routing directly to tasks gives us:
- Clusters that scale to zero when not in use — no warm-instance fee per region.
- Physical isolation per customer workload, while still sharing transient state within a team.
- Expected customer ids baked into each server process’s environment, simplifying authentication.
- No NAT gateway per availability zone.
The costs:
- DNS propagation means the first use of a recording instance waits roughly a minute before the server is reachable.
- More moving parts to monitor.
Logical components
- A LetsEncrypt wildcard certificate on the task domain, e.g.
*.eu-west-2.browser.reflow.io. - A Lambda that renews the certificate monthly and alerts on failure.
- A Lambda that creates and destroys DNS records for every task in the cluster, driven by ECS state-change events.
- An ECS cluster and task configuration that runs the service on demand with a public IP.
- Application logic to deliver task hostnames to clients. We store the hostname in DynamoDB
when a server boots, read it from web clients over AppSync, and bake a
TeamIdinto the server’s environment — a Cognito custom attribute that must be signed into the client JWT on every request.
Components 1–3 are generic — once configured, every task in the target cluster is exposed via DNS — so they are what this guide covers. Components 4 and 5 are domain-specific; reach out if you have questions about them.
LetsEncrypt certificates
CDK
We manage all infrastructure with CDK. This construct owns the renewal lambda. It creates:
- an S3 bucket holding the issued certificates
- an SNS topic notifying us on renewal
- the Lambda function itself.
ReamplifyLambdaFunctionis our wrapper that pre-compiles code outside CDK; a NodejsFunction works just as well.
It references the hosted zone for task DNS records and the domain suffix, a workspace
parameter (dev / prod) so multiple instances can live in one account, a notification email,
and the target region and account.
import { Construct } from 'constructs';import { Duration, RemovalPolicy, StackProps, Tags } from 'aws-cdk-lib';import { BlockPublicAccess, Bucket, BucketEncryption, ObjectOwnership } from 'aws-cdk-lib/aws-s3';import { Topic } from 'aws-cdk-lib/aws-sns';import { EmailSubscription } from 'aws-cdk-lib/aws-sns-subscriptions';import { ReamplifyLambdaFunction } from './reamplifyLambdaFunction';import { PolicyStatement } from 'aws-cdk-lib/aws-iam';import { IHostedZone } from 'aws-cdk-lib/aws-route53';import { Rule, Schedule } from 'aws-cdk-lib/aws-events';import { LambdaFunction } from 'aws-cdk-lib/aws-events-targets';
interface CertbotProps { adminNotificationEmail: string; hostedZone: IHostedZone; domain: string; workspace: string; env: { region: string; account: string; };}
export class Certbot extends Construct { public readonly certBucket: Bucket; constructor(scope: Construct, id: string, props: StackProps & CertbotProps) { super(scope, id); Tags.of(this).add('construct', 'Certbot'); const certBucket = new Bucket(this, 'bucket', { bucketName: `certs.${props.env.region}.${props.workspace}.reflow.io`, objectOwnership: ObjectOwnership.BUCKET_OWNER_PREFERRED, removalPolicy: RemovalPolicy.DESTROY, autoDeleteObjects: true, versioned: true, lifecycleRules: [ { enabled: true, abortIncompleteMultipartUploadAfter: Duration.days(1), }, ], encryption: BucketEncryption.S3_MANAGED, enforceSSL: true, blockPublicAccess: BlockPublicAccess.BLOCK_ALL, }); this.certBucket = certBucket;
const topic = new Topic(this, 'CertAdminTopic'); topic.addSubscription(new EmailSubscription(props.adminNotificationEmail));
const fn = new ReamplifyLambdaFunction(this, 'LambdaFn', { workspace: props.workspace, lambdaConfig: 'deploy/browserCerts.ts', timeout: Duration.minutes(15), environment: { NOTIFY_EMAIL: props.adminNotificationEmail, CERTIFICATES: JSON.stringify([ { domains: [`*.${props.domain}`], zoneId: props.hostedZone.hostedZoneId, certStorageBucketName: certBucket.bucketName, certStoragePrefix: 'browser/', successSnsTopicArn: topic.topicArn, failureSnsTopicArn: topic.topicArn, }, ]), }, });
fn.addToRolePolicy( new PolicyStatement({ actions: ['route53:ListHostedZones'], resources: ['*'], }) ); fn.addToRolePolicy( new PolicyStatement({ actions: ['route53:GetChange', 'route53:ChangeResourceRecordSets'], resources: ['arn:aws:route53:::change/*'].concat(props.hostedZone.hostedZoneArn), }) ); fn.addToRolePolicy( new PolicyStatement({ actions: ['ssm:GetParameter', 'ssm:PutParameter'], resources: ['*'], }) ); certBucket.grantWrite(fn); topic.grantPublish(fn);
new Rule(this, 'trigger', { schedule: Schedule.cron({ minute: '32', hour: '17', day: '3', month: '*', year: '*' }), targets: [new LambdaFunction(fn)], }); }}The renewal lambda
Dependencies: acme-client@4.2.3.
This leans heavily on acme-client, with a scattering of logic to:
- maintain SSM parameters so one LetsEncrypt account is reused across runs, while still bootstrapping cleanly in a fresh environment
- answer LetsEncrypt DNS-01 challenges with Route53 records to prove domain ownership
- store issued certificates in S3
- notify an admin of success or failure
import AWS from 'aws-sdk';import acme from 'acme-client';
const route53 = new AWS.Route53();const s3 = new AWS.S3();const sns = new AWS.SNS();
export function assertEnv(key: string): string { if (process.env[key] !== undefined) { console.log('env', key, 'resolved by process.env as', process.env[key]!); return process.env[key]!; } throw new Error(`expected environment variable ${key}`);}
export const assertEnvOrSSM = async (key: string, shouldThrow = true): Promise<string> => { const workspace = assertEnv('workspace');
if (process.env[key] !== undefined) { console.log('env', key, 'resolved by process.env as', process.env[key]!); return Promise.resolve(process.env[key]!); } else { const SSMLocation = `/${workspace}/${key}`; console.log('env', key, 'resolving via SSM at', SSMLocation);
const SSM = new AWS.SSM(); try { const ssmResponse = await SSM.getParameter({ Name: SSMLocation, }).promise(); if (!ssmResponse.Parameter || !ssmResponse.Parameter.Value) { throw new Error(`env ${key} missing`); } console.log('env', key, 'resolved by SSM as', ssmResponse.Parameter.Value); process.env[key] = ssmResponse.Parameter.Value; return ssmResponse.Parameter.Value; } catch (e) { console.error(`SSM.getParameter({Name: ${SSMLocation}}):`, e); if (shouldThrow) { throw e; } return ''; } }};
export const writeSSM = async (key: string, value: string): Promise<void> => { const workspace = assertEnv('workspace');
const SSMLocation = `/${workspace}/${key}`; console.log('env', key, 'writing to SSM at', SSMLocation, 'value', value);
const SSM = new AWS.SSM(); await SSM.putParameter({ Name: SSMLocation, Value: value, Overwrite: true, DataType: 'text', Tier: 'Standard', Type: 'String', }).promise();};
async function getOrCreateAccountPrivateKey() { let accountKey = await assertEnvOrSSM('LETSENCRYPT_ACCOUNT_KEY', false); if (accountKey) { return accountKey; } console.log('Generating Account Key'); accountKey = (await acme.forge.createPrivateKey()).toString(); await writeSSM('LETSENCRYPT_ACCOUNT_KEY', accountKey); return accountKey;}
export const handler = async function (event) { const maintainerEmail = assertEnv('NOTIFY_EMAIL'); const accountURL = await assertEnvOrSSM('LETSENCRYPT_ACCOUNT_URL', false); const certificates = JSON.parse(assertEnv('CERTIFICATES')); const accountPrivateKey = await getOrCreateAccountPrivateKey();
acme.setLogger(console.log); const client = new acme.Client({ directoryUrl: acme.directory.letsencrypt.production, accountKey: accountPrivateKey, accountUrl: accountURL ? accountURL : undefined, });
const certificateRuns = certificates.map(async (certificate) => { const { domains, zoneId, certStorageBucketName, certStoragePrefix, successSnsTopicArn, failureSnsTopicArn } = certificate;
try { const [certificateKey, certificateCsr] = await acme.forge.createCsr({ commonName: domains[0], altNames: domains.slice(1), });
const certificate = await client.auto({ csr: certificateCsr, email: maintainerEmail, termsOfServiceAgreed: true, challengeCreateFn: async (authz, challenge, keyAuthorization) => { console.log(authz, challenge, keyAuthorization); const dnsRecord = `_acme-challenge.${authz.identifier.value}`;
if (challenge.type !== 'dns-01') { throw new Error('Only DNS-01 challenges are supported'); } const changeReq = { ChangeBatch: { Changes: [ { Action: 'UPSERT', ResourceRecordSet: { Name: dnsRecord, ResourceRecords: [ { Value: '"' + keyAuthorization + '"', }, ], TTL: 60, Type: 'TXT', }, }, ], }, HostedZoneId: zoneId, }; console.log('Sending create request', JSON.stringify(changeReq)); const response = await route53.changeResourceRecordSets(changeReq).promise(); const changeId = response.ChangeInfo.Id; console.log(`Create request sent for ${dnsRecord} (Change id ${changeId}); waiting for it to complete`); const waitRequest = route53.waitFor('resourceRecordSetsChanged', { Id: changeId }); const waitResponse = await waitRequest.promise(); console.log( `Create request complete for ${dnsRecord}: (Change id ${waitResponse.ChangeInfo.Id}) ${waitResponse.ChangeInfo.Status}` ); }, challengeRemoveFn: async (authz, challenge, keyAuthorization) => { const dnsRecord = `_acme-challenge.${authz.identifier.value}`;
const deleteReq = { ChangeBatch: { Changes: [ { Action: 'DELETE', ResourceRecordSet: { Name: dnsRecord, ResourceRecords: [ { Value: '"' + keyAuthorization + '"', }, ], TTL: 60, Type: 'TXT', }, }, ], }, HostedZoneId: zoneId, }; console.log('Sending delete request', JSON.stringify(deleteReq)); const response = await route53.changeResourceRecordSets(deleteReq).promise(); const changeId = response.ChangeInfo.Id; console.log(`Delete request sent for ${dnsRecord} (Change id ${changeId}); waiting for it to complete`); const waitRequest = route53.waitFor('resourceRecordSetsChanged', { Id: changeId }); const waitResponse = await waitRequest.promise(); console.log( `Delete request complete for ${dnsRecord}: (Change id ${waitResponse.ChangeInfo.Id}) ${waitResponse.ChangeInfo.Status}` ); }, challengePriority: ['dns-01'], });
// Write private key & certificate to S3 const certKeyWritingPromise = s3 .putObject({ Body: certificateKey.toString(), Bucket: certStorageBucketName, Key: certStoragePrefix + 'key.pem', ServerSideEncryption: 'AES256', }) .promise(); const certChainWritingPromise = s3 .putObject({ Body: certificate, Bucket: certStorageBucketName, Key: certStoragePrefix + 'cert.pem', }) .promise();
await Promise.all([certKeyWritingPromise, certChainWritingPromise]); console.log('Completed with certificate for ', domains);
// after client.auto, an account should be available if (!accountURL) { await writeSSM('LETSENCRYPT_ACCOUNT_URL', client.getAccountUrl()); }
if (successSnsTopicArn) { await sns .publish({ TopicArn: successSnsTopicArn, Message: `Certificate for ${JSON.stringify(domains)} issued`, Subject: 'Certificate Issue Success', }) .promise(); } } catch (err) { console.log('Error ', err); if (failureSnsTopicArn) { await sns .publish({ TopicArn: failureSnsTopicArn, Message: `Certificate for ${JSON.stringify(domains)} issue failure\n${err}`, Subject: 'Certificate Issue Failure', }) .promise(); } throw err; } });
await Promise.all(certificateRuns);};Automatic DNS records
CDK
This wires an EventBridge rule to a lambda. It references the clusterArn whose task
state-change events we want, the serviceDiscoveryTLD to suffix records with (for us,
browser.${props.env.region}.reflow.io), and the hosted zone to write records into.
import { Rule } from 'aws-cdk-lib/aws-events';import { LambdaFunction } from 'aws-cdk-lib/aws-events-targets';import { PolicyStatement } from 'aws-cdk-lib/aws-iam';
// ...
const eventRule = new Rule(this, 'ECSChangeRule', { eventPattern: { source: ['aws.ecs'], detailType: ['ECS Task State Change'], detail: { clusterArn: [cluster.clusterArn], }, },});
const ecsChangeFn = new ReamplifyLambdaFunction(this, 'ECSStreamLambda', { ...props, lambdaConfig: 'stream/ecsChangeStream.ts', unreservedConcurrency: true, memorySize: 128, environment: { DOMAIN_PREFIX: props.serviceDiscoveryTLD, HOSTED_ZONE_ID: props.hostedZone.hostedZoneId, },});
eventRule.addTarget(new LambdaFunction(ecsChangeFn));
ecsChangeFn.addToRolePolicy( new PolicyStatement({ actions: ['route53:GetChange', 'route53:ChangeResourceRecordSets', 'route53:ListResourceRecordSets'], resources: ['arn:aws:route53:::change/*'].concat(props.hostedZone.hostedZoneArn), }));ecsChangeFn.addToRolePolicy( new PolicyStatement({ actions: ['ec2:DescribeNetworkInterfaces'], resources: ['*'], }));The DNS lambda
The function sanity-checks each event, then:
- if the task is currently
RUNNINGand desiredRUNNING: looks up the task’s public IP and upserts anArecord at${taskId}.${DOMAIN_PREFIX} - otherwise: deletes the task’s
Arecord
import type { EventBridgeHandler } from 'aws-lambda';import AWS from 'aws-sdk';import { Task } from 'aws-sdk/clients/ecs';
export function assertEnv(key: string): string { if (process.env[key] !== undefined) { console.log('env', key, 'resolved by process.env as', process.env[key]!); return process.env[key]!; } throw new Error(`expected environment variable ${key}`);}
const ec2 = new AWS.EC2();const route53 = new AWS.Route53();const DOMAIN_PREFIX = assertEnv('DOMAIN_PREFIX');const HOSTED_ZONE_ID = assertEnv('HOSTED_ZONE_ID');
export const handler: EventBridgeHandler<string, Task, unknown> = async (event) => { console.log('event', JSON.stringify(event)); const task = event.detail; const clusterArn = task.clusterArn; const lastStatus = task.lastStatus; const desiredStatus = task.desiredStatus;
if (!clusterArn) { return; }
if (!lastStatus) { return; }
if (!desiredStatus) { return; }
const taskArn = task.taskArn; if (!taskArn) { return; } const taskId = taskArn.split('/').pop(); if (!taskId) { return; }
const clusterName = clusterArn.split(':cluster/')[1]; if (!clusterName) { return; } const containerDomain = `${taskId}.${DOMAIN_PREFIX}`;
if (lastStatus === 'RUNNING' && desiredStatus === 'RUNNING') { const eniId = getEniId(task); if (!eniId) { return; }
const taskPublicIp = await fetchEniPublicIp(eniId); if (!taskPublicIp) { return; }
const recordSet = createRecordSet(containerDomain, taskPublicIp);
await updateDnsRecord(clusterName, HOSTED_ZONE_ID, recordSet);
console.log(`DNS record update finished for ${taskId} (${taskPublicIp})`); } else { const recordSet = await route53 .listResourceRecordSets({ HostedZoneId: HOSTED_ZONE_ID, StartRecordName: containerDomain, StartRecordType: 'A', }) .promise(); console.log('listRecordSets', JSON.stringify(recordSet)); const found = recordSet.ResourceRecordSets.find((record) => record.Name === containerDomain + '.'); if (found && found.ResourceRecords?.[0].Value) { await route53 .changeResourceRecordSets({ HostedZoneId: HOSTED_ZONE_ID, ChangeBatch: { Changes: [ { Action: 'DELETE', ResourceRecordSet: { Name: containerDomain, Type: 'A', ResourceRecords: [ { Value: found.ResourceRecords[0].Value, }, ], TTL: found.TTL, }, }, ], }, }) .promise(); } }};
function getEniId(task): string | undefined { const eniAttachment = task.attachments.find(function (attachment) { return attachment.type === 'eni'; }); if (!eniAttachment) { return undefined; } const networkInterfaceIdDetail = eniAttachment.details.find((detail) => detail.name === 'networkInterfaceId'); if (!networkInterfaceIdDetail) { return undefined; } return networkInterfaceIdDetail.value;}
async function fetchEniPublicIp(eniId): Promise<string | undefined> { const data = await ec2 .describeNetworkInterfaces({ NetworkInterfaceIds: [eniId], }) .promise(); console.log(data);
return data.NetworkInterfaces?.[0].PrivateIpAddresses?.[0].Association?.PublicIp;}
function createRecordSet(domain, publicIp) { return { Action: 'UPSERT', ResourceRecordSet: { Name: domain, Type: 'A', TTL: 60, ResourceRecords: [ { Value: publicIp, }, ], }, };}
async function updateDnsRecord(clusterName, hostedZoneId, changeRecordSet) { let param = { ChangeBatch: { Comment: `Auto generated Record for ECS Fargate cluster ${clusterName}`, Changes: [changeRecordSet], }, HostedZoneId: hostedZoneId, }; await route53.changeResourceRecordSets(param).promise();}Running this in production
Two months in production. Not perfect, but working well.
Things we worried about unnecessarily:
- Record accumulation. We expected error conditions to leak DNS records that never get removed. Many thousands of records later, it hasn’t been an issue.
- Route53 throttling. We’ve seen it a handful of times; the lambdas retry automatically and the change eventually lands.
Real negatives:
- Browsers sometimes refuse to see a new DNS record until refresh, even past the TTL — this surfaced as flakiness in our end-to-end tests, and we had to automate around it.
- Orchestration logic is considerably more complex when you manage individual ECS tasks instead of a service behind a load balancer.