Forem: Konstantin Troshin

Fortifying federated access to AWS via OIDC

Konstantin Troshin — Fri, 12 Aug 2022 22:02:00 +0000

In order to avoid management of numerous long-term IAM users, AWS
provides federated access options that include SAML2.0 and OIDC identity providers (IDP). Whereas the SAML option is used by many of our customers and there are numerous examples of how to set it up , the examples of use of OIDC are much scarcer. Thus, while selecting our own method of access federation, we decided to try OIDC out to get better understanding of its limits and advantages and be able to better advise our customers.

Differences between SAML and OIDC identity federation

To demonstrate the key differences between OIDC and SAML, I have created a small repo that allows to deploy Keycloak on an EC2 instance and then configure the SAML and OIDC clients to use with AWS.
For those unfamiliar with Keycloak, it is an open source Identity
and Access Management tool sponsored by RedHat and widely used by many of our customers and ourselves as an identity provider. Among other features, Keycloak supports SAML and OIDC protocols for identity management and provides user federation via LDAP that allows to use it with an existing user base from an Active Directory. After deployment of Keycloak and configuring the SAML and OIDC clients, we can use Keycloak to login into AWS.
The SAML login can be performed by going to https://auth.${TF_VAR_root_dn}/realms/awsfed/protocol/saml/clients/amazon-aws where ${TF_VAR_root_dn} is the subdomain you need to create before the deployment. After entering the credentials for the user testuser that is created by the deployment scripts, we get redirected to the AWS console for the AWS account to which Keycloak has been deployed. If we would have assigned multiple roles to the same Keycloak group (or multiple groups to testuser), a page like the one below would appear (which would look familiar to everyone who already used SAML federation with AWS).

If you like to experiment and have deployed everything from the repo, you can go to the network tab of the development tools of the browser, find the saml document there and copy its contents.

Save the contents as aws-saml/assertion and run the saml.sh from the same folder. If you are fast enough (per default, the SAML assertion for AWS is valid only for 5 minutes), the assuming should work for the first role but fail for the second. If you look at the trust policies for the corresponding roles (whose names should end with _Federated_Admin-SAML and _Federated_Admin-SAML2, respectively), you will see that those are identical and allow the AssumeRoleWithSAML operation for the same SAML provider. So, why is access granted for the first and denied for the second role? This is because AWS actually checks the SAML assertion itself for the presence of the role that you try to assume. Looking at the script we ran to configure Keycloak, we can see these two lines:

kcadm create "clients/$clientId/roles" -r ${REALM_NAME} -s "name=$(terraform output -raw role_arn),$(terraform output -raw provider_arn)" -s 'description=AWS Access'
kcadm add-roles -r ${REALM_NAME} --gname "${GROUP_NAME}" --cclientid 'urn:amazon:webservices'  --rolename "$(terraform output -raw role_arn),$(terraform output -raw provider_arn)"

These lines create an entry for the first role (the one without 2) in Keycloak and map this role to a group aws_access that is later assigned to our testuser. Thus, this role shows up in the SAML assertion and can be assumed. Since the same thing does not happen for the second role, the access to it is denied to testuser (of course, this would change if you created the corresponding entry and mapping in Keycloak for this one too).

But what about OIDC? Running the ./oidc.sh script from the aws-oidc folder, we can see that our testuser can assume the role for which our OIDC provider is listed in the trust policy. A closer look at this policy shows that it contains only two things: the ARN of the OIDC provider and the client ID as aud. This corresponds to what AWS Console is doing
if a role is created there.

Also note that (as opposed to the SAML case), there was no need to do anything in Keycloak after running terraform scripts in the aws-oidc folder. What does this mean? Well, in the case of OIDC, AWS does not check for any role or group assignments in the ID token. The only two things that matter with the default settings are the IDP itself (which is defined by the URL and the thumbprint as you can clearly see from the openid.tf file) and the client ID (defined in the aud section of the trust policy).

{
  "exp": 1657326250,
  "iat": 1657322650,
  "auth_time": 0,
  "jti": "a valid id must be here",
  "iss": "https://our.domain/realms/somerealm",
  "aud": "THISISWHATMATTERS",
  "sub": "typically_this_is_the_user_id",
  "typ": "ID",
  "azp": "the_same_as_aud",
  "session_state": "another id is here",
  "at_hash": "some stuff",
  "sid": "and yet another id",
  "email_verified": true,
  "groups": [
    "group1",
    "group2",
    "group3",
    "group4",
    "group5"
  ],
  "preferred_username": "some_user",
  "email": "some_user@our.domain",
  "username": "some_user"
}

This all means that any user that has access to the corresponding
Keycloak realm can assume any role that trusts the IDP which is not very granular or secure and way inferior to SAML, right? Well, that would be so if not for a very important thing - the way I used OIDC in this example is not how it is supposed to be used. Let's look at the oidc.sh script more closely.

function getClientSecret(){
  kcadm get -r ${REALM_NAME} "clients/$(getClientId ${1})/client-secret" | jq -r '.value'
}

Here, I use kcadm.sh (which is containerized and kind of hidden behind source ../kcadm.sh) to get the client secret for the Keycloak OIDC client. This operation requires admin rights and would be equal to a Keycloak administrator giving a client secret to a user in a regular context. This secret is then used together with the username and password for testuser to directly get the ID token that is in turn submitted to AWS STS. Of course, as a Keycloak admin I would never do this in the non-test environment because the client secret (which is bound to the client ID that is, in turn, specified in the IAM trust policy) is not meant to be available for the users. But what is it for then? Looking at the AWS documentation on the OIDC topic, we can see that it mentions an identity broker. And this identity broker (which is not provided by AWS as in the case of SAML) is actually what the client ID and secret are destined for.
So, what is an identity broker anyway? An identity broker (IB) is an application that should function as a link between AWS and Keycloak and take over the management of user rights (it should know which user should be able to assume what role). A proper OIDC login flow should be started by the IB that redirects the user to the IDP (Keycloak in our case) which, after verifying the user credentials, provides the ID Token for that user to the IB. The IB uses client ID and secret to authenticate itself against the IDP. As you also can see from the oidc.sh script, it would be a bad idea to provide the ID token to the user because a combination of the role ARN and the ID token is all you need to assume a role with OIDC.
Instead, the IB should check if the user has access to a requested role and then use the ID token to get the temporary AWS credentials (by using the AssumeRoleWithWebIdentity operation) and then return these credentials to the user (or use them to get the login URL for the AWS console). In my demo above, I use cURL as an IB which is obviously a very poor choice for a production environment since it grants access to any role to any user.

Hardening the OIDC-based roles

Whereas use of a proper identity broker minimizes the risk of the OIDC access to AWS being misused, the experiments above brought me to the question whether it is possible to get AWS STS to look at the user attributes from the ID token and not only at the client ID (aud) and the IDP itself. Looking at the documentation for GitHub (which also uses OIDC) as IDP, I saw that there is another attribute - sub - that is used in trust policies. For Keycloak, the default value of sub is the user ID, which is not very useful, but Keycloak has mappers that can be assigned to
clients and can override the defaults. Experimenting with mappers, I discovered that it is indeed possible to get Keycloak to provide any LDAP user attribute (we use LDAP user federation in our environment) as sub to AWS. The only caveat here is that this attribute needs to be a string, so that it is not directly possible to use group memberships (which would be arrays) to additionally secure the trust policies. It is, however, possible to use the StringLike operator to match substrings. Using this operator, it is possible to check for LDAP groups with AWS STS as long as those are stringified. For instance, the following trust policy checks for a certain group (provided by terraform as ${var.group}) in a group string looking like this:
-group1-group2-group3-...

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": aws_iam_openid_connect_provider.oidc.arn
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${var.oidc_provider}:aud": var.client_id
        },
        "StringLike":{
          "${var.oidc_provider}:sub": ["*-${var.group}-*"]
        }
      }
    }
  ]
}

So, what could this group string come from? One option would be to write a custom plugin for Keycloak and another would be to let the IB (which is a custom app) handle this. My repo actually contains such a custom mapper (next section also discusses that a bit more in detail) that should be active in your Keycloak if you deployed it as described above. To see the mapper in action, we can run the ./oidc_protected.sh script from the aws-oidc folder. As you will see, you would be able to assume the first role but not the second one.

Why? Let's take a look at the trust policies: the one for the first role contains the aws_access group which as we know is assigned to our testuser, the one for the second role refers to the aws_access_exclusive group which does not even exist in Keycloak yet. So, even though our user had a valid ID Token, it was not possible to assume a protected role because this token did not contain the correct group. If you want to verify that the access will be granted once you create the corresponding group and assign it to the testuser
and also to look at the new Keycloak UI (which is at preview for
Keycloak 18.0.2), you can do so at https://auth.${TF_VAR_root_dn}.
In this case, you would need to use the admin credentials (admin and ${TF_VAR_keycloak_password}defined in export.sh). Once the group is created and assigned, the access works as expected. Sweet!

Developing custom mappers for Keycloak

Keycloak documentation mentions "JavaScript" providers which can be used to create custom mappers. As I read JavaScript, I was expecting to write something like this:

function stringifyGroups(groups){ 
    return groups.reduce((current, element)=>{ 
        return current+"-"+element; 
    }, "")+"-"; 
} 
token.setOtherClaims("sub",stringifyGroups(token.groups));

and then place this into a .jar file as described in the documentation.
It turned out that it does not work like that at all. Firstly, the
custom scripts are disabled by default as I found out looking at the Keycloak logs. To fix this, one needs either to activate the preview functions or enable the scripts option alone as described here.
Secondly, "JavaScript" turned out to be very Java-based and needs to call functions from the corresponding Java classes of Keycloak:

var res="";
var forEach = Array.prototype.forEach;
forEach.call(user.getGroupsStream().toArray(), function (group) {
  res=res+"-"+group.getName();
});
res=res+"-";
exports=res;

The repo shows how it all comes together.

Conclusions

In conclusion, both SAML2 and OIDC are great options of access federation for AWS and have their advantages and drawbacks. If you decide to use OIDC like us, you need an identity broker (IB) that provides a link between an IDP (such as Keycloak) and AWS. It would be unwise and potentially dangerous to provide ID tokens directly to the federated users, because a combination of such a token with a role ARN is usually enough to be able to assume that role. Of course, it would be even more unwise to provide an AWS-trusted client ID to the users. A combination of the StringLike operator and the Keycloak mappers can be used to increase the security of OIDC-Federated AWS accounts by restricting the access to the roles to certain user attributes such as group membership similarly to how the SAML2 federation works.

Certbot as an init container for AWS ECS.

Konstantin Troshin — Mon, 08 Aug 2022 17:00:00 +0000

Encryption in transit has become a security standard for most
network-based applications and is requested by the majority of our
customers for all applications we help them to build or manage. Most of the modern applications support TLS out of the box but require the certificate and the corresponding private key to be provided externally.
In some cases (for example, for intranet apps), self-signed certificates (or certificates signed by an internal CA) are sufficient, but if the application is internet-facing and needs to be used without additional steps on the client side, a certificate signed by a commonly trusted certificate authority (CA) is required. For AWS-based applications (as you may have guessed from the title, AWS are a main focus of this post), AWS Certificate Manager (ACM) can be used in combination with a load balancer to provide an amazon-signed certificate. This simple and efficient method is not applicable, however, if the certificate and the corresponding private key need to be provided to the application directly instead of an AWS-managed load balancer. This can be the case if the application is using TLS in combination with its own protocol which would make TLS termination on the load balancer impossible. Let's Encrypt is an open CA that provides trusted certificates which can be acquired by using a tool that supports the ACME protocol. In this case, the certificate and private key can then be provided to the application directly and used also for custom TLS-based protocols. Certbot is one of such tools and can be used to obtain the TLS credentials.

The use case

Recently, I have been asked to provide a publically accessible Neo4j database to use for development purposes. Since a Neo4j installation is available as a docker container, I chose to use AWS ECS to run it (a Kubernetes-based solution such as EKS would be quite an overkill for such a simple use case). To start things up, I deployed a Network Load Balancer (NLB) with three listeners and the corresponding target groups for ports 7474 (HTTP), 7473 (HTTPS), and 7687 (bolt). To improve security of the database, I decided to activate TLS for the bolt and HTTPS endpoints.
Neo4j provides support for both out of the box, but requires the certificates to be provided externally. My initial approach was to use TLS listeners in combination with an Amazon-signed ACM certificate and TLS target groups to talk to the Neo4j container. I used openssl to create a self-signed certificate and provided it via an ECS mount point to Neo4j. This worked just fine for the HTTPS endpoint but did not for bolt which is, however, crucial for the Neo4j clients. It has become clear that TLS termination would not work for this use case and that I needed to use TCP listeners and target groups and to provide the publically facing certificate directly to Neo4j. Since the request of the customer included a wish that the database can be easily accessed by the clients without much configuration on their side, I also wanted this certificate to be publically trusted. In many of our k8s-based projects, we use cert-manager which can directly obtain Let's Encrypt (LE) certificates, which brought me to the idea of using LE for my current task. Thinking about k8s and init containers, I also remembered reading some stuff about container dependencies in ECS, so I came to an idea of using a certbot docker container as such an "init container" for my Neo4j database. The schematic architecture is depicted below and includes an EC2 ECS host on which three containers should run: first the certbot container is started that can request the certificate for the corresponding domain. Once the certificate and the private key are there, the certbot container exits successfully upon which the second container (copier) is started. This container just needs a shell (I decided to use debian:latest for this purpose) and its purpose is to copy the certificate and the private key into the folders and under the file names Neo4j expects. Upon the successful exit of this container, the Neo4j container is finally started.

To achieve the correct order of the containers, AWS ECS supports the
dependsOn attribute - a list of ContainerDependency objects
that in turn consist of containerName and condition. The
condition attribute allows to specify whether the previous container
should have started (START), exited (COMPLETE), ran successfully
(SUCCESS) or is passing docker health checks (HEALTHY). In the present
use case, SUCCESS is the correct condition, since both certificate
retrieval and copy are crucial for the Neo4j container to work properly
(the copier container is called debian here):

  {
    "dependsOn": [
      {
        "containerName": "certbot",
        "condition": "SUCCESS"
      },
      {
        "containerName": "debian",
        "condition": "SUCCESS"
      }
    ],
...

Routing to certbot

A small challenge for the architecture above is to ensure that certbot can solve the HTTP challenge of Let's encrypt which is a part of the ACME protocol (this is needed to verify that the domain for which the certificate is requested is indeed controlled by us). The problem here is that if targets of type instance are used with the load balancer (which makes sense for an ECS EC2 host), health checks are mandatory. On the other hand, since certbot is running only for a short time, it itself cannot be used for health checks on port 80. Also, LE expects the domain to be already routable to certbot requesting the certificate which means that a typical registration delay that load balancers have is not acceptable in this case. As a result, the instance should be
registered at the corresponding target group of the NLB and already healthy before the certbot container is even started. To address this issue, I decided to use a simple trick based on a small handshaker app. This app provides a golang-based http server listening on a specified port that simply replies "OK" to any request and can be deployed as a scratch-based docker container (or a binary). Since the app cannot block the port 80 (which is required by certbot once it is ready to accept the HTTP challenge), I configured the corresponding target group (TG80) to forward to port 80 but health check on another port (6666) which I then assigned to handshaker. To ensure the correct timing, I included starting the app into the user data script of the ECS EC2 instance and made terraform (with which the whole infrastructure is built) to register the auto scaling group that deploys these instances at TG80.

docker run -d -e HEALTH_CHECK_PORT=6666 -p 6666:6666 \
${SOME_ACCOUNT_ID}.dkr.ecr.eu-central-1.amazonaws.com/handshaker:latest

As expected, shortly after terraform apply, the instance was registered at TG80 and became healthy. After this, I used aws cli to scale the ECS service to 1 task (I usually initially deploy the ECS services with the task count of 0, so that the whole infrastructure such as load balancers, instances, Route53 entries, etc. is available before the containers are even started).

To my delight, certbot successfully requested the certificate, passed the HTTP challenge and stored the results in a shared folder mounted via a mount point. After this, the following script ran in the copier container followed by the successful start of Neo4j.

#!/bin/bash

#The le folder will be mounted from the host and filled by certbot
cp /le/live/"${DOMAIN}"/cert.pem /home/neo4j/certificates/bolt/public.crt
cp /le/live/"${DOMAIN}"/privkey.pem /home/neo4j/certificates/bolt/private.key
#from here, we just need to create some more copies
cp /home/neo4j/certificates/bolt/private.key /home/neo4j/certificates/https/
cp /home/neo4j/certificates/bolt/public.crt /home/neo4j/certificates/https/
cp /home/neo4j/certificates/bolt/public.crt /home/neo4j/certificates/bolt/trusted/
cp /home/neo4j/certificates/bolt/public.crt /home/neo4j/certificates/https/trusted/

chown -R 7474 /home/neo4j/certificates #so that the neo4j user can read 'em

Alternatives

Of course, the described approach is not the only way to get a
certificate from LE and provide it to a Neo4j container (or another application). Some of the simple alternatives I can immediately think of would be:

Run certbot directly on the EC2 host instead of a container
Use k8s/k3s/k0s in combination with cert-manager
Build a custom container that has certbot inside of it

That being said, I think that the init container approach shows a way of using ECS similar to k8s pods and can be successfully applied to other ECS-based solutions. Also, it allows to use the upstream containers which makes upgrades seamless as opposed to the "one custom container" approach. In case you wonder, how the hell I could run a custom bash script within an upstream debian container -- I just used a mount point to mount a folder from the host that has been created and filled by the user data script during the EC2 deployment.

...
mkdir -p /home/xtra #prepare the xtra folder that will be attached to the debian contaner
echo "${CERT_SCRIPT}" | base64 -d >/home/xtra/copy_certs.sh
chmod +x /home/xtra/copy_certs.sh
...

Scaling

In the current example, I used an auto scaling group with just one
instance in it, which allows all the mount points to be folders on this instance. Of course, the local folder solution would not scale well. In this case, however, EFS can be used instead, so that the certificate and the key would be requested just once by one of the certbots (certbot exits automatically if a valid certificate is already present), but can then be used by all of the corresponding Neo4j containers. All other services used in the infrastructure above (NLB, NAT Gateway, ECS) support horizontal scaling, so that a solution based on this approach can be scaled out with ease.

Conclusions

In conclusion, AWS ECS provides a nice option to include k8s-like "init containers" by using container dependencies and non-essential containers. Those can be employed for a variety of purposes including generation of TLS certificates with a certbot container. The TLS credentials can be then immediately provided to an application running in the essential container on the same host resulting in a publically trusted secured app.