Forem: Richard Fan

What You See is What You Get - Building a Verifiable Enclave Image

Richard Fan — Sun, 03 Mar 2024 10:30:06 +0000

1. Obstacle of proofing TEE
       1.1. Image digest is meaningless
       1.2. Stable image digest is difficult
2. Solution - Trusted build pipeline
       2.3. GitHub provides the service suite we need
       2.4. Use SigStore to sign and endorse the image
       2.5. Putting everything together
       2.6. How can service consumers verify the PCRs
3. What's beyond
       3.7. Build log retention
       3.8. Build pipeline still needs to be simple
4. Wrap up

Link to the GitHub Action discussed in this post: https://github.com/marketplace/actions/aws-nitro-enclaves-eif-build-action

AWS Nitro Enclaves is a Trusted Execution Environment (TEE) where service consumers can validate if the environment is running what it claims to be running.

I've posted previously on how to achieve it by using attestation documents and why should we care about the content of the attestation document:

In this blog post, I want to dive deep into achieving zero-trust between service providers and consumers on TEE, particularly AWS Nitro Enclaves.

Obstacle of proofing TEE

Image digest is meaningless

Platform configuration registers (PCRs) are just the application image digests; they are generated by a one-way hashing function against the image.

We cannot see what is inside the image by looking at the hash value. So without knowing what generated the PCRs, it's meaningless.

For service consumers who have no oversight of the application source code and build process, they have nothing to do, even if they can validate the attestation document. They can only trust whoever saying "This PCR value 'abcdef' is generated by a secure and safe application"

Service providers may ask 3rd party auditor to attest the above statement. But it's no different than getting SOC2 or ISO 27001 certified.

If we are satisfied with this level of trust model, we can stop talking about TEE already. Why don't we send the SOC2 certificate to the consumers instead of the attestation document?

Stable image digest is difficult

If service consumers can access the application source code and the build pipeline definition, they may build the enclave image and compare the digest with the one provided in the attestation document.

The problem is that generating a stable image digest is difficult, even a small trivial difference occurs in build time can make the digest entirely different.

Some common trivial changes in build time are:

Timestamp

Some build steps inject the current timestamp into the environment (e.g. embedded timestamp in .pyc files when installing Python dependencies).

This makes the resulting image dependent on the time of build.
External dependencies

Even if we pin all dependencies to the exact version, using external sources may still cause image differences.

E.g., when running apt update on Ubuntu, the manifest pulled from an external source may be different than previously pulled.
Other build time randomness

There are more examples that can cause image differences.

E.g., Using random strings as temporary folder names.

By looking at the image digest difference, we cannot tell if it's caused by trivial differences or service provider changing their source code.

Solution - Trusted build pipeline

To avoid the hiccup of creating a reproducible build process, we can instead create a trust build pipeline that service consumers can see and trust.

To make it work on AWS Nitro Enclaves images, I have created a GitHub action AWS Nitro Enclaves EIF Build Action

GitHub provides the service suite we need

To achieve an end-to-end chain of trust from source code, build process, to the resulting enclave image, we need a publicly accessible and trusted code repository, build environment, and artifact store.

Undoubtedly, GitHub is currently the most popular platform to host open-source code. GitHub also provides GitHub Actions as the build environment and GitHub Packages as the artifact store.

By putting the entire build pipeline into GitHub, we can minimize the number of parties we build trust into.

Use SigStore to sign and endorse the image

The other main component of the solution is SigStore.

SigStore is a set of open-source technologies to handle the digital signing of software.

Using SigStore, we can easily sign the enclave image and prove to the public that this image is built by a specific pipeline run, from a particular code repository commit.

Putting everything together

In this sample repository, I use the AWS Nitro Enclaves EIF Build Action to build a Nitro Enclave image from the source code.

After the artifacts are built and pushed to the GitHub Container Registry (GHCR), there will be a cosign command to sign the artifact.

Several things are happening behind this command:

The OIDC token of the GitHub workflow run is used to request a signing certificate from Fulcio
The digest of the uploaded artifacts (In this scenario, the Nitro Enclave EIF and its information) is signed
The signature is pushed to the artifact store (i.e., GHCR)
The signing certificate and the artifact signature are recorded in the Rekor transparency log

How can service consumers verify the PCRs

Service consumers can audit the code once the artifact is signed and pushed to the registry.

To verify the PCRs they get from the attestation document are indeed the same as what was built by the said build pipeline, they can do the following:

Use cosign to verify the artifact against the signature stored in Rekor

cosign verify ghcr.io/username/repo:tag \
    --certificate-identity-regexp https://github.com/<username>/<repo>/ \
    --certificate-oidc-issuer https://token.actions.githubusercontent.com

Validate the information on the signing certificate

User can search the signing entry on Rekor Search by its log index

We should look carefully at the following attributes:
1. OIDC Issuer: The token must be issued by the trusted build environment.
  
  (In this example, it must be the GitHub Actions OIDC issuer https://token.actions.githubusercontent.com)
2. GitHub Workflow SHA: This indicates which particular Git commit the build pipeline run is from.
  
  This helps us identify from which commit we should look at when auditing the source code.
3. Build Config URI: This file defines the build workflow.
  
  We should also check if the build configuration is safe, just like how we audit the application code.
4. Runner Environment: We should also ensure the build was run on GitHub-hosted runners instead of self-hosted ones that cannot be trusted.
Audit the code based on the information on the certificate

After knowing how the artifact was built, we can go to the specific commit of the code repository to audit the codes.
Pull the artifact and get the PCRs

After all the validation, we can use ORAS to pull the EIF and its information.

The PCR values are inside the signed text file; they can be compared with the ones given by the attestation document from the running service.
```
oras pull ghcr.io/username/repo:tag@sha256:<digest>
```

What's beyond

Build log retention

GitHub actions run on public repositories can be viewed by anyone; it gives service consumers more confidence in the enclave application by looking into how exactly it was built.

However, the GitHub action log can only be retained for up to 90 days.

If the service consumers want utmost scrutiny over the enclave application, service providers may need to rebuild the enclave image every 90 days so that build logs can be audited at any point in time.

Build pipeline still needs to be simple

Although service consumers can audit the build process in this design, it doesn't mean service providers don't need to make their build process simple.

The more complex a build pipeline is, the more difficult it can be to understand what's being done under the hood.

E.g., If the build pipeline pulled source codes from an external source instead of the source code repository; How can we see, from the build log, what the content of those codes is?

Wrap up

Three years after AWS announced Nitro Enclaves, the support from AWS is still minimal. (Sidetrack: My PR on kmstool is still pending for review)

There is still little to no discussion on how to utilize Nitro Enclaves to achieve TEE in the real world. I hope the tools I build can at least offer some help to the community.

Link to the GitHub Action: https://github.com/marketplace/actions/aws-nitro-enclaves-eif-build-action

What You Need to Know About the NIST Guideline on Differential Privacy

Richard Fan — Thu, 22 Feb 2024 03:30:44 +0000

1. Highlights
2. What is the current state of privacy protection
       2.1. Input Privacy vs Output Privacy
       2.2. Current De-identification method doesn't work
3. What is Differential Privacy
       3.3. Differential Privacy is not an absolute guarantee
       3.4. This is one of the first major guidelines for implementation
4. Differential Privacy Foundations
       4.5. Epsilon (ε)
       4.6. Privacy Unit
5. Differential Privacy in practice
       5.7. Privacy Budget to limit privacy loss
       5.8. Adding noise to comply with the privacy budget
6. Challenges
       6.9. Reduced accuracy and utility
       6.10. Applications are still limited
       6.11. Reduced accuracy amplifying bias
       6.12. Security challenges
7. Back to the basics, data protection is the paramount to privacy protection

In December 2023, NIST published its first public draft of NIST SP 800-226 Guidelines for Evaluating Differential Privacy Guarantees, this is a huge milestone of the digital privacy domain.

In this blog post, I'm going to tell you why and what you need to know from the guideline.

Highlights

I'm trying to summarize a sixty-page guideline into one blog post, but it's still too long. So, I'm putting the highlight at the beginning for your convenience:

Differential Privacy (DP) is a Statistical measurement of privacy loss
- Epsilon (ε) is an important parameter to measure the privacy loss from a data output
- DP limits total privacy loss by setting thresholds of the ε (i.e., Privacy Budget)
- Defining the Privacy Unit is important. (i.e., do we want to protect the privacy of a person? Or the privacy of a transaction?)
- In practice, we add random noise to the output to meet the expected ε (or Privacy Budget)
Challenges
- Applications are still limited to simple models (e.g., Analytic queries, simple ML models and synthetic data)
- The reduced accuracy from added noise impacts complex analytic models a lot
- DP on unstructured data is still very difficult
- Bias is introduced or amplified by DP, mainly from the added noise
- Conventional security models also apply to DP implementation. Privacy vs accuracy is an extra consideration.
Data protection and data minimization are still important fundamentals even though we have DP.

What is the current state of privacy protection

To understand the importance of Differential Privacy (DP), we first need to understand the current privacy protection approaches and some basic concepts.

Input Privacy vs Output Privacy

In the past, when people wanted to conduct research on data related to individuals, we used different methods to minimize the exposure of the raw data (e.g., Relying on a trusted 3rd party to curate the data, distributing the data curation process to different parties, etc.) These methods prevent privacy leaks from the raw data Input; we call it Input Privacy.

But in some cases, we may also want to publish the research outputs to the broader audience or even the general public. We also need to ensure that an individual's privacy would not be derived from the result data Output; this is called the Output Privacy.

Current De-identification method doesn't work

The main problem Differential Privacy wants to address is Output Privacy. It is about preventing individual information from being derived by combining different results and reverse engineering.

The most common method we have been using for decades is De-identification. We always talk about Personal Identifiable Information (PII), and try to remove them from raw data before doing data research.

But this method has been frequently proved vulnerable; some prominent examples include The Re-Identification of Governor William Weld's Medical Information and De-anonymization attacks on Netflix Prize Dataset

Clearly, with enough auxiliary data, we can re-construct individual information from data that is supposed to be Anonymized. From this assumption, every piece of information about an individual should be considered PII.

What is Differential Privacy

People have been trying to define what PII is for decades and failed repeatedly. Clearly, we need a more robust framework for measuring how much privacy we're preserving when performing anonymization.

And Differential Privacy is the framework we need. It is a Statistical measurement of how much an individual's privacy is lost when exposing the data.

Differential Privacy is not an absolute guarantee

The guideline makes it clear at the very beginning that Differential privacy does not prevent somebody from making inferences about you.

The word Differential means that the guarantee DP provides is relative to the situation where an individual doesn't participate in the dataset. DP can guarantee one's privacy will not face greater risk by participating in the data, but it doesn't mean it will have no risk at all.

Let's consider the following example:

Medical research found that smokers have a higher risk of lung cancer, so their insurance premiums are usually higher than those of other people.

Let's say a smoker, John, didn't participate in that medical research; the result is probably still the same. So, no matter whether he participates or not, his insurance company can still learn that he has a higher risk of lung cancer and charge him a higher premium.

In this example, although medical research makes the insurance company know that John has higher risk of lung cancer. But we can still say DP guarantees John's privacy in the medical research because it makes no difference to him whether he participates or not.

This is one of the first major guidelines for implementation

Although Differential Privacy has formally existed for almost 20 years, the NIST SP 800-226 guideline is probably the first guideline published by a major institution covering the considerations when implementing it.

This is a milestone in bringing DP from R&D into the discussion among practitioners and preparing us for broader adoption.

Differential Privacy Foundations

Epsilon (ε)

The formal definition of ε is the following formula:

It might be too difficult to understand, but it roughly means The chance where the datasets with and without an individual would produce different outputs.

To understand it, we can assume a very small (or even zero) ε; there is little or no difference whether an individual participates in a research. So, there's less chance people can learn if that individual is or isn't in the dataset.

In theory, smaller ε provide more privacy guarantee but less accuracy.

Privacy Unit

Another concept the guideline calls out is the Privacy Unit.

DP describes the difference between outputs from datasets with or without an individual, but it doesn't define what is an individual. It can be an individual transaction, or a person.

Since the common concern of data privacy is always about people. So the guideline suggests we always use User as the Privacy Unit.

This means when we apply DP, we should always measure the ε when ALL records related to one person are presented or not.

Differential Privacy in practice

Privacy Budget to limit privacy loss

Having a mathematical measurement of privacy, we can limit privacy exposure more quantitatively.

ε represents the amount of privacy loss from an output; we can sum the ε from all the outputs published from a dataset to measure the total privacy loss.

This allows us to limit the privacy loss by setting an upper bound of the total ε allowed for all published outputs from a dataset, or we can call it the Privacy budget.

Adding noise to comply with the privacy budget

ε is defined by the difference between outputs from datasets with or without an individual; it depends on how impactful an individual is to the output.

If an individual record is very special in the dataset, the ε of one output may already exceed the total privacy budget.

So, in practice, we'll add random noise into the output to fulfill the ε requirement.

Adding random noise lowers the difference between outputs from datasets with or without an individual, thus lowering the ε.

Challenges

Reduced accuracy and utility

Accuracy and utility of an output may be related but not necessarily the same.

The guideline calls it out by stating that output may be accurate but not useful if most attributes are redacted. Output may also be less accurate but still useful if the survey base is large.

But either way, DP impacts both the accuracy and utility of the outputs. The primary reason is the added random noise to the outputs, especially when the data size is small and more noise is required.

Applications are still limited

The guideline lists several applications of Differential Privacy; I would group them into the following 3 categories:

Analytic queries

This category includes most commonly used aggregation queries (e.g., Count, Summation, Min, Max, etc.)

Because the output of these queries is numbers, it's easy to measure the privacy loss and add random noise to comply with the privacy budget.

In fact, these queries are the most commonly adopted application of DP and have the most detailed guidelines.
Synthetic data and Machine learning

The guideline puts these 2 into separate categories, but I would group them together to simplify things.

Generating synthetic data or training ML model from the dataset can give the curated output more correlation between attributes (The guideline uses an example of the type of coffee vs purchases' age), which analytic queries are not good at.

There are some well-known methods for applying them to DP, like Marginal distributions and Differentially-private stochastic gradient descent (DP-SGD).

However, they are facing a similar problem: The accuracy and utility of the output are easily affected by the model's complexity

The main reason is that the random noise added to the DP output will be amplified when the analysis goal becomes more complex (e.g., more dimension on the synthetic data, more complex deep learning model, etc.).
Unstructured data

Unstructured data are things like text, pictures, audio, video, etc. These data makes it difficult for people to identify the owner (e.g., a video can contain multiple people's faces)

The major obstacle to applying DP to these data is the difficulty of identifying a meaningful privacy unit.

Currently, there is very little research on applying DP to unstructured data.

Reduced accuracy amplifying bias

The 3 biases introduced or amplified by DP are:

Systemic bias

The smaller a dataset is, the more impact an individual can have on the result.

That's why when dealing with smaller groups (e.g., minority population), the noise needed for DP is larger than that of others.

This larger noise can significantly impact the outputs of the already small dataset.

In some extreme cases, the noise added to the output can even make a minority group non-existent in a research output.

This would amplify the public bias towards minority populations.
Human Bias

What DP can make the output even worse than erasing the entire group is that added noise can make unrealistic results.

E.g.
1. Random noise can be a fractional number, thus making countable measurements (e.g., population) become fractional
2. Random noise can also be larger than the original data (especially when data size and ε are small). Adding negative noise to the output may result in a negative number, which is impossible in measurements like population.
These unrealistic outputs may affect the public's view towards DP and give them the impression that DP is not a reliable method.
Statistical Bias

This bias is partly introduced when tackling Human Bias.

When we post-process the DP output to make unrealistic output realistic, the overall accuracy and utility may be affected by the change.

Security challenges

Although the guideline focuses on Differential Privacy, it also reminds us that general security principles also apply to the implementation.

Some of the guidelines given are similar to conventional risk management, but we'll need to deal with more kinds of vulnerabilities, such as:

Interactive Query

Allowing data consumers to run their own queries would make DP implementation difficult because data consumers may be untrusted, and they will try to issue malicious query to break the DP guarantee.

Data custodians also need to store the raw data for real-time queries, which increases data leak risk.

In my opinion, this is similar to conventional application protecting the database behind. But in DP case, we'll also take Privacy Budget into account.
Trust Boundary

The guideline explains 2 different threat models: The local model and the Central model.

Depending on where we put the trust boundary, we will apply DP on different layers, either when data is sent from data subject to data curator, or from data curator to data consumers.

The same principles apply just like when we do the conventional threat model. But in DP case, we also need to balance the output accuracy and risk.

The earlier we apply DP, the fewer risks we take. However, the accuracy of the final output also decreases.

While some challenges may look similar to conventional security frameworks, some are specific to DP.

I'm not going to details because they are quite implementation-specific, but the guideline includes the following:

Floating-Point Arithmetic
Timing Channels
Backend Issues

Back to the basics, data protection is the paramount to privacy protection

Last but not least, the guideline closed up by the 2 most fundamental and yet important things:

Data Security and Access Control
Data Collection Exposure

Simply put, if we cannot protect the raw data in the first place, all privacy protections would become meaningless.

And take one more step back, data protection and privacy protection can minimize but not eliminate privacy risk.

If the data is not needed for research purposes, we shouldn't collect it in the first place.

Security Implication of Giving Examples

Richard Fan — Fri, 16 Feb 2024 03:37:27 +0000

In this post, I want to share my thoughts on giving examples in technical writing and the security implications behind it, no matter whether the impact is real or not.

Background

We will likely give examples when writing technical documents, formal or informal, from user manuals to personal blog posts.

And it's inevitable that the examples contain sensitive or even secret values.

There are many ways we deal with those values (e.g., redacting, modifying, etc.)

I also have many ways of dealing with them throughout my journey, but I slowly build my own convention.

And it all started with this Linkedin post:

AWS rolled back its managed IAM policy AmazonEC2ReadOnlyAccess, but it turned out it's because Scott Piper, Principal Cloud Security Researcher at Wiz, mistakenly thought the ec2:GetPasswordData permission allows users to get the EC2 instance password. And it's due to the poor example AWS gives in their documentation.

But instead of blaming AWS for their poor example, I think I should also formalize my own convention and get feedback from others.

What I am confident that we should follow

The following rules are those I'm pretty confident:

Do not use mosaic to hide secret

When we want to hide the secrets (i.e., password) on the screenshot, simply redact it with a solid box, DON'T use mosaic.

There are many techniques and tools available to reveal text under the mosaic.

You don't want to reveal your password through your blog post, so just redact it; don't trust the mosaic anymore.

Do not show a fake secret

If we want to show the secret on the screenshot or example code, without redacting it.

Do not make a confusing fake. Make it evident that it's a fake.

E.g., when we want to give an example of an OAuth token request call

Instead of using this:

https://example.com/v1/oauth/token?grant_type=authorization_code
  &code=b87c3c60ca2b54ae
  &client_id=9af83a008718df9b
  &client_secret=af8c86cb8bca211d
  &redirect_uri=https://example.com/callback

Try using this:

https://example.com/v1/oauth/token?grant_type=authorization_code
  &code=b87c3c60ca2b54ae
  &client_id=9af83a008718df9b
  &client_secret=<your_client_secret>
  &redirect_uri=https://example.com/callback

Or this:

https://example.com/v1/oauth/token?grant_type=authorization_code
  &code=b87c3c60ca2b54ae
  &client_id=9af83a008718df9b
  &client_secret=****************
  &redirect_uri=https://example.com/callback

Although all 3 examples do no harm to ourselves because the client_secret are all fake.

But the readers with little knowledge of OAuth may not know that client_secret is something they shouldn't expose.

And by seeing us showing the secret in the example, they may just follow and show their REAL secret to others.

The other implication I believe is that, many people (including me) is generous to inform people when they find something sensitive is posted online (Not just technical stuff, I've DM many people on social media to take down the photos of their boarding pass).

If I message a blog owner to be careful of their secret and get a reply that it's fake. I would feel being fooled and may have less willingness to do the same thing next time, even though it may be the true secret.

What I am doing but you may have better options

The following rules are what I am following, but not quite sure if they are the best options.

You may argue that my reasons are wrong and have better options.

Use common pattern for personal values

This is similar to Do not show a fake secret, but for some personal data (e.g. AWS account ID, AWS resource ARN).

These data are not secrets, but we still don't want to expose them to the public.

We can use the same method as dealing with secrets, but it may make the example difficult to read.

So, I would use some common patterns to replace those data.

E.g., If I were to give an AWS CLI command example of creating an EC2 instance, I can write:

aws ec2 run-instances \
   --image-id <ami_id> \
   --subnet-id <subnet_id> \
   --instance-type <instance_type> \
   --key-name <key_pair_name>

It's still useful, but if I use the following format, it would be more useful because the reader can understand the format of each value and find them more easily.

aws ec2 run-instances \
   --image-id ami-11111111111111111 \
   --subnet-id subnet-22222222 \
   --instance-type c5.xlarge \
   --key-name my-key-pair-01

Dealing with encoded values

For encoded or even encrypted values, I still don't have a good option to make the example similar to the real one yet obvious to the reader that it's fake.

E.g., If I use the same method as dealing with secret values, I may write this:

password_b64: <your_password>

But then the reader doesn't know it's a base64-encode value.

If I use the base64-encode <your_password>, like this:

password_b64: PHlvdXJfcGFzc3dvcmQ+

Then, the users may not know it's a secret, and they shouldn't expose theirs.

So right now, what I would write is:

password_b64: <base64_encoded_password>

If you have more explanatory options, please let me know.

Wrap up

These are just the rules I found easy for readers to understand yet not making security concerns.

I see many ways of making examples, even across AWS service teams.

I really hope we'll have a more standardized way of giving examples (especially when secrets are involved) on technical writing, like the one for Git commit message.

Please feel free to share your thoughts.

When Automation Meets Authentication

Richard Fan — Tue, 06 Feb 2024 16:20:44 +0000

Background

This post is not about sharing my success story or lecturing you about some new things. It's more about summarizing my questions about the conflict between automation and authentication.

The Recent Trends

Over the past decades, there have been more and more XxxOps: DevOps, CloudOps, GitOps, AIOps. Recently, I even heard NoOps.

The common theme of them is to Automate everything. We want people to do as little ops work as possible. We shouldn't even allow people to touch the system in the ideal state.

But at the same time, we have another trend: everything should be verifiable and traceable, and people should be accountable.

We are getting rid of shared accounts and long-term credentials. Use MFA and even hardware keys to prevent spoofing.

But aren't they contradicting? We don't want humans to be involved, but we want humans to be accountable.

My recent story

As a cybersecurity practitioner, I'm a fan of hardware keys. I have my own Yubikey, and I use it to sign all my git commits so people can verify my works are done by me.

As an engineer, I'm also a fan of automation. I often use IaC and CI/CD to help me deploy stuff.

But recently, I'm facing a dilemma.

One of my projects is using the IaC repository as the single deployment point. We also use it to deploy application configuration.

But the question is that there is another repository generating the application configuration.

So I have these options:

Merge two repositories.

But it will make the repository too big and difficult to maintain.
Deploy the configurations separately.

However, it will make my AWS resources fragmented and difficult to track the state of my environment.

i.e., No single point of truth on how the current environment state looks like
Have the application repository generate the configuration and push it to the IaC repository for deployment.

This one looks pretty reasonable to me. So, I picked this route.

The Problems Come

How do I sign the git commit?

If the app repository is pushing files, it has to make a commit. As a security engineer, I would like to see all the commits in my repository to be signed.

Use GitHub's key?

Now you may say, GitHub bot can sign the commit for me.

But as a security engineer (Or you can say I'm too paranoid), I don't trust the GitHub.com GPG key because who knows how many accounts I'm sharing that same key with?

Use stored key?

You may also say, I can put the GPG private key into the GitHub Actions and use it to sign the commit. But this is prone to spoofing because people can sniff the key and use it to sign other things.

Hardware key?

Hardware keys can prevent private key leaks, but I can't plug my Yubikey into the GitHub data center and use it in my GitHub actions.

Cloud services?

There are many Cloud HSM/KMS offerings, but I can't find any that provide an easy way to integrate with git.

I see HashiCorp Vault support acting as a PKCS#11 provider and use it as a hardware key with gpg.

I also found an open-source project wrapping pgp with AWS KMS.

But both options look premature to me, and I'm not sure how the security model should look like, so it behaves as similar as an actual hardware key.

Who can access the IaC repository?

If the app repository workflow wants to push files to the IaC repository, it must have access to it.

How can I grant it access?

Interesting GitHub access model

GitHub Actions supports OIDC authentication, so we can grant the workflow access over other cloud environments (e.g., AWS account) as the workflow itself. (Without long-term credentials)

You may think the same should apply to accessing other repositories. Well, the answer is No.

To programmatically access a GitHub repository, we can use Personal Access Token or GitHub App.

Guess what? Both methods involve long-term credentials.

And unlike OIDC, both methods are not directly tied to the workflow itself.

I even made a joke with my colleague that GitHub workflow integrates better with other cloud providers than itself.

GitLab is better in this area

GitLab provides two methods for cross-repository workflow.

Multi-project pipelines

This method allows a pipeline to trigger another pipeline in the other project.
Job token allowlist

This method allows the job token from other projects to access itself.

So the pipeline from other projects can access it.

This is not unique to GitOps but critical to GitOps

The automation vs authentication issue is not unique to GitOps. There are many companies using automation to sign their software build.

The reasons I think this issue is more critical for GitOps are:

Git commit is the first step of defense

The first step a code (whether for software or infrastructure) goes to the codebase is when developers commit it.

No matter how much defense we build around the system. All other defenses are useless if we cannot verify who created the code.
The scope is broader

We may have ten software release pipelines.

But we may also have thousands of developers committing code and hundreds of workflows around them.

Managing the keys and validating them is more challenging than other use cases.
Git is everything nowadays

With the rise of DevOps, IaC, GitOps, etc. We now have more and more kinds of stuff written in code.

We have application code, configuration, infrastructure, access control list, etc.

We may face a total system breakdown or takeover if unauthorized code is injected into the repository.

Wrap up

While I was asking all these questions and doing research. I realized it's not about which method to use, but more about "Who is the automation"

One of the differences between 2 GitLab cross-repository workflow methods is that:

Multi-project pipelines requires the user triggering the first workflow to have permission on the second repository. And Job token allowlist requires the first repository's job to have permission on the second repository.

This also triggers me to think: "Is the automation just a representative of the user? Or it has its own identity?"

Nowadays, we are discouraging shared accounts because we want clear accountability and responsibility. But in the end, automation is still a different form of shared account.

So, what is the line between a shared account and an automation? I don't have a clear answer.

What do you think?

Can We Use aws:SourceVpc Condition Without a VPC Endpoint?

Richard Fan — Thu, 18 Jan 2024 16:18:09 +0000

1. Background
2. Why is VPC Endpoint required?
       2.1. The route of a network request goes within AWS
       2.2. The way IAM knows the API request's context
3. How AWS documentation fails to make its users understand
       3.1. Does it really mean the source VPC?
4. Call for action to AWS

Background

Yesterday, I had a discussion with a guy on Slack about "Does IAM aws:SourceVpc condition requires a VPC endpoint to work?".

Although the documentation states that This key is included in the request context only if the requester uses a VPC endpoint to make the request, it's not obvious that a request originated from a VPC doesn't always have the source VPC information.

The documentation states that a VPC endpoint is required, but the story doesn't stop here

Although the documentation states the pre-requisite of aws:SourceVpc, there are still some confusion. Luckily, I attended a chalk talk session in last year's AWS re:Inforce about this topic. So, I think it's time to share what I've learned.

The Chalk Talk session about IAM that I attended in AWS re:Inforce

Why is VPC Endpoint required?

The reason is based on 2 aspects:

The route of a network request goes within AWS
The way IAM knows the API request's context

The route of a network request goes within AWS

AWS API endpoint is outside the VPC

Many AWS services can be deployed in a VPC (e.g., EC2 instance, RDS instance, ECS task, Elasticache cluster, etc.)

For those resources, we can configure the VPC so that the network traffic should route through or entirely within the VPC to reach the resources. For example, a SQL connection from an EC2 instance to an RDS instance within the same VPC (The green line in the above diagram).

But when it comes to AWS API calls (Let's say an AWS CLI call aws rds stop-db-instance from the EC2 instance), it cannot stay within the VPC.

The AWS API call is not going to the resource itself (i.e., The CLI is not talking to the RDS instance "Hey! I want to stop you"). Instead, the API is going to an AWS API endpoint (in this case, rds.us-east-1.amazonaws.com), which is owned by AWS and sits outside of the VPC. (i.e., The CLI is talking to AWS, "Hey! I want to stop that instance, please do it").

To reach the AWS API endpoint, the traffic must either go through the AWS backbone network or a VPC endpoint inside the VPC. (The blue line in the above diagram).

We CANNOT create an AWS API endpoint inside a VPC, so there is no such "AWS API call within a VPC" (The red line in the above diagram doesn't exist)

Correction: The previous version wrongly stated that traffic going out of Internet Gateway to the AWS API endpoint is through the public Internet. But in fact, it is routed through the AWS backbone network.
However, this change doesn't affect the conclusion of this blog post.

The way IAM knows the API request's context

AWS IAM policy allows us to define permission based on different criteria, like "Who is making the request?", "Where is the request coming from?", "How is the requester authenticated in the first place?", etc.

But AWS IAM service doesn't magically know all these contexts on every API request; it relies on the context attached to the request to perform IAM policy evaluation.

Those contexts are not attached in one place. It depends on what the context is.

For example, the aws:MultiFactorAuthPresent is added inside the session token because when we sign in, the STS service knows if we have MFA authentication and injects this information into the session token.

The aws:SourceIp is added when the request reaches the API endpoint because the endpoint can inspect the IP header and determine which IP the request is coming from.

We cannot expect the API endpoint to add the aws:MultiFactorAuthPresent because it doesn't know how the user login in the first place. We also cannot expect the STS service to add aws:SourceIp into the session token because it won't know where it will be copied and used to sign subsequent API requests.

So, let's come back to the aws:SourceVpc context. Who should add this to the request?

Can the EC2 instance do it? It seems possible because AWS knows where the EC2 instance sits. But is it trustworthy? What if the user generates the API request in the EC2 instance, copies it into the laptop, and sends it through the Internet? Should AWS still treat it as "Coming from the VPC"? It seems not feasible.

Can Internet Gateway add this context? But the API request is inside an HTTPS request; how can Internet Gateway decrypt it, add the context, and then re-encrypt it? This is also not feasible.

Can the AWS API endpoint check if the request comes from the EC2 instance's public IP? It seems possible, but keeping track of all public IP addresses is a considerable overhead and would cause performance issues. So this is also not feasible.

So, the only possible way to do it is to let the VPC endpoint add this context to the request.

And according to the chalk talk session, the aws:SourceVpc context is added when the API call goes through the VPC endpoint.

Request contexts are added at different stages of the traffic path

How AWS documentation fails to make its users understand

Does it really mean the source VPC?

Now we know the aws:SourceVpc context is added by the VPC endpoint. So does it really mean "Source VPC"?

Consider the following scenario:

VPC endpoint sharing

I have 2 VPCs (vpc-aaaaaaa and vpc-bbbbbbb) with VPC peering. An STS VPC endpoint in vpc-aaaaaaa, and an EC2 instance in vpc-bbbbbbb.

Now, I want to restrict an IAM role only to be assumed through the blue route; what should I specify in the IAM policy?

Imagine if I didn't attend the chalk talk session and just read the AWS documentation, which states Use this key to check whether the request comes from the VPC that you specify in the policy.. I would definitely write my policy as follows:

"Condition": {
    "StringEquals": {
        "aws:SourceVpc": "vpc-bbbbbbb"
    }
}

But does it work? I did an experiment:

I created an EC2 instance in vpc-05c07e7f
I created an STS VPC endpoint in another VPC, vpc-0c3610a65f744e73f, which is peered with the first VPC.
Its private IP address is 10.0.0.186
Then I attached an IAM policy into the EC2 IAM role, using vpc-05c07e7f, which is the VPC containing the EC2 instance
I logged into the EC2 instance and verified the STS request will go to the VPC endpoint IP address.
Then my sts:assumeRole CLI command was denied
Then I changed the IAM policy to use vpc-0c3610a65f744e73f, which contains the VPC endpoint
The CLI command was successful this time.

Of course, after learning how request contexts are added, I know why aws:SourceVpc is not where the request is really coming from.

The context is added by the VPC endpoint, it doesn't care where the request comes from. As long as the request is going through the VPC endpoint, it will add the VPC ID of itself.

But it clearly doesn't match the documentation description.

Call for action to AWS

Make the documentation more accurate
Clearly, the aws:SourceVpc doesn't actually represent whether the request comes from the VPC..... So, the IAM team must change the wording.
Publish the process under the hood
AWS environment is complex, and it's difficult to explain something within a few lines.
If someone really wants to customize the AWS environment, the best way to let them understand is to publish the system details.
I believe if I can learn the request context and IAM condition matching process from a chalk talk session, it's not a secret. So why doesn't AWS publish the whole process in their documentation and let the architect read and decide what their IAM policy and VPC configuration should look like?
Let the appropriate team write the documentation
One of the arguing points I had in the discussion is: "Does this aws:SourceVpc condition only works on S3?"
The reason for this argument is that when we read the documentation and want to see more details, it directs us to the S3 documentation: Restricting Access to a Specific VPC
Then I asked myself, VPC endpoint is the VPC team's product, and IAM policy is managed by the IAM team, especially when this is a global condition key. So why would the responsibility of explaining it go to the S3 team?
I understand that maybe the S3 team has written an excellent documentation and the IAM team wants to borrow it.
But can the IAM team at least give it a stamp and move it into the IAM documentation? So we, as AWS users, can be less confused about whether some features are specific to one service? Or is it common to all services?

Start building my AWS Clean Rooms lab

Richard Fan — Tue, 02 Jan 2024 02:10:48 +0000

Last month, I had a post on Linkedin about AWS Clean Rooms Differential Privacy. But I was not comfortable sharing something that I've never used. So I spent some time to try it, but then hit the wall so hard.

Why is it so challenging to try a clean room service?

First of all, the name Clean Room is not coined by AWS. Data clean room is a concept of analyzing data in an isolated environment so multiple parties can bring their data together to produce insight without compromising data privacy.

The difficulties of getting started are not specific to AWS Clean Rooms. It's more about the nature of a data clean room:

Multi-party collaboration

Data clean room is about collaboration between different parties. To simulate this environment, we must utilize multiple AWS accounts to get a sense of the service.

Reliance on good data

We can't feed random data into a data clean room to get some meaningful output. First, we must have 2 different datasets because we are simulating a multi-party collaboration. Second, these 2 data must have some relationship.

Apparently, we can't bring a list of Netflix movies and a bus route table together and hope to get some meaningful insight from them.

Lack of online resources

This is probably the major reason.

I tried to search on AWS official website to find resources. What I got is a lovely architecture diagram and a pre-recorded demo.

I tried to search on AWS workshop website using the keyword Clean. The only thing that popped up is Service Cloud Voice Series: Cleaning up your environment.

I can try ClickOps on the console without a tutorial and figure it out myself. But I still need some good data to play with.

I tried searching on Kaggle, and also on Google using keywords like "data clean room lab csv", "data clean room sample data".

But the data I got are either not clean enough or have only 1 table, which I can't simulate a data collaboration.

That's why I'm creating my own lab

I was frustrated, but I don't want other people like me to be frustrated too. So, I decided to build an easy-to-follow lab on AWS Clean Rooms.

Finding a suitable dataset

After trying harder and harder (I learned this during my OSCP course), I finally found some useful sample data from Maven Analytics. And more importantly, their data is in the public domain, meaning I can freely use it in my lab. I picked the Airline Loyalty Program data in my lab.

IaC everything

Another intimidating thing about AWS Clean Rooms is that we must jump between AWS accounts to finish the setup. It doesn't just make ClickOps complicated, but also IaC.

I usually use CloudFormation when working on public AWS projects because it's native to AWS. But this time, I'm mixing CloudFormation with Terraform because of its easy-to-setup multi-account deployment. I hope AWS can learn from Hashicorp in this aspect and make it easier to deploy stuff remotely.

Here's the link

After talking so much, here's the link to my still-in-progress AWS Clean Rooms Lab: https://github.com/richardfan1126/aws-clean-rooms-lab

This lab is not completed yet. But it has 2 sessions already, which you can go through, start playing, and get meaningful results.

What's more exciting is that if you just want to play with the analysis rules and queries and don't want to deal with all the infrastructure hustle, you can simply run a few commands, and everything will be set for you.

I will continue creating more sessions on more complex analysis rules. And more interestingly, the differential privacy part.

A playground to practice differential privacy - Antigranular

Richard Fan — Tue, 26 Dec 2023 14:47:09 +0000

1. Background
2. What is Antigranular
3. My quick walkthrough - as a non-data engineer
       3.1. Create a Jupyter Notebook
       3.2. Running some basic data engineering tasks
       3.3. Do some machine learning tasks
4. Why I think it is useful
5. How I think as a security engineer
       5.1. Verifiable TEE
       5.2. Threat modelling

Background

I knew Jack from Oblivious (His company was called Oblivious AI then) early this year when I was researching companies that use AWS Nitro Enclaves.

At that time, their tool was just helping users deploy simple applications in the enclaves, and I didn't understand how it was related to data science or even AI.

Last month in AWS re:Invent, I met Jack in person for the first time. After a great chat with him, I finally understood what his company was trying to achieve.

And today's post is to share my first-glance view on the Oblivious platform - Antigranular

What is Antigranular

Antigranular is a Kaggle-like platform where we can play with various datasets, joining competitions on machine learning and data science using those datasets.

The difference from Kaggle is that Antigranular's dataset is not freely available. Instead, there are restrictions on how users can access their data in order to guarantee data privacy.

There is another blog post by Bex T. talking about what is Antigranular, what technique it is applying, and how to get started. You can read it if you are interested in the details.

My quick walkthrough - as a non-data engineer

I'm not a data engineer, and I don't even know the difference between pandas and numpy.

But I still tried to create my Jupyter notebook to play with one sandbox competition on Antigranular.

If you are also not a data engineer and have no idea how DataFrame works, my walkthrough may help you understand Antigranular and differential privacy.

Create a Jupyter Notebook

To play with the dataset, we first must create a Jupyter notebook, a powerful and popular tool among data engineers. I created mine on Google Colab.

Jupyter notebook can run different programming languages. Since Antigranular provides a Python library, I will be using Python.

Running some basic data engineering tasks

Before playing with the dataset, I need to mention a major difference between Antigranular and other data platforms - The data is not loaded into our Jupyter notebook.

You can see from the screenshot that if I access the data and try to get its metadata, it will raise an error.

Instead, the data is being loaded in a trusted execution environment (TEE) hosted by Antigranular.

To access the TEE, we must add a magic function %%ag into the code block. The magic is that we can only use limited libraries and functions in those code blocks.

Data engineers usually use the head() function to preview the data. But with op_pandas, this action is blocked.

With these restrictions, the Antigranular platform can assure data providers that the individual privacy inside the dataset is protected.

Do some machine learning tasks

Now, we know that Antigranular runtime is an environment with limited visibility to the dataset. But what is our goal?

Inside the dataset, there are training data and testing data. We need to use our limited access to the training data to train an ML model. Then, it is used to predict the outcome from the testing data.

The catch is that our privacy budget will be used whenever we access the training data.

I'm not a data scientist, so I won’t explain privacy budget and differential privacy in detail.

But the idea is that:

If we run enough amount of targeted queries on a dataset, we can interpolate some detail from an individual record.

And differential privacy is all about limiting such a scenario. The less privacy budget we use, the less likely we can interpolate individual records.

Of course, if we want to train a good ML model, we should train it with accurate data. But the catch here is that we also want to protect individual privacy.

So, the competition on Antigranular is to train an ML model using as little of a privacy budget as possible. And use it to predict the test data as accurately as possible and submit that prediction result.

I know little about supervised ML, so I used a simple Gaussian Naive Bayes model trained by the training data with 0.1 privacy budget (or epsilon).

Then, I used the model to predict the outcome from the test data and submitted it.

As expected, I got around 0.27 points, far lower than other submissions, at around 0.7.

Another thing we can see here is the privacy budget I've used so far (i.e. total_epsilon_used) on the data.

Why I think it is useful

Many Privacy Enhancing Technologies (PET) are emerging, like Trusted Execution Environment, Homomorphic Encryption, Synthetic data, etc. Most of them only require the skills and knowledge of the developers.

However, for differential privacy, the users must also have the skills. We can see from the walkthrough that even how we query the data or how many queries we run will affect the privacy budget we will be using.

Not just for engineers, data analysts also need to learn how to interact with Differentially Private datasets. And I think Antigranular is a great place to play and learn.

How I think as a security engineer

After talking about data engineering, let me come back to my security engineer role. How do I think about it?

Verifiable TEE

The core of Trusted Execution Environment (TEE) is to ensure data is being processed in a trusted hardware that is running a trusted software.

The core part of oblv_client library, which is used by antigranular library to connect with the TEE runtime on Antigranular, is compiled so I can't see if they are using the process to verify if the code is running on a genuine AWS Nitro Enclaves. But I tend to believe it is.

The other question is the trusted software. From the documentation and the GitHub page of Antigranular, I cannot find any code of their TEE. The fingerprint of the TEE, which the client will verify against during the Jupyter notebook initialization, is from an Antigranular API. So, we can only trust that the software inside the TEE is safe and honest.

Even though we trust Antigranular or maybe some parties can access the source code of the TEE, there is still another problem: Reproducible build

To verify whether a TEE is running the exact same software, we must ensure the fingerprint is always the same.

But many factors can make the compiled software different, e.g. time of build, software dependencies, etc., especially when the Antigranular runtime relies on many libraries written in Python, which is always inconsistent during build time.

Threat modelling

A common way to do threat modelling in cybersecurity is to ask: Is it a risk? How critical is it? How to mitigate it? How do we detect it?

But for differential privacy, it's a little bit tricky.

Is the data critical? Yes, of course! There are many PII

So, there is a risk of data breach. Let's lock it up. No, we need to share with other party to do research

OK, but how do you mitigate the risk of data breach? We can set the differential privacy policy, but you need to figure out the parameters

Can we set an alarm when someone accesses the sensitive data? Our counterpart is supposed to have some access to the data. How can we define what is sensitive?

I can't imagine how I would react if the data team asked me to do threat modelling for their data clean room with differential privacy today.

However, I think differential privacy will definitely change how we protect data in the future, and we must learn its capability and limitations.

First Try on AWS Security Hub Central Configuration

Richard Fan — Tue, 26 Dec 2023 14:36:14 +0000

1. Help us manage security controls in one place
2. The caveats
2.1. Don't forget to enable AWS Config if you want to get findings
2.2. Use the right template
3. Painful experiment

In my previous post, I've mentioned the new AWS Security Hub Central Configuration feature. I thought AWS finally solved the headache we face when managing Security Hub in cross-account, cross-region environments. It's kind of true, but not a lot.

Help us manage security controls in one place

Let's talk about the good first. Security Hub central configuration helps us manage the security controls on different accounts, different regions.

When we enable central configuration, we can pick the regions, and the policy we create later will be deployed to the selected regions.

We can then create different policies on the following:

What security standards to deploy
What controls to enable/disable
Customize control parameters

These policies can be deployed to all accounts or the accounts we specify so that we can configure different accounts differently.

The caveats

OK, we've finished talking about the good part. Let's talk about the dark side.

Don't forget to enable AWS Config if you want to get findings

So the AWS blog post claimed we can "using a single action to enable Security Hub across your organization"

Right, but it only turns on Security Hub. If we want to get findings, we still need to enable AWS Config on all the accounts, ... manually.

OK, fine!! So I scrolled down a little bit and found this.

"if AWS Config is not yet enabled in an account, the policy will have a failed status."

I then tried to deploy Security Hub on my AWS Organization, which I only turned on Config on 1 account.

Guess what? I got the green lights for all 3 accounts.

Maybe I forgot that I had enabled Config on these accounts, or maybe Security Hub helped me turn them on?

So, I waited 2 days for the findings to come. But then, the account that had Config enabled already had many findings, but the 2 without Config only got 17 findings.

So I went on and used CloudFormation StackSet to enable AWS Config for these 2 accounts.

At that point, I was pretty sure AWS Config was not enabled because the StackSet wouldn't succeed if so.

I don't know what's going wrong, but after enabling AWS Config, the findings finally came.

I still don't understand why the error message didn't come.

But the main takeaway is: Make sure you have AWS Config enabled on all relevant accounts if you want to get findings from AWS Security Hub.

Use the right template

Another interesting point (but not related to this new feature) is the template we use to enable AWS Config.

The CloudFormation StackSet console has a sample template called "Enable AWS Config".

But if you only want to get AWS Security Hub findings, DON'T use it.

There is another StackSet template here.

This template only enables configuration recording on resource types that Security Hub cares about.

Using this one could help you save money by not recording resources that Security Hub doesn't look at.

Painful experiment

So, now I still can't figure out why my child accounts could pass the checking even though AWS Config was not enabled.

I'll need to create another clean AWS Organization to test out.

Experimenting with things on Cloud Governance is really a painful task.

I can't simply nuke the resources to restart because what I'm testing is the Organizations; the accounts.

And now, I need to restart everything again.

Help us manage security controls in one place

Let's talk about the good first. Security Hub central configuration helps us manage the security controls on different accounts, different regions.

When we enable central configuration, we can pick the regions, and the policy we create later will be deployed to the selected regions.

We can then create different policies on the following:

What security standards to deploy
What controls to enable/disable
Customize control parameters

These policies can be deployed to all accounts or the accounts we specify so that we can configure different accounts differently.

The caveats

OK, we've finished talking about the good part. Let's talk about the dark side.

Don't forget to enable AWS Config if you want to get findings

So the AWS blog post claimed we can "using a single action to enable Security Hub across your organization"

Right, but it only turns on Security Hub. If we want to get findings, we still need to enable AWS Config on all the accounts, ... manually.

OK, fine!! So I scrolled down a little bit and found this.

"if AWS Config is not yet enabled in an account, the policy will have a failed status."

I then tried to deploy Security Hub on my AWS Organization, which I only turned on Config on 1 account.

Guess what? I got the green lights for all 3 accounts.

Maybe I forgot that I had enabled Config on these accounts, or maybe Security Hub helped me turn them on?

So, I waited 2 days for the findings to come. But then, the account that had Config enabled already had many findings, but the 2 without Config only got 17 findings.

So I went on and used CloudFormation StackSet to enable AWS Config for these 2 accounts.

At that point, I was pretty sure AWS Config was not enabled because the StackSet wouldn't succeed if so.

I don't know what's going wrong, but after enabling AWS Config, the findings finally came.

I still don't understand why the error message didn't come.

But the main takeaway is: Make sure you have AWS Config enabled on all relevant accounts if you want to get findings from AWS Security Hub.

Use the right template

Another interesting point (but not related to this new feature) is the template we use to enable AWS Config.

The CloudFormation StackSet console has a sample template called "Enable AWS Config".

But if you only want to get AWS Security Hub findings, DON'T use it.

There is another StackSet template here.

This template only enables configuration recording on resource types that Security Hub cares about.

Using this one could help you save money by not recording resources that Security Hub doesn't look at.

Painful experiment

So, now I still can't figure out why my child accounts could pass the checking even though AWS Config was not enabled.

I'll need to create another clean AWS Organization to test out.

Experimenting with things on Cloud Governance is really a painful task.

I can't simply nuke the resources to restart because what I'm testing is the Organizations; the accounts.

And now, I need to restart everything again.

Operation Fire Valley Rocks

Richard Fan — Sat, 16 Dec 2023 03:20:00 +0000

I'm not a poet, and it's difficult for me to write a 10-verse poem.

But luckily, this year is all about GenAI. I used GenAI to help me rewrite the poem Jenn Bergstrom has written.

Using Step Functions to make a Christmas Tree reminds me of the grand old days of ASCII art. And here is the Step Functions Christmas Tree I've created.

2023 has been an adventurous year for me. I'm so thankful to the AWS Community Builders and those who make this program great.

This year, I've met many CBs on different occasions. The CB program makes me feel like I am part of the community; I know where I can find people to talk to and where I can participate.

Of course, being recognized as an AWS Security Hero is the biggest thing to me this year. Although it's just a month ago, it feels like time has passed a long way already.

Again, I'm so thankful to the people who have helped me, and pushed me along the way, Jason Dunn, Lily Kerns, Taylor Jacobsen, Johannes Koch, Chris Williams, ... (No specific order)

And, of course, thanks to the AWS community in Hong Kong, Amy Wong, Gabriel Koo, Anthony Lai. My fellow AWS Heroes from HK, Cyrus Wong, and Alex Lau. (No specific order)

My thoughts on AWS re:Invent 2023 announcements

Richard Fan — Thu, 07 Dec 2023 15:37:47 +0000

1. Preface
2. Good ones
       2.1. AWS Security Hub central configuration
       2.2. AWS Security Hub custom control parameters
       2.3. Amazon GuardDuty ECS Runtime Monitoring
       2.4. Amazon Inspector agentless vulnerability assessments
3. Still good ones, but just ... disappointed
       3.5. AWS Config periodic recording
       3.6. Amazon S3 Access Grants
4. GenAI
       4.7. Guardrails for Amazon Bedrock
       4.8. Responsible AI - Amazon Titan image watermark
       4.9. GenAI help cybersecurity

Preface

This year is all about GenAI, and AWS re:Invent is no exception, almost half of the announcements are about GenAI, especially Amazon Q (I still don't like this name)

However, as a cloud security guy, some other announcements also interest me, and here are my thoughts.

Good ones

AWS Security Hub central configuration

Yes! This one! It's not a fancy one, you probably didn't notice it, but this one tops my list.

This is the photo I took from Werner Vogels Keynote. Security is one of the non-functional requirements, it's not a feature we can choose, it's about coverage.

I always find it challenging to maintain the security posture within an AWS Organization, there are so many accounts and regions to take care of.

With AWS Security Hub central configuration, we can now configure security controls across accounts, across regions, all in the same place.

What I love to see in the future is the same feature in Amazon Inspector, GuardDuty, and AWS Config.

AWS Security Hub custom control parameters

It's Security Hub again. The custom control parameters is also a feature that I love to see.

Before this, all Security Hub controls were hard-coded and mostly followed industry standards like CIS, PCI-DSS, and NIST 800-53.

However, most standards only outline the minimum security requirements, and many organizations want to do better. E.g., the IAM user password policy is set to a minimum of 8 characters because of the NIST 800-53 standard. But I think most organizations would like their employees to use a longer password.

Now, we can customize the control to check if all the AWS accounts meet the stronger password policy that we set.

Amazon GuardDuty ECS Runtime Monitoring

The EKS runtime monitoring has already been available since early this year. This time, it's expanded to ECS.

Many companies don't have the talent to set up threat detection systems themselves, nor the skill to use Kubernetes. Having GuardDuty monitor the ECS workloads would be a nice feature to increase their monitoring coverage.

Besides this, the runtime monitoring for EC2 is also in preview now!

Amazon Inspector agentless vulnerability assessments

Historically, if we want Amazon Inspector to scan the EC2 instances for software vulnerability, we need to install an SSM agent into it. The agent also uses some of the instance's resources to perform the scanning.

Agentless scanning allows Amazon Inspector to scan the instances without impacting the running instance.

This is not a new feature and has been offered by several 3rd party cloud security vendors. But having an AWS-native tool to do it makes it more accessible to customers.

Still good ones, but just ... disappointed

AWS Config periodic recording

The high cost has always been my major complaint to AWS Config. Last week, when AWS announced AWS Config periodic recording, I thought it would alleviate some of our pain. But after digging deep into the details, I found it probably won't.

First, most of the cost incurred by AWS Config is from the amount of resources we have in the account, not the frequency of changes. So, having a lower recording frequency doesn't really help reduce the cost.

Second, the price of every periodic recording is 4x higher than continuous recording. So, if the average change frequency of your resources is at a certain level, periodic recording can cost you even more.

Amazon S3 Access Grants

I am having issues granting data access through AWS IAM Identity Center (i.e. AWS SSO). The problem is that permission can only be assigned to Permission Set, and the more granular I want the data access control to be, the more Permission Set I will be creating.

When I saw Amazon S3 Access Grants announcement last week, I thought it would be my savior.

However, after a few trials, I discovered it's quite difficult to set up.

First, we need to create an app in AWS IAM Identity Center to perform some token exchanges and then assume a temporary role to further assume the S3 grant that finally gives you access to the data. (Doc is here)

Second, all these steps are unavailable in the console, so I'll need my data analysts to do all the complex CLI commands to get the data.

This is a good feature on access control, but it's just too difficult to use.

GenAI

GenAI is cool. It's the focus this year. But I think we are still uncertain about how it would relate to cybersecurity.

Guardrails for Amazon Bedrock

With Guardrails for Amazon Bedrock, we can set policies to restrict our Bedrock model from using certain topic or contents. We can also use it to redact PII.

I would love to try out how accurate and robust it is. And how it compares to ChatGPT against all the bypass tricks out on the Internet.

Responsible AI - Amazon Titan image watermark

Last week, AWS announced a new foundation model, AWS Titan Image Generator.

AWS claimed that all images generated by this model will have an invisible watermark on them. And we can use it to detect if AI generates that image. It is a great feature to help fight against fake information.

However, to date, I still can't find any details on how we can verify a given image, and how the watermark can withstand image distortion.

GenAI help cybersecurity

There were many announcements last week on GenAI integrated with different services, like CloudWatch log query generation, AWS Config query generation. I think these capabilities lower the bar of being a security operator on AWS. With more help from GenAI, we no longer need all the engineers to know different query languages to investigate security incidents. With Amazon Q, we can now easily find out what security controls we can or cannot do on AWS without digging into the documents.

But still, I would love to see how AWS can use GenAI to improve cloud security in a more proactive way.

AWS Nitro Enclaves Ecosystem (3) - Anjuna

Richard Fan — Sat, 29 Jul 2023 14:06:00 +0000

1. Background
2. What is Anjuna
       2.1. Tools as a Service
3. Features
       3.2. Network Proxy
       3.3. More handy tools - Secret Storing, Persistent Storage
4. Most Powerful Feature - Kubernetes plugin
5. Data Privacy
       5.4. Operate in Private Network
       5.5. Licensing Model
6. Final Thought
       6.6. Trust Model
       6.7. Target Audience

Background

After my last post on Evervault was published, I didn't have time to try out other AWS Nitro Enclaves service providers. But luckily, Anjuna, which is also on my list to review, reached out to me and offered a free trial for me to review its Nitro enclaves offering.

So in this blog post, I will talk about my takes.

If you are unfamiliar with AWS Nitro Enclaves, please read these AWS documents first. Otherwise, you may find it challenging to understand the rest of this post: What is AWS Nitro Enclaves? / Nitro Enclaves concepts

What is Anjuna

Anjuna Security is a company offering a software platform that automates the creation of confidential computing environments in the public cloud. Besides AWS Nitro Enclaves, they also support other cloud platforms (e.g., Azure, GCP) based on various hardware chipsets (e.g., Intel SGX, AMD SEV).

Tools as a Service

My initial expectation of Anjuna was that it would be a cloud service integrated with my AWS account through permission grants, a common approach of cloud service providers.

However, Anjuna doesn't go on this path. Instead, they provide a complete software platform to customers, helping them build and run their applications on AWS Nitro Enclaves.

During the process, no communication is needed between the workloads and Anjuna. With the tools downloaded upfront, customers can even build and deploy the application in a private VPC without Internet access.

Features

The Anjuna Nitro Enclaves toolset consists of several useful tools to help developers build enclave applications.

Most of the tools act as the replacement for commonly used tools like docker, nitro-cli. The magic behind it is that when you run the command, the tools will embed some Anjuna-built runtime or services alongside your app and help achieve some tasks.

Diagram from Anjuna

Network Proxy

Without Anjuna, we need to create proxies in both the parent EC2 instance and the enclave runtime to forward traffic between them (See my example in another post)

Instead of using AWS-provided nitro-cli, we can use anjuna-nitro-cli build-enclave to build the enclave image. The tool embeds the Anjuna Nitro Runtime into the image. This customized runtime provides more than just a network proxy on the enclave side. It also provides the proxy service for other functions I’ll discuss later.

Before running the enclave app, we need to run the command anjuna-nitro-netd-parent --enclave-name <enclave_name> --daemonize to start the network proxy on the parent instance side.

By running two commands, we are ready to run an enclave app with network connections, a convenient experience for software developers.

More handy tools - Secret Storing, Persistent Storage

AWS KMS is one of the only 2 AWS services with native support on AWS Nitro Enclaves. When it comes to storage, developers need to be creative with their solutions.

Anjuna provides two solutions to it. The 1st one is secret storing, which utilizes S3 as storage and KMS as an encryption service.

The tool anjuna-nitro-encrypt uses your AWS KMS key to encrypt the secret and upload it to an S3 bucket you specified.

When running the enclave app, we can specify the location of the encrypted file in the enclave config file. The Anjuna runtime in the enclave will help download, decrypt it with AWS KMS, and provide the secret to the app runtime.

Anjuna also provides seamless persistent block storage on AWS Nitro Enclaves. With a daemon running on the parent instance and the mount point configured in the enclave config file, the Anjuna Nitro tool can mount a block storage from the enclave runtime to a file on the parent instance.

Most Powerful Feature - Kubernetes plugin

After discussing some handy tools, I need to spare another section on one of the most powerful tools, the Kubernetes plugin.

All the tools I have mentioned make deploying enclave applications easy, but just for one instance. When it comes to large-scale application deployment, Kubernetes is the most popular way to go, and Anjuna takes AWS Nitro Enclaves into this area.

Like previously mentioned tools, the Anjuna Nitro Kubernetes toolset embeds proxy into your workloads. But in this case, besides the Anjuna runtime (they call it Anjuna Nitro Launcher), there are two additional Kubernetes resources – MutatingWebhookConfiguration, DevicePlugin

We only need to add an annotation to the pod definition to deploy an application into the enclave. But under the hood, there are a series of events happening.

First, the Anjuna Nitro Webhook intercepts the request and modifies it. The two main changes are to embed the app image into the Anjuna Nitro Launcher runtime, which will provide services to the enclave app. Another main change is to specify the enclave requirement of the pod inside the resources section.

Anjuna Device Manager is registered as a device plugin of the Kubernetes cluster, so when a pod has an enclave requirement, it can assign it to the Nitro enclave through interaction with the Nitro Enclaves kernel API (i.e. /dev/nitro_enclaves)

This is a standard approach to customizing Kubernetes clusters. But with all these tweaks, the Anjuna Nitro Kubernetes toolset helps us deploy enclave applications in a scalable way.

Data Privacy

Enclave applications usually process sensitive data, so privacy is the most critical concern.

With the Anjuna tool binary running inside the same enclave as the application, we have little to do to prevent it from accessing the data. But unlike other cloud services, Anjuna Nitro tools don’t require any communication between customers’ workloads and Anjuna’s servers. This opens up an option for customers to use Anjuna services without the risk of data exposure.

Operate in Private Network

Although the official documentation doesn’t emphasize, we can actually use Anjuna tools in a private network.

During my review, I tried to build a private VPC and run Anjuna inside, so here’s the result.

Firstly, I downloaded the Anjuna tools and the necessary container images into my EC2 instance.

Then, I created the VPC endpoints necessary for me to access the instance via AWS SSM.

Create VPC endpoints for SSM access

And then, I restricted the instance security group outbound traffic to the VPC endpoints only.

Restrict instance access to the Internet

Since then, my EC2 instance has no access to the Internet. But under this environment, I can still use Anjuna tools to run the enclave application.

Anjuna tools can run without Internet

Another fun fact is that I tried building a self-managed Kubernetes cluster without using EKS and put it in a private subnet without internet access. The Anjuna Nitro Kubernetes toolset can still run correctly.

Licensing Model

With Anjuna software running entirely offline, the license to customers is only checked locally. I think this is a carefully considered decision by Anjuna.

Given the focus on privacy by Anjuna's customers, it would have been challenging to accept tools running inside a sensitive workload sending data out, unless Anjuna could have proven they had no access to customers' data. But this would have required Anjuna to open source their tools, which doesn’t seem to fit their business model.

This license model works for Anjuna’s customers, as they provide not just the tools but also customer support.

They are also planning on transparent metric collection from customers' workloads, so Anjuna can better understand customers' usage, and the customers can also see what data is sent to Anjuna.

Final Thought

Trust Model

Enclave applications usually have access to sensitive data, so the users always pay attention to who has potential access to the data.

Most of the Nitro Enclaves use cases fall into 3 categories:

Building their own enclave applications (e.g., Dashlane)
Completely open-source (e.g., EdgeBit Enclaver)
Providing managed service to customers (e.g., Evervault, Oblivious AI)

In the first 2 cases, we don’t trust anyone and need complete control or visibility of the application source code. The last case is where we trust the vendors and use their service to minimize data exposure.

Anjuna is sitting in between them, which we trust Anjuna that we are installing their tools into the enclave without reviewing the source code. But we do not entirely trust it, so we may want to ensure no data is sent to them from our workloads.

This makes me think that as a Security Engineer, I always face the question between build and buy. Sometimes, I need to ask myself: Is this SOC2 or ISO 27001 certificate trustworthy? Can I trust the vendors that they can safeguard our data? Especially when I see many remarks and accepted risks in the audit reports.

But even with these doubts, we still need to choose the vendor because building our own solution is simply too expensive.

Having a choice to host the application completely in our environment is definitely a plus in these trade-offs. And I think Anjuna is smartly positioning its services here: Not disclosing the tool logic, but you are free to decide where to deploy.

Target Audience

The current licensing model of Anjuna and the technical skills required to use the tools (Especially the knowledge of deploying resources on AWS) are suitable for enterprises whose primary focus is not developing software.

On the one hand, those companies have enough technical personnel to deploy the applications. On the other hand, they don’t have enough resources or incentives to hire and train engineers specifically on enclave technology.

For Anjuna itself, managing those customers is also easier because they can build close relationships with a small number of big companies. Anjuna can even provide special arrangements for their most important customers to audit the tools source code.

I am interested in what direction Anjuna will go in the future. Will they explore other business models to expand the customer base? How will they make their tools much more accessible to customers with fewer technical skills without compromising the risk of customers’ data?

With more types of service offerings, more public awareness of enclave technology, and the pros and cons of different options, I believe there will be more adoption in the future.

AWS Nitro Enclaves Ecosystem (2) - Evervault

Richard Fan — Thu, 09 Feb 2023 16:26:06 +0000

1. Background
2. What is Evervault
       2.1. Encryption service
       2.2. Runtime provisioning
             a. Evervault Functions
             b. Evervault Cages
3. Deep dive
       3.3. Less infrastructure overhead
       3.4. TLS Attestation
       3.5. Unknown sidecar
       3.6. Insufficient access control
4. My thought
5. Final thought

Background

If you haven't read my previous post, please read AWS Nitro Enclaves Ecosystem (1) - Chain of trust on how I see services built on top of AWS Nitro Enclaves and the importance of Attestation Document.

This time, I'm going to talk about my thought on Evervault.

What is Evervault

Evervault provides transparent encryption using relay webhooks.

Encryption service

The idea is that before sensitive data goes into the system, you can route the traffic through Evervault Inbound Relay to encrypt it so that the system can only get the encrypted data.

To use the encrypted data, Evervault provides Outbound Relay to decrypt the data before sending it to the external components.

Using it, developers can build applications that handle sensitive data without worrying about encryption or changing the code to protect it.

Evervault states that the encryption is performed by Evervault Encryption Engine (E3), which is running on Nitro Enclaves. However, there is no way for us to tell whether it's true. There is no independent audit available as well.

Runtime provisioning

Evervault Functions

Besides simply encrypting data, Evervault also provides the environment for developers to run simple functions on sensitive data.

The current offering is Evervault Functions, in which you can invoke your custom Python or Node.js application with the encrypted data. Your application will be given decrypted data as parameters so you can perform your business logic on it.

The example Evervault provides is to validate encrypted phone number

Evervault Cages

Evervault Functions is not running on Nitro Enclaves, so I will not discuss it in this blog post. But Evervault has a beta offering called Evervault Cages which provides a similar feature on Nitro Enclaves. In this blog post, I will focus on it.

Deep dive

I tried Evervault Cages by following their documentation, as well as the help from the Evervault team to understand how it works.

This session is about the key points which are worth considering.

Less infrastructure overhead

Using Cages CLI, you can quickly build your docker application and deploy it into Nitro Enclave. You don't need to provision EC2 instances or configure Nitro Enclaves. Evervault provides the infrastructure in their AWS account for you during deployment.

You also don't need to handle external traffic, as Evervault will handle it for you. The Cage application endpoint will be forwarded to the exposed port of your enclave application. Evervault can also forward egress traffic from the enclave to the Internet.

TLS Attestation

Another feature Evervault Cages provides is TLS Attestation.

When TLS termination is enabled in your Cage application, the attestation document of the Nitro Enclave will be embedded inside the Cage endpoint TLS certificate.

According to the documentation, you can only use Evervault SDK or CLI to validate the embedded Attestation document. The tools use an undocumented API to retrieve the attestation. We can use the same API to validate the attestation document ourselves.

When connecting to the Cage application endpoint <cage_name>.<cage_id>.cages.evervault.com (or <nonce>.attest.<cage_name>.<cage_id>.cages.evervault.com if you want to use nonce on the attestation), the Evervault-signed TLS cert will contain a Nitro Enclave attestation document in the Subject Alternative Name (SAN) section, in hex code format.

Besides using TLS Attestation, Cage environment also provides an internal API http://127.0.0.1:9999/attestation-doc for developers to retrieve attestation document within the enclave.

These two features help application developers use attestation documents to validate enclave identity without writing their code to retrieve attestation documents.

Unknown sidecar

To achieve features like egress proxy, TLS Attestation, etc. Evervault Cages installs a proxy sidecar, which they call it Data Plane.

When we run the following command

ev-cage build --write --output .

ev-cage will modify our original Dockerfile, adding two files (One is the runtime dependency, the other one is the sidecar) into it.

As of the time of writing, there is still no source code of the Data Plane sidecar publicly available. So when using Evervault Cages, we need to keep in mind that an unknown binary is running along with your application in the enclave.

There is a risk of Evervault doing bad things on the sidecar, or there are vulnerabilities on it.

Insufficient access control

API key is the only authentication method Evervault provides for programmatic access.

In a simple platform, this is not an issue. But if I use Nitro Enclaves (or Evervault Cages in this case), I would expect additional data protection.

The issues I can see are:

Lack of permission separation

Each Evervault app only has one API key, which can be used across different services (i.e. Relay, Functions, Cages).

An API key used by Cages can also be used by Functions, so we cannot guarantee an encrypted data can only be decrypted by Cage.
Lack of attestation document support

The decryption API only takes the API key as the sole authentication method. There is no control similar to AWS KMS key policy, where we can specify "only this enclave image can decrypt my secret"

Image if I have a system which handles both phone no. and credit card no. I want to validate phone no. on Evervault Functions because it's not very sensitive.

But I want to validate the credit card no. on Cages because it's more sensitive than phone no.

In this case, I have no way to protect credit card no. because the phone no. validation developers can decrypt the credit card no. using their API key (because they are the same).

Even though API keys are separated, the credit card no. validation developers can also decrypt the credit card no. because they can write a rogue app (e.g. reverse shell) and deploy it to Cages, then use the decrypt API. Since there is no attestation authentication on the API, we cannot specify which enclave image can decrypt the secret.

My thought

The idea of Evervault is good, making data protection as easy as possible. Abstracting protecting data-in-use away from developers using Functions and Cages is a boost on adoption.

However, the current state of Evervault Cages is still a long way to go. I would say Cages is as good as the current Functions offering in terms of security and privacy, but there is no significant extra benefit on top of it.

I would suggest the following if Evervault is targeting first-time users who are not familiar with confidential computing and want a quick start:

Provides permission control for the API key so users can have more control of data on different privacy levels.

Evervault response: Evervault is now working on refining the scopes for API Keys, specifically for decoupling Cages from the surrounding products

If Evervault is to target more advanced users who treat sensitive data seriously, they can:

Open source the Data Plane sidecar so users can review its security.

Alternatively, if Evervault wants to avoid publishing the source code, they can find a reputable 3rd-party audit. Or open source a lightweight version of the sidecar with fewer functions, so users can choose to minimise their risk by using it.

Evervault response: Evervault is now undergoing a 3rd Party audit for Cages. Open sourcing is also in their roadmap.
Provides attestation document authentication so users can specify which enclave image can decrypt specific data.

Evervault response: Evervault is now working on the Cages auth for encryption/decryption to include an attestation step

Final thought

To be fair, Evervault Cages is a new release, and it's not expected to be perfect now. Evervault team has done a great job of democratising Nitro Enclaves’ use. They are open to feedback as well.

I suggest you try it out and have a taste of how confidential computing works.

Forem: Richard Fan

What You See is What You Get - Building a Verifiable Enclave Image

Table of Contents

Obstacle of proofing TEE

Image digest is meaningless

Stable image digest is difficult

Solution - Trusted build pipeline

GitHub provides the service suite we need

Use SigStore to sign and endorse the image

Putting everything together

How can service consumers verify the PCRs

What's beyond

Build log retention

Build pipeline still needs to be simple

Wrap up

What You Need to Know About the NIST Guideline on Differential Privacy

Table of Contents

Highlights

What is the current state of privacy protection

Input Privacy vs Output Privacy

Current De-identification method doesn't work

What is Differential Privacy

Differential Privacy is not an absolute guarantee

This is one of the first major guidelines for implementation

Differential Privacy Foundations

Epsilon (ε)

Privacy Unit

Differential Privacy in practice

Privacy Budget to limit privacy loss

Adding noise to comply with the privacy budget

Challenges

Reduced accuracy and utility

Applications are still limited

Reduced accuracy amplifying bias

Security challenges

Back to the basics, data protection is the paramount to privacy protection

Security Implication of Giving Examples

Background

What I am confident that we should follow

Do not use mosaic to hide secret

Do not show a fake secret

What I am doing but you may have better options

Use common pattern for personal values

Dealing with encoded values

Wrap up

When Automation Meets Authentication

Background

The Recent Trends

My recent story

The Problems Come

How do I sign the git commit?

Use GitHub's key?

Use stored key?

Hardware key?

Cloud services?

Who can access the IaC repository?

Interesting GitHub access model

GitLab is better in this area

This is not unique to GitOps but critical to GitOps

Wrap up

Can We Use aws:SourceVpc Condition Without a VPC Endpoint?

Background

Why is VPC Endpoint required?

The route of a network request goes within AWS

The way IAM knows the API request's context

How AWS documentation fails to make its users understand

Does it really mean the source VPC?

Call for action to AWS

Start building my AWS Clean Rooms lab

Why is it so challenging to try a clean room service?

Multi-party collaboration

Reliance on good data

Lack of online resources

That's why I'm creating my own lab

Finding a suitable dataset

IaC everything

Here's the link

A playground to practice differential privacy - Antigranular

Table of Contents

Background