Forem: Arash Afshar

Secure Machine Learning

Arash Afshar — Sun, 16 Feb 2020 00:21:06 +0000

In previous posts (Zero Trust, MPC-1), I described the importance of Secure Multiparty Computation and showed a simple use case for it, a secure weather application. In this post, I want to skip ahead to a far more advanced application to showcase the amazing things that one can do given the exisitng tools and libraries. More specifically, I will talk about secure machine learning. I will use a popular service such as Grammarly as an example and will described how one can approach implementing a more secure version of it.

Grammarly can be installed as a browser plugin, it will montior the text that you type in any website and will find grammatical mistakes. It will then offer you fixes for those mistakes. It is a great product but, for their product to work, they need to collect all the content that you type and analyze them (see their privacy policy). I am NOT accusing them of any bad intention, but the simple fact that they need to collect what I type means that I cannot use it for anything related to my work (writing emails, describing architectures, etc) which is the main usecase of their service for me. Therefore, I am in a situation where I have a need for such a product, but my privacy requirements prevent me from using it. Now the question is, is it possible to create such a product that satisfies my privacy requiremnts? I believe the answer is yes and I will describe the steps and tools that are needed in this post.

Broadly speaking, products such as Grammarly are backed by a Machine Learning (ML) model. This model needs to be trained on a huge dataset and constantly kept updated with new data. Once a model is trained, the model will be deployed and used to respond to queries from the users. Therefore, to implement a secure verion of this service, at the very least we need to secure each of these steps.

Secure Training: In the training phase, the data must be kept private
Secure Model: The resulting trained model must not reveal anything about the private data
Secure Querying: When querying the model, the query data must be private

Secure Training

In the training phase, one or more parties have private data. In most cases, these data owners do not have the computational power to run the training and therefore, they will outsource their data to a "cloud". Thus, the goal is to make sure that the "cloud" does not learn the private input. Moreover, the cloud should be able to perform computation on those private data. This is a great use case for Secure Multiparty Computation. The current state-of-the-art papers (ABY3, SecureNN) propose a 3-party setting. In this setting many data owners will outsource their data to these parties (think of them as three "clouds") and these parties will run the MPC protocls on behalf of the data owners. The important assumption here is that these parties are non-colluding.

If you are not interested in reading the papers and implementing them yourself, you can check out the great work done by the folks at Dropout Labs. They are working an a secure version of Tensorflow, called tf-encrypted which implements the above papers in addition to the SPDZ protocol.

If you want to implement the protocols yourself, you can checkout the MPC-SoK repository. Marcella Hastings, et al have done an amazing job of collectiong, compiling, documenting, and comparing most of the existing MPC frameworks.

Secure Model

From the above, we have a model that has been computed in a secure manner and is shared between three non-colluding parties. But that is not enough! A model that is trained in this fashion can still leak information. An example of an attack that can reveal information about the private inputs from the model is "Model Inversion" attack (e.g., Secret Sharer). Assume Alice is a data owner and one of the inputs she has sent to cloud has this format: "Alice A, credit card number: 1234-5678". Now the attacker can start a brute-force attack by querying the model with "Alice A, credit card number: XXX-XXX" where XXX-XXX is brute-forced. In other applications, this would be even worse. Consider an attacker that types "Alice A, credit card number: 123-" and the model auto-completes the rest of the credit card number for the attacker!

There are different techinques to defend against this kind of attack, including sanitizing the data, anonymizing the data, or using techniques such as Differential Privacy (DP).

Secure Querying

Given a secure training phase and assuming that the model itself does not leak any information, we can focus on the deployment. In terms of security of the implementation, this is very similar to the training phase in that MPC can be used to ensure the priavcy of the query data.

Final Words

Of course, as I have mentioned in my previous post, having the theoretical solution is not enough and there many more more things that need to be considered before such an application can be ready for secure production use.

I encourage you to checkout the tutorials on tf-encrypted with keras and DP and PySyft for a more hands-on experience.

MPC Part 1: Oblivious Transfer

Arash Afshar — Mon, 20 May 2019 18:22:00 +0000

Consider a weather app that you have on your phone. For most users, this app records the current GPS location, sends it to a server and receives and displays the temperature of the user's location. This means that if the server chooses to, it can create a profile of the user location history and track their movement which is a breach of privacy for most users. Therefore, a privacy-aware user might want to obtain weather information without sharing their location. Let's call this User Security Property.

Designing an application that satisfies this security property is very easy. In fixed intervals, the server sends all the weather information it has for all the cities and regions from all over the world to the user's device and the user does a local lookup to find the weather information that is interesting to them. As you can see, with this approach the server has no way of knowing the user's location. Unfortunately, this approach has two main problems.

The amount of data that is sent to the user's device is too large.
More importantly, the user learns all the weather information which is the backbone of the business that the server is running.

Therefore, we are interested in a solution that is fast and efficient enough to be used in practice (this is a very subjective criterion and depends on the use case). We are also interested in a solution that protects the privacy of the server and only sends the temperature of the city that the user has asked for. Let's call this Server Security Property.

To sum up, we have two parties a server and a user. The server has a list of private temperature data and the user has a private input which indicates the city that the user is interested in. We would like to offer this functionality such that the User Security Property and Server Security Property are satisfied and that the solution is more efficient than sending all the server data to the user.

Oblivious Transfer can help with achieving this goal. To describe Oblivious Transfer (OT), we first consider a simple case where the server only holds the weather information about two cites and the user chooses one of those cities. This case is called 1-out-of-2 OT. In what follows, I'll describe the theory and some code snippets and then describe how to extend it to more than two cities.

Theory

One of the simplest OTs (specially if you know Diffie-Hellman key exchange protocol) is proposed by Chou, Orlandi 2015. The overall protocol is shown in the figure below. But it is not immediately clear what is happening there and what are a, b, g, Hash,Encrypt, and Decrypt.

What is `g`?

g is the generator of a "simple group" of prime order p. For example, consider the group of Z11 which is a group of prime order 11 and therefore has 11-1 members {1,2,...,10}. This is a cyclic group if you consider g=2 since you can create all the members of the group by starting from 2 and keep multiplying it by 2. In other words {2¹,2²,2³, ..., 2¹⁰} module 10 produces the same set as {1,2,...,10}.

To see it for yourself, run the following program.

def generate_group(g, p):
  # Using list to show that there are not duplications
  members = list()
  for i in xrange(1,p):
    members.append(g**i % p)
  return sorted(members)

generate_group(2, 11)
# => prints [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

What are `a` and `b`?

a and b are two integers that are selected at random from the Z11. Note that both of these values appear as the exponents of g and therefore g^a and g^b result in a member of the cyclic group.

What are `Hash` and `Encrypt/Decrypt`?

The proper definition of the Hash and Encrypt functions can be found in the paper, but for our purposes assume Hash is SHA1 and Encrypt/Decrypt is a symmetric key encryption scheme such as AES.

Does the Protocol Work?

To show that this protocol is doing what it claims, let's follow it with an example. In this example, we use the same group Z11 and with g=2 as its generator. Also, assume that a=4 is chosen uniformly at random and similarly, b=7 is chosen uniformly at random. The following code computes the steps required for the user to obtain k and for the server to obtain k0 and k1. You will notice that if the user sets c=0, then k will be the same as k0 and if the user sets c=1, then k will be equal to k1. Therefore, the user can either decrypt e0 or e1 based on their choice, but they CANNOT decrypt both.

# multiplies x to the inverse of y
def div(x, y, p):
  xp = x % p
  yip = pow(y, p-2, p)
  return (xp * yip) % p

def examine_case_c_0(g, a, b, p):
  A=pow(g, a, p)
  B=pow(g, b, p)
  k=pow(A, b, p)
  k0=pow(B, a, p)
  k1=pow(div(B, A, p), a, p)
  print(k, k0, k1, k == k0)

def examine_case_c_1(g, a, b, p):
  A=pow(g, a, p)
  B=(A * pow(g, b, p)) % p
  k=pow(A, b, p)
  k0=pow(B, a, p)
  k1=pow(div(B, A, p), a, p)
  print(k, k0, k1, k == k1)

examine_case_c_0(2, 4, 7, 11)
# => (3, 3, 4, True)

examine_case_c_1(2, 4, 7, 11)
# => (3, 5, 3, True)

So far, we have demonstrated that the protocol is correct. To actually prove its correctness, you can just write down the formulas and go through the math. Next, we will talk about the security of the protocol and try to argue that it satisfies both of the security requirements.

Is the Protocol Secure?

To examine the security of the protocol, imagine that you are the attacker and see what you can do! For example, let's assume that you are the server and your goal is to find out the user's choice of the city. Looking at the protocol, you'll notice that as the server you are getting only one message. Depending on the user choice, you either get B=g^b or B=Ag^b. Therefore, if you can somehow identify which message you got, you have succeeded in your attack. Let's consider a couple of different ways that you can perform your attack. In other words, we want to find ways that we can violate User Security Properties. Our attacks will involve finding or guessing b. Note that both the user and the server know g. Therefore, if we can find/guess b, then we can compute g^b and compare the result with B. If they are the same, we know that the user has chosen city0. We (the server) can perform this attack in two ways.

By finding a flaw in the way b is generated: if the user chooses b randomly and implement the random gen code properly, then the user is safe against this type of attack. Therefore, we define

User Security Requirement 1: Use secure random to choose "b" and do not reuse "b".

By trying all possible values of b (i.e., a brute-force attack). If our cyclic group is big enough, then this attack would be impractical. Therefore, we define

User Security Requirement 2: Choose a large enough cyclic group such that brute-forcing "b" is impractical for the duration that "b" is valid..

We can also make the same kind of arguments about the requirements for satisfying Server Security Property which I leave for you to explore and think about. In particular, I encourage you to read about the hardness property of the discrete logarithm problem and how it relates to Diffie-Hellman problem.

From the above arguments, we have identified that to satisfy User Security Property, the protocol implementation must be configured such that it satisfies the following requirements.

The implementation must generate b using secure random every time.
The implementation must use a cyclic group with a very large size to prohibit the brute-force attack.

Now, are these arguments enough to prove security and more importantly, is the approach that we have taken so far a good approach for proving security? The above arguments are informal and are not accurate. For example, we have not defined the "large enough" size for the cyclic group, nor have we defined the "hardness" property of the discrete logarithm in the cyclic group. Moreover, we have described the random number generator as "secure" without specifying what it means. Nevertheless, this approach towards proving security is a correct approach and it is how real proofs look like. Namely, going over each message that a party receives and proving that the message leaks no information about the private input of the parties. I will write about the proof model in a separate post, in the meantime, you can read about them in a concise tutorial by Yehuda Lindell, or get a more in-depth knowledge by reading the wonderful books by Oded Goldreich, Foundations of Cryptography, Vol I and II.

Back to the Application

We started with the goal of creating a weather reporting service that preserves the privacy of the user and the server and introduced Oblivious Transfer (OT) as a potential solution. We then showed how an OT protocol can be designed for the case where the server has only two city temperatures and the user chooses one of them (1-out-of-2 OT). Now, we want to extend this to a 1-out-of-n OT for some large n. A naive approach is to create a network of 1-out-of-2 OTs, where each pair of initial temperatures are fed to an OT and then create another layer of OTs such the output of each pair of OTs from the first layer is fed to an OT in the second layer and so on. This forms a binary tree and requires approximately n OTs. There are much faster solutions which can achieve this with a constant number of 1-out-2 OTs.

At last, the following code shows an implementation of this application using libOTe. You can find a docker file which sets up and runs this program on my repo.

Final Remarks

Similar to the previous post, just implementing a secure protocol is not enough and there are much more things that one need to take into account. For example, on preserving the privacy of the user, note that the user's location can be found (or at least estimated) through the source IP or network delays. Moreover, based on the frequency of the weather checks and the requests to the server, the server can guess whether the user is traveling on the road or not. Nevertheless, using a secure protocol is far better than a non-secure one.

Absolute Security with No Trust

Arash Afshar — Fri, 17 May 2019 02:35:39 +0000

I have started a blog on Theoretical Security and Software Engineering. Here is my first post. I hope you enjoy it :)

Designing and implementing secure software solutions usually involves a discussion about the level of security and the effort and cost of achieving that level of security. The cost and effort are a function of the upfront cost of development, time to market, and the cost of fixing the security problem when they are exploited. Depending on the product, the company, and the severity of the problem, the later cost could also involve regaining the trust of the customers (e.g., numerous security breaches of Facebook) or losing a significant part of the business as in the case of VFEmail. Based on the risk tolerance and other factors, software products fall somewhere in the following spectrum of security.

No security! It is fast, easy, and cheap to develop but carries significant risk and will require a huge cost to make it secure later.
Some levels of security have been considered and implemented and the rest is protected with NDAs or other non-technical means.
Absolute security without trusting any third-party (more formally known as Information-theoretic security or Unconditional security).

In this post, I will talk about the last item and discuss how feasible or practical it is. I will argue that although absolute security is not possible in practice, it is important that we put the effort in research in fields such as "Secure Multiparty Computation" to provide the tools necessary for making a reasonable compromise with regard to the security of the product and the number of things that we have to trust.

Let's start with a simple example to demonstrate if absolute security is achievable or not. In this example, we have a client which wants to encrypt a secret file and outsource it to a cloud for storage such that the cloud can never find the content of the file.

Theory

You can achieve information-theoretic security by using a simple XOR function. Consider the case where the content of the secret file is a binary string 101010. Now assume the client has picked a password which is also a binary string, say 100101. The client XORs the file content with the password (using the following function) and sends the result to the cloud.

def encrypt(content, password):
  m = int(content, 2)
  k = int(password, 2)
  c = m ^ k
  return "{0:b}".format(c).zfill(6)

encrypt("101010", "100101")
#=> prints '001111'

Why is this way of encryption secure? Assume you are the cloud and you have received 001111 from the client. Also assume that you have unlimited CPU and memory and you decide to try all possible passwords (i.e., perform a brute-force attack) to find the content of the secret file. You notice that the message is 6 characters long and therefore you try all the passwords between 000000 to 111111 using the following function.

def brute_force(secret):
  c = int(secret, 2)
  possible_results = []
  for possible_password in xrange(int("111111", 2)+1):
    possibility = "{0:b}".format(possible_password ^ c)
    possible_results.append(possibility.zfill(6))
  return possible_results

brute_force("001111")
#=> Can you guess what it prints?

If you run this function, you will notice that it is printing all values between 000000 and 111111. In other words, it will print all the possible contents of the secret file! This means that brute-forcing did not help at all and you (i.e., the cloud) still have no idea which of the possibilities is more likely! This is what we call information-theoretic security: regardless of your computation power, the best you can do is guess which one of all the possible contents is the answer.

Alright, we have achieved information-theoretic security and we are done, right? No! We have yet to examine the practical side.

Practice

In our brute-force example, the cloud is looking for the correct answer among all the possibilities. An observant reader might have noticed that the content of the secret file is 101010 or 42 and of course 42 is the answer! Joking aside, this shows an attack when the attacker has some knowledge about the form of the secret. Even if we do not consider this type of attack, there can still be problems with the implementations. For example, consider the following implementation of the encryption function.

def bad_encrypt(content, password):
  m = int(content, 2)
  k = int(password, 2)
  c = m ^ k
  print "DEBUG: Encrypting {0} with {1} resulting in {2}".format(m, k, c)
  return "{0:b}".format(c).zfill(6)

bad_encrypt("101010", "100101")
#=> prints DEBUG: Encrypting 42 with 37 resulting in 15
#          '001111'

hmmm, that does not look very secure! As is evident, even when using algorithms that give you information-theoretic security, you are still trusting that the developers have implemented a secure program. Now, let's assume that the program is written securely and there is no vulnerability in the code. The next question is where to store the password. If an attacker can find the password, then all bets are off. Therefore, even with theoretically secure algorithms and assuming that your programs have no vulnerabilities, you are still trusting that your operational security is perfect!

Now, assume that your code is secure and your operational practices are secure, and you have no malicious insider that is leaking your secrets, would you choose the XOR function for your encryption needs? Probably not! Note that to encrypt a message of length 6, you needed to create a password of length 6. Similarly, to encrypt 10 Terabyte of data that you are outsourcing to a cloud, you would need to store 10 Terabytes of passwords locally! Which is a huge and pointless price to pay to achieve information-theoretic security.

Reality

At this point, you might be wondering why are we even bothering with theoretical security! Well, the picture that I painted so far was an extreme case to make a point about the importance of security in all aspects of software development and also to point out that high level of security is not cheap and in the vast majority of cases you pay it through more computation or more memory/network usage. The solution is research on making theoretical approaches more efficient and to find reasonable compromises on practical security. In the past few decades, there have been significant researches that are bringing us closer to an efficient and reasonable theoretical and practical security.

In my PhD thesis, I have done a small part in furthering such research (AMPR14,AHMR15, AMR17) in the field of "Secure Multiparty Computation" and I believe it is one the most promising fields. In my future blog posts, I will introduce this field from the point of view of a Software Engineer with the goal of encouraging other Software Engineers to adapt and use the results of the amazing research that is being done in this field.

Forem: Arash Afshar

Secure Machine Learning

Secure Training

Secure Model

Secure Querying

Final Words

MPC Part 1: Oblivious Transfer

Theory

What is g?

What are a and b?

What are Hash and Encrypt/Decrypt?

Does the Protocol Work?

Is the Protocol Secure?

Back to the Application

Final Remarks

Absolute Security with No Trust

Theory

Practice

Reality

What is `g`?

What are `a` and `b`?

What are `Hash` and `Encrypt/Decrypt`?