## What is Differential Privacy?

### A simplistic look at a relatively recent tool in the world of data science to enhance consumer privacy.

Published
Categorized as Written by Zach Johnson

Imagine that you are a student at a university that wishes to know what percentage of their student body, as well as percentages of different subgroups within the student body, has cheated on an exam at some point during their undergraduate experience. They go about answering this question by picking a representative random sample of the student body and asking them the following questions: their expected graduation year, their major of study, their residence hall, and whether or not they have ever cheated on an exam. The university assures those being polled that their answers will remain ‘anonymous’—specifically, that their names will not be released along with their answers to the university. Now imagine that you are one of these students being polled, and you have cheated on an exam at some point during your studies. Would you feel completely comfortable answering this poll given the assurance that you will remain ‘anonymous’ to the school?

Most people naturally would answer no. Why is there still an underlying sense of insecurity to truthfully answering this poll when the university is guaranteeing ‘anonymity’?

It is because the ‘anonymity’ promised by the university is a facade.  Say that you are one of 10 students in your residence hall who is in the class of 2022. Now, say that of those ten students, you are the only one who is a psychology major. With these seemingly trivial pieces of demographic information you gave along with your confession to cheating, you are no longer anonymous to the university. The university can simply access their student directory, where they can tie all of this information, including the fact that you have cheated, to your name.

While this ‘data breach’ of sorts is relatively low risk—perhaps you are put on academic probation, or you receive a letter on your permanent record—the act of tracing anonymous datasets to individual people using complementary data or background information, formally defined as a linkage attack, is an incredible threat in the world of data privacy. Linkage attacks enable adversaries to use seemingly harmless datasets to perform large-scale, destructive data breaches. The promise of ‘deanonymizing’ data merely through removing names from a dataset is not enough to protect people from having their own information used against them. According to a paper written by Latanya Sweeney, the professor of the Practice of Government and Technology at Harvard University, 87% of Americans can be uniquely identified by just three pieces of information: their zip code, sex, and birth date.1 Thus, in an attempt to resolve the threat of linkage attacks, differential privacy was born.

Conceptual Definition of Differential Privacy

Imagine the same scenario above, but with a slight alteration. Now, when being asked whether or not you have cheated, the pollster tells you the following: flip a coin. If the coin lands on heads, answer truthfully whether or not you have cheated. If the coin lands tails, flip it one more time and answer yes if it lands heads, and no if it lands tails.

Now say that you are called in to the Board of Academic Integrity, who tells you that you will have a letter put on your academic record because you answered yes to the survey (for the sake of this example, let us dismiss that this is a clear violation of standard polling practices, as the school guaranteed that answers would remain anonymous and not be traced to individual students). With reasonably high (we will later on define mathematically what ‘reasonably high’ means) plausibility, you could argue that you have in fact never cheated on an exam, but simply answered yes per the sequence of coin flips. Now, despite having their answers traced back to them, students are protected from any repercussions from the use of their data as a result of this buffer of uncertainty being added to the questionnaire process.

This is an example of implementing differential privacy. From a conceptual perspective, the definition of differential privacy lies on several key principles. According to Michael Kearns in his book The Ethical Algorithm, the first is that “differential privacy requires that adding or removing the data record of a single individual not change the probability of any outcome by ‘much.’”3 We will address just how much ‘much’ is numerically when we discuss the mathematical definition of differential privacy. The second principle is that “no outside observer can learn very much about any individual because of that person’s specific data.”3 The last key principle is that “for every individual in the dataset, and for any observer no matter what their initial beliefs about the world were, after observing the output of a differentially private computation, their posterior belief about anything is close to what it would have been had they observed the output of the same computation run without the individual’s data.”3 Again, ‘much’ and ‘close’ are loose terms that will be more objectively defined later on. To clarify this last principle, let us refer back to our example with the university. Assume that some observer would like to know whether Student A was polled for this study. Differential privacy guarantees that if the observer is shown the final computation (in this example, the percentage of the university’s student body that has cheated), whether this computation was calculated with Student A’s data or without, they will not be able to guess whether the dataset contained Student A significantly more accurately than by randomly guessing.

Essentially, an algorithm that is differentially private injects a predetermined amount of ‘noise’ into a dataset (in our example, the ‘noise’ inserted is determined by the coin flip; in real world situations, something more complex like a Laplacian distribution is used to insert noise). This noise is what guarantees plausible deniability, and thus protection from harm, to the people whose data is being used. However, because the data scientists who are deploying these algorithms know precisely how this noise (in other words, error) was introduced into the data, they can work backwards to compute the metrics they are seeking with high confidence. Using differentially private polling practices, the university can roughly calculate what percentage of their student body has cheated with high accuracy, while guaranteeing that any student whose data was used in this computation will be protected from any repercussions as a result of the use of their data.

Mathematical Definition of Differential Privacy

For the formula-oriented folk, this section will be much more relevant to you. From a conceptual perspective, we understand that a differentially private algorithm guarantees that an adversary can learn virtually nothing more about an individual than they would be able to if said individual’s data were removed from the dataset powering the algorithm. However, if you were a data analyst, it would be foolish for a company to ask you to design a predictive machine learning model such that one can learn ‘virtually nothing more’ about any individual in the dataset by using their data. At the end of the day, a machine learning algorithm boils down to numbers, matrices (which, in themselves, are just arrays of more numbers), and functions. It has no concept of loose quantities such as ‘virtually nothing more’ or ‘not much.’

It turns out, however, that these loose terms are encapsulated by a parameter, ε, also referred to as the privacy budget. The privacy budget, according to Medium, can be thought of as “an individual’s limit of how much privacy they are ok with leaking.”4 It follows, then, that a small ε would result in greater privacy, while a larger ε would result in less. To state this more mathematically, a model M is ε-differentially private if for all pairs of datasets x, y that differ in exactly one person’s data entry, and all events S,

$Pr[M(x) \in S] \leq e^{\epsilon}Pr[M(y) \in S]$

For a small value of ε, this can be approximated as

$Pr[M(x) \in S] \leq (1+ \epsilon)Pr[M(y) \in S]$

Once you inject noise into a dataset, you must work backwards in order to calculate the metric you originally sought out. Let us take a look at an oversimplified example using the differentially private algorithm from the university scenario above. Say that the true ratio of students at the university who have cheated is x, and that 5/12 of our random sample answered yes to the poll (in the world of statistics, the chances of ratios being represented as whole number fractions is nearly zero, but we will ignore that for the sake of this example). We therefore know that the percentage of students at the university who have not cheated is 1-x, since the query was binary (in other words, the only other option was to answer no). We also know that students will respond to the question truthfully 3/4 of the time, as there is a 1/2 chance of their first flip being heads (in which case they are required to answer truthfully), and a 1/2 * 1/2 = 1/4 chance of answering truthfully if their first flip was tails. Of the 1-x percentage of the students who have not cheated, 1/4 will answer yes to the poll, per their sequence of coin flips. Putting these together, 3/4x students truthfully responded yes to the poll, while 1/4(1-x) students untruthfully answered yes to the poll, and together these sum up to 5/12. Solving for x, we get that with high confidence, 1/3 of the students at the university have cheated. In calculating this ratio, however, we have greatly reduced a privacy breach on behalf of any individual student whose data was used in this study, thanks to the implementation of differential privacy.

If differential privacy being used as a cut-and-dried tool to combat data attacks seems too good to be true, it’s because it is. Differential privacy is not a blackbox solution against all adversarial threats and breaches. We as computer scientists, data analysts, and engineers tend to favor solutions that are methodical, formulaic, and can be applied uniformly. But, when we are dealing with issues that cannot merely be quantified—privacy, ethics, fairness, justice, etc.—there is no single method, no single formula, that is correct above all others and can be applied in every situation. Differential privacy, like all other solutions to subjective moral dilemmas, has its strengths and its drawbacks, and like these solutions, the data analyst must consider the costs and benefits in every unique situation.

Costs of Differential Privacy

In the realm of computation, as well as the realm of humanity, it is inevitable that any solution to a large-scale societal problem will have its imperfections. Just as there are virtually no laws or regulations that optimize justice, practicality, efficacy, and cost simultaneously, there are virtually no algorithms that can optimize all of these factors without any drawbacks. Differential privacy is no stranger to this concept.

One critical property of differentially private algorithms that is both beneficial but also poses a clear burden is that they are compositional. Michael Kearns elaborates further upon this property, explaining that “if you’ve got two differentially private algorithms, you can run them both and the result is still differentially private. You can give the output of one as the input to another, and the result is still differentially private.”3 In many ways, this is incredibly useful. This enables data scientists to create large, complex algorithms by simply concatenating various differentially private building blocks. It makes differential privacy much more practical, as its real-world applications tend to be very large in scale. Conversely, however, just as differential privacy can be aggregated, privacy loss can be as well. Merging two differentially private algorithms to construct a new, more complex one results in a compounding of the two former algorithms’ privacy losses. In other words, while the new algorithm is more robust, its privacy guarantee weakens. The tradeoff in this situation is complexity for privacy, and, as was mentioned before, it is up to whomever is deploying this algorithm (whether this be an individual data scientist, a large tech corporation, etc.) to decide how much algorithmic complexity versus how much consumer privacy they are willing to sacrifice.

Similarly, multiple queries to the same differentially private dataset reduce the level of anonymization overtime, because the results of these queries can be aggregated to reconstruct the original dataset through filtering out noise.5 Thus, the person behind this data must decide whether they more highly value longevity and repeated use of their data, or privacy for the individuals contained in the data.

The last cost of differential privacy we will discuss is more blatant: accuracy. It should come as no surprise that if we are deliberately injecting noise into a dataset, our computations will be less accurate than if we used entirely truthful data. This is why data scientists use the classic buzz phrases such as ‘with high confidence,’ or ‘it is reasonable to assume.’ They cannot guarantee metrics with certainty, because they consciously made the data less accurate in order to protect the individuals contained. Mathematically speaking, privacy and accuracy share an inversely proportional relationship. The higher the level of privacy (low ε), the lower the accuracy of our results. In order to combat this blow to accuracy while still guaranteeing privacy, a bigger dataset must be used. In the business world, this is synonymous with a higher monetary cost, a metric that no business seeks to maximize. Again, there is no differentially private algorithm that is inherently superior relative to these metrics. It is dependent on the values and business practices of the people and groups that deploy them. Some companies prioritize privacy above all else, regardless of the costs incurred, and thus seek to minimize ε (maximize privacy). Others seek to cut costs, regardless of the repercussions placed on individuals and their data, and will therefore try to maximize ε (minimize privacy). For many companies, they seek to find a middle-ground, deliberately choosing ε such that there is a relatively low threat to privacy, while still avoiding a monetary blow. This is a decision that cannot merely be automated; it is up to humans rather than machines to consciously decide what values they prioritize.

Examples of Differential Privacy in Society

Although differential privacy is a relatively recent discovery in the world of data science, its applications can already be seen in our society. One major sector in which differential privacy has inhabited is the industry of Big Tech. Apple has implemented differential privacy in its iOS 10 and macOS Sierra software to collect data for several different purposes. These include determining which websites are using the most power, the contexts in which various ‘emojis’ are used, and which words people are frequently typing that are not contained in the keyboard’s dictionary. Google has deployed open-sourced differentially private algorithms in order to track and detect malware in its browsers, as well as collecting information on traffic in large metropolitan areas for its maps feature.6 Another major sector in which differential privacy can be found is the federal government. The U.S. Census Bureau announced in 2017 that “all statistical analyses published as part of the 2020 Census will be protected with differential privacy.”3 The goal of this was to protect the individual privacy of the American people. Some have argued that this will compromise the accuracy of the Census results, while others have praised the firm stance the federal government has taken on protecting privacy. This is a clear example of what was mentioned above—the specific deployment and parameters chosen of differential privacy is entirely dependent on the specific task at hand, what the goals of the task are, and which costs people are willing to sacrifice.

Conclusion

Differential privacy is a critical property of machine learning algorithms and large datasets that can vastly improve the protection of privacy of the individuals contained. By deliberately introducing noise into a dataset, we are able to guarantee plausible deniability to any individual who may have their data used to harm them, while still being able to calculate desired statistics with high certainty.

This is not to say, however, that differential privacy is a one-size-fits-all solution to data breaches. Differential privacy sacrifices accuracy for privacy, and it is up to those deploying the differentially private algorithm to determine just how much they are willing to give up in order to ensure the protection of the people. Mathematically speaking, data scientists must consciously set the parameter ε to reflect their values and priorities (in more complex implementations of differential privacy, there are more parameters, but for now we will just focus on ε). While the idea of automating the entire process of data analysis is incredibly appealing, this is a crucial point of human intervention that cannot be overlooked or automated.

We have seen differential privacy deployed in several sectors of society, from Big Tech to the federal government. It is still a relatively recent discovery, however, and as we continue with time to see technology permeate into more areas of our daily lives, there is no doubt that differential privacy will manifest itself in many new sectors.

While differential privacy, like many algorithms used to address societal issues, has its clear drawbacks, the use of automation and machine learning to address subjective notions of fairness, ethics, and privacy is an incredibly groundbreaking discovery in the world of computation, and is inching us as a society closer to addressing the inherent moral dilemmas of technology.

References

1. Sweeney, Latanya. “Simple demographics often identify people uniquely.” Health (San Francisco) 671.2000 (2000): 1-34.
2. Nguyen, An (2019, June 30). Understanding Differential Privacy. Towards Data Science. https://towardsdatascience.com/understanding-differential-privacy-85ce191e198a.
3. Kearns, Michael (2019). The Ethical Algorithm. Oxford University Press.
4. Fathima, Shaistha (2020, September 14). Differential Privacy Definition. Medium. https://medium.com/@shaistha24/differential-privacy-definition-bbd638106242.
5. Sartor, Nicolas (2019, May 10). Explaining Differential Privacy in 3 Levels of Difficulty. Air Cloak. https://aircloak.com/explaining-differential-privacy/.
6. Simply Explained. “Differential Privacy – Simply Explained.” YouTube, commentary by Xavier Decuyper, 25 January 2018, https://www.youtube.com/watch?v=gI0wk1CXlsQ&t=225s.