introduction.tex

\chapter{Introduction}\label{c:introduction}

\epigraph{Arguing that you don't care about the right to privacy because you have nothing to hide is no different than saying you don't care about free speech because you have nothing to say.}{\textit{Edward Snowden \cite{snowden15}}}


Big data is growing exponentially, 90 percent of which has been created in the past few years \cite{kim2014big}.
People are constantly producing and publishing information about themselves.
Such data comes from browsing the web, talking to someone online, moving around emitting GPS signal, being registered by cameras or credit card usage, or even wearables and IoT devices.
Many companies and organizations, such as Google and Facebook, make profit from big data analytics, collection and storage.

One of the most common examples is personalized advertisements, occurring from the massive data analytics.
It is becoming increasingly common for data about our location, music and movies we like, private conversations, and any other online trace we leave behind, to be linked to our purchasing preferences.

The aforementioned tracking and profiling comes from breaching of the individuals’ privacy.
Privacy is the ability of individuals to have control over how their personal information is collected and used, and thereby express themselves selectively.
Nowadays, the need of preserving someone's privacy is more crucial than ever.
For instance, financial information can be sensitive.
Such information includes a person’s holdings, debts and transactions (\textit{e.g.} purchases).
This information, if compromised, can lead criminal activity such as fraud or identity theft.
Also, one’s purchases can be linked to places they visit, people they contact and so on; thus, such data should remain private.

A noteworthy example that renders privacy of critical importance is \emph{medical data}.
People may not be comfortable sharing their medical records to others, due to several reasons.
For instance, it could affect their employment, their insurance coverages, or people just do not want others to know about their medical or psychological conditions or treatments.
Medical data reveals a lot for a patient’s personal life and therefore should be protected.

An argument adopted by many, is that there is no need for privacy if you have nothing to hide.
This shows a failure to understand that privacy is a human right.
There is no need to justify why such a right is needed.
The burden of justification falls on the one seeking to violate this right.
Even when a right is not useful to you, you can't give away the right of others.
% edw perissotera https://en.wikipedia.org/wiki/Nothing_to_hide_argument  isws cite???


\section{Privacy Issues in the Cloud \& Multi-Party Computing}\label{s:privacy-cloud-multiparty}

This rapid growth of information has resulted in the consistently growing popularity of cloud computing, which offers strong computational power for both individuals and companies.
At the same time, all data that are uploaded in the cloud can be exposed to attacks from both the cloud provider and third parties.
However, in the case of financial and medical data, people are not comfortable sharing their sensitive data, and more importantly, they do not trust any third party with this information.

There are many real world use cases and business models that use information from different parties to compute jointly meaningful results, but due to the aforementioned limitations, some are avoided and others do not always respect data privacy.
The solution to this, is technique called secure multi-party computation (SMPC or MPC) \cite{yao1982protocols, goldreich1998secure}, which leverages cryptographic primitives to carry out computations on confidential data.
Having $N$ parties with private inputs, the goal is to compute a function and learn nothing more than what they would have if a separate trusted party had collected their inputs, computed the same function for them, and then return the result to all parties.

Real world examples are unlimited; for instance, Sharemind \cite{bogdanov2008sharemind} -- a platform for secure computations -- mentions the example of satellite collision \cite{kamm2015secure}, since the number of satellites orbiting the planet is growing and thus the danger of collisions is also growing.
Indeed, two satellites crashed in 2009.
Satellite owners are not willing to make the orbits of their satellites public.
However, this -- and future -- collisions could be avoided by sharing information about the satellites orbits.
Using MPC, the parties can cooperate and learn whether a collision is going to happen and nothing else.
No information about the actual orbits would leak, since computations are carried out on encrypted data.


Another interesting example is presented in \cite{lindell2000privacy}, where in the late 1990s, the Canadian Government maintained a massive federal database that pooled citizen data from a number of different government ministries, with aim to implement governmental research that would arguably improve the services received by citizens.
This database became known as the “big brother” database, despite that was officially called the Longitudinal Labor Force File.
Fortunately, the people protested and the project was discontinued due to privacy concerns.
As in the example of medical research, here, individuals data privacy would have been exposed, rendering the need of privacy-preserving algorithms of crucial importance.


\section{Our Contribution}\label{s:our-contribution}
In this thesis, our primarily concern is to create an end-to-end infrastructure for computing privacy preserving analytics such as \cite{lindell2000privacy, agrawal2000privacy}.
We have developed algorithms specifically tailored to encrypted architectures and in the SMPC scenario, but also we have focused on the coordination and communication between all involved parties; those who provide their data, those who perform the secure computation, and finally those that initiate new computations.
In our view, this thesis provides an end-to-end system for discovering useful information with respect to data privacy.
In our system, we have developed some essential analytics algorithms -- such as aggregators and decision trees.
Our goal is to provide the building blocks for potentially more elaborate algorithms to be implemented with respect to data privacy.


In the context of this thesis, our study is focused on medical data, which has been a popular data mining topic of late \cite{chaurasia2017data, erickson2017machine}.
However, the primary reason that we focus on medical data is that the privacy protection of medical records is taken more seriously than other data mining tasks \cite{bertino2005privacy}.
Medical records are related to humans, which renders privacy of critical importance.
Thus, medical data constitute an example that demonstrates the necessity of privacy preserving algorithms and also for a comprehensive infrastructure that incorporates and facilitates all participating parties.


Although in this thesis we have focused on medical data, our end-to-end infrastructure is oblivious to the type of data that it processes.
The same analytics will be applied whether it would perform medical research on hospitals data, or highly classified statistics for governments data, or even private computation for preventing satellite collisions.
This variety of applications that can be served through our system, are mainly tied to two data types -- continuous\footnote{\textit{Continuous data} is data where the values can change continuously, rendering uncountable the number of different possible values.} and categorical\footnote{\textit{Categorical data} refer to those aspects of data where there is a distinction between different groups; the number of possible values/categories are small and can be counted.} data.
Examples of the former category include weight, price, profits, etc, where some categories of the latter type of data include product-type, gender, age-group, etc.
This heterogeneity of data types has separated our privacy\hyp preserving algorithms in two corresponding categories, since different data types are managed in different ways.
The algorithms we have developed for privacy-preserving analytics can deal with both quantitative and categorical data.


\section{Thesis Structure}\label{s:thesis-structure}
The rest of the thesis is organized as follows: In section \ref{c:preliminaries}, we examine some fundamental cryptographic protocols that are essential for the subsequent sections.
In section \ref{c:sharemind} we present the Sharemind secure multiparty computation framework, while in section \ref{c:medical-study} we elaborate in our end\hyp to\hyp end medical case study.
Consecutively, in section \ref{c:pp-algorithms} we present the basic notion of privacy-preserving algorithms and how they are different from their textbook equivalents.
Moreover, we elaborate on details of the algorithms of the two major categories we have developed, secure aggregation and secure classification.
In section \ref{c:implementation} we delve into the various implementation details of our system.
Our experimental evaluation is presented in section \ref{c:evaluations} and finally, our conclusions and future work goals are summarized in section \ref{c:conclusions}.