by Dirk Helbing (ETH Zurich, dhelbing@ethz.ch)
(an almost identical version has been
forwarded to some Members of the European Parliament on April 7, 2013)
Some serious, fundamental problems to be solved
The first
problem is that, when two or more anonymous data sets are being combined, this
may allow deanonymization, i.e. the identification of the individuals of which
the data have been recorded. Mobility data, in particular, can be easily
deanonymized.
A second fundamental problem is that it must be assumed that the large
majority of people in developed countries, including the countries of the European
Union, have already been profiled in detail, given that individual devices can
be identified with high accuracy through individual configurations (including
software used and their configurations). There are currently about 700 Million
commercial data sets about users specifying an estimated number of 1500 variables
per user.
A third problem is that both, the CIA and the FBI have revealed that,
besides publicly or semipublicly available data in the Web or Social Media,
they are or will be storing or processing private data including Gmail and Dropbox
data. The same applies to many secret services around the world. It has also
become public that the NSA seems to collect all data they can get hold of.
A fourth fundamental problem is that Europe currently does not have the
technical means, algorithms, software, data and laws to counter foreign
dominance regarding Big Data and its potential misuse.
General principles and suggested approach to
address the above problems
The age of information will only be sustainable, if people can trust
that their data are being used in their interest. The spirit and goal of data
regulations should be to ensure this.
Personal data are data characterizing individuals or data derived from
them. People should be the primary owners of their personal data. Individuals, companies
or government agencies, who gather, produce, process, store, or buy data should
be considered secondary owners. Whenever personal data are from European citizens,
or are being stored, processed, or used in a European country or by a company
operating in a European country, European law should be applied.
Individuals should be allowed to use their own personal data in any way
compatible with fundamental rights (including sharing them with others, for
free or at least for a small monthly fee covering the use of ALL their personal
data – like the radio and TV fee). [Note: This is important to unleash the
power of personal data to the benefit of society and to close the data gap that
Europe has.]
Individuals should have a right to access a full copy of all their
personal data through a central service and be suitably protected from misuse
of these data.
They should have a right to limit the use of their personal data any
time and to request their correction or deletion in a simple and timely way and
for free.
Fines should apply to any person or company or institution having or
creating financial or other advantages by the misuse of personal data.
Misuse includes in particular sensitive use that may have a certain
probability of violating human rights or justified personal interests. Therefore,
it must be recorded what error rate the processing (and, in particular, the
classification) of personal data has, specifying what permille of users feel
disadvantaged.
A central institution (which might be an open Web platform) is needed to
collect user complaints. Sufficient transparency and decentralized institutions
are required to take efficient, timely and affordable action to protect the
interest of users.
The execution of user rights must be easy, not time consuming, and cheap
(essentially for free). For example, users must not be flooded with requests
regarding their personal data. They must be able to effectively ensure a
self-determined use of personal data with a small individual effort.
To limit misuse, transparency is crucial. For example, it should be
required that large-scale processing of personal data (i.e. at least the
queries that were executed) must be made public in a machine-readable form,
such that public institutions and NGOs can determine how dangerous such queries
might be for individuals.
Proposed definitions
As indicated above, there is practically no data that can not be deanonymized, if combined with
other data. However, the following definition may be considered to be a
practical definition of anonymity:
Anonymous data are data in which a person of interest can only be identified with a
probability smaller than 1/2000, i.e. there is no way to find out which one among
two thousand individuals has the property of interest.
Hence, the principles is that of diluting persons with a certain
property of interest by 2000 persons with significantly other properties in
order to make it unlikely to identify persons with the property of interest. This
principle is guided by the way election data or other sensitive data are being
used by public authorities. It also makes sure that private companies do not
have a data processing advantage over public institutions (including research
institutions).
I would propose to characterize pseudonymous
data as data not suited to reveal or track the user and properties
correlated with the user that he or she has not explicitly chosen to reveal in
the specific context. I would furthermore suggest to characterize pseudonymous
transactions as processing and storing the minimum amount of data required to
perform a service requested by a user (which particularly implies not to
process or store technical details that would allow one to identify the device
and software of the user). Essentially, pseudonymous transactions should not be
suited to identity the user or variables that might identify him or her.
Typically, a pseudonym is a random or user-specified variable that allows one to
sell a product or perform a service for a user anonymously, typically in exchange
for an anonymous money transfer.
To allow users to check pseudonymity, the data processed and stored
should be fully shared with the user via an encrypted webpage (or similar) that
is accessible for a limited, but sufficiently long time period through a unique
and confidential decryption key made accessible only to the respective user. It
should be possible for the user to easily decrypt, view, copy, download and
transfer the data processed and stored by the pseudonymous transaction in a way
that is not being tracked.
Further information:
Difficulty to anonymize data
Danger of surveillance society
New deal on data, how to consider consumer interests
- HP software allowing personalized
advertisement without revealing personal data to companies, contact: Prof. Dr. Bernardo
Huberman: huberman@hpl.hp.com
FuturICT initiative www.futurict.eu
Information on the proposer
Dirk Helbing is
Professor of Sociology, in particular of Modeling and Simulation, and member of
the Computer Science Department at ETH Zurich. He is also elected member of the
German Academy of Sciences. He earned a PhD in physics and was Managing
Director of the Institute of Transport & Economics at Dresden University of
Technology in Germany. He is internationally well-known for his work on
pedestrian crowds, vehicle traffic, and agent-based models of social systems.
Furthermore, he is coordinating the FuturICT Initiative (www.futurict.eu), which focuses on the understanding of
techno-socio-economic systems, using Big Data. His work is documented by
hundreds of well-cited scientific articles, dozens of keynote talks and
hundreds of media reports in all major languages. Helbing is also chairman of
the Physics of Socio-Economic Systems Division of the German Physical Society,
co-founder of ETH Zurich’s Risk Center, and elected member of the World Economic
Forum’s Global Agenda Council on Complex Systems.