\chapter{Defining Anonymity}
\label{chap:anon}

Many anonymous publication systems claim `anonymity'
without specifying a precise definition.
%  Indeed, we present an extensive related works section:
%most of these works claim an anonymous network or some other
%catchphrase involving anonymity,
Indeed, they often fail to specify
what protections users and operators receive from their system, as
well as what protections users and operators do not receive. These
protections are a function of both the actual publication system and
the communications channel which it utilizes.

%These protections are a function of both the design of the publication
%system -- fulfilling the requirements of publishing, storage, and
%retrieval -- and the communications medium which it utilizes.

While the anonymity requirements of communications channels have been
considered previously in depth \cite{gold-anon} \cite{bert-anon},
%\footnote{Ian Goldberg, David Wagner,
%and Eric Brewer. Privacy-enhancing technologies for the Internet. {\it
%Proceedings of IEEE COMPCON '97};
%Oliver Berthold, Hannes Federrath, and Marit Kohntopp. Project
%``Anonymity and Unobservability in the Internet.''  Workshop on
%Freedom and Privacy by Design / CFP2000.}
this chapter addresses anonymity at a higher level: the publication systems themselves.

The current Free Haven project design does not achieve all of the
attributes of anonymity that we might want to provide. This chapter is
thus not an enumeration of design goals which Free Haven meets, but rather
an enumeration of design goals for the ideal anonymous publication system.

The following sections start out by describing what a speaker might expect
to achieve. We then list some agents in an anonymous publication system,
and address anonymity for each of these agents separately. We present
some intuitive notions of types or degrees of anonymity, as well as some
models of adversaries that help in thinking about how anonymous a system
as a whole might be. Finally, we attempt to assess real-world projects
in the face of these new definitions and models.

\section{Defining Privacy}

Privacy is the ability to control dissemination of information about
yourself. The level of anonymity you have in a given scenario is a result
of the choices you make about your privacy.

In particular, the speaker may have control in several different dimensions
over dissemination of information about his own speech:

\begin{enumerate}
\item {\bf Linkability:} The speaker controls the ability of readers to
link his utterances.  If there are distinguishing characteristics that
provide this linkability, these characteristics are called a pseudonym.
\item {\bf Replies:} The speaker controls whether readers can reply to
his utterances.  Further parameters include whether the reply is private,
and persistence of the ability to reply (perhaps the ability to reply
expires a week after the publication).
\item {\bf Content Leaks:} The speaker controls how much partial
information about himself is leaked based on the content of his speech. For instance,
the speaker may choose to reveal certain information such as credentials
for being authorized to read the New York Times on the web.
\item {\bf Channel Leaks:} The speaker controls how much partial
information is leaked based on the communications channel he chooses
to use. For example, a satellite TV broadcaster may use an encrypted
communications channel in an attempt to allow only paying customers
access to programming. Alternatively, a speaker may intentionally use
an imperfectly anonymous channel for the sake of efficiency or
convenience.
% Alternatively, a speaker may purposely use 
%an imperfectly encrypted channel in order to spread disinformation
%to an eavesdropping adversary.
% is that really what we are getting at? -dmolnar
\item {\bf Persistence:} The speaker controls how long his speech persists
after the publication.
\item {\bf Readers:} The speaker controls which other parties will be
able to read his speech.
\end{enumerate}

\section{Agents and Operations}

The above list describes some freedoms which a given {\em speaker} may have
available to him. A number of these freedoms have analogs when considered
in the context of other parties in the communication.

In general, there are several agents in an anonymous publication
system: these include the author, the reader, and the server. There
are a number of operations that the system might support, such as
(at a minimum) inserting a document into the system, and retrieving the
document from the system. We address each of these agents and operations
separately, to try to build an intuitive notion about which acts allow
dissemination of information.

\subsection{Author-anonymity}

Author-anonymity means that the original author of a given document
should not be known. This characteristic of anonymity is one of the
integral parts of almost any anonymous network or service.
Even so-called `anonymous remailers', which are simply anonymous
forwarders and don't support persistence or storage of the data, provide
author-anonymity. Indeed, anonymous remailers can be combined with public
storage and distribution systems such as Usenet to offer a rudimentary
but very easy to construct and deploy publication service that allows persistently
available data and provides author-anonymity.

\subsection{Publisher-anonymity}

While author-anonymity addresses the original author of the document
itself, publisher-anonymity addresses the agent that originally introduces
the document into the system. In some cases the author and the publisher
may be the same entity, but in the general case the author may make use of
a separate individual, either a third party or a server in the system,
to introduce the document. Separating these two notions of `origin'
makes it clearer what protections the system provides.

\subsection{Reader-anonymity}

Reader-anonymity means that readers requesting a document should not
have to identify themselves to anyone. In particular, this means that
when a reader performs a document request at a given server, this server
is unable to determine the identity or location of the reader.

This class of anonymity is crucial for protecting people from disclosing
that they are interested in or accessing certain types of material.
For instance, a user of the system might not want it known whether she is
downloading material from the Democrats' web page, or from the Republicans'
web page. Reader-anonymity ensures the privacy of the vast majority of
the system's {\em users}, a concern that is often ignored.

\subsection{Server-anonymity}

Server-anonymity means that the location of the document should not
be known or knowable. Specifically, given a document's name or other
identifier, an adversary is no closer to knowing which server or servers
on the network currently possess this document (or shares of it). This restriction implies that the
retrieved documents do not provably pass through any given server that
receives a request. This protection is crucial for materials where mere
possession is cause for action against the server. The most obvious
example of such materials is child pornography, but in fact any material
which is the subject of an injunction or criminalized by law needs
this kind of protection. Depending upon jurisdiction, this material
runs the gamut from Amnesty International pamphlets to DeCSS to neo-Nazi
propaganda to advertisements for alcohol. 
%%% added more examples. 

Many services rely on sheer volume of servers, each containing the data,
to dissuade organizations from attacking any given server for possessing
the data. However, this approach to constructing a decentralized service
also dissuades large corporations from participating in these networks, due to liability and reputation
concerns. Also, there may be some circumstances, such as the
OpenDVD suits\cite{OPENDVD}, where adversaries are willing to expend the
energy to
track down
all servers which offer a given document. Indeed, making an example out
of even a few high profile server operators can go a long way towards
reducing the availability of a document.

\subsection{Document-anonymity}

Document-anonymity means that the server does not know the contents
of the document that it is storing or helping to store. This is broken
down into two scenarios. Isolated-server document-anonymity means that
if the server is allowed to look only at the data that it is storing,
it is unable to figure out the contents of the document. Generally this
is achieved via some sort of secret sharing mechanism, either sharing
the document or sharing the key for recreating the document (or both)
across servers.  On the other hand, connected-server document-anonymity
means that the server is allowed to communicate and compare data with all
other servers in the system, but is still unable to determine the contents
of the document. Since a connected server may well be able to act as a
reader and do a document request itself, it seems that connected-server
document-anonymity is difficult to achieve without some trusted party
acting as intermediary and authenticating and authorizing readers.

Notice that merely encrypting the document before publishing it into the
system is not sufficient to achieve document-anonymity: we are concerned
here not with confidentiality of the published document, but instead
with whether the given server can recreate the bits that were inserted
into the system.

\subsection{Query-anonymity}

Query-anonymity refers to the notion that over the course of a given
document query or request, the `identity' of the document itself is not
revealed to the server. In short, this means that although a server may
have many different documents stored, which document was served for a
given request is not knowable by the server. This process of private
information retrieval (PIR) 
 is typically done by a computationally
intensive (or bandwidth-intensive) process of downloading lots of other
documents at the same time to mask which document is being retrieved,
or perhaps by using lots of servers in the transaction. Alternatively,
the use of massive amounts of resources might be solved by particularly
clever mathematics. For an overview of PIR, we suggest Malkin's thesis
\cite{MALKIN-THESIS}.

%Query-anonymity is sort of a special case, in that it addresses anonymity
%for only one of many operations that are generally expected to be supported
%by an anonymous publishing or storage service. 

\section{Linkability: Anonymity vs. Pseudonymity}

The above classes of anonymity describe the issues regarding each
of the agents or operations in the system. However, there are some other broader
characteristics of anonymity to consider.

Anonymity, when compared to pseudonymity, means that the agent
performing the operation has no observable persistent characteristics.
For instance, turning on a radio is a nice way of receiving
information anonymously (modulo Tempest attacks --- actually,
\cite{mobiltrak} describes a company which records which radio
station cars on the highway are tuned to. So far, they claim to have
surveyed over 125 million cars.). Reading the advertisements in the
New York Times is another good example, though again it is not perfect,
since it seems conceivable that somebody could track who purchases a given
newspaper and somehow extend this to who reads a given newspaper. In general,
broadcast media are good ways to achieve
anonymity. A Usenet feed is a bad example here, because if the user is
not the systems administrator of that node, he can be monitored by the
systems administrator; if he is the systems administrator of that node,
then in some sense his upstream provider can `prove' that he received
the information that he read (even though the upstream provider has no
idea which articles he actually physically read).

Pseudonymity, on the other hand, means there is some characteristic associated
with the agent for that transaction, and this characteristic provides some
mechanism for recognizing that other transactions also involved this party.

Both anonymity and pseudonymity in this context retain the notion of
privacy of location. Location describes the actual physical connection to
the communications medium: the speaker is in some sense physically {\em at} his
location. Anonymity is in some sense `more private' than
pseudonymity, because there's less to trace, but having a pseudonym
does not necessarily imply that location is public -- the pseudonym
could well be a reply block on a mixnet, or even simply a keypair which
an author uses to sign all of his documents.

%free haven uses pseudonymity for readers and for servers. we could
%theoretically support anonymity for authors. that really isn't speced
%out yet. (we should fix that.)

\section{Partial Anonymity vs. Full Anonymity}

%The notion of full anonymity is really only defined over an ideal world,
%where the adversary has an infinite set of indistinguishable suspects.
The notion of full anonymity is similar to the use of one-time passwords
in encryption: fully anonymous speech means that hearing the speech
gives a listener no more information about the identity of the speaker
than he had beforehand.
In reality, though, every set of candidates is limited in size, and indeed
the adversary often has partial information about the suspect -- for
instance, `he or she has a high-bandwidth Internet connection', or
`he or she probably lives in California based on activity patterns and
routing analysis'.

So in this clearer context, the question shifts from `is it anonymous?' to
`is it anonymous enough?' If the original set of suspects has $n$ members,
then for sufficiently large $n$ a system which leaks no information that
might reduce the set of suspects seems to be `anonymous enough'. However,
we have to bear in mind that we may well be trying to protect the
identities of all the users of a given service -- that is, even evidence
implying that a given user is one of $n$ users of the service may be
sufficient to make him suspicious, thus compromising his anonymity. This
may discourage corporations and persons who are particularly concerned
about their reputation from participating in a given anonymous service,
and indeed it may put ordinary users at risk as well.

It is not even as simple as whether a user is inside or outside the
`set of suspects'.  Often there is no clearly delineated set, and for
each user no boolean value of `suspected' or not. A given member of this
set might be more suspicious than another member; if the adversary has
this knowledge beforehand, then the system can still be fully anonymous
if it does not leak any new information to confirm or deny that adversary's
initial guesses.

\section{Characteristics of Communication Channels}

The speaker has control over whether to publish his speech over
a given channel, based on the characteristics of that particular
channel. This means that he might tailor his speech, or choose not
to speak at all, based on how much protection the channel provides
and how much anonymity he desires.

\subsection{Computational vs. Information-Theoretic Anonymity}

One issue to consider is the notion of how protected a given address
is: does it rely on computational complexity to protect its anonymity
(e.g., a reply block address on a conventional mixnet), or does it use
some other technique to make the address unknowable even in the face of
a computationally powerful adversary?

%Actually, we've glossed over a very important technical point here:
There are really two alternatives to computational anonymity. The
first alternative is that the adversary has the transcript of the
communication, but is still unable to break its anonymity -- this is
what we call information-theoretic anonymity. This might be modeled by
a trusted third party (with trusted channels) acting as moderator, or it
might perhaps be implemented in some mechanism similar to a one-time pad.

The second option is that the adversary is simply unable to obtain
a transcript of the communication, or perhaps the usefulness of the
transcript `expires' quickly enough to be equivalent to no transcript
at all. This might happen in the case of some physical medium which is
untappable (perhaps quantum channels give us some starting point for
this), or in the case of an improved mixnet where a given address no
longer maps to or is traceable to its destination after the transaction.
This also implies that there is no mechanism for the adversary to take
a snapshot of the entire network at that point -- if he could, then he
might be able to go offline with that snapshot and break the anonymity
of the channels.

\subsection{Perfect Forward Anonymity}

Perfect forward secrecy means that after a given transaction is done,
there is nothing new that the adversary can `get' to help him decrypt
the transcript. Similarly, perfect forward anonymity is the notion
that after a given transaction is done, there is nothing new that the
adversary can get that can help him identify the location or identity
of either of the communicating parties.

In effect, this might be achieved by negotiating a `session location'
(the anonymity analog of a session key) between the parties for the
purposes of that transaction. For instance, this session location might be
a double-blinded virtual location in a high-entropy onion-routed network,
where the transaction takes place effectively instantaneously, and then
all records of paths to and from the virtual location are destroyed.

In this case, a snapshot could in fact be taken of the system at that
point, but this falls under the realm of computational anonymity as
described above.

\section{What does it mean to `break' an anonymous system?}

This question also comes up in the context of encryption: what does it mean
to break an encryption system? Goldreich argues \cite{GOLDREICH-BOOK}
% I'm citing Goldreich's fragments. I don't actually have it on me 
% right now, but it's in there somewhere. maybe better to cite
% goldwasser-micali?
that since we do not
know which parts of the plaintext might be useful to a given adversary
or in a given situation, we must protect every aspect of the plaintext.
More precisely, we must preserve the amount of partial information
that the adversary may have a priori about the
plaintext (due to a non-uniform distribution, for example); that is, we
must prevent the adversary from determining any new information beyond
what he starts with.

It is tempting to claim that even this characterization is insufficient
for anonymous systems, as we describe above regarding Partial Anonymity:
even knowing that an individual is among the group of users may be
sufficient to make him suspicious, and thus sufficient to break the
anonymity of the system. On the other hand, even an ideal channel could
not prevent this disclosure: if there is a method of enumerating the
users of a service, then it seems that there is no means of preventing
some amount of information from leaking. Similarly, if there is no
method of enumerating any users of the service, then it seems that we
have achieved ideal anonymity.

There are issues to bear in mind from the social engineering perspective
as well. No matter how powerful or effective your anonymity
protection is, if a user signs his name to his document his anonymity
is broken. Also, more subtle attacks such as word choice correlation or
writing analysis might yield clues that allow better than even chances
of guessing. All of the above models could be based on the assumption
that a given document is a random distribution of bits. However, we
might instead assume that there is some amount of a priori information
that the adversary can guess or assume about the document, and as long
as our communications don't leak {\em more than} this information,
our system is anonymous.

We can picture the ideal anonymous system as a trusted third party
(TTP) with secure channels to each party in the system. This TTP would
receive confidential messages, strip off the source information, and
confidentially forward the messages to their destination in an unlinkable
fashion. Therefore, our goal is to come up with a decentralized system
that is able to simulate this TTP for each operation. Equivalently,
if Alice is communicating through Ted to Bob, a set of protocols which
allows Alice to communicate directly to Bob without Ted is said to be
anonymous if the transcripts of communication are indistinguishable
from those which include Ted. If they are distinguishable, then that
difference is exactly the `break' that causes the system to leak privacy.

\section{Modeling the Adversary}

%This notion of anonymity can be phrased in terms of games. 

%if Alice is talking to some unknown party Bob,
%can our adversary Eve distinguish with better than even probability between
%whether Alice is talking to Bob or whether Alice is talking to any other
%party Chuck?

Potential adversaries have a variety of capabilities, and similarly
they have a variety of goals.

We break down basic capabilities into active attacks versus passive
attacks -- whether or not the adversary has the ability to actively modify
or prevent communications. However, it is not just a matter of whether
an adversary has full access to some part of the system or no access: it
is more reasonable to model the adversary as controlling some fraction
of the edges in the communications channel (`edge' adversaries), and
some other fraction of the servers in the publication system (`server'
adversaries). This would in turn result in the adversary having some
probability of achieving his goal; this probability increases as
percentage control over the system increases.

While some adversaries might be able to monitor or alter the link between
two servers, another class of capability includes the ability to observe
or even modify the private state of a mixnet node or Free Haven server --
that is, data which is internal to the server and does not get sent over
the network.

Further still, adversaries have capabilities describing their capacity
to perform computation -- an adversary is said to be computationally
bounded, or unbounded, as per the normal definitions.

We also need some notion of the time interval over which the adversary
may collect information or modify state. If an adversary has the ability
to watch or affect edges or servers over a long period of time, he may
have a larger overall impact on the system than if he only had control
for a short amount of time.

Aside from the capabilities that an adversary might show, there is also
the matter of building a taxonomy of goals that an adversary might hope
to achieve. In very broad form, an adversary attacks either to identify
a target or to destroy functionality of the system.

Attacks to identify can be further broken down into attacks to determine
the identity of the target, and attacks to provide proof to a third party
of the identity of the target. Such potential targets include publishers,
readers, and servers.

Attacks to destroy include attacks to modify or prevent communications,
attacks against the functionality or owners of servers, and attacks
against the integrity of documents themselves.

\section{Toward a formal definition of anonymity}

Attempting to formalize anonymity can be more subtle than attempting
to formalize encryption or secrecy goals. When assessing the security
of a given encryption scheme, there are generally two parties who might
have information: the sender and the receiver. The sender generates some
bits, and sends these bits over some channel to the receiver; there are
generally measures in place to ensure that the bits are not modified in
the channel, but the exact characteristics of the channel are not very
important. In our case, however, formalizing the notion of anonymity
involves identifying the characteristics of the communications channel
that might leak information about the identity of either of the parties
involved in the transaction.

To put it in more concrete terms, if Alice is running a server at her
company, what about the cached data in the corporate-run intrusion
detection system down the hall? Do we model that as part of Alice, such
that so-called `edge' adversaries do not have access to it?  What if
one of the packets in Alice's transmission gets sent off course, and 120
seconds later an ICMP destination unreachable packet returns to her with
its data segment an exact copy of one of the packets that she sent to Bob?
Is 120 seconds negligible? What isn't negligible?

The notion of anonymity becomes complex very quickly as we introduce
real-world factors and considerations such as these. We develop a notion
of `publisher indistinguishability' as the beginning of a mechanism for
assessing the anonymity that a system provides.  Informally, the adversary
knows that a document $\Phi$ was published by either party $P_0$ or party
$P_1$. We say that a system provides publisher indistinguishability
if the adversary has negligible advantage over guessing whether $P_0$
or $P_1$ is the real publisher.

We might alternatively have defined this aspect of anonymity in terms of
two documents $\Phi_0$ and $\Phi_1$ and a single user Alice who is known
to be a publisher; the system may have achieved some different level of
anonymity if the adversary has only negligible advantage over guessing
in determining which of $\Phi_0$ and $\Phi_1$ Alice published.

Clearly this topic has ample room for future research.

%\section{Application to real-world projects}

%it sounds like there are 9 characteristics of anonymity we want from

%an ideal anonymous service: basic anonymity for each of the three agents,
%plus information-theoretic and perfect forward anonymity for each class
%of transaction that the service supports [does that actually map well to
%"for each of the three agents" also?]

%most of the related works get 1 or 2 out of these 9. free haven gets
%4 or 5 (basic plus some pretty good protection for authors depending on
%how we implement submission), and if we
%switched to a scintillating mixnet (they don't exist,
%but the notion is that addresses materialize and dematerialize
%instantaneously, so you can have a transaction and a moment later it's
%as if the tunnel you just used never existed) we could probably get
%8 (information-theoretical anonymity for servers is very tough?)

%anyway, i haven't thought this far yet. but there's plenty of ideas
%here.