\section{Publication Services}

There are a number of projects and papers which discuss anonymous publication
services.
We start this section by providing an overview of some of the related projects
and papers. After this overview, we examine in detail the amount of anonymity
and privacy protection that each project offers.

\subsection{The Eternity Service}

Ross Anderson's paper on the Eternity Service \cite{eternity}
is the motivation
for this entire project. It includes a wonderful vision of how the
world might work in the future, in terms of data havens and distributed
decentralized data storage.  The overall goal is to build a system that
provides highly available data: as Anderson phrases, it ``[t]he basic
idea is to use redundancy and scattering techniques to replicate data
across a large set of machines (such as the Internet), and add anonymity
mechanisms to drive up the cost of selective service denial attacks.''

A publisher would upload a document and some digital cash, along with a
requested file duration (cost would be based on document size and desired
duration). In the simple design, a publisher would upload the document to
100 servers, and remember ten of these servers for the purposes of auditing
their performance. Because he does not record most of the servers to whom he
submitted the file, there is no way to identify which of the participating
eternity servers are storing his file. Document queries are done via broadcast,
and document delivery is achieved through one-way anonymous remailers.

On the other hand, it relies on a stable
digital cash scheme, which is certainly not available today. Further, it has a strong
correlation between ability to store data into the system and amount of real-world
capital available. While our proposal does have a loose correlation between available
resources and amount of influence over the servnet, the correlation is not nearly as direct.

There are also a number of contradictions and other issues which are not addressed in
his brief paper: for instance, if documents are submitted anonymously but publishers
are expected to remember a random sample of servers so they can audit them, what do they do when
they find that some server is cheating? Since publishers are anonymous, it would seem that they
have no power at all. Anderson passes this responsibility on to the digital cash
itself, so servers do not receive payment if they stop providing the associated service. He
does not elaborate on the possible implications of this increased accountability to the anonymity of
the authors.

Another problem is that there is no consideration at all to maintaining a dynamic
list of available servers and allowing servers to smoothly join and leave.

He proposes that a directory of files in the system should itself be a file in the system.
However, without a mechanism for updating or revising files, this would appear very difficult
to achieve.

\subsection{The {\tt .cz} implementation}

A team of students at Charles University in Czechoslovakia decided to implement their
version \cite{eternitycz}
of Anderson's
idea. They use a mixnet and made overall reasonable design decisions.
Tonda Bene wrote his PhD
thesis \cite{benethesis}
on this design, and provided many more
concrete explanations of the details that Anderson skips over in his
original paper.  However, there
are a number of issues that we have with their implementation:

\begin{itemize}
\item They implement everything they need by themselves. They do not exist in a
vacuum -- for example, the world already has `good enough' loose time
synchronization applications such as xntpd, so there is no need to implement a new
API and designs. This leads to code bloat and too
many layers of abstraction, which makes verifying security very difficult.
\item Almost all of their files are binary. It would be easier to examine or modify
their configuration files if they were in a human-readable and human-writable format.
\item BSD-limited. Their code is not ported to Linux yet, much less other platforms.
First of all, the port is probably
difficult if they have not yet successfully ported it to another platform. Secondly,
a well-designed system (whether in perl, C, or
Java) should be extremely portable already -- this does not bode well.
\item A centralized (single or few) trusted bank system is assumed.
\end{itemize}

\subsection{Eternity Usenet}

Adam Back proposed \cite{eternityusenet}
a simpler implementation of the Eternity Service, using the existing Usenet
infrastructure to distribute the posted files all around the world. This is an
excellent idea, but on the other hand this limits the participating servers to systems
which already host Usenet news. Further, news administrators much specifically choose to
participate in this variant of the eternity service, and so they may well choose not
to carry the `alt' groups that
comprise the service.

Eternity Usenet uses normal Usenet mechanisms for retrieval, posting,
and expiring, so publishers may not have control over the expiration time or
propagation rate of their document.

To achieve anonymity in publishing, Eternity Usenet employs cypherpunks
type I and type II (mixmaster) remailers as gateways from email to newsgroups.
Publishers PGP-sign documents which they wish to publish into the system:
these documents are formatted in html, and readers make http search or query requests to
`Eternity Servers' which map these requests into NNTP commands either
to a remote news server or a local news spool. With the initial implementation,
the default list of newsgroups to read consists only of 
{\tt alt.anonymous.messages}. The Eternity Server effectively provides an interface
to a virtual web filesystem which posters populate via Usenet posts.

Back treats Usenet as an append-only file system. His system provides support
for replacing files (virtual addresses) because newer posts signed with the same
PGP key are assumed to be from the same publisher. Addresses are claimed on a
first-come first-served basis, and PGP signatures provide linkability between an
initial file at a given address and a revision of that file. However, he does not appear
to provide an explanation of conflict resolution that might arise from two addresses
being claimed at the same time -- since Usenet posts may arrive out of order, it
would seem that there might be some subtle attacks against file coherency if two
different Eternity Servers have a different notion of who owns a file.

Also, while the system is not directly `censorable' as we usually consider it,
the term `eternity' is misleading.
Usenet posts expire based on age and size. Back does not provide an analysis of how
long a given document will survive in the network. The task of making a feasible
distributed store of Eternity documents is left as a future work.

Four public-access Eternity Servers are listed at the end of the article; none of
these servers is still available. This indicates that active work on Eternity Usenet
is not ingoing.

%so anonymity might not be as strongly preserved on either side of the
%transaction. Additionally, people can post data to the service, and then
%prove that a specific server was hosting that data (since all of them are).
%This is crucial for laws where mere possession of the data is illegal.

%\subsection{India Project}

%While the India Project{\footnote {\tt http://www.eecs.harvard.edu/\~{}india/}} has no
%real anonymity measures and is designed with efficiency (high bandwidth reads) in mind,
%it includes an implementation of Rabin's information dispersal algorithm; this might be
%able to be lifted entirely and reused for our project.

% \subsection{Internet Memory Project}

% It, uhm, exists. We should look it up. I bet it doesn't address anonymity.


\subsection{Napster}

The Napster service\cite{napster} is a company based around
connecting people who are offering MP3 files to people who want to download them.
While they provide no real anonymity and disclaim all legal liability, a very important
thing to note about the Napster service is that it is highly successful. Thousands of
people use Napster daily to exchange music; if there were greater security (and comparable
ease of use), we suspect that many thousands more would participate. The existence of
Napster presents a
very clear argument that the Internet community wants a service like this, at least
for music.

%The response to the Metallica suit\cite{metallica} --- in particular,
%the continued use of Napster for trading of illegal MP3 files ---
%indicates that the Internet community is willing to continue use of
%this service even knowing that it could have negative legal
%consequences.

\subsection{Gnutella}

Gnutella\cite{gnutella} is a peer-to-peer Napster clone.  It was
developed by a subsidiary company of AOL, and once it went public it
was immediately shut down by AOL (presumably since AOL has interests
in not disrupting the music and movie industries).
Development is proceeding via a widely scattered group of open-source contributors.
Gnutella depends on the ``Small Worlds'' model to maintain a connected
network; see Subsection \ref{subsec:smallworlds} for a more detailed description
and analysis of this idea.

According to the new developers' web site:

\begin{quotation}
Gnutella puts a stop to all those shenanigans. When you send a query
to the GnutellaNet, there is not much in it that can link that query
to you. I'm not saying it's totally impossible to figure out who's
searching for what, but it's pretty unlikely, and each time your query
is passed, the possibility of discovering who originated that query is
reduced exponentially. More on that in the next section.

In short, there is no safer way to search without being watched.

A big however, however. To speed things up, downloads are not
anonymous. Well, we have to make compromises. But again, nobody's
keeping logs, and nobody's trying to profile you.
\end{quotation}

There is a strong contradiction between their bold statement about
perfection and their warnings that users concerned about maintaining
anonymity really should avoid using the system. Behind the media hype,
it is clear there are a number of aspects of their protocol
\cite{gnutellaprotocol}
that help to reveal identities of users.

The header of a Gnutella packet contains a number of fields. Two of
these fields are the `TTL' (time to live: the number of additional hops
after which the packet should be dropped) and `Hops taken' (the number
of hops this packet has made since its creation). The TTL is started at
some default value based on the expected size of the network, and the
Hops value is effectively an inverse of the TTL during the travel of
the packet. Because the Hops value is 1 when the packet is initially
sent, it is clear when a given server is generating a query (assuming
it is playing by the protocol, which for the vast majority of users is a
reasonable assumption). Even if there were no Hops value, the fact that
the TTL itself has a default value for most client programs is sufficient
to make a server originating a request distinguishable from another
server in the system, in the mathematical sense presented in Chapter 2.

Further, the Gnutella network is not so well distributed as they might
lead users to believe. While the protocol is designed for a user to
set up connections with his `friends', there is no infrastructure in
place for easily building such a set of friends. Instead, the Gnutella
site offers a `default' set of friends with which users can start. Most
users will never change this file if the service is functional. This
means that the actual network is built not as a flat network but rather
as a hierarchical system, as shown in their pictures of the Gnutella
network topology \cite{gnutellamaps}.  There are a small number of
central nodes which would be ideal targets for government agencies or
other organizations collecting information about users.

%If the TTL and Hops fields aren't enough, every message in the system is
%stamped with a Globally Unique Identifier (GUID). This identifier is used
%to recognize messages which have already been seen, so they are not passed
%on redundantly. This prevents the Gnutella network from an exponential
%explosion of messages, since the default TTL is typically as high as 25.

If the TTL and Hops fields aren't enough to reveal identities, it turns out that
only the queries have any
semblance of anonymity protection. The actual downloads are done by
point-to-point connections, meaning that the IP addresses of server
and reader are both revealed to each other. This is done for reasons
of efficiency.

Sites such as the Gnutella Wall of Shame\cite{gnushame}, which attempts
to entrap child pornographers using the Gnutella service, show that the
direct file-transfer portion of the Gnutella service has been demonstrated
to not adequately protect the anonymity of servers or readers.

Gnutella is not designed to be an anonymous communications or publication
network. Gnutella is a network designed to provide {\em availability} for
data from one user to the next, and it does a relatively good job of this.

%http://gnutella.wego.com/go/wego.pages.page?groupId=116705&view=page&folderId=118398&pageId=118400
%http://capnbry.dyndns.org/gnutella/protocol.html
%http://www.rixsoft.com/Knowbuddy/gnutellafaq.html

\subsection{Freenet}

The Freenet project \cite{freenet} is
one of the most popular related works. Like Gnutella, Freenet proposes
an interconnected network of nodes, each acting as both client and
server. When a user wishes to request a document, she hashes the name of
that document (where she gets this name is outside the scope of Freenet)
and then queries her own server about the location. If her server does not
have it, it passes the query on to a nearby server which is `more likely'
to have it. Freenet clusters documents with similar hashes nearby each
other, and uses a complex routing protocol to route queries `downhill'
until they arrive at the desired document.

Freenet is similar to Gnutella in that its main purpose is to provide
highly {\em available} data to its users.  There are also a number of
differences between Freenet and Gnutella. First of all, Freenet caches
data near requestors. Specifically, when a request is answered over a
given path, all servers along that path cache that document. This means
that future queries for that document will be answered very quickly,
assuming the copies of that document haven't expired from the nearby
caches. This introduces another important feature: persistence of
data. Because nodes cache documents as they are requested, the documents
do not disappear from the system when the server offering those documents
disappears.  This is a key improvement over Napster and Gnutella.

Freenet bases document lifetime on the popularity of the document:
frequently requested files get duplicated around the system, whereas
infrequently requested files live in only a few places or die out
completely. While this is a valid choice for a system that emphasizes
availability, it precludes certain uses of the system.  For instance,
I can see circumstances where a file has a 12 month lifetime but only
becomes popular in the last few months of its lifetime.  Examples
include photos of JFK Jr.\ saluting his father, or a (timestamped) Idaho
phone book that has those ten extra names that the FBI might one day be
accused of `erasing'. Indeed, this is already happening -- 
\cite{yugo} describes a case where Yugoslav phone books are being collected ``to
document residency for the close to one million people forced to evacuate
Kosovo.''

Freenet takes some steps to increase anonymity. Their goals include both
sender and reader anonymity, as well as plausible deniability for servers
-- the notion that a server does not know the contents of documents it
is storing. They provide this last, which we clarify as isolated-server
document-anonymity (as opposed to connected-server), by referencing
and storing files as $H(name)$ and having users encrypt the documents
themselves with $name$ before inserting them. This means that anybody who
knows the original $name$ string can decrypt the document, but the server
storing the document is unable to invert $H(name)$ to determine $name$.

However, they have the same flaw with publisher- and reader-anonymity
that Gnutella does, due to the presence of the TTL and Depth (comparable
to Hops) fields in the Freenet message headers. Because the document
requests are also sent through the network (rather than peer-to-peer
as they are in Gnutella), there's room for a little bit more anonymity
than Gnutella provides. However, nodes nearby a reader still have a
greater than even chance of being able to identify that reader. Further,
statistical attacks similar to those described in the Crowds \cite{crowds}
paper might work to pinpoint the location of a given reader or publisher;
however, the caching does provide some protection against this since
the network topology for a given document changes dramatically after
each request. This needs to be analyzed further.

Another attack to consider is that simply doing a request from a strict
political environment will cause a copy of the data to migrate into that
jurisdiction, giving law enforcement more reason to shut those servers
down. On the other hand, it may well be that such environments will close
down servers without requiring `sufficient cause', so the political and
legal arguments are meaningless.

It might even be that a less anonymous system is more likely to be
accepted in more parts of the world. This will have to be explored simply
by trying it. However, because Freenet provides weaker anonymity than
Free Haven, it is currently unsuitable for use by those who are truly
concerned about maintaining their privacy, such as political dissidents
or other whistleblowers. (See Chapter 4 for more details about these issues.)


\subsection{Graduated Mirroring}

The `Graduated Mirroring' proposal \cite{gradmirroring} was introduced
by Ron Rivest in response to our initial proposals about a buddy system
and other accountability measures that greatly increase the complexity
of the system. In short, this idea involves a group of servers, each
controlled by a person called the `manager', which all sign up to receive
documents published to the service. Each published document arrives at
each server; the manager manually peruses the document and decides how
important or valuable he believes it to be. Based on this evaluation,
he chooses how many shares of the document to store. For each share,
he basically chooses a random share number and generates that share
via some information dispersal algorithm -- this share generation is
implemented by evaluating a polynomial which describes the overall
document at a number of random values.

Server managers can modify their support for a given document. If they
want to support the document less, they simply throw out some of their
shares. If they want to support the document more, they retrieve the
document and generate some new shares.

When a document no longer has enough support for readers to be able to
reconstruct it, that document has effectively expired; server managers
still holding information about that document may then decide to throw
away whatever they have about the document -- this effectively replaces
the notion of a publisher-chosen expiration date with a much simpler
notion of server-popularity. This notion of popularity is similar to the
notion that Freenet uses, but the duration of a document is based on
how much the {\em servers} like the document rather than on how much the
readers like the document.

While this system has some excellent features, including simplicity first
and foremost, we have a number of issues with it. The first issue is that
it is not what we want to build: the fact that only popular data would get
mirrored is counter to our design goal of content-neutrality. We believe
that it does not capture the essence of what we want from a data haven.

Indeed, it also goes against our basic assumptions about computing: we
have a lot of hardware, and very few people. This trend will get more
pronounced as time goes on.  Having a person hand-sort and consider each
item really cuts down on the number of people who would be willing to
host a server.  Overall, we believe that paving the way for an automated
robust data haven based on privacy of publisher and data is going to
have more of an effect in the long run.

%\subsection{Anon}

%http://www.inf.tu-dresden.de/\~{}hf2/anon/

\subsection{Intermemory} 

The Intermemory Project \cite{GY98} \cite{CEGGSY98} is an initiative at NEC
Research aimed at producing an archival system which makes use of 
spare space on the Internet. The goal is high availability and high
persistence of information. Intermemory uses information dispersal
to mitigate the consequences of server failure, and spends much time
addressing systems issues such as synchronization of information 
between many different servers.

Servers join Intermemory and donate a certain amount of space 
temporarily in return for the right to store some small fraction of
that space in the system forever. The ``economic'' viability of the
Intermemory design depends on the assumption that tomorrow's storage
will be cheap and plentiful enough to meet the obligations incurred
today. 

At present, Intermemory exists as a prototype implementation within NEC.
The public papers on Intermemory do not even address security, much
less anonymity, and it is not clear how the system reacts in the
presence of malicious adversaries. We therefore do not formally compare
Intermemory to the other systems listed here. Instead, we mention it
as an example of a publication and archival system which is designed 
without the severe constraints necessary to ensure anonymity. 

\subsection{Publius}

Publius attacks the problem of anonymous publishing from a different angle.
Rather than trying to come up with a routing protocol like Gnutella and
Freenet, Publius simply employs some one-way anonymous channel to transfer
documents from publishers to servers. The Publius protocol is designed to
maintain availability of documents on these servers.

In this system, a publisher generates a key $K$ for her document, and encrypts the document
with this key. She performs Shamir's secret-sharing algorithm to build a set
of $n$ key shares, any $k$ of which is sufficient to reconstruct $K$. From
there, she chooses some $n$ of the Publius servers and anonymously delivers
the encrypted message plus one share to each of these $n$ servers.

In this way, the document is replicated over each server, but the key is split
over the $n$ servers. Document reading is implemented by running a local web proxy
on the reader's machine; the $n$ addresses chosen as servers are concatenated into
a URL which is presumably published or otherwise remembered. (Publius provides no
description of how a directory service
might be built, or any other mechanism for remembering URLs after documents are
inserted.)

The URL used to retrieve the document specifies all $n$ servers at which the document
was stored. This means that to retrieve a given document, the local proxy fetches
each document independently, reconstructs the original key $K$, and then decrypts
the document.

The Publius system provides strong publisher-anonymity, because a one-way
channel is sufficient for communications to the servers. In addition,
because a cryptographically strong secret-sharing protocol is used and
each server only receives one share, Publius provides both computational
and also information-theoretic isolated-server document-anonymity:
a given server is not able to determine anything about a document that
it is helping to store.

On the other hand, there are a number of limitations beyond those the
authors of the paper enumerate. For instance, the entire scheme is based
on a static, system-wide list of available servers. Since these servers
are permanent, there is no support for adding new servers or purging dead
ones. Perhaps more importantly, however, there is no support for recognizing
misbehaving servers and removing from them the list of available servers.

Another point is that readers cannot determine if a share is corrupt
simply by examining it: the reader must request all of the shares and attempt
to reconstruct in order to determine the integrity of a share. Providing a
mechanism for self-evident share integrity checking might provide significant
robustness to the system.

Publius is by far the strongest related work in terms of our notions of
anonymity. The paper is very well-written and goes into considerable detail
on various attacks and counters to those attacks.

\subsection{An analysis of anonymity}

Many of these related works offer their own variant of `anonymity' for some
of the agents in the system. In this section, we analyze this anonymity that
each work provides in the context of the definitions and formalizations
proposed in Chapter 2.

% \begin{center}
\begin{table}[htc]
\begin{tabular}{|c|ccccc|}
\hline
Project     & Publisher & Reader & Server & Document & Query \\
\hline
Gnutella    &           &        &        &          &       \\
\hline
FreeNet     &           &        &        &  \cm     &       \\
\hline
Eternity Usenet& \cm    &        &        &          &       \\
\hline
Publius     &    \cm    &        &        &  \cm     &       \\
\hline
Free Haven  &    \cm    &  \cm   &  \cm   &  \cm     &       \\
\hline
\end{tabular}
\caption{Overview: Computational Anonymity}
\label{table:comp-anon-pub}
\end{table}
% \end{center}

This first table provides an overview of the protections 
for each of the broad categories of anonymity against computationally-limited
adversaries.
Informally, for polynomially-bounded adversaries who have passive access
to some of the edges between agents, a \cm on this table indicates that the
adversary has less than a polynomial + $\frac12$ chance of correctly
guessing the identity of one of the individuals involved in any given
transaction.

Gnutella fails to provide publisher-anonymity, reader-anonymity, or
server-anonymity because of the peer-to-peer connections for actual
file transfer. Because Gnutella servers start out knowing the intended
contents of the document they are offering, they also fail to provide
document-anonymity.

Freenet achieves document-anonymity because servers are not unable to
reverse the hash of the document name to determine the key with which
to decrypt the document. However server-anonymity is not provided because
given a document, it is very easy to locate a server that is carrying
that document -- querying any server at all will result in that server
carrying the document! Because of the TTL and Hops fields for both reading
and publishing, Freenet also fails to achieve publisher- or reader-anonymity.

Eternity Usenet provides publisher anonymity via the use of one-way anonymous
remailers. Reader anonymity is not protected, and it is clear that a Usenet
service that offers Eternity files is carrying the Eternity feed. Because
each downstream host gets its only entire copy of the feed, there is no
document-anonymity in Eternity Usenet.

Publius achieves document-anonymity because
the key is split between the $n$ servers, and without sufficient shares
of the key a server is unable to decrypt the document that is stores.
Because documents are published to Publius through a one-way anonymous
remailer, it provides publisher-anonymity. However, it provides no support
for protecting readers, and the servers containing a given file are clearly
marked in the URL used for retrieving that file.

Free Haven achieves publisher-anonymity via an anonymous remailer channel.
Similarly, reader-anonymity is provided because the responses are directed
through a mixnet to a temporary address that the reader provides. Server
anonymity is maintained because document requests are performed via broadcast,
and the results arrive out of a one-way channel.
Free Haven achieves document-anonymity because the document itself is
split; assuming a wide enough dispersal of documents, a given server will
never see enough shares of a document to be able to reconstruct it. Certainly
a server that is not actively trying to rebuild a document (for instance, by
doing a request for the other shares) will not have the document available,
since it is broken into shares before publication.

\begin{table}[htc]
%\begin{center}
\begin{tabular}{|c|ccc|ccc|ccc|cc|cc|}
\hline
Project     & \multicolumn{3}{c}{Publisher} & \multicolumn{3}{c}{Reader} \vline & \multicolumn{3}{c}{Server}
        \vline & \multicolumn{2}{c}{Document} \vline & \multicolumn{2}{c}{Query} \vline \\
            & C    & I-T & P-F & C    & I-T & P-F & C    & I-T & P-F & C    & I-T & C    & I-T \\
\hline
Gnutella    &      &     &     &      &     &     &      &     &     &      &     &      &     \\
\hline
FreeNet     &      &     &     &      &     &     &      &     &     & \cm  &     &      &     \\
\hline
Eternity Usenet & \cm & \cm & \cm &&&& &&& && & \\
\hline
Publius     & \cm  & \cm & \cm &      &     &     &      &     &     & \cm  & \cm &      &     \\
\hline
Free Haven  & \cm  & \cm & \cm & \cm  &     & \cm & \cm  &     &     & \cm  &     &      &     \\
\hline
FH + ideal mix &\cm&\cm&\cm & \cm  & \cm & \cm & \cm  & \cm &\cm  & \cm  &     &      &     \\
\hline
\end{tabular}
\caption{Anonymity Properties of Publishing Systems}
\label{table:real-anon-pub}
\end{table}
%\end{center}

This second table provides a more detailed view of the different publishing
systems which we examined. There are a number of interesting points that
become clearer once we increase the resolution of the table.

First of all,
we see that the secret sharing algorithm which Publius uses provides a stronger
form of document-anonymity, since an isolated server really can learn nothing at all
about the contents of a document that it is helping to store. Secondly, we
see that those systems which provide publisher anonymity tend to provide a
very strong form of it -- namely, computational, information-theoretic, and
also perfect-forward publisher anonymity. This is because the publishers can
make use of one-way communications channels, effectively removing any linkability;
removing the need for a reply pseudonym on the mixnet means that there is
`nothing to crack'. Thirdly, we note that Free Haven achieves perfect-forward
reader anonymity: this is because readers make up a new pseudonym for every
document request, so there is no linkability.

The most important line by far in this table is the last line, though. The
fact that it implies that Free Haven is part of the reason it achieves
this much anonymity is slightly misleading: it really is the ideal mix
which is providing the first nine aspects of anonymity. Assuming a robust
ideal mix network like that described in Chapter 2, there would be no
linkability between transactions, and mere computational ability on the
part of the adversary would be insufficient to identify the parties in a
transaction.

This would mean that we could in fact relegate most of the anonymity to the
communications channel itself, and provide a simple back-end file system or
equivalent service to transfer documents between agents. Thus the design of
the back-end system could be based primarily on addressing other issues such
as availability of documents, protections against flooding and denial of
service attacks, and accountability in the face of this enforced anonymity.

Unfortunately, the design and deployment of such an ideal mix network is 
a very challenging task, and one for our future works chapter. Until then,
we focus on providing increased anonymity with a combination of more
conventional mix networks and a publication system tailored for the mixnet.