\section{Motivation/Intro}

While micropayments provide an excellent mechanism for anonymous exchange,
there are a number of situations where a more longterm pseudonymous or
even public relationship is more appropriate. For instance, in the case of
transactions where one party promises a service over a long period of time
(such as storing a document for three years), a simple one-time payment
generally makes one party in the transaction vulnerable to being cheated.
A whistleblower or political dissident who publishes a document may not
wish to monitor the availability of this document and make a number of
incremental micropayments over the course of several years, since this
requires periodic network access and also since continued micropayments
might compromise anonymity.
[other options: third-party escrow monitoring services, or document sponsors]
In addition, some systems might want to base decisions on the observed
behavior of entities -- how well they actually perform -- rather
than simply how many resources they can provide.

In general, the difference between reputation-based trust systems and
micropayment-based trust systems is that in reputation-based trust
systems, parties base their decisions in part on information provided
by third parties. Decisions on whether
to transact are based on information from third parties; decisions on
whether to transact _honestly_ are based (in part) on fear of news of
misdealings reaching third parties.

There are a number of domains in which such a reputation system would
be useful. These range from decentralized pseudonymous publishing
systems such as Free Haven, where servers choose with what servers
to transact based on observed past performance, to centralized B2B
reputation filtering and tracking services such as those provided by
Reputation Technologies, where buyers give feedback on the quality of
vendors and this feedback is aggregated to help future buyers choose
more appropriate vendors.

In both cases, there are very large well-funded organizations who might
have incentive to exploit the system to gain stronger reputations. For
instance, an adversary to Free Haven might want to uncover the identity of
individuals or destroy documents, or simply shut down the system. In the
case of Reputation Technologies, a dishonest vendor might want to forge
or bribe good feedback to raise his resulting reputation.  In addition
to wanting high reputations, vendors might like their competitors'
reputations to appear low.  Exchanges -- online marketplaces which
try to bring together vendors to make transactions more convenient --
would like their vendors' reputations to appear higher than vendors
that do business on other exchanges.  Vendors with low reputations --
or those with an interest in people being kept in the dark -- would like
reputations to appear unusably random.  Dishonest users might like the
reputations of the vendors that they use to be inaccurate, so that their
competitors will have inaccurate information.

Throughout the rest of this section, we will be focusing on the specific
case of developing a reputation system for the online large business
exchange market. While there are a couple of aspects of this domain which
are more specific to this particular situation, most of the issues that
we face are common to other reputation domains as well.

In this domain, buyers give ratings -- that is, numbers which reflect
their satisfaction with a given transaction -- for a variety of different
dimensions for each vendor. For instance, a given vendor might have
good performance in terms of response time or customer service, but
the vendor's geographic location might be inconvenient.  Buyers provide
feedback on a number of these rating dimensions at once, to provide a
more comprehensive view of the entity. The job of the reputation system
is to aggregate these ratings into one or more published scores which
are meaningful and useful to participants in the system.

\section{Requirements of a scoring system}

Some requirements of a good scoring system might include:

\begin{itemize}
\item Accurate: it reflects the quality/honesty of a given entity
\item Longterm performance indicator: reflects certainty (likelihood of
accuracy) of a given score; can distinguish between new entity (unknown
quality) and an entity with bad longterm performance
\item Weighted towards current behavior: Recognizes and reflects recent
trends in entity performance (eg poor performance at the beginning but
marked improvement, or vice versa)
\item Computable/efficient: It would be convenient if we could calculate
the score quickly. Calculations that can be performed incrementally
are important.
\item Robust against attacks: it should resist attempts of any entity or
entities to influence scores other than by being more honest or having
higher quality
\item Amenable to statistical evaluation: easy to find outliers, etc.
\item Private: no one other than that rater should be able to learn how a
given rater rated an entity
\item Smooth: adding any single rating or small number of ratings doesn't
jar the score `much'
\item Understandable: easy to explain to people who use these scores
-- not only so they know how it works, but so they can evaluate for
themselves what each score implies
\item Verifiable: a score under dispute can be supported with data
\end{itemize}

Note that a number of these requirements seem to be contradictory. We
will explore the benefits and tradeoffs from each of them over the course
of the rest of this [section].

\section{Attacks and Adversaries}

Several questions motivate the evaluation of these systems.  First, what
information needs to be protected? Second, who are the adversaries? The
capabilities of potential adversaries and the extent to which they can
damage or influence the system dictate how much energy should be spent on
security. For instance, in the case of Free Haven, if political dissidents
actually begin using the system to publish their reports and information,
government intelligence organizations might be sufficiently motivated to spend
millions of dollars to track the documents to their source. Similarly,
if corporations who are planning to do a fifty million dollar transaction
base their decisions on the reputation score that Reputation Technologies
Inc. provides,
it could be worth many millions of dollars to influence the system
so that a particular company is chosen.

Perhaps the simplest attack that can be made against a scoring system
is what is called *shilling*. While shilling is generally thought of in
terms of submitting fake bids in an auction, it can be considered in a
broader context of submitting fake or misleading ratings. In particular,
an entity might submit positive ratings for one of his friends (positive
shilling), or might submit negative ratings for his competition (negative
shilling). Either of these ideas introduces more subtle attacks such
as negatively rating a friend or positively rating a competitor to try
to trick others into believing that competitors are trying to cheat.

Shilling is a very straightforward attack, but many systems are
nonetheless vulnerable to it. A very simple example is the AOL
Instant Messenger system. You can click to claim that a given user is
abusing the system. Since there is no support for
detecting multiple comments from the same person, a series of repeated
negative votes will exceed the threshold required to kick the user off
the system for bad behavior, effectively denying him service.
Even in a more sophisticated system which detects multiple comments
by the same person, an attacker could mount the same attack by
assuming a multitude of identities.

These vulnerabilities due to oversimplification of the scoring system are
not limited to `toy' systems like IM. Indeed, Ebay suffers from a very
similar problem. In Ebay, the reputation score for an individual is the
linear combination of good and bad ratings, one for each transaction.
Thus a vendor who has performed dozens of transactions and cheats
on only one out of every four customers will have a steadily rising
reputation, whereas a vendor who is completely honest but has only done
ten transactions will be displayed as less reputable. In particular,
a vendor could make a good profit (and build a strong reputation!)
by being honest for several small transactions, and then being dishonest
for a single large transaction.

Weighting ratings by size of transaction helps the issue, but does not
solve it. In this case, large transactions would have a large impact
on the reputation score of a vendor, and small transactions would have
only a small impact. Since small transactions don't have much weight,
vendors have no real incentive to be honest for them -- making the
reputation services useless for small buyers. Breaking reputation into
many different dimensions, each representing the behavior of the vendor
on a given *category* of transaction (based on cost, volume, region, etc)
can solve a lot of these problems; see the section below on algorithms
for more details and analysis of this idea.

\section{aspects of a scoring system}

\subsection{domain}

The particular scoring system or algorithm used in a given domain
should be based on the amount of information available, the extent
to which information must be kept private, and the amount of accuracy
required from the scoring system. In some situations, such as verifying
voting age, a fine-grained reputation measurement is not necessary --
simply indicating who seems to be `sufficient' or `insufficient' is good
enough. The fact that precision may not be required is very fortunate --
in a lot of domains, it is very difficult to collect enough information
to provide a comprehensive view of each entity's behavior.

One reason why it might be difficult to collect information about
entities might be a very low transaction volume, as we see today
in online large business markets. But there's a deeper issue than
just whether there are transactions or whether these transactions are
trackable. More generally: does there exist some sort of proof (a receipt
or other evidence) that the rater and ratee have actually interacted?
Being able to prove the existence of transactions reduces problems on
a wide variety of fronts. For instance, it makes it more difficult to
forge large numbers of entities or transactions. This protection means
that the above AIM abuse is tougher to perform. Similarly, EBay users
currently might directly purchase a high reputation by giving EBay a cut
of a dozen false transactions which they claim to have performed with
their friends; with transaction verification, they would be required to
go the extra step of actually shipping goods back and forth.

In addition, there are a number of situations in which such a
reputation system would simply not be useful. For instance, if there were
thousands of buyers and one or only a few vendors, being able to track
the performance and reliability of those few vendors would not benefit
buyers in terms of picking a good vendor. On the other hand, tracking
performance might provide feedback to the vendor itself on areas in which
it might need improvement. This in turn could result in better performance
down the road, but only if the vendor acted on this feedback.

Similarly, in fields where a given buyer generally doesn't perform
transactions frequently, the benefits of a reputation system are more
subtle.  A buyer might benefit from a real estate reputation system,
since she expects to rent from different people over time. Even for a
domain where she expects to only ever do one transaction (such as laser
eye surgery) she would probably contribute to a reputation site -- first
out of altruism, and second in order to give the surgeon an incentive to
do well.

This is the tragedy of the commons in reverse: when the cost of
contributing to a system is low enough, people will contribute to it
for reasons not *immediately* beneficial to themselves.

In some domains, it is in most raters' perceived advantage that everyone
agree with the rater.  This is how chain letters, Amway, and other
Ponzi schemes get their shills: through establishing a system where
customers are motivated to recruit other customers.  Similarly, if a
vendor offers to discount past purchases if enough future customers buy
the same product, it will be hard to get honest ratings for that vendor.
This applies to all kinds of investments; once you own an investment,
it is not in your interest to rate it negatively so long as it holds
any value at all.

\subsection{Collecting ratings}

One of the first problems encountered in developing systems like this
is the question of how to collect ratings. The answer is highly dependent
on the domain, of course, but there are a number of aspects that
are common across many domains.

The first option is simply to observe as much activity as possible, and
draw conclusions based on this activity. This can be a very effective
technique for automated reputation systems that have a lot of data
available to them. If you can observe the transaction flow and notice
that a particular vendor has very few repeat customers, then it is
likely that he has a low reputation. On the other hand, lack of repeat
customers may simply indicate a market where any given buyer transacts
only infrequently. Similarly, finding a vendor with many repeat customers
might indicate superior quality, or it might just indicate a market where
one or a few vendors hold a monopoly over that product. Knowledge of the
domain in question is crucial to knowing how to correctly interpret data.

In many circumstances, it will be difficult or impossible to observe the
transaction flow, or it is unreasonable to expect parties to take the
initiative in providing feedback. In these cases, a reasonable option
is to solicit feedback from parties involved in each transaction. This
can either be done by publicizing your existence and interest in such
feedback and providing incentives to respond, or even by actively
going to each party after an observed transaction and requesting comments.

Tying feedback to transactions is a very powerful way of reducing
vulnerabilities in the system. This makes it more difficult for people to
spam positive feedback, since each item of feedback has to be associated
with a particular transaction, and presumably only the latest (or first --
see below under selling votes) piece of feedback on a given transaction
would actually count. On the surface, it looks like this requires an
exchange or other third-party transaction moderator, to make it difficult
to simply fabricate a series of several thousand transactions and exploit
the same vulnerability. However, vendors could provide `blinded' receipts
for transactions -- vendors would not be able to identify which buyer
was providing the ratings, and without such a receipt feedback from a
given buyer would be ignored. Thus buyers could not provide feedback
about a vendor without that vendor's `permission'. There are a number of
new problems introduced by this idea, such as how to respond if vendors
fail to provide a receipt, but it seems to address many of the difficult
issues about shilling in a decentralized environment.

A final issue to consider when collecting ratings is how fair the
ratings will be -- that is, how evenly distributed are the raters out
of the set of people who have been doing transactions? If the only people
who have incentive to provide ratings are those that are particularly
unhappy with their transaction, or if only people with sufficient
technical background can navigate the rating submission process, then
the resulting scores may be skewed to the point of being unusable. One
solution to this is to try to come up with incentives for a more evenly
distributed set of responses; another approach is to simply collect so
much data that the issue is no longer relevant. (Note that systematic
errors or biases in the ratings will generally defeat this second
approach.)

\subsection{Bootstrapping}

One of the tough questions for a reputation-based trust system is how to
get started: if users make choices based on the scores that are available
to them, but the system has not yet collected enough data to provide
useful scores, what incentive do buyers have to use the system? More
importantly, what incentive do they have to contribute ratings to the
system? The situation of Free Haven is slightly easier, because some
participants will be generous and willing to try out new nodes just to
test their stability and robustness. In effect, they will be performing
a public service by risking some of their reputation and resources
on evaluating unknown nodes. Businesses, particularly businesses just
getting started in their field and trying to make a name for themselves,
won't necessarily be as eager to spend any of their already limited
transaction volume on trying out unknown suppliers.

The way to present initial scores for entities depends on the domain.
In some non-commercial domains, it might be perfectly fine to present a
series of entities and declare no knowledge or preference; in others,
it might be more reasonable only to list those entities for which a
relatively certain score is known. Reputation Technologies
needs to provide some initial value to the users; this can be done by
asking vendors to provide references, and then asking those references
to fill out a survey describing overall performance and happiness with
that vendor. While this bootstrapping information may not be as useful
as actual transaction-based feedback (and is similarly more suspect),
it is certainly a good starting point for a new system.

The issue of bootstrapping as a distinct process is much more pronounced
in a centralized system than in a decentralized system. This is because
in a decentralized system, each user develops his own picture of the
universe: he builds his trust of each entity based on past direct
performance and on referals from other trusted parties. Thus every new
user effectively joins the system `at the beginning', and the process
of building a profile for new users is an ongoing process throughout the
the entire lifetime of the system. In a centralized environment, on the
other hand, ratings are accumulated across many different transactions
and over long periods of time. New users trust the centralized repository
to provide information about times and transactions from before the user
joined the system.

In a newly developed system, or for a new entity in the system, the choice
of the default reputation score is critical. If it's easy to create a new
identity (that is, pseudonym), and new users start out with an average
reputation, then users who develop a bad reputation are encouraged to
simply drop their old identity and start over with a new one. One way to
deal with this problem is to start all new users with the lowest possible
reputation score; thus even users with a bad track record have incentive
to keep their current identity. This may not be necessary in situations
where it is preferable to have a number of transactions on record, even if
they are poor transactions [example?]. Another approach to solving this
problem is to make it difficult to create a new identity.  For instance,
this can be done by requiring some proof of identity for registration,
or a monetary fee. Tying the user to his real-world identity would be
an effective technique.

\subsection{Personalized Reputation}

The user interface -- that is, the way of presenting scores and asking
questions -- is a crucial element of the system. Scores cannot simply be
static universal values representing the overall quality of an individual.
Since a score is an attempt to predict future performance, each score must
be a prediction for a particular context. That is, the user interface
must allow participants to query the system for the likelihood of a
successful transaction for their particular situation. This flexibility on
the side of the client provides a much more powerful and useful system;
regardless of the algorithm used to aggregate the ratings into scores,
being able to query for a specific enough scenario is crucial.

Another aspect to the user interface is displaying a *confidence value*
associated with each score. The mechanism for generating this confidence
value depends on the domain and on the scoring algorithm, but in general
it should reflect the expected accuracy of this particular score. For
instance, it might reflect the number of ratings which were used to
generate the score; it might reflect the standard deviation of the set
of ratings; or it might reflect the level of agreement between several
different scoring algorithms which were all run against the ratings set.
Not only does this confidence value allow users to have a better feel for
how firm a given score is, but it can also allow a more customized search.
That is, a user might request that only scores with a certain minimum
confidence value be displayed, which would weed out new users as well as
users with unusual (widely varying) transaction patterns.

In some domains, providing qualitative statements along with a more
quantitative score seems much more appealing. Simply providing a number
may not feel as genuine or useful to users -- indeed, allowing for
qualitative statements can provide more flexibility in the system,
as users providing feedback might discuss topics and dimensions which
are difficult for survey authors to anticipate. Rather than providing
a fixed and comprehensive set of dimensions for which users provide
numbers as feedback, the system can merely collect statements and let
other users read those statements to obtain more color and depth to
the score. On the other hand, it is very difficult to integrate these
statements into more numerical scores, particularly if they cover
unanticipated dimensions. Also, as the number of statements increases,
it becomes less useful to simply print all of them. However, choosing
which statements to print not only requires manual intervention and
choice, but might also lead to legal liabilities.  Another problem with
providing verbal statements as part of the score is the issue of using
this scoring system in many different countries.  Statements may need
to be translated, but numbers are universal.

\subsection{Scoring Algorithms}

As we've seen in the previous sections, there are many different aspects
to scoring systems. While we believe that providing query flexibility is
perhaps the most crucial aspect to the system, another important aspect
is the actual algorithm used to aggregrate ratings into scores.
Such an algorithm needs to answer most of the requirements that we laid
out in section {foo}. Broadly speaking, the scoring algorithm should
provide accurate scores, while preventing dishonest users from affecting
the system and also preventing privacy leaks (as detailed in the next
section).

Preventing dishonest users from affecting the system can be done in
several ways. One simple way is to run statistical tests independent
of the actual aggregation algorithm, to attempt to detect outliers or
other suspicious behavior such as a clique of conspiring users. Once this
suspicious behavior has been identified, system operators can go back and
manually examine the system and try to prune the bad ratings. While this
appears to be a very time-intensive approach that could not possibly
be used in a deployed system, such reputation-based systems as Ebay
have used exactly this to try to clean up once dishonest users have
been noticed
[eBay Feedback Removal Policy. http://pages.ebay.com/help/community/fbremove.html].

A more technically sound approach is to weight the ratings based on
the *credibility* of the rater. That is, certain people contribute more
to the score of a given entity based on their past predictive ability.
Google makes use of this idea in its Internet search engine algorithm.
Their algorithm counts the number of references to a given page -- the
more pages that reference that page, the more popular it is. In addition,
the pages that are referenced from popular pages are also given a lot
of weight. This simple credibility metric produces much more accurate
responses for web searches.

By introducing the notion of *local credibility* rather than a simple
global credibility for each entity, the system can provide a great deal of
flexibility and thus provide stronger predictive value. Local credibility
means that a rating is weighted more strongly if the situation in which
that rating was given is similar to the current situation. For instance,
a small farmer in Idaho looking into the reputation of chicken vendors
cares more about the opinion of a small farmer in Arkansas than he does
about the opinion of the Great Chinese Farming Association.  Thus the
algorithm would generate a score which more accurately reflects the
quality of the vendor according to other similar buyers.  Similarly,
if Google knew more about the person doing the web search, it could
provide an even more accurate answer 

[Press Release for Open Profiling Standard. http://home.netscape.com/newsref/pr/newsrelease411.html]

[note that firefly.net returns "no more data is available" - the company
was bought by Microsoft and apparently incorporated into Microsoft Passport.
also crushed its plans to incorporate a plugin into Navigator. -dm]


One of the problems with incorporating credibility into the scoring
algorithm is that in some domains an individual's ability to perform
the protocol honestly is very separate from an individual's ability
to predict performance of others. 

In the Free Haven system, this corresponds to a server which is willing
to store documents and supply them to readers, but keeps no logs about
transactions or trades (so it has no idea which other servers are behaving
honestly).  In the case of Reputation Technologies, one vendor might be
excellent at providing products on time and with high quality, leading
to a high reputaton score, but on the other hand that vendor might only
have an average skill at assessing other vendors. Similarly, a consulting
firm might specialize in predicting performance of vendors, but not
actually sell any products of its own. One way to solve this problem
is to have separate scores for performance and for credibility. This
seems to increase the complexity of keeping track of entities and their
reputations, but could provide tremendous increases in accuracy and
flexibility for scoring systems.

Weighting by credibility is not the only way to improve the accuracy and
robustness of the scoring algorithm. Another approach is to make use of
the idea that previous transactions similar to the current transactions
should carry more weight in predicting an entity's ability to successfully
complete the transaction.
That is, a vendor's ability to sell high-quality granny
smith apples should have some bearing on his ability to sell high-quality
red delicious apples. Of course this could backfire if the vendor in
question specializes only in granny smith apples and doesn't even sell
red delicious apples, but in general weighting by the so-called *category*
of the transaction (and thus the vendor's reputation in related categories)
is a very powerful idea. Separating reputations into categories can act
as a defense against some of the subtle shilling attacks described above,
such as when a vendor develops a good reputation at selling yo-yo's and
has a side business of fraudulently selling used cars.

But the category idea doesn't solve everything. In particular, it raises a
very difficult question: how do we pick categories? How do we know which
categories are related to which other categories, and how related they
are? Can this be automated somehow or do the correlation coefficients
have to be estimated manually?

In the case of Free Haven, where there is only really one commodity
-- a document -- and servers either behave or they don't, it might be
feasible to develop a set of categories manually and allow each server
to manually configure the numbers that specify how related each category
is to the others. For instance, one category might be files less than
100K which expire within a month. A strongly related category would
be files between 100K and 200K which expire within a month -- perhaps
we would say that this category is 0.9-related to the first. A mostly
unrelated category would be files more than 500 megabytes which expire
in 24 months. We might declare that this category is 0.05-related to
the first two. With some experience, an algorithm might be developed
which can tweak the correlation coefficients on the fly based on how
effective the current values have been at predicting the result of
future transactions. Similarly, we might be able to reduce the discrete
categories into a single continuous function which converts 'distance'
between file size and expiration date into a correlation coefficient.

Reputation Technologies is not so lucky. Within a given
exchange, buyers and sellers might barter thousands of different types
of goods, each with different qualities and prices; the correlation
between any pair of categories might be entirely unclear. To make matters
worse, each vendor might only have a few transactions on record, leaving
data too sparse for any meaningful comparison.

While we've presented some techniques to provide more accuracy and
flexibility in using ratings, we still haven't discussed actual algorithms
that can be used to determine scores. The simplest such algorithm involves
treating reputations as probabilities -- a reputation is effectively an
estimate of the likelihood that a future transaction in that category
will be honest. In this case, scores are simply computed as the weighted
sum of the ratings. More complex systems
can be built out of neural networks or data clustering techniques, to
try to come up ways of applying non-linear fitting and optimizing
systems to the field of reputation. Note that as the complexity of the
scoring algorithm increases, it becomes more and more difficult for actual
users of these systems to understand the implications of a given score or
understand what flaws might be present in the system.

A final note on scoring algorithms is the discussion
of the *adversial approach* to scoring systems. That is, in many statistical
or academic approaches the goal is simply to combine the ratings into as
accurate a score as possible; in the statistical analysis, however,
no regard is given for whether participants in the system can conspire
to provide ratings which break the particular algorithm used.

A concrete example might help to illustrate the gravity of this point.
One of the often-referenced pitfalls of applying neural networks to
certain situations comes from the US military. They wanted to teach
their computers how to identify tanks in the battlefield. Thus they
took a series of pictures that included tanks, and a series of pictures
that did not include tanks. But it turns out that the pictures with the
tanks were taken during the night, and the pictures without the tanks
were taken during the day. This caused their high-tech neural network to
learn not how to identify a tank but how to distinguish day from night.
The lesson for the AI developers is to remember that there are a number
of factors that might be different in a set of samples, and your neural
network might not learn quite what you want it to learn. But consider
the situation from our perspective: what if the Russian military were in
charge of providing the tank pictures? Is there a system that can be set
up to resist bad data samples? Many would consider that learning how to
identify a tank under those circumstances is impossible. How about if the
Russians could provide only half of the pictures? Only a tenth? Clearly
this is a much more complicated problem. When developing scoring systems,
we need to keep in mind that simply applying evaluation techniques that
are intended to be used in a `clean' environment may introduce serious
vulnerabilities.

\subsection{Privacy and Information Leaks}

Yet another issue to consider when designing a good scoring system is
whether the system will be vulnerable to attacks which attempt to learn
about the tendencies or patterns of entities in the system. In a
business-oriented domain, knowledge about transaction frequency,
transaction volume, or even the existence of a particular transaction
might be worth a lot of time and money to competitors. Using a simple
and accurate scoring algorithm seems to imply that it should be easy to
understand the implication of a vendor's score changing from 8.0 to 9.0
over the course of a day -- perhaps one or more ratings arrived regarding
large transactions, and those ratings were very positive.

The objectives of providing timeliness and accuracy in the scoring
algorithm and also maintaining privacy of transaction data seem to be
at odds. Fortunately, there are a number of ways to help alleviate the
leakage problems without affecting accuracy too significantly. We will
describe some of the more straightforward of these techniques below.

The problem of hiding transaction data for individual transactions is very
similar to the problem of hiding source and destination data for messages
going through mix networks
 [ "Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms"
 D. Chaum, Communications of the ACM, vol. 24 no. 2, February, 1981]

. More specifically, the notion of having
a new rating influence the published score by a certain amount is very
similar to the notion of tracking a message across a middleman node
in a mix network. In both cases, privacy becomes significantly easier
as transaction volume increases. In both cases, adversaries observe
external aspects of the system (in the case of the scoring system,
the change in the score; in the case of the mix network, the messages
on the links to and from the mix node) to try to determine the details
of some particular message or group of messages (or any message at all).

One common attack against the privacy of a scoring system might be a
*timing attack*. For instance, the adversary might observe transactions
and observe changes in the scores, and then try to determine the rating
values that certain individuals submitted. Alternatively, the adversary
might observe changes in the scores and attempt to discover information about
the timing or size of transactions. The analog of this attack to the Mix
environment is an adversary watching the timing of messages going to and
from nodes, and trying to determine which incoming message corresponds to
which outgoing message. There are a number of solutions that can be very
useful. First of all, introducing extra latency between the time ratings are
submitted and when the new score is published can make timing correlation
more difficult. (On the other hand, this might reduce the quality of the
system since scores are not updated immediately.) Another good solution
is to queue ratings, and process them in bulk. This prevents the adversary
from being able to determine which of the ratings in that bulk update had
which effect on the score. A variant of this approach is the *pooling*
approach, in which some number of ratings is kept in a pool.
When a new rating arrives, it is added to the pool and a
rating from the pool is chosen at random and aggregated into the score.
Obviously, in both cases a higher transaction volume makes it easier to
provide timely score updates.

On the other hand, an active adversary can respond to bulk or pooled
updates with what is known as an *identification flooding attack*.
He submits ratings with known effect, and watches for changes in the
score not due to those ratings. This approach works because
he can `flush' the few still-anonymous ratings by submitting
enough known ratings to fill the queue. Note that this attack only works
if the adversary can produce a significant fraction of the ratings during
a given time period.

[present good example of an intersection attack [perhaps think of one first]]

But all this concern over privacy may not be relevant at all. In some
domains, such as Free Haven, the entire goal of the reputation system
is to provide as much information about each pseudonymous server as
possible. For instance, being able to figure out how Alice performed with
Bob's transaction is always considered to be a good thing. In addition,
even if privacy is a concern the requirement of providing accurate timely
scores may be so much more important that no steps should be taken to
increase user privacy.

\section{Decentralizing the scoring system}

While many of the issues presented in this section have been presented in
the context of a centralized reputation scoring and filtering service,
many of them are applicable to either centralized or decentralized
systems. In a decentralized system such as Free Haven, each server runs
the entire reputation gathering system independently. This requires
each node to make do with only the information which it has gathered
first-hand, and generally requires a broadcast mechanism in order for
all nodes to keep their information databases synchronized.

Another approach is to decentralize the scoring system itself, spreading
it among the entire set of machines participating in the system.  Below we
present two ways of decentralizing a scoring system.  The first uses
redundancy along with user flexibility to reduce the risk from cheating
or compromised servers. The second is a more traditional approach to
decentralizing a system, but it also brings along the more traditional
problems with decentralization, such as high bandwidth requirements and
difficult crypto problems.

\subsection{Multiple trusted parties}

There is a set {S} of scorers around the world, each independently run
and operated.  When a transaction happens, the vendor constructs a set of
tickets {T} for some subset of {S} that he chooses. Each ticket
is a receipt allowing the buyer to rate the vendor at a particular scorer.
The receipts are blinded so that the vendor
is not able to link a ticket with any given customer.
At this point, the buyer can decide to which scorer or scorers he wishes to
submit his ratings. Since each scorer could potentially use its own
algorithm and have its own prices or publishing habits, each scorer might
have its own set of tradeoffs based on accuracy, privacy, and security.
This technique allows the vendor to veto some of the scorers first, and
then the rater chooses from among the remaining scorers. Thus the ratings
will only be submitted to mutually agreeable scorers.

We could extend this scenario to allow both parties in the transaction
to provide tickets to each other, creating a more symmetric rating
process. This approach introduces complications where both parties in
the transaction need to coordinate and agree on which tickets will be
provided before the transaction is completed, and there needs to be some
mechanism to enforce or publicize if one side of the transaction fails
to provide the promised receipts.

The beauty of decentralizing the scoring system in this manner is that
every individual in the system can choose which parts of the system they
want to interact with. Participants in transactions can list scorers
which they trust to provide accurate scores, raters can choose scorers
which they trust to not leak rating information, and users looking for
scores on various entities can choose scorers which they trust to provide
accurate scores.

Of course, this decentralization process introduces the issue of
meta-reputation: how do we determine the reputations of the reputation
servers? This sort of reputation issue is not new. Some Mixmaster nodes
are more reliable than others 

[Electronic Frontiers Georgia Remailer Uptime List. http://anon.efga.org/ ]
, and users and operators keep
uptime and performance lists of various nodes, as a public service. We
expect that reputation scoring services would similarly gain
(external) reputations based on their reliability or speed.

\subsection{True decentralization}

[figure: four columns. first column has one R. second column has a
vertical line of C's. third column has a vertical line of S's. fourth
column has a couple of "client"s.
The R in the first column has arrows pointing to every C. Each S in
the third column has bidirectional arrows pointing to every C (ellipse
arrows as necessary). The fourth column has some client's doing queries
onto the S's.]

In this scenario, both sides of the transaction obtained blinded receipts
as above. Apart from these *Raters*, the system also consists of a set
of *Collectors* and a set of *Scorers*. After the transaction, each
rater splits up his rating using Shamir's secret sharing algorithm or
some other k-of-n system. At this point, he submits one share of his
rating to each Collector. This means that the Collectors together could
combine the shares to determine his rating, but separately they can learn
no information. It is the job of the Scorers to provide useful information
to clients: when a client does a reputation query for a specific category
(situation), the Scorer does the equivalent of a *encrypted database
query* on the set of Collectors 

[ Private Information Retrieval and Oblivious Transfer. Tal Malkin. 
PhD Thesis, MIT 1999].

There are a number of technical challenges that need to be solved in order
to make this scheme work.  First of all, the collectors need to have some
mechanism for authenticating a rating without reading it. Similarly, they
need to have some way of authorizing a rater to put his share onto the
system without being able to know the author of a given rating. Without
this protection, malicious raters could simply flood the system with data
until it overflowed. Once these problems are solved, we need to come up
with some sort of computationally-feasible and bandwidth-feasible way
of communicating between the scorers and the collectors, and a set of
rules that allow the scorers to get the information they need to answer
a given query without allowing them to get `too much' information.

With this decentralization comes some subtle questions. Can scorers
`accidentally forget' to include a specific rating when they're computing
a score? Said another way, is there some way of allowing scorers to
provide proof that they included a certain rating in the calculation of
the score, without publishing the actual ratings that were used? This
question is similar to the question of allowing Mix nodes to prove that
they forwarded a given message, without yielding information that might
help an adversary determine the source or destination of the message

[Masayuki Abe. Universally Verifiable MIX-network with Verification Work
Independent of the Number of MIX Servers. EUROCRYPT '98 Springer-Verlag LNCS].