\section{Motivation/Intro} While micropayments provide an excellent mechanism for anonymous exchange, there are a number of situations where a more longterm pseudonymous or even public relationship is more appropriate. For instance, in the case of transactions where one party promises a service over a long period of time (such as storing a document for three years), a simple one-time payment generally makes one party in the transaction vulnerable to being cheated. A whistleblower or political dissident who publishes a document may not wish to monitor the availability of this document and make a number of incremental micropayments over the course of several years, since this requires periodic network access and also since continued micropayments might compromise anonymity. [other options: third-party escrow monitoring services, or document sponsors] In addition, some systems might want to base decisions on the observed behavior of entities -- how well they actually perform -- rather than simply how many resources they can provide. In general, the difference between reputation-based trust systems and micropayment-based trust systems is that in reputation-based trust systems, parties base their decisions in part on information provided by third parties. Decisions on whether to transact are based on information from third parties; decisions on whether to transact _honestly_ are based (in part) on fear of news of misdealings reaching third parties. There are a number of domains in which such a reputation system would be useful. These range from decentralized pseudonymous publishing systems such as Free Haven, where servers choose with what servers to transact based on observed past performance, to centralized B2B reputation filtering and tracking services such as those provided by Reputation Technologies, where buyers give feedback on the quality of vendors and this feedback is aggregated to help future buyers choose more appropriate vendors. In both cases, there are very large well-funded organizations who might have incentive to exploit the system to gain stronger reputations. For instance, an adversary to Free Haven might want to uncover the identity of individuals or destroy documents, or simply shut down the system. In the case of Reputation Technologies, a dishonest vendor might want to forge or bribe good feedback to raise his resulting reputation. In addition to wanting high reputations, vendors might like their competitors' reputations to appear low. Exchanges -- online marketplaces which try to bring together vendors to make transactions more convenient -- would like their vendors' reputations to appear higher than vendors that do business on other exchanges. Vendors with low reputations -- or those with an interest in people being kept in the dark -- would like reputations to appear unusably random. Dishonest users might like the reputations of the vendors that they use to be inaccurate, so that their competitors will have inaccurate information. Throughout the rest of this section, we will be focusing on the specific case of developing a reputation system for the online large business exchange market. While there are a couple of aspects of this domain which are more specific to this particular situation, most of the issues that we face are common to other reputation domains as well. In this domain, buyers give ratings -- that is, numbers which reflect their satisfaction with a given transaction -- for a variety of different dimensions for each vendor. For instance, a given vendor might have good performance in terms of response time or customer service, but the vendor's geographic location might be inconvenient. Buyers provide feedback on a number of these rating dimensions at once, to provide a more comprehensive view of the entity. The job of the reputation system is to aggregate these ratings into one or more published scores which are meaningful and useful to participants in the system. \section{Requirements of a scoring system} Some requirements of a good scoring system might include: \begin{itemize} \item Accurate: it reflects the quality/honesty of a given entity \item Longterm performance indicator: reflects certainty (likelihood of accuracy) of a given score; can distinguish between new entity (unknown quality) and an entity with bad longterm performance \item Weighted towards current behavior: Recognizes and reflects recent trends in entity performance (eg poor performance at the beginning but marked improvement, or vice versa) \item Computable/efficient: It would be convenient if we could calculate the score quickly. Calculations that can be performed incrementally are important. \item Robust against attacks: it should resist attempts of any entity or entities to influence scores other than by being more honest or having higher quality \item Amenable to statistical evaluation: easy to find outliers, etc. \item Private: no one other than that rater should be able to learn how a given rater rated an entity \item Smooth: adding any single rating or small number of ratings doesn't jar the score `much' \item Understandable: easy to explain to people who use these scores -- not only so they know how it works, but so they can evaluate for themselves what each score implies \item Verifiable: a score under dispute can be supported with data \end{itemize} Note that a number of these requirements seem to be contradictory. We will explore the benefits and tradeoffs from each of them over the course of the rest of this [section]. \section{Attacks and Adversaries} Several questions motivate the evaluation of these systems. First, what information needs to be protected? Second, who are the adversaries? The capabilities of potential adversaries and the extent to which they can damage or influence the system dictate how much energy should be spent on security. For instance, in the case of Free Haven, if political dissidents actually begin using the system to publish their reports and information, government intelligence organizations might be sufficiently motivated to spend millions of dollars to track the documents to their source. Similarly, if corporations who are planning to do a fifty million dollar transaction base their decisions on the reputation score that Reputation Technologies Inc. provides, it could be worth many millions of dollars to influence the system so that a particular company is chosen. Perhaps the simplest attack that can be made against a scoring system is what is called *shilling*. While shilling is generally thought of in terms of submitting fake bids in an auction, it can be considered in a broader context of submitting fake or misleading ratings. In particular, an entity might submit positive ratings for one of his friends (positive shilling), or might submit negative ratings for his competition (negative shilling). Either of these ideas introduces more subtle attacks such as negatively rating a friend or positively rating a competitor to try to trick others into believing that competitors are trying to cheat. Shilling is a very straightforward attack, but many systems are nonetheless vulnerable to it. A very simple example is the AOL Instant Messenger system. You can click to claim that a given user is abusing the system. Since there is no support for detecting multiple comments from the same person, a series of repeated negative votes will exceed the threshold required to kick the user off the system for bad behavior, effectively denying him service. Even in a more sophisticated system which detects multiple comments by the same person, an attacker could mount the same attack by assuming a multitude of identities. These vulnerabilities due to oversimplification of the scoring system are not limited to `toy' systems like IM. Indeed, Ebay suffers from a very similar problem. In Ebay, the reputation score for an individual is the linear combination of good and bad ratings, one for each transaction. Thus a vendor who has performed dozens of transactions and cheats on only one out of every four customers will have a steadily rising reputation, whereas a vendor who is completely honest but has only done ten transactions will be displayed as less reputable. In particular, a vendor could make a good profit (and build a strong reputation!) by being honest for several small transactions, and then being dishonest for a single large transaction. Weighting ratings by size of transaction helps the issue, but does not solve it. In this case, large transactions would have a large impact on the reputation score of a vendor, and small transactions would have only a small impact. Since small transactions don't have much weight, vendors have no real incentive to be honest for them -- making the reputation services useless for small buyers. Breaking reputation into many different dimensions, each representing the behavior of the vendor on a given *category* of transaction (based on cost, volume, region, etc) can solve a lot of these problems; see the section below on algorithms for more details and analysis of this idea. \section{aspects of a scoring system} \subsection{domain} The particular scoring system or algorithm used in a given domain should be based on the amount of information available, the extent to which information must be kept private, and the amount of accuracy required from the scoring system. In some situations, such as verifying voting age, a fine-grained reputation measurement is not necessary -- simply indicating who seems to be `sufficient' or `insufficient' is good enough. The fact that precision may not be required is very fortunate -- in a lot of domains, it is very difficult to collect enough information to provide a comprehensive view of each entity's behavior. One reason why it might be difficult to collect information about entities might be a very low transaction volume, as we see today in online large business markets. But there's a deeper issue than just whether there are transactions or whether these transactions are trackable. More generally: does there exist some sort of proof (a receipt or other evidence) that the rater and ratee have actually interacted? Being able to prove the existence of transactions reduces problems on a wide variety of fronts. For instance, it makes it more difficult to forge large numbers of entities or transactions. This protection means that the above AIM abuse is tougher to perform. Similarly, EBay users currently might directly purchase a high reputation by giving EBay a cut of a dozen false transactions which they claim to have performed with their friends; with transaction verification, they would be required to go the extra step of actually shipping goods back and forth. In addition, there are a number of situations in which such a reputation system would simply not be useful. For instance, if there were thousands of buyers and one or only a few vendors, being able to track the performance and reliability of those few vendors would not benefit buyers in terms of picking a good vendor. On the other hand, tracking performance might provide feedback to the vendor itself on areas in which it might need improvement. This in turn could result in better performance down the road, but only if the vendor acted on this feedback. Similarly, in fields where a given buyer generally doesn't perform transactions frequently, the benefits of a reputation system are more subtle. A buyer might benefit from a real estate reputation system, since she expects to rent from different people over time. Even for a domain where she expects to only ever do one transaction (such as laser eye surgery) she would probably contribute to a reputation site -- first out of altruism, and second in order to give the surgeon an incentive to do well. This is the tragedy of the commons in reverse: when the cost of contributing to a system is low enough, people will contribute to it for reasons not *immediately* beneficial to themselves. In some domains, it is in most raters' perceived advantage that everyone agree with the rater. This is how chain letters, Amway, and other Ponzi schemes get their shills: through establishing a system where customers are motivated to recruit other customers. Similarly, if a vendor offers to discount past purchases if enough future customers buy the same product, it will be hard to get honest ratings for that vendor. This applies to all kinds of investments; once you own an investment, it is not in your interest to rate it negatively so long as it holds any value at all. \subsection{Collecting ratings} One of the first problems encountered in developing systems like this is the question of how to collect ratings. The answer is highly dependent on the domain, of course, but there are a number of aspects that are common across many domains. The first option is simply to observe as much activity as possible, and draw conclusions based on this activity. This can be a very effective technique for automated reputation systems that have a lot of data available to them. If you can observe the transaction flow and notice that a particular vendor has very few repeat customers, then it is likely that he has a low reputation. On the other hand, lack of repeat customers may simply indicate a market where any given buyer transacts only infrequently. Similarly, finding a vendor with many repeat customers might indicate superior quality, or it might just indicate a market where one or a few vendors hold a monopoly over that product. Knowledge of the domain in question is crucial to knowing how to correctly interpret data. In many circumstances, it will be difficult or impossible to observe the transaction flow, or it is unreasonable to expect parties to take the initiative in providing feedback. In these cases, a reasonable option is to solicit feedback from parties involved in each transaction. This can either be done by publicizing your existence and interest in such feedback and providing incentives to respond, or even by actively going to each party after an observed transaction and requesting comments. Tying feedback to transactions is a very powerful way of reducing vulnerabilities in the system. This makes it more difficult for people to spam positive feedback, since each item of feedback has to be associated with a particular transaction, and presumably only the latest (or first -- see below under selling votes) piece of feedback on a given transaction would actually count. On the surface, it looks like this requires an exchange or other third-party transaction moderator, to make it difficult to simply fabricate a series of several thousand transactions and exploit the same vulnerability. However, vendors could provide `blinded' receipts for transactions -- vendors would not be able to identify which buyer was providing the ratings, and without such a receipt feedback from a given buyer would be ignored. Thus buyers could not provide feedback about a vendor without that vendor's `permission'. There are a number of new problems introduced by this idea, such as how to respond if vendors fail to provide a receipt, but it seems to address many of the difficult issues about shilling in a decentralized environment. A final issue to consider when collecting ratings is how fair the ratings will be -- that is, how evenly distributed are the raters out of the set of people who have been doing transactions? If the only people who have incentive to provide ratings are those that are particularly unhappy with their transaction, or if only people with sufficient technical background can navigate the rating submission process, then the resulting scores may be skewed to the point of being unusable. One solution to this is to try to come up with incentives for a more evenly distributed set of responses; another approach is to simply collect so much data that the issue is no longer relevant. (Note that systematic errors or biases in the ratings will generally defeat this second approach.) \subsection{Bootstrapping} One of the tough questions for a reputation-based trust system is how to get started: if users make choices based on the scores that are available to them, but the system has not yet collected enough data to provide useful scores, what incentive do buyers have to use the system? More importantly, what incentive do they have to contribute ratings to the system? The situation of Free Haven is slightly easier, because some participants will be generous and willing to try out new nodes just to test their stability and robustness. In effect, they will be performing a public service by risking some of their reputation and resources on evaluating unknown nodes. Businesses, particularly businesses just getting started in their field and trying to make a name for themselves, won't necessarily be as eager to spend any of their already limited transaction volume on trying out unknown suppliers. The way to present initial scores for entities depends on the domain. In some non-commercial domains, it might be perfectly fine to present a series of entities and declare no knowledge or preference; in others, it might be more reasonable only to list those entities for which a relatively certain score is known. Reputation Technologies needs to provide some initial value to the users; this can be done by asking vendors to provide references, and then asking those references to fill out a survey describing overall performance and happiness with that vendor. While this bootstrapping information may not be as useful as actual transaction-based feedback (and is similarly more suspect), it is certainly a good starting point for a new system. The issue of bootstrapping as a distinct process is much more pronounced in a centralized system than in a decentralized system. This is because in a decentralized system, each user develops his own picture of the universe: he builds his trust of each entity based on past direct performance and on referals from other trusted parties. Thus every new user effectively joins the system `at the beginning', and the process of building a profile for new users is an ongoing process throughout the the entire lifetime of the system. In a centralized environment, on the other hand, ratings are accumulated across many different transactions and over long periods of time. New users trust the centralized repository to provide information about times and transactions from before the user joined the system. In a newly developed system, or for a new entity in the system, the choice of the default reputation score is critical. If it's easy to create a new identity (that is, pseudonym), and new users start out with an average reputation, then users who develop a bad reputation are encouraged to simply drop their old identity and start over with a new one. One way to deal with this problem is to start all new users with the lowest possible reputation score; thus even users with a bad track record have incentive to keep their current identity. This may not be necessary in situations where it is preferable to have a number of transactions on record, even if they are poor transactions [example?]. Another approach to solving this problem is to make it difficult to create a new identity. For instance, this can be done by requiring some proof of identity for registration, or a monetary fee. Tying the user to his real-world identity would be an effective technique. \subsection{Personalized Reputation} The user interface -- that is, the way of presenting scores and asking questions -- is a crucial element of the system. Scores cannot simply be static universal values representing the overall quality of an individual. Since a score is an attempt to predict future performance, each score must be a prediction for a particular context. That is, the user interface must allow participants to query the system for the likelihood of a successful transaction for their particular situation. This flexibility on the side of the client provides a much more powerful and useful system; regardless of the algorithm used to aggregate the ratings into scores, being able to query for a specific enough scenario is crucial. Another aspect to the user interface is displaying a *confidence value* associated with each score. The mechanism for generating this confidence value depends on the domain and on the scoring algorithm, but in general it should reflect the expected accuracy of this particular score. For instance, it might reflect the number of ratings which were used to generate the score; it might reflect the standard deviation of the set of ratings; or it might reflect the level of agreement between several different scoring algorithms which were all run against the ratings set. Not only does this confidence value allow users to have a better feel for how firm a given score is, but it can also allow a more customized search. That is, a user might request that only scores with a certain minimum confidence value be displayed, which would weed out new users as well as users with unusual (widely varying) transaction patterns. In some domains, providing qualitative statements along with a more quantitative score seems much more appealing. Simply providing a number may not feel as genuine or useful to users -- indeed, allowing for qualitative statements can provide more flexibility in the system, as users providing feedback might discuss topics and dimensions which are difficult for survey authors to anticipate. Rather than providing a fixed and comprehensive set of dimensions for which users provide numbers as feedback, the system can merely collect statements and let other users read those statements to obtain more color and depth to the score. On the other hand, it is very difficult to integrate these statements into more numerical scores, particularly if they cover unanticipated dimensions. Also, as the number of statements increases, it becomes less useful to simply print all of them. However, choosing which statements to print not only requires manual intervention and choice, but might also lead to legal liabilities. Another problem with providing verbal statements as part of the score is the issue of using this scoring system in many different countries. Statements may need to be translated, but numbers are universal. \subsection{Scoring Algorithms} As we've seen in the previous sections, there are many different aspects to scoring systems. While we believe that providing query flexibility is perhaps the most crucial aspect to the system, another important aspect is the actual algorithm used to aggregrate ratings into scores. Such an algorithm needs to answer most of the requirements that we laid out in section {foo}. Broadly speaking, the scoring algorithm should provide accurate scores, while preventing dishonest users from affecting the system and also preventing privacy leaks (as detailed in the next section). Preventing dishonest users from affecting the system can be done in several ways. One simple way is to run statistical tests independent of the actual aggregation algorithm, to attempt to detect outliers or other suspicious behavior such as a clique of conspiring users. Once this suspicious behavior has been identified, system operators can go back and manually examine the system and try to prune the bad ratings. While this appears to be a very time-intensive approach that could not possibly be used in a deployed system, such reputation-based systems as Ebay have used exactly this to try to clean up once dishonest users have been noticed [eBay Feedback Removal Policy. http://pages.ebay.com/help/community/fbremove.html]. A more technically sound approach is to weight the ratings based on the *credibility* of the rater. That is, certain people contribute more to the score of a given entity based on their past predictive ability. Google makes use of this idea in its Internet search engine algorithm. Their algorithm counts the number of references to a given page -- the more pages that reference that page, the more popular it is. In addition, the pages that are referenced from popular pages are also given a lot of weight. This simple credibility metric produces much more accurate responses for web searches. By introducing the notion of *local credibility* rather than a simple global credibility for each entity, the system can provide a great deal of flexibility and thus provide stronger predictive value. Local credibility means that a rating is weighted more strongly if the situation in which that rating was given is similar to the current situation. For instance, a small farmer in Idaho looking into the reputation of chicken vendors cares more about the opinion of a small farmer in Arkansas than he does about the opinion of the Great Chinese Farming Association. Thus the algorithm would generate a score which more accurately reflects the quality of the vendor according to other similar buyers. Similarly, if Google knew more about the person doing the web search, it could provide an even more accurate answer [Press Release for Open Profiling Standard. http://home.netscape.com/newsref/pr/newsrelease411.html] [note that firefly.net returns "no more data is available" - the company was bought by Microsoft and apparently incorporated into Microsoft Passport. also crushed its plans to incorporate a plugin into Navigator. -dm] One of the problems with incorporating credibility into the scoring algorithm is that in some domains an individual's ability to perform the protocol honestly is very separate from an individual's ability to predict performance of others. In the Free Haven system, this corresponds to a server which is willing to store documents and supply them to readers, but keeps no logs about transactions or trades (so it has no idea which other servers are behaving honestly). In the case of Reputation Technologies, one vendor might be excellent at providing products on time and with high quality, leading to a high reputaton score, but on the other hand that vendor might only have an average skill at assessing other vendors. Similarly, a consulting firm might specialize in predicting performance of vendors, but not actually sell any products of its own. One way to solve this problem is to have separate scores for performance and for credibility. This seems to increase the complexity of keeping track of entities and their reputations, but could provide tremendous increases in accuracy and flexibility for scoring systems. Weighting by credibility is not the only way to improve the accuracy and robustness of the scoring algorithm. Another approach is to make use of the idea that previous transactions similar to the current transactions should carry more weight in predicting an entity's ability to successfully complete the transaction. That is, a vendor's ability to sell high-quality granny smith apples should have some bearing on his ability to sell high-quality red delicious apples. Of course this could backfire if the vendor in question specializes only in granny smith apples and doesn't even sell red delicious apples, but in general weighting by the so-called *category* of the transaction (and thus the vendor's reputation in related categories) is a very powerful idea. Separating reputations into categories can act as a defense against some of the subtle shilling attacks described above, such as when a vendor develops a good reputation at selling yo-yo's and has a side business of fraudulently selling used cars. But the category idea doesn't solve everything. In particular, it raises a very difficult question: how do we pick categories? How do we know which categories are related to which other categories, and how related they are? Can this be automated somehow or do the correlation coefficients have to be estimated manually? In the case of Free Haven, where there is only really one commodity -- a document -- and servers either behave or they don't, it might be feasible to develop a set of categories manually and allow each server to manually configure the numbers that specify how related each category is to the others. For instance, one category might be files less than 100K which expire within a month. A strongly related category would be files between 100K and 200K which expire within a month -- perhaps we would say that this category is 0.9-related to the first. A mostly unrelated category would be files more than 500 megabytes which expire in 24 months. We might declare that this category is 0.05-related to the first two. With some experience, an algorithm might be developed which can tweak the correlation coefficients on the fly based on how effective the current values have been at predicting the result of future transactions. Similarly, we might be able to reduce the discrete categories into a single continuous function which converts 'distance' between file size and expiration date into a correlation coefficient. Reputation Technologies is not so lucky. Within a given exchange, buyers and sellers might barter thousands of different types of goods, each with different qualities and prices; the correlation between any pair of categories might be entirely unclear. To make matters worse, each vendor might only have a few transactions on record, leaving data too sparse for any meaningful comparison. While we've presented some techniques to provide more accuracy and flexibility in using ratings, we still haven't discussed actual algorithms that can be used to determine scores. The simplest such algorithm involves treating reputations as probabilities -- a reputation is effectively an estimate of the likelihood that a future transaction in that category will be honest. In this case, scores are simply computed as the weighted sum of the ratings. More complex systems can be built out of neural networks or data clustering techniques, to try to come up ways of applying non-linear fitting and optimizing systems to the field of reputation. Note that as the complexity of the scoring algorithm increases, it becomes more and more difficult for actual users of these systems to understand the implications of a given score or understand what flaws might be present in the system. A final note on scoring algorithms is the discussion of the *adversial approach* to scoring systems. That is, in many statistical or academic approaches the goal is simply to combine the ratings into as accurate a score as possible; in the statistical analysis, however, no regard is given for whether participants in the system can conspire to provide ratings which break the particular algorithm used. A concrete example might help to illustrate the gravity of this point. One of the often-referenced pitfalls of applying neural networks to certain situations comes from the US military. They wanted to teach their computers how to identify tanks in the battlefield. Thus they took a series of pictures that included tanks, and a series of pictures that did not include tanks. But it turns out that the pictures with the tanks were taken during the night, and the pictures without the tanks were taken during the day. This caused their high-tech neural network to learn not how to identify a tank but how to distinguish day from night. The lesson for the AI developers is to remember that there are a number of factors that might be different in a set of samples, and your neural network might not learn quite what you want it to learn. But consider the situation from our perspective: what if the Russian military were in charge of providing the tank pictures? Is there a system that can be set up to resist bad data samples? Many would consider that learning how to identify a tank under those circumstances is impossible. How about if the Russians could provide only half of the pictures? Only a tenth? Clearly this is a much more complicated problem. When developing scoring systems, we need to keep in mind that simply applying evaluation techniques that are intended to be used in a `clean' environment may introduce serious vulnerabilities. \subsection{Privacy and Information Leaks} Yet another issue to consider when designing a good scoring system is whether the system will be vulnerable to attacks which attempt to learn about the tendencies or patterns of entities in the system. In a business-oriented domain, knowledge about transaction frequency, transaction volume, or even the existence of a particular transaction might be worth a lot of time and money to competitors. Using a simple and accurate scoring algorithm seems to imply that it should be easy to understand the implication of a vendor's score changing from 8.0 to 9.0 over the course of a day -- perhaps one or more ratings arrived regarding large transactions, and those ratings were very positive. The objectives of providing timeliness and accuracy in the scoring algorithm and also maintaining privacy of transaction data seem to be at odds. Fortunately, there are a number of ways to help alleviate the leakage problems without affecting accuracy too significantly. We will describe some of the more straightforward of these techniques below. The problem of hiding transaction data for individual transactions is very similar to the problem of hiding source and destination data for messages going through mix networks [ "Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms" D. Chaum, Communications of the ACM, vol. 24 no. 2, February, 1981] . More specifically, the notion of having a new rating influence the published score by a certain amount is very similar to the notion of tracking a message across a middleman node in a mix network. In both cases, privacy becomes significantly easier as transaction volume increases. In both cases, adversaries observe external aspects of the system (in the case of the scoring system, the change in the score; in the case of the mix network, the messages on the links to and from the mix node) to try to determine the details of some particular message or group of messages (or any message at all). One common attack against the privacy of a scoring system might be a *timing attack*. For instance, the adversary might observe transactions and observe changes in the scores, and then try to determine the rating values that certain individuals submitted. Alternatively, the adversary might observe changes in the scores and attempt to discover information about the timing or size of transactions. The analog of this attack to the Mix environment is an adversary watching the timing of messages going to and from nodes, and trying to determine which incoming message corresponds to which outgoing message. There are a number of solutions that can be very useful. First of all, introducing extra latency between the time ratings are submitted and when the new score is published can make timing correlation more difficult. (On the other hand, this might reduce the quality of the system since scores are not updated immediately.) Another good solution is to queue ratings, and process them in bulk. This prevents the adversary from being able to determine which of the ratings in that bulk update had which effect on the score. A variant of this approach is the *pooling* approach, in which some number of ratings is kept in a pool. When a new rating arrives, it is added to the pool and a rating from the pool is chosen at random and aggregated into the score. Obviously, in both cases a higher transaction volume makes it easier to provide timely score updates. On the other hand, an active adversary can respond to bulk or pooled updates with what is known as an *identification flooding attack*. He submits ratings with known effect, and watches for changes in the score not due to those ratings. This approach works because he can `flush' the few still-anonymous ratings by submitting enough known ratings to fill the queue. Note that this attack only works if the adversary can produce a significant fraction of the ratings during a given time period. [present good example of an intersection attack [perhaps think of one first]] But all this concern over privacy may not be relevant at all. In some domains, such as Free Haven, the entire goal of the reputation system is to provide as much information about each pseudonymous server as possible. For instance, being able to figure out how Alice performed with Bob's transaction is always considered to be a good thing. In addition, even if privacy is a concern the requirement of providing accurate timely scores may be so much more important that no steps should be taken to increase user privacy. \section{Decentralizing the scoring system} While many of the issues presented in this section have been presented in the context of a centralized reputation scoring and filtering service, many of them are applicable to either centralized or decentralized systems. In a decentralized system such as Free Haven, each server runs the entire reputation gathering system independently. This requires each node to make do with only the information which it has gathered first-hand, and generally requires a broadcast mechanism in order for all nodes to keep their information databases synchronized. Another approach is to decentralize the scoring system itself, spreading it among the entire set of machines participating in the system. Below we present two ways of decentralizing a scoring system. The first uses redundancy along with user flexibility to reduce the risk from cheating or compromised servers. The second is a more traditional approach to decentralizing a system, but it also brings along the more traditional problems with decentralization, such as high bandwidth requirements and difficult crypto problems. \subsection{Multiple trusted parties} There is a set {S} of scorers around the world, each independently run and operated. When a transaction happens, the vendor constructs a set of tickets {T} for some subset of {S} that he chooses. Each ticket is a receipt allowing the buyer to rate the vendor at a particular scorer. The receipts are blinded so that the vendor is not able to link a ticket with any given customer. At this point, the buyer can decide to which scorer or scorers he wishes to submit his ratings. Since each scorer could potentially use its own algorithm and have its own prices or publishing habits, each scorer might have its own set of tradeoffs based on accuracy, privacy, and security. This technique allows the vendor to veto some of the scorers first, and then the rater chooses from among the remaining scorers. Thus the ratings will only be submitted to mutually agreeable scorers. We could extend this scenario to allow both parties in the transaction to provide tickets to each other, creating a more symmetric rating process. This approach introduces complications where both parties in the transaction need to coordinate and agree on which tickets will be provided before the transaction is completed, and there needs to be some mechanism to enforce or publicize if one side of the transaction fails to provide the promised receipts. The beauty of decentralizing the scoring system in this manner is that every individual in the system can choose which parts of the system they want to interact with. Participants in transactions can list scorers which they trust to provide accurate scores, raters can choose scorers which they trust to not leak rating information, and users looking for scores on various entities can choose scorers which they trust to provide accurate scores. Of course, this decentralization process introduces the issue of meta-reputation: how do we determine the reputations of the reputation servers? This sort of reputation issue is not new. Some Mixmaster nodes are more reliable than others [Electronic Frontiers Georgia Remailer Uptime List. http://anon.efga.org/ ] , and users and operators keep uptime and performance lists of various nodes, as a public service. We expect that reputation scoring services would similarly gain (external) reputations based on their reliability or speed. \subsection{True decentralization} [figure: four columns. first column has one R. second column has a vertical line of C's. third column has a vertical line of S's. fourth column has a couple of "client"s. The R in the first column has arrows pointing to every C. Each S in the third column has bidirectional arrows pointing to every C (ellipse arrows as necessary). The fourth column has some client's doing queries onto the S's.] In this scenario, both sides of the transaction obtained blinded receipts as above. Apart from these *Raters*, the system also consists of a set of *Collectors* and a set of *Scorers*. After the transaction, each rater splits up his rating using Shamir's secret sharing algorithm or some other k-of-n system. At this point, he submits one share of his rating to each Collector. This means that the Collectors together could combine the shares to determine his rating, but separately they can learn no information. It is the job of the Scorers to provide useful information to clients: when a client does a reputation query for a specific category (situation), the Scorer does the equivalent of a *encrypted database query* on the set of Collectors [ Private Information Retrieval and Oblivious Transfer. Tal Malkin. PhD Thesis, MIT 1999]. There are a number of technical challenges that need to be solved in order to make this scheme work. First of all, the collectors need to have some mechanism for authenticating a rating without reading it. Similarly, they need to have some way of authorizing a rater to put his share onto the system without being able to know the author of a given rating. Without this protection, malicious raters could simply flood the system with data until it overflowed. Once these problems are solved, we need to come up with some sort of computationally-feasible and bandwidth-feasible way of communicating between the scorers and the collectors, and a set of rules that allow the scorers to get the information they need to answer a given query without allowing them to get `too much' information. With this decentralization comes some subtle questions. Can scorers `accidentally forget' to include a specific rating when they're computing a score? Said another way, is there some way of allowing scorers to provide proof that they included a certain rating in the calculation of the score, without publishing the actual ratings that were used? This question is similar to the question of allowing Mix nodes to prove that they forwarded a given message, without yielding information that might help an adversary determine the source or destination of the message [Masayuki Abe. Universally Verifiable MIX-network with Verification Work Independent of the Number of MIX Servers. EUROCRYPT '98 Springer-Verlag LNCS].