Privacy and XML, Part I
April 17, 2002
The widespread uptake of e-commerce has been stalled as much by the inability of businesses to guarantee the privacy preferences of their customers for the personal data entrusted to them as by any other single factor. Of those who are connected but do not purchase online -- which is over half of all Internet users -- over half say their reluctance is due to fear that their personal information will be stolen or misused. In a sense, XML, through the smart data transfer it enables, contributes to the problem. However, a number of XML-based efforts are emerging that offer solutions to some of the major technology issues for privacy.
Privacy, in the context of this article, may be understood as the ability of individuals to control the collection, use, and dissemination of personal information that is held by others. Privacy is a major issue these days. Corporations are appointing Chief Privacy Officers (CPOs), and governments around the world are creating legislation which forces companies to satisfy requirements on how they collect, secure, and use customer data.
But businesses have always collected data about their customers. In a trivial sense, without certain information, a company could simply not do business with their customers (e.g., a shipping address). More interesting than this trivial case, however, is how, with the appropriate information, a given service or product can be optimized or tailored to specific customers: the more a company knows, the greater are the opportunities for real personalization, potentially benefitting both company and customer.
If businesses collecting information on customers is not new, why then is privacy receiving such attention lately? Reasons include the following:
- The unfamiliarity most people have for the technologies that make up the Web. Users are asked to make decisions and state their preferences on issues for which they have no expertise. For instance, browser cookies, if implemented in a responsible manner, can allow businesses to maintain a relationship with a customer between browser visits, preventing the customer from unnecessarily entering contact information multiple times. However, for the vast majority, cookies are perceived to provide the Web business an unacceptable access point into their computers and their lives.
- The interconnectedness of networks enable faster and easier information flow -- both authorized and unauthorized. With more and more customer data being moved online, the opportunities for illicit access are increased. Ten years ago, a company may have maintained information on their customers on an off-line mainframe. Now it is likely that the database will be connected to a Web server -- both to customize and simplify the customer's browsing and shopping experience, and to allow customers to self-manage their data. The price to be paid for these advantages is of course that a channel now exists for the unauthorized access of that data. As an example, a hacker recently penetrated the computer network of a hospital in Seattle and was able to extract files containing information on more than 5000 patients.
In the past, public records were most likely kept on paper or magnetic tape in physical filing cabinets at offices of various levels of government throughout a country. Thus, even though the information may have been freely available (in a legal sense), the realities of the storage medium prevented this happening on a large scale. Internet technology enables the easy distribution of the information and consequently raises people's sensitivity to this access.
- The emergence of mobile technologies. Smart phone use in the United States and Europe is predicted to grow dramatically in the next few years and the technology enables scenarios unimaginable in the past. For instance, it is possible to determine the user's exact location through the signals emitted by mobile devices. While this sort of ability could be of great benefit in an emergency -- cutting ambulance response times or facilitating the location of missing children -- it is also feared in some quarters that this tracking ability, through its linkage of identity and location, could be misused.
- Federated identity. Both Microsoft .NET My Services and the Liberty Alliance are designing architectures for federated authentication and identity -- in which an individual will be able to create an identity at one Web site and be able to use that identity in order to access the services at another. Federation, with the attendant sharing of user information between Web sites, amplifies concerns about the misuse of that information.
If access to information is part of the problem, it would seem that XML, with its logically identified and structured information objects, would only add fuel to the fire. Imagine how much easier a hacker's 'job' would be if she knew that all banks kept the credit card numbers of their customers in an XML Schema that specified a <creditCardNumber> element. No longer would hackers need to scan multiple tables in multiple databases, rather they would simply let loose a "bot" that read every file it came across, looking for the appropriate tags, and once found, retrieve the number as well as the card owner and expiry dates (these also conveniently captured in their own elements).
Fortunately, XML has more to offer the privacy issue than simply a mechanism to allow hackers to automate their efforts and spend more time dreaming up increasingly elaborate hacks. This two-part article will discuss the existing and potential applications of XML to privacy. Before doing so, we provide an overview of privacy related concepts in the following section.
The remainder of this article provides overviews of some of the concepts and issues central to understanding privacy. The second installment, to be published next week, will highlight some of the XML-based initiatives underway to enhance Internet privacy.
Personally Identifiable information
Personally Identifiable Information (PII) is information that is unique to an individual and, as such, can serve as a locator for that individual, or at least as a way to distinguish that individual from many (or all) others. Examples of personally identifiable information are a social security number, a telephone number, a home address, and possibly an email address. Data such as age, gender, and salary do not uniquely identify the bearer, and so are typically not considered PII. Although somewhat less of a concern than PII, therefore, such anonymous data is very relevant to privacy if it can be linked to PII. As an example, in 1999 DoubleClick was planning on combining the anonymous browser click-stream data it collected with the database of PII it acquired through the purchase of Abacus Direct. DoubleClick dropped these plans under pressure from Privacy groups and the media.
Opt-in versus Opt-out
Opt-in and Opt-out refer to the model by which businesses get approval from consumers for sharing of their information; they differ in the assumptions they make about the value of the data and what the appropriate default rule for sharing should be.
The "opt-in" model assumes that consumer information has high value, and as such, the consumers should be given an explicit choice for approval as each opportunity to share their data arises. With opt-in, the default is not to share consumer information, consistent with the assumption of a high-value for the information. If a consumer is willing to share her information, she must affirmatively "opt in".
The "opt-out" model places less value on the consumer information; it assumes that information is insensitive and can be shared unless a consumer explicitly requests otherwise. The default operation in this model is to share information. If a consumer is not willing to share his information, he must affirmatively "opt out".
Another crucial consideration for companies is transparency or consumer accessibility to data collected. The trend, motivated both by legislation and by the desire to maintain a friendly and trusting relationship with consumers, is toward allowing consumers on-line access to their own data. Significantly opening up on-line access to data inevitably raises security issues, since it may increase the risk of unauthorized access by third parties to an individual's personal information.
If adequate steps are in place to authenticate a person's request to view their information, such as a user name and password or other techniques, this mechanism can benefit both sides. The consumer is reassured as to the nature of the information maintained about them and the openness of their relationship with the business; furthermore, the company minimizes its costs by placing some of the onus of keeping information up-to-date on the consumer.
Exposure and Disclosure
The concepts of "exposure" and "disclosure" are distinct, but both are related to privacy. Exposure has to do with identity: am I willing to reveal who I am to one or more other entities within the context of this transaction? Disclosure has to do with other information about me: am I willing to reveal this personal or sensitive information to other entities for some particular purpose? It is sometimes argued that these two concepts collapse into one if "identity" is simply considered to be one type of personal information that may be disclosed. However, in many environments it is useful to keep the concepts separate because an authentication step (which may expose identity) occurs prior to the remainder of any transaction or set of transactions that discloses additional information.
Exposure may be further categorized into techniques providing anonymity, pseudonymity, or veronymity.
- Anonymity ("no name") refers to the use of no name whatsoever or to the use of a name that was never used before a given transaction and will never be used again. The defining property of anonymity is that no linkage is possible between this transaction and the actual, real-life entity performing the transaction, and no linkage is possible between two different transactions (i.e., it cannot be known that they were both performed by the same actual, real-life entity).
- Pseudonymity ("false name") refers to the use of a particular name for multiple transactions, but that name is different from the identity of the actual, real-life entity performing the transactions. The defining property of pseudonymity is that no explicit linkage is given between this transaction and the actual, real-life entity performing the transaction, but a linkage is possible between different transactions. In this way, a server can know that the same entity is visiting again (and personalize accordingly), but it does not know who this entity actually is. (Care must always be taken with pseudonymity, however, because multiple transactions all known to be performed by the same entity can sometimes allow an observer to derive clues about the actual identity, thereby weakening the property of pseudonymity.)
- Finally, veronymity ("true name") refers to the use of the actual, real-life identity of the entity performing the transaction within the transaction context. Linkage both from a given transaction to the actual identity, and between two transactions performed by the same identity, is obviously possible.
Therefore, identity information may be not exposed at all (in anonymous transactions), may be partially exposed (in pseudonymous transactions), or may be fully exposed (in veronymous transactions).
There are three categories of use to which collected information may be put.
- Approved Intended uses. These are uses for which the company has notified the customer and received approval. An example might be collecting and storing a customer's shipping information to streamline future purchasing.
- Non-Approved Intended uses. These are uses for which the company has either not notified the customer or has notified the customer but has not received approval. The intent is on the company's side, not the consumer's (i.e., the company intends to use the data in a particular way but the consumer has not (yet) given explicit approval for this use). An example would be selling a customer's purchasing history to another company.
- Unintended uses. These are uses for which neither the company nor the customer anticipated or approved. An example would be a hacker gaining access to a back-end database of credit card numbers and posting them to the Web.
The goal of most privacy legislation and technology is to protect consumers by allowing them access to a company's list of "non-approved intended" uses so that informed choices can be made. Implicit, as well, is a recognition that protection against "unintended uses" must be provided.
Security often encompasses such concepts as confidentiality, authorizations, authentication, and non-repudiation; each of these are relevant in some way to privacy.
Confidentiality refers to keeping sensitive information secret and protected from inappropriate viewing. Privacy requires that the confidentiality of user information is protected both in transit and in storage.
Authorization refers to the process of determining what an individual or business entity is allowed to do; for instance, a user may allow one company to only to view their online calendar while another is authorized to write to it.
Non-repudiation refers to mechanisms that prevent individuals and business entities from denying an action of theirs. Such functionality is relevant to privacy because it would prevent a business from denying that it made a claim for how user data would be used if the business was later found to have broken this policy.
Authentication refers to proving that individuals or business are indeed who they claim to be.
Currently, a Web user will likely maintain separate collections of their personal information with multiple businesses, with resulting duplication and administrative burden. For example, they will likely have provided their shipping address to every Web site from which they ever made a purchase. Privacy will become even more of an issue in the future as these existing islands of customer information are connected to each other to create a virtual whole (as in Microsoft's .NET My Services initiatives and the evolving Liberty Alliance).
The power of such aggregation is obvious, from the perhaps mundane scenario of auto form-filling to new and exciting scenarios of applications providing a holistic experience for a user (e.g., an online grocery service that is able to access a filtered view of the user's agenda to determine when is the best time to deliver their order). This sort of concentration of data, either physical or virtual, has obvious implications for privacy. If nothing else, it would seem to present an incredibly attractive target for hackers wishing to concentrate their efforts where there is the greatest potential for reward.
Privacy of user information in this information sharing model requires:
- Protected data storage
- Authentication and authorization of requesting applications
- Confidentiality of transmitted data
Another privacy aspect of the model described above, quite separate from the issue of controlling access to the user's personal information stored in the information repository, is that the authentication service, through its central role in the authentication process, will have access to a vast store of click-stream data: the record of sites a user visits. Such data could enable powerfully targeted marketing. For instance, if a user were seen to visit the Web sites of high-end furniture and antique stores, then a displayed banner ad for cigars would presumably enjoy greater success with this user than a member of the public chosen at random. Privacy experts have expressed concerns about a single corporation (Microsoft or any other) playing such a central role in e-commerce transactions. Microsoft has promised that they will neither use Passport data in this way themselves, nor sell it to others. An organization wishing to participate in a Liberty community will necessarily make the same commitment.
.NET My Services will make the user's information available through a published XML API; Microsoft is calling this the "XML Message Interfaces" (XMI). XMI will simplify for application developers both the retrieval of this information and its integration into their applications (browser-based and non-browser-based). The initial .NET My Services roll-out will include core services like .Net Profile (nicknames, picture, etc.) and .Net Calendar (time and task management), each of which will have an appropriately defined XML Schema. The following shows an example of the stored XML.
<c:contact xmlns:c="http://schemas.microsoft.com/hs/2002/10/myContacts" xmlns:p="http://schemas.microsoft.com/hs/2002/10/myProfile"> <c:firstName xml:lang="en-us">Bill G.</c: firstName> <c:lastName xml:lang="en-us">Ates<c:lastName> <c:emailAddress> <p: address>firstname.lastname@example.org</p:address> </c:emailAddress> </c:contact>
Passport, .NET My Services, the Liberty Alliance, and similar architectures built around Web protocols and services have heightened public awareness of privacy issues with regard to the Internet. The next article will highlight some XML-based efforts to enhance Internet privacy.