How does identity data relate to transactional and other kinds of data?

Is it non-overlapping? The same, or a subset? Is there overlap; if so, where, and under which circumstances?

These questions are at the heart of the thought process that needs to get into designing identity technologies for the era of pervasive identity. For example, if the answer was "non-overlapping", then we could merrily go ahead and design identity systems on the green field (at least with respect to non-identity systems), not worrying about what is in, say, the customer database. If the answer was "the same, or a subset", then we’d better not design any identity systems, but instead devise methods by which existing systems (such as transactional systems and other systems that were not built for identity per se) can be recruited to also meet the requirements of identity.

I’ve found that when I’m asking that question, which I occasionally do, the answers that I’m getting from people in the community often dramatically differ depending on how much the person being asked has an "enterprise directory" background. (and if you think about that for a minute, that makes some sense.) To exaggerate, not being entirely fair here for a minute, that opinion sometimes seems to be "what’s in the directory is the identity information, and all other systems aren’t directories, so they don’t have identity information." with the conclusion of "little or no overlap". If that sounds like a self-fulfilling promise to you, it certainly does to me.

In my experience, however, this traditional view is increasingly inconsistent with what is happening, in terms of user behavior, in terms of the shift of power and control from centralized organizations with a firm wall around them to individuals, and in terms of the new technologies and services that are springing up in response.

For example, the other day, I needed to get in touch with somebody whose phone number I had lost, or never had had in the first place. I found it: by Googling his name first, finding his blog, and doing a whois lookup on the DNS address of his blog. It could have been through reverse phone number lookup by address at Google. Or by going through LinkedIn. If we had happened to work for the same company with a well-maintained directory, I could have gotten the number from there; but we didn’t.

Of course, this use case is not an enterprise use case. But that is the whole point about pervasive, indepedent identity! It isn’t tied into any one organization or central repository of identity information. It is the non-enterprise use cases, the "open internet" use cases of identity technology that are needed to be addressed today because increasingly, the people we interact with and relate to are outside of the confines of the same organization; certainly outside of 9-5, which isn’t what it used to be, either. There is also the convenience factor: Google is a lot closer to the fingertips of a lot of people than the enterprise directory application.

Note also that what is one person’s identity data is somebody else’s transaction data. We certainly don’t run MyLID.net (our hosted LID, OpenID, Yadis identity service) on top of a directory, and I’m pretty sure that is also true for many social web applications. Netflix’s social network functionality has a lot of identity-related data in it, but they probably (conjecture on my part, I don’t know) store it just in the same database that all their other information is in: maintaining the relationships between people and their purchases would be rather difficult if one introduced the usual impedance mismatch between a directory and a database; the benefits would be rather marginal, and Neflix does use transaction data as a form of identity data in any case …

The conclusion: a separation between these different kinds of data, and allocation to different kinds of information systems with strict boundaries between them, might have made sense in the past, and within a tightly structured IT environment (and even then, show me the enterprise application that does not have at least a bit of identity information in it). Today, on the open web, with social software being one of the primary areas of innovation, this separation is increasingly anachronistic, if it is performed for the purposes of "separating identity data from transaction data". (There are good other reasons, such as differing performance profiles. But conceptually, we should be thinking about one tightly cross-referenced set of information, even if we decide that data item A should rather sit in system B than C because it’s faster, or cheaper, or …).

What we need, in the end, is an approach that considers the entire web and enterprise IT infrastructure, warts and all, one giant, distributed, decentralized meta-directory (or meta-database, or …) that has parts that are optimized for different requirements, but that can be accessed uniformly so application development "native to the web" is possible. Identity data elements are a subset of all of that information, and tightly related to other data elements, both identity and not. And that way, we don’t even need to draw an artificial line whether or not information item X (say, somebody’s presence or transaction record on eBay) is or isn’t a piece of identity information.