EZID: A Tool for Creation and Long-Term Management of Identifiers

March 30, 2011 Author: Joan StarrCategories:

Tags:

DataCite

Recently our EZID (http://n2t.net/ezid) service was reviewed on four points:

1. EZID represents a single point of failure, and, as such is vulnerable.
2. URLs are “a globally deployed identifier scheme” that have endured and provide the functionality cultural heritage institutions need.
3. These same institutions are better off managing their own namespaces than in working with other organizations.
and
4. We, CDL, should not attempt to achieve sustainability for our services by charging for them.

We’re happy to address each one of these points in turn.

Point 1.

“I have some serious concerns about a group of cultural institutions relying on a single service like EZID for managing their identifier namespaces. It seems too much like a single point of failure. I wonder, are there any plans to make the software available, and to allow multiple EZID servers to operate as peers?”

While every web-based service that invites access via a single URL suffers from a single-point-of-failure vulnerability, whether it be gmail.com, handle.net, dx.doi.org, purl.org, loc.gov, or n2t.net, three important considerations are (a) service replication, (b) delegation of responsibility, and (c) consequences of downtime.

Taking these in reverse order, let us first observe that users experience EZID via two major components: an identifier manager accessible at n2t.net/ezid for entering and updating metadata and a set of identifier resolvers, primarily n2t.net, other external resolvers such as dx.doi.org. It may help to use EZID as shorthand to refer to the manager and N2T to refer to the primary resolver. The consequence of manager downtime is that a change to identifier metadata will be delayed until service is restored. It’s not usually serious in the short term if, for example, a typo or a broken link remains uncorrected for several hours. The EZID manager is independent of the N2T resolver (and other external resolvers) in the sense that one service can be down without affecting the other.

Long term availability of the EZID manager interface is shored up by its essential simplicity – it’s just a metadata entry system after all – but it also has a fully documented API and a pure open-source technology base (Apache server, Django, NOID, BerkeleyDB). In these challenging budgetary times, no organization is immune to questions about its long-term capacity to deliver service. In the worst case, if we could no longer support the service, we are confident that we could find a home for it on one of the campuses of the University of California (UC) or with a partner in one of several consortia to which we belong (DataCite, DataONE, etc). As subsidies disappear, mechanisms to recover operational costs are beginning to appear, even in libraries whose budgets some believed would always be protected.
The consequence of N2T resolver downtime is potentially much more serious, in both the short term and the long term, as every link relying on n2t.net as a starting point will stop working unless a functioning service replica is available. Resolution only starts with a top-level resolver, of course, and is not complete unless the target service is correctly recorded at the resolver and is itself available. Delegation of this critical delivery responsibility to autonomous peer object servers and secondary resolvers is a key strategy common to all resolution systems that we know of, including DNS (URL), Handle, and N2T. In a sense, EZID is delegating resolution through its use of multiple external resolvers.

Delegation for identifiers that last longer than organizations, however, is not as simple as for DNS resolution, which need only be concerned with delegating current host information for current organizations. Resolution designed for the URN or Handle/DOI schemes both use a flawed model in which the set of objects named by one organization, say 80 years ago, is required to reside together today in one successor archiving organization; if ever such a set is split up, secondary ad hoc resolvers falling outside the model must be set up and operated on behalf of any other organizations holding objects from the original set. N2T overcomes this with a combination of broad class-based redirection for entire identifier schemes (e.g., PMID), subsets of schemes (e.g., URN), and efficient individual object redirection. That this approach will scale for the foreseeable future, we argue that long-term identifiers will be a vanishingly small percentage of URLs on the web, given that persistence is widely acknowledged to be a matter of long-term service commitment and that few organizations are suited to it. Unfortunately, this still comes as a surprise and a disappointment to those who cherish the early 1990’s dream of a technology that would magically make “persistent identifiers”.

What no resolution system has been able to do is to delegate top-level resolution (the starting step) to autonomous peers, as this would prevent a coherent unified view of the resolution namespace. Not surprisingly, replication (mirroring among non-autonomous peers) of top-level resolvers is the key high-availability strategy common to DNS (URL), Handle, and N2T resolution.

An instructive outlier is the ARK identifier scheme, which formally embeds a globally unique identifier in a URL that permits peer providers to establish their own top-level resolution starting points. The NOID software, which we have made available open-source since 2004, is currently used by a number of cultural heritage organizations to provide their own autonomous identifier resolver and management services for ARKs as well as other identifiers. Nothing prevents others from using this open-source technology, just as EZID and N2T do.

Because our users request more than just ARKs, and even for ARKs and URLs there is a demand for single-starting-point DNS resolution (cf. purl.org) to protect more vulnerable organizations from hostname instability, we offer just that from N2T. Given the pressure this puts on high availability, we therefore take N2T resolver replication very seriously, as described in presentations [e.g., Low-Risk Persistent Identification: the “Entity” (N2T) Resolver, (iPRES 2006); Supporting Persistent Citation (webcast Dec 2006, ) and the n2t.info website, and piloted with partners in New York, Germany, and Australia. This is a key motivation behind current replication discussions with the San Diego Supercomputer Center and fellow DataCite partner, Purdue University. The details are yet to be worked out, and we’ll share them as soon as they are available. As before, we feel confident that our role as a particular replication node could be replaced by a UC campus or by one of our consortial partners. Researchers and organizations that work with us can take advantage of our partnerships, which allow us to advocate on behalf of our patrons for enhanced services at the network level.

Point 2.

“…cultural heritage institutions should make every effort to work with the grain of the Web, and taking URLs seriously is a big part of that. I’d like to see more guidance for cultural heritage institutions on effective use of URLs, what Tim Berners-Lee has called Cool URIs, …”

With regard to URLs, here too, we are in complete agreement. While we need to respond to demand for and legacy use of other identifiers, we believe that URLs and certain classes of URLs (e.g., ARKs) provide the best return on investment. Supporting the creation and maintenance of “Cool URIs” is precisely what we intend when we offer URL support within EZID. There is no reason why URLs should not enjoy the benefits of attached metadata, redirection targets, commitment statements, etc. EZID’s support for URLs also doesn’t prevent others deploying services to support URIs.

Point 3.

“Instead of investing time & energy getting your institution to use a service like EZID, I think most cultural heritage institutions would be better off thinking about how they manage their URL namespaces, and making resource metadata available at those URLs.”

This comment expresses one of the assumptions behind the original decision to release the NOID identifier creation, management, and resolution open-source software in 2004. What we’ve learned since then was that this software enabled a few of the more technically adept organizations, but that many preferred not to host identifier services themselves. Besides the technical challenges, there were also administrative challenges related to long-term record-keeping, sub-namespace delegation, database recovery, etc. While a few cultural heritage organizations may be suited the challenges, in the current economic climate, recommending that individual institutions go it alone seems dubious advice, and perhaps a formula for a thousand single points of failure. As Tyler Walters and Katherine Skinner say in the Executive Summary to the recent ARL report, New Roles for New Times, Digital Curation for Preservation,

“We assert that the strongest future for research libraries is one in which multi-institutional collaborations achieve evolvable cyberinfrastructures and services for digital curation. The alternative, a “go it alone” strategy, will only lead to dangerous isolation for practitioners, yielding idiosyncratic, expensive, and ultimately unsustainable infrastructures.”

EZID and N2T provide a necessary service not only to organizations that lack other means to manage their identifiers, but also to individuals. This service is also essential for organizations and federation (e.g., the NSF DataONE network) with a wide range of legacy identifier types that would otherwise require a wide range of identifier infrastructure components.

Finally, we agree strongly with the idea of organizations “making resource metadata available at those URLs.” We do not see EZID as a master copy of metadata, but as a secondary or cached copy. The utility of this copy is to provide uniform and fast lookup of citation metadata drawn from a wide variety of sources that have no API or non-uniform APIs.

Point 4

And so let’s face this question of cost-effectiveness and cost in general. Much has been written in the last several years about ROI and sustainability at libraries in general and digital libraries in particular. Funders ask for these plans, and, increasingly, our home institutions ask for them as well. At California Digital Library, we are facing this question head on, and we are moving forward. This is new for our community, and sometimes it is uncomfortable. Let’s be honest: it doesn’t look like a pretty picture. It looks like a spreadsheet with dollars and cents on it.

Our approach is to be open and clear and share the information we have so that we can all learn how to do this together. We believe in this new tool, EZID (http://n2t.net/ezid). We think we are making a valuable contribution to data management, and we will try hard to recover our costs. We want to be straightforward about that.

Please take a look at this tool. You can try out without a log-in by going to the Help tab, and there you’ll see you can make test DOIs and ARKs. Let us know if you have any questions.