Skip to main content

Web Archives: Information on Recent Technical Difficulties

From: Tracy Seneca – Web Archiving Service Manager, California Digital Library

UPDATE: Wednesday February 21, 2012

The Web Archiving Service (http://was.cdlib.org/) developers have completed their work to improve the performance of the public WAS archives.  Developers are now resuming work on long-term improvements to search speed and capability.

If you have any questions or if you encounter any issues, please contact washelp@ucop.edu.


UPDATE: Friday, February 17, 2012

The Web Archiving Service (http://was.cdlib.org/) developers have been working to improve the response time for search and display of the public archives.  Those of you who encountered difficulty using the archives during recent weeks should see improved response time.  Additional work is underway to further improve archive performance.

The California Digital Library will be closed on Monday, February 20th in observance of Presidents’ Day, but we will be monitoring the service and our help desk during that time.

Again, we apologize for the recent problems with archive performance.  If you have any questions or if you encounter any issues, please contact washelp@ucop.edu.


Those of you who use the public web archives created with CDL’s Web Archiving Service (http://was.cdlib.org) (WAS) will have noticed some noteworthy delays when searching and browsing.  CDL’s WAS support team is working to improve the performance of the archives for the immediate short term future, and will soon be migrating to a new indexing system to dramatically improve the archives’ performance.

We have been communicating with the archivists who build the web archives about our work on this issue, but we also want to communicate with anyone in the general public who may be searching those archives.

The issue:
Web archives involve working with data at a very large scale.  When first released in 2009, the Web Archiving Service offered approximately 4 terabytes of content across all of the public archives.  Since then, the archives have grown to 20.8 terabytes, and in recent weeks that growth has begun to affect their performance.   The current delays in service are the result of the challenges that our old indexing system (based on Nutch) has with this larger scale of data.
For the short term: we are taking some immediate steps to improve this performance prior to the more substantial upgrade to the service.
For the long term: we are  approaching completion on the development of a new indexing system (based on Solr) that will:

  • address the greater scale of our current archives,
  • enable future scaling of the archives, and
  • provide greater search control.

Important:

  • This issue does not in any way affect the content itself; it only impacts the speed of searching the archives.
  • This issue does not affect the curatorial features of the service; archivists and librarians should encounter no difficulty in archiving new materials.

Next steps:

The WAS developers are looking at a range of solutions to address the immediate performance of the archives.  We expect this work to continue into the week of February 20th.  The improvements to performance may require us to temporarily disable certain features, such as the display of highlighted search terms in search results.  We will attempt to keep all features available, but our top priority is for the archives to return results in a timely manner. We will keep our Web Archiving Service partners informed of our progress.

We sincerely apologize for this temporary problem with the performance of the public archives.  

Please contact washelp@ucop.edu  if you have any questions.