Next versions of wget will support WARC

Alard has implemented an option to produce WARC output in wget:

I (Alard) have been working on a way to have Wget write its results to a WARC (Web ARChive file format) file, just like Heritrix and other archiving tools. With the WARC format, it’s possible to save both the request and the response headers. It also provides a clean way to store redirects and 404 responses.

Gijs van Tulder has proposed to merge these changes into the wget main code line:

I’d like to propose a new feature that allows Wget to make WARC files.Perhaps you're already familiar with it, but in short: WARC is a file format for web archives. In a single WARC file, you can store every file of the website, plus the HTTP request and response headers and other metadata. This makes it a very useful format for web archivists: you keep everything together, in the most detailed and original form. The WARC format (an ISO standard, ISO 28500) has been developed by the International Internet Preservation Consortium, which includes the Internet Archive and many national libraries. It is supposed to become *the* standard file format for web archives. For example, it is used in the Internet Archive's Wayback Machine and its Heritrix crawler. There are several projects building tools to work with WARC files.

The proposal has been accepted by Giuseppe Scrivano, the current wget maintainer:

Sure we do!

And the integration seems to be progressing as far as we can tell from the bug-wget mailing list.

Wget is handy and widespread and this will make it much easier to create quality archives for pretty much everyone.

Wget is also what is currently used by the Owark WordPress plugin and that’s obviously good news for the project.

That doesn’t really impact the plan to create a web service based option for Owark discussed in a recent blog post, but it will strengthen the “local archiving option” in the long term (it will take a few years before a new version of wget is widespread) and provide an interesting alternative to produce archives for the web service.

What’s next for Owark

The current version of the WordPress plugin relies on the Broken Ling Checker to find new links and check their status and on wget to create its archive but all these operations are done on the server which is hosting the blog.

This is working fine for the few blogs I administer but this architecture has a couple of drawbacks:

  • It’s not always obvious to have a recent version of wget installed on the server (actually, the most common installation issues with Owark are related to wget).
  • Visitors have to trust the website that hosts the archives that the archived pages are actuals copies of the archived pages.

You can imagine many motivations to forge archives (to replace or add ads, to comfort a statement you’ve made on your site, …) and I think that this issue of trust could become really nasty if the idea of creating private archives got some traction.

Thinking about these issues, I took a step back and made a list of the functions involved in an Owark node:

  • Link management (maintains a list of outbound links and theirs status)
  • Archiving (creates an archive for each outbound link)
  • Archives hosting
  • Link replacement (replacing the links to broken links by links to local archives)
  • Black list management

When you think about it, the only of these functions that must absolutely be kept on the web server to insure long term preservation through the LOCKSS principle is the archives hosting.

The archiving itself could be performed elsewhere on the net and there are several advantages to do so:

  • Among these five basic functions, archiving is the most technically challenging one. To archive a web page, you need to retrieve not only the page but also all the resources needed to display that page and update both the page and all the resources that may contain links to use archives versions of the resources rather than original ones. To deal with this complexity it makes sense to rely on dedicated web services rather than implementing it on each server.
  • If the archiving was done by well known “archive makers” (Owark could be one, but also archive.org or national archive organizations), these archivers could sign their archives to certify that they are conform to the original resources.

Coincidentally, this would solve the two drawbacks I have mentioned in my introduction… And this is definitely the direction I’d like to give to the project:

  • I have started to implement a web service to create web archives.
  • The next versions of the WordPress plugin will have the choice between this web service and wget to create their archives.

To design this web service, we need to define an API and to define this API, we need to choose an format for exchanging archives.

If I had to propose such a format from scratch, I would probably go for a jar archive containing all the resources (both original and converted) and the HTTP headers and a manifest file including relevant metadata.

However, that would look like reinventing the wheel: archivists have been working on the subject for decades and have defined WARC, an ISO format to store web archives.

WARC has the benefit to be pretty simple: it’s basically a single file aggregating HTTP requests and responses (including their payloads) with additional headers for metadata.

Of course, using an ISO standard is useful for interoperability and it make a lot of sense to use that format to store the HTTP exchanges involved for creating an archive. Such archives are useful to keep for legal reasons in case you need to show what has led to an archive. They are also useful in case you need to reapply, for any reason, the conversion that is needed to visualize the archive without using remote resources.

In addition to HTTP exchanges, WARC also includes “conversion” records to store the result of this conversion.

However, WARC files cannot be directly displayed on a browser and the job required by a web server to send conversion records from a WARC archive to a browsers as “normal” resources is non trivial (more complex than serving static files in any case).

What I have currently in mind is thus a mixed approach consisting of a jar archive containing:

  • A WARC archive for the original resources.
  • A directory containing all the resources needed to render the page, converted or original for those that doesn’t have to be converted. Original resources that do not need to be converted would thus be included twice: in the WARC archive and in the “ready to display” directory.
  • A manifest file with metadata

To help solving the trust issue, these informations should be signed.

The jar specification defines a mechanism to create signed jar files. Unfortunately the signature is applied on the complete content of the jar and that doesn’t help much in our case where what is presented to the visitors won’t be the jar itself but the resources from the “ready to display” directory.

I think we should rather sign separately the WARC archive and the content of the directory.

Doing so we should have all the information needed to allow a visitor browsing an archive to check that the archive has been done by a specific archiver and has not been altered in any way.

The signature which would be included in the manifest could be sent to the browser in a specific HTTP header.

To authenticate an archive, one would have to:

  • read the signature from the HTTP header
  • download the web page and all the related resources
  • fetch the public key that could be stored on the archiver’s web site or in a DNS record
  • check the signature.

In practice it would be foolish to rely on a JavaScript stored on the web site that is hosting the archive (the whole point of this mechanism is that this web site can’t be trusted) and the more appropriate way to perform this kind of verification is probably to rely on plugins that could be downloaded from well known public plugin repositories or from archivers’ web sites.

Does all that make sense? Am I reinventing the wheel?

Thanks for your comments!