The current version of the WordPress plugin relies on the Broken Ling Checker to find new links and check their status and on wget to create its archive but all these operations are done on the server which is hosting the blog.
This is working fine for the few blogs I administer but this architecture has a couple of drawbacks:
- It’s not always obvious to have a recent version of wget installed on the server (actually, the most common installation issues with Owark are related to wget).
- Visitors have to trust the website that hosts the archives that the archived pages are actuals copies of the archived pages.
You can imagine many motivations to forge archives (to replace or add ads, to comfort a statement you’ve made on your site, …) and I think that this issue of trust could become really nasty if the idea of creating private archives got some traction.
Thinking about these issues, I took a step back and made a list of the functions involved in an Owark node:
- Link management (maintains a list of outbound links and theirs status)
- Archiving (creates an archive for each outbound link)
- Archives hosting
- Link replacement (replacing the links to broken links by links to local archives)
- Black list management
When you think about it, the only of these functions that must absolutely be kept on the web server to insure long term preservation through the LOCKSS principle is the archives hosting.
The archiving itself could be performed elsewhere on the net and there are several advantages to do so:
- Among these five basic functions, archiving is the most technically challenging one. To archive a web page, you need to retrieve not only the page but also all the resources needed to display that page and update both the page and all the resources that may contain links to use archives versions of the resources rather than original ones. To deal with this complexity it makes sense to rely on dedicated web services rather than implementing it on each server.
- If the archiving was done by well known “archive makers” (Owark could be one, but also archive.org or national archive organizations), these archivers could sign their archives to certify that they are conform to the original resources.
Coincidentally, this would solve the two drawbacks I have mentioned in my introduction… And this is definitely the direction I’d like to give to the project:
- I have started to implement a web service to create web archives.
- The next versions of the WordPress plugin will have the choice between this web service and wget to create their archives.
To design this web service, we need to define an API and to define this API, we need to choose an format for exchanging archives.
If I had to propose such a format from scratch, I would probably go for a jar archive containing all the resources (both original and converted) and the HTTP headers and a manifest file including relevant metadata.
However, that would look like reinventing the wheel: archivists have been working on the subject for decades and have defined WARC, an ISO format to store web archives.
WARC has the benefit to be pretty simple: it’s basically a single file aggregating HTTP requests and responses (including their payloads) with additional headers for metadata.
Of course, using an ISO standard is useful for interoperability and it make a lot of sense to use that format to store the HTTP exchanges involved for creating an archive. Such archives are useful to keep for legal reasons in case you need to show what has led to an archive. They are also useful in case you need to reapply, for any reason, the conversion that is needed to visualize the archive without using remote resources.
In addition to HTTP exchanges, WARC also includes “conversion” records to store the result of this conversion.
However, WARC files cannot be directly displayed on a browser and the job required by a web server to send conversion records from a WARC archive to a browsers as “normal” resources is non trivial (more complex than serving static files in any case).
What I have currently in mind is thus a mixed approach consisting of a jar archive containing:
- A WARC archive for the original resources.
- A directory containing all the resources needed to render the page, converted or original for those that doesn’t have to be converted. Original resources that do not need to be converted would thus be included twice: in the WARC archive and in the “ready to display” directory.
- A manifest file with metadata
To help solving the trust issue, these informations should be signed.
The jar specification defines a mechanism to create signed jar files. Unfortunately the signature is applied on the complete content of the jar and that doesn’t help much in our case where what is presented to the visitors won’t be the jar itself but the resources from the “ready to display” directory.
I think we should rather sign separately the WARC archive and the content of the directory.
Doing so we should have all the information needed to allow a visitor browsing an archive to check that the archive has been done by a specific archiver and has not been altered in any way.
The signature which would be included in the manifest could be sent to the browser in a specific HTTP header.
To authenticate an archive, one would have to:
- read the signature from the HTTP header
- download the web page and all the related resources
- fetch the public key that could be stored on the archiver’s web site or in a DNS record
- check the signature.
Does all that make sense? Am I reinventing the wheel?
Thanks for your comments!