Next versions of wget will support WARC

Alard has implemented an option to produce WARC output in wget:

I (Alard) have been working on a way to have Wget write its results to a WARC (Web ARChive file format) file, just like Heritrix and other archiving tools. With the WARC format, it’s possible to save both the request and the response headers. It also provides a clean way to store redirects and 404 responses.

Gijs van Tulder has proposed to merge these changes into the wget main code line:

I’d like to propose a new feature that allows Wget to make WARC files.Perhaps you're already familiar with it, but in short: WARC is a file format for web archives. In a single WARC file, you can store every file of the website, plus the HTTP request and response headers and other metadata. This makes it a very useful format for web archivists: you keep everything together, in the most detailed and original form. The WARC format (an ISO standard, ISO 28500) has been developed by the International Internet Preservation Consortium, which includes the Internet Archive and many national libraries. It is supposed to become *the* standard file format for web archives. For example, it is used in the Internet Archive's Wayback Machine and its Heritrix crawler. There are several projects building tools to work with WARC files.

The proposal has been accepted by Giuseppe Scrivano, the current wget maintainer:

Sure we do!

And the integration seems to be progressing as far as we can tell from the bug-wget mailing list.

Wget is handy and widespread and this will make it much easier to create quality archives for pretty much everyone.

Wget is also what is currently used by the Owark WordPress plugin and that’s obviously good news for the project.

That doesn’t really impact the plan to create a web service based option for Owark discussed in a recent blog post, but it will strengthen the “local archiving option” in the long term (it will take a few years before a new version of wget is widespread) and provide an interesting alternative to produce archives for the web service.

Leave a Reply