loading . . . Data rescue for World Digital Preservation Day 2025 Today, Thursday 6 November 2025 if I actually manage to finish and publish this today, is World Digital Preservation Day so I thought I would try and get a blog post out about some work Iโve been doing to rescue at-risk data. Iโve briefly mentioned this in my post about Library of Congress Subject Headings but not in much detail.
The project is Safeguarding Research & Culture and I got involved back in March or April when Henrik reached out on social media looking for someone with library & metadata experience to contribute. I said that I wasnโt a Real Librarian but Iโd love to help if I could, and now here we are.
The concept is simple: download public datasets that are at risk of being lost, and replicate them as widely as possible to make them hard to destroy, though obviously thereโs a lot of complexity buried in that statement. When the Trump administration first took power, there were a lot of people around the world worried about this issue and wanting to help, so while there are a number of institutions & better resourced groups doing similar things, we aim to complement them by mobilising grassroots volunteers.
Downloading data isnโt always straightforward. It may be necessary to crawl an entire website, or query a poorly-documented API, or work within the constraints of rate-limiting so as not to overload an under-resourced server. That takes knowledge and skill, so part of the work is guiding and mentoring new contributors and fostering a community that can share what they learn and proactively find and try out new tools.
We also need people to be able to find and access the data, and volunteers to be able to contribute their storage to the network. We distribute data via the venerable BitTorrent protocol, which is very good at defeating censorship and getting data out to as many peers as possible as quickly as possible. To make those torrents discoverable, our dev team led by the incredible Jonny have built a catalogue of dataset torrents, playfully named SciOp. Thatโs built on well-established linked data standards like DCAT, the Data Catalogue Vocabulary, so the metadata is standardised and interoperable, and thereโs a public API and a developing commandline client to make it even easier to process and upload datasets. There are even RSS and RDF feeds of datasets by tag, size, threat status or number of seeds (copies) in the network that you can plug into your favourite BitTorrent client to automatically start downloading newly published datasets. There are even exciting plans in the works to make it federated via ActivityPub, to give us a network of catalogues instead of just a single one.
Weโre accidentally finding ourselves needing to push the state of the art in BitTorrent client implementations. If youโre familiar with the history of BitTorrent as a favoured tool for _ahem_ less-than-legal media sharing, it probably wonโt surprise you that most current BitTorrent clients are optimised for working with single audio-visual streams of about 1 to 2ยฝ hours in length. Our scientific & cultural data is much more diverse than that, and the most popular clients can struggle for various reasons. In many cases there are BEPs (BitTorrent Enhancement Proposals) to extend the protocol to improve things, but these are optimal features that most clients donโt implement. The collection of BEPs that make up โBitTorrent v2โ is a good example: most clients donโt support v2 well, so most people donโt bother making v2-compatible torrents, but that means thereโs no demand to implement v2 in the clients. We are planning to make a scientific-grade BitTorrent client as a test-bed for these and other new ideas.
Myself Iโm running one of a small number of โsuperโ nodes in the swarm, with much more storage available than the average laptop or desktop, and often much better bandwidth too. Thatโs good, because some of our datasets run to multiple terabytes, plus to ensure new nodes can get started quickly we need to have some always-on nodes with most of the data available to others. Since BitTorrent is truly peer-to-peer, it doesnโt matter how many people have a copy of a given dataset, if none of them are online no-one else can access it.
This is all very technically interesting, but communications, community, governance, policy, documentation, funding are also vitally important, and for us these are all works in progress. We need volunteers to help with all of this, but especially those less-technical aspects. If youโre interested in helping, please drop us a line at [email protected], or join our community forum and introduce yourself and your interests.
If you want to contribute but donโt feel you have the time or skills, well, to start with weโre more than happy to show you the ropes and help you get started, but as an alternative, Iโm running one of those โsuperโ nodes and you can contribute to my storage costs via GoFundMe: even a few quid helps. I currently have 3x 6TB hard drives with no space to mount them, so Iโm currently in need of a drive cage to hold them and plug them into my server.
Special shout-out also to our sibling project, the Data Rescue Project, who are doing amazing work on this and often send us requests for websites or complex datasets for our community to save.
Iโve barely scratched the surface here, but I _really_ want to actually get this post out for WDPD so Iโm going to stop here and hopefully continue soon! https://erambler.co.uk/blog/wdpd2025-data-rescue/