Share this
All you need to know about Web Archiving at The National Archives
by Clare Brown on February 24, 2021
Notes from a Webinar: "Web archiving services at the National Archives" (Wednesday 3 February 2021)with Tom Storrar - Web Archiving Service Owner at The National Archives. Chair was Fiona Laing (currently Chair of SCOOP – Standing Committee on Official Publications)
When I read that Tom, Web Archiving Service Owner at the UK's National Archives (TNA) was presenting a webinar, I immediately signed up and was excited to join CILIP GIG colleagues online. We weren’t disappointed. Tom and his colleagues (7 full-time and one part-time) have the important role of officially preserving the UK government’s online material.
Technology has often run ahead of government, which has left researchers stranded. The issue of missing or inaccessible online government information has caused problems in the past, and was raised in the UK parliament, for instance in 2006 and 2009.
What’s the story of the UK Government Web Archive?
Even in 2006, the National Archives were on the case: in 2003 they were preserving a small selection of UK central government websites. By 2008, the scope had been expanded, with 2012 seeing the addition of social media. In 2017 they switched their service provider to MirrorWeb Ltd, which gives the archiving team a technical edge, for instance, being able to improve the site’s search functionality.
It comprises more than 40,000 crawls/snapshots of over 5000 websites and over 500 social media accounts. It is approximately 160tb in size, 6 billion resources and an important tool for contextualising records for past, present and future research. After all, in the future someone might write a PhD exploring the influence of government advertising on the sales of beef and lamb!
What is web archiving, and why is it important?
Tom agrees with the Wikipedia definition of web archiving, which says,
Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated capture due to the massive size and amount of information on the Web.
One of the first experiences that made me question my librarian abilities occurred in May 1997, just after the general election. The change in government meant that the embryonic government websites on which we were starting to rely vanished overnight. I instantly regretted my lack of foresight and wished we’d printed out all that online material.
As a consequence of this, our law library created paper files of government press releases, guidance notes, manuals, reports, white paper etc which had to be updated daily - the role of the junior team member. When I see those government website iterations from the late 1990’s, it brings back many filing memories!
What do the National Archives capture?
We have come a long way since those classic static sites, and the advent of government broadcast on social media - Twitter and Flickr - means that even more information needs to be captured and archived. Web archiving operates within certain technical constraints, for it to be archived, content must be,
- Publicly available
- Reachable by robots/crawlers
If the web archiving team is informed about web resources that don’t meet the above criteria, they can intervene and capture it using state of the art tools such as Conifer. The captured pages then undergo quality assurance checks before they are published out - occasionally rogue code causes readability issues but is inevitable when you are dealing with this amount of data, and a wide variety of web sources.
Currently more than 800 distinct websites and social media accounts are regularly archived: those of central government, departments, and other public bodies, hubsites (e.g. GOV.UK, NHS, public inquiries, and some inquests. They take as much as possible from the target website:
- Publications, datasets, documentation
- Images
- Video, animations
Their approach is to take a “deep” and complete captures of every website they archive, with an emphasis on quality, completeness, and fidelity. Obviously there are circumstances when pages need to be taken down; when there are errors, non-governmental material, or something is subject to data protection - otherwise everything is conserved.
Special event-based archiving projects
It would be impossible - and unnecessary - to capture everything, everyday so they have a schedule for regular material. However some captures are triggered when a website is about to be retired, refreshed, or redesigned. Events of national importance such as Brexit (EEWA) have become web archive projects in their own right, and have required daily web archiving.
Tom outlined their three-pronged approach:
- They increased the frequency of captures of key sites and resources
- They supplemented the frequency with (weekly and fortnightly) keyword-generated broader crawls across the government web estate
- They captured daily snapshots of complex, interactive (forms etc), or very fast-changing content using web crawlers and/or Conifer
The search function is vital because the archive is huge and these projects are important. You have several options - a specific social media search; the general search, a fascinating A-Z, a URL search, and the Discovery catalogue style search. Discovery holds more than 32 million descriptions of records held by TNA and more than 2,500 archives across the country.
How people are coming together to help the National Archives
Web archiving is the responsibility of every government department and it can only be done with the assistance and cooperation of other people. TNA ask departments to:
- Make sure that their content is “crawlable” - there is technological guidance
- Provide XML sitemap(s), especially for content behind inaccessible functionality
- Ensure that the website’s copyright and reuse statement is clear
- Review the takedown policy, and check that existing archived content is within the rules
- Consider archiving timescales and remember that it isn’t an instantaneous process
- And check the capture before retiring, pruning, taking down or deleting a website!
The official status of TNA makes inter-departmental relationship building easier and to raise their profile, they hold events and host webinars. Should people need advice on tech, copyright, or new (or old!) websites, then they are invited to get in touch with the National Archive team.
I’ve enjoyed browsing the government archives and it’s nice to remember the best civil servants and revisit happier times. Read my full report on the GIG site!
Share this
- October 2024 (1)
- July 2024 (1)
- June 2024 (2)
- May 2024 (2)
- April 2024 (3)
- March 2024 (3)
- February 2024 (4)
- January 2024 (2)
- December 2023 (1)
- November 2023 (2)
- October 2023 (2)
- September 2023 (1)
- August 2023 (3)
- July 2023 (5)
- June 2023 (2)
- May 2023 (2)
- April 2023 (4)
- March 2023 (1)
- February 2023 (1)
- January 2023 (2)
- November 2022 (2)
- September 2022 (2)
- August 2022 (2)
- July 2022 (1)
- June 2022 (1)
- May 2022 (2)
- April 2022 (3)
- March 2022 (1)
- February 2022 (2)
- December 2021 (2)
- November 2021 (2)
- October 2021 (2)
- September 2021 (2)
- August 2021 (2)
- July 2021 (2)
- June 2021 (2)
- May 2021 (1)
- April 2021 (2)
- March 2021 (1)
- February 2021 (3)
- January 2021 (2)
- November 2020 (3)
- October 2020 (1)
- August 2020 (2)
- July 2020 (4)
- June 2020 (1)
- May 2020 (1)
- April 2020 (2)
- March 2020 (2)
- February 2020 (3)
- January 2020 (1)
- December 2019 (2)
- November 2019 (1)
- October 2019 (1)
- September 2019 (1)
- August 2019 (3)
- July 2019 (3)
- June 2019 (3)
- May 2019 (2)
- April 2019 (1)
- March 2019 (2)
- February 2019 (3)
- January 2019 (3)
- December 2018 (1)
- November 2018 (2)
- October 2018 (2)
- September 2018 (1)
- August 2018 (2)
- July 2018 (1)
- June 2018 (2)
- May 2018 (3)
- April 2018 (3)
- March 2018 (1)
- February 2018 (3)
- January 2018 (1)
- November 2017 (1)
- October 2017 (1)
- July 2017 (1)
- April 2017 (2)
- March 2017 (3)
- February 2017 (1)
- January 2017 (1)
- November 2016 (2)
- October 2016 (1)
- September 2016 (1)
- August 2016 (2)
- June 2016 (1)
- May 2016 (1)
- April 2016 (1)