Harvesting the government web space
At a time when more and more government publications are online, U of T librarians have stepped in to start archiving government websites.
U of T’s collection is considered among the most extensive and accessible collection of online captures of government websites in the country, and the university’s efforts are critical because they’re preserving information – and in turn keeping governments accountable – in an era when the documents are no longer available in print and always changing on the Internet.
Until recently, federal agencies would send Robarts Library print copies of government publications to preserve them and make them available to the public, thereby encouraging active citizenship. But that program ended in 2014, and no new initiatives have followed for the archiving of government websites.
“Government documents, government information, things like annual reports, statistics, are material we help researchers find on websites,” said Nicholas Worby, who is in charge of web archiving at ߲ݴý Libraries. “The control for their preservation and curation is out of our hands, but we have a huge stake in making sure we have access to this stuff. This is an effort to rescue those documents and increasingly it’s becoming more of a means of extending our job into a born digital world.”
Those and other issues will be part of a discussion this week when Robarts Library hosts an , bringing together researchers, archivists and librarians from around the world to begin charting unchartered waters. They’ll be developing open source tools and methodologies for working with web archives.
Currently, U of T Libraries is archiving material from the federal government’s website, and has begun collecting content from provincial websites. They are also working with the City of Toronto Archives to capture parts of the Toronto municipal portal.
U of T’s collection includes captures of about 200 Canadian federal government websites from the end of Library and Archives Canada’s web archiving program in 2007 as well as archives of 60 sites from the Ontario government web domain and 7 sites for the city of Toronto.
The effort began three years ago when U of T Government Publications and Reference Librarian Sam-chin Li and librarians from other universities discovered that the Harper government was shutting down the Aboriginal Canada Portal site within a week. U of T librarians rushed to figure out how they’d capture the online information. They consulted fellow university librarians. They learned how to use web-harvesting software and then worked into the night to crawl part of the site.
That was followed by another shock a few months later – a leaked document showing that more federal government websites could be terminated or at least 60 percent of their content reduced.
It was a wake-up call to U of T librarians. They needed to begin archiving the website content themselves.
“It was going to be a digital dark age of Canadian government information, of what we were going to know about our government,” Li said. “U of T filled that gap.”
Today, Library and Archives Canada says it is capturing content on federal government websites, but for the past two and a half years the sites have not been made publicly available.
“We keep asking them to send information about what they have captured so we can fill the gaps, but it’s still a big unknown to us,” Li said. “We can’t stop doing our job because we don’t know what they have done. We have students coming and asking for information. We can’t say you have to wait for Library and Archives to share the information.”
Worby, who is now the Government Information and Statistician Librarian, was a grad student in the Faculty of Information when U of T began the rush to harvest the sites. He says the government records are crucial for researchers. However, not every page on a government website is getting captured on a daily basis. Most of the time, the university does broad crawls semi-annually and captures media release pages every evening.
“It terrifies me, but I know we’re never going to archive everything,” Worby said. “It’s impossible to capture everything with web archiving. Having at least some fragmentary pieces of historical memory is still better than not having it.”
The U of T collection also includes campaign and party websites for the recent federal and the Toronto mayoral elections. U of T Scarborough Principal Bruce Kidd offered advice to the librarians about archiving Toronto 2015 Pan Am/ Parapan Am Games sites, so as to collect documents from host cities detailing the planning and experience of the games throughout the GTA.
“It is really important that documentary records of major games be kept,” Kidd says. “In the growing field of international scholarship of major games, mega events and international sport, the Olympics is fairly well covered and documented. The Commonwealth Games less so. But the Pan American Games is really a big dark hole. I thought that one of the legacy contributions of Toronto 2015 would be to leave a good documentary archive on the games. This will benefit researchers in a variety of fields for years to come.”