We have been collecting material from the internet since July, 2005. The collection is done automatically using a so-called “harvester”, which is software developed to collect internet material.
What do we collect?
We only collect publicly available material from the internet. Private content (with limited access) like e.g. internal family websites or private enterprise networks are not public and are therefore not collected by us.
We use different strategies for collecting:
- Cross-sectional data collection, which observes all Danish domains at a specific point in time, up to four times a year.
- Selective data collection from the following types of websites: all Danish news outlets (from 12 times daily to once a week), political parties, organizations and unions, government departments and agencies, select social media profiles, Youtube videos (e.g. weekly)
- Event collection of to or three events annually (e.g. general election or the coronavirus pandemic)
- Special collections (e.g. chosen by researchers)
The intension of using different strategies is, that when combined, they provide the best possible coverage of what is being published on the Danish part of the internet.
The online archive contains both data and metadata (descriptive data), and both are made available for research projects on the Cultural Heritage Cluster.
As it is, the amount of web material housed in the archive continues to grow. At the beginning of 2020, the archive contained approx. 640TB of data in 1024 numbers. This is an increase of more than 100TB in one year.
How does the harvester work?
The online archive is not a replica of the living internet. Using a harvester (Heritrix 3), the material is collected mechanically and stored in a particular format (Warc format). The content of the online archive can be displayed using a wayback machine, similar to the one used by Internet Archive .
The harvester is fed with URLs (links), and we decide how the harvester follows them and how it collects what it encounters. If we wish to harvest a front page or a single article, the harvester only needs to collect all elements (images, stylesheets etc.) in order to create the webpage. If we wish to harvest a theme section, the harvester must, in addition to creating the theme front page, jump one or two levels further down the site, in order to obtain the articles that are being linked to from the front page. When performing cross-sectional harvesting, the harvester is told to collect the entirety of the website.
Additionally, we also have to tell the harvester how often a pre-defined link should be collected.
Unfortunately, our collection software and display tool have technical limitations and as a result we have problems:
- harvesting audio and moving picture streams (video)
- displaying some http-based webpages (especially from social media sites like Reddit, Twitter, YouTube and Facebook)