About collecting internet material
We are responsible for collecting and preserving relevant material from the Danish part of the Internet - that is, content produced by Danes, in Danish or for a Danish audience.
If you are the owner of a Danish website, you have probably had visits from the web crawlers that we use to collect Internet material. Netarkivet is Denmark's national web archive, and we collect the entire Danish part of the Internet between two and four times a year and preserve it all in our web archive.
Our web crawlers leave a signature with a url to this page when collecting material. In that way we let you know that we have stopped by and that we are not hackers.
If you experience problems, have questions about our web crawlers or suggestions for improvements, you are very welcome to contact us. The material in the web archive may only be used for research purposes because it contains sensitive personal information. Read more under research access.
Collection of Internet material
Websites come in all sizes, but most are very small - or not in use at all. To collect efficiently, we start by making a small harvest with a max limit of for example 10Mb per. domain. The statistics show that more than 75% of all Danish websites are smaller than this limit.
We use the results from the first harvest to find out which domains are active and to group the Danish domains by size. Then we follow up with larger harvests and sort them from those which reached the limit in the previous harvests, which means that a small number of files on larger websites are harvested several times.
When we collect material for our web archive, we use the open-source harvester Heritrix, developed by the Internet Archive in collaboration with web archives and libraries from around the world.
Below you can read more about our work with collecting internet material and find answers to some of the most frequently asked questions in connection with the work.
On many websites robots.txt guides the search engine web crawlers around material that is absolutely necessary to be able to recreate the correct experience of a website as it looked at any given time.
Our experience shows that if we collect with respect for robots.txt, we miss large amounts of vital data - for example newspapers' websites - but also tens of thousands of private websites, which we consider significant contributions to Danish cultural heritage. Following exactly the same principle, we also have the ability to override HTML meta tags.
You will find our User Agent in the following snip:
<map name = ”http-headers”>
<string name = ”user-agent”> Mozilla / 5.0 (compatible; kb.dk_bot; heritrix / 3.4.0 + https: //www.kb.dk/netarkivindsamling) Firefox / 57 <string>
<string name = ”from”> email@example.com </string >
You are very welcome to contact us at firstname.lastname@example.org if our web crawlers create problems for your website.
It will help us a lot if you write the following in the email:
- A list of the affected domains (for example minhjemmeside.dk)
- Domain IP addresses (for example 18.104.22.168, 22.214.171.124 etc.)
- Domain 'alias', that is identical websites with different domain names
- Examples of the problem (for example screenshots, logs and the like)
We strive to ensure that our web crawler does not overload your server. Thus we have a one minute delay to the same host of 0.3 seconds between each request (max one second). If you find that our crawler is straining your website, please contact us so we can remedy the problem.
If your website contains many identical copies of the same material (mirroring), and it is experienced as a problem that we download all copies, you must also contact us.
We do not want to obtain users' passwords for webmail, banking or the like. We would like to create Netarkivet as a "user", so we can read, for example, the news sites that require login - possibly for a fee. We want to collect what all users can access. It's easiest for us if you give our web crawler access via IP validation. We will contact you regarding the possibility.
The Danish Legal Deposit Act grants us the option to access password protected content without payment.
Material that is only intended for a closed crowd, and which everyone can therefore not access in principle - for example closed family websites, companies' intranets, etc., is not considered published material, and they therefore do not fall under the provision.
Technically, one must remember to distinguish between POST and GET queries via the http protocol. See for example: www.w3.org .
Our web crawler finds links via regular terms, among other things, but we always only send GET requests. If the web server at the other end also responds to GET on URLs that were really only intended for POST, then it is a program error in the scripts that receive the queries.
It is the Danish Legal Deposit Act that allows us to collect material that is subject to copyright. With regard to the relevance of the pages, the principle in the collection is that, as far as possible, it is the researchers of the future who decide the relevance, rather than various stakeholders at the time of collection.
Legal Deposit helps us document our society for posterity. None of the material we collect will be deleted because it is too old.
If we become aware that we are being prevented from collecting a website, we'll contact the owner of the website and try to find a solution that meets both the website owner's needs and our legal obligation to collect and preserve Danish cultural heritage on the Internet.
If we cannot agree, we can ultimately go to court - see Danish Legal Deposit Act, § 21.
To avoid blocking our web crawler, please enter our IP addresses in the blocking mechanism as permitted IP addresses.
Our web crawler is currently coming from the following IP addresses: