500,000 newspaper pages from the absolute monarchy recreated as digital text data
Read about how researchers have created machine learning models which can recognize both layout and text on newspaper pages with high precision and segment the recognized text.
Photo: ENO – Enevældens Nyheder Online: https://hislab.quarto.pub/eno/
The project Enevældens Nyheder Online (ENO) (News from the Absolute Monarchy Online) aims to recreate Denmark-Norway's newspaper corpus from the time period under absolute monarchy as digital text data.
Our interest in the material is about themes such as the labor market, crime and consumption. Royal Danish Library's enormous newspaper collection is an underutilised resource for social and cultural history research.
More about the project
| Project name | Enevældens News Online – ENO (News from the Absolute Monarchy Online) |
|---|---|
| Scientists |
|
| Related material |
|
| Service from Royal Danish Library | We used the material as it is made available through LOAR (https://loar.kb.dk/collections/3933596a-95ca-4927-b55c-3ba948ea6603) and mediastream's API. |
| Royal Danish Library | c. 500,000 newspaper pages in image form with associated metadata about date, edition and page numbering. The images mostly come from the digitization of the newspaper collection's microfilm, but we have also used new photographs of individual series that were not part of the original newspaper digitization. |
| Contact at Royal Danish Library | Ask the library |
The researcher explains further
Based on provided image files and expert sparring with Royal Danish Library, we have created machine learning models that can recognize both layout and text on newspaper pages with high precision, as well as segment the recognized text. Among other things, we have used the new version to train a historical language model DA-BERT_Old_News, which makes it possible to calculate semantic relationships between the more than five million newspaper texts.