Matthias Boenig

For over twenty years I have been involved in the digitization of library collections, scientific publications and research data for science. I have worked on a wide range of projects. My work includes the development and improvement of electronic publishing and the use of XML for digital editions and research data. I have also further developed OCR technology and supported its application in various areas of the humanities and the archiving of historical documents. A few projects should be highlighted.

OCR-D (2015-2024): In this project I worked on the standardization and cataloging of training material for automatic text and structure recognition. The major goal of this project is the complete digitization of historical prints from the 16th century to the 18th century and thus to enable their full text transformation, which represents a significant contribution to the preservation and accessibility of cultural assets.

German Text Archive (DTA) (2010-2017): Today, the DTA offers an extensive collection of German-language texts that serve as the basis for a reference corpus of Modern High German. It comprises around 1500 titles and is characterized by a balanced selection of texts and the use of first editions for digitization. In the DTA project, I worked on the procedural and technological implementation of full-text digitization.

AEDIT (2012-2015): AEDIT is a prototype archive, edition and distribution platform for early modern works. In this repository, data stocks from digitization and edition projects are to be catalogued, disseminated and made available in the long term. In this project, I have digitized a corpus of 335 funeral sermons together with the Research Centre for Personal Writings at the Philipps University of Marburg (working group of the Academy of Sciences and Literature, Mainz). As part of the project, the basic DTA format was updated for this type of text.

ProPrint (2000-2003): ProPrint is a prototypical print-on-demand service provider. It is technologically based on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). ProPrint makes it possible to link document and publication servers and offers the option of ordering user-selected publications as print-on-demand publications.

Dissonline (1997-2000): This project specializes in the online publication of dissertations and habilitations. It offers a platform on which scientific work can be made digitally accessible, which considerably facilitates the dissemination of and access to research results.

selected publications

Auf dem Trainingsplatz der OCR, die OCR-D-GT-Guidelines

Matthias Boenig, Lena Hinrichsen, and Konstantin Baierer

2024
OCR-D für die Massendigitalisierung: Projektstand und Ausblick

Lena Hinrichsen, Konstantin Baierer, Clemens Neudecker, and 2 more authors

2023
Dokument, Transkription, Forschungsdatum

Konstantin Baierer, Matthias Boenig, Elisabeth Engl, and 5 more authors

2022
Das DTABf in der Edition: zusammenfassender Evaluationsbericht

Bernhard Fisseni, Simon Sendler, Daniela Schulz, and 3 more authors

2021
Volltexte–die Zukunft alter Drucke: Bericht zum Abschlussworkshop des OCR-D-Projekts

Elisabeth Engl, Konstantin Baierer, Matthias Boenig, and 2 more authors

o-bib. Das offene Bibliotheksjournal/Herausgeber VDB, 2020
OCR-D: An end-to-end open source OCR framework for historical printed documents

Clemens Neudecker, Konstantin Baierer, Maria Federbusch, and 4 more authors

2019
Ground Truth: Grundwahrheit oder Ad-Hoc-Lösung? Wo stehen die Digital Humanities?

Matthias Boenig, Maria Federbusch, Elisa Herrmann, and 2 more authors

2018
Über den Mehrwert der Vernetzung von OCR-Verfahren zur Erfassung von Texten des 17. Jahrhunderts.

Matthias Boenig, Kay-Michael Würzner, Arne Binder, and 1 more author

2016
Zeitliche Verlaufskurven in den DTA-und DWDS-Korpora: Wörter und Wortverbindungen über 400 Jahre (1600-2000).

Alexander Geyken, Matthias Boenig, Susanne Haaf, and 4 more authors

2015
Standardized Information on historical Proper Names in Digital Full Text Transcriptions. Crowdsourcing ref= s for< placeName> and< persName> tags in the corpora of the German Text Archive/Deutsches Textarchiv*

Christian Thomas, Matthias Boenig, Alexander Geyken, and 5 more authors

2015
Mehr als schmutzige OCR’: die Aufwertungen von historischen Volltextdigitalisaten zu Forschungsdaten

Matthias Boenig, and Alexander Geyken

2015
Historical newspapers & journals for the DTA

Susanne Haaf, and Matthias Schulz

Proceedings of the LREC Workshop on Language Resources and Technologies for Processing and Linking Historical Documents and Archives—Deploying Linked Open Data in Cultural Heritage (LRT4HDA), 2014