Clean German Wikipedia Text Corpus released#

Today I published a new Wikipedia-based German text corpus. It is to be used for NLP machine learning tasks.

../../_images/wikipedia.png — Wikipedia#

The corpus is based on a database dump. This was unpacked with WikiExtractor. Then a script is provided to split the texts into sentences. This is done by using SoMaJo. Each line of the text corpus contains one single sentence. Between each Wikipedia article is a blank line.

For splitting sentences we have tested SoMaJo extensively and it produces better results than other much more popular classic NLP tools like spaCy.

Both the code for preprocessing and the corpus itself can be downloaded from GitHub here: GermanT5/wikipedia2corpus

LightGBM with Optuna: Demo released Anomalies in the MLSUM Dataset

22 February 2022

Recent Posts

Archives

Clean German Wikipedia Text Corpus released#