Clean German Wikipedia Text Corpus released#

Today I published a new Wikipedia-based German text corpus. It is to be used for NLP machine learning tasks.

../../_images/wikipedia.png

Wikipedia#

The corpus is based on a database dump. This was unpacked with WikiExtractor. Then a script is provided to split the texts into sentences. This is done by using SoMaJo. Each line of the text corpus contains one single sentence. Between each Wikipedia article is a blank line.

For splitting sentences we have tested SoMaJo extensively and it produces better results than other much more popular classic NLP tools like spaCy.

Both the code for preprocessing and the corpus itself can be downloaded from GitHub here: GermanT5/wikipedia2corpus