All Posts
The selection of topic-specific texts from Wikipedia
- 09 August 2024
For some machine learning tasks, you want to obtain texts on specific topics. For example, as a basis for a DPR or RAG training data set. But which texts can be used? Especially for non-English texts? And how can they be filtered by topic?
Pandas Data Format and Compression
- 02 July 2024
This is a systematic comparison of the most important pandas data formats (CSV, Parquet with PyArrow backend and Feather) and different compression methods respectively compression levels. The comparison is based on the compression ratio and the time it takes to save and load the data. Factors such as RAM usage are not considered.
The importance of chat templates
- 11 April 2024
A long time ago, when GPT-3.5 (without turbo) was current, LLMs were simply trained to complete texts. When GPT-3.5-turbo was released, there was a small but essential change in the course of ChatGPT. Now the LLM could not only complete texts but also “read and understand” multi turn conversations. At the same time, there was also the option of using system prompts.
Pandas apply
- 18 November 2023
I often use Pandas to process NLP data. In many cases I want to create a new column from the information in an existing column. For example, if I want to have the number of characters or tokens.
Options for Date Encoding
- 12 October 2022
Some data, such as strings, must be encoded to be used in machine learning models. Here we explore the different options for encoding date fields.
Python Installation and Package Management with conda and pip
- 23 July 2022
This article is about installing Python and package management. It is a subjective article and represents my own opinion and experience. The article is structured by several recommendations.
Anomalies in the MLSUM Dataset
- 23 February 2022
While evaluating the ml6team/mt5-small-german-finetune-mlsum summarization model, my colleague Michal Harakal and I noticed that in many cases this model for summarization simply reproduces the first sentence of the input text. Instead, it should generate an independent summary of the whole text.
Clean German Wikipedia Text Corpus released
- 22 February 2022
Today I published a new Wikipedia-based German text corpus. It is to be used for NLP machine learning tasks.
LightGBM with Optuna: Demo released
- 20 February 2022
This week I published a project to show how to combine LightGBM and Optuna efficiently to train good models. The purpose of this work is to be able to be reused as a template for new projects.
German colossal, cleaned Common Crawl corpus (GC4) released
- 10 April 2021
Philipp Reißel (ambeRoad) and me published the largest German text corpus within the German NLP Group: The German colossal, cleaned Common Crawl corpus
Training and Evaluation of our German Electra Language Model Talk
- 01 December 2020
Together with Philipp Reissel from ambeRoad I gave a talk about the training and evaluation of our open-source German Electra NLP language model.