About Me#

This is an overview of my open source models, datasets, projects and contributions.

Models#

german-nlp-group/electra-base-german-uncased

German Electra NLP model, joined work with Philipp Reißel (ambeRoad)

Talk about this model:
BEYOND BERT – Challenges and Potentials in the Training of German Language Models

German T5 models in 3 different sizes

These models are trained on our GC4 corpus.

Joined work with Stefan Schweter (schweter.ml) and Philipp Schmid (Hugging Face).

T-Systems-onsite/cross-en-de-roberta-sentence-transformer

This model is intended to compute sentence (text) embeddings for English and German text. These embeddings can then be compared with cosine-similarity to find sentences with a similar semantic meaning.

T-Systems-onsite/mt5-small-sum-de-en-v2

A bilingual summarization model for English and German. It is based on the multilingual T5 model google/mt5-small.

German ELMo Model

This is a German ELMo deep contextualized word representation. It is trained on a special German Wikipedia Text Corpus.

Datasets#

Ger-RAG-eval

This dataset is intended for the evaluation of German RAG (retrieval augmented generation) capabilities of LLM models. It is based on the test set of the deutsche-telekom/wikipedia-22-12-de-dpr data set (also see wikipedia-22-12-de-dpr on GitHub) and consists of 4 subsets or tasks.

Ger-RAG-eval is also implemented in LightEval:

wikipedia-22-12-de-dpr

This dataset provides a German dataset for DPR model training. DPR (Dense Passage Retrieval) is one of the most important components of RAG applications. Based on this dataset, German document retrieval models can be trained.

The unique feature of this data set is that it contains not only training data for questions, but also imperative questions. An imperative question is a type of question that is phrased as a command or an instruction. Since there is a formal and informal form of address in German, both cases are included in the case of imperative questions.

The German colossal, cleaned Common Crawl corpus (GC4 corpus)

This is a German text corpus which is based on Common Crawl. The text corpus has the size of 454 GB packed. Unpacked it is more than 1 TB. It has been cleaned up and preprocessed and can be used for various tasks in the NLP field. The dataset is joined work with Philipp Reißel (ambeRoad).

STSb Multi MT

Machine translated multilingual translations and the English original of the STSbenchmark dataset. Translation has been done with deepl.com.
This dataset is available on GitHub and as a Hugging Face Dataset.

German Backtranslated Paraphrase Dataset

This is a dataset of more than 21 million German paraphrases. These are text pairs that have the same meaning but are expressed with different words. This dataset can be used for example to train semantic text embeddings. To do this, for example, SentenceTransformers and the MultipleNegativesRankingLoss can be used.

Wikipedia 2 Corpus

Tools to extract and clean the Wikipedia texts to transform them into a text corpus for self-supervised NLP model training. Includes also a prepared corpus for English and German language.

NLU Evaluation Data - German and English + Similarity

This repository contains two datasets:

A labeled multi-domain (21 domains) German and English dataset with 25K user utterances for human-robot interaction. It is also available as a Hugging Face dataset: deutsche-telekom/NLU-Evaluation-Data-en-de
A dataset with 1,127 German sentence pairs with a similarity score. The sentences originate from the first data set.

deutsche-telekom/NLU-few-shot-benchmark-en-de

This is a few-shot training dataset from the domain of human-robot interaction. It contains texts in German and English language with 64 different utterances (classes). Each utterance (class) has exactly 20 samples in the training set. This leads to a total of 1280 different training samples.

The dataset is intended to benchmark the intent classifiers of chat bots in English and especially in German language. We are building on our deutsche-telekom/NLU-Evaluation-Data-en-de data set.

Projects#

Machine Learning Tool Box 2 (Documentation Page)

A box of different machine learning tools. It contains tools for:
data loading, fastText, files, Markdown, Markdown, OpenAI, Optuna, plot, SoMaJo, text cleaning, Transformers

XLSR – Cross-Lingual Sentence Representations

Models and training code for cross-lingual sentence representations like T-Systems-onsite/cross-en-de-roberta-sentence-transformer

LightGBM Tools

This Python package implements tools for LightGBM. In the current version lightgbm-tools focuses on binary classification metrics.

ML-Cloud-Tools

Tools for machine learning in cloud environments. At the moment it is only a tool to easily handle Amazon S3.

Census-Income with LightGBM and Optuna

This project uses the census income data and fits LightGBM models on it. It is not intended to bring super good results, but rather as a demo to show the interaction between LightGBM, Optuna and HPOflow. The usage of HPOflow is optional and can be removed if wanted. We also calculare the feature importances with SHAP (SHapley Additive exPlanations).

S.M.A.R.T. Prometheus Metrics Exporter

smart-prom-next is a Prometheus metric exporter for S.M.A.R.T. values of hard disks.

MLflow Image

The MLflow Docker image.
MLflow does not provide an official Docker image. This project fills that gap.

Lazy-Imports

Python tool to support lazy imports

Style-Doc

This is Black for Python docstrings and reStructuredText (rst). It can be used to format docstrings (Google docstring format) in Python files or reStructuredText.

conda-forge/hyperopt-feedstock

conda-forge release of Hyperopt

Pull Requests#

Hugging Face / Transformers

add classifier_dropout to classification heads: #12794
add option for subword regularization in sentencepiece tokenizer: #11149, #11417
add strip_accents to basic BertTokenizer: #6280
refactor slow sentencepiece tokenizers and add tests: #11716, #11737
more fixes and improvements

Optuna

add MLflow integration callback: #1028
trial level suggest for same variable with different parameters give warning: #908
more fixes and improvements

Sentence Transformers

add callback so we can do pruning and check for nan values: #327
add option to pass params to tokenizer: #342
always store best_score: #439
fix for OOM problems on GPU with large datasets: #525
more fixes and improvements

SetFit - Efficient Few-shot Learning with Sentence Transformers

add option to normalize embeddings #177
add option to set samples_per_label #196
add warmup_proportion param - make warmup_steps configurable #140
add option to use amp / FP16 #134
add num_epochs to train_step calculation #139
add more loss function options #159
more fixes and improvements

Other Fixes and Improvements

google-research/electra: add toggle to turn off strip_accents #88
opensearch-project/opensearch-py: add Sphinx to generate Code Documentation #112 - also see API Reference
deepset-ai/FARM: various fixes and improvements
hyperopt/hyperopt: add progressbar with tqdm #455
mlflow/mlflow: add possibility to use client cert. with tracking API #2843

Archived Projects#

Machine Learning Tool Box

This is the machine learning tool box. A collection of userful machine learning tools intended for reuse and extension. The toolbox contains the following modules: hyperopt, keras, lightgbm, shap, metrics, plot, tools

The main functionality is now available in MLTB2.

HPOflow

Tools for Optuna, MLflow and the integration of both

Transformer-Tools

Tools for Hugging Face / Transformers

PyCharm Community Edition IDE for Python with bundled JRE

An Arch Linux package (AUR)