Posted in 2024

The selection of topic-specific texts from Wikipedia

For some machine learning tasks, you want to obtain texts on specific topics. For example, as a basis for a DPR or RAG training data set. But which texts can be used? Especially for non-English texts? And how can they be filtered by topic?

Read more ...


Pandas Data Format and Compression

This is a systematic comparison of the most important pandas data formats (CSV, Parquet with PyArrow backend and Feather) and different compression methods respectively compression levels. The comparison is based on the compression ratio and the time it takes to save and load the data. Factors such as RAM usage are not considered.

Read more ...


The importance of chat templates

A long time ago, when GPT-3.5 (without turbo) was current, LLMs were simply trained to complete texts. When GPT-3.5-turbo was released, there was a small but essential change in the course of ChatGPT. Now the LLM could not only complete texts but also “read and understand” multi turn conversations. At the same time, there was also the option of using system prompts.

Read more ...