NLP Datasets#
Links#
XGLUE: https://microsoft.github.io/XGLUE/
Lincense: non-commercial research purposes only
XNLI: facebookresearch/XNLI
XNLI 1.0
multi lang
2490 dev per lang., 5010 test per lang.
XNLI-15way: 10,000 multi lingual parallel sentences without label
Lincense: non-commercial
Stanford Natural Language Inference (SNLI) Corpus: https://nlp.stanford.edu/projects/snli/
English language
3 classes: neutral, contradiction, entailment
The Multi-Genre NLI Corpus: https://cims.nyu.edu/~sbowman/multinli/
English language
3 classes: neutral, contradiction, entailment
PAWS-X: google-research-datasets/paws
lang: de, en, es, fr, ja, ko, zh
29401 train, 2000 dev, 2000 test - size can be slightly different between languages
sentence1, sentence2, label
dev & test overlaps!
label: binary (paraphrase or not paraphrase)
STS benchmark
Original dataset: https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark
STSb Multi MT: PhilipMay/stsb-multi-mt
CORD19STS
German Sentiment#
Text Corpus#
Multilingual Text Corpus#
Wikipedia (multiple languages): https://huggingface.co/datasets/wikimedia/wikipedia
LLM Datasets#
Function calling: https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2