NLP Datasets#

Links#

XGLUE: https://microsoft.github.io/XGLUE/
- Lincense: non-commercial research purposes only
XNLI: facebookresearch/XNLI
- XNLI 1.0
  - multi lang
  - 2490 dev per lang., 5010 test per lang.
- XNLI-15way: 10,000 multi lingual parallel sentences without label
- Lincense: non-commercial
Stanford Natural Language Inference (SNLI) Corpus: https://nlp.stanford.edu/projects/snli/
- English language
- 3 classes: neutral, contradiction, entailment
The Multi-Genre NLI Corpus: https://cims.nyu.edu/~sbowman/multinli/
- English language
- 3 classes: neutral, contradiction, entailment
PAWS: google-research-datasets/paws
PAWS-X: google-research-datasets/paws
- lang: de, en, es, fr, ja, ko, zh
- 29401 train, 2000 dev, 2000 test - size can be slightly different between languages
- sentence1, sentence2, label
- dev & test overlaps!
- label: binary (paraphrase or not paraphrase)
STS benchmark
- Original dataset: https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark
- STSb Multi MT: PhilipMay/stsb-multi-mt
- Paper: https://arxiv.org/abs/1708.00055
CORD19STS
- Paper: https://arxiv.org/abs/2007.02461
- Dataset: https://gitlab.vista.isi.edu/xiaoguo/cord_19