German Electra Training#
Steps for training#
Step 1 - generate Vocab#
import os
from tokenizers import BertWordPieceTokenizer
from pathlib import Path
save_dir = "./<vocab_save_dir>"
paths = [str(x) for x in Path("<text_corpus_dir>").glob("*.txt")]
print('text corpus files:', paths)
vocab_size = 32_767 # 2^15-1
min_frequency = 2
os.makedirs(save_dir, exist_ok=True)
special_tokens = [
"[PAD]",
"[UNK]",
"[CLS]",
"[SEP]",
"[MASK]"]
for i in range(767-5):
special_tokens.append('[unused{}]'.format(i))
tokenizer = BertWordPieceTokenizer(strip_accents=False)
tokenizer.train(files=paths,
vocab_size=vocab_size,
min_frequency=min_frequency,
special_tokens=special_tokens,
)
tokenizer.save_model(save_dir)
tokenizer.save(save_dir + "/tokenizer.json")
Step 2 - get modified code to disable strip_accents
#
Use the modified fork to disable strip_accents
. Also see:
PhilipMay/electra and this
PR: google-research/electra#88
git clone -b no_strip_accents https://github.com/PhilipMay/electra.git
cd electra
Step 3 - create TF datasets#
python3 build_pretraining_dataset.py --corpus-dir ~/data/nlp/corpus/ready/ --vocab-file ~/dev/git/german-transformer-training/src/vocab_no_strip_accents/vocab.txt --output-dir ./tf_data --max-seq-length 512 --num-processes 8 --do-lower-case --no-strip-accents
Step 4 - training#
This needs to be done on a strong GPU or even better TPU machine.
python3 run_pretraining.py --data-dir gs://<dir> --model-name 02_Electra_Checkpoints_32k_766k_Combined --hparams '{"pretrain_tfrecords": "gs://<dir>/*" , "model_size": "base", "vocab_file": "gs://<dir>/german_electra_uncased_no_strip_accents_vocab.txt", "num_train_steps": 766000, "max_seq_length": 512, "learning_rate": 2e-4, "embedding_size" : 768, "generator_hidden_size": 0.33333, "vocab_size": 32767, "keep_checkpoint_max": 0, "save_checkpoints_steps": 5000, "train_batch_size": 256, "use_tpu": true, "num_tpu_cores": 8, "tpu_name": "electrav5"}'
Config:
debug False
disallow_correct False
disc_weight 50.0
do_eval False
do_lower_case True
do_train True
electra_objective True
embedding_size 768
eval_batch_size 128
gcp_project None
gen_weight 1.0
generator_hidden_size 0.33333
generator_layers 1.0
iterations_per_loop 200
keep_checkpoint_max 0
learning_rate 0.0002
lr_decay_power 1.0
mask_prob 0.15
max_predictions_per_seq 79
max_seq_length 512
model_dir gs://<dir>
model_hparam_overrides {}
model_name 02_Electra_Checkpoints_32k_766k_Combined
model_size base
num_eval_steps 100
num_tpu_cores 8
num_train_steps 766000
num_warmup_steps 10000
pretrain_tfrecords gs://<dir>/*
results_pkl gs://<dir>/results/unsup_results.pkl
results_txt gs://<dir>/results/unsup_results.txt
save_checkpoints_steps 5000
temperature 1.0
tpu_job_name None
tpu_name electrav5
tpu_zone None
train_batch_size 256
uniform_generator False
untied_generator True
untied_generator_embeddings False
use_tpu True
vocab_file gs://<dir>/german_electra_uncased_no_strip_accents_vocab.txt
vocab_size 32767
weight_decay_rate 0.01
Convert Model to PyTorch#
python /home/phmay/miniconda3/envs/farm-git/lib/python3.7/site-packages/transformers/convert_electra_original_tf_checkpoint_to_pytorch.py --tf_checkpoint_path ./model.ckpt-40000 --config_file ./config.json --pytorch_dump_path ./pytorch_model.bin --discriminator_or_generator='discriminator'
More info on conversion: