Skip to content

Instantly share code, notes, and snippets.

What would you like to do?

Releasing Hindi ELECTRA model

This is a first attempt at a Hindi language model trained with Google Research's ELECTRA. I don't modify ELECTRA until we get into finetuning, and only then because there's hardcoded train and test files


Additional background:

It's available on HuggingFace: - sample usage:

I was greatly influenced by:

Please ask questions in comments below or @mapmeld on Twitter



The corpus is two files:

Bonus notes:

  • Adding English wiki text or parallel corpus could help with cross-lingual tasks and training


Bonus notes:

  • Created with HuggingFace Tokenizers; could be longer or shorter, review ELECTRA vocab_size param

Pretrain TF Records splits the corpus into training documents

Set the ELECTRA model size and whether to split the corpus by newlines. This process can take hours on its own.

Bonus notes:

  • I am not sure of the meaning of the corpus newline split (what is the alternative?) and given this corpus, which creates the better training docs


Structure your files, with data-dir named "trainer" here

- vocab.txt
- pretrain_tfrecords
-- (all .tfrecord... files)
- models
-- modelname
--- checkpoint
--- graph.pbtxt
--- model.*

CoLab notebook gives examples of GPU vs. TPU setup

Baby Model:

Baby2 Model (more training)

Using the model with transformers

It's available on HuggingFace: - sample usage:


Sample CoLab comparing to SimpleTransformers / MultilingualBERT

Each task (such as XLNI, BBC, Hindi Movie Reviews) is a hardcoded class.

Where to place your training and test/dev data in the file system (for data-dir = trainer)

- finetuning_data
-- xnli
--- train.tsv
--- dev.tsv
- models
-- model_name
--- finetuning_tfrecords
--- finetuning_models

^^ If things go bad or you redesign your data, delete finetuning_tfrecords and finetuning_models

In finetune/

elif task_name == "bbc":
    return classification_tasks.BBC(config, tokenizer)

In finetune/classification/

class BBC(ClassificationTask):
  def __init__(self, config: configure_finetuning.FinetuningConfig, tokenizer):
    super(BBC, self).__init__(config, "bbc", tokenizer,
                               ['southasia', 'international', 'learningenglish', 'institutional', 'india', 'news', 'pakistan', 'multimedia', 'social', 'china', 'entertainment', 'science', 'business', 'sport'])

  def get_examples(self, split):
    return self._create_examples(read_tsv(
        os.path.join(self.config.raw_data_dir(, split + ".csv"),
        max_lines=100 if self.config.debug else None), split)

  def _create_examples(self, lines, split):
    return self._load_glue(lines, split, 1, None, 0, skip_first_line=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.