🚀 Train Your Own Language Model with Transformers 🤖
🔍 Finding a Dataset
First, embark on an adventure to find a corpus of text in Esperanto. No worries, we'll make it fun!
- 🔍 Explore the Esperanto portion of the OSCAR corpus from INRIA.
- 📚 Concatenate with the Esperanto sub-corpus of the Leipzig Corpora Collection.
Voilà! You've got yourself a dataset to train your model!
# Find a dataset
# Import necessary libraries
from pathlib import Path
# Define paths to text files
paths = [str(x) for x in Path("./eo_data/").glob("**/*.txt")]
# Display paths
print(paths)
🧠 Training a Tokenizer
Let's whip up a Byte-pair encoding tokenizer like a wizard casting spells!
⚡️ Customize training with ByteLevelBPETokenizer and save it to disk.
🔥 Wow, that was fast! Now you have a powerful tokenizer ready to go!
# Train a tokenizer
# Import necessary libraries
from tokenizers import ByteLevelBPETokenizer
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
"",
"",
"",
"",
"",
])
# Save files to disk
tokenizer.save_model(".", "esperberto")
🎓 Train a Language Model from Scratch
Time to unleash the magic of training a language model!
🚂 Choo-choo! Use run_language_modeling.py script from transformers and experiment with hyperparameters.
🔥🔥🔥 Let the training begin! Watch your model learn and grow!
# Train a language model from scratch
# Import necessary libraries
from transformers import RobertaConfig, RobertaForMaskedLM, RobertaTokenizer
# Load tokenizer
tokenizer = RobertaTokenizer.from_pretrained("esperberto")
# Configure model
config = RobertaConfig(
vocab_size=52_000,
max_position_embeddings=514,
num_attention_heads=12,
num_hidden_layers=6,
type_vocab_size=1,
)
# Initialize model
model = RobertaForMaskedLM(config=config)
# Print model architecture
print(model)
🔍 Checking Your LM
Is your language model actually learning something cool?
🔍 Peek into the FillMaskPipeline to see what your LM can do!
😃 Have fun with simple and complex prompts to see the magic unfold!
# Check the LM
# Import necessary libraries
from transformers import pipeline
# Create FillMaskPipeline
fill_mask = pipeline(
"fill-mask",
model="./models/EsperBERTo-small",
tokenizer="./models/EsperBERTo-small"
)
# Test the LM
result = fill_mask("La suno .")
print(result)
🎯 Fine-tuning Your LM
Time to fine-tune your LM on a downstream task! Let's level up!
⚙️ Fine-tune your model for Part-of-speech tagging. Easy peasy!
🔥 Your model is evolving! Check out those losses converge!
# Fine-tune the LM
# Import necessary libraries
from transformers import RobertaForTokenClassification, Trainer, TrainingArguments
# Define fine-tuning arguments
training_args = TrainingArguments(
output_dir="./models/EsperBERTo-finetuned",
num_train_epochs=3,
per_device_train_batch_size=8,
per_device_eval_batch_size=16,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
)
# Initialize model for fine-tuning
model = RobertaForTokenClassification.from_pretrained("esperberto")
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
# Start fine-tuning
trainer.train()
🌟 Share Your Model
🙂 Congratulations! You've created a masterpiece! It's time to share it with the world!
📦 Upload your model using the CLI and write a cool README.md!
🎉 TADA! Your model has a page on huggingface.co/models for everyone to enjoy!