Fine-tuning Gemma for Arabic

In this tutorial, we will explore how to fine-tune the Gemma model for spoken language tasks in Arabic. Our concrete use case involves customizing the model so it can read and respond to Arabic email requests in a customer support setting. By the end of this guide, you will be able to adapt Gemma to handle specific Arabic language tasks, potentially improving conversational AI applications like virtual assistants or automated email triage systems.

About the Referenced Tools and Resources

Google Colab: A free cloud-based notebook environment with GPUs/TPUs that makes it easy to prototype and test machine learning models.
Kaggle: A platform hosting datasets, coding notebooks, and competitions. It provides convenient access to numerous datasets for training and testing machine learning models.
Keras/Keras-NLP: A high-level deep learning API written in Python, supporting easy-to-use libraries like keras_nlp that streamline the fine-tuning of language models.
Gemma: A family of pre-trained language models by Google, designed for various language understanding and generation tasks. Here, we're using “gemma2_instruct_2b_en” as the base model and adapting it to Arabic.
LoRA (Low-Rank Adapters): A parameter-efficient fine-tuning technique that adds "adapters" to a model, enabling faster and more resource-efficient training without modifying the entire model's weights.

Responding to Arabic Email Requests

Imagine you have a customer support system receiving a large volume of emails in Arabic. Many of these emails ask similar questions regarding product details, return policies, or account information. By fine-tuning Gemma on a dataset of example email queries and responses, you can create a model that quickly generates helpful, context-aware replies in Arabic.

Example of Our `arabic_email_dataset`

Here’s a small, illustrative example of what a portion of our dataset could look like. This dataset contains a series of entries, each with a "prompt" (the customer's email request) and a "response" (the ideal support reply).

[
  {
    "prompt": "مرحباً، قمت بشراء حاسوب محمول منكم أمس وأود الاستفسار عن موعد تسليمه.",
    "response": "مرحباً بك! شكراً لتواصلك معنا. طلبك في مرحلة الشحن حالياً ومن المتوقع أن يصلك خلال يومين إلى ثلاثة أيام. سنرسل لك رسالة تأكيد برقم تتبع الطلب قريباً."
  },
  {
    "prompt": "أريد إرجاع منتج اشتريته من موقعكم لأن اللون غير مطابق لما طلبته.",
    "response": "شكراً لتواصلك معنا. نحن نتفهم عدم رضاك عن المنتج. يُرجى زيارة صفحة الإرجاع في حسابك، واتباع الخطوات الموضحة. بعد وصول المنتج إلينا سنقوم بمعالجة طلب الإرجاع أو الاستبدال فوراً."
  },
  {
    "prompt": "قمت بتغيير عنوان الشحن الخاص بي، هل يمكنكم تأكيد تحديث الطلب accordingly؟",
    "response": "مرحباً! تم تحديث عنوان الشحن بنجاح. سيتم توصيل طلبك إلى العنوان الجديد وفق الجدول الزمني المعتاد. شكراً لتعاملك معنا!"
  }
]

In practice, you would have a much larger dataset, possibly hosted on Kaggle, that includes hundreds or thousands of similar entries. Each entry provides a prompt and a desired response, giving the model plenty of examples to learn how to generate appropriate replies.

Setup

Select the Colab Runtime

Open a new Colab notebook at https://colab.research.google.com.
In the upper-right corner, select ▾ (Additional connection options).
Choose Change runtime type.
Under Hardware accelerator, select L4 or A100 GPU to ensure fast training.

Gemma Setup on Kaggle

Follow the instructions at Gemma setup to access Gemma on Kaggle. This involves:

Creating a Kaggle account.
Generating a Kaggle username and API key.
Installing and authenticating Kaggle from within your Colab environment.

Set Environment Variables

Set environment variables for KAGGLE_USERNAME and KAGGLE_KEY to access Kaggle datasets. Also, mount Google Drive if you plan to store data and checkpoints there.

import os
from google.colab import userdata, drive
 
os.environ["KAGGLE_USERNAME"] = userdata.get("KAGGLE_USERNAME")
os.environ["KAGGLE_KEY"] = userdata.get("KAGGLE_KEY")
drive.mount("/content/drive")

Install Dependencies

Ensure all required dependencies are installed:

!pip install -q -U keras-nlp datasets
!pip install -q -U keras

Load the Gemma Model

Load the Gemma causal language model. We start from an English-instruct model (gemma2_instruct_2b_en) and will adapt it for Arabic tasks.

import keras_nlp
import keras
 
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma2_instruct_2b_en")
gemma_lm.summary()

Prepare the Arabic Dataset

Replace the following code with the actual dataset you have on Kaggle. In this example, we assume you uploaded your arabic_email_dataset.json to a Kaggle dataset.

from datasets import load_dataset
 
# Replace 'your_arabic_email_dataset' with a real dataset name or a path to your JSON file.
ds = load_dataset("json", data_files="your_arabic_email_dataset.json", split="train")
ds = ds.shuffle(seed=42)

Each entry in ds should look like this:

print(ds[0])
# Example output:
# {
#   "prompt": "مرحباً، قمت بشراء حاسوب محمول منكم أمس وأود الاستفسار عن موعد تسليمه.",
#   "response": "مرحباً بك! شكراً لتواصلك معنا..."
# }

Before fine-tuning, preprocess the dataset to a suitable tokenized format. We will assume your tokenizer is compatible with Arabic text.

tokenizer = gemma_lm.backbone.tokenizer
 
def tokenize(example):
    input_ids = tokenizer(example["prompt"], truncation=True, max_length=128)["input_ids"]
    label_ids = tokenizer(example["response"], truncation=True, max_length=128)["input_ids"]
    return {"input_ids": input_ids, "labels": label_ids}
 
tokenized_ds = ds.map(tokenize, batched=True)
tokenized_ds = tokenized_ds.with_format("tensorflow")
train = tokenized_ds.to_tf_dataset(
    columns=["input_ids", "labels"],
    shuffle=True,
    batch_size=2,
)

LoRA Fine-tuning

Enable LoRA to efficiently fine-tune only a subset of parameters.

gemma_lm.backbone.enable_lora(rank=4)

Compile the model:

optimizer = keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=0.01)
gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer
)

Train the model:

history = gemma_lm.fit(train, epochs=20)

Testing the Fine-Tuned Model

After training, test the model by providing a new Arabic email prompt:

test_prompt = "أود الاستفسار عن سياسة الاستبدال لديكم."
encoded_input = tokenizer(test_prompt, return_tensors="tf")["input_ids"]
 
# Generate a response
output = gemma_lm.generate(encoded_input, max_length=64)
decoded_response = tokenizer.decode(output[0], skip_special_tokens=True)
 
print("User Email:", test_prompt)
print("Gemma Response:", decoded_response)

If fine-tuning was successful, the model should produce a coherent and helpful Arabic response related to the exchange policy.

Conclusion

By following these steps and using LoRA fine-tuning, you can adapt the Gemma model to Arabic language tasks. Our use case—responding to Arabic customer emails—demonstrates how you can create more contextually relevant and specialized language models. This approach can be extended to other domains, such as voice assistants, chatbots, or document summarizers operating in Arabic.

For additional details, consult the Gemma documentation and the Gemma Cookbook.