Integrating ALLaM-7B-Instruct-preview with Ollama

Ollama provides a convenient way to run large language models locally. While many models are available directly via ollama pull, you can also import custom models, like ALLaM-AI/ALLaM-7B-Instruct-preview, by creating a Modelfile.

Understanding Model Formats: Safetensors vs. GGUF

Before diving into the import process, it's helpful to understand the different model formats involved:

Safetensors / PyTorch Format:
- What it is: These formats (.safetensors, .bin, .pth) are standard for distributing models used in training and within frameworks like Hugging Face Transformers. They typically store high-precision model weights (e.g., 16-bit or 32-bit floating point). The official ALLaM-AI/ALLaM-7B-Instruct-preview model uses Safetensors.
- Use Case: Primarily for model training, fine-tuning, and running inference using Python libraries (transformers, torch) often requiring powerful hardware (especially GPUs).
GGUF (GPT-Generated Unified Format):
- What it is: A binary format specifically designed by the llama.cpp project for efficient inference (running the model) on a wider range of hardware, including CPUs and Apple Silicon (Metal).
- Key Feature - Quantization: GGUF files usually contain quantized weights. Quantization reduces the precision (e.g., to 4-bit or 5-bit integers), significantly shrinking file size and memory usage, making large models accessible on consumer hardware.
- Self-Contained: Bundles model weights, architecture details, and tokenizer information into a single file.

Why Ollama Prefers GGUF

Ollama leverages the llama.cpp library internally. GGUF is the native format for llama.cpp, offering several advantages for Ollama's goal of easy local LLM execution:

Efficiency: Quantized GGUF models run faster and use less RAM/VRAM.
Accessibility: Enables running large models on standard laptops and desktops.
Simplicity: Users interact with a single file format managed seamlessly by Ollama.

The Challenge with ALLaM and Ollama

Ollama works best with models in the GGUF format. The official ALLaM-AI/ALLaM-7B-Instruct-preview repository on Hugging Face primarily provides weights in Safetensors/PyTorch format.

Key Point: Directly importing Safetensors into Ollama using a Modelfile is often not straightforward and might require manual conversion to GGUF first. Tools like llama.cpp can perform this conversion, but it's an advanced process.

This guide outlines the steps using a Modelfile, assuming you either: a) Find a pre-converted GGUF version of the model (community contributions sometimes provide these). b) Have successfully converted the original model weights to GGUF format yourself.

Importing ALLaM into Ollama (Requires GGUF)

1. Obtain the GGUF Model File

Ollama Installed: Ensure you have Ollama running on your system. Visit ollama.com for installation instructions.
GGUF File: As Ollama works best with GGUF, you need the ALLaM model in this format.
- Option A (Recommended): Search the Hugging Face community (e.g., search for "ALLaM GGUF") or other model repositories for a pre-converted GGUF version of ALLaM-7B-Instruct-preview. Download a suitable quantization level (e.g., q4_K_M).
- Option B (Advanced): Download the original Safetensors weights from the official repository and convert them to GGUF yourself using llama.cpp's conversion scripts (e.g., convert.py and quantize). This is a technical process requiring familiarity with Python and compiling llama.cpp.

2. Create the `Modelfile`

Create a file named Modelfile (no extension) in a directory of your choice. This file tells Ollama how to configure and run the model.

# Modelfile for ALLaM-7B-Instruct-preview (GGUF)
 
# IMPORTANT: Replace './path/to/your/allam-7b-instruct-preview.gguf'
# with the actual path to your downloaded or converted GGUF file.
FROM ./path/to/your/allam-7b-instruct-preview.gguf
 
# Define the chat template based on the model's expected format.
# This mirrors the structure used by Hugging Face's apply_chat_template
# for models expecting user/assistant turns. Check model card if specific tokens are needed.
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
 
# Set model parameters (refer to model card for specifics if available)
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.6
PARAMETER top_k 50
PARAMETER top_p 0.95
# Context size from the model card
PARAMETER num_ctx 4096
 
# Optional: Set a default system prompt (can be overridden during chat)
SYSTEM """You are ALLaM, a helpful bilingual English and Arabic AI assistant developed by SDAIA."""
 
# Optional: License information (refer to the model's actual license)
LICENSE """
Apache License 2.0
(Verify the exact license from the Hugging Face repository)
"""

Explanation:

FROM: Specifies the path to your GGUF model file. Crucially, update this path.
TEMPLATE: Defines how prompts and responses are formatted for the model. This example uses a common chat format with <|im_start|> and <|im_end|> tokens, often used by instruct models. You might need to adjust this based on the specific GGUF conversion or model requirements.
PARAMETER: Sets default generation parameters like stopping tokens, temperature, context window size (num_ctx), etc.
SYSTEM: Sets a default personality or instruction for the model.
LICENSE: Include the model's license information.

3. Create the Ollama Model

Open your terminal, navigate to the directory containing your Modelfile and the GGUF file, and run the ollama create command:

ollama create allam-instruct-preview -f Modelfile

allam-instruct-preview: This is the name you'll use to refer to the model in Ollama. You can choose a different name.
-f Modelfile: Specifies the Modelfile to use.

Ollama will process the Modelfile and import the GGUF model into its library. This might take some time depending on the model size.

4. Run the Model

Once the creation process is complete, you can run the model using:

ollama run allam-instruct-preview

You can now chat with the model directly in your terminal. Try prompts in English or Arabic:

>>> كيف أجهز كوب شاهي؟
 
>>> Explain the concept of Large Language Models.
 
>>> /bye (to exit)

Publishing Your Custom Model to Ollama Registry

Once you have successfully created a local Ollama model from a GGUF file (e.g., allam-instruct-preview), you can share it on the ollama.com registry for others to use.

1. Create an Ollama Account

If you haven't already, sign up for an account at ollama.com/signup. Your username will be part of the model's public name (e.g., <your_username>/allam-instruct-preview).

2. Add Your Public Key

Find your local Ollama public key. The location is typically:
- macOS: ~/.ollama/id_ed25519.pub
- Linux: /usr/share/ollama/.ollama/id_ed25519.pub or ~/.ollama/id_ed25519.pub
- Windows: C:\\Users\\<username>\\.ollama\\id_ed25519.pub
Go to your account settings on ollama.com/settings/keys.
Click "Add Ollama Public Key" and paste the entire content of your .pub file.

3. Tag Your Local Model

Before pushing, you need to tag your local model with your Ollama username. Use the ollama cp (copy) command:

# ollama cp <local_model_name> <your_username>/<new_model_name>
ollama cp allam-instruct-preview your_ollama_username/allam-instruct-preview

Replace your_ollama_username with your actual Ollama username.

4. Push the Model

Now, push the tagged model to the registry:

ollama push your_ollama_username/allam-instruct-preview

This will upload your model. Once complete, others can pull and run it using:

ollama pull your_ollama_username/allam-instruct-preview
ollama run your_ollama_username/allam-instruct-preview

Important: You can only publish models that you have successfully created locally, which for non-standard models like ALLaM typically requires starting with a GGUF file. Ensure you have the rights to redistribute the model weights according to the original model's license (ALLaM uses Apache 2.0, which generally permits redistribution).

Conclusion

Integrating custom models like ALLaM-7B-Instruct-preview into Ollama requires a Modelfile and, most importantly, the model weights in the GGUF format. While the official repository doesn't provide a GGUF version, finding a community conversion or converting it yourself allows you to leverage this powerful Arabic/English model within your local Ollama environment. Remember to adjust the FROM path and potentially the TEMPLATE in the Modelfile based on your specific GGUF file.

References: