Unlocking the Future of AI with GPT-4o: A Comprehensive Introduction to Multimodal Capabilities

Cover Image

As technology continues to evolve, the intersection of artificial intelligence (AI) and web development offers increasingly transformative capabilities. One of the most promising developments in this space is GPT-4o, an advanced model that integrates multiple modalities such as text, audio, and video inputs to generate multimodal outputs. This guide will walk you through the key features, capabilities, and potential applications of GPT-4o, preparing you to harness this technology for your own projects.

Background

Before the advent of GPT-4o, AI models like ChatGPT operated in a single-modality mode, using distinct models for text, vision, and audio. GPT-4o, however, integrates all these capabilities into one unified model, processing inputs coherently across text, visual, and auditory formats. This seamless integration enhances the ability to understand and generate multimodal content, paving the way for more dynamic and interactive applications.

Current API Capabilities

As of now, the GPT-4o API supports text and image inputs with text outputs, similar to GPT-4 Turbo. However, it’s important to note that additional modalities, including audio, are expected to be introduced soon.

Getting Started

Install the OpenAI SDK for Python

To start using the GPT-4o capabilities, you need to first install the OpenAI SDK for Python:

%pip install --upgrade openai --quiet

Configure the OpenAI Client

Set up the OpenAI client by creating and using an API key. If you don’t have an API key, follow these steps to acquire one:

Create a new project.
Generate an API key in your project.
Set up your API key as an environmental variable.

Once you have the API key, configure the OpenAI client:

from openai import OpenAI
import os
 
# Set the API key and model name
MODEL="gpt-4o"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", ""))

Submit a Test Request

Let's begin with a simple text input:

completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system", 
            "content": "You are a helpful assistant. Help me with my math homework!"
        },
        {
            "role": "user", 
            "content": "Hello! Could you solve 2+2?"
        }
    ]
)
 
print("Assistant: " + completion.choices[0].message.content)

Output:

Assistant: Of course! 2 + 2 = 4. If you have any other questions, feel free to ask!

Image Processing

GPT-4o can process images and take intelligent actions based on the content of the image. Here's how you can provide images in two formats: Base64 encoded and URL.

Base64 Image Processing

First, preview the image and then encode it:

from IPython.display import Image, display, Audio, Markdown
import base64
 
IMAGE_PATH = "data/triangle.png"
# Preview image for context
display(Image(IMAGE_PATH))
 
# Open the image file and encode as base64 string
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")
 
base64_image = encode_image(IMAGE_PATH)
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's the area of the triangle?"},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
            ]
        }
    ],
    temperature=0.0
)
 
print(response.choices[0].message.content)

URL Image Processing

You can also send an image using a URL:

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's the area of the triangle?"},
                {"type": "image_url", "image_url": 
                {
                    "url": 
                    "https://example.com/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png"
                    }
                }
            ]
        }
    ],
    temperature=0.0
)
 
print(response.choices[0].message.content)

Video Processing

While the current API does not directly support video inputs, you can sample frames from a video and provide them as images. Here's how you can process a video and use both visual and audio data:

Setup for Video Processing

First, install the necessary packages:

%pip install opencv-python --quiet
%pip install moviepy --quiet

Process the Video into Frames and Audio

import cv2
from moviepy.editor import VideoFileClip
import time
import base64
 
VIDEO_PATH = "data/keynote_recap.mp4"
 
def process_video(video_path, seconds_per_frame=2):
    base64Frames = []
    base_video_path, _ = os.path.splitext(video_path)
    video = cv2.VideoCapture(video_path)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = video.get(cv2.CAP_PROP_FPS)
    frames_to_skip = int(fps * seconds_per_frame)
    curr_frame = 0
    
    # Loop through the video and extract frames at the specified sampling rate
    while curr_frame < total_frames - 1:
        video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
        curr_frame += frames_to_skip
 
    video.release()
 
    # Extract audio from video
    audio_path = f"{base_video_path}.mp3"
    clip = VideoFileClip(video_path)
    clip.audio.write_audiofile(audio_path, bitrate="32k")
    clip.audio.close()
    clip.close()
    
    print(f"Extracted {len(base64Frames)} frames")
    print(f"Extracted audio to {audio_path}")
    
    return base64Frames, audio_path
 
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)
 
# Display the frames for context
display_handle = display(None, display_id=True)
for img in base64Frames:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
    time.sleep(0.025)
 
Audio(audio_path)

Generating Summaries with Visual and Audio Data

Generate a comprehensive summary by using frames and the transcription of the audio:

# Transcribe the audio
transcription = client.audio.transcriptions.create(
    model="whisper-1",
    file=open(audio_path, "rb"),
)
 
# Generate a summary using both visual and audio data
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system",
            "content": """You are generating a video summary. 
            Create a summary of the provided video and its transcript. Respond in Markdown"""
        },
        {
            "role": "user",
            "content": [
                "These are the frames from the video.",
                *map(lambda x: {"type": "image_url", 
                    "image_url": 
                    {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, 
                base64Frames),
                {"type": "text", "text": f"The audio transcription is: {transcription.text}"}
            ]
        }
    ],
    temperature=0,
)
 
print(response.choices[0].message.content)

Conclusion

Integrating multiple input modalities (text, image, and soon audio) significantly enhances the model's performance on a diverse range of tasks. This multimodal approach allows for a comprehensive understanding and interaction, closely mirroring how humans perceive and process information.

Currently, GPT-4o in the API supports text and image inputs, with audio capabilities expected to be added soon. Integrating these advanced capabilities into your web development projects can unlock new dimensions of AI-driven interactions and functionalities.

References

Introduction to GPT-4o by Juston Forte, May 13, 2024