Unlocking the Future of AI with GPT-4o: A Comprehensive Introduction to Multimodal Capabilities
As technology continues to evolve, the intersection of artificial intelligence (AI) and web development offers increasingly transformative capabilities. One of the most promising developments in this space is GPT-4o, an advanced model that integrates multiple modalities such as text, audio, and video inputs to generate multimodal outputs. This guide will walk you through the key features, capabilities, and potential applications of GPT-4o, preparing you to harness this technology for your own projects.
Background
Before the advent of GPT-4o, AI models like ChatGPT operated in a single-modality mode, using distinct models for text, vision, and audio. GPT-4o, however, integrates all these capabilities into one unified model, processing inputs coherently across text, visual, and auditory formats. This seamless integration enhances the ability to understand and generate multimodal content, paving the way for more dynamic and interactive applications.
Current API Capabilities
As of now, the GPT-4o API supports text and image inputs with text outputs, similar to GPT-4 Turbo. However, it’s important to note that additional modalities, including audio, are expected to be introduced soon.
Getting Started
Install the OpenAI SDK for Python
To start using the GPT-4o capabilities, you need to first install the OpenAI SDK for Python:
%pip install --upgrade openai --quiet
Configure the OpenAI Client
Set up the OpenAI client by creating and using an API key. If you don’t have an API key, follow these steps to acquire one:
- Create a new project.
- Generate an API key in your project.
- Set up your API key as an environmental variable.
Once you have the API key, configure the OpenAI client:
from openai import OpenAI
import os
# Set the API key and model name
MODEL="gpt-4o"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", ""))
Submit a Test Request
Let's begin with a simple text input:
completion = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Help me with my math homework!"
},
{
"role": "user",
"content": "Hello! Could you solve 2+2?"
}
]
)
print("Assistant: " + completion.choices[0].message.content)
Output:
Assistant: Of course! 2 + 2 = 4. If you have any other questions, feel free to ask!
Image Processing
GPT-4o can process images and take intelligent actions based on the content of the image. Here's how you can provide images in two formats: Base64 encoded and URL.
Base64 Image Processing
First, preview the image and then encode it:
from IPython.display import Image, display, Audio, Markdown
import base64
IMAGE_PATH = "data/triangle.png"
# Preview image for context
display(Image(IMAGE_PATH))
# Open the image file and encode as base64 string
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
base64_image = encode_image(IMAGE_PATH)
response = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"
},
{
"role": "user",
"content": [
{"type": "text", "text": "What's the area of the triangle?"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
]
}
],
temperature=0.0
)
print(response.choices[0].message.content)
URL Image Processing
You can also send an image using a URL:
response = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"
},
{
"role": "user",
"content": [
{"type": "text", "text": "What's the area of the triangle?"},
{"type": "image_url", "image_url":
{
"url":
"https://example.com/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png"
}
}
]
}
],
temperature=0.0
)
print(response.choices[0].message.content)
Video Processing
While the current API does not directly support video inputs, you can sample frames from a video and provide them as images. Here's how you can process a video and use both visual and audio data:
Setup for Video Processing
First, install the necessary packages:
%pip install opencv-python --quiet
%pip install moviepy --quiet
Process the Video into Frames and Audio
import cv2
from moviepy.editor import VideoFileClip
import time
import base64
VIDEO_PATH = "data/keynote_recap.mp4"
def process_video(video_path, seconds_per_frame=2):
base64Frames = []
base_video_path, _ = os.path.splitext(video_path)
video = cv2.VideoCapture(video_path)
total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
fps = video.get(cv2.CAP_PROP_FPS)
frames_to_skip = int(fps * seconds_per_frame)
curr_frame = 0
# Loop through the video and extract frames at the specified sampling rate
while curr_frame < total_frames - 1:
video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
success, frame = video.read()
if not success:
break
_, buffer = cv2.imencode(".jpg", frame)
base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
curr_frame += frames_to_skip
video.release()
# Extract audio from video
audio_path = f"{base_video_path}.mp3"
clip = VideoFileClip(video_path)
clip.audio.write_audiofile(audio_path, bitrate="32k")
clip.audio.close()
clip.close()
print(f"Extracted {len(base64Frames)} frames")
print(f"Extracted audio to {audio_path}")
return base64Frames, audio_path
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)
# Display the frames for context
display_handle = display(None, display_id=True)
for img in base64Frames:
display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
time.sleep(0.025)
Audio(audio_path)
Generating Summaries with Visual and Audio Data
Generate a comprehensive summary by using frames and the transcription of the audio:
# Transcribe the audio
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=open(audio_path, "rb"),
)
# Generate a summary using both visual and audio data
response = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": """You are generating a video summary.
Create a summary of the provided video and its transcript. Respond in Markdown"""
},
{
"role": "user",
"content": [
"These are the frames from the video.",
*map(lambda x: {"type": "image_url",
"image_url":
{"url": f'data:image/jpg;base64,{x}', "detail": "low"}},
base64Frames),
{"type": "text", "text": f"The audio transcription is: {transcription.text}"}
]
}
],
temperature=0,
)
print(response.choices[0].message.content)
Conclusion
Integrating multiple input modalities (text, image, and soon audio) significantly enhances the model's performance on a diverse range of tasks. This multimodal approach allows for a comprehensive understanding and interaction, closely mirroring how humans perceive and process information.
Currently, GPT-4o in the API supports text and image inputs, with audio capabilities expected to be added soon. Integrating these advanced capabilities into your web development projects can unlock new dimensions of AI-driven interactions and functionalities.
References
Discuss Your Project with Us
We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.
Let's find the best solutions for your needs.