Creating a Podcast from a PDF using Vercel AI SDK and LangChain

Creating a podcast from a PDF document is an innovative way to repurpose content and reach a wider audience. This guide will walk you through the process of setting up a system to convert PDF text into an engaging podcast using the Vercel AI SDK, LangChain's PDFLoader, ElevenLabs, and Next.js.

Prerequisites

Before you begin, ensure you have the following:

Node.js and npm installed on your machine.
A Vercel account.
An OpenAI API key.
An ElevenLabs API key (if you plan to use ElevenLabs for text-to-speech).

Setting Up the Project

1. Initialize a Next.js App

Start by creating a new Next.js application:

npx create-next-app@latest pdf-to-podcast --typescript
cd pdf-to-podcast

2. Install Required Packages

Install the necessary dependencies, including the Vercel AI SDK, LangChain's PDFLoader, and other required packages:

npm install ai @ai-sdk/openai @langchain/pdfjs zod elevenlabs

ai: Vercel AI SDK for AI integrations.
@ai-sdk/openai: OpenAI integration for the AI SDK.
@langchain/pdfjs: LangChain's PDFLoader for parsing PDFs.
zod: Schema validation library.
elevenlabs: SDK for ElevenLabs text-to-speech service.

Building the API Route

Create an API route to handle the PDF to audio conversion.

1. Create the Route

In your Next.js app, create a new file at /app/api/generate-podcast/route.ts.

import { NextResponse } from 'next/server';
import { PDFLoader } from '@langchain/pdfjs';
import fs from 'fs';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
import { ElevenLabsClient } from 'elevenlabs';
 
const ELEVEN_LABS_API_KEY = process.env.ELEVEN_LABS_API_KEY;
const elevenLabsClient = ELEVEN_LABS_API_KEY ? new ElevenLabsClient({ apiKey: ELEVEN_LABS_API_KEY }) : null;
 
export async function POST(req: Request) {
  try {
    const formData = await req.formData();
    const apiKey = formData.get('api_key') as string;
    const pdfFile = formData.get('pdf') as File;
 
    if (!pdfFile) {
      return NextResponse.json({ error: 'PDF file is required' });
    }
 
    const pdfBuffer = Buffer.from(await pdfFile.arrayBuffer());
 
    try {
      const pdfText = await extractTextFromPDF(pdfBuffer);
      const dialogue = await generateDialogue(pdfText, apiKey);
      const audioBuffer = await streamDialogueToAudio(dialogue);
 
      return new NextResponse(audioBuffer, {
        headers: {
          'Content-Type': 'audio/mpeg',
          'Content-Disposition': 'inline',
          'Content-Length': audioBuffer.length.toString(),
        },
      });
    } catch (parseError) {
      console.error('Error during PDF parsing:', parseError);
      return NextResponse.json({ error: 'Failed to parse PDF content' });
    }
  } catch (error) {
    console.error('General error in route handler:', error);
    return NextResponse.json({ error: 'Failed to process request' });
  }
}

2. Helper Functions

Implement helper functions to extract text from the PDF, generate dialogue, and convert it into audio.

async function extractTextFromPDF(fileBuffer: Buffer): Promise<string> {
  const tempFilePath = `/tmp/tmp.pdf`;
  fs.writeFileSync(tempFilePath, fileBuffer);
  const loader = new PDFLoader(tempFilePath);
  const documents = await loader.load();
  return documents.map((doc) => doc.pageContent).join('\n');
}
 
async function generateDialogue(text: string, apiKey: string) {
  const dialogueSchema = z.object({
    conversation: z.array(
      z.object({
        speaker: z.string().describe('Name or role of the speaker (e.g., Host, Guest 1, Guest 2).'),
        message: z.string().describe('The text spoken by the speaker in the dialogue.'),
      })
    ),
  });
 
  const systemMessage = `
You are creating a structured dialogue for a podcast conversation.
Use the following structure for each speaker: their role (e.g., Host, Guest) and their message.
Make it conversational, engaging, and cover key points from the text.
Apply these principles to deliver a natural and engaging conversation suitable for a podcast.
`;
 
  const { object: dialogueObject } = await generateObject({
    model: openai('gpt-4', { apiKey }),
    system: systemMessage,
    prompt: `text: ${text}`,
    schema: dialogueSchema,
  });
 
  return dialogueObject;
}
 
async function streamDialogueToAudio(dialogue: any): Promise<Buffer> {
  const audioBuffers: Buffer[] = [];
  const voices = ['Rachel', 'Domi']; // Example voices from ElevenLabs
 
  for (const [index, entry] of dialogue.conversation.entries()) {
    const { message } = entry;
    const currentVoice = voices[index % voices.length];
 
    if (!elevenLabsClient) {
      throw new Error('ElevenLabs API client is not initialized');
    }
 
    const audioStream = await elevenLabsClient.textToSpeechStream({
      text: message,
      voice: currentVoice,
    });
 
    const chunks: Buffer[] = [];
    for await (const chunk of audioStream) {
      chunks.push(chunk);
    }
 
    audioBuffers.push(Buffer.concat(chunks));
  }
 
  return Buffer.concat(audioBuffers);
}

Note: Make sure to replace 'gpt-4' with the appropriate OpenAI model you have access to, and adjust the ElevenLabs voices based on availability.

Frontend Implementation

Create a user interface to upload PDFs and generate audio.

1. Create the Page Component

In your Next.js app, create a new page component at /app/page.tsx.

'use client';
 
import { useState } from 'react';
 
export default function PDFToAudioPage() {
  const [pdfFile, setPdfFile] = useState<File | null>(null);
  const [apiKey, setApiKey] = useState('');
  const [audioUrl, setAudioUrl] = useState<string | null>(null);
 
  const handleSubmit = async (event: React.FormEvent) => {
    event.preventDefault();
    if (!pdfFile || !apiKey) return alert('Both PDF and API key are required');
 
    const formData = new FormData();
    formData.append('pdf', pdfFile);
    formData.append('api_key', apiKey);
 
    const response = await fetch('/api/generate-podcast', {
      method: 'POST',
      body: formData,
    });
 
    if (response.ok) {
      const audioBlob = await response.blob();
      const audioUrl = URL.createObjectURL(audioBlob);
      setAudioUrl(audioUrl);
    } else {
      const errorData = await response.json();
      alert(`Audio generation failed: ${errorData.error}`);
    }
  };
 
  return (
    <div className="flex flex-col items-center justify-center min-h-screen bg-gray-100">
      <h1 className="text-2xl font-bold text-gray-800 mb-6">PDF to Audio Podcast</h1>
      <form
        onSubmit={handleSubmit}
        className="bg-white shadow-md rounded-lg p-6 w-96 space-y-4"
      >
        <div>
          <label
            htmlFor="file"
            className="block text-sm font-medium text-gray-700 mb-2"
          >
            Upload PDF
          </label>
          <input
            type="file"
            id="file"
            accept="application/pdf"
            className="block w-full text-sm text-gray-500 file:mr-4 file:py-2 file:px-4 file:rounded-full file:border-0 file:text-sm file:font-semibold file:bg-indigo-50 file:text-indigo-700 hover:file:bg-indigo-100"
            onChange={(e) => setPdfFile(e.target.files?.[0] || null)}
          />
        </div>
        <div>
          <label
            htmlFor="apiKey"
            className="block text-sm font-medium text-gray-700 mb-2"
          >
            OpenAI API Key
          </label>
          <input
            type="password"
            id="apiKey"
            placeholder="OpenAI API Key"
            value={apiKey}
            onChange={(e) => setApiKey(e.target.value)}
            className="block w-full px-4 py-2 border border-gray-300 rounded-lg shadow-sm focus:ring-indigo-500 focus:border-indigo-500 sm:text-sm"
          />
        </div>
        <button
          type="submit"
          className="w-full py-2 px-4 bg-indigo-600 text-white font-medium text-sm rounded-lg shadow hover:bg-indigo-700 focus:outline-none focus:ring-2 focus:ring-indigo-500 focus:ring-offset-2"
        >
          Generate Audio
        </button>
      </form>
      {audioUrl && (
        <audio
          controls
          src={audioUrl}
          className="mt-6 w-full max-w-md rounded-lg shadow"
        />
      )}
    </div>
  );
}

Conclusion

By following this guide, you can successfully create a podcast from a PDF document using the Vercel AI SDK, LangChain's PDFLoader, ElevenLabs, and Next.js. This setup allows you to transform written content into engaging audio formats, expanding your content's reach and accessibility.

Important Notes:

API Keys: Ensure you have valid API keys for OpenAI and ElevenLabs and set them in your environment variables.
Dependencies: The ai package from Vercel simplifies AI integrations. LangChain's PDFLoader helps in parsing PDF documents efficiently.
Text-to-Speech: ElevenLabs provides high-quality text-to-speech services, enhancing the listening experience.

Additional Resources

Troubleshooting

PDF Parsing Errors: Ensure the PDF file is not corrupted and is properly uploaded.
API Errors: Double-check your API keys and ensure they have the necessary permissions.
Audio Playback Issues: Confirm that the audio content is correctly generated and that your browser supports the audio format.

Feel free to reach out if you have any questions or need further assistance with the implementation.