Build an AI Voice Agent That Joins Google Meet Calls (ElevenLabs + Selenium + PulseAudio)

Noqta Team
By Noqta Team ·

Loading the Text to Speech Audio Player...

What You'll Build

Imagine an AI agent that can join your Google Meet calls, listen to participants, and respond in a natural human voice — all running autonomously on a cloud server. That's exactly what we're building in this tutorial.

By the end, you'll have:

  • A Selenium-based bot that joins Google Meet as a guest participant
  • PulseAudio virtual audio routing that pipes Meet audio to an AI and back
  • An ElevenLabs Conversational AI bridge that listens, thinks, and speaks
  • A cloud GPU deployment on Vast.ai with a one-command setup script

Here's how the audio flows:

Meet participants speak
    → Chrome captures audio → meet_capture sink
        → ElevenLabs ConvAI (STT → LLM → TTS)
            → atlas_out sink → atlas_mic virtual source
                → Chrome mic input → Meet hears the AI respond

This is the same setup we tested internally at Noqta — and it works.


Prerequisites

Before diving in, make sure you have:

  1. An ElevenLabs account with Conversational AI access
  2. A Vast.ai account (or any Linux server with root access)
  3. Basic Python and Linux command-line knowledge
  4. A Google Meet link to test with

Architecture Overview

The system has three main components working together:

ComponentRoleTool
Meet JoinerOpens Chrome, navigates to Meet, joins as guestSelenium + Xvfb
Audio RouterCreates virtual audio devices, routes audio between Meet and AIPulseAudio
AI BridgeCaptures Meet audio, sends to ElevenLabs, plays response backElevenLabs ConvAI SDK

All three run on the same machine. The key insight is using PulseAudio null sinks and virtual sources to create a bidirectional audio bridge between Chrome and the ElevenLabs API — no physical microphone or speaker needed.


Step 1: Provision a Cloud Server

You need a Linux server with a desktop environment (for Chrome). We used Vast.ai because it's cheap, fast to spin up, and gives you root access.

On Vast.ai

  1. Sign up at Vast.ai
  2. Search for a template with Ubuntu 22.04 and at least 4 GB RAM
  3. You don't strictly need a GPU for this project — CPU instances work fine
  4. SSH into your instance once it's running

System Dependencies

Once connected, install everything:

#!/bin/bash
set -e
 
echo "=== Voice Agent Setup ==="
 
# Core system packages
apt-get update -q
apt-get install -y -q \
    wget curl unzip \
    xvfb pulseaudio \
    libsndfile1 libportaudio2 ffmpeg portaudio19-dev \
    fonts-liberation libappindicator3-1 libasound2 \
    libatk-bridge2.0-0 libatk1.0-0 libcups2 libdbus-1-3 \
    libgdk-pixbuf2.0-0 libnspr4 libnss3 libx11-xcb1 \
    libxcomposite1 libxdamage1 libxrandr2 xdg-utils
 
# Install Google Chrome
wget -q https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb || apt-get -f install -y
rm google-chrome-stable_current_amd64.deb
 
# Install ChromeDriver (match your Chrome version)
CHROME_VERSION=$(google-chrome --version | grep -oP '\d+\.\d+\.\d+')
DRIVER_URL="https://storage.googleapis.com/chrome-for-testing-public/${CHROME_VERSION}.0/linux64/chromedriver-linux64.zip"
wget -q "$DRIVER_URL" -O /tmp/chromedriver.zip
unzip -o /tmp/chromedriver.zip -d /tmp/
mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/
chmod +x /usr/local/bin/chromedriver
 
# Python dependencies
pip install -q \
    selenium \
    websockets \
    elevenlabs
 
echo "=== Setup complete ==="

Save this as setup.sh and run it:

chmod +x setup.sh && ./setup.sh

Step 2: Set Up the Virtual Display and Audio

Since there's no physical monitor or sound card on a cloud server, we use Xvfb (virtual framebuffer) for the display and PulseAudio for virtual audio devices.

Start Xvfb and PulseAudio

# Start virtual display
Xvfb :99 -screen 0 1280x720x24 &
export DISPLAY=:99
 
# Start PulseAudio in system mode
pulseaudio --start --exit-idle-time=-1

Create Virtual Audio Devices

This is the critical part. We need two virtual audio devices:

# 1. meet_capture — Chrome's audio output goes here
#    We'll read from meet_capture.monitor to hear what Meet participants say
pactl load-module module-null-sink \
    sink_name=meet_capture \
    sink_properties=device.description=MeetCapture
 
# 2. atlas_out — AI responses are played here
#    atlas_mic reads from atlas_out.monitor and acts as Chrome's mic input
pactl load-module module-null-sink \
    sink_name=atlas_out \
    sink_properties=device.description=AtlasOutput
 
# 3. atlas_mic — virtual microphone source fed by atlas_out
pactl load-module module-virtual-source \
    source_name=atlas_mic \
    master=atlas_out.monitor \
    source_properties=device.description=AtlasMic
 
# Set defaults so Chrome picks them up
pactl set-default-source atlas_mic    # Chrome's mic = AI output
pactl set-default-sink meet_capture   # Chrome's speakers = our capture point

Verify the Setup

# Check sinks (should see meet_capture and atlas_out)
pactl list short sinks
 
# Check sources (should see atlas_mic and meet_capture.monitor)
pactl list short sources

You should see output like:

1  meet_capture   module-null-sink.c   s16le 1ch 44100Hz   IDLE
2  atlas_out      module-null-sink.c   s16le 1ch 44100Hz   IDLE

Step 3: Build the Meet Joiner Bot

The bot uses Selenium to open Chrome, navigate to Google Meet, enter a name, and click "Ask to join":

#!/usr/bin/env python3
"""meet_bot.py — Joins a Google Meet call as a guest via Selenium."""
 
import os
import sys
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
 
MEET_URL = sys.argv[1] if len(sys.argv) > 1 else "https://meet.google.com/abc-defg-hij"
DISPLAY = os.environ.get("DISPLAY", ":99")
BOT_NAME = "Atlas"
 
 
def make_driver():
    opts = Options()
    opts.binary_location = "/usr/bin/google-chrome"
    opts.add_argument("--no-sandbox")
    opts.add_argument("--disable-dev-shm-usage")
    opts.add_argument("--disable-gpu")
    opts.add_argument("--use-fake-ui-for-media-stream")   # auto-allow mic/cam
    # Important: do NOT use --use-fake-device-for-media-stream
    # We want Chrome to use real PulseAudio devices
    opts.add_argument("--disable-blink-features=AutomationControlled")
    opts.add_argument("--window-size=1280,720")
    opts.add_argument("--autoplay-policy=no-user-gesture-required")
    opts.add_experimental_option("excludeSwitches", ["enable-automation"])
 
    svc = Service("/usr/local/bin/chromedriver")
    driver = webdriver.Chrome(service=svc, options=opts)
    driver.execute_script(
        "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
    )
    return driver
 
 
def find_any(driver, xpaths, timeout=10):
    """Try multiple XPath selectors, return first visible match."""
    deadline = time.time() + timeout
    while time.time() < deadline:
        for xpath in xpaths:
            try:
                el = driver.find_element(By.XPATH, xpath)
                if el.is_displayed():
                    return el
            except NoSuchElementException:
                pass
        time.sleep(0.5)
    return None
 
 
def join_meet(driver, url):
    print(f"Opening {url}")
    driver.get(url)
    time.sleep(4)
 
    # Step 1: Choose guest mode (no Google account)
    guest = find_any(driver, [
        "//span[contains(text(),'Use without an account')]",
        "//button[contains(text(),'Use without')]",
        "//span[contains(text(),'Continue without')]",
    ], timeout=8)
    if guest:
        guest.click()
        print("  Guest mode selected")
        time.sleep(3)
 
    # Step 2: Enter bot name
    name_el = find_any(driver, [
        "//input[contains(@aria-label,'name') or contains(@aria-label,'Name')]",
        "//input[@placeholder='Your name']",
        "//input[@type='text']",
    ], timeout=8)
    if name_el:
        name_el.clear()
        name_el.send_keys(BOT_NAME)
        print(f"  Entered name: {BOT_NAME}")
        time.sleep(1)
 
    # Step 3: Click join
    join_btn = find_any(driver, [
        "//span[contains(text(),'Ask to join')]",
        "//span[contains(text(),'Join now')]",
        "//button[contains(text(),'Ask to join')]",
        "//button[contains(text(),'Join now')]",
    ], timeout=15)
    if join_btn:
        join_btn.click()
        print("  Join request sent — waiting for admission...")
    else:
        print("  ERROR: Join button not found")
        driver.save_screenshot("/tmp/meet_debug.png")
        return False
 
    # Step 4: Wait for admission (up to 5 minutes)
    for i in range(60):
        mute_btns = driver.find_elements(By.XPATH,
            "//button[contains(@aria-label,'microphone') or contains(@aria-label,'mic')]"
        )
        if mute_btns:
            print(f"  IN THE CALL!")
            return True
        time.sleep(5)
 
    print("  Timed out waiting for admission")
    return False
 
 
if __name__ == "__main__":
    driver = make_driver()
    try:
        if join_meet(driver, MEET_URL):
            print("\nBot is in the call. Press Ctrl+C to leave.")
            while True:
                time.sleep(10)
    except KeyboardInterrupt:
        print("\nLeaving call...")
    finally:
        driver.quit()

Key Chrome Flags Explained

FlagWhy
--use-fake-ui-for-media-streamAuto-allows mic/camera permission popups
--no-sandboxRequired for running as root on cloud servers
--disable-blink-features=AutomationControlledPrevents Meet from detecting Selenium
No --use-fake-device-for-media-streamEnsures Chrome uses PulseAudio (real audio devices)

Important: After Chrome starts, you may need to manually move its audio stream to the meet_capture sink. We'll automate this in the bridge.


Step 4: Build the ElevenLabs ConvAI Bridge

This is the core component. It captures audio from meet_capture.monitor (what participants say), sends it to ElevenLabs Conversational AI, and plays the AI's response into atlas_out (which feeds Chrome's mic).

#!/usr/bin/env python3
"""
meet_elevenlabs_bridge.py — ElevenLabs ConvAI <-> Google Meet bridge via PulseAudio.
 
Audio flow:
  Meet audio out -> meet_capture.monitor -> ElevenLabs Agent
  ElevenLabs Agent -> atlas_out -> atlas_out.monitor -> atlas_mic -> Meet mic
"""
 
import argparse
import queue
import subprocess
import sys
import threading
import signal
import time
from typing import Callable
 
from elevenlabs.client import ElevenLabs
from elevenlabs.conversational_ai.conversation import (
    AudioInterface,
    Conversation,
)
 
# ── Config ─────────────────────────────────────────────────────────
API_KEY = "your_elevenlabs_api_key_here"
 
INPUT_SOURCE  = "meet_capture.monitor"   # what Meet plays
OUTPUT_SINK   = "atlas_out"              # feeds into atlas_mic -> Meet mic
 
SAMPLE_RATE   = 16000
CHANNELS      = 1
FORMAT        = "s16le"          # signed 16-bit little-endian PCM
CHUNK_SAMPLES = 4000             # 250ms chunks (recommended by SDK)
CHUNK_BYTES   = CHUNK_SAMPLES * CHANNELS * 2
 
 
# ── Custom PulseAudio AudioInterface ──────────────────────────────
class PulseAudioInterface(AudioInterface):
    """Routes audio through PulseAudio using parec/pacat subprocesses."""
 
    def start(self, input_callback: Callable[[bytes], None]):
        self.input_callback = input_callback
        self.output_queue: queue.Queue[bytes] = queue.Queue()
        self.should_stop = threading.Event()
 
        # Capture from Meet's audio output
        self._rec_proc = subprocess.Popen(
            [
                "parec",
                f"--device={INPUT_SOURCE}",
                f"--format={FORMAT}",
                f"--rate={SAMPLE_RATE}",
                f"--channels={CHANNELS}",
                "--latency-msec=50",
            ],
            stdout=subprocess.PIPE,
            stderr=subprocess.DEVNULL,
        )
 
        # Play AI responses into atlas_out -> Meet's mic
        self._play_proc = subprocess.Popen(
            [
                "pacat",
                "--playback",
                f"--device={OUTPUT_SINK}",
                f"--format={FORMAT}",
                f"--rate={SAMPLE_RATE}",
                f"--channels={CHANNELS}",
                "--latency-msec=50",
            ],
            stdin=subprocess.PIPE,
            stderr=subprocess.DEVNULL,
        )
 
        self._input_thread = threading.Thread(
            target=self._read_input, daemon=True
        )
        self._output_thread = threading.Thread(
            target=self._write_output, daemon=True
        )
        self._input_thread.start()
        self._output_thread.start()
 
        print(f"[audio] capturing from {INPUT_SOURCE}")
        print(f"[audio] playing to {OUTPUT_SINK}")
 
    def stop(self):
        self.should_stop.set()
        if self._rec_proc:
            self._rec_proc.terminate()
        if self._play_proc:
            try:
                self._play_proc.stdin.close()
            except Exception:
                pass
            self._play_proc.terminate()
 
    def output(self, audio: bytes):
        self.output_queue.put(audio)
 
    def interrupt(self):
        # Drain output queue when user interrupts the AI
        try:
            while True:
                self.output_queue.get_nowait()
        except queue.Empty:
            pass
 
    def _read_input(self):
        while not self.should_stop.is_set():
            chunk = self._rec_proc.stdout.read(CHUNK_BYTES)
            if not chunk:
                break
            self.input_callback(chunk)
 
    def _write_output(self):
        while not self.should_stop.is_set():
            try:
                audio = self.output_queue.get(timeout=0.25)
                self._play_proc.stdin.write(audio)
                self._play_proc.stdin.flush()
            except queue.Empty:
                pass
            except BrokenPipeError:
                break
 
 
# ── Main ──────────────────────────────────────────────────────────
def run_bridge(client: ElevenLabs, agent_id: str):
    print(f"\n[bridge] Starting ElevenLabs ConvAI bridge")
    print(f"[bridge] Agent: {agent_id}")
    print(f"[bridge] Press Ctrl+C to stop\n")
 
    quit_event = threading.Event()
    signal.signal(signal.SIGTERM, lambda s, f: quit_event.set())
    signal.signal(signal.SIGINT, lambda s, f: quit_event.set())
 
    while not quit_event.is_set():
        print("[bridge] Starting new session...")
        try:
            conversation = Conversation(
                client=client,
                agent_id=agent_id,
                requires_auth=False,
                audio_interface=PulseAudioInterface(),
                callback_agent_response=lambda t: print(f"[agent] {t}"),
                callback_user_transcript=lambda t: print(f"[user]  {t}"),
                callback_latency_measurement=lambda ms: print(
                    f"[latency] {ms}ms"
                ),
            )
            conversation.start_session()
            conversation.wait_for_session_end()
        except Exception as e:
            print(f"[bridge] Session error: {e}")
 
        if not quit_event.is_set():
            print("[bridge] Session ended, restarting in 2s...")
            time.sleep(2)
 
    print("[bridge] Done.")
 
 
def main():
    parser = argparse.ArgumentParser(
        description="ElevenLabs ConvAI <-> Google Meet bridge"
    )
    parser.add_argument("--agent-id", required=True, help="ElevenLabs agent ID")
    parser.add_argument(
        "--api-key",
        default=API_KEY,
        help="ElevenLabs API key",
    )
    args = parser.parse_args()
 
    client = ElevenLabs(api_key=args.api_key)
    run_bridge(client, args.agent_id)
 
 
if __name__ == "__main__":
    main()

How the PulseAudioInterface Works

The ElevenLabs SDK expects an AudioInterface with four methods:

MethodWhat it does
start(callback)Spawns parec (capture) and pacat (playback) subprocesses
output(audio)Queues AI-generated audio bytes for playback
interrupt()Drains the queue when the user starts speaking (barge-in)
stop()Terminates the audio subprocesses

Using parec and pacat directly (instead of PyAudio or sounddevice) is the most reliable approach on headless Linux servers — no ALSA/JACK conflicts, no device enumeration issues.


Step 5: Wire Everything Together

Now let's run all three components. Open three terminal sessions (or use tmux):

Terminal 1: Start the Display and Audio

# Start Xvfb
Xvfb :99 -screen 0 1280x720x24 &
export DISPLAY=:99
 
# Start PulseAudio
pulseaudio --start --exit-idle-time=-1
 
# Create virtual audio devices
pactl load-module module-null-sink sink_name=meet_capture \
    sink_properties=device.description=MeetCapture
pactl load-module module-null-sink sink_name=atlas_out \
    sink_properties=device.description=AtlasOutput
pactl load-module module-virtual-source source_name=atlas_mic \
    master=atlas_out.monitor \
    source_properties=device.description=AtlasMic
pactl set-default-source atlas_mic
pactl set-default-sink meet_capture

Terminal 2: Join the Meet

export DISPLAY=:99
python3 meet_bot.py "https://meet.google.com/your-meeting-code"

Wait for the bot to request to join. Accept the bot from another participant's Meet window.

Terminal 3: Start the AI Bridge

python3 meet_elevenlabs_bridge.py --agent-id YOUR_AGENT_ID --api-key YOUR_API_KEY

Move Chrome's Audio (Important!)

After Chrome joins the call, move its audio output to the capture sink:

# List Chrome's audio streams
pactl list short sink-inputs
 
# Move each stream to meet_capture (replace INDEX with actual number)
pactl move-sink-input INDEX meet_capture

You can automate this with a helper function:

def move_chrome_audio():
    """Move all Chrome audio streams to meet_capture sink."""
    import time
    time.sleep(6)  # wait for Chrome to start playing audio
    result = subprocess.run(
        ["pactl", "list", "short", "sink-inputs"],
        capture_output=True, text=True,
    )
    for line in result.stdout.strip().splitlines():
        parts = line.split()
        if parts:
            subprocess.run(
                ["pactl", "move-sink-input", parts[0], "meet_capture"],
                capture_output=True,
            )
            print(f"Moved audio stream {parts[0]} to meet_capture")

Step 6: Create Your ElevenLabs Agent

Before running the bridge, you need a Conversational AI agent on ElevenLabs:

  1. Go to ElevenLabs Dashboard > Conversational AI
  2. Click Create Agent
  3. Configure your agent:
    • Name: Your bot's name (e.g., "Atlas")
    • Voice: Pick any voice from the library
    • System prompt: Define the agent's personality and knowledge
    • Language: Set to your preferred language
  4. Copy the Agent ID from the agent settings page

Example System Prompt

You are Atlas, a helpful AI assistant participating in a Google Meet call.
You listen to what participants say and respond naturally.
Keep responses concise — this is a live conversation, not a text chat.
If you're unsure about something, ask for clarification.

Step 7: Deploy as a Single Script

For production use, combine everything into one script:

#!/usr/bin/env python3
"""voice_meet_bot.py — Complete Google Meet AI voice agent."""
 
import os
import subprocess
import sys
import threading
import time
 
# ... (combine meet_bot.py + meet_elevenlabs_bridge.py)
# See the full combined script in the project repo
 
def main():
    # 1. Setup audio devices
    setup_audio()
 
    # 2. Start Chrome and join Meet
    driver = make_driver()
    join_thread = threading.Thread(target=join_meet, args=(driver, MEET_URL))
    join_thread.start()
 
    # 3. Move Chrome audio after it joins
    time.sleep(10)
    move_chrome_audio()
 
    # 4. Start ElevenLabs bridge
    client = ElevenLabs(api_key=API_KEY)
    run_bridge(client, AGENT_ID)

Troubleshooting

"No audio from Meet participants"

  • Chrome's audio may not be routed to meet_capture. Run:
    pactl list short sink-inputs
    If you see Chrome's stream on a different sink, move it:
    pactl move-sink-input <INDEX> meet_capture

"AI responds but Meet participants can't hear it"

  • Check that atlas_mic is Chrome's input source:
    pactl list short source-outputs
    Move Chrome's source input if needed:
    pactl move-source-output <INDEX> atlas_mic

"Chrome fails to start"

  • Make sure Xvfb is running: export DISPLAY=:99
  • Check ChromeDriver version matches Chrome: google-chrome --version

"ElevenLabs session keeps restarting"

  • Check your API key is valid
  • Ensure there's actual audio coming in (silence may cause session timeouts)
  • Try increasing CHUNK_SAMPLES to 8000 (500ms chunks)

"Meet detects bot as automated"

  • The --disable-blink-features=AutomationControlled flag helps
  • The webdriver property override in make_driver() also helps
  • Avoid joining too many calls in rapid succession

Cost Breakdown

ServiceCostNotes
ElevenLabsFree tier: 10 min/mo, Pro: ~$5/hr of conversationConversational AI pricing
Vast.ai~$0.10-0.30/hrCPU instance is enough
Google MeetFreeWorks with guest access

For testing and development, you can run the full stack for under $1/day.


What's Next

Once you have the basic setup working, here are some ideas to extend it:

  • Add knowledge bases to your ElevenLabs agent for domain-specific conversations
  • Record transcripts using the callback functions for automated meeting notes
  • Multi-language support by configuring the agent's language settings
  • Custom tools — ElevenLabs agents support function calling, so your bot can check databases, call APIs, or trigger actions mid-conversation
  • Multiple bots in the same call — each with different roles (note-taker, translator, domain expert)

Wrapping Up

Building an AI voice agent for Google Meet is surprisingly achievable with the right audio routing setup. The combination of Selenium for browser automation, PulseAudio for virtual audio devices, and ElevenLabs for conversational AI creates a robust pipeline that works reliably on headless cloud servers.

The hardest part isn't the AI — it's the audio plumbing. Once you understand the meet_capture -> ElevenLabs -> atlas_out -> atlas_mic flow, the rest is straightforward.

Spin up a Vast.ai instance, follow the steps, and have your AI joining calls in under an hour. Let us know what you build with it!


Built and tested by the Noqta engineering team. Questions? Reach out at noqta.tn.


Want to read more tutorials? Check out our latest tutorial on Building a Full-Stack Web App with SolidStart: A Complete Hands-On Guide.

Discuss Your Project with Us

We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.

Let's find the best solutions for your needs.

Related Articles

Getting Started with ALLaM-7B-Instruct-preview

Learn how to use the ALLaM-7B-Instruct-preview model with Python, and how to interact with it from JavaScript via a hosted API (e.g., on Hugging Face Spaces).

8 min read·