Imagine an AI agent that can join your Google Meet calls, listen to participants, and respond in a natural human voice — all running autonomously on a cloud server. That's exactly what we're building in this tutorial.
By the end, you'll have:
A Selenium-based bot that joins Google Meet as a guest participant
PulseAudio virtual audio routing that pipes Meet audio to an AI and back
An ElevenLabs Conversational AI bridge that listens, thinks, and speaks
A cloud GPU deployment on Vast.ai with a one-command setup script
This is the same setup we tested internally at Noqta — and it works.
Prerequisites
Before diving in, make sure you have:
An ElevenLabs account with Conversational AI access
A Vast.ai account (or any Linux server with root access)
Basic Python and Linux command-line knowledge
A Google Meet link to test with
Architecture Overview
The system has three main components working together:
Component
Role
Tool
Meet Joiner
Opens Chrome, navigates to Meet, joins as guest
Selenium + Xvfb
Audio Router
Creates virtual audio devices, routes audio between Meet and AI
PulseAudio
AI Bridge
Captures Meet audio, sends to ElevenLabs, plays response back
ElevenLabs ConvAI SDK
All three run on the same machine. The key insight is using PulseAudio null sinks and virtual sources to create a bidirectional audio bridge between Chrome and the ElevenLabs API — no physical microphone or speaker needed.
Step 1: Provision a Cloud Server
You need a Linux server with a desktop environment (for Chrome). We used Vast.ai because it's cheap, fast to spin up, and gives you root access.
Since there's no physical monitor or sound card on a cloud server, we use Xvfb (virtual framebuffer) for the display and PulseAudio for virtual audio devices.
Start Xvfb and PulseAudio
# Start virtual displayXvfb :99 -screen 0 1280x720x24 &export DISPLAY=:99# Start PulseAudio in system modepulseaudio --start --exit-idle-time=-1
Create Virtual Audio Devices
This is the critical part. We need two virtual audio devices:
# 1. meet_capture — Chrome's audio output goes here# We'll read from meet_capture.monitor to hear what Meet participants saypactl load-module module-null-sink \ sink_name=meet_capture \ sink_properties=device.description=MeetCapture# 2. atlas_out — AI responses are played here# atlas_mic reads from atlas_out.monitor and acts as Chrome's mic inputpactl load-module module-null-sink \ sink_name=atlas_out \ sink_properties=device.description=AtlasOutput# 3. atlas_mic — virtual microphone source fed by atlas_outpactl load-module module-virtual-source \ source_name=atlas_mic \ master=atlas_out.monitor \ source_properties=device.description=AtlasMic# Set defaults so Chrome picks them uppactl set-default-source atlas_mic # Chrome's mic = AI outputpactl set-default-sink meet_capture # Chrome's speakers = our capture point
Verify the Setup
# Check sinks (should see meet_capture and atlas_out)pactl list short sinks# Check sources (should see atlas_mic and meet_capture.monitor)pactl list short sources
Important: After Chrome starts, you may need to manually move its audio stream to the meet_capture sink. We'll automate this in the bridge.
Step 4: Build the ElevenLabs ConvAI Bridge
This is the core component. It captures audio from meet_capture.monitor (what participants say), sends it to ElevenLabs Conversational AI, and plays the AI's response into atlas_out (which feeds Chrome's mic).
#!/usr/bin/env python3"""meet_elevenlabs_bridge.py — ElevenLabs ConvAI <-> Google Meet bridge via PulseAudio.Audio flow: Meet audio out -> meet_capture.monitor -> ElevenLabs Agent ElevenLabs Agent -> atlas_out -> atlas_out.monitor -> atlas_mic -> Meet mic"""import argparseimport queueimport subprocessimport sysimport threadingimport signalimport timefrom typing import Callablefrom elevenlabs.client import ElevenLabsfrom elevenlabs.conversational_ai.conversation import ( AudioInterface, Conversation,)# ── Config ─────────────────────────────────────────────────────────API_KEY = "your_elevenlabs_api_key_here"INPUT_SOURCE = "meet_capture.monitor" # what Meet playsOUTPUT_SINK = "atlas_out" # feeds into atlas_mic -> Meet micSAMPLE_RATE = 16000CHANNELS = 1FORMAT = "s16le" # signed 16-bit little-endian PCMCHUNK_SAMPLES = 4000 # 250ms chunks (recommended by SDK)CHUNK_BYTES = CHUNK_SAMPLES * CHANNELS * 2# ── Custom PulseAudio AudioInterface ──────────────────────────────class PulseAudioInterface(AudioInterface): """Routes audio through PulseAudio using parec/pacat subprocesses.""" def start(self, input_callback: Callable[[bytes], None]): self.input_callback = input_callback self.output_queue: queue.Queue[bytes] = queue.Queue() self.should_stop = threading.Event() # Capture from Meet's audio output self._rec_proc = subprocess.Popen( [ "parec", f"--device={INPUT_SOURCE}", f"--format={FORMAT}", f"--rate={SAMPLE_RATE}", f"--channels={CHANNELS}", "--latency-msec=50", ], stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, ) # Play AI responses into atlas_out -> Meet's mic self._play_proc = subprocess.Popen( [ "pacat", "--playback", f"--device={OUTPUT_SINK}", f"--format={FORMAT}", f"--rate={SAMPLE_RATE}", f"--channels={CHANNELS}", "--latency-msec=50", ], stdin=subprocess.PIPE, stderr=subprocess.DEVNULL, ) self._input_thread = threading.Thread( target=self._read_input, daemon=True ) self._output_thread = threading.Thread( target=self._write_output, daemon=True ) self._input_thread.start() self._output_thread.start() print(f"[audio] capturing from {INPUT_SOURCE}") print(f"[audio] playing to {OUTPUT_SINK}") def stop(self): self.should_stop.set() if self._rec_proc: self._rec_proc.terminate() if self._play_proc: try: self._play_proc.stdin.close() except Exception: pass self._play_proc.terminate() def output(self, audio: bytes): self.output_queue.put(audio) def interrupt(self): # Drain output queue when user interrupts the AI try: while True: self.output_queue.get_nowait() except queue.Empty: pass def _read_input(self): while not self.should_stop.is_set(): chunk = self._rec_proc.stdout.read(CHUNK_BYTES) if not chunk: break self.input_callback(chunk) def _write_output(self): while not self.should_stop.is_set(): try: audio = self.output_queue.get(timeout=0.25) self._play_proc.stdin.write(audio) self._play_proc.stdin.flush() except queue.Empty: pass except BrokenPipeError: break# ── Main ──────────────────────────────────────────────────────────def run_bridge(client: ElevenLabs, agent_id: str): print(f"\n[bridge] Starting ElevenLabs ConvAI bridge") print(f"[bridge] Agent: {agent_id}") print(f"[bridge] Press Ctrl+C to stop\n") quit_event = threading.Event() signal.signal(signal.SIGTERM, lambda s, f: quit_event.set()) signal.signal(signal.SIGINT, lambda s, f: quit_event.set()) while not quit_event.is_set(): print("[bridge] Starting new session...") try: conversation = Conversation( client=client, agent_id=agent_id, requires_auth=False, audio_interface=PulseAudioInterface(), callback_agent_response=lambda t: print(f"[agent] {t}"), callback_user_transcript=lambda t: print(f"[user] {t}"), callback_latency_measurement=lambda ms: print( f"[latency] {ms}ms" ), ) conversation.start_session() conversation.wait_for_session_end() except Exception as e: print(f"[bridge] Session error: {e}") if not quit_event.is_set(): print("[bridge] Session ended, restarting in 2s...") time.sleep(2) print("[bridge] Done.")def main(): parser = argparse.ArgumentParser( description="ElevenLabs ConvAI <-> Google Meet bridge" ) parser.add_argument("--agent-id", required=True, help="ElevenLabs agent ID") parser.add_argument( "--api-key", default=API_KEY, help="ElevenLabs API key", ) args = parser.parse_args() client = ElevenLabs(api_key=args.api_key) run_bridge(client, args.agent_id)if __name__ == "__main__": main()
How the PulseAudioInterface Works
The ElevenLabs SDK expects an AudioInterface with four methods:
Method
What it does
start(callback)
Spawns parec (capture) and pacat (playback) subprocesses
output(audio)
Queues AI-generated audio bytes for playback
interrupt()
Drains the queue when the user starts speaking (barge-in)
stop()
Terminates the audio subprocesses
Using parec and pacat directly (instead of PyAudio or sounddevice) is the most reliable approach on headless Linux servers — no ALSA/JACK conflicts, no device enumeration issues.
Step 5: Wire Everything Together
Now let's run all three components. Open three terminal sessions (or use tmux):
After Chrome joins the call, move its audio output to the capture sink:
# List Chrome's audio streamspactl list short sink-inputs# Move each stream to meet_capture (replace INDEX with actual number)pactl move-sink-input INDEX meet_capture
You can automate this with a helper function:
def move_chrome_audio(): """Move all Chrome audio streams to meet_capture sink.""" import time time.sleep(6) # wait for Chrome to start playing audio result = subprocess.run( ["pactl", "list", "short", "sink-inputs"], capture_output=True, text=True, ) for line in result.stdout.strip().splitlines(): parts = line.split() if parts: subprocess.run( ["pactl", "move-sink-input", parts[0], "meet_capture"], capture_output=True, ) print(f"Moved audio stream {parts[0]} to meet_capture")
Step 6: Create Your ElevenLabs Agent
Before running the bridge, you need a Conversational AI agent on ElevenLabs:
Go to ElevenLabs Dashboard > Conversational AI
Click Create Agent
Configure your agent:
Name: Your bot's name (e.g., "Atlas")
Voice: Pick any voice from the library
System prompt: Define the agent's personality and knowledge
Language: Set to your preferred language
Copy the Agent ID from the agent settings page
Example System Prompt
You are Atlas, a helpful AI assistant participating in a Google Meet call.
You listen to what participants say and respond naturally.
Keep responses concise — this is a live conversation, not a text chat.
If you're unsure about something, ask for clarification.
Step 7: Deploy as a Single Script
For production use, combine everything into one script:
#!/usr/bin/env python3"""voice_meet_bot.py — Complete Google Meet AI voice agent."""import osimport subprocessimport sysimport threadingimport time# ... (combine meet_bot.py + meet_elevenlabs_bridge.py)# See the full combined script in the project repodef main(): # 1. Setup audio devices setup_audio() # 2. Start Chrome and join Meet driver = make_driver() join_thread = threading.Thread(target=join_meet, args=(driver, MEET_URL)) join_thread.start() # 3. Move Chrome audio after it joins time.sleep(10) move_chrome_audio() # 4. Start ElevenLabs bridge client = ElevenLabs(api_key=API_KEY) run_bridge(client, AGENT_ID)
Troubleshooting
"No audio from Meet participants"
Chrome's audio may not be routed to meet_capture. Run:
pactl list short sink-inputs
If you see Chrome's stream on a different sink, move it:
pactl move-sink-input <INDEX> meet_capture
"AI responds but Meet participants can't hear it"
Check that atlas_mic is Chrome's input source:
pactl list short source-outputs
Move Chrome's source input if needed:
pactl move-source-output <INDEX> atlas_mic
"Chrome fails to start"
Make sure Xvfb is running: export DISPLAY=:99
Check ChromeDriver version matches Chrome: google-chrome --version
"ElevenLabs session keeps restarting"
Check your API key is valid
Ensure there's actual audio coming in (silence may cause session timeouts)
Try increasing CHUNK_SAMPLES to 8000 (500ms chunks)
"Meet detects bot as automated"
The --disable-blink-features=AutomationControlled flag helps
The webdriver property override in make_driver() also helps
For testing and development, you can run the full stack for under $1/day.
What's Next
Once you have the basic setup working, here are some ideas to extend it:
Add knowledge bases to your ElevenLabs agent for domain-specific conversations
Record transcripts using the callback functions for automated meeting notes
Multi-language support by configuring the agent's language settings
Custom tools — ElevenLabs agents support function calling, so your bot can check databases, call APIs, or trigger actions mid-conversation
Multiple bots in the same call — each with different roles (note-taker, translator, domain expert)
Wrapping Up
Building an AI voice agent for Google Meet is surprisingly achievable with the right audio routing setup. The combination of Selenium for browser automation, PulseAudio for virtual audio devices, and ElevenLabs for conversational AI creates a robust pipeline that works reliably on headless cloud servers.
The hardest part isn't the AI — it's the audio plumbing. Once you understand the meet_capture -> ElevenLabs -> atlas_out -> atlas_mic flow, the rest is straightforward.
Spin up a Vast.ai instance, follow the steps, and have your AI joining calls in under an hour. Let us know what you build with it!
Built and tested by the Noqta engineering team. Questions? Reach out at noqta.tn.
Learn how to use the ALLaM-7B-Instruct-preview model with Python, and how to interact with it from JavaScript via a hosted API (e.g., on Hugging Face Spaces).