For two years the strongest coding models were closed, expensive, and impossible to run on your own hardware. GLM-5.2, released by Z.ai (formerly Zhipu) on June 13, 2026, breaks that pattern. It is an open-weights Mixture-of-Experts model built for agentic software engineering rather than chat, it ships with a genuinely usable 1-million-token context window, and its hosted API is exposed through two compatibility layers at once — one that mimics OpenAI's /chat/completions and one that mimics Anthropic's Messages API.
That dual personality is the interesting part for developers. It means a single model can power your TypeScript backend through the OpenAI SDK, and act as a drop-in replacement for Claude inside Claude Code, and — because the weights are MIT-licensed and published on Hugging Face — run entirely inside your own data center with vLLM. For teams in the MENA region bound by Tunisia's INPDP rules or Saudi Arabia's PDPL, that last option means frontier-class coding assistance without a single token leaving your infrastructure.
This tutorial walks through all of it: calling GLM-5.2 from TypeScript, streaming, tool calling, swapping it into Claude Code and Cline, and finally self-hosting the open weights.
Prerequisites
Before starting, make sure you have:
- Node.js 20+ and a package manager (we use
pnpm, butnpmworks too) - Basic familiarity with TypeScript and
async/await - A Z.ai account with an API key (a Coding Plan subscription or pay-as-you-go credits) — sign up at z.ai
- Optional, for the self-hosting section: a Linux box with an NVIDIA GPU (or access to one) and Docker
- Optional: Claude Code and/or the Cline VS Code extension installed
What You'll Build
By the end you will have:
- A small TypeScript script that calls GLM-5.2 through the OpenAI-compatible endpoint
- A streaming chat function and a tool-calling (function-calling) agent loop
- Claude Code reconfigured to use GLM-5.2 as its model backend
- Cline pointed at the same model inside VS Code
- A self-hosted GLM endpoint running on vLLM that behaves identically to the hosted API
The trick that makes all five possible is understanding Z.ai's two endpoints, so let's start there.
Step 1: Understand the Two Endpoints
Z.ai exposes GLM models through two API surfaces. Picking the right one for each tool is the single most important decision in this tutorial.
| Endpoint | Base URL | Dialect | Use it for |
|---|---|---|---|
| OpenAI-compatible | https://api.z.ai/api/paas/v4/ | OpenAI /chat/completions | Your own code, Cline, Cursor, most SDKs |
| Anthropic-compatible | https://api.z.ai/api/anthropic | Anthropic Messages API | Claude Code, Anthropic SDK |
| Coding-plan (Anthropic) | https://api.z.ai/api/coding/paas/v4 | Anthropic Messages API | Claude Code on a Coding Plan subscription |
Z.ai is currently the only provider besides Anthropic itself that offers an Anthropic-compatible endpoint. That is what makes GLM-5.2 a true drop-in for Claude Code — no proxy, no adapter, just a different base URL.
The model id is glm-5.2 everywhere, with one exception we will cover later: inside Claude Code you append a [1m] suffix (glm-5.2[1m]) to unlock the full 1-million-token window.
Step 2: Project Setup
Create a fresh project and install the OpenAI SDK. Even though we are talking to GLM, the OpenAI-compatible endpoint means the official openai package is the cleanest way in.
mkdir glm-quickstart && cd glm-quickstart
pnpm init
pnpm add openai
pnpm add -D typescript tsx @types/node
npx tsc --initStore your key in an environment variable rather than hardcoding it — a rule that holds even for throwaway scripts.
# .env (never commit this)
ZAI_API_KEY=your-zai-api-key-hereAdd a .gitignore line for .env, then create src/client.ts that builds a configured client once and exports it:
// src/client.ts
import OpenAI from "openai";
if (!process.env.ZAI_API_KEY) {
throw new Error("Missing ZAI_API_KEY environment variable");
}
export const glm = new OpenAI({
apiKey: process.env.ZAI_API_KEY,
// The OpenAI-compatible base URL — note the trailing slash
baseURL: "https://api.z.ai/api/paas/v4/",
});
export const GLM_MODEL = "glm-5.2";Because the client is just the OpenAI SDK pointed at a different host, every method you already know — chat.completions.create, streaming, tool calls — works unchanged.
Step 3: Your First Completion
Create src/hello.ts:
// src/hello.ts
import "dotenv/config";
import { glm, GLM_MODEL } from "./client";
async function main() {
const completion = await glm.chat.completions.create({
model: GLM_MODEL,
messages: [
{ role: "system", content: "You are a precise senior TypeScript engineer." },
{ role: "user", content: "Write a one-line function that debounces a callback." },
],
});
console.log(completion.choices[0].message.content);
}
main().catch((err) => {
console.error("GLM request failed:", err);
process.exit(1);
});Install dotenv and run it:
pnpm add dotenv
pnpm tsx src/hello.tsIf your key is valid you will see a debounce implementation printed to the terminal. Notice that nothing in this code references Z.ai except the base URL inside client.ts — that is the whole point of the OpenAI-compatible layer.
Step 4: Streaming Responses
For anything interactive, you want tokens as they arrive instead of waiting for the full response. Set stream: true and iterate the async stream:
// src/stream.ts
import "dotenv/config";
import { glm, GLM_MODEL } from "./client";
async function streamChat(prompt: string) {
const stream = await glm.chat.completions.create({
model: GLM_MODEL,
messages: [{ role: "user", content: prompt }],
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) process.stdout.write(delta);
}
process.stdout.write("\n");
}
streamChat("Explain the actor model in three sentences.");The chunk shape is identical to OpenAI's, so any UI streaming code you already wrote for ChatGPT-style apps works without changes.
Step 5: Control Reasoning Effort
GLM-5.2 ships with two selectable thinking-effort levels. Higher effort spends more tokens reasoning before answering — worth it for hard refactors and architecture questions, wasteful for simple lookups. You control it with the reasoning parameter through the SDK's extra_body passthrough (the OpenAI types do not know about it, so we cast):
// src/reasoning.ts
import "dotenv/config";
import { glm, GLM_MODEL } from "./client";
async function deepThink(prompt: string) {
const completion = await glm.chat.completions.create({
model: GLM_MODEL,
messages: [{ role: "user", content: prompt }],
// Z.ai-specific field passed straight through to the API
// @ts-expect-error reasoning is a Z.ai extension, not in OpenAI types
reasoning: { effort: "high" },
});
return completion.choices[0].message.content;
}
deepThink("Design a retry strategy for a payment webhook that must never double-charge.")
.then(console.log);Use high effort sparingly — on long-horizon agent loops it adds up. For routine completions, omit the parameter entirely and let the model default to its faster mode.
Step 6: Tool Calling for Agents
The reason GLM-5.2 was built for "agentic software engineering" rather than chat is its tool-calling reliability. The schema is the standard OpenAI tools array, so an agent loop looks familiar. Here is a minimal loop that gives the model one tool and executes it:
// src/agent.ts
import "dotenv/config";
import { glm, GLM_MODEL } from "./client";
import type { ChatCompletionMessageParam } from "openai/resources/chat/completions";
const tools = [
{
type: "function" as const,
function: {
name: "get_repo_file_count",
description: "Return how many files live in a given directory of the repo.",
parameters: {
type: "object",
properties: {
path: { type: "string", description: "Directory path, e.g. src/" },
},
required: ["path"],
},
},
},
];
// A fake implementation — in real life you would read the filesystem.
function runTool(name: string, args: Record<string, unknown>): string {
if (name === "get_repo_file_count") {
return JSON.stringify({ path: args.path, count: 42 });
}
return JSON.stringify({ error: "unknown tool" });
}
async function agent(userPrompt: string) {
const messages: ChatCompletionMessageParam[] = [
{ role: "user", content: userPrompt },
];
// First turn: the model decides whether to call a tool.
const first = await glm.chat.completions.create({
model: GLM_MODEL,
messages,
tools,
});
const choice = first.choices[0].message;
messages.push(choice);
// Execute any tool calls and feed results back.
for (const call of choice.tool_calls ?? []) {
const args = JSON.parse(call.function.arguments);
const result = runTool(call.function.name, args);
messages.push({ role: "tool", tool_call_id: call.id, content: result });
}
// Second turn: the model answers using the tool output.
const second = await glm.chat.completions.create({
model: GLM_MODEL,
messages,
tools,
});
return second.choices[0].message.content;
}
agent("How many files are in the src directory?").then(console.log);In production you would loop the tool-execution step until the model stops requesting tools, add a max_steps guard, and log every tool call. But the core pattern — ask, execute, feed back, ask again — is exactly the loop that powers coding agents.
Step 7: Use GLM-5.2 as the Claude Code Backend
This is where the Anthropic-compatible endpoint earns its keep. Claude Code reads a handful of environment variables to decide which API and which models to call. Point them at Z.ai and Claude Code never knows the difference.
Add these to your shell profile, or to the env block of ~/.claude/settings.json:
export ANTHROPIC_BASE_URL="https://api.z.ai/api/coding/paas/v4"
export ANTHROPIC_API_KEY="your-glm-coding-plan-key"
export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-5.2[1m]"
export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7"
export CLAUDE_CODE_AUTO_COMPACT_WINDOW=1000000
export API_TIMEOUT_MS=3000000What each line does:
ANTHROPIC_BASE_URLredirects every Claude Code request to Z.ai's coding endpoint.ANTHROPIC_DEFAULT_SONNET_MODEL/OPUS_MODELmap Claude's tiers onto GLM-5.2. The[1m]suffix is a Claude Code convention that requests the 1-million-token context variant.ANTHROPIC_DEFAULT_HAIKU_MODELpoints the cheap background model (used for summaries and titles) at the smaller, fasterglm-4.7.CLAUDE_CODE_AUTO_COMPACT_WINDOWtells Claude Code not to compact the conversation until it nears one million tokens, so you actually use the window you are paying for.API_TIMEOUT_MSraises the request timeout so long agentic runs do not get cut off.
Keep your real Anthropic key in a separate shell profile if you still use the genuine Claude API. These environment variables are global to Claude Code, so setting them switches every session over to GLM until you unset them.
Open a new terminal so the variables load, run claude, and ask it to do something. The responses now come from GLM-5.2 — at roughly one-sixth the per-token cost of a frontier closed model.
Step 8: Use GLM-5.2 in Cline
Cline (the VS Code agent) speaks the OpenAI dialect, so here you use the OpenAI-compatible endpoint instead. In Cline's settings:
- API Provider: OpenAI Compatible
- Base URL:
https://api.z.ai/api/paas/v4/(keep the trailing slash) - API Key: your Z.ai key
- Model ID:
glm-5.2(no[1m]suffix here — that is a Claude Code-only convention) - Context Window:
1000000
That last setting matters more than it looks. Cline uses the configured context window to decide when to truncate conversation history. Leave it at a small default and you throw away most of GLM-5.2's million-token window before the model ever sees it. The same fields and values work for Cursor's custom-model configuration.
Step 9: Self-Host the Open Weights with vLLM
The hosted API is convenient, but the headline feature for regulated teams is that GLM-5.2's weights are MIT-licensed and published on Hugging Face and ModelScope. You can serve them yourself with vLLM, which exposes an OpenAI-compatible server — meaning the exact client.ts from Step 2 works against your own hardware by changing one URL.
A minimal server launch (on a multi-GPU box, since this is a large MoE model):
# Install vLLM in a fresh environment
pip install "vllm>=0.8.0"
# Serve GLM-5.2 with an OpenAI-compatible API on port 8000
vllm serve zai-org/GLM-5.2 \
--served-model-name glm-5.2 \
--tensor-parallel-size 8 \
--max-model-len 200000 \
--host 0.0.0.0 \
--port 8000Then point your TypeScript client at the local server — no API key leaves your network:
// src/client-selfhosted.ts
import OpenAI from "openai";
export const glm = new OpenAI({
apiKey: "not-needed-locally", // vLLM ignores this by default
baseURL: "http://localhost:8000/v1",
});
export const GLM_MODEL = "glm-5.2";For MENA teams under Tunisia's INPDP or Saudi Arabia's PDPL, self-hosting is the cleanest compliance story: open weights plus a local vLLM endpoint means proprietary source code and customer data never touch a third-party API. Start with a smaller context length (--max-model-len) and raise it as your GPU memory allows.
If a full multi-GPU deployment is out of reach, start with the hosted API for development and reserve the self-hosted path for the workloads that genuinely require on-premise data handling.
Step 10: Make the Most of the 1M Context
A million-token window changes how you prompt. Instead of carefully trimming context, you can paste an entire module, its tests, and the relevant docs in one request. Two practical guidelines:
- Long context is not free. Even with GLM-5.2's sparse-attention optimizations, more tokens means higher latency and cost. Send the whole repo when the task genuinely needs it; send one file when it does not.
- The
[1m]suffix is Claude Code only. In your own TypeScript code and in Cline, the window is controlled by configuration (the context-window setting), not by the model id. Do not append[1m]toglm-5.2in API calls — it is not a valid model name there.
Testing Your Integration
Verify each layer independently so you know where a failure lives:
- OpenAI layer:
pnpm tsx src/hello.tsreturns a completion → your key and base URL are correct. - Streaming:
pnpm tsx src/stream.tsprints text progressively → streaming works. - Tools:
pnpm tsx src/agent.tsreturns an answer that references the tool output (the count of 42) → tool calling round-trips. - Claude Code: in a fresh terminal, run
claudeand check that responses arrive; if Claude Code reports an auth error, yourANTHROPIC_API_KEYor base URL is wrong. - Self-hosted:
curl http://localhost:8000/v1/modelslistsglm-5.2→ your vLLM server is up.
Troubleshooting
401 Unauthorized: Your API key is missing, malformed, or for the wrong plan. The Coding Plan key and the standard API key are billed differently — make sure you are using the one that matches your endpoint.
Claude Code still calls the real Anthropic API: Environment variables only apply to terminals opened after you set them. Open a new terminal, or confirm with echo $ANTHROPIC_BASE_URL.
Responses get truncated early in Cline: The context-window setting is too low. Raise it to 1000000 so Cline stops discarding history prematurely.
Invalid model name on a direct API call: You appended [1m] outside Claude Code. Use plain glm-5.2 in the OpenAI-compatible endpoint and in vLLM.
vLLM out-of-memory at launch: Lower --max-model-len, increase --tensor-parallel-size if you have more GPUs, or quantize the weights. A large MoE model needs substantial VRAM even when only a fraction of parameters are active per token.
Next Steps
- Wrap the agent loop from Step 6 in a proper framework — see our guides on building an AI agent with the Vercel AI SDK and type-safe agents with the OpenAI Agents SDK, both of which accept any OpenAI-compatible endpoint.
- Add observability so you can see token usage and latency — our Langfuse LLM observability tutorial plugs into any OpenAI-compatible client.
- Compare GLM-5.2 against other open-weights frontier models with our MiniMax M3 developer guide.
Conclusion
GLM-5.2's real innovation is not just that it is an open-weights model that competes with closed frontier systems — it is how portable it is. One model, exposed through both the OpenAI and Anthropic dialects, drops into your TypeScript code, your Claude Code sessions, and your Cline workspace with nothing more than a base-URL change. And because the weights are open, the same code can run against a vLLM server inside your own data center the day a compliance requirement makes that necessary.
For MENA teams especially, that combination — frontier coding quality, one-sixth the cost, and a credible path to full on-premise data sovereignty — is what makes GLM-5.2 worth wiring into your stack today. Start with the hosted API to ship fast, and keep the self-hosting recipe in your back pocket for the workloads that demand it.