Apache SkyWalking – AI

Blog: Monitoring LLM Applications with SkyWalking 10.4: Insights into Performance and Cost

Sun, 05 Apr 2026 00:00:00 +0000

With the deep penetration of Generative AI (GenAI) into enterprise workflows, developers face a challenging paradox: while powerful LLM capabilities are easily integrated via Spring AI or OpenAI SDKs, the actual performance and reliability of these calls remain largely invisible.

1. The “Black Box” of Cost and Performance: Is the Expensive Model Worth It?

Facing high LLM bills, organizations often only see a total sum paid to a provider, but cannot calculate the “ROI” within the application.

Blind Upgrades: You might switch to a premium flagship model for a better experience. But in your specific business scenario, does paying several times more per token actually yield lower latency or a faster TTFT (Time to First Token)?
Lack of Real-World Benchmarks: Official benchmarks mean little without your real-world business requests. You need to know which model achieves the perfect balance between “Token/Cost Consumption” and “Response Speed” under your actual prompt lengths and concurrency levels.

2. The Vanishing “Golden Timeout”

Many teams set timeouts for LLM calls arbitrarily (e.g., 30s or 60s).

Too Short: During peak periods or long-text generation, requests are frequently interrupted, causing business failure rates to soar.
Too Long: If a provider hangs, requests pile up in memory, blocking execution threads and potentially leading to the collapse of the entire Java application or microservice cluster. Only by mastering the P99/P95 Latency can you set rational timeout policies based on data rather than intuition.

3. The Overlooked Experience Killer: TTFT

In GenAI scenarios, a user’s perception of speed depends less on the total duration of the conversation and more on “when the first word appears.” * A streaming response with a 10s total duration but a 500ms TTFT feels instantaneous.

A non-streaming response with a 5s total duration but a 4s TTFT feels “frozen.” If your observability system only tracks total latency, you miss the core UX metric that explains why users complain about “AI slowness.”

SkyWalking 10.4: A “Digital Dashboard”
From the Application Perspective The Virtual GenAI capability introduced in Apache SkyWalking 10.4 fills this “observability vacuum.” It avoids reliance on external gateways by using application-side probes (like the Java Agent) to collect the most authentic data from the client’s perspective.

Precise Latency Distribution: Multi-dimensional metrics (P50, P90, P99) help visualize LLM fluctuations to inform dynamic timeout strategies.
Core UX Metric — TTFT Monitoring: Native support for first-token latency in streaming calls.
Multi-dimensional Model Profiling: Aligns token usage, estimated cost, and performance across Providers and Models, helping you choose the most cost-effective solution for your specific needs.

Virtual GenAI Observability

Virtual GenAI represents Generative AI service nodes detected by probe plugins. All performance metrics are based on the GenAI Client Perspective.

For instance, the Spring AI plugin in the Java Agent detects the response latency of a Chat Completion request. SkyWalking then visualizes these in the dashboard:

Traffic & Success Rate (CPM & SLA)
Latency & TTFT
Token Usage (Input/Output)
Estimated Cost

Screenshots:

How It Works

When the SkyWalking Java Agent or OTLP probes intercept calls to mainstream AI frameworks (e.g., Spring AI, OpenAI SDK), they report Trace data to the SkyWalking OAP. The OAP aggregates and computes this data to generate performance metrics for both Providers and Models, which are then rendered in the built-in Virtual-GenAI dashboards.

Installation & Configuration

Requirements

SkyWalking Java Agent: >= 9.7
SkyWalking OAP: >= 10.4

Semantic Conventions & Compatibility

SkyWalking Virtual GenAI follows OpenTelemetry GenAI Semantic Conventions. OAP identifies GenAI-related Spans based on:

SkyWalking Java Agent

Spans must be of type Exit, have the SpanLayer attribute set to GENAI, and contain the gen_ai.response.model tag.

OTLP / Zipkin Probes

Spans must contain the gen_ai.response.model tag.

For details, refer to the E2E configurations:

GenAI Estimated Cost Configuration

Overview

SkyWalking provides a built-in GenAI Billing Configuration File.

This file defines how SkyWalking maps model names from Trace data to their corresponding providers and estimates the token cost for each LLM call. The estimated cost is displayed in the SkyWalking UI alongside trace and metric data, helping users intuitively understand the financial impact of their GenAI usage.

Important: The pricing in this file is intended for cost estimation only and must not be treated as actual billing or invoice amounts. Users are advised to regularly verify the latest rates on the providers’ official pricing pages.

Configuration Structure

Top-level Fields

Field	Type	Description
`last-updated`	`date`	The last update date of the pricing data. All prices are based on public billing standards announced by providers prior to this date.
`providers`	`list`	List of GenAI provider definitions. Each entry contains matching rules and specific model pricing information.

Provider Definition

Each entry under providers defines a GenAI provider:

providers:
- provider: <provider-name>
  prefix-match:
    - <prefix-1>
    - <prefix-2>
  models:
    - name: <model-name>
      aliases: [<alias-1>, <alias-2>]
      input-estimated-cost-per-m: <cost>
      output-estimated-cost-per-m: <cost>

Field	Type	Required	Description
`provider`	`string`	Yes	The provider identifier (e.g., `openai`, `anthropic`, `gemini`). It is displayed as the Virtual GenAI service name in SkyWalking.
`prefix-match`	`list[string]`	Yes	A list of prefixes used to match model names to this provider. If a model name in the Trace data starts with any of these prefixes, it will be mapped to this provider.
`models`	`list[model]`	No	A list of model definitions containing pricing information. If omitted, the system can still identify the provider but will not perform cost estimation.

Model Definition

Each entry under models defines the pricing for a specific model:

Field	Type	Required	Description
`name`	`string`	Yes	The standard model name used for matching.
`aliases`	`list[string]`	No	Alternative names that should resolve to the same billing entry. This is useful when providers use different naming conventions (see the “Model Aliases” section).
`input-estimated-cost-per-m`	`float`	No	Estimated cost per 1,000,000 (one million) input (Prompt) tokens. The default unit is USD.
`output-estimated-cost-per-m`	`float`	No	Estimated cost per 1,000,000 (one million) output (Completion) tokens. The default unit is USD.

Model Matching Mechanism

Provider-Level Prefix Matching

When SkyWalking receives a Trace containing a GenAI call, it determines the Provider based on the following priority order:

gen_ai.provider.name tag: This tag is retrieved first. It follows the latest OpenTelemetry GenAI semantic conventions.
gen_ai.system tag: If the above tag is missing, the system falls back to this legacy tag. Note: This tag is only parsed when processing OTLP or Zipkin format data, primarily for compatibility with older versions of libraries like the Python auto-instrumentation.
Prefix Matching: If neither of the above tags exists, SkyWalking reads the prefix-match rules defined in gen-ai-config.yml and attempts to identify the provider by matching the Model Name.

- provider: openai
  prefix-match:
    - gpt

Any model name starting with gpt (such as gpt-4o, gpt-4.1-mini, or gpt-5-nano) will be mapped to the openai provider. A single provider can have multiple prefixes:

- provider: tencent
  prefix-match:
    - hunyuan
    - Tencent

Model-level Longest-Prefix Matching

Once the provider is determined, SkyWalking uses a Trie-based longest-prefix matching algorithm to find the best billing entry. This is crucial because model names returned in provider API responses often include version numbers or timestamps, differing from the base model name in the config. Example OpenAI config:

models:
- name: gpt-4o
  input-estimated-cost-per-m: 2.5
  output-estimated-cost-per-m: 10.0
- name: gpt-4o-mini
  input-estimated-cost-per-m: 0.15
  output-estimated-cost-per-m: 0.6

Matching behavior:

Model Name in Trace	Matched Configuration Entry	Reason
`gpt-4o`	`gpt-4o`	Exact match
`gpt-4o-2024-08-06`	`gpt-4o`	Longest prefix is `gpt-4o`
`gpt-4o-mini`	`gpt-4o-mini`	Exact match (Longer prefix `gpt-4o-mini` takes priority over `gpt-4o`)
`gpt-4o-mini-2024-07-18`	`gpt-4o-mini`	Longest prefix is `gpt-4o-mini`

This mechanism ensures versioned API model names map to the correct pricing tier without requiring exact full names in the configuration file.

Model Aliases

Some providers use different naming conventions across API responses and documentation. For example, Anthropic’s model might appear as claude-4-sonnet or claude-sonnet-4. The aliases field supports both formats under a single billing entry:

- name: claude-4-sonnet
  aliases: [claude-sonnet-4]
  input-estimated-cost-per-m: 3.0
  output-estimated-cost-per-m: 15.0

Under this configuration, claude-4-sonnet and claude-sonnet-4 (as well as any versioned variants, such as claude-sonnet-4-20250514) will resolve to the same billing entry.
Note: Aliases also participate in longest prefix matching. Therefore, claude-sonnet-4-20250514 will match the alias claude-sonnet-4, which in turn resolves to the pricing information for claude-4-sonnet.

Custom Configuration

Adding a New Provider

To add a provider that is not included in the default configuration:

providers:
# ... Existing providers ...

- provider: ollama
  prefix-match:
    - mymodel
  models:
    - name: mymodel-large
      input-estimated-cost-per-m: 1.0
      output-estimated-cost-per-m: 5.0
    - name: mymodel-small
      input-estimated-cost-per-m: 0.1
      output-estimated-cost-per-m: 0.5

For OTLP/Zipkin data, a dedicated estimated tag has been added. You can now view the cost of each GenAI call directly on the UI.

Main Metrics

1.Provider Level

Metric ID	Description	Meaning
`gen_ai_provider_cpm`	Calls Per Minute	Requests per minute (Throughput)
`gen_ai_provider_sla`	Success Rate	Request success rate
`gen_ai_provider_resp_time`	Avg Response Time	Average response time
`gen_ai_provider_latency_percentile`	Latency Percentiles	Response time percentiles (P50, P75, P90, P95, P99)
`gen_ai_provider_input_tokens_sum/avg`	Input Token Usage	Total and average input token usage
`gen_ai_provider_output_tokens_sum/avg`	Output Token Usage	Total and average output token usage
`gen_ai_provider_total_estimated_cost/avg`	Estimated Cost	Total estimated cost and average cost per call

2. Model Level

Metric ID	Description	Meaning
`gen_ai_model_call_cpm`	Calls Per Minute	Requests per minute for this specific model
`gen_ai_model_sla`	Success Rate	Model-specific request success rate
`gen_ai_model_latency_avg/percentile`	Latency	Average and percentiles of model response duration
`gen_ai_model_ttft_avg/percentile`	TTFT	Time to First Token (Streaming only)
`gen_ai_model_input_tokens_sum/avg`	Input Token Usage	Detailed input token consumption for the model
`gen_ai_model_output_tokens_sum/avg`	Output Token Usage	Detailed output token consumption for the model
`gen_ai_model_total_estimated_cost/avg`	Estimated Cost	Estimated total cost and average cost for the model

Recommended Usage Scenarios

Performance Evaluation: Use Latency and Time to First Token (TTFT) metrics to analyze model inference efficiency and the end-user interaction experience.
Token Monitoring: Real-time monitoring of Input and Output token consumption to analyze resource utilization across different business scenarios.
Cost Alerting: Set alert thresholds based on Estimated Cost or token consumption to promptly detect abnormal calls and prevent budget overruns.

Zh: 基于 SkyWalking 10.4 的大模型应用监控：洞察 LLM 的性能与成本

Sun, 05 Apr 2026 00:00:00 +0000

问题：当应用开始“吞噬”大模型，监控却留下了盲区

随着生成式 AI（GenAI）在企业业务中的深度渗透，开发者正面临一个尴尬的局面：我们在应用中通过Spring AI或OpenAI SDK快速集成了强大的大模型能力，但对于这些调用的实际表现却几乎一无所知。

成本与性能的“黑盒”：昂贵的模型真的更具性价比吗？
面对高昂的大模型账单，我们往往只知道把钱交给了某个Provider，却算不清这笔账在应用内部的“投入产出比”。盲目的选型升级：为了追求更好的体验，你可能将业务默认切换到了成本更高的旗舰模型。但在具体的业务场景下，花费数倍的 Token 成本，它真的能在真实请求中带来更低的延迟和更快的 TTFT(Time to First Token) 吗？缺乏真实的评估基准：脱离了真实的业务请求，单纯看官网的 Benchmark 意义不大，你需要知道在实际的 Prompt 长度和并发压力下，同一Provider下的哪个模型能在“Token/Cost 消耗”与“响应速度”之间达到完美的平衡。如果没有应用侧的数据支撑，你根本无从判断哪款模型才是当前业务的最优解。
消失的“黄金超时时间”
很多团队在代码里给 LLM 调用设置超时（Timeout）时，往往是拍脑袋决定（比如 30s 或 60s）。
设太短：长文本生成或模型高峰期时，请求会被频繁强行中断，导致业务失败率飙升。
设太长：如果下游供应商出现故障（卡死），大量的请求会堆积在应用内存中，阻塞执行线程，最终导致整个 Java 应用甚至微服务集群的瘫痪。只有真正掌握了预估的整体调用延迟（P99/P95 Latency），你才能基于数据而非直觉，为不同模型设置最合理的超时策略。
被忽视的体验杀手：TTFT
在 GenAI 场景下，用户对“快”的感知并不完全取决于整个对话结束的总耗时，而取决于**“第一行字什么时候跳出来”**。一个总耗时 10 秒但 TTFT 仅 500ms 的流式响应，给用户的观感是“秒回”。一个总耗时 5 秒但 TTFT 需要 4s 的非流式响应，给用户的观感却是“卡死”。如果你的观测系统只能看到总耗时，你就会漏掉最核心的 UX 指标，无法解释为什么用户反馈“AI 很慢”即便总耗时看起来还行。

SkyWalking 10.4：应用视角的“数字仪表盘”
Apache SkyWalking 自 10.4 版本引入的 Virtual GenAI 能力，正是为了解决应用层侧的这种“观测真空”。它不依赖任何外部网关，直接通过应用侧探针（如 Java Agent）在客户端视角采集最真实的数据。

精准的延迟分布（Latency Percentiles）：通过 P50、P90、P99 等多维指标，帮你勾勒出 LLM 调用的真实波动曲线，为设置“动态超时时间”提供科学依据。
核心 UX 指标——TTFT 监控：原生支持流式（Streaming）调用的首字延迟统计。通过对比不同 Provider 或不同模型的 TTFT，你可以优化提示词（Prompt）策略或切换更快的模型，确保用户体验始终在线。
多维度的模型“画像”分析：在 Provider 和 Model 两个维度上，将 Token 消耗、预估成本与性能指标深度对齐。这让你不再看供应商全网的“理想平均数”，而是看清你的应用在调用特定模型时的真实表现，从而在复杂的模型生态中选出最具性价比的选型方案。

虚拟 GenAI 观测

虚拟 GenAI 代表了由探针插件检测到的生成式 AI 服务节点。GenAI 操作的性能指标均基于 GenAI 客户端视角。

例如，Java 探针中的 Spring AI 插件可以检测一次对话补全（Chat Completion）请求的响应延迟。随后，SkyWalking 将在仪表盘中展示：

流量与成功率 (CPM & SLA)
响应延迟 (Latency & TTFT)
Token 消耗 (Input/Output)
预估成本 (Estimated Cost)

如图：

原理

当 SkyWalking Java Agent 或 OTLP 探针拦截到主流 AI 框架（如 Spring AI、OpenAI SDK 等）的调用时，将Trace 数据上报至 SkyWalking OAP。 OAP会基于这些 Trace 自动完成数据的聚合与计算。最终会生成 Provider（服务商）与 Model（模型）两个维度的各类性能指标，并直接渲染填充至内置的 Virtual-GenAI 仪表盘中。

安装配置

要求

版本要求

● SkyWalking Java Agent: >= 9.7 ● SkyWalking Oap: >= 10.4

语义规范与兼容性

SkyWalking 虚拟 GenAI 遵循 OpenTelemetry GenAI 语义规范。OAP 将根据以下标准识别 GenAI 相关 Span：

SkyWalking Java Agent

上报的 Span 必须为 Exit 类型，其 SpanLayer 属性需设定为 GENAI,包含gen_ai.response.model 标签。

输出OTLP / Zipkin格式数据的探针

上报的 Span 中包含 gen_ai.response.model 标签。

具体可以参考e2e配置
SkyWalking Java Agent上报数据
 探针上报OTLP格式数据
 探针上报Zipkin格式数据

GenAI 预估成本配置

概览

SkyWalking 提供了一个内置的GenAI计费配置文件

该配置定义了SkyWalking 如何将 Trace 数据中的模型名称映射到对应的供应商，并估算每次 LLM 调用的 Token 成本。估算成本将与 Trace 和指标数据一起显示在 SkyWalking UI 中，帮助用户直观了解 GenAI 使用带来的预估费用影响。重要提示: 此文件中的定价仅用于成本估算，不得视为实际账单或发票金额。建议用户定期从供应商官方定价页面核实最新费率。

配置结构

Top 字段

字段	类型	描述
`last-updated`	`date`	定价数据的最后更新日期。所有价格均基于该日期前各厂商官网公布的公开计费标准。
`providers`	`list`	GenAI 厂商定义列表。每个厂商条目下包含匹配规则（matching rules）以及具体的模型计费信息（model pricing）。

provider 定义

providers 下的每个条目定义一个 GenAI 供应商：

providers:
- provider: <provider-name>
  prefix-match:
    - <prefix-1>
    - <prefix-2>
  models:
    - name: <model-name>
      aliases: [<alias-1>, <alias-2>]
      input-estimated-cost-per-m: <cost>
      output-estimated-cost-per-m: <cost>

字段 (Field)	类型 (Type)	必填 (Required)	描述 (Description)
`provider`	`string`	是	供应商标识（如 `openai`, `anthropic`, `gemini`）。在 SkyWalking 中作为虚拟 GenAI 服务名显示。
`prefix-match`	`list[string]`	是	用于将模型名称匹配到该供应商的前缀列表。如果 Trace 数据中的模型名以其中任一前缀开头，则会被映射到该供应商。
`models`	`list[model]`	否	包含定价信息的模型定义列表。如果省略，系统仍能识别供应商，但不会进行成本估算。

model 定义

models 下的每个条目定义特定模型的定价：

字段 (Field)	类型 (Type)	必填 (Required)	描述 (Description)
`name`	`string`	是	用于匹配的标准模型名称。
`aliases`	`list[string]`	否	应解析为同一计费条目的备选名称。当供应商使用不同的命名习惯时非常有用（参见“模型别名”部分）。
`input-estimated-cost-per-m`	`float`	否	每 1,000,000（一百万）输入（Prompt）Token 的预估成本。默认单位为 USD。
`output-estimated-cost-per-m`	`float`	否	每 1,000,000（一百万）输出（Completion）Token 的预估成本。默认单位为 USD。

模型匹配机制

供应商级前缀匹配

当 SkyWalking 接收到包含 GenAI 调用的 Trace 时，会按照以下优先级顺序来确定供应商（Provider）：

gen_ai.provider.name 标签：首先检索此标签。它是OpenTelemetry最新的语义规范。
gen_ai.system 标签：如果缺少上述标签，系统将回退到此旧版（Legacy）标签。注意：此标签仅在处理 OTLP 或 Zipkin 协议的数据时会被解析，主要用于兼容旧版的 Python 自动仪表化等库。
前缀匹配 (Prefix Matching)：若上述两个标签均不存在，SkyWalking 会读取 gen-ai-config.yml 中定义的 prefix-match 规则，通过匹配模型名称 (Model Name) 来尝试识别供应商。

- provider: openai
  prefix-match:
    - gpt

任何以 gpt 开头的模型名称（如 gpt-4o, gpt-4.1-mini, gpt-5-nano）都会被映射到 openai 供应商。一个供应商可以拥有多个前缀：

- provider: tencent
  prefix-match:
    - hunyuan
    - Tencent

模型级最长前缀匹配 (Model-Level Longest-Prefix Matching)

一旦确定了供应商，SkyWalking 会使用基于前缀树 (Trie) 的最长前缀匹配算法来查找最佳的模型计费条目。这至关重要，因为 LLM 供应商在 API 响应中返回的模型名称通常包含版本号或时间戳，与配置中的基础模型名称有所不同。示例：假设 OpenAI 的配置条目如下：

models:
- name: gpt-4o
  input-estimated-cost-per-m: 2.5
  output-estimated-cost-per-m: 10.0
- name: gpt-4o-mini
  input-estimated-cost-per-m: 0.15
  output-estimated-cost-per-m: 0.6

其匹配行为如下表所示：

Trace 中的模型名称	匹配的配置条目	原因
`gpt-4o`	`gpt-4o`	完全匹配
`gpt-4o-2024-08-06`	`gpt-4o`	最长前缀为 `gpt-4o`
`gpt-4o-mini`	`gpt-4o-mini`	完全匹配（比 `gpt-4o` 更长的前缀优先）
`gpt-4o-mini-2024-07-18`	`gpt-4o-mini`	最长前缀为 `gpt-4o-mini`

这种机制确保了 API 返回的带有版本的模型名称能够被正确映射到相应的价格档位，而无需在配置文件中维护精确的全名。

模型别名 (Model Aliases)

部分供应商在 API 响应和官方文档中会使用不同的命名规范。例如，Anthropic 的模型在 Trace 中可能显示为 claude-4-sonnet 或 claude-sonnet-4。通过 aliases 字段，可以让单个计费条目同时支持这两种配置：

- name: claude-4-sonnet
  aliases: [claude-sonnet-4]
  input-estimated-cost-per-m: 3.0
  output-estimated-cost-per-m: 15.0

在这种配置下，claude-4-sonnet 和 claude-sonnet-4（以及任何带有版本的变体，如 claude-sonnet-4-20250514）都会解析为同一个计费条目。
注意：别名同样参与最长前缀匹配。因此，claude-sonnet-4-20250514 会匹配到别名 claude-sonnet-4，进而解析到 claude-4-sonnet 的定价信息。

自定义配置

添加新供应商 (Adding a New Provider) 要添加默认配置中未包含的供应商：

providers:
# ... 现有供应商 ...

- provider: ollama
  prefix-match:
    - mymodel
  models:
    - name: mymodel-large
      input-estimated-cost-per-m: 1.0
      output-estimated-cost-per-m: 5.0
    - name: mymodel-small
      input-estimated-cost-per-m: 0.1
      output-estimated-cost-per-m: 0.5

针对OTLP/zipkin的数据，新增了单独的estimated tag, 可以在UI上看到这次GenAI调用消耗的cost。

主要指标

1. Provider Level (服务商维度)

指标 ID	描述	含义
`gen_ai_provider_cpm`	Calls Per Minute	每分钟请求数 (吞吐量)
`gen_ai_provider_sla`	Success Rate	请求成功率
`gen_ai_provider_resp_time`	Avg Response Time	平均响应耗时
`gen_ai_provider_latency_percentile`	Latency Percentiles	响应耗时百分位数 (P50, P75, P90, P95, P99)
`gen_ai_provider_input_tokens_sum/avg`	Input Token Usage	输入 Token 的总和及平均值
`gen_ai_provider_output_tokens_sum/avg`	Output Token Usage	输出 Token 的总和及平均值
`gen_ai_provider_total_estimated_cost/avg`	Estimated Cost	预估总成本及次均成本

2. Model Level (模型维度)

指标 ID	描述	含义
`gen_ai_model_call_cpm`	Calls Per Minute	该特定模型的每分钟请求数
`gen_ai_model_sla`	Success Rate	模型请求成功率
`gen_ai_model_latency_avg/percentile`	Latency	模型响应耗时的平均值及百分位数
`gen_ai_model_ttft_avg/percentile`	TTFT	首个token响应时间 (仅限流式传输 Streaming)
`gen_ai_model_input_tokens_sum/avg`	Input Token Usage	该模型的输入 Token 消耗详情
`gen_ai_model_output_tokens_sum/avg`	Output Token Usage	该模型的输出 Token 消耗详情
`gen_ai_model_total_estimated_cost/avg`	Estimated Cost	该模型的预估总成本及次均成本

建议使用场景

性能评估：利用响应延迟（Latency）和首字响应时间（TTFT）指标，分析模型推理效率及终端用户交互体验。
Token 监控：实时监控输入（Input）与输出（Output）Token 的消耗，用于分析不同业务场景下的资源占用情况。
成本预警：支持基于预估成本（Cost）或 Token 消耗量配置告警阈值，及时发现异常调用，防止成本超支。

Blog: Monitoring Envoy AI Gateway with Apache SkyWalking

Thu, 02 Apr 2026 00:00:00 +0000

LLM traffic is becoming a first-class citizen in production infrastructure. Teams are calling OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Gemini — often multiple providers at once. But most organizations have no unified visibility into this traffic:

Token costs spiral without knowing which teams, models, or providers drive the spend. A single misconfigured prompt template can burn through thousands of dollars before anyone notices.
Provider outages cause cascading failures. When OpenAI has a bad hour, your application goes down with it — and you have no failover visibility to understand what happened or switch providers automatically.
No unified metrics across heterogeneous LLM calls. Latency, Time to First Token (TTFT), Time Per Output Token (TPOT), token usage, error rates — each provider reports these differently, if at all. There is no single dashboard to compare them.

This is the same observability gap that microservices faced a decade ago. The solution then was service meshes and API gateways with built-in telemetry. For AI workloads, the answer is an AI gateway.

Why an AI Gateway

Envoy AI Gateway is an open-source AI gateway built on top of Envoy Proxy and Envoy Gateway. It is not a standalone SaaS product or a Python proxy — it is infrastructure-grade software built on the same Envoy that already handles traffic for a large portion of cloud-native deployments.

Key capabilities:

Multi-provider routing — supports 16+ AI providers (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Gemini, Mistral, Cohere, DeepSeek, and more) behind a unified API.
Token-based rate limiting — rate limit by token consumption, not just request count.
Provider fallback — automatic failover when a provider is down or slow.
Model virtualization — abstract model names so applications are decoupled from specific providers.
Two-tier architecture — a reference architecture with a centralized entry gateway (Tier 1) for auth and global routing, and per-cluster gateways (Tier 2) for inference optimization.
CNCF ecosystem native — runs on Kubernetes, composes with existing Envoy filters, WASM plugins, and standard Kubernetes Gateway API resources.

Because Envoy AI Gateway natively emits GenAI metrics and access logs via OTLP following OpenTelemetry GenAI Semantic Conventions, it plugs directly into any OpenTelemetry-compatible backend.

Starting from SkyWalking 10.4.0, the OAP server natively receives and analyzes Envoy AI Gateway’s OTLP metrics and access logs — no OpenTelemetry Collector needed in between.

Data Flow

The AI Gateway pushes telemetry directly to SkyWalking via OTLP gRPC:

Application sends LLM API requests through the Envoy AI Gateway.
Envoy AI Gateway routes requests to AI providers (or local models like Ollama) and records GenAI metrics (token usage, latency, TTFT, TPOT) and access logs.
The gateway pushes metrics and logs via OTLP gRPC directly to SkyWalking OAP on port 11800.
SkyWalking OAP parses metrics with MAL rules and access logs with LAL rules, then stores everything in BanyanDB.

No OpenTelemetry Collector is needed. SkyWalking OAP’s built-in OTLP receiver handles everything.

Try It Locally

This demo uses Ollama as a local LLM backend so you can try everything without an API key. The Envoy AI Gateway CLI (aigw) provides a standalone mode that runs outside Kubernetes — perfect for local testing.

Prerequisites

Docker and Docker Compose
Ollama installed on your host

Step 1: Start Ollama

Start Ollama on all interfaces so Docker containers can reach it:

OLLAMA_HOST=0.0.0.0 ollama serve

Pull a small model for testing:

ollama pull llama3.2:1b

Step 2: Start the Stack

Create a docker-compose.yaml:

services:
  banyandb:
    image: apache/skywalking-banyandb:0.10.0
    container_name: banyandb
    ports:
      - "17912:17912"
    command: standalone --stream-root-path /tmp/stream-data --measure-root-path /tmp/measure-data
    healthcheck:
      test: ["CMD-SHELL", "wget -qO- http://localhost:17913/api/healthz || exit 1"]
      interval: 5s
      timeout: 3s
      retries: 10

  oap:
    image: apache/skywalking-oap-server:10.4.0
    container_name: oap
    depends_on:
      banyandb:
        condition: service_healthy
    ports:
      - "11800:11800"
      - "12800:12800"
    environment:
      SW_STORAGE: banyandb
      SW_STORAGE_BANYANDB_TARGETS: banyandb:17912
    healthcheck:
      test: ["CMD-SHELL", "bash -c 'echo > /dev/tcp/localhost/12800' || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 30
      start_period: 60s

  ui:
    image: apache/skywalking-ui:10.4.0
    container_name: ui
    depends_on:
      oap:
        condition: service_healthy
    ports:
      - "8080:8080"
    environment:
      SW_OAP_ADDRESS: http://oap:12800

  aigw:
    image: envoyproxy/ai-gateway-cli:latest
    container_name: aigw
    depends_on:
      oap:
        condition: service_healthy
    environment:
      - OPENAI_BASE_URL=http://host.docker.internal:11434/v1
      - OPENAI_API_KEY=unused
      - OTEL_SERVICE_NAME=my-ai-gateway
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://oap:11800
      - OTEL_EXPORTER_OTLP_PROTOCOL=grpc
      - OTEL_METRICS_EXPORTER=otlp
      - OTEL_LOGS_EXPORTER=otlp
      - OTEL_METRIC_EXPORT_INTERVAL=5000
      - OTEL_RESOURCE_ATTRIBUTES=job_name=envoy-ai-gateway,service.instance.id=aigw-1,service.layer=ENVOY_AI_GATEWAY
    ports:
      - "1975:1975"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    command: ["run"]

Start everything:

docker compose up -d

Wait for all services to become healthy (BanyanDB starts first, then OAP, then UI and AI Gateway):

docker compose ps

The key OTLP configuration on the aigw service:

Env Var	Value	Purpose
`OTEL_SERVICE_NAME`	`my-ai-gateway`	Service name in SkyWalking
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://oap:11800`	SkyWalking OAP gRPC endpoint
`OTEL_EXPORTER_OTLP_PROTOCOL`	`grpc`	OTLP transport
`OTEL_METRICS_EXPORTER`	`otlp`	Enable metrics push
`OTEL_LOGS_EXPORTER`	`otlp`	Enable access log push

The OTEL_RESOURCE_ATTRIBUTES must include:

job_name=envoy-ai-gateway — routing tag for MAL/LAL rules
service.instance.id=<id> — instance identity
service.layer=ENVOY_AI_GATEWAY — routes logs to AI Gateway LAL rules

The MAL and LAL rules are enabled by default in SkyWalking OAP. No OAP-side configuration is needed.

Step 3: Run the Demo App

Create a simple Python application that sends requests through the AI Gateway (app.py). It mixes normal requests, streaming requests (for TTFT/TPOT metrics), and error requests (non-existent model → HTTP 404, always captured by the LAL sampling policy):

import time, random, requests

GATEWAY = "http://localhost:1975"
HEADERS = {"Authorization": "Bearer unused", "Content-Type": "application/json"}

questions = [
    "What is Apache SkyWalking? Answer in one sentence.",
    "What is Envoy Proxy used for? Answer in one sentence.",
    "What are the benefits of an AI gateway? Answer in two sentences.",
    "Explain observability in three sentences.",
]

def chat(model, question, stream=False):
    resp = requests.post(
        f"{GATEWAY}/v1/chat/completions",
        json={"model": model, "messages": [{"role": "user", "content": question}], "stream": stream},
        headers=HEADERS, timeout=60, stream=stream,
    )
    if stream:
        chunks = []
        for line in resp.iter_lines():
            if line:
                chunks.append(line.decode())
        return resp.status_code, f"[streamed {len(chunks)} chunks]"
    return resp.status_code, resp.json()

while True:
    r = random.random()
    if r < 0.2:
        # Error request: non-existent model triggers 404
        status, body = chat("non-existent-model", "hello")
        print(f"[error] model=non-existent-model status={status}")
    elif r < 0.5:
        # Streaming request — generates TTFT and TPOT metrics
        q = random.choice(questions)
        status, info = chat("llama3.2:1b", q, stream=True)
        print(f"[stream] status={status} {info}")
    else:
        # Normal non-streaming request
        q = random.choice(questions)
        status, body = chat("llama3.2:1b", q)
        answer = body.get("choices", [{}])[0].get("message", {}).get("content", "")[:80]
        tokens = body.get("usage", {})
        print(f"[ok] status={status} tokens={tokens} answer={answer}...")
    time.sleep(random.randint(20, 30))

Run it:

pip install requests
python app.py

The application talks to the AI Gateway on port 1975, which routes to Ollama. Each request generates GenAI metrics (token usage, latency, TTFT, TPOT) and access logs that the gateway pushes to SkyWalking via OTLP.

The error requests (non-existent model → HTTP 404) are always captured by the access log sampling policy, so you will see them in the SkyWalking log view.

Step 4: View in SkyWalking UI

Open http://localhost:8080 and select the GenAI > Envoy AI Gateway menu.

The service list shows my-ai-gateway with CPM, latency, and token rates at a glance:

Click into the service to see the full dashboard — Request CPM, Latency (average + percentiles), Input/Output Token Rates, TTFT, and TPOT:

The Providers tab breaks down metrics by AI provider:

The Models tab shows per-model metrics including TTFT and TPOT (streaming only). Note the unknown model entries — these are the error requests with non-existent models:

The Log tab shows access logs. The sampling policy drops normal successful responses but always captures errors (HTTP 404) and high-token requests:

Cleanup

docker compose down

Deploying on Kubernetes

For production deployments, Envoy AI Gateway runs as a full Kubernetes controller with Envoy Gateway as the control plane. See the Envoy AI Gateway getting started guide for Kubernetes installation.

The OTLP configuration is the same — set the OTEL_* environment variables on the AI Gateway’s external processor to point at SkyWalking OAP’s gRPC port (11800). See the SkyWalking Envoy AI Gateway Monitoring documentation for details.

GenAI Observability Without an AI Gateway

Not every deployment uses an AI gateway. If your applications call LLM providers directly, SkyWalking 10.4.0 also provides GenAI observability through the Virtual GenAI layer.

This works with any SkyWalking-instrumented, OpenTelemetry-instrumented, or Zipkin-instrumented application. When traces carry gen_ai.* tags (following OpenTelemetry GenAI Semantic Conventions), SkyWalking derives per-provider and per-model metrics from the client side: latency, token usage, success rate, and estimated cost.

For Java applications, the SkyWalking Java Agent (9.7+) includes a Spring AI plugin that automatically instruments calls to 13+ providers (OpenAI, Anthropic, AWS Bedrock, Google GenAI, DeepSeek, Mistral, etc.) with the correct gen_ai.* span tags — no code changes needed.

This is a different use case from the Envoy AI Gateway monitoring covered above:

Envoy AI Gateway layer: infrastructure-level observability — what the gateway sees across all traffic. Best for platform teams managing centralized AI routing.
Virtual GenAI layer: application-level observability — what each instrumented app sees for its own LLM calls. Best for teams without a centralized gateway, or for per-application cost tracking.

References

Envoy AI Gateway — project site and documentation
Envoy AI Gateway CLI — standalone mode for local development
SkyWalking Envoy AI Gateway Monitoring — OAP setup doc
SkyWalking Virtual GenAI — client-side GenAI observability
OpenTelemetry GenAI Semantic Conventions — the metric/attribute standard both projects follow

Zh: 用 Apache SkyWalking 监控 Envoy AI Gateway

Thu, 02 Apr 2026 00:00:00 +0000

问题：LLM 流量缺乏统一观测

LLM 流量正在成为生产基础设施中不可忽视的一部分。团队同时在调用 OpenAI、Anthropic、AWS Bedrock、Azure OpenAI、Google Gemini——往往还不止一个提供商。但大多数组织对这些流量缺乏统一的可见性：

Token 费用失控，却不知道哪个团队、哪个模型、哪个提供商在烧钱。一个配置不当的 prompt 模板就可能在无人察觉的情况下烧掉几千美元。
提供商故障引发连锁反应。 OpenAI 出问题的那一小时，你的应用也跟着挂——而你既没有故障切换的可见性，也无法自动切换提供商。
缺乏统一指标。 延迟、首 Token 耗时（TTFT）、每 Token 输出耗时（TPOT）、Token 用量、错误率——每个提供商的报告方式都不一样，有些甚至不提供。没有一个统一的面板能做对比。

这和十年前微服务面临的可观测性困境如出一辙。当时的解法是服务网格和内置遥测的 API 网关。对 AI 工作负载来说，答案就是 AI 网关。

为什么选择 AI 网关

Envoy AI Gateway 是一个开源 AI 网关，构建在 Envoy Proxy 和 Envoy Gateway 之上。底层就是云原生世界里已经广泛部署的 Envoy，天然具备基础设施级的稳定性和性能。

核心能力：

多提供商路由 —— 支持 16+ AI 提供商（OpenAI、Anthropic、AWS Bedrock、Azure OpenAI、Google Gemini、Mistral、Cohere、DeepSeek 等），统一 API 接入。
基于 Token 的限流 —— 按 Token 消耗限流，而不只是按请求数。
提供商故障切换 —— 某个提供商宕机或响应慢时自动切换。
模型虚拟化 —— 抽象模型名称，让应用与具体提供商解耦。
两层架构 —— 参考架构包含一个集中入口网关（Tier 1）负责认证和全局路由，以及每集群网关（Tier 2）负责推理优化。
CNCF 生态原生 —— 运行在 Kubernetes 上，兼容现有的 Envoy Filter、WASM 插件和标准 Kubernetes Gateway API 资源。

Envoy AI Gateway 原生支持通过 OTLP 发送 GenAI 指标和访问日志，遵循 OpenTelemetry GenAI 语义约定，可以直接接入任何兼容 OpenTelemetry 的后端。

从 SkyWalking 10.4.0 开始，OAP 原生接收和分析 Envoy AI Gateway 的 OTLP 指标和访问日志——中间不需要部署 OpenTelemetry Collector。

数据流

AI Gateway 通过 OTLP gRPC 直接将遥测数据推送到 SkyWalking：

应用通过 Envoy AI Gateway 发送 LLM API 请求。
Envoy AI Gateway 将请求路由到 AI 提供商（或 Ollama 这样的本地模型），同时记录 GenAI 指标（Token 用量、延迟、TTFT、TPOT）和访问日志。
网关通过 OTLP gRPC 直接将指标和日志推送到 SkyWalking OAP 的 11800 端口。
SkyWalking OAP 用 MAL 规则解析指标、用 LAL 规则解析访问日志，然后统一存储到 BanyanDB。

不需要 OpenTelemetry Collector。SkyWalking OAP 内置的 OTLP 接收器可以直接处理所有数据。

本地体验

这个 Demo 使用 Ollama 作为本地 LLM 后端，不需要任何 API Key 就能跑起来。Envoy AI Gateway CLI（aigw）提供独立运行模式，不依赖 Kubernetes，非常适合本地测试。

前置条件

Docker 和 Docker Compose
主机上已安装 Ollama

第一步：启动 Ollama

让 Ollama 监听所有网络接口，以便 Docker 容器能访问到：

OLLAMA_HOST=0.0.0.0 ollama serve

拉取一个小模型用于测试：

ollama pull llama3.2:1b

第二步：启动服务栈

创建 docker-compose.yaml：

services:
  banyandb:
    image: apache/skywalking-banyandb:0.10.0
    container_name: banyandb
    ports:
      - "17912:17912"
    command: standalone --stream-root-path /tmp/stream-data --measure-root-path /tmp/measure-data
    healthcheck:
      test: ["CMD-SHELL", "wget -qO- http://localhost:17913/api/healthz || exit 1"]
      interval: 5s
      timeout: 3s
      retries: 10

  oap:
    image: apache/skywalking-oap-server:10.4.0
    container_name: oap
    depends_on:
      banyandb:
        condition: service_healthy
    ports:
      - "11800:11800"
      - "12800:12800"
    environment:
      SW_STORAGE: banyandb
      SW_STORAGE_BANYANDB_TARGETS: banyandb:17912
    healthcheck:
      test: ["CMD-SHELL", "bash -c 'echo > /dev/tcp/localhost/12800' || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 30
      start_period: 60s

  ui:
    image: apache/skywalking-ui:10.4.0
    container_name: ui
    depends_on:
      oap:
        condition: service_healthy
    ports:
      - "8080:8080"
    environment:
      SW_OAP_ADDRESS: http://oap:12800

  aigw:
    image: envoyproxy/ai-gateway-cli:latest
    container_name: aigw
    depends_on:
      oap:
        condition: service_healthy
    environment:
      - OPENAI_BASE_URL=http://host.docker.internal:11434/v1
      - OPENAI_API_KEY=unused
      - OTEL_SERVICE_NAME=my-ai-gateway
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://oap:11800
      - OTEL_EXPORTER_OTLP_PROTOCOL=grpc
      - OTEL_METRICS_EXPORTER=otlp
      - OTEL_LOGS_EXPORTER=otlp
      - OTEL_METRIC_EXPORT_INTERVAL=5000
      - OTEL_RESOURCE_ATTRIBUTES=job_name=envoy-ai-gateway,service.instance.id=aigw-1,service.layer=ENVOY_AI_GATEWAY
    ports:
      - "1975:1975"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    command: ["run"]

启动所有服务：

docker compose up -d

等待所有服务变为健康状态（BanyanDB 先启动，然后是 OAP，最后是 UI 和 AI Gateway）：

docker compose ps

aigw 服务的关键 OTLP 配置：

环境变量	值	用途
`OTEL_SERVICE_NAME`	`my-ai-gateway`	SkyWalking 中的服务名
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://oap:11800`	SkyWalking OAP gRPC 端点
`OTEL_EXPORTER_OTLP_PROTOCOL`	`grpc`	OTLP 传输协议
`OTEL_METRICS_EXPORTER`	`otlp`	启用指标推送
`OTEL_LOGS_EXPORTER`	`otlp`	启用访问日志推送

OTEL_RESOURCE_ATTRIBUTES 必须包含：

job_name=envoy-ai-gateway —— MAL/LAL 规则的路由标签
service.instance.id=<id> —— 实例标识
service.layer=ENVOY_AI_GATEWAY —— 将日志路由到 AI Gateway LAL 规则

MAL 和 LAL 规则在 SkyWalking OAP 中默认启用，不需要额外配置。

第三步：运行 Demo 应用

创建一个简单的 Python 应用，通过 AI Gateway 发送请求（app.py）。它混合了普通请求、流式请求（用于产生 TTFT/TPOT 指标）和错误请求（不存在的模型 → HTTP 404，始终会被 LAL 采样策略捕获）：

import time, random, requests

GATEWAY = "http://localhost:1975"
HEADERS = {"Authorization": "Bearer unused", "Content-Type": "application/json"}

questions = [
    "What is Apache SkyWalking? Answer in one sentence.",
    "What is Envoy Proxy used for? Answer in one sentence.",
    "What are the benefits of an AI gateway? Answer in two sentences.",
    "Explain observability in three sentences.",
]

def chat(model, question, stream=False):
    resp = requests.post(
        f"{GATEWAY}/v1/chat/completions",
        json={"model": model, "messages": [{"role": "user", "content": question}], "stream": stream},
        headers=HEADERS, timeout=60, stream=stream,
    )
    if stream:
        chunks = []
        for line in resp.iter_lines():
            if line:
                chunks.append(line.decode())
        return resp.status_code, f"[streamed {len(chunks)} chunks]"
    return resp.status_code, resp.json()

while True:
    r = random.random()
    if r < 0.2:
        # Error request: non-existent model triggers 404
        status, body = chat("non-existent-model", "hello")
        print(f"[error] model=non-existent-model status={status}")
    elif r < 0.5:
        # Streaming request — generates TTFT and TPOT metrics
        q = random.choice(questions)
        status, info = chat("llama3.2:1b", q, stream=True)
        print(f"[stream] status={status} {info}")
    else:
        # Normal non-streaming request
        q = random.choice(questions)
        status, body = chat("llama3.2:1b", q)
        answer = body.get("choices", [{}])[0].get("message", {}).get("content", "")[:80]
        tokens = body.get("usage", {})
        print(f"[ok] status={status} tokens={tokens} answer={answer}...")
    time.sleep(random.randint(20, 30))

运行：

pip install requests
python app.py

应用通过 1975 端口与 AI Gateway 通信，AI Gateway 再路由到 Ollama。每次请求都会产生 GenAI 指标（Token 用量、延迟、TTFT、TPOT）和访问日志，由网关通过 OTLP 推送到 SkyWalking。

错误请求（不存在的模型 → HTTP 404）始终会被访问日志采样策略捕获，所以在 SkyWalking 的日志视图中一定能看到。

第四步：在 SkyWalking UI 中查看

打开 http://localhost:8080，选择 GenAI > Envoy AI Gateway 菜单。

服务列表显示 my-ai-gateway，可以一览 CPM、延迟和 Token 速率：

点击进入服务详情，查看完整仪表盘——请求 CPM、延迟（平均值 + 百分位数）、输入/输出 Token 速率、TTFT 和 TPOT：

Providers 标签页按 AI 提供商维度展示指标：

Models 标签页展示每个模型的指标，包括 TTFT 和 TPOT（仅流式请求）。注意 unknown 模型条目——这些就是使用不存在模型的错误请求：

Log 标签页展示访问日志。采样策略会丢弃正常的成功响应，但始终保留错误（HTTP 404）和高 Token 消耗的请求：

清理

docker compose down

Kubernetes 生产部署

生产环境中，Envoy AI Gateway 作为完整的 Kubernetes 控制器运行，以 Envoy Gateway 作为控制面。详见 Envoy AI Gateway 入门指南。

OTLP 配置方式相同——在 AI Gateway 的 External Processor 上设置 OTEL_* 环境变量，指向 SkyWalking OAP 的 gRPC 端口（11800）。详见 SkyWalking Envoy AI Gateway 监控文档。

不用 AI 网关也能做 GenAI 可观测

并非所有场景都需要 AI 网关。如果你的应用直接调用 LLM 提供商，SkyWalking 10.4.0 也提供了基于 Virtual GenAI 层的 GenAI 可观测方案。

任何接入了 SkyWalking、OpenTelemetry 或 Zipkin 探针的应用都能使用这个功能。只要 Trace 中携带 gen_ai.* 标签（遵循 OpenTelemetry GenAI 语义约定），SkyWalking 就能从客户端视角推导出每提供商、每模型的指标：延迟、Token 用量、成功率和预估费用。

对于 Java 应用，SkyWalking Java Agent（9.7+）内置了 Spring AI 插件，自动为 13+ 提供商（OpenAI、Anthropic、AWS Bedrock、Google GenAI、DeepSeek、Mistral 等）的调用注入正确的 gen_ai.* Span 标签——不需要改代码。

这与上面介绍的 Envoy AI Gateway 监控是不同的使用场景：

Envoy AI Gateway 层：基础设施级可观测——网关视角，覆盖所有流量。适合负责集中 AI 路由的平台团队。
Virtual GenAI 层：应用级可观测——每个应用自己看到的 LLM 调用情况。适合没有集中网关的团队，或者需要按应用维度跟踪费用的场景。

参考资料

Envoy AI Gateway —— 项目官网和文档
Envoy AI Gateway CLI —— 本地开发用的独立运行模式
SkyWalking Envoy AI Gateway 监控 —— OAP 配置文档
SkyWalking Virtual GenAI —— 客户端侧 GenAI 可观测
OpenTelemetry GenAI 语义约定 —— 两个项目共同遵循的指标/属性标准

Zh: AI Coding 如何重塑软件架构师的工作方式

Sun, 15 Mar 2026 00:00:00 +0000

以 SkyWalking GraalVM Distro 为例，看 AI Coding 如何把一批探索性 PoC 打磨成一条可重复的迁移流水线。

这个项目给我最大的启发，不是 AI 能写多少代码，而是 AI Coding 改变了架构设计的试错成本。当一个想法可以很快做成 PoC、跑起来验证、不行就推翻重来时，架构师就更有机会逼近自己真正想要的设计，而不是过早停在“团队现在做得出来”的折中方案上。

这种变化在成熟开源系统里尤其重要。Apache SkyWalking OAP 长期以来一直是一个功能强大且经过生产验证的可观测性后端，但大型 Java 平台该有的问题它一个不少：运行时字节码生成、重反射初始化、classpath 扫描、基于 SPI 的模块装配，以及动态 DSL 执行——这些机制方便扩展，但做 GraalVM Native Image 时全是障碍。

SkyWalking GraalVM Distro 的出现，源于我们把这个挑战当成一个架构设计问题来处理，而不是一次性的移植工程。目标不仅是让 OAP 能以原生二进制运行，更是把 GraalVM 迁移本身做成一条可重复执行、能够持续跟上上游演进的自动化流水线。

如果你想看完整的技术设计、基准数据和上手方式，请阅读配套文章：SkyWalking GraalVM Distro：设计与基准测试。

从停滞的想法到可运行的系统

这件事其实很多年前就开始了。在这个仓库创建不久之后，yswdqz 曾花了数个月探索迁移方案。真正做下来才发现，这个项目远比 GraalVM 文档里列出的那些单点限制复杂得多，这项工作最终也因此搁置了很多年。

这段停滞很重要。缺少的并不是想法。成熟维护者通常从来不缺想法，真正稀缺的，是把这些想法真正做出来的时间、人力和精力。即使架构师已经看到了几条很有前景的路线，有限的开发资源也会迫使大家更早做出权衡：优先选择实现成本最低的方案，而不是那个更干净、更可复用、更经得起未来变化的方案。

这种情况非常普遍，并不特殊。在开源社区里，很多工作依赖志愿者或有限的企业赞助；在商业产品里，约束的形式不同，但本质仍然一样：路线图承诺、团队规模和交付压力都会让工程资源始终紧张。在这两种环境里，很多好想法被放弃，并不是因为它们错了，而是因为要把它们真正验证清楚、实现完整，成本太高。

还有一个同样重要的约束：架构师通常同时也是非常资深的工程师，而不是一个可以全职扑在实现细节上的人。问题在于个人编码精力有限、时间高度碎片化，同时还要在代码尚未出现之前，不断向其他资深工程师解释自己的设计意图。传统上，这种解释主要通过图、文档和沟通完成。它很慢、信息损失大，而且充满不确定性。我们都体验过“传话游戏”：哪怕是很简单的意思，也很容易被误解，而等误解真正暴露出来时，时间已经过去很多了。

到了 2025 年末，AI Coding 让”同时尝试多条路线”这件事终于变得现实。我们不必再因为实现能力稀缺而过早接受折中，而是可以在多个设计之间来回切换，用代码验证，快速淘汰弱方案，持续迭代，直到架构本身变得足够稳固、足够实用、足够高效。

这种设计自由度至关重要。GraalVM 文档对单个限制讲得很清楚，但成熟 OSS 平台遇到的是一整套彼此牵连的系统性问题。只修补一个动态机制远远不够。要让 native image 真正落地，我们必须把整类运行时行为改造成构建期产物和自动生成的元数据。

在这条路的早期历史中，还有一座非常具体的大山。那时上游 SkyWalking 仍然大量依赖 Groovy 来处理 LAL、MAL 和 Hierarchy 脚本。理论上，这只不过是另一个“不支持运行时动态行为”的例子；但在实践中，Groovy 是整条路径上最大的障碍。它不仅意味着脚本执行，还意味着一整套在 JVM 里极其便利、在 native image 里却极其不友好的动态模型。

为了跨过这道坎，我们围绕 AOT-first 模式重新设计了 OAP 的核心引擎。早期实验必须直接面对 Groovy 时代的运行时行为，并尝试不同的脚本编译方案来绕过去。最终方案走得更远：对齐上游编译器流水线，把动态生成前移到构建期，并引入自动化机制，让这条迁移路径在上游持续演进时依然保持可控。具体来说，就是把 OAL、MAL、LAL 和 Hierarchy 的生成过程变成构建期预编译器的输出，而不是继续保留为启动期的动态行为。

AI Coding 如何改写架构迭代

这次转变的关键，并不只是“写代码更快了”。AI 真正改变的，是想法、原型、验证和重设计之间来回迭代的速度。围绕同一个问题，我们可以很快做出几个可运行的 PoC，迅速淘汰不成立的方向，再把值得保留的抽象慢慢沉淀成一套连贯的迁移系统。

这并不会削弱人的架构价值，反而会放大它。哪些行为应该前移到构建期，哪些地方应该保留可配置性，哪里应该引入 same-FQCN 替换，如何让上游同步保持可控，以及哪些抽象值得不惜代价保留下来，这些判断仍然只能由人来做。不同的是，AI 的速度让我们终于有机会把这些更好的设计真正做出来，而不是过早退回到更简单、也更差的折中方案。

这才是软件架构师工作方式真正发生变化的地方。过去，架构师往往已经知道更干净的方向在哪里，但有限的工程产能会逼着那个愿景退回到一个更便宜的妥协方案。现在，架构师在某种意义上又重新变回了“能快速动手的人”：可以直接用代码把思路搭出来，把高层抽象落成接口，再用真实运行的实现去证明设计。

这不仅改变了实现，也改变了沟通方式。在开源里，我们常说：talk is cheap, show me the code。在 AI Coding 时代，“把代码拿出来”这件事变得容易多了。设计不再那么依赖一个缓慢的、自上而下的翻译过程：从想法到文档，再到解释，再到实现。代码可以更早出现，也可以更早跑起来。

这也让其他资深工程师受益。他们不必只靠图、会议或长篇解释来还原整个设计，而是可以直接审查抽象、阅读真实代码、运行它、质疑它，并在具体实现上一起打磨。这让架构协作更快、更清晰，也少了很多沟通误差。

也正因为如此，我总觉得今天很多 AI 讨论有点跑偏。很多项目确实很有趣、也很好玩，拿来体验当然没问题，但高级工程工作并不会因为“给代码库接了个 agent”就自然变好。真正重要的，不是哪个 demo 看起来最炫，而是哪些工程能力真的被放大了，同时软件开发本身的纪律有没有被保留下来。

对于架构师和资深工程师来说，这里真正重要的能力包括：

快速做对比式原型验证：不是只用 slides 和文档去论证某个想法，而是直接把多个方案做成可运行代码来比较。
大规模代码理解能力：能在大量模块之间快速阅读，同时保持对整个系统的全局认识。
系统性的重构能力：把基于反射、依赖运行时动态行为的路径，系统性地改造成适配 AOT 约束的设计。
搭建自动化的能力：当一个迁移步骤在每次上游同步时都必须重做一次，靠手工处理本身就很费时费力，而且越往后只会越累。AI 让我们真正有条件去投资生成器、清单、一致性检查和漂移检测，把重复的人力劳动变成可重复的自动化流程。
大范围审查能力：在很大的代码面上检查边界条件、兼容性约束，以及方案是否经得起反复执行。

这些能力也都体现在最终的设计结果里。same-FQCN 替换为 GraalVM 特定行为建立了清晰、受控的边界；反射元数据不再依赖手工维护的猜测清单，而是直接从构建产物中生成；各种清单机制和漂移检测，则把原本模糊的“上游同步风险”变成了显式的工程工作流。

对于初级工程师，我觉得这里的启发同样重要。AI 不会让架构设计、系统约束、接口设计、测试和可维护性这些基本功变得不重要。恰恰相反，这些能力只会变得更重要，因为它们决定了“被加速的实现”最终产出的是一个可持续演进的系统，还是只是更快地制造出更多代码。真正的杠杆来自工程判断力，而不是新鲜感。

Claude Code 和 Gemini AI 在整个过程中都扮演了工程加速器的角色。在 GraalVM Distro 这个项目里，它们具体帮我们做了几件事：

把迁移思路直接做成可运行代码：不是争论哪个方向可能行得通，而是把多个真实原型做出来、跑起来、比较掉，把不成立的方向淘汰掉。
重构重反射、重动态的代码路径：把不适合运行时的模式系统性替换成 AOT 友好的实现方式。
让上游同步真正可持续：每次 distro 从上游 SkyWalking 拉取变更后，元数据扫描、配置再生成和重新编译都必须再来一次。AI 帮助我们把这些过程做成流水线，使每次同步都变成一个可控、且大部分自动化的过程，而不是一次比一次更长的手工重复劳动。
在大范围内审查逻辑和边界情况：特别是在功能对等性比纯实现速度更重要的地方。

最终产出的，不只是一次大重写，而是一套可重复的系统：预编译器、manifest 驱动的加载、反射配置生成、替换边界，以及让上游迁移可审查、可自动化的漂移检测机制。

如果你想看这种开发方法背后的更广泛背景，可以读这篇文章：在成熟开源大型项目中实践 Agentic Vibe Coding：软件工程与工程控制论还在延续。这篇文章则是这个故事的下一步：不仅是在一个成熟代码库里增强功能，而是重新激活一项曾经停滞的工作，并把它真正做成可运行系统。

真正改变的到底是什么

这个项目最重要的结果，并不是一张 benchmark 表。基准数据当然属于 distro 本身，而且它们很重要，因为它们证明这套系统是真实可运行的。但对这篇文章来说，更深层的变化发生在方法论层面：AI Coding 改变了我们探索、验证和打磨架构方案的方式。

过去，架构往往更像一项以文档为主、后面拖着漫长而昂贵实现过程的活动。现在，我们可以更快地在想法、原型、比较和重设计之间切换。这让我们真正有机会去追求更高抽象层次的方案，保留更干净的边界，并建设那些让迁移过程可持续维护的自动化机制。

这项工作的技术证据，就是 SkyWalking GraalVM Distro 本身：它不仅是一个可运行的系统，更是一条由预编译器、自动生成的反射元数据、受控替换边界和漂移检查组成的迁移流水线。基准数据之所以重要，是因为它们证明这套系统在实践里是成立的；但从架构角度看，真正的结果是：这次迁移不再是一场一次性的移植，而是变成了一套可重复执行的系统工程。关于完整测试方法、原始数据和技术设计，请阅读配套文章：SkyWalking GraalVM Distro：设计与基准测试。

项目仓库位于 apache/skywalking-graalvm-distro。我们欢迎社区成员测试这个新发行版、提交 issue，并帮助它逐步走向生产可用。

对我来说，更深层的启发并不止于这个发行版。AI Coding 不会让架构变得不重要，反而会让架构更值得被认真追求。当实现速度提升到一定程度时，我们终于有机会在真实代码里验证更多想法，保留那些真正好的抽象，并把那些过去常常因为投入太大而半途妥协的系统真正做出来。

对于资深工程师来说，瓶颈正在从单纯的代码实现速度，转向品味、系统判断力，以及定义稳定边界的能力。对于初级工程师来说，真正该走的路不是追逐每一种看上去都很刺激的 AI 工作流，而是把基础能力练得更扎实，让加速真正产生复利：理解需求、阅读陌生系统、质疑假设，并识别出在系统快速变化时仍然必须保持正确的那些部分。AI Coding 降低了验证好设计的代价，但并没有降低工程判断本身的门槛。

Blog: How AI Changed the Economics of Architecture

Fri, 13 Mar 2026 00:00:00 +0000

SkyWalking GraalVM Distro: A case study in turning runnable PoCs into a repeatable migration pipeline.

The most important lesson from this project is not that AI can generate a large amount of code. It is that AI changes the economics of architecture. When runnable PoCs become cheap to build, compare, discard, and rebuild, architects can push further toward the design they actually want instead of stopping early at a compromise they can afford to implement.

That shift matters a lot in mature open source systems. Apache SkyWalking OAP has long been a powerful and production-proven observability backend, but it also carries all the realities of a large Java platform: runtime bytecode generation, reflection-heavy initialization, classpath scanning, SPI-based module wiring, and dynamic DSL execution that are friendly to extensibility but hostile to GraalVM native image.

SkyWalking GraalVM Distro is the result of treating that challenge as a design-system problem instead of a one-off porting exercise. The goal was not only to make OAP run as a native binary, but to turn GraalVM migration itself into a repeatable automation pipeline that can stay aligned with upstream evolution.

For the full technical design, benchmark data, and getting-started guide, see the companion post: SkyWalking GraalVM Distro: Design and Benchmarks.

From Paused Idea to Runnable System

This journey actually began years ago. Shortly after this repository was created, yswdqz spent several months exploring the transition. The project proved much harder in practice than the individual GraalVM limitations sounded on paper, and the work eventually paused for years.

That pause is important. The missing ingredient was not ideas. Mature maintainers usually have more ideas than time. The real constraint was implementation economics. Even when the architect can see several promising directions, limited developer resources force an earlier trade-off: choose the path that is cheapest to implement, not necessarily the path that is cleanest, most reusable, or most future-proof.

This is a very common reality, not an exceptional one. In open source communities, much of the work depends on volunteers or limited company sponsorship. In commercial products, the pressure is different but the constraint is still real: roadmap commitments, staffing limits, and delivery deadlines keep engineering resources tight. In both worlds, good ideas are often abandoned not because they are wrong, but because they are too expensive to validate and implement thoroughly.

There is another constraint that matters just as much: the architect is usually also a very senior engineer, not a full-time implementation machine. That means limited personal coding energy, fragmented time, and a constant need to explain ideas to other senior engineers before the code exists. Traditionally, that explanation happens through diagrams, documents, and conversations. It is slow, lossy, and unpredictable. We all know some version of the Telephone Game: even simple words are easy to misunderstand, and by the time the misunderstanding becomes visible, a lot of time has already passed.

What changed in late 2025 was that AI engineering made multiple runnable ideas affordable. Instead of picking an early compromise because implementation capacity was scarce, we could switch repeatedly between designs, validate them with code, discard weak directions quickly, and keep iterating until the architecture became solid, practical, and efficient enough to hold.

That design freedom was critical. GraalVM documentation gives clear guidance on isolated limitations, but a mature OSS platform hits them as a connected system. Fixing only one dynamic mechanism is not enough. To make native image practical, we had to turn whole categories of runtime behavior into build-time artifacts and automated metadata generation.

There was also a very concrete mountain in front of us in the early history of this distro. In the first several commits of the repository, upstream SkyWalking still relied heavily on Groovy for LAL, MAL, and Hierarchy scripts. In theory, that was just one more unsupported runtime-heavy component. In practice, Groovy was the biggest obstacle in the whole path. It represented not only script execution, but a whole dynamic model that was deeply convenient on the JVM side and deeply unfriendly to native image.

To bridge the gap, we re-architected the core engines of OAP around an AOT-first model. Earlier experiments had to confront Groovy-era runtime behavior directly and explore alternative script-compilation approaches to get around it. The finalized direction went further: align with the upstream compiler pipeline, move dynamic generation to build time, and add automation so the migration stays controllable as upstream keeps moving. Concretely, that meant turning OAL, MAL, LAL, and Hierarchy generation into build-time precompiler outputs instead of leaving them as startup-time dynamic behavior.

AI Speed Changed the Design Loop

The scale of this transformation was not only about coding faster. AI changed the loop between idea, prototype, validation, and redesign. We could build runnable PoCs for different approaches, throw away weak ones quickly, and preserve the promising abstractions until they formed a coherent migration system.

That does not reduce the role of human architecture. It raises the value of it. Human judgment was still required to decide what should become build-time, what should stay configurable, where to introduce same-FQCN replacements, how to keep upstream sync controllable, and which abstractions were worth preserving. But AI speed made it realistic to pursue those better designs instead of settling for a simpler compromise too early.

This is the real change in the economics of architecture. In the past, an architect might already know the cleaner direction, but limited engineering capacity often forced that vision back toward a cheaper compromise. Now the architect can return much closer to being a fast developer again: building code, shaping high-abstraction interfaces, and using design patterns to prove the vision directly in the real world.

That changes communication as much as implementation. In open source, we often say, talk is cheap, show me the code. With AI engineering, showing the code becomes much more straightforward. The design no longer depends so heavily on a slow top-down translation from idea to documents to interpretation to implementation. The code can appear earlier, and it can run earlier.

Other senior engineers benefit from this too. They do not need to reconstruct the whole design only from diagrams, meetings, or long explanations. They can review the actual abstraction, see the behavior in code, run it, challenge it, and refine it from something concrete. That makes architectural collaboration faster, clearer, and less lossy.

This is also where I think the current AI discussion is often noisy. Many projects are fun, surprising, and worth exploring, but advanced engineering work is not improved merely by attaching an agent to a codebase. The important question is not which demo looks most magical. The important question is which engineering capabilities are actually being accelerated without losing the discipline of software development itself.

For architects and senior engineers, the capabilities that mattered most here were:

Fast comparative prototyping: Building several runnable approaches in code instead of defending one idea with slides and documents.
Large-scale code comprehension: Reading across many modules quickly enough to keep the whole system in view.
Systematic refactoring: Converting reflection-heavy or runtime-dynamic paths into designs that fit AOT constraints.
Automation construction: When a migration step must be repeated every upstream sync, doing it manually once is already expensive. Doing it manually again next time is even more expensive. AI made it practical to invest in generators, inventories, consistency checks, and drift detectors that turn repeated manual work into repeatable automation.
Review at breadth: Checking edge cases, compatibility boundaries, and repeatability across a large surface area.

Those capabilities were visible in the resulting design. Same-FQCN replacements created a controlled boundary for GraalVM-specific behavior. Reflection metadata was generated from build outputs instead of maintained as a hand-written guess list. Inventories and drift detectors turned upstream sync from a vague maintenance risk into an explicit engineering workflow.

For junior engineers, I think the lesson is equally important. AI does not remove the need to learn architecture, invariants, interfaces, testing, or maintenance. It makes those skills more valuable, because they determine whether accelerated implementation produces a durable system or just more code faster. The leverage comes from engineering judgment, not from novelty.

Claude Code and Gemini AI acted as engineering accelerators throughout this process. In the GraalVM Distro specifically, they helped us:

Explore migration strategies as running code: Instead of debating which approach might work, we built and compared multiple real prototypes, discarded the weak ones, and kept what held up.
Refactor reflection-heavy and dynamic code paths: Replace runtime-hostile patterns with AOT-friendly alternatives across the codebase.
Make upstream sync sustainable: Every time the distro pulls from upstream SkyWalking, metadata scanning, config regeneration, and recompilation must happen again. AI helped build the pipeline so that each sync is a controlled, largely automated process rather than a fresh manual effort that grows longer each time.
Review logic and edge cases at scale: Especially in places where feature parity mattered more than raw implementation speed.

The result was not just a large rewrite. It was a repeatable system: precompilers, manifest-driven loading, reflection-config generation, replacement boundaries, and drift detectors that make upstream migration reviewable and automatable.

For the broader methodology behind this style of development, see Agentic Vibe Coding in a Mature OSS Project. This post is the next step in that story: not only enhancing an active mature codebase, but reviving a paused effort and making it actually runnable.

What Actually Changed

The most important outcome of this project is not a benchmark table. The benchmark results belong to the distro itself, and they matter because they prove the system is real. But for this post, the deeper result is methodological: AI engineering changed how architecture could be explored, validated, and refined.

Instead of treating architecture as a mostly document-driven activity followed by a long and expensive implementation phase, we were able to move much faster between idea, prototype, comparison, and redesign. That made it realistic to pursue higher-abstraction solutions, preserve cleaner boundaries, and build the automation needed to keep the migration maintainable over time.

The technical evidence for that work is the SkyWalking GraalVM Distro itself: not only a runnable system, but a migration pipeline expressed as precompilers, generated reflection metadata, controlled replacement boundaries, and drift checks. The benchmark data matter because they prove the system works in practice, but the architectural result is that the migration became a repeatable system rather than a one-time port. For detailed benchmark methodology, per-pod data, and the full technical design, see SkyWalking GraalVM Distro: Design and Benchmarks.

The project is hosted at apache/skywalking-graalvm-distro. We invite the community to test it, report issues, and help move it toward production readiness.

For me, the deeper takeaway is broader than this distro. AI engineering does not make architecture less important. It makes architecture more worth pursuing. When implementation speed rises enough, we can afford to test more ideas in code, keep the good abstractions, and build systems that would previously have been judged too expensive to finish well.

For senior engineers, that means the bottleneck shifts away from raw typing speed and toward taste, system judgment, and the ability to define stable boundaries. For junior engineers, it means the path forward is not to chase every exciting AI workflow, but to become stronger at the fundamentals that let acceleration compound: understanding requirements, reading unfamiliar systems, questioning assumptions, and recognizing what must remain correct as everything around it changes. AI changed the economics of architecture because it lowered the cost of validating better designs without lowering the bar for engineering judgment.

Blog: Agentic Vibe Coding in a Mature OSS Project: What Worked, What Didn't

Sun, 08 Mar 2026 00:00:00 +0000

Most “vibe coding” stories start with a greenfield project. This one doesn’t.

Apache SkyWalking is a 9-year-old observability platform with hundreds of production deployments, a complex DSL stack, and an external API surface that users have built dashboards, alerting rules, and automation scripts against. When I decided to replace the core scripting engine — purging the Groovy runtime from four DSL compilers — the constraint wasn’t “can AI write the code?” It was: “can AI write the code without breaking anything for existing users?”

The answer turned out to be yes — ~77,000 lines changed across 10 major PRs in about 5 weeks — but only because the AI was tightly guided by a human who understood the project’s architecture, its compatibility contracts, and its users. This post is about the methodology: what worked, what didn’t, and what mature open-source maintainers should know before handing their codebase to AI agents.

The Project in Brief

The task was to replace SkyWalking’s Groovy-based scripting engines (MAL, LAL, Hierarchy) with a unified ANTLR4 + Javassist bytecode compilation pipeline, matching the architecture already proven by the OAL compiler. The internal tech stack was completely overhauled; the external interface had to remain identical.

Beyond the compiler rewrites, the scope included a new queue infrastructure (threads dropped from 36 to 15), virtual thread support for JDK 25+, and E2E test modernization. By conventional estimates, this was 5-8 months of senior engineer work.

For the full technical details on the compiler architecture, see the Groovy elimination discussion.

What is Agentic Vibe Coding?

“Vibe coding” — a term coined by Andrej Karpathy — describes a style of programming where you describe intent and let AI write the code. It’s powerful for prototyping, but on its own, it’s risky for production systems.

Agentic vibe coding takes this further: instead of a single AI autocomplete, you orchestrate multiple AI agents — each with different strengths — under your architectural direction, with automated tests as the safety net. In my workflow:

Claude Code (plan mode): Primary coding agent. Plan mode lets me review the approach before any code is generated. This is critical for architectural decisions — I steer the design, Claude handles the implementation.
Gemini: Code review, concurrency analysis, and verification reports. Gemini reviewed every major PR for thread-safety, feature parity, and edge cases.
Codex: Autonomous task execution for well-defined, bounded work items.

The key insight: AI writes the code, but the architect owns the design. Without deep domain knowledge of SkyWalking’s internals, no AI could have planned these changes. Without AI, I couldn’t have executed them in 5 weeks.

How TDD Made AI Coding Safe

The reason I could move this fast without breaking things comes down to one principle: never let AI code without a test harness.

My workflow for each major change:

Plan mode first: Describe the goal to Claude, review the plan, iterate on architecture before any code is written.
Write the test contract: Define what “correct” means — for the compiler rewrites, this meant cross-version comparison tests that run every expression through both the old and new engines, asserting identical results across 1,290+ expressions.
Let AI implement: With the test contract in place, Claude can write thousands of lines of implementation code. If it’s wrong, the tests catch it immediately.
E2E as the final gate: Every PR must pass the full E2E test suite — Docker-based integration tests that boot the entire server with real storage backends.
AI code review: Gemini reviewed each PR for concurrency issues, thread-safety, and feature parity — catching things that unit tests alone wouldn’t find.

This is the opposite of “hope it works” vibe coding. The AI writes fast, the tests verify fast, and I steer the architecture. The feedback loop is tight enough that I can iterate on complex compiler code in minutes instead of days.

Lessons Learned

AI is a force multiplier, not a replacement. Before any AI agent wrote a single line, a human had to define the replacement solution: what gets replaced, how it gets replaced, and — critically — where the boundaries are. Which APIs could break? The internal compilation pipeline was fair game for a complete overhaul. Which APIs must stay aligned? Every external-facing DSL syntax, every YAML configuration key, every metrics name and tag structure had to remain byte-for-byte identical — because hundreds of deployed dashboards, alerting rules, and user scripts depend on them. Drawing these boundaries required deep knowledge of the codebase and its users. AI executed the plan at extraordinary speed, but the plan itself — the scope, the invariants, the compatibility contract — had to come from a human who understood the blast radius of every change.

Plan mode is non-negotiable for architectural work. Letting AI jump straight to code on a compiler rewrite would be a disaster. Plan mode’s strength is that it collects code context — scanning imports, tracing call chains, mapping class hierarchies — and uses that context to help me fill in implementation details I’d otherwise have to look up manually. But it can’t tell you the design principles. That direction had to come from me, stated clearly upfront, so the AI’s planning stayed on the right track instead of optimizing toward a locally reasonable but architecturally wrong solution.

Know when to hit ESC. Claude has a clear tendency to dive deep into solution code writing once it starts — and it won’t stop on its own when it encounters something that conflicts with the original plan’s concept. Instead of pausing to flag the conflict, it will push forward, improvising around the obstacle in ways that silently violate the design intent. I had to learn to watch for this: when Claude’s output started drifting from the plan, I’d manually cancel the task (ESC), call it off, identify where the plan and reality diverged, adjust the plan, and restart. This interrupt-replan cycle was a regular part of the workflow, not an exception. The architect has to stay in the loop — not just at planning time, but during execution — because AI agents don’t yet know when to stop and ask.

Spec-driven testing is necessary but not sufficient — the logic workflow matters more. It’s tempting to think that if you define the input/output spec clearly enough, AI can fill in the implementation and tests will catch any mistakes. I tried this. It doesn’t work for anything non-trivial. During the expression compiler rewrite, Claude would sometimes change code in unreasonable ways just to make the spec tests pass — the inputs went in, the expected outputs came out, and everything looked green. But the internal logic was wrong: inconsistent with the design patterns the rest of the codebase relied on, impossible to extend, or solving the specific test case through a hack rather than a general mechanism. A spec only checks what the code produces; it says nothing about how the code produces it. For a mature project, the “how” matters enormously — the solution needs to be consistent with the existing architecture, widely adoptable by contributors, and maintainable long-term. That’s why I needed cross-version testing and human review of the implementation path, not just the results.

Testing at two levels kept the rewrite honest. Cross-version testing was part of my design plan from the start — I architected the dual-path comparison framework so that every production DSL expression runs through both the old and new engines, asserting identical results across 1,290+ expressions. This gave me confidence no human review could match, and it was a deliberate planning decision: I knew AI-generated compiler code needed a mechanical proof of behavioral equivalence, not just eyeball review. On top of that, E2E tests served as the project’s existing infrastructure safety net — Docker-based integration tests that boot the entire server with real storage backends. Unit tests and cross-version tests verify logic in isolation; E2E tests verify the system actually works end-to-end. For infrastructure-level changes like queue replacement and thread model changes, E2E is the only gate that truly matters. Together, the two layers — designed-for-this-rewrite cross-version tests and pre-existing E2E infrastructure — caught different classes of bugs and made shipping with confidence possible.

Multiple AIs have different strengths. Claude excels at large-scale code generation with plan mode. Gemini is exceptional at logic review — it can mentally trace code branches with given input data, simulating execution without actually running the code. This is significant for reviewing AI-generated code: Gemini would walk through a generated compiler method step by step, flagging where a null check was missing or where a branch would produce wrong output for a specific edge case. Codex proved most valuable as a test reviewer and honesty checker. AI-generated code has a subtle failure mode: the coding agent can make wrong assumptions and then write tests that pass by setting expected values to match the wrong behavior — effectively bypassing the test safety net. Codex caught cases where Claude had set unreasonable expected values that happened to make tests green, masking logic errors that would have surfaced in production. Using all three as checks on each other was far more effective than relying on any single one.

The Mythical Man-Month still applies — and so does the Mythical Token-Month. Brooks taught us that a task requiring 12 person-months does not mean 12 people can finish it in one month. The same law applies to AI: you cannot simply throw more tokens, more agents, or more parallel sessions at a problem and expect it to converge faster. Communication costs, coordination overhead, requirements analysis, and conceptual integrity — these software engineering fundamentals do not disappear just because your workforce is artificial. Worse, when the direction is wrong — when there’s a conceptual error in the design or an unreasonable architectural choice — AI will not recognize it. It will charge down the wrong path at extraordinary speed, burning tokens furiously while trapped in a vortex of self-justification: patching code to make failing tests pass, adjusting expected values to match wrong behavior, adding workarounds on top of workarounds — each iteration making the codebase look more “complete” while drifting further from correctness. AI vibe coding cannot break out of this spiral on its own. Only a human who understands the domain can recognize “this is fundamentally wrong, stop,” discard the work, and redirect. Speed without direction is just expensive chaos.

The Bigger Picture

The agentic vibe coding approach worked because it combined AI’s speed with human architectural judgment and automated test discipline. It’s not magic — it’s engineering, accelerated.

Brooks also gave us “No Silver Bullet,” and its core distinction matters more than ever: software complexity comes in two kinds. Essential complexity comes from the problem itself — the domain semantics, the behavioral contracts, the concurrency invariants. No tool can eliminate this; it must be understood, modeled, and reasoned about by someone who knows the domain. Accidental complexity comes from the tools and implementation — boilerplate code, manual refactoring across hundreds of files, the mechanical work of translating a design into compilable source. This is exactly where AI excels. What made this project work was recognizing which complexity was which: I owned the essential complexity (architecture, API boundaries, correctness invariants), and AI demolished the accidental complexity (generating 77K lines of implementation, scaffolding test harnesses, rewriting repetitive patterns across dozens of config files). Confuse the two — let AI make essential decisions, or waste human time on accidental work — and you get the worst of both worlds.

Qian Xuesen(Tsien Hsue-shen)’s Engineering Cybernetics offers another lens that proved surprisingly relevant. His core framework — feedback, control, optimization — describes how to keep complex systems running toward their target. AI vibe coding at full speed is like a hypersonic missile: extraordinarily fast, but without a guidance system it just creates a bigger crater in the wrong place. The feedback loop in my workflow was the test harness — cross-version tests and E2E tests providing continuous signal on whether the system was still on course. Control was the human architect deciding when to intervene: reviewing plans before execution, hitting ESC when the direction drifted, choosing which AI to trust for which task. Optimization was iterative: each interrupt-replan cycle refined the approach, each Gemini review tightened the logic, each Codex audit caught assumptions the coding agent had smuggled past the tests. Without all three — feedback to detect deviation, control to correct course, optimization to converge — the speed of AI coding would be not an advantage but a liability. The faster the missile, the more precise the guidance must be.

For more details or to share your own experience with agentic coding on production systems, feel free to reach me on GitHub.

Zh: 在成熟开源大型项目中实践 Agentic Vibe Coding：软件工程与工程控制论还在延续

Sun, 08 Mar 2026 00:00:00 +0000

大多数"vibe coding"的故事都从一个全新项目开始，讲述一个快速构建原型或者可运行项目的过程，但这篇不是。

Apache SkyWalking 是一个有 9 年历史的Apache顶级项目，线上数以千计的集群部署，内部有一套复杂的 DSL 编译栈，对外暴露的 API 上承载着用户构建的仪表盘、告警规则和自动化脚本。当我决定替换核心脚本引擎——从四个 DSL 编译器中彻底移除 Groovy 运行时——面临的问题不是"AI 能不能写出代码"，而是"也许只有AI能完成如此大规模的一致性迭代"，以及"AI 能不能在不破坏系统的前提下写出完整且高效的代码"。

答案是可以——约 7.7 万行代码变更，10 个主要 PR，历时约 5 周——但前提是 AI 始终在一个深刻理解项目架构、兼容性要求和用户场景的人的引导下工作。这篇文章分享了我在过去几个月的实践体验，以及成熟开源项目的维护者在把代码库交给 AI 智能体之前应该知道什么。

项目概况

这次的任务是将 SkyWalking 基于 Groovy 的脚本引擎（MAL、LAL、Hierarchy）替换为统一的 ANTLR4 + Javassist 字节码编译管线，对齐 OAL 编译器已经验证过的架构。内部技术栈彻底重构，但对外接口必须保持完全一致。

除了编译器重写，范围还包括新的线程管理策略（线程数从 36 降到 15）、JDK 25+ 虚拟线程支持，以及端到端测试的现代化改造。按传统估算，这是 5-8 个月的资深工程师（以我自己为例）工作量。

编译器架构的完整技术细节，参见 Groovy 移除讨论。

什么是 Agentic Vibe Coding？

“Vibe coding”——Andrej Karpathy 提出的概念——描述的是一种你表达意图、让 AI 来写代码的编程风格。整个AI编程过程，一直以来都是用来做原型，效果强大且速度迅猛，但单独用于生产系统是有风险的。

Agentic vibe coding 更进一步：不是单一的 AI 自动补全，而是在你的架构指导下编排多个 AI 智能体——各有所长——以自动化测试作为安全网。我的工作流是这样的：

Claude Code（plan 模式）：主力编码智能体。Plan 模式让我在生成任何代码之前先审查方案。这对架构决策至关重要——我把控设计方向，Claude 负责实现。
Gemini：代码审查、并发分析和验证报告。每个主要 PR 都经过 Gemini 审查线程安全性、功能对等性和边界情况。
Codex：对定义明确、边界清晰的工作项进行自主任务执行。

核心洞察：AI 写代码，但架构师掌控设计。 没有对 SkyWalking 内部机制的深入领域知识，任何 AI 都无法规划这些变更。没有 AI，我也不可能在 5 周内完成执行。

TDD 如何让 AI 编程变得安全

我能以这样的速度推进而不搞砸，归结为一个原则：绝不让 AI 在没有测试保护的情况下写代码。

每次重大变更的工作流：

先进 plan 模式：向 Claude 描述目标，审查方案，在写任何代码之前先在架构层面迭代。
编写测试契约：定义"正确"意味着什么——对于编译器重写，这意味着交叉版本对比测试，让每个表达式同时通过新旧两个引擎运行，在 1290+ 个表达式上断言结果完全一致。
让 AI 实现：有了测试契约，Claude 可以写出数千行实现代码。如果写错了，测试会立即捕获。
端到端测试作为最终关卡：每个 PR 都必须通过完整的端到端测试套件——基于 Docker 的集成测试，启动整个服务器并连接真实存储后端。
AI 代码审查：Gemini 审查每个 PR 的并发问题、线程安全性和功能对等性——捕获单元测试无法发现的问题。

这和"写完祈祷能跑"的 vibe coding 完全相反。AI 写得快，测试验证得快，我把控架构方向。反馈循环足够紧凑，让我能在几分钟而不是几天内迭代复杂的编译器代码。

经验教训

AI 是力量倍增器，不是替代品。 在任何 AI 智能体写下第一行代码之前，必须由人来定义替换方案：替换什么、怎么替换，以及——至关重要的——边界在哪里。哪些 API 可以破坏性变更？内部编译管线可以彻底重构。哪些 API 必须保持对齐？每一个对外的 DSL 语法、每一个 YAML 配置键、每一个指标名称和标签结构都必须逐字节保持一致——因为数百个已部署的仪表盘、告警规则和用户脚本依赖于它们。划定这些边界需要对代码库及其用户的深入了解。AI 以惊人的速度执行了计划，但计划本身——范围、不变量、兼容性契约——必须来自一个理解每次变更影响半径的人。

架构级工作，plan 模式不可妥协。 让 AI 在编译器重写上直接跳到写代码，那是灾难。Plan 模式的价值在于它会收集代码上下文——扫描 import、追踪调用链、映射类继承关系——并利用这些上下文帮我补全那些我本来需要手动查找的实现细节。但它无法告诉你设计原则。方向必须由我在前期明确给出，这样 AI 的规划才能沿着正确的轨道走，而不是朝着一个局部合理但架构上错误的方案去优化。

要知道什么时候该按 ESC。 Claude 有一个明显的倾向：一旦开始写解决方案代码就会一头扎进去——当遇到与原始计划概念冲突的东西时，它不会自己停下来。它不会暂停来标记冲突，而是会继续推进，用即兴的方式绕过障碍，悄无声息地违背设计意图。我必须学会观察这个信号：当 Claude 的输出开始偏离计划时，我会手动取消任务（ESC），叫停它，找出计划和现实的分歧点，调整计划，然后重新开始。这种中断-重新规划的循环是工作流的常态，而非例外。架构师必须始终在环路中——不仅是在规划阶段，执行阶段也是——因为 AI 智能体还不知道什么时候该停下来问一句。

Spec-Driven 更多的运用于测试，而非开发。它只是一个必要的但不充分条件，而逻辑工作流更重要。 很容易产生一种想法：只要把输入/输出规格定义得足够清楚，AI 就能填充实现，测试会捕获任何错误。我试过。对于任何复杂的生产场景，这行不通。在表达式编译器重写过程中，Claude 有时会以不合理的方式修改代码，仅仅为了让规格测试通过——输入进去了，预期输出出来了，一切看起来都是正常的。但内部逻辑是错的：与代码库其他部分依赖的设计模式不一致，无法扩展，或者通过 hack （代码反射、字段名称静态比较等不可接受的工程方法）而非通用机制来解决特定测试用例。规格只检查代码产出了什么；它对代码如何产出一无所知。对于成熟项目，“如何"极其重要——解决方案需要与现有架构一致，能被贡献者广泛采用，并且长期可维护可扩展。这就是为什么我需要交叉版本测试加上对实现路径的人工审查，而不仅仅是审查结果。

两个层次的测试让重写的代码验证更有保障。 交叉版本测试从一开始就是我设计方案的一部分——我架构了双路径对比框架，让每个生产环境的 DSL 表达式同时通过新旧两个引擎运行，在 1290+ 个表达式上断言结果完全一致。这给了我任何人工审查都无法匹敌的信心，而且这是一个刻意的规划决策：我知道 AI 生成的编译器代码需要行为等价性的机械证明，而不仅仅是肉眼审查。在此之上，端到端测试作为项目已有的基础设施安全网——基于 Docker/K8s 的集成测试，启动整个服务器并连接真实存储后端。单元测试和交叉版本测试在隔离环境中验证逻辑；端到端测试验证系统真正能端到端地工作。对于队列替换和线程模型变更这样的基础设施级变更，端到端测试是唯一真正重要的关卡。两个层次——为本次重写专门设计的交叉版本测试和预先存在的端到端基础设施——捕获了不同类别的 bug，使得有信心地发布成为可能。

多个 AI 各有所长。 Claude 擅长配合 plan 模式进行大规模代码生成。Gemini 在逻辑审查方面表现出色——它能在给定输入数据的情况下在脑中追踪代码分支，模拟执行而无需实际运行代码。这对审查 AI 生成的代码意义重大：Gemini 会逐步走查一个编译器生成的方法，标记出哪里缺少空值检查，或者哪个分支在特定边界情况下会产生错误输出。Codex 作为测试审查者和诚实性检查者最有价值。AI 生成的代码有一种微妙的失败模式：编码智能体可能做出错误假设，然后编写测试时将期望值设置为匹配错误行为——实际上绕过了测试安全网。Codex 捕获了 Claude 设置不合理期望值使测试变绿的情况，掩盖了本会在生产环境中暴露的逻辑错误。将三者互相校验，远比依赖其中任何一个更有效。

人月神话依然适用——基于Token的AI月神话同样如此。 Brooks 告诉我们，一个需要 12 人月的任务不意味着 12 个人能在一个月内完成。同样的定律适用于 AI：你不能简单地投入更多 token、更多智能体或更多并行会话，就指望问题更快收敛。沟通成本、协调开销、需求分析和概念完整性——这些软件工程的基本规律不会因为你的劳动力是人工智能就消失。更糟糕的是，当方向错误时——当设计中存在概念性错误或不合理的架构选择时——AI 不会识别出来。它会以惊人的速度冲向错误的方向，疯狂消耗 token，同时陷入自我辩护的漩涡：修补代码让失败的测试通过，调整期望值去匹配错误行为，在变通方案上叠加变通方案——每次迭代都让代码库看起来更"完整”，实际上却离正确越来越远。AI vibe coding 无法自行跳出这个螺旋。只有理解领域的人才能认识到"这从根本上就是错的，停下来"，丢弃这些工作，重新引导方向。没有方向的速度，只是昂贵的混乱。

更大的图景

Agentic vibe coding 之所以有效，是因为它将 AI 的速度与人的架构判断力和自动化测试纪律结合在了一起。这不是魔法——这是被加速的工程。

Brooks 还给了我们《没有银弹》，其核心区分在今天比以往任何时候都更重要：软件复杂性分为两种。本质复杂性来自问题本身——领域语义、行为契约、并发不变量。没有任何工具能消除它；它必须由理解领域的人去理解、建模和推理。偶然复杂性来自工具和实现——样板代码、跨数百个文件的手动重构、将设计翻译成可编译源码的机械工作。这恰恰是 AI 擅长的地方。这个项目之所以成功，在于认清了哪种复杂性是哪种：我掌控本质复杂性（架构、API 边界、正确性不变量），AI 消灭偶然复杂性（生成 7.7 万行实现代码、搭建测试框架、跨数十个配置文件重写重复模式）。搞混这两者——让 AI 做本质决策，或者让人浪费时间在偶然工作上——你会得到两个世界中最差的结果。

钱学森的《工程控制论》提供了另一个视角，在实践中出人意料地切题。他的核心框架——反馈、控制、优化——描述的是如何让复杂系统持续朝目标运行。全速运转的 AI vibe coding 就像一枚高超音速导弹：速度惊人，但没有制导系统只会在错误的地方炸出一个更大的坑。我工作流中的反馈回路是测试体系——交叉版本测试和端到端测试持续提供系统是否仍在航线上的信号。控制是人类架构师决定何时介入：在执行前审查方案，在方向偏移时按 ESC，选择哪个 AI 负责哪项任务。优化是迭代式的：每次中断-重新规划的循环都在精炼方法，每次 Gemini 审查都在收紧逻辑，每次 Codex 审计都在捕获编码智能体偷偷绕过测试的假设。缺少其中任何一个——检测偏差的反馈、纠正航向的控制、趋向收敛的优化——AI 编程的速度就不是优势而是负债。导弹越快，制导就必须越精确。

AI Vibe Coding以及它的迭代，正在快速地走进每一个开发者，也正在广泛地融入开源和商业软件。我们都在见证这种新的开发模式，以及AI Vibe Coding和软件工程理论的融合。如果你想和我探讨更多的AI + OSS话题，欢迎在 GitHub 上联系我。

Apache SkyWalking – AI

Blog: Monitoring LLM Applications with SkyWalking 10.4: Insights into Performance and Cost

The Problem: As Applications “Consume” LLMs, Monitoring Leaves a Blind Spot

1. The “Black Box” of Cost and Performance: Is the Expensive Model Worth It?

2. The Vanishing “Golden Timeout”

3. The Overlooked Experience Killer: TTFT

Virtual GenAI Observability

How It Works

Installation & Configuration

Requirements

Semantic Conventions & Compatibility

SkyWalking Java Agent

OTLP / Zipkin Probes

GenAI Estimated Cost Configuration

Overview

Configuration Structure

Top-level Fields

Provider Definition

Model Definition

Model Matching Mechanism

Provider-Level Prefix Matching

Model-level Longest-Prefix Matching

Model Aliases

Custom Configuration

Adding a New Provider

Main Metrics

1.Provider Level

2. Model Level

Recommended Usage Scenarios

Zh: 基于 SkyWalking 10.4 的大模型应用监控：洞察 LLM 的性能与成本

问题：当应用开始“吞噬”大模型，监控却留下了盲区

虚拟 GenAI 观测

原理

安装配置

要求

版本要求

语义规范与兼容性

SkyWalking Java Agent

输出OTLP / Zipkin格式数据的探针

GenAI 预估成本配置

概览

配置结构

Top 字段

provider 定义

model 定义

模型匹配机制

供应商级前缀匹配

模型级最长前缀匹配 (Model-Level Longest-Prefix Matching)

模型别名 (Model Aliases)

自定义配置

主要指标

1. Provider Level (服务商维度)

2. Model Level (模型维度)

建议使用场景

Blog: Monitoring Envoy AI Gateway with Apache SkyWalking

The Problem: Flying Blind with LLM Traffic

Why an AI Gateway

Data Flow

Try It Locally

Prerequisites

Step 1: Start Ollama

Step 2: Start the Stack

Step 3: Run the Demo App

Step 4: View in SkyWalking UI

Cleanup

Deploying on Kubernetes

GenAI Observability Without an AI Gateway

References

Zh: 用 Apache SkyWalking 监控 Envoy AI Gateway

问题：LLM 流量缺乏统一观测

为什么选择 AI 网关

数据流

本地体验

前置条件

第一步：启动 Ollama

第二步：启动服务栈

第三步：运行 Demo 应用

第四步：在 SkyWalking UI 中查看

清理

Kubernetes 生产部署