🤖 Hermes 3 Agent：开源 LLM 时代的工具调用之王

汤姆的技术雷达 | AI Agent 框架深度评测

开篇：为什么 Hermes 3 火了？

2024 年 8 月，Nous Research 发布了 Hermes 3 405B——一个 400 亿参数的开源模型，刚发布就登上了 HuggingFace 下载榜首。

不是因为参数最大（Llama 3.1 405B 是基础模型），而是因为它专门针对 Agent 场景优化：

🎯 Function Calling（工具调用）——可靠性比 ChatGPT 还高
🧠 长上下文推理——支持 128K tokens，完整 Agent 记忆
⚡ 结构化输出——JSON 模式稳定，格式错误率 < 2%
🎭 多角色扮演——能进行复杂的角色扮演和 Prompt 工程

核心竞争力：当 ChatGPT 在收费，开源模型还在 baseline 时，Hermes 3 直接拉平了。

第一章：什么是 Hermes 3？

系谱：从 Hermes 2 到 Hermes 3

版本	基础模型	参数	发布时间	定位
Hermes 2	Llama 2 / Mistral	7B - 70B	2023年	早期 Agent 探索
Hermes 2 Pro	Llama 3 / Mistral	8B - 70B	2024年4月	Function Calling v1
Hermes 3 ⭐	Llama 3.1	8B - 405B	2024年8月	Agent 标杆

Hermes 3 的三大升级：

Function Calling 2.0 — 格式更严格，成功率从 92% → 98%+
长上下文支持 — 从 4K → 128K tokens（完整 Agent 对话历史）
中文能力 — 原生支持中文 Function Calling 和 JSON 输出

第二章：Hermes 3 如何做 Agent？

架构：三层式工具调用

用户请求
   ↓
[系统提示] ← 告诉模型"你是一个 Agent，有这些工具"
[工具定义] ← JSON Schema 定义可用的函数
[对话历史] ← ChatML 格式的多轮对话
   ↓
[Hermes 3 405B 推理]
   ↓
输出：<tool_call>{"name": "get_weather", "arguments": {"city": "北京"}}</tool_call>
   ↓
执行工具 → 返回结果
   ↓
继续推理 → 生成最终答案

代码实例：5 分钟构建一个天气助手

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
from transformers import AutoTokenizer, AutoModelForCausalLM
import json

# 1. 加载 Hermes 3
tokenizer = AutoTokenizer.from_pretrained('NousResearch/Hermes-3-Llama-3.1-405B')
model = AutoModelForCausalLM.from_pretrained(
    'NousResearch/Hermes-3-Llama-3.1-405B',
    device_map="auto",
    load_in_4bit=True  # 降低显存占用
)

# 2. 定义工具（JSON Schema）
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取指定城市的天气",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "城市名"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_hotel",
            "description": "搜索酒店",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"},
                    "check_in": {"type": "string", "format": "date"},
                    "check_out": {"type": "string", "format": "date"}
                }
            }
        }
    }
]

# 3. 构造 ChatML 格式的提示
system_prompt = f"""你是一个旅行助手，有以下工具可用：
{json.dumps(tools, ensure_ascii=False, indent=2)}

调用工具时，输出格式如下：
<tool_call>
{{"name": "工具名", "arguments": {{...}}}}
</tool_call>"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "我想去北京，请告诉我天气，然后帮我找个便宜的酒店"}
]

# 4. 生成 Agent 响应
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
output = model.generate(input_ids, max_new_tokens=500, temperature=0.7)
response = tokenizer.decode(output[0], skip_special_tokens=True)

print(response)
# 输出示例：
# 我来帮你查询北京的天气和酒店信息。
# <tool_call>
# {"name": "get_weather", "arguments": {"city": "北京", "unit": "celsius"}}
# </tool_call>
# <tool_call>
# {"name": "search_hotel", "arguments": {"city": "北京", "check_in": "2025-04-15", "check_out": "2025-04-20"}}
# </tool_call>

对比：Hermes 3 vs ChatGPT

功能	Hermes 3 405B	GPT-4 Turbo	GPT-4o
Function Calling 成功率	98.5%	99.2%	99.1%
长上下文	128K tokens	128K tokens	128K tokens
JSON 格式正确率	97.8%	99.0%	98.9%
工具链深度	最多 10 步	最多 20 步	最多 15 步
成本	$0（本地运行）	$0.01 / 1K input	$0.015 / 1K input
隐私	完全本地	云端存储	云端存储
部署方式	本地 GPU / vLLM	API 调用	API 调用

Hermes 3 的优势：

✅ 完全开源，无需付费
✅ 可本地私有部署
✅ 支持系统提示词修改，灵活性更强
✅ 长上下文记忆，适合复杂 Agent

不足：

❌ 需要 GPU 资源（405B 需要 430GB VRAM）
❌ 推理速度不如 API（0.5-1 s/token vs 0.05-0.1 s/token）
❌ Function Calling 偶现格式错误

第三章：实战应用场景

场景 1：本地代码执行助手

用户: "帮我写一个快速排序算法"
  ↓
[Hermes 3 调用 execute_python 工具]
  ↓
def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    left = [x for x in arr if x < pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + [pivot] + quicksort(right)
  ↓
[用户得到完整代码 + 执行结果]

适合部署方式：单机 4090 + vLLM 推理服务

场景 2：多步骤工作流 Agent

示例：自动化财务报告生成

Flow:
1. [调用 fetch_financial_data] → 获取 Q1 财务数据
2. [调用 analyze_trend] → 分析同比增长
3. [调用 fetch_industry_benchmark] → 对标行业水平
4. [调用 generate_report] → 生成 PDF 报告
5. [调用 send_email] → 发送给管理层

Hermes 3 通过 128K 上下文保持完整的工作流状态
不会中途丢失中间计算结果

场景 3：开发者工具链

在你的代码编辑器（VS Code / Cursor）中集成 Hermes 3 作为后端：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# .vscode/settings.json
"hermes.apiEndpoint": "http://localhost:8000/v1"
"hermes.model": "NousResearch/Hermes-3-Llama-3.1-70B"
"hermes.temperature": 0.7
"hermes.tools": [
    "git_commit",
    "run_tests",
    "refactor_code",
    "generate_docstring"
]

优势：

所有代码在本地，永不上传到云端
实时反馈，无延迟
可自定义工具集

第四章：部署指南

最小化部署：4090 单卡运行 Hermes 3 70B

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# 1. 安装依赖
pip install vllm torch transformers

# 2. 启动 vLLM 服务
python -m vllm.entrypoints.openai_api_server \
    --model NousResearch/Hermes-3-Llama-3.1-70B \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 1 \
    --port 8000

# 3. 测试 API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Hermes-3-Llama-3.1-70B",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.7
  }'

生产级部署：多卡 + 负载均衡

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# docker-compose.yml
version: '3'
services:
  hermes-inference:
    image: vllm/vllm-openai:latest
    environment:
      - MODEL_NAME=NousResearch/Hermes-3-Llama-3.1-70B
      - TENSOR_PARALLEL_SIZE=2  # 2 张 GPU
      - GPU_MEMORY_UTILIZATION=0.95
    volumes:
      - ./models:/root/.cache/huggingface
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]

  api-gateway:
    image: nginx:latest
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - hermes-inference

成本对比

方案	硬件成本	月度运营成本	适用场景
本地 4090	¥15,000	¥500 (电费)	个人开发者
云端 A100 × 4	—	¥20,000/月	中型企业
GPT-4 API	—	¥50,000+/月	按量付费

3 个月回本（若选择本地部署）

第五章：为什么选 Hermes 3？

vs 开源竞品

模型	Function Calling	长上下文	推理速度	社区	成本
Hermes 3	⭐⭐⭐⭐⭐	128K	⭐⭐⭐	⭐⭐⭐⭐	免费
Llama 3.1 Instruct	⭐⭐⭐	128K	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	免费
Mistral Large	⭐⭐⭐⭐	32K	⭐⭐⭐⭐	⭐⭐⭐	免费
Command R+	⭐⭐⭐	128K	⭐⭐	⭐⭐	付费 API
Qwen 2	⭐⭐⭐	128K	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	免费

Hermes 3 的独特优势：

🎯 Function Calling 专家地位
🔐 企业级隐私保证
💰 完全无成本运行
🧠 长记忆 + 推理能力均衡

第六章：快速开始

3 分钟上手

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# pip install vllm

from vllm import LLM, SamplingParams

# 1. 加载模型
llm = LLM(model="NousResearch/Hermes-3-Llama-3.1-70B")

# 2. 定义采样参数
sampling_params = SamplingParams(temperature=0.7, max_tokens=500)

# 3. 推理
prompts = [
    "<|im_start|>user\n你是什么？<|im_end|>\n<|im_start|>assistant"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

扩展阅读

官方 GitHub：https://github.com/NousResearch/Hermes-Function-Calling
技术报告：https://arxiv.org/abs/2408.11857
模型主页：https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-405B
社区讨论：https://discord.gg/nous-research

总结：开源 Agent 时代已来

Hermes 3 的出现证明了一个重要的趋势：

开源 LLM 在 Agent 能力上已经能比肩付费 API

🚀 立即行动：

如果你有 GPU 资源，部署 Hermes 3 试试
如果你在做 Agent 开发，用它替换 ChatGPT API
如果你在企业环境，用它保证数据隐私

下一期：《10 分钟构建你的第一个 AI Agent 工具链》

关注「汤姆的技术雷达」，深度解析开源 AI 动向 🔔

作者：tom | 汤姆的技术雷达
发布日期：2026-04-12

🤖 Hermes 3 Agent：开源 LLM 时代的工具调用之王#

开篇：为什么 Hermes 3 火了？#

第一章：什么是 Hermes 3？#

系谱：从 Hermes 2 到 Hermes 3#

第二章：Hermes 3 如何做 Agent？#

架构：三层式工具调用#

代码实例：5 分钟构建一个天气助手#

对比：Hermes 3 vs ChatGPT#

第三章：实战应用场景#

场景 1：本地代码执行助手#

场景 2：多步骤工作流 Agent#

场景 3：开发者工具链#

第四章：部署指南#

最小化部署：4090 单卡运行 Hermes 3 70B#

生产级部署：多卡 + 负载均衡#

成本对比#

第五章：为什么选 Hermes 3？#

vs 开源竞品#

第六章：快速开始#

3 分钟上手#

扩展阅读#

总结：开源 Agent 时代已来#