Ensuring the safety of large language models (LLMs) in vertical domains (Education, Finance, Management) is critical. While current alignment efforts primarily target explicit risks like bias and violence, they often fail to address deeper, domain-specific implicit risks. We introduce a comprehensive dataset of 9,000 queries categorizing risks into Green (Guide), Yellow (Reflect), and Red (Deny), and MENTOR, a framework using a Rule Evolution Cycle (REC) and Activation Steering (RV) to effectively detect and mitigate these subtle risks.
确保垂直领域(教育、金融、管理)中大模型的安全性至关重要。虽然目前的对齐工作主要针对偏见和暴力等显性风险,但往往忽略了更深层次的特定领域隐性风险。研发团队推出了一个包含9,000条查询的基准测试集,将风险分为引导、反思、禁止三类,以及 MENTOR 框架。该框架利用规则演化循环(REC)和激活引导(RV)技术,能够有效发现并缓解这些不易察觉的潜在风险。
A domain-specific risk evaluation benchmark covering 9,000 queries. 涵盖9,000条查询的特定领域风险评估基准。
Figure 1: The "Litmus Strip" framework (Partial Examples). The area below the dashed line illustrates specific implicit risks hidden deeply within vertical domains like Education, Finance, and Management, similar to chemical components detected by a litmus test.
图 1:"试纸"框架(部分示例)。图中虚线下方形象地展示了深埋于教育、金融和管理等垂直领域场景下的各类隐性风险,如同试纸检测出的潜藏成分。
Metacognition-Driven Self-Evolution for Implicit Risk Mitigation 元认知驱动的隐性风险缓解自进化机制
Figure 2: The MENTOR Architecture. 图 2:MENTOR 架构图。
Main Benchmark Performance 主要基准测试表现
| Model | Jailbreak Rate (Overall) | Jailbreak by Domain (Original) | Immunity Score | |||||
|---|---|---|---|---|---|---|---|---|
| Original | +Rules | +Meta1 | +Meta2 | Edu | Mgt | Fin | ||
| GPT-5-2025-08-07* | 0.313 | 0.098 | 0.041 | 0.026 | 0.363 | 0.189 | 0.370 | 0.855 |
| Doubao-Seed-1.6 | 0.628 | 0.055 | 0.021 | 0.011 | 0.576 | 0.616 | 0.692 | 0.680 |
| Llama-4-Maverick | 0.752 | 0.227 | 0.131 | 0.088 | 0.696 | 0.716 | 0.844 | 0.581 |
| GPT-4o | 0.834 | 0.135 | 0.061 | 0.038 | 0.804 | 0.872 | 0.826 | 0.543 |
| Qwen3-235B | 0.437 | 0.061 | 0.030 | 0.019 | 0.492 | 0.300 | 0.518 | 0.771 |
| DeepSeek-R1-0528 | 0.625 | 0.070 | 0.035 | 0.021 | 0.522 | 0.672 | 0.682 | 0.659 |
| o3-high-2025-04 | 0.473 | 0.067 | 0.020 | 0.011 | 0.608 | 0.328 | 0.482 | 0.749 |
| Llama-3.1-8B | 0.661 | 0.268 | 0.172 | 0.131 | 0.658 | 0.724 | 0.600 | 0.617 |
| Mistral-large | 0.874 | 0.148 | 0.073 | 0.059 | 0.790 | 0.912 | 0.920 | 0.496 |
| Claude Sonnet 4 | 0.208 | 0.029 | 0.009 | 0.003 | 0.280 | 0.174 | 0.170 | 0.906 |
| Grok-4 | 0.631 | 0.073 | 0.029 | 0.016 | 0.810 | 0.596 | 0.486 | 0.659 |
| Gemini-2.5-Pro | 0.442 | 0.018 | 0.003 | 0.002 | 0.425 | 0.400 | 0.502 | 0.761 |
| Kimi-K2-Instruct | 0.331 | 0.015 | 0.005 | 0.003 | 0.426 | 0.220 | 0.346 | 0.831 |
* Note regarding GPT-5-2025-08-07: Due to platform safety mechanisms and request interceptions, this model was evaluated on 1302 out of 1500 queries. * 关于 GPT-5-2025-08-07 的说明: 由于平台安全机制和请求拦截,该模型在 1500 条查询中实测了 1302 条。