Benchmark

大模型中文指令跟随能力基准测试服务

星尘数据大模型Benchmark服务专注于为大型语言模型提供全面的评测和优化方案。我们的服务包括但不限于模型性能评估、数据策略服务、数据集定制以及迭代成本控制。
立即咨询 Benchmark 服务
大模型Benchmark数据集特点
模型目标高贴合
专业场景全覆盖
数据质量高标准
评测-算法协同迭代
定制 Benchmark 数据集
以数据策略设计为中心的服务,整体质量从方案源头抓起,贯穿生产全流程
1. 数据生产方案
基于模型目标严格规范数据内容, 保证数据多样性和准确性
2. 数据激励机制
建立数据团队的激励机制, 匹配任务与专业团队,覆盖全面数据场景
circle
4. 数据统计报告
数据分布统计与Bad Case汇总, 优化数据生产方案,实现数据最高质量标准
3. 数据策略培训
协同数据团队,打磨数据生产工艺, 强化数据理解
CIF-Bench
全球首个考察大模型指令遵循能力的评测服务
评测体系
20
种基础维度
20
150
类任务
15000
条评测数据
4
种指标框架
自动化评测榜单
对 28 个入选LLMs模型的评估揭示了明显的性能差距,评测在中文任务环境中的LLMs泛化能力的局限性
Model NameOverallChinese CultureClassificationCodeCommonsenseCreative NLGEvaluationGrammarLinguisticMotion DetectionNERNLIQAReasoningRole PlayingSentimentStructured DataStyle TransferSummarizationToxicTranslation
Baichuan2-13B-Chat0.5290.520.6740.3330.6410.4970.6860.5420.5280.5780.5630.6320.5690.5150.7520.6240.4590.4620.3320.4410.273
Qwen-72B-Chat0.5190.4860.630.2960.6340.5080.6340.4580.520.4940.550.6260.5650.5280.7620.6130.4960.4590.2820.6080.271
Yi-34B-Chat0.5120.4830.6060.3470.6230.4970.5980.480.490.5750.5250.6190.5540.4940.7570.580.4720.4390.3460.5140.259
Qwen-14B-Chat0.50.4810.5820.3070.6140.4940.6450.4280.4750.4960.5130.6160.5480.5070.7640.5830.4690.4530.2830.5750.262
Deepseek-Llm-67B-Chat0.4710.4670.5710.2590.5770.4860.5490.4420.4760.4750.5090.5660.4960.4390.7110.5460.4090.4360.2620.570.235
Baichuan-13B-Chat0.450.4080.4910.2860.5520.4390.670.4170.4220.4820.4860.5650.5050.3770.7040.5520.3870.4020.350.4310.304
Chatglm3-6B0.4360.3810.4390.330.5410.4520.5770.310.3580.4360.4530.5440.5030.4140.7620.560.4460.4020.3210.3910.27
Yi-6B-Chat0.4170.4020.4540.3130.5230.4250.5060.3830.3830.4870.3960.5230.4570.3690.7540.4820.4010.380.310.4550.227
Baichuan2-7B-Chat0.4120.4370.6470.160.520.4020.580.5110.4440.4550.4070.4890.3950.4060.670.5170.3420.2980.1010.4630.138
Chatglm2-6B0.3520.2780.4690.3460.4030.4240.5350.2740.3970.4060.240.3970.3520.3260.7140.4380.2980.3130.320.4610.19
Chatglm-6B-Sft0.3490.2650.4540.3650.3850.4620.5540.2960.3790.4270.2320.380.3210.2920.7180.4150.2960.3330.3510.4410.19
Chinese-Llama2-Linly-13B0.3440.250.4620.3110.3990.4290.5570.2730.3580.3850.2680.390.330.3130.6530.4330.2790.3320.2920.4570.181
Gpt-3.5-Turbo-Sft0.3430.2690.4270.2980.3890.3950.5750.3250.3650.3890.2260.3820.3940.3450.710.4330.3240.2660.290.3970.225
Chinese-Alpaca-2-13B0.3410.2420.4210.3560.3820.4420.6020.2560.3630.430.210.3760.3340.3170.7140.4590.2990.3160.3080.4520.2
Chinese-Alpaca-13B0.3340.250.3990.3480.3640.4350.6160.2750.3490.4210.2230.370.3090.3190.7240.4260.2850.3070.2980.4450.181
Chinese-Alpaca-7B0.3340.2160.4120.3780.3810.4250.5760.2650.3590.3930.2430.3830.3260.2950.710.4090.3010.3270.3250.4050.186
Chinese-Llama2-Linly-7B0.3330.2180.4510.330.3960.4270.5830.2480.350.410.2310.3670.3450.2760.6980.4330.2590.3150.310.4690.168
Tigerbot-13B-Chat0.3310.2050.3970.3090.3850.420.6140.310.3790.3410.2760.3630.3290.3010.6940.4190.280.310.2830.3930.186
Telechat-7B0.3290.2670.3380.3210.420.4040.420.2720.2650.3270.320.3880.3550.2440.6720.3440.3340.3350.2990.3640.184
Ziya-Llama-13B0.3290.1960.4020.3240.3410.4280.6160.3120.3490.40.2280.3510.2790.3130.7210.4680.3110.2910.2780.4310.175
Chinese-Alpaca-33B0.3260.2340.370.3720.3640.4290.6140.2460.3180.3770.2210.3680.30.3140.7130.4280.2880.3030.2950.4010.199
Tigerbot-7B-Chat0.3250.2180.3950.3060.370.4130.6310.2940.370.3680.2150.3550.3130.2920.7130.4150.2830.3150.290.3890.171
Chinese-Alpaca-2-7B0.3230.2150.3740.3350.3660.4150.5460.2570.3260.3950.2150.3750.3180.2890.6980.4170.2850.3030.3120.4390.193
Aquilachat-7B0.3090.1620.2340.2910.320.4370.3440.1350.2660.3090.2870.3370.3420.2360.6090.2550.2490.40.5270.430.306
Moss-Moon-003-Sft0.3020.2140.4050.2740.3470.380.4480.3050.3410.3780.2320.3170.3210.2670.6940.3750.2510.2590.2880.4240.152
Qwen-7B-Chat0.3010.2110.410.2890.3490.3910.5310.2190.3870.4040.2080.3250.2970.2780.6810.4190.2660.2510.2480.3710.157
Belle-13B-Sft0.2640.1980.3070.2850.3160.3490.4090.2370.3050.2220.1770.3170.2840.2420.6310.2990.2440.2220.2340.2960.133
Cpm-Bee-10B0.2440.2340.3770.0240.2780.3110.2550.3020.2780.3270.1480.2860.2240.1470.6030.2770.1170.2630.220.3520.125
CIF-Bench评测服务
您可以提供待评测接口或者模型代码,星尘可提供多样化的评测服务
CIF-Bench Public数据集下载
CIF-Bench共有45,000个数据实例,为减少评估过程中的偏差,我们策略性地公开了一半的数据集供学术和商业使用,同时保留另一半作为私密资源,以维持评测的公正性和前瞻性。
联系我们
自动化评测服务
用户可以通过提供评测接口API或相关模型代码,星尘将提供必要的测试数据,快速进行模型推理评测,帮助您获得即时的性能反馈。
联系我们
人工评测服务
结合用户提交的评测接口、模型和代码,由星尘的专家团队进行细致的测试结果评估,为您沉淀出一份全面的评测报告。
联系我们

了解更多

请填写您的企业邮箱,可获取更详细的介绍资料、个性化购买咨询服务

联系我们

© 2024 北京星尘纪元智能科技有限公司 保留所有权