Benchmark

LLM instruction-following evaluation service

The service provides comprehensive evaluation and customized solutions for LLMs, offering model performance assessment, data strategy consultation, tailored dataset creation, and cost-effective iteration management.
Schedule a Call
LLM Benchmark Dataset Features
Model Target
Professional scene coverage
High Data Quality Standards
Evaluation-Algorithm Collaboration Iteration
Benchmark with custom datasets
Data-strategy-centered service dedicated to ensuring top-notch quality throughout the planning and production phases.
1. Data Production Plan
Strictly standardize data content based on model objectives to ensure data diversity and accuracy
2. Data Incentive Mechanism
Establish an incentive mechanism for data teams, match tasks with professional teams, and cover comprehensive data scenarios
circle
4. Data Statistics Report
Data distribution statistics and Bad Case summaries, optimize data production plans, and achieve the highest data quality standards.
3. Data Strategy Training
Collaborate with data teams, refine data production processes, and strengthen data understanding
CIF-Bench
The first LLM instruction-following evaluation services in the world.
Evaluation system
20
Dimensions
20
150
Tasks
15000
Evaluation Data Points
4
Metrics Framework
Automated Evaluation Ranking
The evaluation of 28 selected LLMs models revealed significant performance gaps, highlighting the limitations of LLMs' generalization capabilities in Chinese task environments.
Model NameOverallChinese CultureClassificationCodeCommonsenseCreative NLGEvaluationGrammarLinguisticMotion DetectionNERNLIQAReasoningRole PlayingSentimentStructured DataStyle TransferSummarizationToxicTranslation
Baichuan2-13B-Chat0.5290.520.6740.3330.6410.4970.6860.5420.5280.5780.5630.6320.5690.5150.7520.6240.4590.4620.3320.4410.273
Qwen-72B-Chat0.5190.4860.630.2960.6340.5080.6340.4580.520.4940.550.6260.5650.5280.7620.6130.4960.4590.2820.6080.271
Yi-34B-Chat0.5120.4830.6060.3470.6230.4970.5980.480.490.5750.5250.6190.5540.4940.7570.580.4720.4390.3460.5140.259
Qwen-14B-Chat0.50.4810.5820.3070.6140.4940.6450.4280.4750.4960.5130.6160.5480.5070.7640.5830.4690.4530.2830.5750.262
Deepseek-Llm-67B-Chat0.4710.4670.5710.2590.5770.4860.5490.4420.4760.4750.5090.5660.4960.4390.7110.5460.4090.4360.2620.570.235
Baichuan-13B-Chat0.450.4080.4910.2860.5520.4390.670.4170.4220.4820.4860.5650.5050.3770.7040.5520.3870.4020.350.4310.304
Chatglm3-6B0.4360.3810.4390.330.5410.4520.5770.310.3580.4360.4530.5440.5030.4140.7620.560.4460.4020.3210.3910.27
Yi-6B-Chat0.4170.4020.4540.3130.5230.4250.5060.3830.3830.4870.3960.5230.4570.3690.7540.4820.4010.380.310.4550.227
Baichuan2-7B-Chat0.4120.4370.6470.160.520.4020.580.5110.4440.4550.4070.4890.3950.4060.670.5170.3420.2980.1010.4630.138
Chatglm2-6B0.3520.2780.4690.3460.4030.4240.5350.2740.3970.4060.240.3970.3520.3260.7140.4380.2980.3130.320.4610.19
Chatglm-6B-Sft0.3490.2650.4540.3650.3850.4620.5540.2960.3790.4270.2320.380.3210.2920.7180.4150.2960.3330.3510.4410.19
Chinese-Llama2-Linly-13B0.3440.250.4620.3110.3990.4290.5570.2730.3580.3850.2680.390.330.3130.6530.4330.2790.3320.2920.4570.181
Gpt-3.5-Turbo-Sft0.3430.2690.4270.2980.3890.3950.5750.3250.3650.3890.2260.3820.3940.3450.710.4330.3240.2660.290.3970.225
Chinese-Alpaca-2-13B0.3410.2420.4210.3560.3820.4420.6020.2560.3630.430.210.3760.3340.3170.7140.4590.2990.3160.3080.4520.2
Chinese-Alpaca-13B0.3340.250.3990.3480.3640.4350.6160.2750.3490.4210.2230.370.3090.3190.7240.4260.2850.3070.2980.4450.181
Chinese-Alpaca-7B0.3340.2160.4120.3780.3810.4250.5760.2650.3590.3930.2430.3830.3260.2950.710.4090.3010.3270.3250.4050.186
Chinese-Llama2-Linly-7B0.3330.2180.4510.330.3960.4270.5830.2480.350.410.2310.3670.3450.2760.6980.4330.2590.3150.310.4690.168
Tigerbot-13B-Chat0.3310.2050.3970.3090.3850.420.6140.310.3790.3410.2760.3630.3290.3010.6940.4190.280.310.2830.3930.186
Telechat-7B0.3290.2670.3380.3210.420.4040.420.2720.2650.3270.320.3880.3550.2440.6720.3440.3340.3350.2990.3640.184
Ziya-Llama-13B0.3290.1960.4020.3240.3410.4280.6160.3120.3490.40.2280.3510.2790.3130.7210.4680.3110.2910.2780.4310.175
Chinese-Alpaca-33B0.3260.2340.370.3720.3640.4290.6140.2460.3180.3770.2210.3680.30.3140.7130.4280.2880.3030.2950.4010.199
Tigerbot-7B-Chat0.3250.2180.3950.3060.370.4130.6310.2940.370.3680.2150.3550.3130.2920.7130.4150.2830.3150.290.3890.171
Chinese-Alpaca-2-7B0.3230.2150.3740.3350.3660.4150.5460.2570.3260.3950.2150.3750.3180.2890.6980.4170.2850.3030.3120.4390.193
Aquilachat-7B0.3090.1620.2340.2910.320.4370.3440.1350.2660.3090.2870.3370.3420.2360.6090.2550.2490.40.5270.430.306
Moss-Moon-003-Sft0.3020.2140.4050.2740.3470.380.4480.3050.3410.3780.2320.3170.3210.2670.6940.3750.2510.2590.2880.4240.152
Qwen-7B-Chat0.3010.2110.410.2890.3490.3910.5310.2190.3870.4040.2080.3250.2970.2780.6810.4190.2660.2510.2480.3710.157
Belle-13B-Sft0.2640.1980.3070.2850.3160.3490.4090.2370.3050.2220.1770.3170.2840.2420.6310.2990.2440.2220.2340.2960.133
Cpm-Bee-10B0.2440.2340.3770.0240.2780.3110.2550.3020.2780.3270.1480.2860.2240.1470.6030.2770.1170.2630.220.3520.125
CIF-Bench Evaluation Service
We provide highly customizable evaluation services that seamlessly integrate with API or model code.
Unlimited Access to CIF-Bench Public Dataset
Including 45,000 data instances, the Chinese Instruction-Following Benchmark (CIF-Bench) enables the development of more adaptable, culturally aware, and linguistically diverse language models.
Book a Demo
Automated Evaluation Service
By providing the API or model code, users can quickly and easily receive performance feedback on their models using our comprehensive dataset for model inference evaluation.
Book a Demo
Manual Evaluation Service
Leveraging user-submitted interfaces, models, and code, our expert team conducts detailed tests to deliver a comprehensive report for you.
Book a Demo

Explore More

Fill out the form to schedule a personalized demo with our team. Experience firsthand how our innovative solutions can meet your needs and drive success.

Book a Demo

Copyright © 2024 StardustAI Inc. All rights reserved.