Kaggle launches Community Benchmarks to compare AI models on real tasks

Kaggle said it is launching Community Benchmarks, a new feature that allows developers to create task-based evaluations and combine them into benchmarks to compare artificial intelligence model outputs across different systems.

The machine learning specialist said the feature is designed to provide a transparent way to validate specific use cases and to bridge the gap between experimental code and production-ready applications.

Rather than relying on abstract scores, Community Benchmarks focuses on concrete tasks that reflect how models are used in practice.

Tasks can be designed to test capabilities, including multi-step reasoning, code generation, tool use and image recognition, Kaggle said. Once tasks are assembled into a benchmark, the system runs them across selected models and produces leaderboards that rank performance.

Kaggle added that benchmarks capture exact outputs and model interactions, allowing results to be audited and independently verified. This approach is intended to increase trust in benchmark outcomes and make comparisons between models more reproducible.

Users will have free access, within usage quotas, to state-of-the-art models from research labs including Google, Anthropic and DeepSeek, the company said. The platform supports multi-modal inputs, code execution, tool use and multi-turn conversations.

Community Benchmarks is powered by the new kaggle-benchmarks software development kit. Kaggle said it is also providing supporting resources, including a Benchmarks Cookbook, example tasks and step-by-step guides on creating a first task and assembling a benchmark.

The company invited developers and researchers to begin building with Community Benchmarks immediately, positioning the feature as a community-driven way to evaluate AI systems against real-world requirements rather than generic tests.