San Francisco-based AI startup, Arthur, has unveiled its latest offering, Arthur Bench, an open-source tool designed to evaluate the performance of large language models (LLMs) such as OpenAI’s GPT-3.5 Turbo and Meta’s LLaMA 2. The tool enables companies to assess the performance of different language models according to specific use cases and provides metrics to compare models based on accuracy, readability, hedging, and other criteria.
The issue of hedging is particularly important when it comes to using LLMs regularly. Hedging refers to instances where an LLM provides extraneous language summarizing its terms of service or programming constraints, which is often irrelevant to the user’s intended response. Arthur Bench aims to highlight these subtle behavioral differences that may be relevant to each application.
By offering starter criteria for comparing LLM performance, Arthur Bench allows enterprises to add their own specific criteria to suit their needs. For example, companies can input their users’ last 100 questions and run them against all models, with Arthur Bench highlighting areas where answers differed significantly, enabling manual review.
The primary objective of Arthur Bench is to help businesses make informed decisions when adopting AI. The tool streamlines benchmarking and translates academic measures into tangible real-world business benefits. By combining statistical measures and scores, along with input from other LLMs, Arthur grades the responses of desired LLMs side-by-side.
Arthur Bench has already attracted various industry sectors. Financial services firms are leveraging the tool to generate investment theses and accelerate analysis. Vehicle manufacturers are tapping into Arthur Bench to create LLMs capable of answering customer queries accurately and promptly, while reducing information hallucinations. In addition, enterprise media and publishing platform Axios HQ is utilizing the tool to establish an internal framework for LLM evaluation and description of performance to its Product team.
Arthur is making Bench available as an open-source tool, enabling anyone to use and contribute to it for free. While the startup believes in the strength of an open-source approach in building the best products, opportunities for monetization through team dashboards are still available. Additionally, Arthur has announced a hackathon with Amazon Web Services (AWS) and Cohere to encourage developers to develop new metrics for Arthur Bench. The alignment between AWS’s Bedrock environment, which helps users select and deploy various LLMs, and Arthur Bench is expected to further enhance the tool’s capabilities and reach.
This marks a continuation of Arthur’s efforts in the AI space, following the launch of Arthur Shield earlier this year. Arthur Shield focuses on monitoring large language models for hallucinations and other potential issues. With the introduction of Arthur Bench, the startup aims to provide enterprises with the necessary tools to leverage language models effectively and make informed decisions regarding AI adoption.