San Francisco-based AI startup, Arthur, has unveiled its latest offering, Arthur Bench, an open-source tool designed to evaluate the performance of large language models (LLMs) such as OpenAI’s GPT-3.5 Turbo and Meta’s LLaMA 2. The tool enables companies to assess the performance of different language models according to specific use cases and provides metrics to compare models based on accuracy, readability, hedging, and other criteria.

The issue of hedging is particularly important when it comes to using LLMs regularly. Hedging refers to instances where an LLM provides extraneous language summarizing its terms of service or programming constraints, which is often irrelevant to the user’s intended response. Arthur Bench aims to highlight these subtle behavioral differences that may be relevant to each application.

By offering starter criteria for comparing LLM performance, Arthur Bench allows enterprises to add their own specific criteria to suit their needs. For example, companies can input their users’ last 100 questions and run them against all models, with Arthur Bench highlighting areas where answers differed significantly, enabling manual review.

The primary objective of Arthur Bench is to help businesses make informed decisions when adopting AI. The tool streamlines benchmarking and translates academic measures into tangible real-world business benefits. By combining statistical measures and scores, along with input from other LLMs, Arthur grades the responses of desired LLMs side-by-side.

Arthur Bench has already attracted various industry sectors. Financial services firms are leveraging the tool to generate investment theses and accelerate analysis. Vehicle manufacturers are tapping into Arthur Bench to create LLMs capable of answering customer queries accurately and promptly, while reducing information hallucinations. In addition, enterprise media and publishing platform Axios HQ is utilizing the tool to establish an internal framework for LLM evaluation and description of performance to its Product team.

Arthur is making Bench available as an open-source tool, enabling anyone to use and contribute to it for free. While the startup believes in the strength of an open-source approach in building the best products, opportunities for monetization through team dashboards are still available. Additionally, Arthur has announced a hackathon with Amazon Web Services (AWS) and Cohere to encourage developers to develop new metrics for Arthur Bench. The alignment between AWS’s Bedrock environment, which helps users select and deploy various LLMs, and Arthur Bench is expected to further enhance the tool’s capabilities and reach.

This marks a continuation of Arthur’s efforts in the AI space, following the launch of Arthur Shield earlier this year. Arthur Shield focuses on monitoring large language models for hallucinations and other potential issues. With the introduction of Arthur Bench, the startup aims to provide enterprises with the necessary tools to leverage language models effectively and make informed decisions regarding AI adoption.

AI Startup Arthur Launches Open-Source Tool to Evaluate Performance of Language Models

Subscribe

Revolutionary Small Business Exchange Network Connects Sellers and Buyers

District 1 Commissioner Race Results Delayed by Recounts & Ballot Reviews, US

Fed Minutes Hint at Potential Rate Cut in September amid Economic Uncertainty, US

Baltimore Orioles Host First-Ever ‘Faith Night’ with Players Sharing Testimonies, US

Democratic National Convention Approves Platform Doubling Down on Abortion and LGBTQ+ Rights in 2024

More like this
Related

Revolutionary Small Business Exchange Network Connects Sellers and Buyers

District 1 Commissioner Race Results Delayed by Recounts & Ballot Reviews, US

Fed Minutes Hint at Potential Rate Cut in September amid Economic Uncertainty, US

Baltimore Orioles Host First-Ever ‘Faith Night’ with Players Sharing Testimonies, US

About us

Company

The latest

Revolutionary Small Business Exchange Network Connects Sellers and Buyers

District 1 Commissioner Race Results Delayed by Recounts & Ballot Reviews, US

Fed Minutes Hint at Potential Rate Cut in September amid Economic Uncertainty, US

Subscribe

AI Startup Arthur Launches Open-Source Tool to Evaluate Performance of Language Models

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

More like this
Related