Gretel releases world’s largest open source text-to-SQL dataset, empowering businesses to unlock AI’s potential

Time’s almost up! There’s only one week left to request an invite to The AI Impact Tour on June 5th. Don’t miss out on this incredible opportunity to explore various methods for auditing AI models. Find out how you can attend here.

Gretel, a trailblazer in the synthetic data industry, has made a monumental leap forward in democratizing access to high-quality AI training data. The company announced on Thursday the release of the world’s largest open source Text-to-SQL dataset, a move poised to accelerate AI model training and unlock new possibilities for businesses across the globe.

The dataset, boasting over 100,000 meticulously crafted synthetic Text-to-SQL samples spanning 100 verticals, is now available on Hugging Face under the Apache 2.0 license. This bold move by Gretel aims to equip developers with the tools they need to create powerful AI models that can understand natural language queries and generate SQL queries, effectively bridging the gap between business users and complex data sources.

“Access to quality training data is one of the biggest obstacles to building with generative AI,” emphasized Yev Meyer, Chief Scientist at Gretel, in an interview with VentureBeat. “High-quality synthetic data can fill this gap. One of the most notable recent shifts in the world of Large Language Models (LLMs) and AI is the renewed focus on data quality.”

Addressing the data quality challenge

Gretel’s groundbreaking dataset was generated using Gretel Navigator, a sophisticated compound AI system currently in public preview. “Our open source Text-to-SQL dataset was generated by Gretel Navigator, our compound AI system that integrates agent-based execution, multiple proprietary models, including a custom tabular Large Language Model, and privacy-enhancing technologies to generate high quality synthetic data from scratch, on demand,” explained Meyer.

June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.

The implications of this release are far-reaching, as businesses across industries grapple with the challenges of accessing and leveraging the wealth of data buried in complex databases, data warehouses, and data lakes. Gretel’s dataset not only provides a solution to this problem but also includes an explanation field that offers plain-English descriptions of the SQL code, making it easier for end-users to understand and extract value from the output.

Rigorous quality validation and broad industry applications

Gretel’s commitment to data quality is evident in its meticulous validation processes. “Every dataset we generate is assessed for quality. Quality benchmarking is central to what we do,” said Meyer. The company’s Text-to-SQL dataset consistently outperformed others in compliance with SQL standards, correctness, and adherence to instructions when evaluated using an independent service and the LLM-as-a-judge technique.

Gretel’s synthetic Text-to-SQL dataset outperforms the b-mc2/sql-create-context dataset across various grading criteria, including compliance with SQL standards (+54.6%), SQL correctness (+34.5%), and adherence to instructions (+8.5%), as evaluated by an independent LLM-as-a-judge technique. Credit: Gretel

The potential applications of Gretel’s dataset are vast, spanning industries from finance and healthcare to government. Financial analysts can now ask questions about a company’s performance and receive instant answers sourced from databases, while healthcare providers can streamline the analysis of clinical trial data from multiple experiments. Government leaders can also leverage the dataset to provide citizens with easy access to public records databases, such as licenses, property ownership, and permits.

Balancing data privacy and accessibility

As enterprises increasingly recognize the importance of data-centric AI, Gretel’s ability to generate massive amounts of high-quality synthetic data positions it as a key player in the industry. “Gretel solutions are built with enterprise scale in mind so that customers can satisfy their data needs when creating data from scratch or editing and augmenting existing data,” Meyer told VentureBeat.

Gretel’s dedication to privacy is equally impressive, employing cutting-edge techniques like differential privacy to ensure that sensitive information remains protected while still enabling models to learn from the data. This commitment to balancing accuracy and privacy sets Gretel apart in an industry where data security is of utmost importance.

The release of Gretel’s Text-to-SQL dataset marks a significant milestone in the company’s mission to accelerate the adoption of data-centric AI and empower businesses to unlock the full potential of their data. With its focus on quality, privacy, and accessibility, Gretel is well-positioned to lead the charge in the synthetic data revolution.

As the AI landscape continues to evolve at a breakneck pace, Gretel’s groundbreaking contribution to the open-source community serves as a testament to its commitment to driving innovation and democratizing access to high-quality training data. The ripple effects of this release are likely to be felt across industries, as businesses harness the power of AI to gain a competitive edge and drive growth in an increasingly data-driven world.