Data Science

Talking to Your Data (Literally): Getting the Most from Natural Language Querying

DATE POSTED

May 20, 2025

Natural Language Data Querying: How, Why, and the Challenging Bits

With AI systems currently making huge gains in software engineering, this looks like a simple enough parallel.

If a stakeholder has sufficient privileges, let them talk to the AI system. It will listen, change the words to natural language prompts to SQL statements, mapped to the right schema.

Run the query, format, and return the data. Even analyze it next and put observations back into natural language.

Of course, this isn’t all as simple as it sounds. Enterprise data is almost invariably complicated and unique, with structured and unstructured sources, spanning databases, often using unique naming conventions and varied means of connection.

To make this seemingly straightforward goal a reality, there are several hurdles to overcome first.

Going from Natural Language to SQL

Let’s say we’re starting with a user question, such as, for sales, “Can I see the numbers from Q1 by a given product line, by a given region?”

Or maybe you’re sharing a data point in your meeting, but in further discussion, want to slice it in a way that’s not been considered, or explore a parallel query?

For marketing, maybe you want to get the site traffic and behavior of a key demographic, in response to a campaign. Or maybe it’s internal, looking for a breakdown of headcount and turnover in a certain area.

From these kinds of text prompts (or spoken asks, with multi-modal AI inputs), the system then needs to parse out the question for intent.

Once it has this, it must bridge the gap to the schema—where is the data stored, what are the table and column names, necessary joins and filtering, for example.

From intent and schema, it can write SQL easily enough (or hand to a SQL-writing LLM), but there’s also performance to consider. With giant data sets, requests must be optimized and sensible.

Can it handle ambiguity (“top sales” meaning what exactly?), mistakes, and mistaken volume? It must assess context to get at what’s really needed.

Systems today use a variety of means to get this done (from the biggest and best LLM to combining smaller models and RAG systems, agents and customized LLMs).

But this process can be unforgiving, meaning systems also need tolerance and the ability to conduct their own evaluations before firing off.

Potentially then, after the fact, also verifying the results match the request.

GenAI for Data Analysis: Asking Your Questions in Real Time

Few organizations believe they’re getting the most from their data, and there are numerous bottlenecks and systematic issues standing in the way.

The traditional model involved requesting reports (at a labor and time cost), waiting for response, then hoping it meets your need. If the outputs weren’t what was wanted (either because of communication issues or a lack of sufficient understanding), the entire process iterated.

Self-service BI systems (with expense and overhead) sought to cut through that, and AI solutions now hope to lower this bar much more and extend the options much further.

The goal is a cultural change more than a technological one, where it’s not costly, or dangerous, to explore your questions.

Text to SQL AI Tools

There are a lot of text-based data analysis tools on the market, both commercial and open source. And while virtually all at their core convert natural language into SQL, many also include reporting and extensive visualization abilities, varying connectors for data sources and database types, capabilities for managing and mapping data, integrating API access, fine-tuning or improvement, and even generating synthetic data.

As with all AI solutions, there’s a wide range in the expected technical expertise end-users have, and in the complexity, and cost, of setup.

Ayush Gupta, a former Senior ML Research Engineer at Apple now promoting his own startup, Genloop, shared his team’s experience with a variety of popular approaches, as they sought to improve accuracy (from a perceived 85% cap), lower latency and costs, and maintain control over data:

Benchmarking the Best GenAI Data Tools in 2025

Yale University’s initial Spider benchmark was one popular means of evaluating these tools, at least in terms of their success at going from natural language to SQL using unseen databases. It measured syntax, but also semantic correctness and generalizability.

SQLCoder, Codex, T5-SQL are examples that measure performance there, and Spider 2.0 uses enterprise-level real world workflow problems for evaluation in multiple ways. At the time of this writing, the leading performers there include the ByteDance research team (using ByteBrain-Agent with multiple LLMs), agents making use of OpenAI’s o3 and o1, and Anthropic’s Claude 3.7-Sonnet.

Another popular benchmark at this time is BIRD-bench, which shows human data engineers still dominating (92.96%) over AskData + GPT-4o’s next closest mark (77.14%).

Bumps in the Road: Concerns with AI SQL Query Generators

With the widespread uniqueness and complexity of data storage even across individual companies, there are still plenty of issues being worked out.

The most obvious is security. Forms that can take input and run data operations have always run the risk of SQL-injection, where hackers sneak in code to hit backend databases and make them act in undesired ways.

And when moving from natural language to SQL this concern is back, and in all new ways. Given a sufficient understanding of how a system works, malicious queries can prey on AI, to make systems generate malicious or overly taxing queries to sabotage data or tie down resources.

Access controls and data privacy are always concerns to be resolved with self-service data systems (AI or otherwise), and if a solution involves third-party servers, or works in ways that are insufficiently understood, this risk can also be intensified.

Accuracy is of course another concern. Even the best-trained models hallucinate, and correctly understanding schema is a significant challenge for even the best LLMs available today.

(In addition to cost, this is one reason Genloop’s Ayush Gupta, as above, recommends customized, contextual LLMs with fine-tuned, open weight models over leaning on the biggest large language models for SQL. These make it easier to fine-tune for your specific structures.)

Natural language requests will always introduce ambiguity, and systems must be smart enough to ask follow-up questions, as recommended by Google’s Per Jacobsson, such as by a disambiguation step. In this case, the system could re-prompt the user for greater specificity.

He also recommends utilizing a semantic layer over raw data, and validating prompts.

We wrote in a prior PTP Report on multi-agent design patterns. Using multi-agent systems in analytics can help provide many of these benefits, including pre-validating queries.

Fine-tuning any system is essential to success, ideally using specific schema and query logs with performance history.

And while popular benchmarks are a good starting place, they often originate in academia and may not represent real enterprise workloads and schemas. Companies need their own benchmarks, with ongoing evaluations, to ensure systems are functioning properly.

The Impact of NLQ on Data Scientists

So how do data professionals feel about all of this?

Query accuracy and execution safety are both frequently given concerns, and for good reason. Even most finely tuned system is going to generate wrong, or at least inefficient, code in at least some cases.

But at the same time, these systems are nowhere near taking away the need for data science experts. Instead, they can ideally enable practical, self-service tools to alleviate the backlog and empower business professionals to do more on their own.

With too many hours lost to continuously retrieving data now, AI text-to-SQL solutions will lighten the load, as well as enabling easier draft queries for refinement.

There are also AI tools for data scientists, like MIT’s GenSQL. This adds complex statistics, enabling probabilistic programming layered right on top of SQL.

More than querying data, it can be used to predict outcomes, test hypothesis, fix data issues, and even generate synthetic data, allowing the querying of data and predictive models together.

Let PTP Help You Get the Most from Your Data

Good data governance is essential for any of these natural language to SQL systems—and not just as a best practice. It’s the required foundation for accuracy, security, and compliance.

AI systems in this area are becoming a powerful enabler and facilitator, helping companies get more from their data, democratizing access, and encouraging exploration.

But they also still require data professionals in the loop, to ensure quality, compliance, and integrity.

Consider PTP for your own data expertise needs. Whether you need scientists, engineers, or analysts, you can trust our extensive vetting process and 27+ years of experience in the field.

Conclusion

The dream of being able to ask your data whatever you need to know, in real time, may not be a reality yet, but it’s closer than ever before.

No longer a novelty hook, or BI system sales pitch, text-to-SQL AI solutions are improving fast and expanding with the inclusion of agents and multi-agent systems.

And while data professionals aren’t in danger of being replaced, they can hope to spend a lot less time delivering rote reports and visualizations, and more time asking, and answering, important questions for the organization as a whole.

References

MIT Researchers Unveil GenSQL AI Tool for Simplified Data Analysis, Quantum Zeitgeist

GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables, arXiv:2406.15652 [cs.PL]

Big Models, Bad Math: The GenAI Problem In Finance, Forbes

Text to SQL: The Ultimate Guide for 2025, Genloop

Getting AI to write good SQL: Text-to-SQL techniques explained, Google Cloud Blog

SPS-SQL: Enhancing Text-to-SQL generation on small-scale LLMs with pre-synthesized queries, Science Direct

FAQs

Are natural language to SQL tools accurate enough for business use today?

Yes, especially with improvements in 2025. But not for all use cases, and some errors will persist. With updated offerings like GPT-4 with DIN-SQL setting new records on Spider (85.3% on the test set), and the open-source SQLCoder now capable of reaching near-human levels, fine-tuned, enterprise-specific systems can be expected to do better, and regularly perform on par with humans.

Still, governance is a must, with human review, especially on edge cases or complicated asks.

How will these text-to-SQL tools impact the career stability of data professionals?

As with many positions, it was initially all doom and gloom, until the realities and limits of AI implementations came into focus. They’re far better at taking on routine, repetitive tasks than taking away the necessity for trained data experts.

Data teams themselves can make use of these tools for sketching and prototyping queries, onboarding new staff, building self-service BI, or going deeper into analysis more easily.

Can natural language to SQL tools really handle complex, multi-table queries?

Yes, though, again, it depends on how well the system in question understands the schema. Tools like SQLCoder and fine-tuned enterprise-specific systems are capable of dealing with requests that require multiple joins, nested subqueries, and aggregates, though such sophistication requires human oversight, and ideally internal checks, such as through multi-agent systems.

WRITTEN BY

Doug McCord