How C3 AI designed a generative AI agent solution for efficient question-answering of structured data

By Jack Lin, Lead Data Scientist, C3 AI


 

Making informed business decisions often replies on extracting insights from structured data — highly organized information stored in databases. While this data holds immense value, analyzing it can be time consuming and complex. Typically, this involves crafting intricate database queries, which even skilled analysts may spend hours or days refining, ultimately delaying decision making.

Recent advances in generative AI are revolutionizing how we interact with structured data. Large language models (LLMs) can now translate natural language queries into database queries (text-to-SQL for relational databases), retrieving and processing data faster than ever before. Once the queries are executed, LLMs can process the resulting tables to provide clear, actionable answers to the original questions. Yet, creating a production-ready, scalable generative AI application for question-answering structured data presents unique challenges.

The C3 AI Structured DB Agent, part of C3 Generative AI, solves these challenges by simplifying database access and delivering advanced analytics. With this tool, organizations can unlock the full potential of their structured data and make decisions faster and more effectively.

Overcoming the Challenges of Question-Answering with Structured Data

Building a reliable AI-powered solution for structured data requires overcoming the following hurdles:

  1. Ambiguity in Human Queries
    Human queries often lack specificity, so translating them into specific database queries is difficult. For example, “How many events happened last week?” could mean the last seven days or the previous Monday to Sunday. Correctly interpreting these queries usually depends on context and intent.
  2. Data Mismatch
    Query terms may not align with database entries. For example, “How many events happened in the US?” might translate to “WHERE location = ‘US’” in the SQL query. However, if the database stores the location as “U.S.,” the query with “US” would return no data. Such issues become even more challenging when the column contains many unique values. Addressing such inconsistencies is critical for accurate results.
  3. Complex Database Schemas
    Large datasets often contain numerous tables and columns. This complexity can exceed the context window limit of LLMs, making accurate query generation and response difficult.
  4. Multiple Data Sources
    Insights often require data from various sources or databases. Integrating these sources and capturing their relationships adds another layer of complexity.
  5. Varying Query Languages
    Different databases may use unique query languages or SQL dialects, requiring custom handling for each system. Even after fine-tuning on dialect-specific data, the execution accuracy for PostgreSQL and BigQuery can be suboptimal.
  6. Mathematical and Statistical Limitations
    LLMs often struggle with complex mathematical problem solving and calculations and statistical queries.
  7. Sequential Queries
    Some user queries involve multi-step processes, such as retrieving data in one step and using it for further analysis.
  8. Clarity of Results
    Raw tables and text results can be hard to interpret. Effective visualizations and simplified summaries are often more useful.
  9. Security Risks
    Prompt attacks can jeopardize the application and underlying databases, posing significant security concerns, including potential exposure of sensitive information.

How the C3 AI Structured DB Agent Solves These Challenges

How the C3 AI Structured DB Agent Works

The C3 AI Structured DB Agent is a powerful multi-hop system designed to navigate these challenges and deliver precise, actionable answers. Here’s how it works:

Query Translation

The agent converts natural language queries into database queries using LLMs, leveraging context and domain-specific knowledge to improve accuracy. Given a user query, the agent retrieves the most relevant few-shot examples from long-term memory and the most relevant data tables and columns from the C3 AI Data Model. This information is then sent to the LLM for synthesizing the database query.

Fuzzy Matching for Accuracy

If query terms don’t perfectly match database entries, the agent applies fuzzy matching to align them. For example, after fuzzy matching, the query “How many events happened in the US?” would have the filter string “U.S.” matching the data value in the database. The database query is then executed, and the table data is sent to the LLM.

Handling Complex Queries

The agent processes multi-step queries and performs necessary calculations, integrating data from multiple sources when required. Depending on the user queries, the agent may generate and execute Python code to pre-process the database query and post-process the table.

Error Self-Correction

If errors occur during execution, the agent self-corrects using feedback and retries until the task is successfully completed.

Clear, Insightful Outputs

The results are presented as text summaries, tables, and visualizations, providing easy-to-interpret insights.

Powered by the C3 AI Platform: Built on a Unified and Standardized Data Framework

The agent is powered by the C3 AI Data Model, which standardizes relationships between data elements and the C3 AI Unified Data Lake, which consolidates fragmented data sources into a single, central view. These features eliminate the need for multiple query languages and ensure efficient integration across diverse data environments. Without the C3 AI Platform and model-driven architecture, these features would not be possible.

Ensuring Reliability, Scalability, and Security

The C3 AI Structured DB Agent is built to handle complex datasets while ensuring secure and reliable operations.

Reflection & Self-Correction

The agent uses a reflection mechanism to address program errors and refine outputs. It observes validation scripts, other LLMs, and even humans, and adjusts its approach as needed. This type of mechanism ensures robust performance, though there’s a tradeoff between accuracy and speed. We can configure the system to limit the number of reflections or total time spent before stopping, ensuring a balance between precision and efficiency.

Retrieval-Augmented Prompting & Long-Term Memory

To scale across databases with numerous tables and columns, the agent uses a retrieval-augmented (RAG) approach. Documentation of data model is stored in a vector store, allowing the agent to retrieve only the most relevant tables and columns to for each query. This avoids hitting the context window limit and reduces noise, improving accuracy.

To address ambiguity in queries, domain-specific information and few-shot examples are also stored in the system’s long-term memory. When a query is processed, the agent retrieves the most relevant examples to construct prompts tailored to the task. These examples can include small Python scripts for specific operations, atomic tasks, or feedback-driven refinements.

Built-in Guardrails

To ensure security and reliability, the agent implements robust guardrails. Inputs (including user queries) and outputs are carefully examined to prevent prompt attacks, filter harmful or inappropriate responses, and protect sensitive data such as personally identifiable information (PII). These guardrails make the agent enterprise-ready, capable of operating in production environments.

Robust guardrails are implemented to ensure security

Proven Performance

The C3 AI Structured DB Agent has been benchmarked using the Defog text-to-SQL dataset and demonstrated superior performance compared to general-purpose LLMs, even without fine-tuning. This highlights its optimized design for structured data applications.

Benchmarking C3 AI’s Performance with Defog Text-to-SQL

Generative AI in Production: Real-World Results at a Multinational Food Company

One multinational food company adopted C3 Generative AI to simplify its complex data analysis processes. C3 Generative AI, built with the C3 AI Structured DB Agent, was able to quickly aggregate and analyze key metrics across facilities and products, producing analytical insights, including monthly trends, moving averages, metric correlations, and facility outliers. C3 Generative AI was able to answer questions and prompts like:

  • Which KPIs for this meat product processed in the Redwood City facility saw the smallest change over a three-month period?
  • Show me the facilities with the highest deviation from monthly goal for total food waste.

Key Benefits:

  1. Speed and Efficiency: User queries are answered in seconds with rich multimodal data, including text summaries, tables, and visualizations. This significantly accelerates insight extraction compared to traditional data analyst reports, which can take hours or days.
  2. Accessibility: Makes database data easily accessible to non-technical users by understanding complex database schemas and translating natural language queries into database queries and code.

Intelligent, Efficient Decision Making with AI-Powered Data Analysis

The C3 AI Structured DB Agent represents a significant advancement in the field of question-answering with structured data. By leveraging state-of-the-art techniques, it addresses many of the inherent challenges associated with querying and extracting insights from complex databases. The agent’s ability to handle single-hop/multi-hop and mathematical/statistical queries ensures that users receive accurate and multimodal answers to their queries efficiently.

With robust security, seamless scalability, and the ability to deliver real-time insights, the agent — and C3 Generative AI — empowers organizations to make smarter, faster decisions.

Learn how C3 Generative AI produced a 90% time savings while providing a 90%+ in response accuracy for a large multinational manufacturing group.

 


About the Author(s)

Jack Lin is a Lead Data Scientist on the Generative AI Data Science team at C3 AI, where he leads and develops advanced applications for question-answering across both structured and unstructured data sources. His current work focuses on leveraging Plan & Execute Agents, Structured Database Agents, and Retrieval-Augmented Generation (RAG) to build performant, robust, and scalable generative AI solutions. Jack is also an active voice in the generative AI field, sharing insights through articles on Towards Data Science on Medium (https://medium.com/@jacklingenai). He received his Ph.D in Quantitative Computational Biology from Baylor College of Medicine.