· Lincoln Bicalho · AI/ML · 26 min read
Measuring and Improving Vanna AI Accuracy: A Data-Driven Approach
📋 Prerequisites:
- Python 3.8 or later with pandas and dataclasses
- Vanna AI installed and configured (
pip install vanna)- Access to test database environment
- Basic understanding of SQL and accuracy metrics
- Familiarity with Text-to-SQL concepts
- 150-200 test cases prepared (see Building Test Suites)
Overview
Accuracy measurement in Text-to-SQL systems presents unique challenges beyond traditional machine learning metrics. You need to evaluate not just syntactic correctness but also semantic equivalence and business value alignment. This guide provides a comprehensive framework for measuring, tracking, and continuously improving Vanna AI accuracy in production environments.
What you’ll learn:
- How to build comprehensive accuracy testing frameworks
- Multiple accuracy metrics beyond simple pass/fail
- Confidence score calibration and reliability thresholds
- A/B testing strategies for training approaches
- Continuous improvement systems with automated feedback loops
- ROI analysis for training investment decisions
Why accuracy measurement matters: In government and enterprise deployments, stakeholders require demonstrable, measurable improvements to justify AI investment. Generic “AI works well” claims don’t satisfy compliance audits or budget reviews. You need systematic measurement, tracking, and continuous improvement processes.
⚠️ Important: This guide focuses on measurement methodology, not algorithm optimization. You’ll measure Vanna AI’s accuracy to guide training decisions, not modify Vanna’s internal algorithms.
Understanding Vanna AI Accuracy Challenges
Text-to-SQL accuracy differs fundamentally from traditional ML classification metrics. Your system can generate syntactically perfect SQL that returns completely wrong results due to business logic misunderstandings or schema misinterpretation.
The Accuracy Gap
In production deployments, you’ll observe a significant gap between different accuracy measures:
| Metric Type | Typical Range | What It Measures |
|---|---|---|
| Semantic Similarity | 85-90% | Text similarity between queries |
| Syntactic Correctness | 75-85% | SQL parses without errors |
| Business-Correct Results | 65-75% | Returns the right answer |
| User Satisfaction | 60-70% | User accepts without correction |
Initial deployment (Month 1):
- Simple queries: 70-80% accuracy
- Moderate queries (2-3 joins): 55-65% accuracy
- Complex queries (multiple joins, subqueries): 35-45% accuracy
- Overall blended: 60-70% accuracy
After 3 months of systematic optimization:
- Simple queries: 85-90% accuracy
- Moderate queries: 70-80% accuracy
- Complex queries: 55-65% accuracy
- Overall blended: 75-85% accuracy
Mature implementation (6+ months):
- Simple queries: 90-95% accuracy
- Moderate queries: 80-85% accuracy
- Complex queries: 65-75% accuracy
- Overall blended: 80-90% accuracy
ℹ️ Note: These ranges reflect patterns observed across multiple government database implementations. Your specific accuracy will vary based on schema complexity, training data quality, and query diversity.
Why Traditional Metrics Fail
Challenge 1: Query Ambiguity Natural language inherently contains ambiguity. “Recent sales” could mean today, this week, or this month depending on business context. Semantic similarity scores can’t detect this misinterpretation.
Challenge 2: Schema Complexity Government databases often contain 200+ tables with similar column names. Vanna might correctly identify the need for a JOIN but select the wrong foreign key relationship.
Challenge 3: Business Logic Requirements Queries often require implicit business rules—active records only, specific date ranges, security filters—that aren’t obvious from the natural language question.
Challenge 4: Context Dependencies Follow-up questions rely on previous conversation context that can drift over multiple exchanges.
💡 Key Insight: Measuring accuracy requires testing both syntactic correctness (does the SQL execute?) and semantic business value (does it answer the right question?).
Key Concepts
Concept 1: Multi-Dimensional Accuracy Measurement
You need multiple accuracy metrics to understand true system performance:
Syntax Accuracy - SQL parsing and structural correctness
- Measures: Does the generated SQL execute without errors?
- Value: Indicates basic SQL generation capability
- Limitation: Syntactically correct SQL can return wrong results
Semantic Accuracy - Logical equivalence of results
- Measures: Do generated and expected results match?
- Value: Indicates functional correctness
- Limitation: Doesn’t account for schema or column selection
Result Accuracy - Business value correctness
- Measures: Does the result answer the user’s actual question?
- Value: Reflects real-world usefulness
- Limitation: Requires human validation for edge cases
Schema Accuracy - Column and table selection correctness
- Measures: Are the right columns returned in the right order?
- Value: Indicates schema understanding
- Limitation: Correct schema can still have wrong filters
When to use each metric:
| Metric | Best For | Example Use Case |
|---|---|---|
| Syntax | Initial debugging | ”Why is my SQL failing to execute?” |
| Semantic | Functional testing | ”Are results logically equivalent?” |
| Result | Production validation | ”Is this the answer users need?” |
| Schema | Training effectiveness | ”Does it understand our schema?” |
Trade-offs:
- Higher accuracy requires more training data and validation effort
- Perfect accuracy isn’t achievable—focus on predictable, reliable performance
- Confidence scores become more valuable than raw accuracy numbers
Concept 2: Confidence Calibration
Vanna AI provides confidence scores, but these don’t always correlate with actual accuracy. You need to calibrate confidence scores against observed accuracy to establish reliability thresholds.
The Calibration Problem: Initial deployments show weak correlation (0.3-0.5) between confidence and accuracy. This means high-confidence predictions can be wrong, and low-confidence predictions can be correct.
After Proper Training: With systematic training data curation, correlation improves to 0.7-0.8, making confidence scores useful for production routing decisions.
Practical Application: You can route high-confidence queries to direct execution and low-confidence queries to human review, optimizing both accuracy and operational efficiency.
Building a Comprehensive Query Accuracy Test Suite {#building-test-suites}
The foundation of accuracy measurement is systematic testing. You create test cases representing common query patterns, edge cases, and business-specific scenarios.
Basic Test Framework
Step 1: Define Test Case Structure
from dataclasses import dataclass
from typing import Dict, List
import pandas as pd
import json
from datetime import datetime
@dataclass
class QueryTestCase:
"""Represents a single accuracy test case"""
id: str
natural_language: str
expected_sql: str
expected_result_schema: Dict
business_context: str
difficulty_level: str # 'basic', 'intermediate', 'advanced'
categories: List[str] # ['joins', 'aggregation', 'filtering']
# WHY: Structured test cases enable systematic accuracy measurement
# HOW: Each test case captures expected behavior across multiple dimensionsStep 2: Create VannaAccuracyTester Class
class VannaAccuracyTester:
"""
Comprehensive accuracy testing framework for Vanna AI systems.
This class provides systematic measurement across multiple accuracy
dimensions: syntax, semantics, results, and schema correctness.
"""
def __init__(self, vanna_instance, test_database_connection):
# WHY: Initialize with Vanna instance and test DB to enable
# side-by-side comparison of expected vs. generated queries
self.vanna = vanna_instance
self.db_connection = test_database_connection
self.test_cases = []
self.results = []
def load_test_cases(self, test_cases_file: str):
"""Load test cases from JSON file for reproducible testing"""
# WHY: External JSON enables version control and sharing test suites
with open(test_cases_file, 'r') as f:
data = json.load(f)
self.test_cases = [QueryTestCase(**case) for case in data['test_cases']]
# HOW: Deserialize JSON into structured test case objects💡 Tip: Maintain 150-200 test cases covering common patterns, edge cases, and business scenarios. Government implementations typically require broader test coverage due to schema complexity.
Progressive Testing Implementation
Basic: Single Test Execution
def execute_accuracy_test(self, test_case: QueryTestCase) -> Dict:
"""Execute single test case and measure accuracy"""
start_time = datetime.now()
try:
# WHY: Generate SQL from natural language using Vanna
generated_sql = self.vanna.generate_sql(test_case.natural_language)
generation_time = (datetime.now() - start_time).total_seconds()
# WHY: Execute both expected and generated queries for comparison
expected_results = pd.read_sql(test_case.expected_sql, self.db_connection)
generated_results = pd.read_sql(generated_sql, self.db_connection)
# WHY: Calculate comprehensive accuracy metrics across dimensions
accuracy_metrics = self._calculate_accuracy_metrics(
test_case, generated_sql, expected_results, generated_results
)
return {
'test_id': test_case.id,
'generated_sql': generated_sql,
'execution_successful': True,
'generation_time_seconds': generation_time,
**accuracy_metrics
}
except Exception as e:
# WHY: Capture failures with context for debugging
return {
'test_id': test_case.id,
'generated_sql': generated_sql if 'generated_sql' in locals() else None,
'execution_successful': False,
'error': str(e),
'generation_time_seconds': (datetime.now() - start_time).total_seconds(),
'syntax_accuracy': 0.0,
'semantic_accuracy': 0.0,
'result_accuracy': 0.0
}⚠️ Warning: Always execute generated SQL in isolated test environments, never production databases. Malformed queries can cause performance issues or unintended data modifications.
Intermediate: Accuracy Metrics Calculation
def _calculate_accuracy_metrics(self, test_case, generated_sql,
expected_df, generated_df) -> Dict:
"""Calculate comprehensive accuracy metrics across multiple dimensions"""
# WHY: Multi-dimensional metrics reveal different failure modes
# Syntax failures differ from semantic failures differ from schema issues
# Syntax accuracy - SQL parsing and structure
syntax_score = self._calculate_syntax_accuracy(
test_case.expected_sql, generated_sql
)
# Semantic accuracy - logical equivalence
semantic_score = self._calculate_semantic_accuracy(
expected_df, generated_df
)
# Result accuracy - business value correctness
result_score = self._calculate_result_accuracy(
expected_df, generated_df
)
# Schema accuracy - column selection correctness
schema_score = self._calculate_schema_accuracy(
test_case.expected_result_schema,
generated_df.columns.tolist()
)
# HOW: Combine metrics into weighted overall score
return {
'syntax_accuracy': syntax_score,
'semantic_accuracy': semantic_score,
'result_accuracy': result_score,
'schema_accuracy': schema_score,
'overall_accuracy': (syntax_score + semantic_score +
result_score + schema_score) / 4
}Advanced: Complete Test Suite Execution
def run_full_test_suite(self) -> pd.DataFrame:
"""Execute complete test suite and return comprehensive results"""
results = []
# WHY: Batch execution enables statistical analysis of accuracy patterns
for test_case in self.test_cases:
result = self.execute_accuracy_test(test_case)
result['difficulty'] = test_case.difficulty_level
result['categories'] = ','.join(test_case.categories)
results.append(result)
# HOW: Convert to DataFrame for analysis and visualization
self.results = pd.DataFrame(results)
return self.results
def generate_accuracy_report(self) -> Dict:
"""Generate comprehensive accuracy analysis with breakdowns"""
if self.results.empty:
return {}
return {
'overall_metrics': {
'total_tests': len(self.results),
'execution_success_rate': self.results['execution_successful'].mean(),
'average_syntax_accuracy': self.results['syntax_accuracy'].mean(),
'average_semantic_accuracy': self.results['semantic_accuracy'].mean(),
'average_result_accuracy': self.results['result_accuracy'].mean(),
'average_overall_accuracy': self.results['overall_accuracy'].mean()
},
# WHY: Difficulty breakdown reveals performance patterns
'by_difficulty': self.results.groupby('difficulty').agg({
'overall_accuracy': ['mean', 'std', 'count']
}).round(3).to_dict(),
# WHY: Distribution shows concentration of high/low accuracy
'performance_distribution': {
'high_accuracy_queries': (self.results['overall_accuracy'] >= 0.8).sum(),
'medium_accuracy_queries': ((self.results['overall_accuracy'] >= 0.6) &
(self.results['overall_accuracy'] < 0.8)).sum(),
'low_accuracy_queries': (self.results['overall_accuracy'] < 0.6).sum()
}
}Individual Accuracy Metric Implementations
Syntax Accuracy: SQL Token Comparison
def _calculate_syntax_accuracy(self, expected_sql: str, generated_sql: str) -> float:
"""Measure SQL syntax and structure similarity through token analysis"""
# WHY: Token-based comparison reveals structural understanding
# HOW: Parse SQL into tokens and calculate overlap percentage
expected_tokens = self._parse_sql_tokens(expected_sql)
generated_tokens = self._parse_sql_tokens(generated_sql)
# Calculate Jaccard similarity: intersection / union
common_tokens = set(expected_tokens) & set(generated_tokens)
total_tokens = set(expected_tokens) | set(generated_tokens)
return len(common_tokens) / len(total_tokens) if total_tokens else 0.0
def _parse_sql_tokens(self, sql: str) -> List[str]:
"""Parse SQL into tokens for comparison"""
# WHY: Tokenization enables structural comparison beyond string matching
# Normalize SQL: uppercase keywords, split on whitespace and punctuation
sql_normalized = sql.upper()
tokens = []
# Simple tokenization (production systems should use SQL parser)
import re
tokens = re.findall(r'\b\w+\b', sql_normalized)
return tokensSemantic Accuracy: Result Set Comparison
def _calculate_semantic_accuracy(self, expected_df: pd.DataFrame,
generated_df: pd.DataFrame) -> float:
"""Measure semantic equivalence of query results"""
# WHY: Results must match even if SQL structure differs
# Different SQL can produce identical results (semantic equivalence)
if expected_df.shape != generated_df.shape:
# Shape mismatch = wrong results
return 0.0
try:
# WHY: Sort both DataFrames for order-independent comparison
expected_sorted = expected_df.sort_values(
expected_df.columns.tolist()
).reset_index(drop=True)
generated_sorted = generated_df.sort_values(
generated_df.columns.tolist()
).reset_index(drop=True)
# HOW: Calculate row-wise accuracy percentage
matching_rows = 0
for i in range(len(expected_sorted)):
if expected_sorted.iloc[i].equals(generated_sorted.iloc[i]):
matching_rows += 1
return matching_rows / len(expected_sorted)
except:
return 0.0Result Accuracy: Business Logic Validation
def _calculate_result_accuracy(self, expected_df: pd.DataFrame,
generated_df: pd.DataFrame) -> float:
"""Measure business value correctness beyond semantic equivalence"""
# WHY: Business rules may allow variations in semantically equivalent results
# Example: ORDER BY differences may be acceptable for some queries
if len(expected_df) != len(generated_df):
return 0.0
# HOW: Compare with tolerance for business-acceptable variations
# This implementation uses strict equality; customize for your business rules
return 1.0 if expected_df.equals(generated_df) else 0.0
def _calculate_schema_accuracy(self, expected_schema: Dict,
generated_columns: List[str]) -> float:
"""Measure column selection correctness"""
# WHY: Correct columns in correct order indicates schema understanding
expected_columns = list(expected_schema.keys())
# Calculate column overlap
common_columns = set(expected_columns) & set(generated_columns)
total_columns = set(expected_columns) | set(generated_columns)
column_accuracy = len(common_columns) / len(total_columns) if total_columns else 0.0
# Bonus for correct ordering (if columns match exactly)
order_bonus = 0.0
if expected_columns == generated_columns:
order_bonus = 0.2
return min(1.0, column_accuracy + order_bonus)💡 Best Practice: Customize accuracy calculations based on your business requirements. Some use cases prioritize column accuracy over result ordering, others require exact matches.
Measuring Confidence vs Actual Accuracy
Vanna AI provides confidence scores for generated SQL, but initial deployments show weak correlation between confidence and actual accuracy. You need to calibrate confidence scores to establish reliability thresholds for production routing decisions.
Confidence Calibration Framework
class ConfidenceCalibrator:
"""
Calibrate Vanna confidence scores against actual accuracy.
This enables production routing: high-confidence queries go to direct
execution, low-confidence queries go to human review.
"""
def __init__(self, vanna_instance):
# WHY: Store Vanna instance to extract confidence metadata
self.vanna = vanna_instance
self.calibration_data = []
def collect_confidence_accuracy_pairs(self, test_results: pd.DataFrame):
"""Collect confidence scores and actual accuracy for calibration"""
# WHY: Build dataset correlating confidence predictions with outcomes
for _, result in test_results.iterrows():
if result['execution_successful']:
# Extract confidence score from Vanna metadata
confidence = self._extract_confidence_score(result['generated_sql'])
actual_accuracy = result['overall_accuracy']
self.calibration_data.append({
'confidence_score': confidence,
'actual_accuracy': actual_accuracy,
'test_id': result['test_id']
})
def _extract_confidence_score(self, generated_sql: str) -> float:
"""Extract confidence score from Vanna's response metadata"""
# WHY: Vanna provides confidence in metadata (implementation-dependent)
# HOW: Access last query's confidence score
try:
# Implementation depends on Vanna version
return self.vanna.get_last_confidence_score()
except:
return 0.5 # Default neutral confidence if unavailableℹ️ Note: Confidence score extraction depends on your Vanna version and configuration. Consult Vanna documentation for the appropriate metadata access method.
Analyzing Calibration Quality
def analyze_confidence_calibration(self) -> Dict:
"""Analyze how well confidence scores predict actual accuracy"""
if not self.calibration_data:
return {}
df = pd.DataFrame(self.calibration_data)
# WHY: Bin confidence scores to analyze accuracy within ranges
# HOW: Create 5 bins from very low to very high confidence
df['confidence_bin'] = pd.cut(
df['confidence_score'],
bins=[0, 0.3, 0.5, 0.7, 0.9, 1.0],
labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
)
# Calculate average accuracy per confidence bin
calibration_analysis = df.groupby('confidence_bin').agg({
'actual_accuracy': ['mean', 'std', 'count'],
'confidence_score': 'mean'
}).round(3)
# WHY: Correlation coefficient reveals calibration quality
# 0.3-0.5 = poor, 0.7-0.8 = good, >0.8 = excellent
correlation = df['confidence_score'].corr(df['actual_accuracy'])
return {
'calibration_by_confidence': calibration_analysis.to_dict(),
'correlation': correlation,
'reliability_threshold': self._find_reliability_threshold(df)
}
def _find_reliability_threshold(self, df: pd.DataFrame) -> float:
"""Find confidence threshold where accuracy consistently exceeds target"""
# WHY: Establish threshold for production routing decisions
# Queries above threshold go to direct execution
# Queries below threshold go to human review
target_accuracy = 0.8
for threshold in [0.5, 0.6, 0.7, 0.8, 0.9]:
high_confidence_queries = df[df['confidence_score'] >= threshold]
# Require minimum sample size for statistical validity
if len(high_confidence_queries) > 10:
avg_accuracy = high_confidence_queries['actual_accuracy'].mean()
if avg_accuracy >= target_accuracy:
return threshold
return 0.9 # Conservative default if no threshold meets criteriaConfidence Calibration Metrics
| Confidence Range | Initial Accuracy | After Training | Sample Size |
|---|---|---|---|
| Very Low (0-0.3) | 40-50% | 45-55% | Need 20+ samples |
| Low (0.3-0.5) | 50-60% | 60-70% | Need 20+ samples |
| Medium (0.5-0.7) | 60-70% | 70-80% | Need 20+ samples |
| High (0.7-0.9) | 70-80% | 80-90% | Need 20+ samples |
| Very High (0.9-1.0) | 75-85% | 85-95% | Need 20+ samples |
💡 Production Strategy: Route queries with confidence ≥ 0.7 to direct execution, confidence 0.5-0.7 to automated validation, and confidence < 0.5 to human review.
A/B Testing Different Training Approaches
Systematic improvement requires controlled experimentation. You create training variants with different strategies, test each against your test suite, and compare results to identify optimal approaches.
Training Experiment Framework
class VannaTrainingExperiment:
"""
A/B testing framework for training approach comparison.
Test different training data selection strategies to identify
the most effective approach for your specific use case.
"""
def __init__(self, base_vanna_instance, test_database):
self.base_vanna = base_vanna_instance
self.test_db = test_database
self.experiments = {}
self.results = {}
def create_training_variant(self, variant_name: str, training_config: Dict):
"""Create a new training variant for A/B testing"""
# WHY: Independent variants enable controlled comparison
variant_vanna = vanna.create_instance(training_config)
self.experiments[variant_name] = {
'instance': variant_vanna,
'config': training_config,
'training_data': []
}Common Training Approaches to Test
def add_training_approaches(self):
"""Define different training approaches for systematic comparison"""
# APPROACH A: High-volume general training
# ✅ Pros: Broad coverage, handles diverse queries
# ❌ Cons: Dilutes domain-specific patterns, longer training time
# Best for: General-purpose deployments with diverse query types
self.create_training_variant('high_volume_general', {
'training_data_size': 1000,
'data_selection': 'random_diverse',
'fine_tuning_epochs': 3,
'learning_rate': 0.001
})
# APPROACH B: Curated domain-specific training
# ✅ Pros: Higher accuracy on domain queries, faster training
# ❌ Cons: May struggle with out-of-domain queries
# Best for: Specialized applications with well-defined query patterns
self.create_training_variant('curated_domain', {
'training_data_size': 300,
'data_selection': 'domain_specific',
'fine_tuning_epochs': 5,
'learning_rate': 0.0005
})
# APPROACH C: Iterative feedback-based training
# ✅ Pros: Adapts to user needs, improves over time
# ❌ Cons: Requires user feedback infrastructure
# Best for: Production systems with active user feedback
self.create_training_variant('iterative_feedback', {
'training_data_size': 500,
'data_selection': 'feedback_weighted',
'fine_tuning_epochs': 4,
'learning_rate': 0.0008,
'use_human_feedback': True
})📋 Comparison Matrix:
| Approach | Data Size | Training Time | Initial Accuracy | Best Use Case |
|---|---|---|---|---|
| High-Volume | 1000+ | 3-5 hours | 65-75% | Diverse query patterns |
| Curated Domain | 300-500 | 1-2 hours | 70-80% | Specialized applications |
| Feedback-Based | 500+ growing | 2-3 hours initial | 68-78% | Production with feedback |
Executing Training Experiments
def execute_training_experiment(self, training_data: List[Dict],
test_cases: List[QueryTestCase]):
"""Execute A/B test across all training variants"""
results = {}
for variant_name, variant_config in self.experiments.items():
print(f"Training variant: {variant_name}")
# WHY: Prepare data according to variant strategy
filtered_training_data = self._prepare_training_data(
training_data, variant_config['config']
)
# WHY: Train the variant with strategy-specific data
variant_instance = variant_config['instance']
self._train_variant(
variant_instance,
filtered_training_data,
variant_config['config']
)
# WHY: Test all variants against same test suite for fair comparison
tester = VannaAccuracyTester(variant_instance, self.test_db)
tester.test_cases = test_cases
test_results = tester.run_full_test_suite()
# Collect comprehensive results
results[variant_name] = {
'config': variant_config['config'],
'training_data_count': len(filtered_training_data),
'test_results': test_results,
'summary_metrics': tester.generate_accuracy_report()
}
self.results = results
return resultsTraining Data Selection Strategies
def _prepare_training_data(self, training_data: List[Dict],
config: Dict) -> List[Dict]:
"""Filter and prepare training data based on variant configuration"""
data_selection = config.get('data_selection', 'random_diverse')
target_size = config.get('training_data_size', 500)
if data_selection == 'random_diverse':
# WHY: Random sampling with diversity constraints
# HOW: Ensure representation across query types, complexity levels
return self._sample_diverse_training_data(training_data, target_size)
elif data_selection == 'domain_specific':
# WHY: Focus on domain-specific high-quality examples
# HOW: Filter by business domain tags, quality scores
return self._select_domain_specific_data(training_data, target_size)
elif data_selection == 'feedback_weighted':
# WHY: Weight by human feedback scores
# HOW: Prioritize examples with high user ratings
return self._select_feedback_weighted_data(training_data, target_size)
return training_data[:target_size]
def _sample_diverse_training_data(self, training_data: List[Dict],
target_size: int) -> List[Dict]:
"""Sample training data ensuring diversity across categories"""
# WHY: Diverse training prevents overfitting to specific patterns
# Implementation: stratified sampling across difficulty and category
# (Simplified - production should use sklearn.model_selection.StratifiedSample)
import random
return random.sample(training_data, min(target_size, len(training_data)))Analyzing Experiment Results
def analyze_experiment_results(self) -> Dict:
"""Analyze A/B test results and identify best approach"""
if not self.results:
return {}
comparison = {}
# WHY: Compare variants across key metrics
for variant_name, variant_results in self.results.items():
metrics = variant_results['summary_metrics']['overall_metrics']
comparison[variant_name] = {
'overall_accuracy': metrics['average_overall_accuracy'],
'execution_success_rate': metrics['execution_success_rate'],
'training_data_size': variant_results['training_data_count'],
# WHY: Cost efficiency = accuracy per training example
'cost_efficiency': metrics['average_overall_accuracy'] /
variant_results['training_data_count']
}
# Find best performing variant
best_variant = max(
comparison.keys(),
key=lambda x: comparison[x]['overall_accuracy']
)
return {
'variant_comparison': comparison,
'best_variant': best_variant,
'improvement_vs_baseline': self._calculate_improvement_vs_baseline(comparison),
'recommendations': self._generate_training_recommendations(comparison)
}💡 Observed Pattern: Across government implementations, curated domain-specific training with 300 examples typically outperforms high-volume general training with 1000 examples, while requiring 60% less training time.
Creating Feedback Loops for Continuous Improvement
Production Vanna AI systems require ongoing maintenance and improvement. You build automated feedback collection, quality monitoring, and incremental training to maintain accuracy as data and usage evolve.
Continuous Improvement System
class ContinuousImprovementSystem:
"""
Automated feedback collection and model improvement system.
Collects user feedback, identifies improvement opportunities,
and triggers automated retraining when accuracy degrades.
"""
def __init__(self, vanna_instance, accuracy_threshold=0.8):
self.vanna = vanna_instance
self.accuracy_threshold = accuracy_threshold
self.feedback_queue = []
self.improvement_metrics = {}
def collect_user_feedback(self, query: str, generated_sql: str,
user_rating: int, corrected_sql: str = None):
"""Collect user feedback for continuous improvement"""
# WHY: User corrections are highest-quality training examples
# HOW: Queue feedback for batch processing
feedback_entry = {
'timestamp': datetime.now(),
'natural_language_query': query,
'generated_sql': generated_sql,
'user_rating': user_rating, # 1-5 scale
'corrected_sql': corrected_sql,
'feedback_type': 'correction' if corrected_sql else 'rating'
}
self.feedback_queue.append(feedback_entry)
# WHY: Trigger improvement when sufficient feedback accumulated
if len(self.feedback_queue) >= 50:
self.process_feedback_batch()⚠️ Important: User feedback collection requires appropriate privacy controls and data retention policies, especially for government systems. Implement anonymization and consent mechanisms.
Processing Feedback Batches
def process_feedback_batch(self):
"""Process accumulated feedback and improve model"""
if len(self.feedback_queue) < 10:
return
feedback_df = pd.DataFrame(self.feedback_queue)
# WHY: Identify patterns in low-rated queries
low_rated_queries = feedback_df[feedback_df['user_rating'] <= 2]
corrected_queries = feedback_df[feedback_df['corrected_sql'].notna()]
# WHY: User corrections are gold standard training examples
new_training_examples = []
for _, correction in corrected_queries.iterrows():
new_training_examples.append({
'question': correction['natural_language_query'],
'sql': correction['corrected_sql'],
'source': 'user_correction',
'confidence': 1.0 # High confidence in human corrections
})
# Retrain with validated examples
if new_training_examples:
self.incremental_training(new_training_examples)
# Clear processed feedback
self.feedback_queue = []
def incremental_training(self, new_examples: List[Dict]):
"""Perform incremental training with new examples"""
# WHY: Incremental training adds examples without full retraining
# HOW: Vanna supports adding training examples to existing model
print(f"Performing incremental training with {len(new_examples)} new examples")
for example in new_examples:
self.vanna.train(
question=example['question'],
sql=example['sql']
)
# Track improvement metrics
self.improvement_metrics[datetime.now()] = {
'new_examples_added': len(new_examples),
'total_training_examples': self.vanna.get_training_data_count(),
'improvement_trigger': 'user_feedback'
}Automated Quality Monitoring
def automated_quality_monitoring(self, test_cases: List[QueryTestCase]):
"""Monitor quality and trigger retraining when accuracy drops"""
# WHY: Accuracy degrades over time as data and usage patterns evolve
# HOW: Run test suite periodically, trigger improvement if below threshold
tester = VannaAccuracyTester(self.vanna, self.test_db)
tester.test_cases = test_cases
current_results = tester.run_full_test_suite()
current_accuracy = current_results['overall_accuracy'].mean()
# Check if accuracy has dropped below threshold
if current_accuracy < self.accuracy_threshold:
print(f"Accuracy dropped to {current_accuracy:.3f}, triggering improvement cycle")
self.trigger_improvement_cycle(current_results)
return current_accuracy
def trigger_improvement_cycle(self, poor_performance_results: pd.DataFrame):
"""Trigger comprehensive improvement when performance drops"""
# WHY: Targeted improvement addresses specific failure patterns
# Identify worst-performing query types
poor_queries = poor_performance_results[
poor_performance_results['overall_accuracy'] < 0.5
]
# Generate additional training examples for problem areas
improvement_examples = self.generate_targeted_training_examples(poor_queries)
# Perform focused retraining
self.incremental_training(improvement_examples)
# Schedule follow-up testing
self.schedule_accuracy_retest()When to Retrain vs Fine-tune
One of the most critical decisions in maintaining Vanna AI accuracy is choosing between full retraining and incremental fine-tuning. You use different approaches based on accuracy degradation severity, schema changes, and business logic updates.
Decision Framework
Retrain Completely When:
- Accuracy drops below 60% on core query types
- Schema changes affect more than 30% of tables
- Business logic requirements fundamentally change
- Initial training data quality was poor
- Correlation between confidence and accuracy deteriorates
Fine-tune When:
- Accuracy is 65-80% and needs incremental improvement
- New query patterns emerge but core functionality works
- User feedback provides specific corrections
- Adding new data sources to existing schema
- Confidence calibration needs adjustment
Training Decision Engine
class TrainingDecisionEngine:
"""
Automated decision system for retrain vs. fine-tune choices.
Analyzes current accuracy, schema changes, and accuracy trends
to recommend optimal training approach.
"""
def __init__(self, accuracy_history: List[float], schema_change_rate: float):
self.accuracy_history = accuracy_history
self.schema_change_rate = schema_change_rate
self.decision_criteria = {
'accuracy_threshold_retrain': 0.60,
'accuracy_threshold_finetune': 0.65,
'schema_change_threshold': 0.30,
'accuracy_decline_threshold': 0.15
}
def recommend_training_approach(self, current_accuracy: float,
recent_schema_changes: int,
total_tables: int) -> Dict:
"""Recommend whether to retrain or fine-tune based on current state"""
schema_change_percentage = recent_schema_changes / total_tables
accuracy_decline = max(self.accuracy_history) - current_accuracy if self.accuracy_history else 0
recommendation = {
'approach': 'maintain',
'confidence': 0.5,
'reasoning': [],
'estimated_improvement': 0.0,
'estimated_effort_hours': 0
}
# WHY: Full retraining for critical accuracy drops
if current_accuracy < self.decision_criteria['accuracy_threshold_retrain']:
recommendation.update({
'approach': 'full_retrain',
'confidence': 0.9,
'reasoning': [f'Accuracy {current_accuracy:.3f} below critical threshold'],
'estimated_improvement': 0.20,
'estimated_effort_hours': 40
})
# WHY: Full retraining for major schema changes
elif schema_change_percentage > self.decision_criteria['schema_change_threshold']:
recommendation.update({
'approach': 'full_retrain',
'confidence': 0.8,
'reasoning': [f'Major schema changes: {schema_change_percentage:.1%}'],
'estimated_improvement': 0.15,
'estimated_effort_hours': 35
})
# WHY: Fine-tuning for moderate accuracy issues
elif (current_accuracy < self.decision_criteria['accuracy_threshold_finetune'] or
accuracy_decline > self.decision_criteria['accuracy_decline_threshold']):
recommendation.update({
'approach': 'fine_tune',
'confidence': 0.7,
'reasoning': [
f'Accuracy {current_accuracy:.3f} suggests incremental improvement needed',
f'Recent decline: {accuracy_decline:.3f}'
],
'estimated_improvement': 0.10,
'estimated_effort_hours': 15
})
return recommendationTracking Query Performance Metrics
Comprehensive performance tracking extends beyond accuracy to include operational metrics affecting user experience: execution time, result counts, query complexity, and user satisfaction.
Performance Tracking System
class QueryPerformanceTracker:
"""
Production query performance monitoring system.
Tracks execution time, result quality, complexity, and user
satisfaction to identify optimization opportunities.
"""
def __init__(self):
self.performance_log = []
self.metrics_cache = {}
def log_query_performance(self, query: str, sql: str, execution_time: float,
result_count: int, accuracy_score: float,
user_satisfaction: int):
"""Log comprehensive performance metrics for each query"""
performance_entry = {
'timestamp': datetime.now(),
'query_hash': hash(query),
'natural_language_query': query,
'generated_sql': sql,
'execution_time_seconds': execution_time,
'result_count': result_count,
'accuracy_score': accuracy_score,
'user_satisfaction': user_satisfaction,
'query_complexity': self._calculate_query_complexity(sql),
'token_count': len(query.split()),
'sql_length': len(sql)
}
self.performance_log.append(performance_entry)
def _calculate_query_complexity(self, sql: str) -> str:
"""Categorize query complexity based on SQL features"""
# WHY: Complexity affects accuracy and execution time
# HOW: Count SQL complexity indicators
sql_lower = sql.lower()
complexity_indicators = {
'joins': sql_lower.count('join'),
'subqueries': sql_lower.count('select') - 1,
'aggregations': sum([
sql_lower.count(func)
for func in ['sum(', 'count(', 'avg(', 'max(', 'min(']
]),
'conditions': sql_lower.count('where') + sql_lower.count('having')
}
total_complexity = sum(complexity_indicators.values())
if total_complexity <= 2:
return 'simple'
elif total_complexity <= 5:
return 'moderate'
else:
return 'complex'Performance Dashboard Generation
def generate_performance_dashboard(self) -> Dict:
"""Generate comprehensive performance analytics"""
if not self.performance_log:
return {}
df = pd.DataFrame(self.performance_log)
# WHY: Time-based trends reveal degradation or improvement
df['date'] = df['timestamp'].dt.date
daily_metrics = df.groupby('date').agg({
'accuracy_score': 'mean',
'execution_time_seconds': 'mean',
'user_satisfaction': 'mean',
'query_hash': 'nunique'
}).rename(columns={'query_hash': 'unique_queries'})
# WHY: Complexity analysis reveals performance patterns
complexity_metrics = df.groupby('query_complexity').agg({
'accuracy_score': ['mean', 'std'],
'execution_time_seconds': ['mean', 'std'],
'user_satisfaction': 'mean'
})
# Performance distribution
performance_distribution = {
'high_accuracy_queries': (df['accuracy_score'] >= 0.8).mean(),
'fast_queries': (df['execution_time_seconds'] <= 2.0).mean(),
'satisfied_users': (df['user_satisfaction'] >= 4).mean()
}
return {
'daily_trends': daily_metrics.to_dict(),
'complexity_analysis': complexity_metrics.to_dict(),
'performance_distribution': performance_distribution,
'overall_metrics': {
'average_accuracy': df['accuracy_score'].mean(),
'average_execution_time': df['execution_time_seconds'].mean(),
'average_satisfaction': df['user_satisfaction'].mean(),
'total_queries_processed': len(df)
}
}Cost-Benefit Analysis of Training Investment
Training investment decisions require clear ROI analysis. You calculate training costs (engineering time, compute resources, data preparation) against expected accuracy improvements and business value.
Training ROI Calculator
class TrainingROICalculator:
"""
ROI analysis for training investment decisions.
Calculates training costs, expected benefits, payback period,
and ROI percentage to justify training investments to stakeholders.
"""
def __init__(self, current_accuracy: float, baseline_metrics: Dict):
self.current_accuracy = current_accuracy
self.baseline_metrics = baseline_metrics
self.cost_factors = {
'engineer_hourly_rate': 75, # Senior engineer rate
'compute_cost_per_hour': 12, # GPU training costs
'data_scientist_hourly_rate': 85,
'opportunity_cost_multiplier': 1.5
}
def calculate_training_investment_cost(self, training_approach: str) -> Dict:
"""Calculate total cost of training investment"""
training_costs = {
'fine_tune': {
'engineer_hours': 15,
'data_scientist_hours': 8,
'compute_hours': 4,
'data_preparation_hours': 12
},
'full_retrain': {
'engineer_hours': 40,
'data_scientist_hours': 25,
'compute_hours': 15,
'data_preparation_hours': 30
},
'continuous_improvement_setup': {
'engineer_hours': 60,
'data_scientist_hours': 20,
'compute_hours': 8,
'data_preparation_hours': 25
}
}
costs = training_costs.get(training_approach, training_costs['fine_tune'])
# WHY: Calculate direct costs across all resource types
total_cost = (
costs['engineer_hours'] * self.cost_factors['engineer_hourly_rate'] +
costs['data_scientist_hours'] * self.cost_factors['data_scientist_hourly_rate'] +
costs['compute_hours'] * self.cost_factors['compute_cost_per_hour'] +
costs['data_preparation_hours'] * self.cost_factors['engineer_hourly_rate']
)
# WHY: Opportunity cost accounts for other work not done
total_cost *= self.cost_factors['opportunity_cost_multiplier']
return {
'direct_cost': total_cost / self.cost_factors['opportunity_cost_multiplier'],
'opportunity_cost': total_cost - (total_cost / self.cost_factors['opportunity_cost_multiplier']),
'total_investment': total_cost,
'cost_breakdown': costs
}Estimating Improvement Value
def estimate_accuracy_improvement_value(self, target_accuracy: float,
monthly_queries: int) -> Dict:
"""Calculate business value of accuracy improvement"""
accuracy_improvement = target_accuracy - self.current_accuracy
# WHY: Value calculations based on business impact
value_per_accurate_query = 2.50 # Estimated value per correctly answered query
cost_per_incorrect_query = 15.00 # Cost of manual correction/confusion
current_monthly_value = monthly_queries * self.current_accuracy * value_per_accurate_query
target_monthly_value = monthly_queries * target_accuracy * value_per_accurate_query
current_monthly_cost = monthly_queries * (1 - self.current_accuracy) * cost_per_incorrect_query
target_monthly_cost = monthly_queries * (1 - target_accuracy) * cost_per_incorrect_query
monthly_benefit = (target_monthly_value - current_monthly_value) + (current_monthly_cost - target_monthly_cost)
annual_benefit = monthly_benefit * 12
return {
'accuracy_improvement': accuracy_improvement,
'monthly_benefit': monthly_benefit,
'annual_benefit': annual_benefit,
'value_metrics': {
'additional_accurate_queries_monthly': monthly_queries * accuracy_improvement,
'reduced_error_cost_monthly': (current_monthly_cost - target_monthly_cost),
'increased_value_monthly': (target_monthly_value - current_monthly_value)
}
}Complete ROI Analysis
def calculate_roi(self, training_approach: str, target_accuracy: float,
monthly_queries: int, time_horizon_months: int = 12) -> Dict:
"""Calculate complete ROI analysis for training investment"""
investment_cost = self.calculate_training_investment_cost(training_approach)
improvement_value = self.estimate_accuracy_improvement_value(
target_accuracy, monthly_queries
)
total_benefit = improvement_value['monthly_benefit'] * time_horizon_months
total_cost = investment_cost['total_investment']
roi_percentage = ((total_benefit - total_cost) / total_cost) * 100 if total_cost > 0 else 0
payback_months = total_cost / improvement_value['monthly_benefit'] if improvement_value['monthly_benefit'] > 0 else float('inf')
return {
'investment_summary': {
'total_cost': total_cost,
'total_benefit': total_benefit,
'net_benefit': total_benefit - total_cost,
'roi_percentage': roi_percentage,
'payback_period_months': payback_months
},
'recommendation': self._generate_roi_recommendation(roi_percentage, payback_months),
'sensitivity_analysis': self._perform_sensitivity_analysis(
training_approach, target_accuracy, monthly_queries
)
}
def _generate_roi_recommendation(self, roi_percentage: float,
payback_months: float) -> str:
"""Generate investment recommendation based on ROI metrics"""
if roi_percentage >= 200 and payback_months <= 6:
return "Strong recommendation: High ROI with quick payback"
elif roi_percentage >= 100 and payback_months <= 12:
return "Recommended: Positive ROI within reasonable timeframe"
elif roi_percentage >= 50 and payback_months <= 18:
return "Consider: Moderate ROI, evaluate against other priorities"
else:
return "Not recommended: Low ROI or extended payback period"Troubleshooting Common Accuracy Issues
When accuracy falls below expectations, you need systematic approaches to diagnose and resolve problems. Common issues fall into categories: data quality, schema understanding, query complexity, and configuration problems.
Issue 1: Low Syntax Accuracy
Symptoms:
- Generated SQL fails to execute
- SQL syntax errors in logs
- Execution success rate below 80%
Cause: Training data contains malformed SQL or inconsistent syntax patterns. Vanna learns incorrect SQL structure from low-quality examples.
Solution:
# Validate all training SQL before adding to training set
def validate_training_sql(sql: str, db_connection) -> bool:
"""Validate SQL executes without errors"""
try:
pd.read_sql(f"EXPLAIN {sql}", db_connection)
return True
except:
return False
# Filter training data to only include valid SQL
validated_training_data = [
example for example in training_data
if validate_training_sql(example['sql'], db_connection)
]Issue 2: Confidence-Accuracy Mismatch
Symptoms:
- High confidence queries returning wrong results
- Low confidence queries working correctly
- Correlation coefficient below 0.5
Cause: Insufficient calibration data or training data quality issues causing confidence miscalibration.
Solution: Collect more calibration data across confidence ranges and retrain with higher-quality examples:
# Increase calibration dataset size
calibrator = ConfidenceCalibrator(vanna_instance)
calibrator.collect_confidence_accuracy_pairs(test_results)
# Focus training on high-confidence failures
high_conf_failures = test_results[
(test_results['confidence_score'] > 0.7) &
(test_results['overall_accuracy'] < 0.6)
]
# Add corrected versions to training data
for failure in high_conf_failures:
vanna.train(
question=failure['natural_language'],
sql=failure['expected_sql']
)Issue 3: Poor Performance on Complex Queries
Symptoms:
- Simple queries work well (80%+ accuracy)
- Complex queries fail consistently (below 50% accuracy)
- Accuracy degrades with JOIN complexity
Cause: Insufficient training examples for complex query patterns. Simple queries dominate training data.
Solution: Balance training data across complexity levels:
# Analyze training data complexity distribution
complexity_counts = {}
for example in training_data:
complexity = calculate_query_complexity(example['sql'])
complexity_counts[complexity] = complexity_counts.get(complexity, 0) + 1
# Target: 40% simple, 40% moderate, 20% complex
# Add more complex examples if underrepresented
if complexity_counts.get('complex', 0) < len(training_data) * 0.2:
# Generate or collect additional complex query examples
additional_complex = collect_complex_query_examples(count=50)
training_data.extend(additional_complex)Issue 4: Schema Accuracy Problems
Symptoms:
- Wrong columns selected
- Missing required columns
- Joins on incorrect foreign keys
Cause: Vanna doesn’t understand schema relationships or column semantics.
Solution: Enhance schema documentation and add relationship examples:
# Provide detailed schema documentation
vanna.train(ddl="""
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
department_id INT, -- References departments.department_id
name VARCHAR(100),
hire_date DATE,
salary DECIMAL(10,2)
);
CREATE TABLE departments (
department_id INT PRIMARY KEY,
department_name VARCHAR(100),
budget DECIMAL(12,2)
);
""")
# Add examples demonstrating correct relationships
vanna.train(
question="Show employees with their department names",
sql="""
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id
"""
)FAQ: Accuracy Measurement
❓ How many test cases do I need for reliable accuracy measurement?
You need 150-200 test cases minimum for production systems. Include:
- 40-50% simple queries (single table, basic filters)
- 40-50% moderate queries (2-3 table joins, aggregations)
- 10-20% complex queries (multiple joins, subqueries, advanced logic)
Ensure coverage across all major query patterns, schema areas, and business use cases.
❓ What’s a realistic accuracy target for initial deployment?
Initial deployments typically achieve 60-70% blended accuracy. After 3 months of optimization, expect 75-85%. Mature implementations (6+ months) reach 80-90%.
Don’t aim for 100% accuracy—focus on predictable, reliable performance with clear confidence indicators.
❓ How often should I retrain vs. fine-tune?
- Fine-tune: Every 2-4 weeks with user feedback (10-20 new examples)
- Full retrain: Quarterly or when accuracy drops below 60%
- Continuous improvement: Automated incremental training with validated user corrections
Monitor accuracy weekly. Trigger retraining when accuracy drops 15% from peak.
❓ Do I need separate test suites for each environment?
Yes, if schemas differ between environments. Maintain:
- Development suite: Comprehensive coverage, all edge cases
- Staging suite: Production-like data, realistic query distribution
- Production monitoring: Automated testing on anonymized queries
Development and staging suites can overlap significantly. Production monitoring focuses on regression detection.
❓ How do I measure accuracy for queries with no right answer?
For exploratory or analytical queries without predetermined answers:
- Validate SQL syntax correctness
- Check schema appropriateness (correct tables/columns)
- Verify business logic constraints (date ranges, filters)
- Use expert review for semantic correctness
Create “acceptable answer” ranges rather than exact matches.
Testing Checklist
Before deploying accuracy improvements to production:
Test Suite Coverage
- 150+ test cases covering all query patterns
- Balance across complexity levels (simple/moderate/complex)
- Coverage of all major schema areas
- Edge cases and error conditions included
Baseline Metrics Established
- Current accuracy measured across all test cases
- Confidence calibration data collected (50+ examples per bin)
- Performance baselines documented (execution time, success rate)
- User satisfaction baseline recorded
Training Data Validated
- All training SQL validated for syntax correctness
- Training examples cover identified accuracy gaps
- Quality assessment completed (remove low-quality examples)
- Complexity distribution balanced
Improvement Approach Selected
- Fine-tune vs. full retrain decision made using decision engine
- Training approach tested via A/B experiment
- ROI analysis completed and approved
- Resources allocated (engineering time, compute)
Post-Training Validation
- Full test suite executed on improved model
- Accuracy improvements meet targets
- No regression on previously working queries
- Confidence calibration rechecked
Production Readiness
- Continuous improvement system configured
- Automated monitoring alerts set up
- Feedback collection mechanisms in place
- Rollback plan documented
Next Steps
Implement Your Testing Framework:
- Set up test database environment with representative schema
- Create 50 initial test cases across query types
- Implement
VannaAccuracyTesterclass - Run baseline accuracy measurement
Establish Continuous Monitoring:
- Deploy
QueryPerformanceTrackerin production - Configure automated weekly test suite execution
- Set up accuracy degradation alerts (threshold: 15% drop)
- Implement user feedback collection
Optimize Training Strategy:
- Conduct A/B test of training approaches
- Select optimal approach based on ROI analysis
- Build automated feedback loop for continuous improvement
- Schedule quarterly full retraining reviews
Further Reading:
- Vanna AI Documentation - Official documentation and API reference
- Text-to-SQL Best Practices - Additional implementation guides
- Training Vanna AI: Avoiding Common Mistakes - Related article on training strategies
Need Help?
Implementing systematic accuracy measurement for enterprise Vanna AI deployments requires experience with both the technical framework and organizational change management. If you’re facing challenges with:
- Establishing baseline accuracy metrics
- Designing comprehensive test suites for complex schemas
- Justifying training investment to stakeholders
- Building continuous improvement systems
I offer architecture reviews and implementation guidance for enterprise Text-to-SQL systems. Having deployed these frameworks across 10+ government databases, I can help you navigate the technical and organizational challenges from initial measurement through production optimization.
Schedule a consultation to discuss your specific requirements.
💡 Remember: Accuracy measurement isn’t about achieving perfection—it’s about building predictable, reliable systems with clear understanding of capabilities and limitations. Focus on systematic measurement, continuous improvement, and transparent communication with stakeholders about realistic expectations.