Security & Privacy

Data Security FAQ

Last updated: October 2025 • Version 2.0 (Vanna v2.0.0 Agent Framework)

Overview

This FAQ addresses data security and privacy questions for developers and technical teams implementing Vanna AI. Vanna operates as a modular agent framework that can be deployed in multiple configurations, from fully self-hosted open-source installations to cloud-managed premium services.

Architecture & Deployment Models

What is the architecture of Vanna v2.0?

Vanna v2.0 is a modular agent framework built on clean abstractions:

  • Core Agent: Orchestrates LLM interactions with tool execution loops, conversation management, and streaming support
  • Tool System: Extensible tool registry with group-based access control
  • Storage Layer: Abstract interfaces for conversations, audit logs, and observability data
  • User Management: User resolution with group-based permissions (RBAC)
  • LLM Services: Pluggable integrations for Anthropic Claude, OpenAI GPT, and other providers

The framework provides 6 extensibility points: lifecycle hooks, middlewares, error recovery, context enrichers, conversation filters, and observability providers.

What deployment models does Vanna support?

ModelDescriptionData LocationUse Case
Self-HostedOpen-source Python package on your infrastructureAll data stays localMaximum control, sensitive data, air-gapped environments
Cloud PremiumFully managed Vanna premium servicesVanna cloud infrastructureRapid deployment, managed observability
HybridPython local, premium services for telemetryConversations local, telemetry in cloudBalance between control and managed services

What are the premium backend services?

Vanna's premium backend (written in Go) provides managed services:

  • Observability: Metrics and distributed tracing for monitoring agent performance
  • Audit Logging: Centralized audit event storage with query capabilities
  • Tool Registry: Shared tool schemas and templates across teams
  • Agent Memory: Semantic search over historical tool usage patterns
  • Conversation Management: Cloud-based conversation persistence
  • Analytics: Dashboard aggregations and usage statistics

Current Status: The premium backend is in development/demo stage with in-memory storage. Production deployment requires additional hardening (see "Production Considerations" section below).

Data Handling & Privacy

What data does the open-source Python package handle?

When running self-hosted, the Python package handles:

  1. Database Connections: Connection strings and credentials (stored locally, never transmitted)
  2. Training Data: DDL statements, documentation, SQL examples, and question-SQL pairs
  3. Conversation History: User messages and AI responses
  4. Tool Execution Data: Tool invocations, parameters, and results
  5. Audit Logs: Security events, access checks, and tool usage

Important: In self-hosted mode with local storage (e.g., MemoryConversationStore), all data stays on your infrastructure. No data is transmitted to Vanna servers or third-party services unless you explicitly configure premium integrations.

What data is sent to Vanna's premium services?

When using premium backend services (opt-in), the following data may be transmitted via HTTPS:

Data TypeSent to Premium?Purpose
Database credentialsNeverAlways stored locally only
Training data (DDL, docs, SQL)Yes (opt-in)Enable semantic search and retrieval augmentation
Conversation messagesYes (opt-in)Persist conversations across sessions
Tool execution metadataYes (opt-in)Centralized audit logging and analytics
Observability metrics/tracesYes (opt-in)Performance monitoring and debugging

Authentication: All premium API requests use Authorization: Bearer {api_key} headers and X-Organization-ID for multi-tenancy isolation.

What data is sent to third-party LLMs (Anthropic, OpenAI)?

Vanna integrates with external LLM providers to generate responses. The following data is transmitted to the LLM provider you configure:

Always Sent:

  • User questions/messages
  • System prompts (including tool schemas)
  • Tool execution results (to provide context for follow-up responses)
  • Conversation history (for context)

Conditionally Sent:

  • Training data: DDL statements, documentation snippets, and example SQL from your retrieval augmentation layer

Never Sent:

  • Database credentials
  • Raw database connection strings

Transmission Security: All LLM API requests are sent via HTTPS. Data is not stored by Vanna on transmission; it flows directly from your Python environment to the LLM provider per their retention policies (see Anthropic and OpenAI privacy policies).

How are database credentials managed?

Database credentials are:

  1. Stored Locally: Credentials remain in your Python environment (environment variables, config files, or passed programmatically)
  2. Never Transmitted: Credentials are never sent to Vanna servers, premium backend, or LLM providers
  3. Not Logged: Audit logging automatically sanitizes sensitive parameters (password, secret, token, api_key, credential, etc.) before recording events

Best Practice: Use environment variables or secret management services (AWS Secrets Manager, HashiCorp Vault) rather than hardcoding credentials.

Security Controls

What user isolation and access control mechanisms exist?

Vanna v2.0 implements group-based access control (RBAC):

User Model:

class User:
    id: str                          # Unique user identifier
    username: str
    email: str
    group_memberships: List[str]     # e.g., ["admin", "analyst", "viewer"]

Access Control:

  1. Tool Access: Each tool specifies access_groups. Users can only invoke tools where their group memberships intersect with the tool's allowed groups.
  2. UI Feature Access: Sensitive UI features (e.g., viewing tool arguments, error details) can be restricted by group.
  3. Conversation Isolation: All conversation storage operations validate that conversation.user.id == requesting_user.id.

How does audit logging work?

Vanna provides comprehensive audit logging with automatic parameter sanitization:

Events Logged:

  • Tool Access Checks: User attempts to access tools (granted/denied)
  • Tool Invocations: Tool name, sanitized parameters, execution timestamp
  • Tool Results: Success/failure status, execution time, error messages
  • UI Feature Access: Which users accessed restricted UI features
  • AI Responses: Response metadata (length, hash, model used)

Parameter Sanitization:

The audit system automatically redacts sensitive patterns:

  • password, secret, token, api_key
  • credential, auth, private_key, access_key
  • Values replaced with [REDACTED]

What observability and monitoring capabilities exist?

The framework includes built-in observability:

Metrics:

  • Tool execution latency
  • LLM request duration
  • Error rates by tool
  • Conversation length statistics

Distributed Tracing:

  • Request-level tracing
  • Tool execution traces
  • LLM interaction traces
  • Custom span attributes

Providers: Local (in-memory), Premium (cloud-based), or Custom (implement ObservabilityProvider for Datadog, Prometheus, etc.)

What extensibility points exist for custom security controls?

Vanna v2.0 provides multiple integration points for custom security:

1. Custom User Resolver

Implement authentication (OAuth, JWT, SAML)

2. Custom Audit Logger

Route audit events to your SIEM (Splunk, DataDog, etc.)

3. Custom Conversation Store

Implement encrypted storage

4. Lifecycle Hooks

Inject custom validation/security checks

5. Middlewares

Request/response interception for rate limiting, etc.

Production Deployment Considerations

What are the current limitations of the premium backend?

The premium backend is currently in development/demo stage with the following limitations:

LimitationImpactProduction Requirement
In-memory storageData lost on restartPostgreSQL, MongoDB
No encryption at restUnencrypted dataDatabase encryption
Wide-open CORSCSRF risksRestrict to known domains
No rate limitingDoS vulnerabilityRedis-backed rate limiting
No auth middlewareOpen endpointsJWT/OAuth authentication

Recommendation: For production deployments, use self-hosted mode until premium services complete security hardening, or implement additional security layers (API gateway, VPN, etc.).

What security hardening is recommended for production?

Self-Hosted Deployments:

  • TLS/HTTPS for all API endpoints
  • Encryption at rest for conversation storage
  • Deploy behind firewall/VPN
  • Use secret management services (AWS Secrets Manager, Vault)
  • Enable comprehensive audit logging
  • Regular security log reviews

Premium/Hybrid Deployments (additional):

  • API key rotation
  • IP whitelisting
  • Rate limiting
  • Data retention policies aligned with GDPR/CCPA

How can I ensure GDPR/CCPA compliance?

Self-Hosted Deployments: You have full control and responsibility for compliance.

  • Data Minimization: Only collect necessary data
  • Right to Access: Implement endpoints to export user data
  • Right to Erasure: Implement deletion via conversation store APIs
  • Data Retention: Configure automatic cleanup of old conversations
  • Privacy Policy: Clearly disclose what data is sent to LLM providers

Premium Services Roadmap: Vanna will provide GDPR-compliant data retention, deletion APIs, and data processing agreements for production premium services.

Quick Reference

QuestionSelf-HostedPremium Services
Where are database credentials stored?Locally onlyNever sent to premium
Where are conversations stored?Local storage (your control)Vanna cloud (opt-in)
Is data encrypted in transit?HTTPS (your config)HTTPS to Vanna APIs
Is data encrypted at rest?Your implementationRoadmap (not current)
Can I delete my data?Yes (via API)Yes (via API, roadmap: UI)
Is it production-ready?Yes (with hardening)No (development stage)

Getting Help

Security Issues

Please do not open public GitHub issues for security vulnerabilities.

  • • Report via GitHub Security Advisories
  • • Email: security@vanna.ai
  • • Response time: 48 hours

Ready to get started?

Deploy Vanna with confidence. Choose self-hosted for maximum control or try our managed services.