Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save YLChen-007/b4f326eaecc29b192cf93dc5d6bc0623 to your computer and use it in GitHub Desktop.

Select an option

Save YLChen-007/b4f326eaecc29b192cf93dc5d6bc0623 to your computer and use it in GitHub Desktop.
SQL Injection in `remove_training_data` via BigQuery Backend Allows Mass Deletion of Training Data

Description:

Summary

The BigQuery vector store backend (bigquery_vector.py) constructs a SQL DELETE statement using Python f-string interpolation with user-supplied input, without any parameterization or sanitization. An unauthenticated attacker (under the default NoAuth configuration) can send a crafted id value via POST /api/v0/remove_training_data to inject arbitrary SQL, resulting in mass deletion of all training data or other unauthorized database operations against the BigQuery dataset.

This finding is based on static code audit of Vanna v2.0.2 (latest, commit 365d061).

Details

Vulnerability type: Classic SQL Injection via string interpolation (CWE-89).

The remove_training_data method in BigQuery_VectorStore directly interpolates the id parameter into a raw SQL DELETE statement using an f-string. The id originates from the HTTP request body and passes through the Flask API layer without any validation, sanitization, or parameterization.

Vulnerable Sink

# src/vanna/legacy/google/bigquery_vector.py, line 273-282
def remove_training_data(self, id: str, **kwargs) -> bool:
    query = f"DELETE FROM `{self.table_id}` WHERE id = '{id}'"  # SQL INJECTION

    try:
        self.conn.query(query).result()
        return True

    except Exception as e:
        print(f"Failed to remove training data: {e}")
        return False

The id parameter is wrapped in single quotes inside the f-string. An attacker sending ' OR '1'='1 as the id value would produce:

DELETE FROM `project.dataset.training_data` WHERE id = '' OR '1'='1'

The OR '1'='1' condition is always true, so this deletes every row in the training data table.

Entry Point (HTTP API)

# src/vanna/legacy/flask/__init__.py, line 783-815
@self.flask_app.route("/api/v0/remove_training_data", methods=["POST"])
@self.requires_auth
def remove_training_data(user: any):
    id = flask.request.json.get("id")  # user-controlled, no validation

    if id is None:
        return jsonify({"type": "error", "error": "No id provided"})

    if vn.remove_training_data(id=id):  # passes directly to vulnerable sink
        return jsonify({"success": True})

The id value is extracted from the JSON request body with zero validation — no type check, no format check (e.g., UUID regex), no escaping.

Default NoAuth

# src/vanna/legacy/flask/auth.py, line 36-41
class NoAuth(AuthInterface):
    def get_user(self, flask_request) -> any:
        return {}
    def is_logged_in(self, user: any) -> bool:
        return True  # every request is "authenticated"
# src/vanna/legacy/flask/__init__.py, line 145-149
class VannaFlaskAPI:
    def __init__(self, vn: VannaBase, ..., auth: AuthInterface = NoAuth(), ...):

The default authentication backend unconditionally returns True, meaning any network-reachable attacker can hit the vulnerable endpoint without credentials.

Static Audit: Data Flow Trace (Source → Sink)

The following traces the user-controlled input from HTTP entry point to SQL execution, proving reachability through 5 hops:

[Source] HTTP POST /api/v0/remove_training_data  Body: {"id": "<PAYLOAD>"}
   │
   ▼ Hop 1: Flask route registration
   flask/__init__.py:783   @self.flask_app.route("/api/v0/remove_training_data", methods=["POST"])
   │
   ▼ Hop 2: User input extracted from request body — no sanitization
   flask/__init__.py:805   id = flask.request.json.get("id")
   │                       # Only check: "if id is None" (line 807) — no format validation
   │
   ▼ Hop 3: Passed directly to the VannaBase polymorphic method
   flask/__init__.py:810   vn.remove_training_data(id=id)
   │
   │  The `vn` variable is the VannaBase instance passed to VannaFlaskAPI.__init__() at line 147.
   │  It is stored as self.vn (line 176) and captured in the route closure.
   │  When vn is a BigQuery_VectorStore instance (or subclass), Python MRO resolves to:
   │
   ▼ Hop 4: BigQuery_VectorStore.remove_training_data — f-string SQL construction
   bigquery_vector.py:274  query = f"DELETE FROM `{self.table_id}` WHERE id = '{id}'"
   │
   │  The {id} is interpolated directly into the SQL string with no escaping.
   │  A payload like: ' OR '1'='1
   │  Produces:       DELETE FROM `...` WHERE id = '' OR '1'='1'
   │
   ▼ Hop 5: Injected SQL executed against BigQuery
   bigquery_vector.py:277  self.conn.query(query).result()
   │
[Sink] BigQuery executes the injected SQL → all rows deleted

Polymorphism proof (Hop 3 → Hop 4)

VannaBase.remove_training_data is an abstract method:

# src/vanna/legacy/base/base.py, line 500-516
class VannaBase(ABC):
    @abstractmethod
    def remove_training_data(self, id: str, **kwargs) -> bool:
        pass

BigQuery_VectorStore inherits from VannaBase and provides the concrete implementation:

# src/vanna/legacy/google/bigquery_vector.py, line 13
class BigQuery_VectorStore(VannaBase):

The standard Vanna deployment pattern uses multiple inheritance:

class MyVanna(BigQuery_VectorStore, OpenAI_Chat):
    ...

vn = MyVanna(config={...})
app = VannaFlaskApp(vn)   # vn.remove_training_data → BigQuery_VectorStore.remove_training_data
app.run()

Since no intermediate class overrides remove_training_data, Python's MRO guarantees vn.remove_training_data() resolves to BigQuery_VectorStore.remove_training_data — the vulnerable implementation.

Frontend confirms this is a production code path

The bundled Svelte frontend (flask/assets.py) calls this endpoint when a user clicks delete on the Training Data page:

// Deobfuscated from flask/assets.py
function Un(E) {
    gt.set(null);
    Pe("remove_training_data", "POST", {id: E}).then(e => {
        Pe("get_training_data", "GET", []).then(pt)
    })
}

This confirms POST /api/v0/remove_training_data is a designed, production code path — not a theoretical or dead endpoint.

Cross-comparison with safe backends

Backend File remove_training_data pattern Vulnerable?
BigQuery bigquery_vector.py:274 f"DELETE ... WHERE id = '{id}'" YES — f-string
pgvector pgvector.py:227 execute(delete_statement, {"id": id}) No — parameterized
Oracle oracle_vector.py:309 cursor.execute(..., [id]) No — parameterized
ChromaDB chromadb_vector.py:167 collection.delete(ids=[id]) No — API-based
OpenSearch opensearch_vector.py:340 client.delete(index=..., id=id) No — API-based

All other backends use parameterized queries or safe APIs. Only BigQuery uses raw string interpolation.

Impact

This is a SQL Injection vulnerability. An attacker with network access to the Vanna Flask API can:

  1. Delete all training data — a single request wipes the entire training dataset, causing denial of service for the AI assistant. The organization loses all curated SQL examples, DDL schemas, and documentation that were used to train the model.

  2. Delete selective data — using payloads like ' OR training_data_type='sql, an attacker can surgically remove specific categories of training data to degrade the AI's output quality without triggering obvious alarms.

  3. Potential data exfiltration — depending on BigQuery's error handling and query capabilities, error-based or blind SQL injection techniques could be used to extract data from the training dataset or other tables in the same BigQuery dataset.

  4. Cross-table impact — since the BigQuery client connection (self.conn) has access to the entire dataset, sub-queries could potentially read or modify other tables beyond training_data.

All Vanna deployments using the BigQuery vector store backend with the default NoAuth configuration are affected.

Affected products

  • Ecosystem: pip
  • Package name: vanna
  • Affected versions: <= 2.0.2 (all versions containing BigQuery_VectorStore)
  • Patched versions:

Severity

  • Severity: High
  • Vector string: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:H/A:H

Weaknesses

  • CWE: CWE-89: Improper Neutralization of Special Elements used in an SQL Command ('SQL Injection')

Occurrences

Permalink Description
https://github.com/vanna-ai/vanna/blob/365d0617c1a4567ffee1b19b40c27feb4206bfcf/src/vanna/legacy/google/bigquery_vector.py#L273-L282 The vulnerable remove_training_data method — constructs SQL DELETE via f-string interpolation with unsanitized user input id.
https://github.com/vanna-ai/vanna/blob/365d0617c1a4567ffee1b19b40c27feb4206bfcf/src/vanna/legacy/flask/init.py#L805 id = flask.request.json.get("id") — user-controlled input extracted from HTTP request body with no validation.
https://github.com/vanna-ai/vanna/blob/365d0617c1a4567ffee1b19b40c27feb4206bfcf/src/vanna/legacy/flask/init.py#L810 vn.remove_training_data(id=id) — tainted id passed directly to the polymorphic method without sanitization.
https://github.com/vanna-ai/vanna/blob/365d0617c1a4567ffee1b19b40c27feb4206bfcf/src/vanna/legacy/flask/auth.py#L36-L41 NoAuth class — default auth backend where is_logged_in() unconditionally returns True, allowing unauthenticated access.
https://github.com/vanna-ai/vanna/blob/365d0617c1a4567ffee1b19b40c27feb4206bfcf/src/vanna/legacy/flask/init.py#L147 VannaFlaskAPI.__init__ accepts vn: VannaBase — BigQuery_VectorStore is a concrete subclass, its remove_training_data is resolved via Python MRO.
https://github.com/vanna-ai/vanna/blob/365d0617c1a4567ffee1b19b40c27feb4206bfcf/src/vanna/legacy/base/base.py#L500-L516 VannaBase.remove_training_data — declared as @abstractmethod, confirming subclass implementations (like BigQuery) are the actual execution targets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment