Skip to content

Instantly share code, notes, and snippets.

@nielsberglund
Created January 14, 2026 06:11
Show Gist options
  • Select an option

  • Save nielsberglund/b26fd60e18abb4a8e7ca0157a7fc69e0 to your computer and use it in GitHub Desktop.

Select an option

Save nielsberglund/b26fd60e18abb4a8e7ca0157a7fc69e0 to your computer and use it in GitHub Desktop.

Import MCP Server - Tool Specification

Date: January 14, 2026 Status: Implementation Planning Related: See import-analysis-and-recommendations.md for detailed data analysis


Overview

This document specifies the 7 MCP tools to build for the Custom Import MCP server. Each tool is designed to handle specific import workflows based on real CSV data analysis from Data & AI Community Day Durban events.

Total Scope: ~1,250 lines of Python code using FastMCP framework


7 MCP Tools to Build

Core Infrastructure (Build First)

1. validate_import_data

Purpose: Fast CSV validation WITHOUT database access

Why first: Every import workflow needs validation before processing

Parameters:

{
    "file_path": str (required),
    "import_type": enum (required) [
        "quicket_registrations",
        "quicket_checkins",
        "walkup",
        "sessionize",
        "ratings"
    ]
}

Returns:

{
    "valid": bool,
    "file_info": {
        "format": "csv" | "xlsx",
        "rows": int,
        "columns": list[str]
    },
    "validation_results": {
        "required_columns_present": bool,
        "missing_columns": list[str],
        "extra_columns": list[str],
        "data_quality_score": float,
        "duplicate_rows": int,
        "missing_data_percentage": float
    },
    "errors": list[{row, field, error}],
    "warnings": list[{row, field, warning}]
}

Implementation Notes:

  • CSV parsing + format-specific rules
  • No database access (fast feedback ~50ms)
  • Validates column presence, data types, required fields
  • Detects duplicates within file

Complexity: ~150 lines


2. preview_import

Purpose: Dry-run showing what WOULD happen (connects to database in read-only mode)

Why second: User confidence before committing changes

Parameters:

{
    "file_path": str (required),
    "import_type": enum (required),
    "event_id": int (required for most types),
    "duplicate_strategy": enum (optional, default="skip") ["skip", "update", "fail"],
    "preview_limit": int (optional, default=10)
}

Returns:

{
    "summary": {
        "total_rows": int,
        "would_create": int,
        "would_update": int,
        "would_skip": int,
        "would_fail": int
    },
    "sample_actions": [
        {
            "row": int,
            "action": "create" | "update" | "skip" | "fail",
            "record_type": "contact" | "ticket" | "session" | "rating",
            "identifier": str,
            "data_preview": {...},
            "reason": str
        }
    ],
    "duplicates_found": [{identifier, action, reason}],
    "errors": [{row, field, error}],
    "warnings": [{row, field, warning}]
}

Implementation Notes:

  • Reads database to check for existing records
  • Simulates import logic without writes
  • Shows sample actions (first N rows)
  • Slower than validate (~500ms+) but provides detailed impact analysis

Complexity: ~200 lines


Import Operations

Phase 1 Priority (Build These Next)

3. import_quicket_registrations

Purpose: Import ticket purchases from Quicket CSV (37 columns)

Why critical: Most complex import, handles core registration workflow

Parameters:

{
    "file_path": str (required),
    "event_id": int (required),
    "duplicate_strategy": enum (optional, default="skip") ["skip", "update", "fail"]
}

Logic:

  1. Parse CSV with pandas (handle UTF-8 BOM)
  2. For each row:
    • Parse contact data (email, first_name, surname, phone, company, employment_status)
    • Check if contact exists by composite key (email, first_name, surname)
      • If exists: use existing contact_id
      • If not: create new contact
    • Parse Areas of Interest (comma-separated with HTML entities)
      • Decode HTML entities: "&""&"
      • Split by comma: "AI Coding, Data Engineering & Stream Processing"
      • For each topic: find or create in interest_topics
      • Create junction records in contact_interests
    • Check if ticket exists by (ticket_number, event_id)
      • If exists: apply duplicate_strategy
      • If not: create ticket with all fields:
        • purchaser_email, order_number, marketing_source
        • ticket_type, is_complimentary
    • Ensure contact_role exists (attendee, event_id)

Returns:

{
    "success": bool,
    "summary": {
        "total_rows": int,
        "contacts_created": int,
        "contacts_updated": int,
        "tickets_created": int,
        "tickets_updated": int,
        "tickets_skipped": int,
        "interests_processed": int
    },
    "errors": list[{row, error}],
    "warnings": list[{row, warning}]
}

Key Challenges:

  • Composite key deduplication: (email, first_name, surname) - email alone NOT unique
  • HTML entity decoding in interests field
  • Purchaser vs attendee distinction
  • Marketing source attribution

Complexity: ~250 lines (most complex parser)


4. import_sessionize_data

Purpose: Import speakers + sessions from Sessionize CSV (25 columns)

Why in Phase 1: Must come BEFORE import_speaker_ratings due to FK dependency (speaker_ratings.session_idspeaker_sessions.id)

Parameters:

{
    "file_path": str (required),
    "event_id": int (required),
    "duplicate_strategy": enum (optional, default="skip") ["skip", "update", "fail"]
}

Logic:

  1. Parse CSV with pandas (handle multi-line bio fields)
  2. Group rows by Session Id
    • Multi-speaker structure: One row per session-speaker pair
    • First speaker row: Has complete session data
    • Subsequent speaker rows: Have minimal session info
  3. For each session group:
    • Create session record from first row:
      • Store in speaker_sessions: title, description, track, level, room, scheduled_at, duration_minutes, session_format
      • Store: sessionize_session_id for re-import tracking
    • For each speaker row in group:
      • Find or create contact by composite key (email, first_name, surname)
      • Create or update speaker enrichment in speakers table:
        • Store: sessionize_speaker_id, bio, tagline, shirt_size, profile_picture_url
      • Link speaker to session via speaker_sessions.speaker_id
        • Store: speaker_informed_at, speaker_confirmed_at
    • Ensure contact_role exists (speaker, event_id)

Returns:

{
    "success": bool,
    "summary": {
        "total_rows": int,
        "sessions_created": int,
        "sessions_updated": int,
        "speakers_created": int,
        "speakers_updated": int,
        "speakers_linked": int
    },
    "multi_speaker_sessions": [
        {
            "session_id": int,
            "title": str,
            "speaker_count": int,
            "speakers": list[str]
        }
    ],
    "errors": list[{row, error}],
    "warnings": list[{row, warning}]
}

Key Challenges:

  • Multi-speaker grouping logic (group by Session Id)
  • Session data in first row only vs speaker data in all rows
  • Speaker enrichment vs basic contact data
  • Real example: "Mirror-Code" session has 2 speakers (Shraddha + Pratish Neerputh)

Complexity: ~200 lines (multi-speaker grouping complexity)


5. import_speaker_ratings

Purpose: Import session evaluations with fuzzy matching

Why after Sessionize: References speaker_sessions.id via FK constraint - sessions must exist first

Parameters:

{
    "file_path": str (required),
    "event_id": int (required),
    "fuzzy_match_threshold": float (optional, default=0.8)
}

Logic:

  1. Parse CSV/Excel (support both formats)
  2. Load all sessions for this event from database
  3. For each rating row:
    • Parse session identifier field:
      • Strip number prefix: "01. ", "02. ", etc.
      • Strip speaker suffix: " - Speaker Name(s)"
      • Result: Clean session title for matching
    • Fuzzy match against loaded sessions (Levenshtein distance)
      • Use fuzzywuzzy.token_sort_ratio (handles word order differences)
      • Threshold default: 80% similarity
    • If match score >= threshold:
      • Get all speakers for matched session
      • For EACH speaker:
        • Create speaker_ratings record with SAME values:
          • topic_rating (informativeness: 1-7)
          • presentation_rating (delivery: 1-7)
          • time_investment_rating (1-7)
          • improvement_suggestions (text)
    • If no match: log warning with suggestions

Returns:

{
    "success": bool,
    "summary": {
        "total_rows": int,
        "ratings_created": int,
        "sessions_matched": int,
        "sessions_unmatched": int,
        "multi_speaker_sessions": int
    },
    "unmatched_sessions": [
        {
            "row": int,
            "session_identifier": str,
            "best_match": str,
            "match_score": float,
            "suggestion": str
        }
    ],
    "errors": list[{row, error}],
    "warnings": list[{row, warning}]
}

Key Challenges:

  • Fuzzy string matching with 80%+ threshold
  • Multi-speaker sessions: Create N ratings from 1 evaluation
  • Rating scale is 1-7 (not 1-5 as originally assumed)
  • Session identifier format variations
  • FK dependency: Requires speaker_sessions records to exist

Complexity: ~200 lines (fuzzy matching + multi-speaker logic)


Phase 2 Priority (Build After Phase 1 Works)

6. import_quicket_checkins 📊

Purpose: Update attendance from post-event Quicket export

Parameters:

{
    "file_path": str (required),
    "event_id": int (required),
    "update_contact_info": bool (optional, default=True)
}

Logic:

  1. Parse CSV (SAME format as import_quicket_registrations)
  2. For each row:
    • Find ticket by (ticket_number, event_id)
    • If NOT found:
      • Create contact + ticket (late registration case)
    • If found:
      • UPDATE: checked_in = TRUE, checked_in_at = timestamp
      • If update_contact_info=True:
        • Update contact fields (phone, company, employment_status)
        • Update interests (if changed since registration)

Returns:

{
    "success": bool,
    "summary": {
        "total_rows": int,
        "tickets_updated": int,
        "tickets_created": int,
        "contact_info_updated": int
    },
    "errors": list[{row, error}],
    "warnings": list[{row, warning}]
}

Key Challenges:

  • UPDATE operation vs CREATE (most tickets already exist)
  • Late registration handling
  • Contact info change detection

Complexity: ~150 lines (simpler - mostly updates)


7. import_walkup_registrations 📊

Purpose: Microsoft Forms walk-ins (10 columns)

Parameters:

{
    "file_path": str (required),
    "event_id": int (required),
    "duplicate_strategy": enum (optional, default="skip") ["skip", "update", "fail"],
    "phone_cleanup": bool (optional, default=True)
}

Logic:

  1. Parse CSV (simple 10-column format)
  2. For each row:
    • Clean phone number (strip spaces, validate 10 digits if phone_cleanup=True)
    • Find or create contact by composite key (email, first_name, surname)
    • Check if contact already has ticket for this event
      • If yes: apply duplicate_strategy
      • If no: create ticket
    • Set ticket fields:
      • ticket_number = -1 (walk-in sentinel value)
      • checked_in = TRUE (they're physically present)
      • checked_in_at = completion_time
      • purchase_date = completion_time

Returns:

{
    "success": bool,
    "summary": {
        "total_rows": int,
        "contacts_created": int,
        "tickets_created": int,
        "tickets_skipped": int,
        "phone_numbers_cleaned": int
    },
    "phone_validation_warnings": list[{row, phone, issue}],
    "errors": list[{row, error}],
    "warnings": list[{row, warning}]
}

Key Challenges:

  • Phone number format variations (spaces, missing leading zeros)
  • Data quality issues (duplicates, date format ambiguity)
  • Simpler than Quicket (no interests, no purchaser tracking)

Complexity: ~100 lines (simplest import)


Recommended Build Order

Phase 1: Foundation + Core Workflow

1. validate_import_data           → Validate any CSV format
2. preview_import                  → Preview any import operation
3. import_quicket_registrations    → Core registration workflow
4. import_sessionize_data          → Create speaker_sessions (FK dependency!)
5. import_speaker_ratings          → References speaker_sessions.id

Rationale:

  • Tools 1-2 are infrastructure used by all imports
  • Tool 3 handles the most common/complex use case (event registrations)
  • Tool 4 MUST come before Tool 5 due to FK constraint: speaker_ratings.session_id references speaker_sessions.id
  • Tool 5 provides high-value analytics (speaker/session feedback)
  • After Phase 1: Can run complete workflow:
    1. Validate registration CSV
    2. Preview registration import
    3. Import registrations (attendees)
    4. Import sessions/speakers (creates speaker_sessions records)
    5. Import ratings (links to speaker_sessions) ✅

Estimated Lines: ~950 lines


Phase 2: Enhanced Operations

6. import_quicket_checkins         → Post-event attendance updates
7. import_walkup_registrations     → On-site registration workflow

Rationale:

  • Tool 6 builds on Tool 3 logic (same CSV format, UPDATE operations)
  • Tool 7 is simplest import (good for quick win, no dependencies)
  • After Phase 2: Complete import coverage for all data sources

Estimated Lines: ~250 lines


Database Foreign Key Dependencies

Critical FK constraints to respect:

-- speaker_ratings MUST reference existing speaker_sessions
ALTER TABLE speaker_ratings
    ADD CONSTRAINT speaker_ratings_session_id_fkey
    FOREIGN KEY (session_id) REFERENCES speaker_sessions(id);

-- tickets reference contacts (handled by all import tools)
ALTER TABLE tickets
    ADD CONSTRAINT tickets_contact_id_fkey
    FOREIGN KEY (contact_id) REFERENCES contacts(id);

-- speaker_sessions reference speakers (handled by sessionize import)
ALTER TABLE speaker_sessions
    ADD CONSTRAINT speaker_sessions_speaker_id_fkey
    FOREIGN KEY (speaker_id) REFERENCES speakers(id);

Import order implications:

  1. import_quicket_registrations → Creates contacts, tickets
  2. import_sessionize_data → Creates contacts, speakers, speaker_sessions
  3. import_speaker_ratings → Creates speaker_ratings referencing speaker_sessions.id

This is why sessionize MUST come before ratings!


Design Principles

Why 7 Separate Tools vs Fewer Unified Tools?

Separate because:

  • Different CSV formats (37 columns vs 10 columns vs 25 columns)
  • Different operations (CREATE vs UPDATE vs LINK)
  • Different workflows (pre-event vs post-event)
  • Different complexity levels (simple walk-ins vs complex multi-speaker sessions)
  • Follows MCP principle: "Atomic tools (one thing well), composable"

Why Validate + Preview as Separate Tools?

Separate because:

  • validate_import_data = fast, no DB access (~50ms)
  • preview_import = slower, reads DB (~500ms+)
  • Composable: Can validate multiple files, then preview one
  • User can validate quickly, then decide which file to import

Why Not Use csv-mcp-server for Validation?

Decision: Direct implementation instead

Reasons:

  • FastMCP can't easily call other MCP servers (architectural limitation)
  • csv-mcp-server provides 40+ generic operations
  • We need format-specific validation (Quicket 37 columns, Sessionize multi-speaker structure)
  • Direct implementation = fewer dependencies, more control

Technical Stack

Dependencies:

[project]
dependencies = [
    "fastmcp>=0.3.0",                    # MCP server framework
    "asyncpg>=0.29.0",                   # Async PostgreSQL driver
    "pandas>=2.0.0",                     # CSV parsing
    "fuzzywuzzy[speedup]>=0.18.0",       # Fuzzy string matching
    "python-Levenshtein>=0.21.0",        # Fast string similarity
]

Database Access:

  • Read ${NEON_DATABASE_URL} environment variable (same as Postgres MCP Pro)
  • Use asyncpg connection pool for performance
  • All operations in transactions (rollback on error)

Error Handling:

  • Validate before import (fail fast)
  • Preview before commit (user confirmation)
  • Atomic transactions (all or nothing)
  • Detailed error reporting (row number, field, issue)

Project Structure

mcp-servers/import-mcp/
├── pyproject.toml                    # Dependencies + FastMCP config
├── README.md                         # Usage guide with examples
├── src/import_mcp/
│   ├── server.py                     # FastMCP entry point (~100 lines)
│   ├── tools/
│   │   ├── validation.py             # validate_import_data, preview_import (~350 lines)
│   │   ├── quicket.py                # import_quicket_registrations/checkins (~400 lines)
│   │   ├── walkins.py                # import_walkup_registrations (~100 lines)
│   │   ├── sessionize.py             # import_sessionize_data (~200 lines)
│   │   └── ratings.py                # import_speaker_ratings (~200 lines)
│   └── utils/
│       ├── db.py                     # Database connection pool (~50 lines)
│       ├── parsers.py                # Interests, phone cleanup, HTML decode (~100 lines)
│       └── matching.py               # Fuzzy matching algorithms (~100 lines)
└── tests/
    ├── test_parsers.py               # Unit tests for parsing logic
    ├── test_integration.py           # Integration tests with real CSVs
    └── fixtures/                     # Symlink to docs/import-files/*.csv

Total: ~1,600 lines including tests


Success Metrics

Phase 1 Complete When:

  • ✅ Can validate Quicket CSV (37 columns)
  • ✅ Can preview import showing create/update counts
  • ✅ Can import registrations with interests parsing
  • ✅ Can import sessions/speakers (multi-speaker grouping)
  • ✅ Can import ratings with 80%+ fuzzy matching (FK constraint satisfied)
  • ✅ End-to-end workflow tested: reg.csvsessionize.csvevals.csv

Phase 2 Complete When:

  • ✅ Can update attendance from check-in CSV
  • ✅ Can import walk-ins with phone cleanup
  • ✅ All 7 tools tested with real CSV files
  • ✅ Import MCP server added to .mcp.json configuration

Next Steps

  1. Setup Project Structure

    • Create mcp-servers/import-mcp/ directory
    • Initialize pyproject.toml with FastMCP + dependencies
    • Create module structure (server.py, tools/, utils/)
  2. Build Phase 1 Tools (in order)

    • validate_import_data - Foundation
    • preview_import - User confidence
    • import_quicket_registrations - Core workflow
    • import_sessionize_data - Create speaker_sessions (FK dependency!)
    • import_speaker_ratings - Analytics value (depends on sessionize)
  3. Test with Real Data

    • Use docs/import-files/reg.csv (50 rows)
    • Use docs/import-files/sessionize.csv (25+ rows)
    • Use docs/import-files/evals.csv (199 rows)
    • Verify FK constraints satisfied (sessions created before ratings)
  4. Build Phase 2 Tools

    • import_quicket_checkins
    • import_walkup_registrations
  5. Integration & Documentation

    • Add to .mcp.json MCP server configuration
    • Update README.md with usage examples
    • Update CLAUDE.md Phase 1 status → Complete

References

  • Detailed Data Analysis: docs/planning/import-analysis-and-recommendations.md (32,000+ words)
  • Database Schema: database/DATABASE.md + database/schema.sql
  • Real CSV Files: docs/import-files/ (5 files: reg, checkin, walkins, sessionize, evals)
  • FastMCP Documentation: https://github.com/jlowin/fastmcp
  • MCP Protocol: https://modelcontextprotocol.io

Last Updated: January 14, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment