Skip to content

Instantly share code, notes, and snippets.

@manish
Last active September 10, 2025 19:31
Show Gist options
  • Select an option

  • Save manish/07d719d1f29342fc4cfcea6abf9c1520 to your computer and use it in GitHub Desktop.

Select an option

Save manish/07d719d1f29342fc4cfcea6abf9c1520 to your computer and use it in GitHub Desktop.
Complete Upload Workflow Trace Analysis for AskMarkAI - Text Content, File Attachments, and URL Processing

🔍 Complete Upload Workflow Trace Analysis - AskMarkAI

Analysis Date: $(date)
System: AskMarkAI Document Upload & Processing Pipeline
Scope: Text Content, File Attachments, and URL Processing workflows

Executive Summary

This comprehensive trace analysis reveals how AskMarkAI handles three distinct document upload types through a unified, event-driven architecture. All upload paths converge at a single document processor triggered by S3 events, ensuring consistent processing and vector embedding generation.

CALL FLOW DIAGRAM - Text Content Upload

[handleCreateDocument] (file: frontend/src/app/dashboard/businesses/[businessId]/page.tsx:89)
↓
[useDocuments::createDocument] (file: frontend/src/hooks/use-documents.ts:60)
↓
[ApiClient::createDocumentWithUpload] (file: frontend/src/lib/api-client.ts:335)
  ↓
  [ApiClient::createDocument] (file: frontend/src/lib/api-client.ts:297) ? if not file upload
  ↓
  [documentManagementHandler] (file: backend/src/document-management.ts:10)
  ↓
  [DocumentService::createDocument] (file: backend/src/services/document-service.ts:70)
    ↓
    [S3UploadService::uploadTextContent] (file: backend/src/services/s3-upload-service.ts:146) ? if text/html content
    ↓
    [S3 PUT Event Trigger] (file: infrastructure/lib/constructs/processing-construct.ts:449)
    ↓
    [documentProcessorHandler] (file: lambda/src/document-processor/index.ts:11)
      ↓
      [DocumentProcessor::processDocument] (file: shared/services/document-processor.ts:156)
        ↓
        [DocumentProcessor::updateDocumentStatus] (file: shared/services/document-processor.ts:503)

CALL FLOW DIAGRAM - File Attachment Upload

[handleCreateDocument] (file: frontend/src/app/dashboard/businesses/[businessId]/page.tsx:89)
↓
[useDocuments::createDocument] (file: frontend/src/hooks/use-documents.ts:60)
↓
[ApiClient::createDocumentWithUpload] (file: frontend/src/lib/api-client.ts:335)
  ↓
  [ApiClient::createDocument] (file: frontend/src/lib/api-client.ts:342) → Step 1: Create DynamoDB record
  ↓
  [ApiClient::getPresignedUploadUrl] (file: frontend/src/lib/api-client.ts:357) → Step 2: Get presigned URL
    ↓
    [s3UploadHandler] (file: backend/src/s3-upload.ts:64)
    ↓
    [S3UploadService::generatePresignedUploadUrl] (file: backend/src/services/s3-upload-service.ts:32)
  ↓
  [Frontend Direct S3 Upload] (file: frontend/src/lib/api-client.ts:372) → Step 3: Upload to S3
  ↓
  [S3 PUT Event Trigger] (file: infrastructure/lib/constructs/processing-construct.ts:449)
  ↓
  [ApiClient::updateDocument] (file: frontend/src/lib/api-client.ts:391) → Step 4: Update metadata
    ↓
    [documentProcessorHandler] (file: lambda/src/document-processor/index.ts:11)

CALL FLOW DIAGRAM - URL Processing

[handleCreateDocument] (file: frontend/src/app/dashboard/businesses/[businessId]/page.tsx:89)
↓
[useDocuments::processHtmlUrl] (file: frontend/src/hooks/use-documents.ts:96)
↓
[ApiClient::processHtmlUrl] (file: frontend/src/lib/api-client.ts:418)
↓
[htmlProcessingHandler] (file: backend/src/html-processing.ts:321)
  ↓
  [sendHtmlProcessingMessage] (file: backend/src/html-processing.ts:49)
  ↓
  [SQS Queue Trigger] (file: infrastructure/lib/constructs/processing-construct.ts:436)
  ↓
  [htmlProcessingWorkerHandler] (file: backend/src/html-processing-worker.ts:24)
    ↓
    [processHtmlContent] (file: backend/src/html-processing.ts:236)
      ↓
      [extractHtmlContent] (file: backend/src/html-processing.ts:150)
      ↓
      [extractContentWithGemini] (file: backend/src/html-processing.ts:28)
      ↓
      [S3UploadService::uploadTextContent] (file: backend/src/services/s3-upload-service.ts:146)
      ↓
      [S3 PUT Event Trigger] → Same processing pipeline as above

BRANCHING & SIDE EFFECT TABLE

Location Condition Branches Uncertain
api-client.ts:336 if request.file exists createDocumentWithUpload(), createDocument() No
document-service.ts:94 if documentType === 'text' || 'html' && content uploadTextContent(), skip upload No
document-management.ts:66 if documentType === 'text' || 'html' && !content return error, continue No
html-processing.ts:354 if urlValidation.isValid process URL, return error No
document-processor/index.ts:38 if eventName.startsWith('s3:ObjectCreated') creation processing, deletion processing No

SIDE EFFECTS

Side Effects:
- [database] DynamoDB document record creation with pending status (document-service.ts:115)
- [database] DynamoDB document status updates to completed/failed (document-processor.ts:544)
- [storage] S3 file uploads with business/document path structure (s3-upload-service.ts:164)
- [storage] S3 presigned URL generation for secure uploads (s3-upload-service.ts:77)
- [network] FireCrawl API calls for HTML content extraction (html-processing.ts:158)
- [network] Gemini API calls for intelligent content processing (html-processing.ts:28)
- [network] OpenAI API calls for embedding generation (document-processor.ts:96)
- [vector-db] Pinecone vector storage with business namespace isolation (document-processor.ts:403)
- [queue] SQS message publishing for async HTML processing (html-processing.ts:77)
- [state] Document processing status transitions in DynamoDB (document-processor.ts:298)

USAGE POINTS

Usage Points:
1. page.tsx:89 - Main user interaction point for all document uploads in dashboard
2. use-documents.ts:60 - React hook abstraction for document operations with error handling
3. api-client.ts:335 - Central API client routing different upload strategies based on content type
4. document-management.ts:10 - API Gateway Lambda entry point for document CRUD operations
5. s3-upload.ts:64 - Presigned URL generation Lambda for secure file uploads
6. html-processing.ts:321 - HTML URL processing Lambda with async SQS queuing
7. document-processor/index.ts:11 - S3 event-triggered Lambda for unified document processing
8. document-processor.ts:156 - Shared service for all document processing and embedding generation

ENTRY POINTS

Entry Points:
- handleCreateDocument (context: User-initiated document upload from React dashboard)
- s3UploadHandler (context: API Gateway route for presigned URL generation)
- htmlProcessingHandler (context: API Gateway route for URL processing requests)
- documentProcessorHandler (context: S3 event-triggered processing for all uploaded files)
- htmlProcessingWorkerHandler (context: SQS-triggered async HTML content processing)

🎯 KEY ARCHITECTURAL INSIGHTS

Convergence Pattern

All three upload types (text, files, URLs) converge at the documentProcessorLambda triggered by S3 PUT events, ensuring consistent processing regardless of upload method.

Event-Driven Architecture

The system uses S3 events as the primary trigger mechanism, making it fully asynchronous and decoupled from the request/response cycle.

Security Model

File uploads use presigned URLs for direct frontend-to-S3 transfer, avoiding Lambda payload limits and reducing bandwidth costs.

Processing Pipeline

  1. Text Content: Immediate S3 upload → S3 event → processing
  2. File Attachments: 4-step presigned URL process → S3 event → processing
  3. URLs: SQS queuing → async extraction → S3 upload → S3 event → processing

Error Handling

Each processing stage updates document status in DynamoDB, providing clear visibility into processing state and failure points.

Scalability

SQS queuing for HTML processing enables handling of high-volume URL processing without blocking the request path.

🚀 POST-REQUEST INVOCATION PATTERNS

When web requests finish, here's what gets invoked:

  1. Text Content:

    • Request completes after S3 upload
    • Async: documentProcessorLambda via S3 event
  2. File Attachments:

    • Request completes after metadata update
    • Async: documentProcessorLambda via S3 event (already triggered during step 3)
  3. URLs:

    • Request completes immediately after SQS queuing
    • Async: htmlProcessingWorkerLambda → content extraction → S3 upload → documentProcessorLambda

🔧 SUPPORTED FILE TYPES & S3 EVENT TRIGGERS

The system processes files with the following extensions via S3 events:

  • .pdf, .doc, .docx (documents)
  • .txt, .md, .json, .csv (text files)
  • .html (web content)

S3 events configured:

  • OBJECT_CREATED_PUT → triggers document processing
  • OBJECT_REMOVED_DELETE → triggers vector cleanup

🏗️ TECHNOLOGY STACK

  • Frontend: React, Next.js, TypeScript
  • Backend: Node.js, AWS Lambda, TypeScript
  • Storage: AWS S3 (documents), DynamoDB (metadata), Pinecone (vectors)
  • Processing: OpenAI (embeddings), Gemini 2.5 Flash (content extraction)
  • Queue: AWS SQS for async HTML processing
  • Infrastructure: AWS CDK

📋 CONCLUSION

The trace reveals a well-architected system with:

  • ✅ Clear separation of concerns
  • ✅ Proper async processing
  • ✅ Unified convergence at document processor level
  • ✅ Event-driven scalability
  • ✅ Comprehensive error handling
  • ✅ Security best practices with presigned URLs

The architecture demonstrates excellent design patterns for handling multiple content types through a single, consistent processing pipeline.

Deletion Workflow Analysis for AskMarkAI Document System

Overview

This analysis examines the deletion workflow for three content types in the AskMarkAI document system:

  1. Text Content - Direct text input
  2. File Attachments - PDF/DOC files uploaded via presigned URLs
  3. URL Processing - HTML content extracted from web URLs

Complete Deletion Flow

1. Frontend Initiation

Location: /frontend/src/app/dashboard/businesses/[businessId]/page.tsx:533

const confirmDeleteDocument = async () => {
  setDeletingDocumentId(deleteConfirmation.documentId);
  try {
    await deleteDocument(deleteConfirmation.documentId);

2. React Hook Layer

Location: /frontend/src/hooks/use-documents.ts:174

const deleteDocument = async (documentId: string): Promise<boolean> => {
  try {
    const document = documents.find(d => d.documentId === documentId);
    const response = await apiClient.deleteDocument(businessId, documentId);

3. API Client Layer

Location: /frontend/src/lib/api-client.ts:226

async deleteDocument(businessId: string, documentId: string): Promise<ApiResponse<{ message: string }>> {
  return this.makeRequest<{ message: string }>(`/businesses/${businessId}/documents/${documentId}`, {
    method: 'DELETE',
  });
}

4. Backend API Handler

Location: /backend/src/document-management.ts:100

case 'DELETE':
  if (documentId) {
    await documentService.deleteDocument(businessId, documentId, userId, username);
    return successResponse({ message: 'Document deleted successfully' });

5. Document Service - Critical Three-Layer Cleanup

Location: /backend/src/services/document-service.ts:241

The DocumentService performs a three-step cleanup process:

Step 1: Permission Validation

const hasPermission = await this.businessService.hasBusinessPermission(userId, businessId, 'editor', username);
if (!hasPermission) {
  throw new Error('Insufficient permissions to delete documents in this business');
}

Step 2: S3 File Deletion

// Delete from S3 if s3Key exists
// This will trigger S3 deletion events which should clean up vectors from Pinecone
if (document.s3Key) {
  console.log(`Deleting S3 object: ${document.s3Key}`);
  try {
    await this.s3Service.deleteFile(document.s3Key);
    console.log(`Successfully deleted S3 object: ${document.s3Key}`);
  } catch (error) {
    console.error(`Failed to delete S3 object: ${document.s3Key}`, error);
    // Continue with DynamoDB deletion even if S3 deletion fails
    // This prevents orphaned DynamoDB records
  }
}

Step 3: DynamoDB Record Deletion

const deleteCommand = new DeleteCommand({
  TableName: this.documentsTableName,
  Key: { businessId, documentId },
});
await this.dynamoClient.send(deleteCommand);
console.log(`Successfully deleted document ${documentId} from DynamoDB`);

6. S3 Event-Triggered Vector Cleanup

Location: /lambda/src/document-processor/index.ts:66

S3 deletion events automatically trigger the DocumentProcessorLambda:

// Process deletion events (vector cleanup)
if (deletionRecords.length > 0) {
  console.log(`🗑️ Processing ${deletionRecords.length} document deletion events`);
  
  for (const record of deletionRecords) {
    try {
      await processor.deleteDocumentVectors(record.key);
      totalProcessed++;
    } catch (error) {
      console.error(`❌ Failed to delete vectors for ${record.key}:`, error);
      totalFailed++;
      failedFiles.push(record.key);
    }
  }
}

7. Pinecone Vector Deletion

Location: /shared/services/document-processor.ts:1107

The DocumentProcessor performs sophisticated vector cleanup:

S3 Key Parsing

// Parse the S3 key to get businessId and documentId
const { businessId, documentId } = parseS3KeyForIds(s3Key);

Vector Query and Batch Deletion

// First, query to find all vectors for this document
const queryResponse = await indexNamespace.query({
  vector: new Array(1536).fill(0), // Dummy vector for metadata filtering
  topK: 10000, // Large number to get all chunks
  includeMetadata: true,
  filter: {
    source: s3Key // Filter by exact S3 source path
  }
});

const vectorIds = queryResponse.matches?.map(match => match.id) || [];

// Delete vectors in batches (Pinecone has limits on batch size)
const batchSize = 100;
const batches = createBatches(vectorIds, batchSize);

Deletion Verification

// Verify deletion by querying for remaining vectors
const verificationQuery = await indexNamespace.query({
  vector: new Array(1536).fill(0),
  topK: 1,
  includeMetadata: true,
  filter: {
    source: s3Key
  }
});

if (verificationQuery.matches && verificationQuery.matches.length > 0) {
  throw new Error(`Deletion verification failed: ${verificationQuery.matches.length} vectors still exist for ${s3Key}`);
}

Content Type Specific Analysis

Text Content Documents

  • S3 Key Format: businesses/{businessId}/documents/{documentId}/{filename}.txt
  • Storage: Content stored in S3, metadata in DynamoDB
  • Vector Cleanup: Uses S3 source path filtering for precise vector identification
  • Risk Level:LOW - Standard format, reliable cleanup

File Attachments (PDF/DOC)

  • S3 Key Format: businesses/{businessId}/documents/{documentId}/{originalFilename}
  • Storage: File stored in S3, metadata in DynamoDB
  • Vector Cleanup: Uses S3 source path filtering
  • Risk Level:LOW - Standard format, reliable cleanup

URL-Processed Documents (HTML)

  • S3 Key Format: businesses/{businessId}/documents/{documentId}/{sanitizedTitle}.txt
  • Processing: FireCrawl extracts content → S3 storage → Vector embedding
  • Vector Cleanup: Uses S3 source path filtering
  • Risk Level: ⚠️ MEDIUM - Complex processing pipeline, multiple async steps

S3 Key Parsing Logic

Location: /shared/utils/aws-utils.ts:70

export function parseS3KeyForIds(key: string): { businessId: string; documentId: string } {
  const pathParts = key.split('/');
  
  // Handle the current format: businesses/{businessId}/documents/{documentId}/{filename}
  if (pathParts.length >= 5 && pathParts[0] === 'businesses' && pathParts[2] === 'documents') {
    const businessId = pathParts[1];
    const documentId = pathParts[3];
    if (!businessId || !documentId) {
      throw new Error(`Invalid S3 key format: ${key}. Expected: businesses/{businessId}/documents/{documentId}/{filename}`);
    }
    return { businessId, documentId };
  }
  // Handle legacy HTML format: html-{uuid}
  else if (key.startsWith('html-') && key.length === 41) {
    const documentId = key.substring(5); // Remove 'html-' prefix
    return { businessId: 'legacy', documentId };
  }
  else {
    throw new Error(`Invalid S3 key format: ${key}. Expected: businesses/{businessId}/documents/{documentId}/{filename} or legacy html-{uuid}`);
  }
}

Identified Issues and Concerns

1. ⚠️ S3 Deletion Failure Tolerance

Issue: If S3 deletion fails, the system continues with DynamoDB deletion to prevent orphaned records. Risk: This can create inconsistent state where DynamoDB record is deleted but S3 file remains. Impact: Storage costs, potential confusion if file is manually discovered later.

2. ⚠️ Vector Cleanup Dependency on S3 Events

Issue: Pinecone vector cleanup depends entirely on S3 deletion events being triggered. Risk: If S3 deletion event is not fired or Lambda processing fails, vectors remain orphaned. Impact: Storage costs in Pinecone, irrelevant search results.

3. ⚠️ Async Processing Race Conditions (HTML Documents)

Issue: HTML documents involve complex async processing with SQS queues. Risk: If user deletes document while HTML processing is still running, the worker might fail or create inconsistent state. Impact: Failed processing attempts, error logs, potential vector creation after deletion.

4. ✅ Legacy HTML Key Support

Observation: System maintains backward compatibility with legacy html-{uuid} keys. Status: This is actually good - ensures old documents can still be properly cleaned up.

5. ⚠️ Pinecone Query Limits

Issue: Vector cleanup queries are limited to 10,000 results. Risk: For documents with >10,000 chunks, some vectors might not be found and deleted. Impact: Orphaned vectors for very large documents.

6. ⚠️ Error Handling in Batch Operations

Issue: If a single batch deletion fails in Pinecone cleanup, the entire operation throws an error. Risk: Partial vector cleanup leaving some vectors orphaned. Impact: Inconsistent cleanup state.

Recommendations

High Priority

  1. Improve S3 Deletion Error Handling: Consider implementing retry logic or manual cleanup jobs for failed S3 deletions.
  2. Add Vector Cleanup Monitoring: Implement monitoring to detect orphaned vectors and alert on cleanup failures.

Medium Priority

  1. Handle Large Document Deletion: Implement pagination or multiple queries for documents with >10,000 vectors.
  2. Improve Batch Error Handling: Continue with remaining batches even if one fails, then report partial success.

Low Priority

  1. Add Deletion Audit Trail: Consider logging detailed deletion steps for debugging and compliance.

Conclusion

The deletion workflow is generally robust with good three-layer cleanup architecture. The main risks are around error handling and edge cases with very large documents or processing failures. The system shows good architectural patterns with event-driven cleanup and proper separation of concerns.

Overall Risk Assessment:ACCEPTABLE with recommended improvements for production reliability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment