Analysis Date: $(date)
System: AskMarkAI Document Upload & Processing Pipeline
Scope: Text Content, File Attachments, and URL Processing workflows
This comprehensive trace analysis reveals how AskMarkAI handles three distinct document upload types through a unified, event-driven architecture. All upload paths converge at a single document processor triggered by S3 events, ensuring consistent processing and vector embedding generation.
[handleCreateDocument] (file: frontend/src/app/dashboard/businesses/[businessId]/page.tsx:89)
↓
[useDocuments::createDocument] (file: frontend/src/hooks/use-documents.ts:60)
↓
[ApiClient::createDocumentWithUpload] (file: frontend/src/lib/api-client.ts:335)
↓
[ApiClient::createDocument] (file: frontend/src/lib/api-client.ts:297) ? if not file upload
↓
[documentManagementHandler] (file: backend/src/document-management.ts:10)
↓
[DocumentService::createDocument] (file: backend/src/services/document-service.ts:70)
↓
[S3UploadService::uploadTextContent] (file: backend/src/services/s3-upload-service.ts:146) ? if text/html content
↓
[S3 PUT Event Trigger] (file: infrastructure/lib/constructs/processing-construct.ts:449)
↓
[documentProcessorHandler] (file: lambda/src/document-processor/index.ts:11)
↓
[DocumentProcessor::processDocument] (file: shared/services/document-processor.ts:156)
↓
[DocumentProcessor::updateDocumentStatus] (file: shared/services/document-processor.ts:503)
[handleCreateDocument] (file: frontend/src/app/dashboard/businesses/[businessId]/page.tsx:89)
↓
[useDocuments::createDocument] (file: frontend/src/hooks/use-documents.ts:60)
↓
[ApiClient::createDocumentWithUpload] (file: frontend/src/lib/api-client.ts:335)
↓
[ApiClient::createDocument] (file: frontend/src/lib/api-client.ts:342) → Step 1: Create DynamoDB record
↓
[ApiClient::getPresignedUploadUrl] (file: frontend/src/lib/api-client.ts:357) → Step 2: Get presigned URL
↓
[s3UploadHandler] (file: backend/src/s3-upload.ts:64)
↓
[S3UploadService::generatePresignedUploadUrl] (file: backend/src/services/s3-upload-service.ts:32)
↓
[Frontend Direct S3 Upload] (file: frontend/src/lib/api-client.ts:372) → Step 3: Upload to S3
↓
[S3 PUT Event Trigger] (file: infrastructure/lib/constructs/processing-construct.ts:449)
↓
[ApiClient::updateDocument] (file: frontend/src/lib/api-client.ts:391) → Step 4: Update metadata
↓
[documentProcessorHandler] (file: lambda/src/document-processor/index.ts:11)
[handleCreateDocument] (file: frontend/src/app/dashboard/businesses/[businessId]/page.tsx:89)
↓
[useDocuments::processHtmlUrl] (file: frontend/src/hooks/use-documents.ts:96)
↓
[ApiClient::processHtmlUrl] (file: frontend/src/lib/api-client.ts:418)
↓
[htmlProcessingHandler] (file: backend/src/html-processing.ts:321)
↓
[sendHtmlProcessingMessage] (file: backend/src/html-processing.ts:49)
↓
[SQS Queue Trigger] (file: infrastructure/lib/constructs/processing-construct.ts:436)
↓
[htmlProcessingWorkerHandler] (file: backend/src/html-processing-worker.ts:24)
↓
[processHtmlContent] (file: backend/src/html-processing.ts:236)
↓
[extractHtmlContent] (file: backend/src/html-processing.ts:150)
↓
[extractContentWithGemini] (file: backend/src/html-processing.ts:28)
↓
[S3UploadService::uploadTextContent] (file: backend/src/services/s3-upload-service.ts:146)
↓
[S3 PUT Event Trigger] → Same processing pipeline as above
| Location | Condition | Branches | Uncertain |
|---|---|---|---|
| api-client.ts:336 | if request.file exists | createDocumentWithUpload(), createDocument() | No |
| document-service.ts:94 | if documentType === 'text' || 'html' && content | uploadTextContent(), skip upload | No |
| document-management.ts:66 | if documentType === 'text' || 'html' && !content | return error, continue | No |
| html-processing.ts:354 | if urlValidation.isValid | process URL, return error | No |
| document-processor/index.ts:38 | if eventName.startsWith('s3:ObjectCreated') | creation processing, deletion processing | No |
Side Effects:
- [database] DynamoDB document record creation with pending status (document-service.ts:115)
- [database] DynamoDB document status updates to completed/failed (document-processor.ts:544)
- [storage] S3 file uploads with business/document path structure (s3-upload-service.ts:164)
- [storage] S3 presigned URL generation for secure uploads (s3-upload-service.ts:77)
- [network] FireCrawl API calls for HTML content extraction (html-processing.ts:158)
- [network] Gemini API calls for intelligent content processing (html-processing.ts:28)
- [network] OpenAI API calls for embedding generation (document-processor.ts:96)
- [vector-db] Pinecone vector storage with business namespace isolation (document-processor.ts:403)
- [queue] SQS message publishing for async HTML processing (html-processing.ts:77)
- [state] Document processing status transitions in DynamoDB (document-processor.ts:298)
Usage Points:
1. page.tsx:89 - Main user interaction point for all document uploads in dashboard
2. use-documents.ts:60 - React hook abstraction for document operations with error handling
3. api-client.ts:335 - Central API client routing different upload strategies based on content type
4. document-management.ts:10 - API Gateway Lambda entry point for document CRUD operations
5. s3-upload.ts:64 - Presigned URL generation Lambda for secure file uploads
6. html-processing.ts:321 - HTML URL processing Lambda with async SQS queuing
7. document-processor/index.ts:11 - S3 event-triggered Lambda for unified document processing
8. document-processor.ts:156 - Shared service for all document processing and embedding generation
Entry Points:
- handleCreateDocument (context: User-initiated document upload from React dashboard)
- s3UploadHandler (context: API Gateway route for presigned URL generation)
- htmlProcessingHandler (context: API Gateway route for URL processing requests)
- documentProcessorHandler (context: S3 event-triggered processing for all uploaded files)
- htmlProcessingWorkerHandler (context: SQS-triggered async HTML content processing)
All three upload types (text, files, URLs) converge at the documentProcessorLambda triggered by S3 PUT events, ensuring consistent processing regardless of upload method.
The system uses S3 events as the primary trigger mechanism, making it fully asynchronous and decoupled from the request/response cycle.
File uploads use presigned URLs for direct frontend-to-S3 transfer, avoiding Lambda payload limits and reducing bandwidth costs.
- Text Content: Immediate S3 upload → S3 event → processing
- File Attachments: 4-step presigned URL process → S3 event → processing
- URLs: SQS queuing → async extraction → S3 upload → S3 event → processing
Each processing stage updates document status in DynamoDB, providing clear visibility into processing state and failure points.
SQS queuing for HTML processing enables handling of high-volume URL processing without blocking the request path.
When web requests finish, here's what gets invoked:
-
Text Content:
- Request completes after S3 upload
- Async: documentProcessorLambda via S3 event
-
File Attachments:
- Request completes after metadata update
- Async: documentProcessorLambda via S3 event (already triggered during step 3)
-
URLs:
- Request completes immediately after SQS queuing
- Async: htmlProcessingWorkerLambda → content extraction → S3 upload → documentProcessorLambda
The system processes files with the following extensions via S3 events:
.pdf,.doc,.docx(documents).txt,.md,.json,.csv(text files).html(web content)
S3 events configured:
OBJECT_CREATED_PUT→ triggers document processingOBJECT_REMOVED_DELETE→ triggers vector cleanup
- Frontend: React, Next.js, TypeScript
- Backend: Node.js, AWS Lambda, TypeScript
- Storage: AWS S3 (documents), DynamoDB (metadata), Pinecone (vectors)
- Processing: OpenAI (embeddings), Gemini 2.5 Flash (content extraction)
- Queue: AWS SQS for async HTML processing
- Infrastructure: AWS CDK
The trace reveals a well-architected system with:
- ✅ Clear separation of concerns
- ✅ Proper async processing
- ✅ Unified convergence at document processor level
- ✅ Event-driven scalability
- ✅ Comprehensive error handling
- ✅ Security best practices with presigned URLs
The architecture demonstrates excellent design patterns for handling multiple content types through a single, consistent processing pipeline.