docs/api/sessions/cloudwatch-logging-best-practices.md

CloudWatch Logging Best Practices for Wallcrawler Sessions

This document outlines logging strategies for session visibility now that terminated sessions are cleaned up after 15 minutes.

Overview

With session data stored in DynamoDB and automatically removed via TTL (expiresAt), CloudWatch Logs become the primary source for debugging, auditing, and monitoring session lifecycles beyond the session timeout window.

Structured Logging Format

Use JSON structured logs for easy querying in CloudWatch Insights:

type SessionLogEntry struct {
    Timestamp   string                 `json:"timestamp"`
    SessionID   string                 `json:"session_id"`
    ProjectID   string                 `json:"project_id"`
    EventType   string                 `json:"event_type"`
    Status      string                 `json:"status"`
    Duration    int64                  `json:"duration_ms,omitempty"`
    Error       string                 `json:"error,omitempty"`
    Metadata    map[string]interface{} `json:"metadata,omitempty"`
}

// Example usage
func LogSessionEvent(event SessionLogEntry) {
    jsonBytes, _ := json.Marshal(event)
    log.Println(string(jsonBytes))
}

Key Events to Log

1. Session Lifecycle Events

// Session Created
LogSessionEvent(SessionLogEntry{
    Timestamp: time.Now().Format(time.RFC3339),
    SessionID: sessionID,
    ProjectID: projectID,
    EventType: "SESSION_CREATED",
    Status:    "CREATING",
    Metadata: map[string]interface{}{
        "timeout":      timeout,
        "user_id":      userID,
        "api_key_id":   apiKeyID,
    },
})

// Session Ready
LogSessionEvent(SessionLogEntry{
    SessionID: sessionID,
    EventType: "SESSION_READY",
    Status:    "READY",
    Duration:  provisioningTime.Milliseconds(),
    Metadata: map[string]interface{}{
        "public_ip":   publicIP,
        "task_arn":    taskARN,
        "container_id": containerID,
    },
})

// Session Terminated
LogSessionEvent(SessionLogEntry{
    SessionID: sessionID,
    EventType: "SESSION_TERMINATED",
    Status:    "STOPPED",
    Duration:  sessionDuration.Milliseconds(),
    Metadata: map[string]interface{}{
        "reason":       "timeout|manual|error",
        "proxy_bytes":  proxyBytes,
        "cpu_usage":    avgCPUUsage,
        "memory_usage": memoryUsage,
    },
})

2. Operation Events

// Browser Operations
LogSessionEvent(SessionLogEntry{
    SessionID: sessionID,
    EventType: "BROWSER_OPERATION",
    Metadata: map[string]interface{}{
        "operation": "navigate|extract|screenshot|act",
        "url":       targetURL,
        "selector":  selector,
        "success":   true,
    },
})

// Errors
LogSessionEvent(SessionLogEntry{
    SessionID: sessionID,
    EventType: "SESSION_ERROR",
    Error:     err.Error(),
    Metadata: map[string]interface{}{
        "operation":   operation,
        "retry_count": retryCount,
        "fatal":       isFatal,
    },
})

3. Resource Usage

// Periodic resource metrics
LogSessionEvent(SessionLogEntry{
    SessionID: sessionID,
    EventType: "RESOURCE_METRICS",
    Metadata: map[string]interface{}{
        "cpu_percent":    cpuUsage,
        "memory_mb":      memoryMB,
        "network_bytes":  networkBytes,
        "active_targets": chromeTargets,
    },
})

CloudWatch Insights Queries

Find Failed Sessions

fields @timestamp, session_id, error, metadata.operation
| filter event_type = "SESSION_ERROR"
| filter project_id = "proj_123"
| sort @timestamp desc
| limit 100

Session Duration Analysis

stats avg(duration_ms), max(duration_ms), min(duration_ms)
| filter event_type = "SESSION_TERMINATED"
| filter @timestamp > ago(1h)

Resource Usage by Project

stats sum(metadata.proxy_bytes) as total_bytes,
      avg(metadata.cpu_usage) as avg_cpu,
      count(*) as session_count by project_id
| filter event_type = "SESSION_TERMINATED"
| filter @timestamp > ago(24h)

Debug Specific Session

fields @timestamp, event_type, status, error, metadata
| filter session_id = "sess_abc123"
| sort @timestamp asc

Log Retention Strategy

  1. CloudWatch Log Groups:

    • /aws/lambda/sessions-create - 30 days (high-volume, synchronous entrypoint)
    • /aws/lambda/sessions-list & /sessions-retrieve - 14 days
    • /aws/lambda/sessions-update & /sessions-debug - 14 days
    • /aws/lambda/sessions-stream-processor - 14 days
    • /aws/lambda/ecs-task-processor - 14 days
    • /aws/lambda/authorizer - 14 days
    • /aws/ecs/wallcrawler-controller - 30 days
  2. Archive to S3:

    • Export terminated session logs to S3 after 30 days
    • Use S3 lifecycle policies for long-term retention
    • Enable S3 Intelligent-Tiering for cost optimization

Implementation Example

// utils/logging.go
package utils

import (
    "encoding/json"
    "log"
    "os"
)

var (
    // Use environment variable to enable/disable structured logging
    structuredLogging = os.Getenv("STRUCTURED_LOGGING") == "true"
)

func LogSessionCreated(sessionID, projectID string, metadata map[string]interface{}) {
    if structuredLogging {
        LogSessionEvent(SessionLogEntry{
            Timestamp: time.Now().Format(time.RFC3339),
            SessionID: sessionID,
            ProjectID: projectID,
            EventType: "SESSION_CREATED",
            Status:    "CREATING",
            Metadata:  metadata,
        })
    } else {
        log.Printf("Session created: %s for project %s", sessionID, projectID)
    }
}

func LogSessionError(sessionID string, err error, metadata map[string]interface{}) {
    if structuredLogging {
        LogSessionEvent(SessionLogEntry{
            Timestamp: time.Now().Format(time.RFC3339),
            SessionID: sessionID,
            EventType: "SESSION_ERROR",
            Error:     err.Error(),
            Metadata:  metadata,
        })
    } else {
        log.Printf("Session %s error: %v", sessionID, err)
    }
}

Monitoring & Alerting

CloudWatch Alarms

  1. High Error Rate:

    MetricName: SessionErrors
    Statistic: Sum
    Period: 300
    Threshold: 10
    
  2. Long Running Sessions:

    MetricName: SessionDuration
    Statistic: Maximum
    Period: 300
    Threshold: 600000 (10 minutes)
    

Custom Metrics

// Publish custom metrics
func PublishSessionMetrics(sessionID string, duration time.Duration) {
    cwClient := cloudwatch.NewFromConfig(cfg)
    
    _, err := cwClient.PutMetricData(ctx, &cloudwatch.PutMetricDataInput{
        Namespace: aws.String("Wallcrawler/Sessions"),
        MetricData: []types.MetricDatum{
            {
                MetricName: aws.String("SessionDuration"),
                Value:      aws.Float64(duration.Seconds()),
                Unit:       types.StandardUnitSeconds,
                Dimensions: []types.Dimension{
                    {
                        Name:  aws.String("ProjectId"),
                        Value: aws.String(projectID),
                    },
                },
            },
        },
    })
}

Benefits

  1. Permanent Audit Trail: Session history preserved beyond the DynamoDB TTL window
  2. Advanced Analytics: CloudWatch Insights for complex queries
  3. Cost Tracking: Detailed usage metrics per project
  4. Debugging: Full session lifecycle visibility
  5. Compliance: Long-term retention for audit requirements

Best Practices

  1. Log Early & Often: Capture events as they happen
  2. Include Context: Always include sessionID and projectID
  3. Use Structured Logs: JSON format for easy parsing
  4. Batch Writes: Use CloudWatch Logs PutLogEvents for efficiency
  5. Set Alarms: Proactive monitoring for issues
  6. Regular Reviews: Analyze logs for optimization opportunities