adrs/026-structured-logging-observability.md

ADR-026: Structured Logging and Observability Strategy

Status

Accepted (Partially Implemented)

Implementation Progress

Completed

✅ OpenTelemetry integration for logs, metrics, and traces (Extensions.cs)
✅ OTLP exporter configuration via environment variables
✅ Azure Monitor integration support
✅ Serilog file logging in Development mode with daily rolling
✅ Health check filtering from traces
✅ Component name enrichment for log files
✅ Polly/Resilience logging noise reduction
✅ Aspire Dashboard for local development observability

In Progress

⚠️ Standardized logging patterns across all components
⚠️ High-performance LoggerMessage source generators

Remaining

⏳ Consistent use of log scopes for operations
⏳ Sensitive data redaction with Microsoft.Extensions.Compliance
⏳ Custom metrics instrumentation per component
⏳ Production log aggregation configuration
⏳ Alerting and dashboard setup

Key Implementation Files

strata/service-defaults/src/Acsis.Dynaplex.Strata.ServiceDefaults/Extensions.cs
Component-specific logging patterns (to be standardized)

Context

Observability is critical for understanding, debugging, and operating distributed microservices. The Dynaplex architecture requires comprehensive telemetry that correlates across service boundaries while maintaining high performance and protecting sensitive data.

Requirements

Performance: Logging must not impact service latency or throughput
Correlation: Trace requests across service boundaries automatically
Structure: Logs must be machine-parseable for querying and alerting
Security: Sensitive data must never appear in logs
Consistency: All components must follow the same patterns
Flexibility: Support multiple backends (OTLP, Azure Monitor, local files)

Existing Foundation

.NET Aspire provides OpenTelemetry configuration via ServiceDefaults
OpenTelemetry is the industry-standard, vendor-neutral observability framework
.NET 10 has first-class support for LoggerMessage source generators

Decision

We adopt a three-pillar observability strategy built on OpenTelemetry:

Logs: Structured logging via ILogger with OpenTelemetry export
Traces: Distributed tracing via System.Diagnostics.Activity
Metrics: Custom and automatic metrics via System.Diagnostics.Metrics

Core Principles

OpenTelemetry-first: Use native .NET logging/tracing APIs that feed into OpenTelemetry
High-performance by default: LoggerMessage source generators for all logging
Automatic correlation: Activity.TraceId propagates through all calls
Redact, don't omit: Use data classification to redact sensitive info
Semantic conventions: Follow OpenTelemetry semantic conventions

Consequences

Positive

Performance: Source-generated logging eliminates boxing and allocation
Debugging: Trace any request across all services via TraceId
Searchability: Query structured logs by any property
Flexibility: Switch observability backends without code changes
Compliance: Audit trails with full context, sensitive data redacted

Negative

Storage costs: Structured logs are larger than plain text
Learning curve: Team must learn structured logging patterns
Migration effort: Existing logging must be updated

Neutral

Tool selection: Azure Monitor for production, Aspire Dashboard for dev

Implementation

1. High-Performance Logging with Source Generators

Every component should define its log messages using the LoggerMessageAttribute for maximum performance. This eliminates boxing, reduces allocations, and provides compile-time validation.

// CatalogLogMessages.cs - Define all log messages for the component
namespace Acsis.Dynaplex.Engines.Catalog;

/// <summary>
/// High-performance log messages for the Catalog component.
/// Uses source generators to eliminate boxing and reduce allocations.
/// </summary>
public static partial class CatalogLogMessages
{
    // ========================================
    // Item Operations (EventId 1000-1099)
    // ========================================

    [LoggerMessage(
        EventId = 1000,
        Level = LogLevel.Information,
        Message = "Creating item {ItemName} of type {ItemTypeId} at location {LocationId}")]
    public static partial void CreatingItem(
        this ILogger logger,
        string itemName,
        Guid itemTypeId,
        Guid locationId);

    [LoggerMessage(
        EventId = 1001,
        Level = LogLevel.Information,
        Message = "Created item {ItemId} with passport {PassportId}")]
    public static partial void CreatedItem(
        this ILogger logger,
        Guid itemId,
        Guid passportId);

    [LoggerMessage(
        EventId = 1002,
        Level = LogLevel.Warning,
        Message = "Item {ItemId} not found")]
    public static partial void ItemNotFound(
        this ILogger logger,
        Guid itemId);

    [LoggerMessage(
        EventId = 1003,
        Level = LogLevel.Error,
        Message = "Failed to create item {ItemName}")]
    public static partial void FailedToCreateItem(
        this ILogger logger,
        string itemName,
        Exception exception);

    // ========================================
    // Move Operations (EventId 1100-1199)
    // ========================================

    [LoggerMessage(
        EventId = 1100,
        Level = LogLevel.Information,
        Message = "Moving {ItemCount} items from {SourceLocationId} to {DestinationLocationId}")]
    public static partial void MovingItems(
        this ILogger logger,
        int itemCount,
        Guid sourceLocationId,
        Guid destinationLocationId);

    [LoggerMessage(
        EventId = 1101,
        Level = LogLevel.Information,
        Message = "Moved item {ItemId} to location {LocationId}")]
    public static partial void MovedItem(
        this ILogger logger,
        Guid itemId,
        Guid locationId);

    // ========================================
    // Batch Operations (EventId 1200-1299)
    // ========================================

    [LoggerMessage(
        EventId = 1200,
        Level = LogLevel.Information,
        Message = "Starting batch import of {ItemCount} items")]
    public static partial void StartingBatchImport(
        this ILogger logger,
        int itemCount);

    [LoggerMessage(
        EventId = 1201,
        Level = LogLevel.Information,
        Message = "Completed batch import: {SuccessCount} succeeded, {FailureCount} failed")]
    public static partial void CompletedBatchImport(
        this ILogger logger,
        int successCount,
        int failureCount);

    [LoggerMessage(
        EventId = 1202,
        Level = LogLevel.Warning,
        Message = "Batch import item {Index} failed: {Reason}")]
    public static partial void BatchImportItemFailed(
        this ILogger logger,
        int index,
        string reason);
}

2. Using Log Messages in Services

public class ItemsService(
    CatalogDb db,
    ILogger<ItemsService> logger,
    ICurrentTenantService tenantService)
{
    public async Task<Item> CreateItemAsync(CreateItemRequest request, CancellationToken ct)
    {
        // Log entry with structured properties
        logger.CreatingItem(request.Name, request.ItemTypeId, request.LocationId);

        try
        {
            var item = new Item
            {
                Id = Guid.NewGuid(),
                Name = request.Name,
                ItemTypeId = request.ItemTypeId,
                LocationId = request.LocationId,
                TenantId = tenantService.TenantId
            };

            db.Items.Add(item);
            await db.SaveChangesAsync(ct);

            // Log success with result data
            logger.CreatedItem(item.Id, item.PassportId);

            return item;
        }
        catch (Exception ex)
        {
            // Log failure with exception - exception is captured automatically
            logger.FailedToCreateItem(request.Name, ex);
            throw;
        }
    }
}

3. Log Scopes for Complex Operations

Use log scopes to add context that applies to multiple log entries:

public class BatchProcessor(ILogger<BatchProcessor> logger)
{
    public async Task ProcessBatchAsync(Batch batch, CancellationToken ct)
    {
        // All logs within this scope will include BatchId and BatchSize
        using (logger.BeginScope(new Dictionary<string, object>
        {
            ["BatchId"] = batch.Id,
            ["BatchSize"] = batch.Items.Count,
            ["BatchType"] = batch.Type.ToString()
        }))
        {
            logger.StartingBatchImport(batch.Items.Count);

            var successCount = 0;
            var failureCount = 0;

            for (var i = 0; i < batch.Items.Count; i++)
            {
                var item = batch.Items[i];

                // Inner scope for per-item context
                using (logger.BeginScope("Processing item {ItemIndex} of {TotalItems}", i + 1, batch.Items.Count))
                {
                    try
                    {
                        await ProcessItemAsync(item, ct);
                        successCount++;
                    }
                    catch (Exception ex)
                    {
                        failureCount++;
                        logger.BatchImportItemFailed(i, ex.Message);
                    }
                }
            }

            logger.CompletedBatchImport(successCount, failureCount);
        }
    }
}

4. Automatic Correlation with OpenTelemetry

OpenTelemetry automatically propagates TraceId and SpanId across HTTP calls. The ServiceDefaults configuration handles this:

// In Extensions.cs - already implemented
builder.Logging.AddOpenTelemetry(logging =>
{
    logging.IncludeFormattedMessage = true;
    logging.IncludeScopes = true;  // Essential for scope propagation
});

When viewing logs, you can filter by TraceId to see all logs from a single request across all services:

TraceId: 4bf92f3577b34da6a3ce929d0e0e4736

5. Tenant and User Context Enrichment

Add tenant and user context to all logs automatically via middleware:

// TelemetryEnrichmentMiddleware.cs
public class TelemetryEnrichmentMiddleware(RequestDelegate next)
{
    public async Task InvokeAsync(HttpContext context)
    {
        var activity = Activity.Current;

        // Add tenant context
        if (context.User.FindFirst("tenant_id")?.Value is { } tenantId)
        {
            activity?.SetTag("tenant.id", tenantId);
            activity?.SetBaggage("tenant.id", tenantId);
        }

        // Add user context (never log sensitive user data)
        if (context.User.FindFirst("sub")?.Value is { } userId)
        {
            activity?.SetTag("user.id", userId);
        }

        // Add request context
        activity?.SetTag("http.request.id", context.TraceIdentifier);

        await next(context);
    }
}

6. Sensitive Data Redaction

Use Microsoft.Extensions.Compliance for automatic redaction of sensitive data:

// Define data classifications
public static class DataClassifications
{
    public static DataClassification Pii => new("PII");
    public static DataClassification Sensitive => new("Sensitive");
}

// Apply to log parameters
[LoggerMessage(
    EventId = 2000,
    Level = LogLevel.Information,
    Message = "User {UserId} logged in from {IpAddress}")]
public static partial void UserLoggedIn(
    this ILogger logger,
    string userId,
    [Sensitive] string ipAddress);  // IP address will be redacted

// Configure redaction in DI
services.AddRedaction(builder =>
{
    builder.SetRedactor<ErasingRedactor>(DataClassifications.Pii);
    builder.SetRedactor<StarRedactor>(DataClassifications.Sensitive);
});

// Enable redaction in logging
builder.Logging.EnableRedaction();

7. Component-Specific Metrics

Each component should define custom metrics for business operations:

// CatalogMetrics.cs
public class CatalogMetrics
{
    private readonly Counter<long> _itemsCreated;
    private readonly Counter<long> _itemsMoved;
    private readonly Histogram<double> _batchProcessingDuration;

    public CatalogMetrics(IMeterFactory meterFactory)
    {
        var meter = meterFactory.Create("Acsis.Dynaplex.Engines.Catalog");

        _itemsCreated = meter.CreateCounter<long>(
            "catalog.items.created",
            unit: "{item}",
            description: "Number of items created");

        _itemsMoved = meter.CreateCounter<long>(
            "catalog.items.moved",
            unit: "{item}",
            description: "Number of items moved");

        _batchProcessingDuration = meter.CreateHistogram<double>(
            "catalog.batch.duration",
            unit: "s",
            description: "Duration of batch processing operations");
    }

    public void RecordItemCreated(string itemType) =>
        _itemsCreated.Add(1, new KeyValuePair<string, object?>("item.type", itemType));

    public void RecordItemMoved(Guid fromLocation, Guid toLocation) =>
        _itemsMoved.Add(1,
            new KeyValuePair<string, object?>("from.location", fromLocation.ToString()),
            new KeyValuePair<string, object?>("to.location", toLocation.ToString()));

    public void RecordBatchDuration(double seconds, string batchType) =>
        _batchProcessingDuration.Record(seconds,
            new KeyValuePair<string, object?>("batch.type", batchType));
}

8. Configuration

Development (appsettings.Development.json)

{
  "Logging": {
    "LogLevel": {
      "Default": "Debug",
      "Microsoft": "Information",
      "Microsoft.AspNetCore": "Warning",
      "Microsoft.EntityFrameworkCore": "Information",
      "Microsoft.EntityFrameworkCore.Database.Command": "Information"
    },
    "Console": {
      "IncludeScopes": true,
      "TimestampFormat": "HH:mm:ss.fff "
    }
  }
}

Production

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft": "Warning",
      "Microsoft.AspNetCore": "Warning",
      "Microsoft.EntityFrameworkCore": "Warning"
    }
  },
  "OTEL_EXPORTER_OTLP_ENDPOINT": "https://your-otel-collector:4317",
  "APPLICATIONINSIGHTS_CONNECTION_STRING": "InstrumentationKey=..."
}

9. Event ID Allocation Strategy

Each component should use a distinct range of Event IDs:

Component	EventId Range	Example
Catalog	1000-1999	1000 = ItemCreated
Spatial	2000-2999	2000 = LocationCreated
Transport	3000-3999	3000 = ShipmentCreated
Workflow	4000-4999	4000 = WorkflowStarted
Identity	5000-5999	5000 = UserAuthenticated
IoT	6000-6999	6000 = TagRead
Events	7000-7999	7000 = EventPublished
Prism	8000-8999	8000 = PassportCreated
Core-Data	9000-9999	9000 = ReferenceDataLoaded

10. Log Level Guidelines

Level	When to Use	Examples
Trace	Detailed diagnostic info, development only	Method entry/exit, variable values
Debug	Development and troubleshooting	Cache hits/misses, query details
Information	Normal operations, business events	Item created, user logged in
Warning	Unusual but handled situations	Retry occurred, deprecated API used
Error	Operation failures, exceptions	Database error, external API failed
Critical	System failures requiring immediate attention	Service unavailable, data corruption

Migration Plan

Phase 1: Create LogMessages classes for each component ⏳
Phase 2: Migrate existing logging to use source generators ⏳
Phase 3: Add log scopes to complex operations ⏳
Phase 4: Implement sensitive data redaction ⏳
Phase 5: Add component-specific metrics ⏳
Phase 6: Configure production log aggregation ⏳
Phase 7: Create dashboards and alerts ⏳

ADR-019: OpenTelemetry Integration (traces and metrics)
ADR-020: Container-Aware Configuration (environment-based settings)
ADR-027: Event-Driven Architecture (event logging)

Documentation

ADR-026: Structured Logging and Observability Strategy

Status

Implementation Progress

Completed

In Progress

Remaining

Key Implementation Files

Context

Requirements

Existing Foundation

Decision

Core Principles

Consequences

Positive

Negative

Neutral

Implementation

1. High-Performance Logging with Source Generators

2. Using Log Messages in Services

3. Log Scopes for Complex Operations

4. Automatic Correlation with OpenTelemetry

5. Tenant and User Context Enrichment

6. Sensitive Data Redaction

7. Component-Specific Metrics

8. Configuration

Development (appsettings.Development.json)

Production

9. Event ID Allocation Strategy

10. Log Level Guidelines

Migration Plan

References

Documentation

ADR-026: Structured Logging and Observability Strategy

Status

Implementation Progress

Completed

In Progress

Remaining

Key Implementation Files

Context

Requirements

Existing Foundation

Decision

Core Principles

Consequences

Positive

Negative

Neutral

Implementation

1. High-Performance Logging with Source Generators

2. Using Log Messages in Services

3. Log Scopes for Complex Operations

4. Automatic Correlation with OpenTelemetry

5. Tenant and User Context Enrichment

6. Sensitive Data Redaction

7. Component-Specific Metrics

8. Configuration

Development (appsettings.Development.json)

Production

9. Event ID Allocation Strategy

10. Log Level Guidelines

Migration Plan

Related ADRs

References