Documentation

adrs/019-opentelemetry-integration.md

ADR-019: OpenTelemetry Integration

Status

Accepted

Context

In a distributed microservices architecture, observability is crucial for understanding system behavior, debugging issues, and monitoring performance. We need comprehensive telemetry including distributed tracing, metrics, and structured logging across all Dynaplex services.

OpenTelemetry has emerged as the industry standard for observability, providing:

  • Vendor-neutral telemetry collection
  • Standardized instrumentation
  • Support for traces, metrics, and logs
  • Wide ecosystem support
  • Integration with major observability platforms

Decision

We will adopt OpenTelemetry as the standard observability framework across all Dynaplex services, providing distributed tracing, metrics collection, and structured logging.

Implementation approach:

builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation())
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddRuntimeInstrumentation())
    .UseOtlpExporter();

Consequences

Positive

  • Distributed Tracing: Full request flow visibility across services
  • Performance Insights: Identify bottlenecks and slow operations
  • Error Tracking: Correlate errors across service boundaries
  • Vendor Neutral: Can switch observability backends without code changes
  • Auto-Instrumentation: Automatic telemetry for common libraries
  • Standards-Based: Following industry best practices

Negative

  • Performance Overhead: Small latency increase from instrumentation
  • Data Volume: Significant telemetry data generation
  • Cost: Storage and processing costs for telemetry data
  • Complexity: Additional configuration and management
  • Learning Curve: Team needs to understand observability concepts

Neutral

  • Infrastructure Requirements: Requires telemetry backend (Jaeger, etc.)
  • Sampling Strategies: Must balance detail vs. data volume
  • Privacy Considerations: Must sanitize sensitive data

Implementation Notes

Complete Service Configuration

// Program.cs
var builder = WebApplication.CreateBuilder(args);

// Service identification
var serviceName = "catalog-service";
var serviceVersion = "1.0.0";

// Configure OpenTelemetry
builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService(
            serviceName: serviceName,
            serviceVersion: serviceVersion,
            serviceInstanceId: Environment.MachineName))
    .WithTracing(tracing =>
    {
        tracing
            // Core instrumentation
            .AddAspNetCoreInstrumentation(options =>
            {
                options.RecordException = true;
                options.Filter = (httpContext) =>
                {
                    // Skip health checks
                    return !httpContext.Request.Path.StartsWithSegments("/health");
                };
            })
            
            // HTTP client instrumentation
            .AddHttpClientInstrumentation(options =>
            {
                options.RecordException = true;
                options.FilterHttpRequestMessage = (httpRequestMessage) =>
                {
                    // Skip telemetry endpoints
                    return !httpRequestMessage.RequestUri?.Host.Contains("telemetry") ?? true;
                };
            })
            
            // Database instrumentation
            .AddSqlClientInstrumentation(options =>
            {
                options.RecordException = true;
                options.SetDbStatementForText = true;
                options.SetDbStatementForStoredProcedure = true;
            })
            
            // Custom instrumentation
            .AddSource("Dynaplex.Catalog")
            
            // Sampling
            .SetSampler(new TraceIdRatioBasedSampler(0.1)) // 10% sampling
            
            // Export to OTLP
            .AddOtlpExporter(options =>
            {
                options.Endpoint = new Uri(configuration["Telemetry:Endpoint"]);
            });
    })
    .WithMetrics(metrics =>
    {
        metrics
            // Core metrics
            .AddAspNetCoreInstrumentation()
            .AddRuntimeInstrumentation()
            .AddProcessInstrumentation()
            
            // Custom metrics
            .AddMeter("Dynaplex.Catalog")
            
            // Export to OTLP
            .AddOtlpExporter(options =>
            {
                options.Endpoint = new Uri(configuration["Telemetry:Endpoint"]);
            });
    });

// Configure logging to OpenTelemetry
builder.Logging.AddOpenTelemetry(options =>
{
    options.SetResourceBuilder(ResourceBuilder.CreateDefault()
        .AddService(serviceName));
    options.AddOtlpExporter();
});

Custom Instrumentation

public class AssetService
{
    private static readonly ActivitySource ActivitySource = 
        new ActivitySource("Dynaplex.Catalog");
    private static readonly Meter Meter = 
        new Meter("Dynaplex.Catalog", "1.0.0");
    private static readonly Counter<int> AssetCreatedCounter = 
        Meter.CreateCounter<int>("assets.created", "Assets", "Number of assets created");
    
    public async Task<Asset> CreateAssetAsync(CreateAssetRequest request)
    {
        // Start custom span
        using var activity = ActivitySource.StartActivity("CreateAsset");
        activity?.SetTag("asset.type", request.TypeId);
        activity?.SetTag("asset.location", request.LocationId);
        
        try
        {
            var asset = new Asset { /* ... */ };
            
            // Record metric
            AssetCreatedCounter.Add(1, new KeyValuePair<string, object?>("type", request.TypeId));
            
            await _db.SaveChangesAsync();
            
            activity?.SetTag("asset.id", asset.Id);
            activity?.SetStatus(ActivityStatusCode.Ok);
            
            return asset;
        }
        catch (Exception ex)
        {
            activity?.RecordException(ex);
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            throw;
        }
    }
}

Baggage and Context Propagation

public static class TelemetryContext
{
    public static void SetUser(string userId, string username)
    {
        Baggage.SetBaggage("user.id", userId);
        Baggage.SetBaggage("user.name", username);
        
        Activity.Current?.SetTag("user.id", userId);
    }
    
    public static void SetTenant(string tenantId)
    {
        Baggage.SetBaggage("tenant.id", tenantId);
        Activity.Current?.SetTag("tenant.id", tenantId);
    }
}

// In authentication middleware
app.Use(async (context, next) =>
{
    if (context.User.Identity?.IsAuthenticated == true)
    {
        var userId = context.User.FindFirst("sub")?.Value;
        var username = context.User.Identity.Name;
        TelemetryContext.SetUser(userId, username);
    }
    
    await next();
});

Sampling Configuration

// Custom sampler for different scenarios
public class AdaptiveSampler : Sampler
{
    public override SamplingResult ShouldSample(in SamplingParameters parameters)
    {
        // Always sample errors
        if (parameters.Tags?.Any(t => t.Key == "error" && (bool)t.Value) == true)
            return new SamplingResult(SamplingDecision.RecordAndSample);
        
        // Higher sampling for slow requests
        if (parameters.Tags?.Any(t => t.Key == "http.duration" && (int)t.Value > 1000) == true)
            return new SamplingResult(SamplingDecision.RecordAndSample);
        
        // Default 10% sampling
        return new TraceIdRatioBasedSampler(0.1).ShouldSample(parameters);
    }
}

Metrics Examples

public class MetricsService
{
    private readonly Histogram<double> _requestDuration;
    private readonly Counter<int> _requestCount;
    private readonly UpDownCounter<int> _activeRequests;
    
    public MetricsService()
    {
        var meter = new Meter("Dynaplex.Metrics");
        
        _requestDuration = meter.CreateHistogram<double>(
            "http.request.duration",
            unit: "ms",
            description: "HTTP request duration");
            
        _requestCount = meter.CreateCounter<int>(
            "http.request.count",
            description: "Total HTTP requests");
            
        _activeRequests = meter.CreateUpDownCounter<int>(
            "http.request.active",
            description: "Currently active HTTP requests");
    }
    
    public async Task<T> TrackRequest<T>(string endpoint, Func<Task<T>> action)
    {
        _activeRequests.Add(1);
        var stopwatch = Stopwatch.StartNew();
        
        try
        {
            var result = await action();
            _requestCount.Add(1, new("endpoint", endpoint), new("status", "success"));
            return result;
        }
        catch
        {
            _requestCount.Add(1, new("endpoint", endpoint), new("status", "error"));
            throw;
        }
        finally
        {
            _activeRequests.Add(-1);
            _requestDuration.Record(stopwatch.ElapsedMilliseconds, new("endpoint", endpoint));
        }
    }
}

Configuration Examples

{
  "Telemetry": {
    "Endpoint": "http://otel-collector:4317",
    "ServiceName": "catalog-service",
    "ServiceVersion": "1.0.0",
    "Sampling": {
      "Type": "TraceIdRatio",
      "Ratio": 0.1
    },
    "Exporters": {
      "Console": false,
      "Otlp": true,
      "Jaeger": false
    }
  }
}

Development vs Production

if (builder.Environment.IsDevelopment())
{
    // Development: Export to console for debugging
    builder.Services.AddOpenTelemetry()
        .WithTracing(tracing => tracing
            .SetSampler(new AlwaysOnSampler()) // Sample everything
            .AddConsoleExporter());
}
else
{
    // Production: Export to OTLP with sampling
    builder.Services.AddOpenTelemetry()
        .WithTracing(tracing => tracing
            .SetSampler(new TraceIdRatioBasedSampler(0.01)) // 1% sampling
            .AddOtlpExporter());
}

Best Practices

  1. Use semantic conventions for attribute names
  2. Sample appropriately to control data volume
  3. Sanitize sensitive data before recording
  4. Create meaningful spans for important operations
  5. Use baggage for cross-service context
  6. Monitor telemetry overhead and adjust sampling
  7. Correlate logs with traces using trace IDs
  • ADR-007: Migration to .NET Aspire Microservices (distributed architecture)
  • ADR-020: Container-Aware Configuration (telemetry in containers)
  • ADR-021: Microsoft Kiota for API Client Generation (propagates trace context)