Transform all documentation from modular monolith to true microservices
architecture where core services are independently deployable.
Key Changes:
- Core Kernel: Infrastructure only (no business logic)
- Core Services: Auth, Identity, Authz, Audit as separate microservices
- Each service has own entry point (cmd/{service}/)
- Each service has own gRPC server and database schema
- Services register with Consul for service discovery
- API Gateway: Moved from Epic 8 to Epic 1 as core infrastructure
- Single entry point for all external traffic
- Handles routing, JWT validation, rate limiting, CORS
- Service Discovery: Consul as primary mechanism (ADR-0033)
- Database Pattern: Per-service connections with schema isolation
Documentation Updates:
- Updated all 9 architecture documents
- Updated 4 ADRs and created 2 new ADRs (API Gateway, Service Discovery)
- Rewrote Epic 1: Core Kernel & Infrastructure (infrastructure only)
- Rewrote Epic 2: Core Services (Auth, Identity, Authz, Audit as services)
- Updated Epic 3-8 stories for service architecture
- Updated plan.md, playbook.md, requirements.md, index.md
- Updated all epic READMEs and story files
New ADRs:
- ADR-0032: API Gateway Strategy
- ADR-0033: Service Discovery Implementation (Consul)
New Stories:
- Epic 1.7: Service Client Interfaces
- Epic 1.8: API Gateway Implementation
Epic 6: Observability & Production Readiness
Overview
Enhance observability with full OpenTelemetry integration across all services, add comprehensive error reporting (Sentry), create Grafana dashboards for service monitoring, improve logging with request correlation across services, add rate limiting (primarily at API Gateway), security hardening, and optimize performance for microservices architecture.
Note: Observability spans all services - distributed tracing, service-level metrics, and cross-service log correlation.
Stories
6.1 Enhanced Observability
- Story: 6.1 - Enhanced Observability
- Goal: Enhance observability with full OpenTelemetry integration across all services, comprehensive Prometheus metrics per service, and improved logging with trace correlation.
- Deliverables: Complete OpenTelemetry integration, expanded metrics per service, enhanced logging with trace IDs
6.2 Error Reporting (Sentry)
- Story: 6.2 - Error Reporting
- Goal: Add comprehensive error reporting with Sentry integration.
- Deliverables: Sentry integration, error context enhancement
6.3 Grafana Dashboards
- Story: 6.3 - Grafana Dashboards
- Goal: Create comprehensive Grafana dashboards for monitoring all services.
- Deliverables: Grafana dashboard JSON files per service, service-level dashboards, cross-service dashboards, documentation
6.4 Rate Limiting
- Story: 6.4 - Rate Limiting
- Goal: Implement rate limiting primarily at API Gateway level, with per-service rate limiting support.
- Deliverables: Rate limiting middleware for API Gateway, per-service rate limiting support, Redis-backed rate limiting
Note: Rate limiting is primarily implemented in API Gateway (Epic 1, Story 1.8). This story adds per-service rate limiting capabilities.
6.5 Security Hardening
- Story: 6.5 - Security Hardening
- Goal: Add comprehensive security hardening.
- Deliverables: Security headers, input validation, request limits
6.6 Performance Optimization
- Story: 6.6 - Performance Optimization
- Goal: Optimize platform performance.
- Deliverables: Connection pooling, query optimization, compression, caching
Deliverables Checklist
- Full OpenTelemetry integration
- Sentry error reporting
- Enhanced logging with correlation
- Comprehensive Prometheus metrics
- Grafana dashboards
- Rate limiting
- Security hardening
- Performance optimizations
Acceptance Criteria
- Distributed traces span all services and are visible in Jaeger
- Errors are reported to Sentry with service context
- Logs include request IDs and trace IDs for correlation across services
- Metrics are exposed per service and scraped by Prometheus
- Rate limiting prevents abuse (primarily at API Gateway)
- Security headers are present on all services
- Performance meets SLA (< 100ms p95 for auth endpoints)
- Service-level dashboards available in Grafana