feat: reword phase to epic, update mkdocs

2025-11-05 09:28:33 +01:00
parent 65a428534c
commit ace9678f6c
64 changed files with 214 additions and 208 deletions
--- a/docs/content/stories/epic6/6.1-enhanced-observability.md
+++ b/docs/content/stories/epic6/6.1-enhanced-observability.md
@@ -0,0 +1,72 @@
+# Story 6.1: Enhanced Observability
+
+## Metadata
+- **Story ID**: 6.1
+- **Title**: Enhanced Observability
+- **Epic**: 6 - Observability & Production Readiness
+- **Status**: Pending
+- **Priority**: High
+- **Estimated Time**: 6-8 hours
+- **Dependencies**: 1.6, 5.2, 5.1
+
+## Goal
+Enhance observability with full OpenTelemetry integration, comprehensive Prometheus metrics expansion, and improved logging with request correlation.
+
+## Description
+This story enhances the observability system by completing OpenTelemetry integration with all infrastructure components, expanding Prometheus metrics, and improving logging with better correlation and structured fields.
+
+## Deliverables
+
+### 1. Complete OpenTelemetry Integration
+- Export traces to Jaeger/OTLP collector
+- Add database instrumentation (Ent interceptor)
+- Add Kafka instrumentation
+- Add Redis instrumentation
+- Create custom spans:
+  - Module initialization spans
+  - Background job spans
+  - Event publishing spans
+- Trace context propagation:
+  - Include trace ID in logs
+  - Propagate across HTTP calls
+  - Include in error reports
+
+### 2. Prometheus Metrics Expansion
+- Add more metrics:
+  - Database connection pool stats
+  - Cache hit/miss ratio
+  - Event bus publish/consume rates
+  - Background job execution times
+  - Module-specific metrics (via module interface)
+- Create metric labels:
+  - `module` label for module metrics
+  - `tenant_id` label (if multi-tenant)
+  - `status` label for error rates
+
+### 3. Enhanced Logging
+- Add structured fields:
+  - `user_id` from context
+  - `tenant_id` from context
+  - `module` name for module logs
+  - `trace_id` from OpenTelemetry
+- Create log aggregation config:
+  - JSON format for production
+  - Human-readable for development
+  - Support for Loki/CloudWatch/ELK
+
+## Acceptance Criteria
+- [ ] Traces are exported and visible in Jaeger
+- [ ] All infrastructure components are instrumented
+- [ ] Trace IDs are included in logs
+- [ ] Metrics are expanded with new dimensions
+- [ ] Logs include all correlation fields
+- [ ] Log aggregation works correctly
+
+## Files to Create/Modify
+- `internal/observability/tracer.go` - Enhanced tracing
+- `internal/infra/database/client.go` - Add tracing
+- `internal/infra/cache/redis_cache.go` - Add tracing
+- `internal/infra/bus/kafka_bus.go` - Add tracing
+- `internal/metrics/metrics.go` - Expanded metrics
+- `internal/logger/zap_logger.go` - Enhanced logging
+
--- a/docs/content/stories/epic6/6.2-error-reporting.md
+++ b/docs/content/stories/epic6/6.2-error-reporting.md
@@ -0,0 +1,53 @@
+# Story 6.2: Error Reporting (Sentry)
+
+## Metadata
+- **Story ID**: 6.2
+- **Title**: Error Reporting (Sentry)
+- **Epic**: 6 - Observability & Production Readiness
+- **Status**: Pending
+- **Priority**: High
+- **Estimated Time**: 4-5 hours
+- **Dependencies**: 1.4
+
+## Goal
+Add comprehensive error reporting with Sentry integration that captures errors with full context.
+
+## Description
+This story integrates Sentry for error reporting, sending all errors from the error bus to Sentry with complete context including trace IDs, user information, and module context.
+
+## Deliverables
+
+### 1. Sentry Integration
+- Install and configure Sentry SDK
+- Integrate with error bus:
+  - Send errors to Sentry
+  - Include trace ID in Sentry events
+  - Add user context (user ID, email)
+  - Add module context (module name)
+- Sentry middleware:
+  - Capture panics
+  - Capture HTTP errors (4xx, 5xx)
+- Configure Sentry DSN via config
+
+### 2. Error Context Enhancement
+- Enrich errors with:
+  - Request context
+  - User information
+  - Module information
+  - Stack traces
+  - Environment information
+
+## Acceptance Criteria
+- [ ] Errors are reported to Sentry with context
+- [ ] Panics are captured and reported
+- [ ] HTTP errors are captured
+- [ ] Trace IDs are included in Sentry events
+- [ ] User context is included
+- [ ] Sentry DSN is configurable
+
+## Files to Create/Modify
+- `internal/errorbus/sentry_bus.go` - Sentry integration
+- `internal/server/middleware.go` - Sentry middleware
+- `internal/di/providers.go` - Add Sentry provider
+- `config/default.yaml` - Add Sentry config
+
--- a/docs/content/stories/epic6/6.3-grafana-dashboards.md
+++ b/docs/content/stories/epic6/6.3-grafana-dashboards.md
@@ -0,0 +1,46 @@
+# Story 6.3: Grafana Dashboards
+
+## Metadata
+- **Story ID**: 6.3
+- **Title**: Grafana Dashboards
+- **Epic**: 6 - Observability & Production Readiness
+- **Status**: Pending
+- **Priority**: Medium
+- **Estimated Time**: 4-5 hours
+- **Dependencies**: 1.3, 6.1
+
+## Goal
+Create comprehensive Grafana dashboards for monitoring platform health, performance, and errors.
+
+## Description
+This story creates Grafana dashboard JSON files that visualize platform metrics, health, and performance data from Prometheus.
+
+## Deliverables
+
+### 1. Grafana Dashboards (`ops/grafana/dashboards/`)
+- `platform-overview.json` - Overall health dashboard
+- `http-metrics.json` - HTTP request metrics
+- `database-metrics.json` - Database performance
+- `module-metrics.json` - Per-module metrics
+- `error-rates.json` - Error tracking
+- Dashboard setup documentation
+
+### 2. Documentation
+- Document dashboard setup in `docs/operations.md`
+- Dashboard import instructions
+- Metric explanation
+
+## Acceptance Criteria
+- [ ] All dashboards are created
+- [ ] Dashboards display correct metrics
+- [ ] Dashboard setup is documented
+- [ ] Dashboards can be imported into Grafana
+
+## Files to Create/Modify
+- `ops/grafana/dashboards/platform-overview.json`
+- `ops/grafana/dashboards/http-metrics.json`
+- `ops/grafana/dashboards/database-metrics.json`
+- `ops/grafana/dashboards/module-metrics.json`
+- `ops/grafana/dashboards/error-rates.json`
+- `docs/operations.md` - Dashboard documentation
+
--- a/docs/content/stories/epic6/6.4-rate-limiting.md
+++ b/docs/content/stories/epic6/6.4-rate-limiting.md
@@ -0,0 +1,53 @@
+# Story 6.4: Rate Limiting
+
+## Metadata
+- **Story ID**: 6.4
+- **Title**: Rate Limiting
+- **Epic**: 6 - Observability & Production Readiness
+- **Status**: Pending
+- **Priority**: High
+- **Estimated Time**: 4-5 hours
+- **Dependencies**: 1.5, 5.1
+
+## Goal
+Implement rate limiting to prevent API abuse and ensure fair resource usage.
+
+## Description
+This story implements rate limiting middleware that limits requests per user and per IP address, with configurable limits per endpoint.
+
+## Deliverables
+
+### 1. Rate Limiting Middleware
+- Per-user rate limiting
+- Per-IP rate limiting
+- Configurable limits per endpoint
+- Rate limit storage (Redis)
+- Return `X-RateLimit-*` headers
+
+### 2. Configuration
+- Rate limit config in `config/default.yaml`:
+  ```yaml
+  rate_limiting:
+    enabled: true
+    per_user: 100/minute
+    per_ip: 1000/minute
+  ```
+
+### 3. Integration
+- Integrate with HTTP server
+- Add to middleware stack
+- Error responses for rate limit exceeded
+
+## Acceptance Criteria
+- [ ] Rate limiting prevents abuse
+- [ ] Per-user limits work correctly
+- [ ] Per-IP limits work correctly
+- [ ] Rate limit headers are returned
+- [ ] Configuration is flexible
+- [ ] Rate limits are stored in Redis
+
+## Files to Create/Modify
+- `internal/server/middleware.go` - Rate limiting middleware
+- `internal/infra/ratelimit/limiter.go` - Rate limiter implementation
+- `config/default.yaml` - Add rate limit config
+
--- a/docs/content/stories/epic6/6.5-security-hardening.md
+++ b/docs/content/stories/epic6/6.5-security-hardening.md
@@ -0,0 +1,54 @@
+# Story 6.5: Security Hardening
+
+## Metadata
+- **Story ID**: 6.5
+- **Title**: Security Hardening
+- **Epic**: 6 - Observability & Production Readiness
+- **Status**: Pending
+- **Priority**: High
+- **Estimated Time**: 5-6 hours
+- **Dependencies**: 1.5
+
+## Goal
+Add comprehensive security hardening including security headers, input validation, and request size limits.
+
+## Description
+This story implements security best practices including security headers, input validation, request size limits, and SQL injection protection.
+
+## Deliverables
+
+### 1. Security Headers Middleware
+- `X-Content-Type-Options: nosniff`
+- `X-Frame-Options: DENY`
+- `X-XSS-Protection: 1; mode=block`
+- `Strict-Transport-Security` (if HTTPS)
+- `Content-Security-Policy`
+
+### 2. Request Size Limits
+- Max body size (10MB default)
+- Max header size
+- Configurable limits
+
+### 3. Input Validation
+- Use `github.com/go-playground/validator`
+- Validate all request bodies
+- Sanitize user inputs
+- Validation error responses
+
+### 4. SQL Injection Protection
+- Use parameterized queries (Ent already does this)
+- Add linter rule to prevent raw SQL
+- Security scanning
+
+## Acceptance Criteria
+- [ ] Security headers are present
+- [ ] Request size limits are enforced
+- [ ] Input validation works
+- [ ] SQL injection protection is in place
+- [ ] Security headers are configurable
+
+## Files to Create/Modify
+- `internal/server/middleware.go` - Security headers middleware
+- `internal/server/validation.go` - Input validation
+- `config/default.yaml` - Add security config
+
--- a/docs/content/stories/epic6/6.6-performance-optimization.md
+++ b/docs/content/stories/epic6/6.6-performance-optimization.md
@@ -0,0 +1,53 @@
+# Story 6.6: Performance Optimization
+
+## Metadata
+- **Story ID**: 6.6
+- **Title**: Performance Optimization
+- **Epic**: 6 - Observability & Production Readiness
+- **Status**: Pending
+- **Priority**: Medium
+- **Estimated Time**: 6-8 hours
+- **Dependencies**: 1.2, 5.1
+
+## Goal
+Optimize platform performance through database connection pooling, query optimization, response compression, and caching strategies.
+
+## Description
+This story implements performance optimizations including database connection pooling, query optimization, response compression, and strategic caching.
+
+## Deliverables
+
+### 1. Database Connection Pooling
+- Configure max connections
+- Configure idle timeout
+- Monitor pool stats
+- Connection health checks
+
+### 2. Query Optimization
+- Add indexes for common queries
+- Use database query logging (development)
+- Add slow query detection
+- Query performance monitoring
+
+### 3. Response Compression
+- Gzip middleware for large responses
+- Configurable compression levels
+- Content type filtering
+
+### 4. Caching Strategy
+- Cache frequently accessed data (user permissions, roles)
+- Cache invalidation strategies
+- Cache warming
+
+## Acceptance Criteria
+- [ ] Database connection pooling is optimized
+- [ ] Query performance is improved
+- [ ] Response compression works
+- [ ] Caching strategy is effective
+- [ ] Performance meets SLA (< 100ms p95 for auth endpoints)
+
+## Files to Create/Modify
+- `internal/infra/database/client.go` - Connection pooling
+- `internal/server/middleware.go` - Compression middleware
+- `internal/perm/in_memory_resolver.go` - Add caching
+
--- a/docs/content/stories/epic6/README.md
+++ b/docs/content/stories/epic6/README.md
@@ -0,0 +1,55 @@
+# Epic 6: Observability & Production Readiness
+
+## Overview
+Enhance observability with full OpenTelemetry integration, add comprehensive error reporting (Sentry), create Grafana dashboards, improve logging with request correlation, add rate limiting and security hardening, and optimize performance.
+
+## Stories
+
+### 6.1 Enhanced Observability
+- [Story: 6.1 - Enhanced Observability](./6.1-enhanced-observability.md)
+- **Goal:** Enhance observability with full OpenTelemetry integration, comprehensive Prometheus metrics, and improved logging.
+- **Deliverables:** Complete OpenTelemetry integration, expanded metrics, enhanced logging
+
+### 6.2 Error Reporting (Sentry)
+- [Story: 6.2 - Error Reporting](./6.2-error-reporting.md)
+- **Goal:** Add comprehensive error reporting with Sentry integration.
+- **Deliverables:** Sentry integration, error context enhancement
+
+### 6.3 Grafana Dashboards
+- [Story: 6.3 - Grafana Dashboards](./6.3-grafana-dashboards.md)
+- **Goal:** Create comprehensive Grafana dashboards for monitoring.
+- **Deliverables:** Grafana dashboard JSON files, documentation
+
+### 6.4 Rate Limiting
+- [Story: 6.4 - Rate Limiting](./6.4-rate-limiting.md)
+- **Goal:** Implement rate limiting to prevent API abuse.
+- **Deliverables:** Rate limiting middleware, configuration
+
+### 6.5 Security Hardening
+- [Story: 6.5 - Security Hardening](./6.5-security-hardening.md)
+- **Goal:** Add comprehensive security hardening.
+- **Deliverables:** Security headers, input validation, request limits
+
+### 6.6 Performance Optimization
+- [Story: 6.6 - Performance Optimization](./6.6-performance-optimization.md)
+- **Goal:** Optimize platform performance.
+- **Deliverables:** Connection pooling, query optimization, compression, caching
+
+## Deliverables Checklist
+- [ ] Full OpenTelemetry integration
+- [ ] Sentry error reporting
+- [ ] Enhanced logging with correlation
+- [ ] Comprehensive Prometheus metrics
+- [ ] Grafana dashboards
+- [ ] Rate limiting
+- [ ] Security hardening
+- [ ] Performance optimizations
+
+## Acceptance Criteria
+- Traces are exported and visible in Jaeger
+- Errors are reported to Sentry with context
+- Logs include request IDs and trace IDs
+- Metrics are exposed and scraped by Prometheus
+- Rate limiting prevents abuse
+- Security headers are present
+- Performance meets SLA (< 100ms p95 for auth endpoints)