feat: reword phase to epic, update mkdocs

This commit is contained in:
2025-11-05 09:28:33 +01:00
parent 65a428534c
commit ace9678f6c
64 changed files with 214 additions and 208 deletions

View File

@@ -0,0 +1,72 @@
# Story 6.1: Enhanced Observability
## Metadata
- **Story ID**: 6.1
- **Title**: Enhanced Observability
- **Epic**: 6 - Observability & Production Readiness
- **Status**: Pending
- **Priority**: High
- **Estimated Time**: 6-8 hours
- **Dependencies**: 1.6, 5.2, 5.1
## Goal
Enhance observability with full OpenTelemetry integration, comprehensive Prometheus metrics expansion, and improved logging with request correlation.
## Description
This story enhances the observability system by completing OpenTelemetry integration with all infrastructure components, expanding Prometheus metrics, and improving logging with better correlation and structured fields.
## Deliverables
### 1. Complete OpenTelemetry Integration
- Export traces to Jaeger/OTLP collector
- Add database instrumentation (Ent interceptor)
- Add Kafka instrumentation
- Add Redis instrumentation
- Create custom spans:
- Module initialization spans
- Background job spans
- Event publishing spans
- Trace context propagation:
- Include trace ID in logs
- Propagate across HTTP calls
- Include in error reports
### 2. Prometheus Metrics Expansion
- Add more metrics:
- Database connection pool stats
- Cache hit/miss ratio
- Event bus publish/consume rates
- Background job execution times
- Module-specific metrics (via module interface)
- Create metric labels:
- `module` label for module metrics
- `tenant_id` label (if multi-tenant)
- `status` label for error rates
### 3. Enhanced Logging
- Add structured fields:
- `user_id` from context
- `tenant_id` from context
- `module` name for module logs
- `trace_id` from OpenTelemetry
- Create log aggregation config:
- JSON format for production
- Human-readable for development
- Support for Loki/CloudWatch/ELK
## Acceptance Criteria
- [ ] Traces are exported and visible in Jaeger
- [ ] All infrastructure components are instrumented
- [ ] Trace IDs are included in logs
- [ ] Metrics are expanded with new dimensions
- [ ] Logs include all correlation fields
- [ ] Log aggregation works correctly
## Files to Create/Modify
- `internal/observability/tracer.go` - Enhanced tracing
- `internal/infra/database/client.go` - Add tracing
- `internal/infra/cache/redis_cache.go` - Add tracing
- `internal/infra/bus/kafka_bus.go` - Add tracing
- `internal/metrics/metrics.go` - Expanded metrics
- `internal/logger/zap_logger.go` - Enhanced logging

View File

@@ -0,0 +1,53 @@
# Story 6.2: Error Reporting (Sentry)
## Metadata
- **Story ID**: 6.2
- **Title**: Error Reporting (Sentry)
- **Epic**: 6 - Observability & Production Readiness
- **Status**: Pending
- **Priority**: High
- **Estimated Time**: 4-5 hours
- **Dependencies**: 1.4
## Goal
Add comprehensive error reporting with Sentry integration that captures errors with full context.
## Description
This story integrates Sentry for error reporting, sending all errors from the error bus to Sentry with complete context including trace IDs, user information, and module context.
## Deliverables
### 1. Sentry Integration
- Install and configure Sentry SDK
- Integrate with error bus:
- Send errors to Sentry
- Include trace ID in Sentry events
- Add user context (user ID, email)
- Add module context (module name)
- Sentry middleware:
- Capture panics
- Capture HTTP errors (4xx, 5xx)
- Configure Sentry DSN via config
### 2. Error Context Enhancement
- Enrich errors with:
- Request context
- User information
- Module information
- Stack traces
- Environment information
## Acceptance Criteria
- [ ] Errors are reported to Sentry with context
- [ ] Panics are captured and reported
- [ ] HTTP errors are captured
- [ ] Trace IDs are included in Sentry events
- [ ] User context is included
- [ ] Sentry DSN is configurable
## Files to Create/Modify
- `internal/errorbus/sentry_bus.go` - Sentry integration
- `internal/server/middleware.go` - Sentry middleware
- `internal/di/providers.go` - Add Sentry provider
- `config/default.yaml` - Add Sentry config

View File

@@ -0,0 +1,46 @@
# Story 6.3: Grafana Dashboards
## Metadata
- **Story ID**: 6.3
- **Title**: Grafana Dashboards
- **Epic**: 6 - Observability & Production Readiness
- **Status**: Pending
- **Priority**: Medium
- **Estimated Time**: 4-5 hours
- **Dependencies**: 1.3, 6.1
## Goal
Create comprehensive Grafana dashboards for monitoring platform health, performance, and errors.
## Description
This story creates Grafana dashboard JSON files that visualize platform metrics, health, and performance data from Prometheus.
## Deliverables
### 1. Grafana Dashboards (`ops/grafana/dashboards/`)
- `platform-overview.json` - Overall health dashboard
- `http-metrics.json` - HTTP request metrics
- `database-metrics.json` - Database performance
- `module-metrics.json` - Per-module metrics
- `error-rates.json` - Error tracking
- Dashboard setup documentation
### 2. Documentation
- Document dashboard setup in `docs/operations.md`
- Dashboard import instructions
- Metric explanation
## Acceptance Criteria
- [ ] All dashboards are created
- [ ] Dashboards display correct metrics
- [ ] Dashboard setup is documented
- [ ] Dashboards can be imported into Grafana
## Files to Create/Modify
- `ops/grafana/dashboards/platform-overview.json`
- `ops/grafana/dashboards/http-metrics.json`
- `ops/grafana/dashboards/database-metrics.json`
- `ops/grafana/dashboards/module-metrics.json`
- `ops/grafana/dashboards/error-rates.json`
- `docs/operations.md` - Dashboard documentation

View File

@@ -0,0 +1,53 @@
# Story 6.4: Rate Limiting
## Metadata
- **Story ID**: 6.4
- **Title**: Rate Limiting
- **Epic**: 6 - Observability & Production Readiness
- **Status**: Pending
- **Priority**: High
- **Estimated Time**: 4-5 hours
- **Dependencies**: 1.5, 5.1
## Goal
Implement rate limiting to prevent API abuse and ensure fair resource usage.
## Description
This story implements rate limiting middleware that limits requests per user and per IP address, with configurable limits per endpoint.
## Deliverables
### 1. Rate Limiting Middleware
- Per-user rate limiting
- Per-IP rate limiting
- Configurable limits per endpoint
- Rate limit storage (Redis)
- Return `X-RateLimit-*` headers
### 2. Configuration
- Rate limit config in `config/default.yaml`:
```yaml
rate_limiting:
enabled: true
per_user: 100/minute
per_ip: 1000/minute
```
### 3. Integration
- Integrate with HTTP server
- Add to middleware stack
- Error responses for rate limit exceeded
## Acceptance Criteria
- [ ] Rate limiting prevents abuse
- [ ] Per-user limits work correctly
- [ ] Per-IP limits work correctly
- [ ] Rate limit headers are returned
- [ ] Configuration is flexible
- [ ] Rate limits are stored in Redis
## Files to Create/Modify
- `internal/server/middleware.go` - Rate limiting middleware
- `internal/infra/ratelimit/limiter.go` - Rate limiter implementation
- `config/default.yaml` - Add rate limit config

View File

@@ -0,0 +1,54 @@
# Story 6.5: Security Hardening
## Metadata
- **Story ID**: 6.5
- **Title**: Security Hardening
- **Epic**: 6 - Observability & Production Readiness
- **Status**: Pending
- **Priority**: High
- **Estimated Time**: 5-6 hours
- **Dependencies**: 1.5
## Goal
Add comprehensive security hardening including security headers, input validation, and request size limits.
## Description
This story implements security best practices including security headers, input validation, request size limits, and SQL injection protection.
## Deliverables
### 1. Security Headers Middleware
- `X-Content-Type-Options: nosniff`
- `X-Frame-Options: DENY`
- `X-XSS-Protection: 1; mode=block`
- `Strict-Transport-Security` (if HTTPS)
- `Content-Security-Policy`
### 2. Request Size Limits
- Max body size (10MB default)
- Max header size
- Configurable limits
### 3. Input Validation
- Use `github.com/go-playground/validator`
- Validate all request bodies
- Sanitize user inputs
- Validation error responses
### 4. SQL Injection Protection
- Use parameterized queries (Ent already does this)
- Add linter rule to prevent raw SQL
- Security scanning
## Acceptance Criteria
- [ ] Security headers are present
- [ ] Request size limits are enforced
- [ ] Input validation works
- [ ] SQL injection protection is in place
- [ ] Security headers are configurable
## Files to Create/Modify
- `internal/server/middleware.go` - Security headers middleware
- `internal/server/validation.go` - Input validation
- `config/default.yaml` - Add security config

View File

@@ -0,0 +1,53 @@
# Story 6.6: Performance Optimization
## Metadata
- **Story ID**: 6.6
- **Title**: Performance Optimization
- **Epic**: 6 - Observability & Production Readiness
- **Status**: Pending
- **Priority**: Medium
- **Estimated Time**: 6-8 hours
- **Dependencies**: 1.2, 5.1
## Goal
Optimize platform performance through database connection pooling, query optimization, response compression, and caching strategies.
## Description
This story implements performance optimizations including database connection pooling, query optimization, response compression, and strategic caching.
## Deliverables
### 1. Database Connection Pooling
- Configure max connections
- Configure idle timeout
- Monitor pool stats
- Connection health checks
### 2. Query Optimization
- Add indexes for common queries
- Use database query logging (development)
- Add slow query detection
- Query performance monitoring
### 3. Response Compression
- Gzip middleware for large responses
- Configurable compression levels
- Content type filtering
### 4. Caching Strategy
- Cache frequently accessed data (user permissions, roles)
- Cache invalidation strategies
- Cache warming
## Acceptance Criteria
- [ ] Database connection pooling is optimized
- [ ] Query performance is improved
- [ ] Response compression works
- [ ] Caching strategy is effective
- [ ] Performance meets SLA (< 100ms p95 for auth endpoints)
## Files to Create/Modify
- `internal/infra/database/client.go` - Connection pooling
- `internal/server/middleware.go` - Compression middleware
- `internal/perm/in_memory_resolver.go` - Add caching

View File

@@ -0,0 +1,55 @@
# Epic 6: Observability & Production Readiness
## Overview
Enhance observability with full OpenTelemetry integration, add comprehensive error reporting (Sentry), create Grafana dashboards, improve logging with request correlation, add rate limiting and security hardening, and optimize performance.
## Stories
### 6.1 Enhanced Observability
- [Story: 6.1 - Enhanced Observability](./6.1-enhanced-observability.md)
- **Goal:** Enhance observability with full OpenTelemetry integration, comprehensive Prometheus metrics, and improved logging.
- **Deliverables:** Complete OpenTelemetry integration, expanded metrics, enhanced logging
### 6.2 Error Reporting (Sentry)
- [Story: 6.2 - Error Reporting](./6.2-error-reporting.md)
- **Goal:** Add comprehensive error reporting with Sentry integration.
- **Deliverables:** Sentry integration, error context enhancement
### 6.3 Grafana Dashboards
- [Story: 6.3 - Grafana Dashboards](./6.3-grafana-dashboards.md)
- **Goal:** Create comprehensive Grafana dashboards for monitoring.
- **Deliverables:** Grafana dashboard JSON files, documentation
### 6.4 Rate Limiting
- [Story: 6.4 - Rate Limiting](./6.4-rate-limiting.md)
- **Goal:** Implement rate limiting to prevent API abuse.
- **Deliverables:** Rate limiting middleware, configuration
### 6.5 Security Hardening
- [Story: 6.5 - Security Hardening](./6.5-security-hardening.md)
- **Goal:** Add comprehensive security hardening.
- **Deliverables:** Security headers, input validation, request limits
### 6.6 Performance Optimization
- [Story: 6.6 - Performance Optimization](./6.6-performance-optimization.md)
- **Goal:** Optimize platform performance.
- **Deliverables:** Connection pooling, query optimization, compression, caching
## Deliverables Checklist
- [ ] Full OpenTelemetry integration
- [ ] Sentry error reporting
- [ ] Enhanced logging with correlation
- [ ] Comprehensive Prometheus metrics
- [ ] Grafana dashboards
- [ ] Rate limiting
- [ ] Security hardening
- [ ] Performance optimizations
## Acceptance Criteria
- Traces are exported and visible in Jaeger
- Errors are reported to Sentry with context
- Logs include request IDs and trace IDs
- Metrics are exposed and scraped by Prometheus
- Rate limiting prevents abuse
- Security headers are present
- Performance meets SLA (< 100ms p95 for auth endpoints)