Appearance
Phase 4: Deploy
The Deployment phase delivers the tested application to users and ensures it remains healthy, observable, and recoverable in production. Modern deployment practices emphasize automation, reversibility, and incremental rollout — minimizing risk while maximizing the speed at which value reaches users. Deployment is not the end of the lifecycle but the beginning of the application's operational life.
Infrastructure
Production infrastructure must be provisioned, configured, and managed with the same rigor applied to application code.
Infrastructure as Code
Infrastructure as Code (IaC) treats servers, networks, databases, and all supporting resources as declarative configuration that is versioned, reviewed, and tested alongside application code. This eliminates manual configuration drift, makes environments reproducible, and enables disaster recovery by rebuilding infrastructure from source.
Terraform is the most widely adopted multi-cloud IaC tool, using HashiCorp Configuration Language (HCL) to define resources across AWS, Azure, GCP, and dozens of other providers. Pulumi offers an alternative that uses general-purpose programming languages (TypeScript, Python, Go) instead of a domain-specific language. Cloud-native tools like AWS CloudFormation, Azure Bicep, and Google Cloud Deployment Manager are tightly integrated with their respective platforms.
IaC repositories should follow the same practices as application code: pull request reviews, automated validation (terraform plan, policy-as-code with tools like Open Policy Agent or Checkov), and modular organization that separates concerns like networking, compute, storage, and monitoring.
Environment Strategy
A well-defined environment strategy provides isolation between stages of the delivery pipeline. A development environment is used for daily work, often running locally or in an ephemeral cloud workspace. A staging environment mirrors production as closely as possible — same infrastructure configuration, same data volumes (anonymized), same integrations — and serves as the final validation gate before release. The production environment serves real users and is subject to the strictest access controls and change management processes. Some teams also maintain a pre-production or canary environment for gradual rollout testing.
Environment parity is critical. Differences between staging and production are the most common source of "works in staging, fails in production" surprises. Docker containers and Kubernetes manifests help enforce parity by packaging the application and its runtime dependencies identically across environments.
Cloud Architecture Patterns
Modern applications leverage cloud services to achieve scalability, resilience, and operational efficiency. Common patterns include auto-scaling groups that adjust compute capacity based on demand metrics, load balancers that distribute traffic across healthy instances and provide SSL termination, managed databases that handle replication, backups, and failover automatically, content delivery networks (CDNs) that cache static assets at edge locations for faster global delivery, and serverless functions (AWS Lambda, Azure Functions, Google Cloud Functions) that execute code on demand without provisioning servers.
The choice between self-managed infrastructure, managed services, and serverless depends on the team's operational capacity, cost constraints, and the application's performance and customization requirements.
Continuous Delivery and Continuous Deployment
Continuous Delivery (CD) ensures that every change passing the automated test suite is ready to deploy at any time. Continuous Deployment extends this by automatically releasing every passing change to production. Both approaches require a mature, trusted CI/CD pipeline.
Pipeline Architecture
A complete CI/CD pipeline for deployment typically includes a build stage that compiles code, runs tests, and produces a deployable artifact (Docker image, binary, or package), an artifact registry (Docker Hub, GitHub Container Registry, AWS ECR, Artifactory) that stores versioned artifacts immutably, a deployment stage that pulls the artifact and deploys it to the target environment, and a verification stage that runs post-deployment checks (smoke tests, health checks, synthetic monitors) before the deployment is considered complete.
Pipeline Tools
Popular CI/CD platforms include GitHub Actions (deeply integrated with GitHub repositories, YAML-based workflow definitions), GitLab CI/CD (built into GitLab with a unified interface for code, CI, and deployment), Jenkins (highly extensible and self-hosted, with a vast plugin ecosystem), CircleCI (cloud-native with strong parallelism and caching), and ArgoCD and Flux (GitOps-style continuous deployment for Kubernetes, where the desired state is declared in Git and reconciled automatically).
Artifact Immutability
A fundamental principle of reliable deployment is that the artifact deployed to production is the exact same artifact that was tested in staging. Rebuilding from source for each environment introduces the risk of non-deterministic builds. Instead, the pipeline should build once, tag the artifact with a version or commit SHA, push it to a registry, and promote the same artifact through environments.
Deployment Strategies
Different strategies manage the risk of releasing new code, each with distinct trade-offs between complexity, speed, and safety.
Rolling Deployment
A rolling deployment gradually replaces instances running the old version with instances running the new version. At any point during the rollout, both versions are serving traffic simultaneously. This strategy provides zero-downtime deployment and is straightforward to implement on most platforms. The primary risk is that if the new and old versions are incompatible (e.g., database schema changes), users may experience inconsistent behavior during the transition. Rolling deployments require backward-compatible changes.
Blue-Green Deployment
Blue-green deployment maintains two identical production environments. One (blue) serves live traffic while the other (green) is idle. The new version is deployed to the idle environment, thoroughly tested, and then traffic is switched from blue to green in a single operation (typically by updating a load balancer or DNS record). If issues are discovered, traffic can be switched back immediately. The trade-off is cost — maintaining two full environments doubles infrastructure expenses during the transition, though the idle environment can be scaled down between deployments.
Canary Deployment
Canary deployment releases the new version to a small subset of users (e.g., 1–5%) while the majority continue using the current version. The team monitors error rates, latency, and business metrics for the canary group. If everything looks healthy, traffic is gradually shifted (10%, 25%, 50%, 100%). If problems emerge, the canary is pulled back with minimal user impact. Canary deployments provide the strongest risk mitigation for changes with uncertain impact and are standard practice at large-scale organizations.
Feature Flags
Feature flags allow new code to be deployed to production but remain invisible to users until explicitly activated. This decouples deployment (a technical event) from release (a business event). Flags can be toggled instantly without a deployment, targeted to specific user segments for gradual rollout, and used as a kill switch if a released feature causes problems. Combined with canary deployment, feature flags provide fine-grained control over what users see and when.
Rollback
Every deployment strategy must include a clear rollback plan. For container-based deployments, rollback typically means redeploying the previous container image. For blue-green, it means switching traffic back. For canary, it means routing all traffic to the stable version. Rollback procedures should be documented, tested regularly, and executable in minutes — not hours. Automated rollback triggered by health check failures or error rate thresholds provides the fastest recovery.
Database Migrations
Database schema changes are among the riskiest operations in deployment because they affect persistent state and are difficult to reverse.
Migration Tools and Workflow
Migration tools maintain a versioned sequence of schema changes that can be applied and (in some cases) rolled back. Flyway (Java ecosystem), Alembic (Python/SQLAlchemy), Liquibase (multi-platform), and Knex.js (Node.js) are widely used. Each migration is a numbered script that transforms the schema from one state to the next. The tool tracks which migrations have been applied and runs only new ones.
Safe Migration Practices
Backward-compatible migrations are essential for zero-downtime deployments. If the old and new application versions run simultaneously during a rolling or canary deployment, both must be able to work with the current database schema. The expand-and-contract pattern achieves this by first expanding the schema (adding new columns, tables, or indexes) while the old version continues to function, deploying the new application version that uses the expanded schema, and then contracting the schema (removing deprecated columns or tables) in a subsequent migration after the old version is fully retired.
Large data migrations (backfilling columns, transforming data formats) should be run as background jobs rather than blocking migrations, to avoid locking tables and causing downtime. Migrations should always be tested against production-scale data volumes in staging, as a migration that runs in seconds on a small dataset may take hours on millions of rows.
Monitoring and Observability
Once the application is live, the team must be able to understand its behavior, detect problems, and diagnose root causes — often under time pressure.
The Three Pillars of Observability
Logs are structured records of discrete events. Effective logging uses structured formats (JSON) with consistent fields (timestamp, severity, request ID, user ID, message), avoids logging sensitive data (passwords, tokens, personal information), and uses correlation IDs to trace a single request across multiple services. Centralized log aggregation with tools like the ELK stack (Elasticsearch, Logstash, Kibana), Grafana Loki, or Datadog Logs makes searching and alerting practical.
Metrics are numerical measurements collected over time. Key application metrics include request rate, error rate, and response latency (the RED method), or request rate, error rate, and saturation (the USE method for infrastructure). Prometheus is the most widely adopted open-source metrics system, paired with Grafana for visualization. Cloud-native alternatives include AWS CloudWatch, Azure Monitor, and Google Cloud Monitoring.
Traces follow a single request as it propagates through multiple services, recording timing and metadata at each hop. Distributed tracing is essential for diagnosing latency and failures in microservices architectures. OpenTelemetry is the emerging standard for instrumentation, with backends like Jaeger, Zipkin, and commercial platforms (Datadog, Honeycomb, New Relic) providing visualization and analysis.
Dashboards
Dashboards provide at-a-glance visibility into system health. A good production dashboard shows request volume and error rates, response time percentiles (p50, p95, p99), infrastructure utilization (CPU, memory, disk, network), database connection pool usage and query latency, queue depths and consumer lag for asynchronous workloads, and business metrics (signups, orders, active sessions) that confirm the application is functioning correctly from a user perspective.
Dashboards should be organized by audience: an engineering dashboard focused on technical metrics, and a business dashboard focused on outcomes.
Alerting
Alerts notify the team when something requires attention. Effective alerting is based on symptoms (high error rate, slow response times) rather than causes (CPU usage), because symptoms directly reflect user impact. Alerts should be actionable — every alert should have a clear next step or runbook link. Alert fatigue (too many noisy alerts) is a serious problem; teams should periodically review and prune alerts, routing low-severity items to dashboards rather than pagers.
Alerting tools include PagerDuty, Opsgenie, and Grafana Alerting for on-call management, and integration with communication platforms like Slack or Microsoft Teams for lower-severity notifications.
Incident Response
Despite thorough testing, incidents will occur in production. A prepared team handles them efficiently and learns from them systematically.
Incident Lifecycle
Detection occurs through monitoring alerts, user reports, or automated anomaly detection. Triage determines the severity and scope of the impact — is it affecting all users or a subset? Is data at risk? Is there a workaround? Mitigation focuses on restoring service as quickly as possible, often through rollback, feature flag toggle, traffic rerouting, or scaling adjustments. The goal is to stop the bleeding before diagnosing the root cause. Communication keeps stakeholders informed throughout the incident. A status page (Statuspage.io, Atlassian Statuspage, or a custom solution) provides transparent updates to users. Internal communication follows a defined channel (a dedicated Slack channel, a war room) with clear roles. Resolution addresses the root cause once the immediate impact is mitigated. This may involve a hotfix, a configuration change, or a vendor escalation.
On-Call and Escalation
An on-call rotation ensures someone is always available to respond to production alerts. On-call engineers should have the access, tooling, and documentation needed to diagnose and mitigate common issues independently. Escalation paths should be clearly defined for situations that exceed the on-call engineer's expertise or authority.
Post-Incident Review
After every significant incident, the team conducts a blameless post-mortem. The review documents a timeline of events from detection to resolution, root cause analysis (often using the "five whys" technique), contributing factors (monitoring gaps, missing tests, process failures), action items with owners and deadlines to prevent recurrence, and an assessment of what went well during the response. Post-mortems are shared openly within the organization to spread learning. The action items are tracked in the issue tracker alongside regular work to ensure follow-through.
Release Management
For teams that do not practice continuous deployment, release management coordinates the process of delivering changes to production on a defined schedule.
Release Cadence
The release cadence should balance the desire for frequent delivery (smaller, lower-risk changes) with the overhead of the release process itself. Common cadences include weekly releases, which provide a predictable rhythm and limit the size of each release; bi-weekly or sprint-aligned releases, which map naturally to agile iteration cycles; and on-demand releases, where mature CI/CD pipelines allow deployment whenever a change is ready. Regardless of cadence, every release should be small enough that its impact is predictable and rollback is straightforward.
Release Notes and Changelogs
Release notes communicate what changed to users, stakeholders, and support teams. They should be written in clear, non-technical language (for external-facing notes), organized by category (new features, improvements, bug fixes, deprecations), and linked to relevant documentation or migration guides. Automated changelog generation from Conventional Commits or PR labels reduces the effort required and ensures completeness.
Versioning
Semantic Versioning (SemVer) provides a clear contract for how version numbers communicate the nature of changes. The MAJOR version increments for incompatible API changes, the MINOR version increments for backward-compatible new functionality, and the PATCH version increments for backward-compatible bug fixes. For internal applications, calendar versioning (CalVer) — using the date as the version (e.g., 2026.02.1) — can be simpler and more intuitive.
Post-Deployment Validation
Deployment is not complete when the new version is running — it is complete when the team has verified it is running correctly.
Smoke Tests
A small, fast suite of automated tests runs immediately after deployment to verify that the application starts, core endpoints respond, authentication works, and critical user flows complete. Smoke test failure should trigger automatic rollback or alert the on-call team for immediate investigation.
Synthetic Monitoring
Synthetic monitoring runs scripted user journeys against the production application at regular intervals (e.g., every minute) from multiple geographic locations. Unlike real user monitoring, synthetics provide consistent baselines and detect issues even during low-traffic periods. Tools like Datadog Synthetics, Checkly, and Pingdom execute browser-based or API-based checks and alert when assertions fail or latency exceeds thresholds.
Real User Monitoring
Real user monitoring (RUM) captures performance and error data from actual user sessions. It provides insight into real-world conditions — diverse devices, network speeds, and geographic locations — that synthetic tests cannot fully replicate. RUM data reveals which pages are slowest for real users, JavaScript errors affecting specific browsers, the impact of third-party scripts on load times, and geographic performance disparities that may indicate CDN or routing issues.
Gradual Rollout Verification
For canary or percentage-based deployments, the team monitors comparative metrics between the canary group and the stable group. Key comparisons include error rate, latency percentiles, conversion rates, and any business-specific KPIs. Statistical significance should be considered — small differences in low-traffic canaries may be noise rather than signal.
Security in Production
Production security extends beyond the application code to encompass the entire operational environment.
Secrets Management
Secrets (API keys, database credentials, encryption keys, tokens) must never be stored in source code, environment files committed to repositories, or container images. Dedicated secrets management tools like HashiCorp Vault, AWS Secrets Manager, Azure Key Vault, or Google Secret Manager provide encrypted storage, access control, automatic rotation, and audit logging. Applications retrieve secrets at runtime through environment variables injected by the orchestrator or direct API calls to the secrets manager.
Network Security
Production networks should follow the principle of least privilege. Private subnets isolate databases and internal services from direct internet access. Security groups and network policies restrict traffic to only the necessary ports and protocols between specific services. Web Application Firewalls (WAFs) protect against common attack patterns like SQL injection and XSS at the network edge. TLS encryption should be enforced for all traffic — both external (user to application) and internal (service to service).
Access Control
Production access should be tightly restricted and audited. Role-based access control (RBAC) limits who can deploy, view logs, access databases, or modify infrastructure. Just-in-time access grants temporary elevated permissions for specific tasks (e.g., debugging a production issue) and revokes them automatically afterward. All production access should be logged and periodically reviewed.
Operational Runbooks
Runbooks document the procedures for handling common operational tasks and incidents. A well-maintained runbook library reduces reliance on tribal knowledge and enables any on-call engineer to respond effectively.
What Runbooks Should Cover
Essential runbook topics include deployment procedures (step-by-step instructions for deploying, verifying, and rolling back), common incident responses (database connection exhaustion, memory leaks, third-party service outages), scaling procedures (how to add capacity manually if auto-scaling is insufficient), data recovery (how to restore from backups, replay events, or reconcile data), and certificate and secret rotation (procedures for rotating TLS certificates, API keys, and database credentials before expiration).
Runbook Maintenance
Runbooks are only useful if they are accurate. They should be reviewed and updated after every incident that reveals gaps, tested periodically by having someone unfamiliar with the system follow the instructions, and stored alongside the code or infrastructure they describe, making them easy to find and update.
Continuous Improvement
Deployment marks the beginning of a feedback loop that drives the next cycle of development.
Collecting Feedback
Post-launch, the team actively gathers signal from production metrics and dashboards, user feedback through in-app surveys, support tickets, and direct outreach, error and crash reports, and usage analytics that reveal which features are adopted, ignored, or abandoned. This data feeds directly back into the Research phase, informing the next round of prioritization and planning.
Iteration
The application is never "done." Each deployment provides new information that refines the team's understanding of user needs, system behavior, and technical constraints. Features are improved, performance is optimized, technical debt is addressed, and new capabilities are planned — continuing the Research, Develop, Test, Deploy cycle as an ongoing, iterative process.
"Production is the ultimate test environment. The goal is not to avoid surprises, but to detect them quickly, respond effectively, and learn continuously."
Key Deliverables
By the end of the Deploy phase (and on an ongoing basis), the team should have produced infrastructure defined as code and version-controlled, a CI/CD pipeline automating build, test, and deployment, a documented deployment strategy with rollback procedures, monitoring dashboards covering application, infrastructure, and business metrics, an alerting configuration based on symptoms with clear escalation paths, an incident response process with blameless post-mortem practices, operational runbooks for common tasks and failure scenarios, and a feedback loop connecting production data to the product backlog.
These deliverables ensure that the application not only reaches users reliably but continues to improve with every iteration.