Phase 3: Test

The Testing phase validates that the application behaves correctly, performs reliably, and meets the standards defined during Research. Testing is not a single activity but a layered strategy where different types of tests catch different categories of defects. A well-designed testing strategy provides confidence that the software is ready for users while keeping feedback cycles fast enough to sustain development velocity.

The Testing Pyramid

The testing pyramid is a conceptual model that guides how teams allocate their testing effort across different levels of abstraction.

Structure

At the base of the pyramid are unit tests — fast, numerous, and focused on individual functions or components in isolation. The middle layer contains integration tests, which verify that modules, services, or layers work together correctly. At the top are end-to-end (E2E) tests, which simulate real user workflows across the full application stack. As you move up the pyramid, tests become slower, more expensive to maintain, and more brittle, but they also exercise more realistic scenarios.

The key insight of the pyramid is that the majority of testing effort should be concentrated at the base. A codebase with thousands of fast unit tests, hundreds of targeted integration tests, and a focused suite of E2E tests covering critical paths will catch more bugs faster than one that relies primarily on slow, comprehensive E2E tests.

Anti-Patterns

The "ice cream cone" is an inverted pyramid where most tests are manual or E2E, with few unit tests. This pattern results in slow feedback, flaky test suites, and teams that are afraid to refactor. The "hourglass" has many unit tests and many E2E tests but few integration tests, leaving a blind spot where modules interact. Both anti-patterns should be actively corrected by investing in the missing layer.

Unit Testing

Unit tests are the foundation of a reliable test suite. They verify that individual units of code — typically functions, methods, or classes — produce the correct output for a given input.

Characteristics of Good Unit Tests

A good unit test is fast, executing in milliseconds so that the entire suite can run in seconds. It is isolated, depending on no external services, databases, or file systems. It is deterministic, producing the same result every time regardless of execution order or environment. It is focused, testing one behavior or scenario per test case. And it is readable, serving as documentation of expected behavior.

Test Structure

The Arrange-Act-Assert (AAA) pattern provides a clear structure for every test. The Arrange step sets up the preconditions and inputs. The Act step invokes the function or method under test. The Assert step verifies that the result matches expectations. Each test should have a descriptive name that communicates the scenario and expected outcome, such as test_calculate_discount_returns_zero_when_cart_is_empty rather than test_discount.

Mocking and Stubbing

When a unit under test depends on external collaborators (a database client, an HTTP service, a file system), those dependencies should be replaced with test doubles. A stub provides canned responses to calls made during the test. A mock additionally verifies that specific interactions occurred (e.g., "this method was called once with these arguments"). A fake is a lightweight implementation that behaves like the real dependency but operates entirely in memory (e.g., an in-memory database).

Over-mocking is a common pitfall. If a test requires extensive mocking to set up, it may indicate that the code under test has too many responsibilities or is too tightly coupled. Refactoring to reduce dependencies often makes both the code and its tests simpler.

Code Coverage

Code coverage measures the percentage of code lines, branches, or paths exercised by tests. It is a useful indicator but not a goal in itself. 100% coverage does not guarantee correctness — a test that executes every line but asserts nothing provides no value. Conversely, uncovered code is guaranteed to be untested. A reasonable target for most projects is 70–90% line coverage, with particular attention to branch coverage in complex conditional logic. Coverage reports generated by tools like Istanbul (JavaScript), coverage.py (Python), or JaCoCo (Java) help identify blind spots.

Integration Testing

Integration tests verify that components work together as expected when connected through real interfaces.

What Integration Tests Cover

Typical integration test scenarios include API endpoint tests that send HTTP requests and verify responses, status codes, and headers; database interaction tests that run queries against a real (or containerized) database and verify correct reads, writes, and constraint enforcement; service-to-service communication tests that confirm message serialization, routing, and error handling between microservices; and third-party integration tests that validate behavior against external APIs using sandbox environments or recorded responses.

Test Databases and Containers

Integration tests that involve databases should run against a real database engine (not an in-memory substitute that behaves differently). Docker containers spun up by the CI pipeline or locally via Docker Compose provide disposable, consistent database instances. Each test run should start with a clean schema, apply migrations, seed necessary data, execute tests, and tear down. This ensures tests do not depend on leftover state from previous runs.

Contract Testing

In a microservices architecture, contract testing verifies that the API provided by one service (the provider) matches the expectations of another service (the consumer). Tools like Pact enable consumer-driven contract testing, where the consumer defines its expectations in a contract file, and the provider verifies that it satisfies those expectations. Contract tests catch breaking changes at the API boundary without requiring a full end-to-end environment.

End-to-End Testing

End-to-end tests exercise the application as a user would, interacting with the UI, triggering backend logic, and verifying outcomes across the entire stack.

Tooling

Modern E2E testing frameworks include Cypress, which runs directly in the browser and provides excellent developer experience with time-travel debugging and automatic waiting; Playwright, which supports Chromium, Firefox, and WebKit with a single API and excels at cross-browser testing; and Selenium, the longest-standing framework with broad language and browser support, though often requiring more boilerplate.

Writing Effective E2E Tests

E2E tests should focus on critical user journeys — the paths that, if broken, would represent a significant business impact. Examples include user registration and login, core workflow completion (e.g., placing an order, submitting a form, generating a report), payment processing, and permission and access control boundaries.

Each test should be independent, setting up its own data and not relying on state created by other tests. Page Object Models (POMs) encapsulate page-specific selectors and actions into reusable classes, reducing duplication and making tests resilient to UI changes. Tests should use stable selectors — dedicated data-testid attributes rather than CSS classes or XPath expressions that are tightly coupled to visual design.

Managing Flakiness

Flaky tests — tests that pass and fail intermittently without code changes — are one of the biggest threats to test suite credibility. Common causes include race conditions and timing issues (solved by using explicit waits for specific conditions rather than arbitrary sleep statements), shared mutable state between tests (solved by isolating test data), and environmental instability (solved by running E2E tests in consistent, containerized environments). Flaky tests should be flagged, quarantined, and fixed promptly. A team that ignores flakiness will eventually stop trusting and running the suite altogether.

Performance Testing

Performance testing ensures the application meets speed, scalability, and stability requirements under realistic conditions.

Types of Performance Tests

Load testing measures system behavior under expected concurrent user volumes. The goal is to verify that response times, throughput, and error rates remain within acceptable bounds during normal operations. Stress testing pushes the system beyond its expected capacity to identify the breaking point and observe how it degrades. A well-designed system degrades gracefully (e.g., queuing requests, returning informative errors) rather than crashing without warning. Spike testing simulates sudden, sharp increases in traffic — for example, a flash sale or a viral social media post — to verify that auto-scaling mechanisms respond quickly enough. Soak testing (endurance testing) runs a sustained load over an extended period (hours or days) to detect slow memory leaks, connection pool exhaustion, disk space accumulation, and other issues that only manifest over time.

Tooling and Methodology

Performance testing tools include k6 (a developer-friendly tool that scripts tests in JavaScript and integrates well with CI pipelines), Locust (a Python-based tool that defines user behavior as code), JMeter (a mature Java-based tool with extensive protocol support), and Gatling (a Scala-based tool with excellent reporting).

Effective performance testing requires a production-like environment with representative data volumes, a clear definition of acceptable thresholds (e.g., p95 response time under 300ms at 1,000 concurrent users), baseline measurements established before changes so that regressions can be detected, and realistic user behavior simulation including think times, session lengths, and diverse request patterns.

Interpreting Results

Key metrics to analyze include response time percentiles (p50, p95, p99 — averages hide outliers), throughput (requests per second), error rate, and resource utilization (CPU, memory, network, disk I/O). Results should be compared against the non-functional requirements defined during Research. Performance regressions should be treated with the same urgency as functional bugs.

Security Testing

Security testing identifies vulnerabilities that could be exploited to compromise data, availability, or user trust.

Static Application Security Testing (SAST)

SAST tools analyze source code or compiled binaries without executing the application. They detect common vulnerability patterns such as SQL injection, cross-site scripting (XSS), insecure cryptographic usage, hardcoded secrets, and path traversal. Tools like Semgrep, CodeQL, SonarQube, and Bandit integrate into the CI pipeline and flag issues in pull requests before code is merged.

Dynamic Application Security Testing (DAST)

DAST tools interact with a running instance of the application, sending malicious inputs and observing responses. They simulate attacker behavior to discover vulnerabilities that only manifest at runtime, such as improper error handling, missing security headers, authentication bypass, and server misconfiguration. OWASP ZAP and Burp Suite are widely used DAST tools.

Dependency Vulnerability Scanning

Modern applications depend on hundreds of third-party packages, and new vulnerabilities are disclosed daily. Tools like Dependabot, Snyk, Trivy, and npm audit continuously monitor dependencies against vulnerability databases (CVE, NVD, GitHub Advisory Database) and alert the team when a patch is available. Critical vulnerabilities in direct dependencies should be treated as high-priority issues.

Penetration Testing

Penetration testing (pen testing) engages skilled security professionals to attempt to breach the application using the same techniques a real attacker would. Pen tests go beyond automated scanning by chaining vulnerabilities, testing business logic flaws, and simulating social engineering. For high-risk applications (financial services, healthcare, government), periodic pen testing is often a compliance requirement.

OWASP Top Ten

The OWASP Top Ten is a widely recognized list of the most critical web application security risks. Testing against these categories — including broken access control, cryptographic failures, injection, insecure design, security misconfiguration, vulnerable and outdated components, identification and authentication failures, software and data integrity failures, security logging and monitoring failures, and server-side request forgery — provides a structured baseline for security validation.

Accessibility Testing

Accessibility ensures the application is usable by people with diverse abilities, including those who use screen readers, keyboard navigation, voice control, or alternative input devices.

Standards and Guidelines

The Web Content Accessibility Guidelines (WCAG) define three conformance levels: A (minimum), AA (standard target for most applications), and AAA (highest). WCAG is organized around four principles — Perceivable, Operable, Understandable, and Robust (POUR). Legal frameworks such as the Americans with Disabilities Act (ADA), Section 508, and the European Accessibility Act increasingly require digital products to meet WCAG AA compliance.

Automated Accessibility Testing

Tools like axe-core, Lighthouse, Pa11y, and WAVE scan pages for common violations such as missing alt text on images, insufficient color contrast, missing form labels, incorrect heading hierarchy, and missing ARIA attributes. These tools can be integrated into the CI pipeline to catch regressions automatically.

Manual Accessibility Testing

Automated tools catch approximately 30–50% of accessibility issues. Manual testing is essential to evaluate keyboard navigation (can every interactive element be reached and operated without a mouse?), screen reader experience (does the content make sense when read aloud in linear order?), focus management (is focus moved logically after modal dialogs, page transitions, and dynamic content updates?), and cognitive load (are instructions clear, error messages helpful, and workflows intuitive?). Testing with actual assistive technology users provides the most authentic feedback.

User Acceptance Testing

User acceptance testing (UAT) is the final validation gate before release, confirming that the application meets business requirements and user expectations in practice.

Planning UAT

UAT should be planned early, with test scenarios derived directly from the acceptance criteria defined during Research. Test cases should cover the full range of user roles and workflows, including edge cases and error conditions. A UAT environment that mirrors production as closely as possible — including realistic data, integrations, and performance characteristics — ensures that results are meaningful.

Conducting UAT

UAT participants are typically business stakeholders, subject matter experts, or representative end users — not the development team. Participants execute predefined test scripts and also perform exploratory testing, attempting to use the application the way they would in their daily work. Feedback is captured in a structured format, categorized by severity, and triaged alongside the development team.

Acceptance Criteria and Sign-Off

Each user story or feature has acceptance criteria that define "done" from the business perspective. UAT sign-off means that stakeholders have verified these criteria are met and approve the feature for release. Any blocking issues must be resolved before proceeding to deployment. Non-blocking issues are documented and scheduled for future iterations.

Test Automation Strategy

A sustainable testing strategy balances coverage, speed, and maintenance cost through thoughtful automation.

What to Automate

Regression tests (verifying that existing functionality still works after changes) are the highest-value automation candidates because they run repeatedly and catch unintended side effects. Smoke tests (a small suite verifying that the application's most critical functions work) should run on every deployment. Data-driven tests (the same logic tested with many input variations) are far more efficient automated than manual.

What Not to Automate

Exploratory testing, where a skilled tester uses creativity and intuition to find unexpected issues, is inherently manual. Usability testing, which evaluates subjective qualities like clarity and ease of use, requires human judgment. One-off tests for issues that are unlikely to recur may not justify the maintenance cost of automation.

Test Data Management

Tests need data, and managing that data is often the most underestimated aspect of test automation. Strategies include factory functions or builders that generate valid data programmatically, database seeding scripts that create a consistent baseline, anonymized production data subsets for realistic performance and integration testing, and test data cleanup to prevent state leakage between test runs.

Continuous Testing in CI/CD

Automated tests should be integrated into the CI/CD pipeline at appropriate stages. On every commit or pull request, linting, static analysis, and unit tests should run (providing feedback in under five minutes). On merge to main, integration tests and a smoke E2E suite should run. Before deployment to staging, the full E2E suite and performance benchmarks should execute. After deployment to production, smoke tests and synthetic monitoring should verify the deployment is healthy.

Key Deliverables

By the end of the Testing phase, the team should have produced a comprehensive unit test suite with strong coverage of business logic, integration tests validating component interactions and data flows, an E2E test suite covering critical user journeys, performance test results benchmarked against non-functional requirements, security scan reports with all critical and high vulnerabilities resolved, accessibility audit results demonstrating WCAG conformance, UAT sign-off from business stakeholders, and a test automation suite integrated into the CI/CD pipeline.

These deliverables provide the evidence and confidence needed to proceed to deployment, knowing that the application has been thoroughly validated from every critical angle.

Phase 3: Test ​

The Testing Pyramid ​

Structure ​

Anti-Patterns ​

Unit Testing ​

Characteristics of Good Unit Tests ​

Test Structure ​

Mocking and Stubbing ​

Code Coverage ​

Integration Testing ​

What Integration Tests Cover ​

Test Databases and Containers ​

Contract Testing ​

End-to-End Testing ​

Tooling ​

Writing Effective E2E Tests ​

Managing Flakiness ​

Performance Testing ​

Types of Performance Tests ​

Tooling and Methodology ​

Interpreting Results ​

Security Testing ​

Static Application Security Testing (SAST) ​

Dynamic Application Security Testing (DAST) ​

Dependency Vulnerability Scanning ​

Penetration Testing ​

OWASP Top Ten ​

Accessibility Testing ​

Standards and Guidelines ​

Automated Accessibility Testing ​

Manual Accessibility Testing ​

User Acceptance Testing ​

Planning UAT ​

Conducting UAT ​

Acceptance Criteria and Sign-Off ​

Test Automation Strategy ​

What to Automate ​

What Not to Automate ​

Test Data Management ​

Continuous Testing in CI/CD ​

Key Deliverables ​