DevOps Interview Questions and Answers (Real-World & Scenario Based)

This page is designed for real-world DevOps, Platform Engineering, Kubernetes, and SRE interviews. Each answer goes beyond definitions to explain why the concept matters in production, how it is implemented in real systems, and the trade-offs engineers must evaluate.

Section 1: DevOps Fundamentals (20 Questions)

What is DevOps?
DevOps is a cultural and engineering approach that removes silos between development and operations teams. It promotes collaboration, automation, and shared ownership across the entire software lifecycle.

In production systems, DevOps enables faster releases, safer deployments, and quicker recovery from failures by standardizing processes and tooling.
Why did DevOps emerge?
Traditional development and operations worked independently, causing slow releases and frequent production issues.

DevOps emerged to shorten feedback loops, automate deployments, and improve system stability while increasing delivery speed.
What are the core principles of DevOps?
Key principles include continuous integration, continuous delivery, automation, infrastructure as code, monitoring, and fast feedback.

These principles ensure changes are small, testable, observable, and reversible.
How does DevOps improve reliability?
DevOps reduces manual intervention by automating builds, deployments, and infrastructure.

Automation reduces human error and ensures consistent behavior across environments.
What is Infrastructure as Code (IaC)?
IaC is the practice of managing infrastructure using code instead of manual configuration.

It allows version control, repeatability, and safe rollback of infrastructure changes.
What is configuration drift?
Configuration drift occurs when systems are changed manually and no longer match their defined configuration.

This leads to unpredictable behavior and difficult troubleshooting in production.
Why is Git important in DevOps?
Git acts as a single source of truth for application and infrastructure code.

It enables collaboration, auditability, and controlled rollbacks.
What is Continuous Integration (CI)?
CI automatically builds and tests code whenever changes are pushed.

It prevents integration issues and catches bugs early.
What is Continuous Delivery (CD)?
CD ensures software can be deployed to production at any time.

It focuses on reliability and repeatability rather than speed alone.
What is Continuous Deployment?
Continuous Deployment automatically deploys every successful change to production.

It requires strong testing, monitoring, and rollback strategies.
What is shift-left testing?
Shift-left testing means testing earlier in the development lifecycle.

It reduces late-stage failures and lowers cost of fixing bugs.
Why is automation critical in DevOps?
Automation eliminates repetitive manual tasks and reduces errors.

It allows teams to scale systems without scaling human effort.
What is immutable infrastructure?
Infrastructure is replaced rather than modified in-place.

This simplifies rollback and prevents configuration drift.
What is observability?
Observability is understanding system behavior through metrics, logs, and traces.

It is essential for debugging distributed systems.
What is a feedback loop?
Feedback loops provide insight into system health and deployment impact.

Fast feedback enables rapid improvements and recovery.
What is DevSecOps?
DevSecOps integrates security into DevOps practices.

Security checks are automated and treated as code.
Why are small releases preferred?
Small releases reduce blast radius and simplify troubleshooting.

They make rollbacks faster and safer.
What is MTTR?
Mean Time To Recovery measures how quickly a system recovers from failure.

Lower MTTR indicates operational maturity.
What is change failure rate?
It measures how often deployments cause incidents.

Lower rates indicate safer delivery practices.
Why is documentation important?
Documentation preserves operational knowledge.

It reduces dependency on individuals and improves onboarding.

Section 2: Docker (10 Questions + Code)

What is Docker?
Docker packages applications and dependencies into containers.

It ensures consistent behavior across environments.
Why use containers instead of VMs?
Containers are lightweight and start faster than VMs.

They share the host OS kernel and consume fewer resources.
What is a Docker image?
An image is an immutable blueprint for containers.

It contains code, dependencies, and runtime configuration.
What is a Docker container?
A container is a running instance of an image.

It has isolated process and filesystem space.
Explain multi-stage Docker builds.
Multi-stage builds separate build and runtime environments.

They reduce image size and improve security.

Example production Dockerfile


FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12 /usr/local/lib/python3.12
COPY . .
CMD ["python", "app.py"]

What is ENTRYPOINT vs CMD?
ENTRYPOINT defines the executable.

CMD provides default arguments.
How do you persist data?
Containers are ephemeral by nature.

Volumes or bind mounts are used for persistence.
What is Docker networking?
Docker networking enables container communication.

Bridge, host, and overlay are common modes.
What would you do if a container keeps restarting?
Check logs, exit codes, health checks, and environment variables.

Startup failures are the most common cause.

Section 3: CI/CD (20 Questions + YAML)


name: CI Pipeline
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: pytest

CI/CD pipelines automate building, testing, and deploying applications. In production, they are critical for reliability, speed, and safe rollbacks.

What is CI/CD and why is it critical in production?
CI/CD automates the process of integrating code, running tests, and deploying applications.

In production systems, CI/CD reduces manual errors, ensures consistent releases, and allows teams to deploy frequently with confidence.
What is the difference between Continuous Delivery and Continuous Deployment?
Continuous Delivery ensures code is always deployable but requires manual approval.

Continuous Deployment automatically releases every successful change to production, which requires strong testing and monitoring.
CI pipeline suddenly starts failing without code changes. Why?
This often happens due to dependency updates, external service outages, or changes in the build environment.

Resolution involves checking pipeline logs, pinning dependency versions, and verifying external integrations.
How do you design a reliable CI pipeline?
A reliable pipeline includes build, unit tests, integration tests, and security checks.

Each stage should fail fast and provide clear feedback to developers.
What causes flaky tests in CI pipelines?
Flaky tests fail intermittently due to timing issues, shared state, or external dependencies.

They should be isolated, stabilized, or removed to maintain pipeline trust.
Pipeline execution is very slow. How do you optimize it?
Slow pipelines reduce developer productivity and delay releases.

Optimization includes caching dependencies, parallelizing jobs, and running only impacted tests.
How do you manage secrets securely in CI/CD?
Secrets should never be hard-coded or stored in repositories.

They must be injected via secret managers or encrypted environment variables.
What happens if secrets are exposed in pipeline logs?
Exposed secrets are a security incident and must be rotated immediately.

Pipelines should mask secrets and avoid echoing sensitive variables.
How do you handle database migrations in CI/CD?
Migrations should be backward-compatible and reversible.

They are often executed as a separate pipeline stage before application rollout.
What is a build artifact?
A build artifact is a packaged output such as a binary, image, or archive.

Artifacts should be versioned and stored for traceability and rollback.
Deployment succeeds but application fails. What do you do?
First, stop further rollouts and assess user impact.

Roll back to the last known good version and analyze logs and metrics.
What is blue-green deployment?
Blue-green deployment runs two environments simultaneously.

Traffic is switched to the new version only after validation.
What is canary deployment?
Canary deployment releases changes to a small subset of users.

It reduces blast radius and allows early issue detection.
How do you implement rollback in CI/CD?
Rollback uses previously stored artifacts or container images.

Automated rollback is triggered when health checks fail.
What is pipeline-as-code?
Pipeline-as-code defines CI/CD pipelines using version-controlled files.

This ensures consistency, reviewability, and reproducibility.
Why should CI pipelines fail fast?
Failing fast prevents wasting resources on broken builds.

It provides faster feedback to developers.
How do you test CI/CD pipelines safely?
Use separate test environments and dry-run modes.

Never test unverified pipelines directly in production.
What causes deployments to work in staging but fail in production?
Environment drift, missing configs, or traffic differences are common causes.

Production parity and IaC help prevent this.
How do you monitor CI/CD pipelines?
Monitor build success rate, duration, and failure trends.

Alerts should trigger on abnormal failure spikes.
How do CI/CD pipelines support DevOps culture?
They enable collaboration, transparency, and shared responsibility.

Everyone can see and improve the delivery process.

Section 4: Kubernetes (20 Questions + YAML)


apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: myapp:latest
        ports:
        - containerPort: 5000

Kubernetes provides automated deployment, scaling, self-healing, and high availability for containerized workloads in production.

What is Kubernetes and why is it used in production?
Kubernetes is a container orchestration platform that manages the lifecycle of containerized applications.

It handles scheduling, scaling, self-healing, and service discovery, which are critical for running distributed systems reliably at scale.
What is a Pod and why is it the smallest unit?
A Pod is the smallest deployable unit in Kubernetes and can contain one or more containers.

Containers in a Pod share networking and storage, allowing tightly coupled processes to run together.
What happens when a Pod crashes?
Kubernetes automatically restarts the Pod based on the restart policy.

Controllers ensure the desired number of replicas is always running, providing self-healing behavior.
What is CrashLoopBackOff?
CrashLoopBackOff occurs when a container repeatedly crashes after startup.

It is usually caused by application errors, wrong entrypoints, missing environment variables, or failing probes.
How do you debug a Pod in CrashLoopBackOff?
Use kubectl logs to inspect application errors and kubectl describe pod to review events.

Check startup commands, configuration, secrets, and resource limits.
What is the difference between Deployment and StatefulSet?
Deployments manage stateless applications with interchangeable Pods.

StatefulSets provide stable identities, persistent storage, and ordered deployment for stateful workloads.
What are liveness and readiness probes?
Liveness probes determine when a container should be restarted.

Readiness probes determine when a Pod can receive traffic.
What production issue occurs if readiness probes are misconfigured?
Traffic may be sent to Pods before the application is ready.

This causes intermittent failures during deployments or scaling events.
What is a Service in Kubernetes?
A Service provides a stable network endpoint for accessing Pods.

It decouples clients from the dynamic nature of Pod IP addresses.
Difference between ClusterIP, NodePort, and LoadBalancer?
ClusterIP is internal-only, NodePort exposes via node ports, and LoadBalancer integrates with cloud providers.

The choice depends on traffic pattern and environment.
What is an Ingress and why is it used?
Ingress manages HTTP/HTTPS routing to Services.

It provides TLS termination, path-based routing, and centralized traffic management.
Why do you get 502/504 errors from Ingress?
These errors occur when backend Pods are unhealthy or unreachable.

Misconfigured timeouts, services, or network policies are common causes.
What is a ConfigMap and how is it used?
ConfigMaps store non-sensitive configuration data.

They allow configuration changes without rebuilding container images.
What is a Secret and how is it different from ConfigMap?
Secrets store sensitive data such as passwords and tokens.

They are base64-encoded and access-controlled within the cluster.
Why don’t updated Secrets reflect immediately?
Secrets mounted as files are only read at Pod startup.

Pods must be restarted or applications must support reload mechanisms.
What is Horizontal Pod Autoscaler (HPA)?
HPA automatically scales Pods based on metrics like CPU or memory usage.

It helps maintain performance during traffic spikes.
Why might HPA not scale Pods?
Metrics server may be missing or misconfigured.

Incorrect resource requests or thresholds also prevent scaling.
What happens when a Kubernetes node fails?
Kubernetes marks the node as NotReady and reschedules Pods.

Workloads continue running on healthy nodes automatically.
What is etcd and why is it critical?
etcd is a distributed key-value store holding cluster state.

Loss of etcd data can result in total cluster failure.
How do you ensure zero-downtime deployments?
Use rolling updates, readiness probes, sufficient replicas, and proper resource limits.

This ensures traffic is always served by healthy Pods.

Section 5: SRE & DevOps Troubleshooting Labs (20 Questions)

What Would You Do If… (Real Production Scenarios)

Production latency suddenly spikes across all services
Error / Symptom: Users report slow response times, dashboards show increased latency.

Root Cause Analysis: Could be due to a recent deployment, downstream dependency slowdown, database contention, or network saturation.

Resolution: Check latency metrics, recent deployments, and dependency health. Roll back recent changes if needed, scale affected services, and analyze traces to isolate bottlenecks.
Pods are running but users receive HTTP 500 errors
Error / Symptom: Kubernetes shows pods as healthy, but application returns 500 errors.

Root Cause Analysis: Application-level exceptions, database connection failures, or misconfigured environment variables.

Resolution: Inspect application logs, verify database connectivity, check secrets/config maps, and reproduce the error locally if possible.
Pods stuck in CrashLoopBackOff state
Error / Symptom: Pods restart continuously after deployment.

Root Cause Analysis: Application crashes on startup, incorrect entrypoint, missing environment variables, or failing health probes.

Resolution: Use kubectl logs and kubectl describe, fix startup issues, validate configuration, and redeploy with corrected settings.
CI pipeline suddenly starts failing
Error / Symptom: Builds fail despite no recent code changes.

Root Cause Analysis: Dependency updates, expired credentials, external service outages, or infrastructure changes.

Resolution: Review pipeline logs, lock dependency versions, rotate credentials if needed, and retry builds after validation.
Deployment succeeds but application is unreachable
Error / Symptom: Pods are running, but service is not accessible externally.

Root Cause Analysis: Incorrect service selector, missing ingress rules, firewall issues, or wrong ports.

Resolution: Validate service selectors, ingress configuration, network policies, and ensure ports match container configuration.
Database CPU usage spikes suddenly
Error / Symptom: Slow queries and increased response time.

Root Cause Analysis: Inefficient queries, missing indexes, traffic surge, or application bugs.

Resolution: Identify slow queries, add indexes, scale database resources, and apply rate limiting if needed.
Memory usage keeps increasing until pods are killed
Error / Symptom: Pods restart due to OOMKilled events.

Root Cause Analysis: Memory leaks, improper caching, or missing memory limits.

Resolution: Profile memory usage, fix leaks, set proper resource limits, and restart pods gradually.
Alerts are firing continuously (alert fatigue)
Error / Symptom: Engineers receive too many alerts, many non-actionable.

Root Cause Analysis: Poor alert thresholds or alerting on symptoms instead of causes.

Resolution: Tune thresholds, remove noisy alerts, alert only on user-impacting conditions, and document runbooks.
Node becomes NotReady in Kubernetes
Error / Symptom: Pods are evicted and node is marked NotReady.

Root Cause Analysis: Disk pressure, network failure, or kubelet crash.

Resolution: Check node logs, free disk space, restart kubelet if needed, and replace node if unstable.
Ingress returns 502 / 504 errors
Error / Symptom: Gateway timeout or bad gateway errors.

Root Cause Analysis: Backend pods not responding, timeout misconfiguration, or network issues.

Resolution: Check backend health, adjust ingress timeouts, and verify service connectivity.
Secrets updated but application still uses old values
Error / Symptom: App does not reflect updated credentials.

Root Cause Analysis: Secrets mounted at startup and not reloaded dynamically.

Resolution: Restart pods to reload secrets, or implement secret reload mechanisms.
Rolling deployment causes partial downtime
Error / Symptom: Users see intermittent failures during deploy.

Root Cause Analysis: Incorrect readiness probes or insufficient replicas.

Resolution: Fix readiness probes, increase replica count, and use rolling update strategies properly.
High error rate after a new release
Error / Symptom: Error metrics spike immediately post-deploy.

Root Cause Analysis: Buggy release, incompatible dependency, or configuration mismatch.

Resolution: Roll back quickly, analyze logs, fix the issue, and redeploy safely.
Autoscaling not triggering under high load
Error / Symptom: Traffic increases but pods do not scale.

Root Cause Analysis: Missing metrics server or incorrect HPA configuration.

Resolution: Validate metrics availability, fix HPA thresholds, and test scaling behavior.
Service-to-service communication fails intermittently
Error / Symptom: Random timeouts between services.

Root Cause Analysis: Network policies, DNS issues, or overloaded downstream services.

Resolution: Validate network policies, check DNS resolution, and apply retries with backoff.
Disk usage reaches 100%
Error / Symptom: Applications fail to write data or logs.

Root Cause Analysis: Log accumulation, temp files, or missing log rotation.

Resolution: Clean disk, enable log rotation, and monitor disk usage continuously.
Monitoring dashboards show missing data
Error / Symptom: Metrics suddenly disappear.

Root Cause Analysis: Exporter failure, network issues, or misconfigured scraping.

Resolution: Check exporter health, validate configs, and restart monitoring components.
Deployment pipeline exposes secrets accidentally
Error / Symptom: Secrets appear in logs or build output.

Root Cause Analysis: Improper logging or echoing environment variables.

Resolution: Rotate secrets immediately, fix pipeline scripts, and restrict log verbosity.
On-call engineer overwhelmed during incident
Error / Symptom: Slow response and confusion during outages.

Root Cause Analysis: Lack of runbooks and unclear ownership.

Resolution: Create clear runbooks, improve documentation, and rotate on-call responsibilities.
Recurring incident with same root cause
Error / Symptom: Same failure keeps happening repeatedly.

Root Cause Analysis: Temporary fixes without addressing underlying issue.

Resolution: Conduct blameless postmortems, implement permanent fixes, and add preventive monitoring.