# Internal PKI for Microservices — mTLS, Certificate Automation, and Trust Distribution

> **Intro:** Once a platform has many services, containers, and clusters, “copy a self-signed certificate into each service” stops being a real security strategy. The operational problem becomes certificate **issuance, trust distribution, rotation, and revocation** at scale. This page gives a practical path for building an internal PKI for service-to-service TLS and mTLS without turning the KB into a full PKI textbook.
>
> **What this page includes**
>
> * when an internal PKI is worth the operational cost;
> * the most practical open-source and commercial options;
> * a recommended CA hierarchy for microservices;
> * step-by-step implementation guidance for Kubernetes-heavy and mixed environments;
> * starter snippets for cert-manager, trust-manager, and Vault PKI.

## What problem this solves

You need an internal PKI when services must:

* encrypt service-to-service traffic;
* mutually authenticate peers, not just encrypt a channel;
* rotate certificates automatically without manual redeploy work;
* revoke or replace compromised credentials quickly;
* keep trust roots and leaf certificates under central control.

Typical drivers:

* many east-west calls between services;
* zero-trust or mTLS programs;
* multi-cluster Kubernetes platforms;
* regulated environments where certificate ownership and rotation evidence matter;
* service mesh or workload-identity programs.

## Do not start with leaf certificates — start with operating model

Before choosing tools, decide:

1. **identity model** — DNS names, service names, workload identity, or SPIFFE IDs;
2. **topology** — one cluster, many clusters, mixed VM + Kubernetes, or hybrid cloud;
3. **certificate lifetime** — long-lived leaf certs make operations easier but security worse;
4. **trust distribution method** — how every service gets the CA bundle;
5. **enforcement point** — application code, sidecar proxy, ingress / gateway, or service mesh.

If these decisions are vague, the PKI will become fragile quickly.

## Recommended hierarchy for an internal PKI

For most organizations, the practical model is:

* **offline root CA**;
* **online intermediate CA(s)** for issuance;
* **short-lived leaf certificates** for workloads;
* separate **trust bundle distribution** for roots / intermediates.

```mermaid
flowchart TD
    ROOT[Offline Root CA\nrarely used]
    INT1[Online Intermediate CA\ncluster / environment / region]
    INT2[Optional second Intermediate\nfor rotation or another environment]
    TRUST[Trust bundle distribution\nConfigMap / secret / image / system trust]
    SVC1[Service A]
    SVC2[Service B]
    SVC3[Service C]

    ROOT --> INT1
    ROOT --> INT2
    INT1 --> SVC1
    INT1 --> SVC2
    INT1 --> SVC3
    ROOT --> TRUST
    INT1 --> TRUST
    TRUST --> SVC1
    TRUST --> SVC2
    TRUST --> SVC3
```

### Why this hierarchy works better than “one self-signed cert per service”

Because it separates:

* **trust anchor lifetime** from **workload certificate lifetime**;
* **emergency replacement** from **normal renewal**;
* CA key protection from day-to-day service deployment.

## Practical options — open source and commercial

### Option 1 — cert-manager + trust-manager

Best fit when:

* workloads are mainly on Kubernetes;
* teams already manage secrets and ingress in cluster-native ways;
* you want Kubernetes-native renewal and trust-bundle distribution.

Why teams choose it:

* clean Kubernetes API model;
* automatic renewal with `Certificate` resources;
* easy bootstrap path from self-signed root to CA issuer;
* trust-manager distributes CA bundles across namespaces and workloads.

Trade-offs:

* strongest in Kubernetes, weaker as a general-purpose PKI for mixed estates;
* you still need to think through root/intermediate protection and disaster recovery;
* service-to-service identity semantics remain your responsibility unless combined with a mesh or SPIFFE-based model.

### Option 2 — Smallstep `step-ca`

Best fit when:

* you want an internal CA beyond Kubernetes only;
* you want ACME-friendly automation for servers, gateways, and services;
* you want a lightweight private CA with relatively low operational complexity.

Why teams choose it:

* purpose-built for automated private X.509 and SSH issuance;
* good support for short-lived certificates;
* useful when services or edge proxies can enroll via ACME or other supported provisioners;
* good stepping stone from “manual certs” to automated certificate lifecycle.

Trade-offs:

* still a CA you must operate and protect;
* trust distribution remains a platform task;
* less Kubernetes-native than cert-manager for in-cluster object workflows.

### Option 3 — HashiCorp Vault PKI

Best fit when:

* you already run Vault for secrets or strong authn/authz workflows;
* you want dynamic issuance, short-lived certs, and policy-driven roles;
* you need more enterprise-grade control over issuance, revocation, and multi-issuer rotation.

Why teams choose it:

* dynamic X.509 issuance through the PKI engine;
* short TTLs work well for service certificates;
* good fit when services already authenticate to Vault;
* can centralize PKI and secret workflows in one control plane.

Trade-offs:

* heavier to operate than a narrow CA-only solution;
* application or platform enrollment patterns must be designed carefully;
* Kubernetes distribution is good, but not as “native object first” as cert-manager.

### Option 4 — SPIRE / SPIFFE

Best fit when:

* the real requirement is **workload identity**, not just certificates;
* the environment is dynamic and heterogeneous;
* you want workload-attested identities and automated mTLS without manually reasoning about each private key and CSR flow.

Why teams choose it:

* identities are issued to workloads based on attestation;
* short-lived SVIDs fit service-to-service auth very well;
* good choice for platform teams building zero-trust service identity.

Trade-offs:

* more architectural than “just run a CA”;
* stronger fit for platform engineering than for quick certificate file distribution;
* application teams must understand SPIFFE / Workload API or rely on a service mesh or proxy integration.

### Commercial examples worth knowing

| Product                              | Where it fits                                                                                                  |
| ------------------------------------ | -------------------------------------------------------------------------------------------------------------- |
| **Smallstep Certificate Manager**    | managed / hosted version of the Smallstep model for teams that want less CA operations overhead                |
| **Venafi Control Plane**             | enterprise machine identity management, policy, lifecycle, discovery, and governance across many environments  |
| **DigiCert Trust Lifecycle Manager** | CA-agnostic certificate inventory, lifecycle, workflow, and private PKI / trust management at enterprise scale |
| **HCP Vault / Vault Enterprise**     | good when Vault is already strategic and you want PKI plus broader secrets / identity workflows                |

## What to choose in practice

### If you are mostly on Kubernetes

Start with **cert-manager + trust-manager**.

### If you need a general internal CA across VMs, containers, and gateways

Start with **step-ca** or **Vault PKI** depending on whether you need a focused CA or a broader secrets platform.

### If you need first-class workload identity for dynamic service fleets

Evaluate **SPIRE / SPIFFE**, often together with a mesh or proxy layer.

## Service mesh note

If you already run a mesh such as **Istio** or **Linkerd**, the easiest way to encrypt east-west traffic is often to let the mesh manage workload certificates and mTLS.

That does **not** remove the PKI problem. It shifts it to:

* who signs workload certificates;
* how the trust anchor rotates;
* whether the mesh uses self-signed roots or plugs into your own CA.

A common mistake is to assume “we enabled the mesh, therefore PKI is solved forever.” It is not.

## Step-by-step implementation model

### Step 1 — define service identity format

Decide what identities look like.

Typical choices:

* DNS SANs like `service-a.namespace.svc.cluster.local`;
* external/internal FQDNs for gateway or VM services;
* SPIFFE IDs like `spiffe://company.internal/ns/payments/sa/api`.

Do this before automation, otherwise you will bake inconsistent identity into every certificate.

### Step 2 — create an offline root and an online intermediate

Baseline guidance:

* root key offline or otherwise strongly protected;
* intermediate used for routine issuance;
* do not let applications or normal deployment automation talk to the root.

This lets you:

* rotate intermediates without replacing the entire trust model;
* issue short-lived workload certs at scale;
* reduce blast radius if the online issuance tier is compromised.

### Step 3 — automate enrollment, do not hand-copy certificates

Use one of these patterns:

* Kubernetes `Certificate` objects via cert-manager;
* ACME enrollment against step-ca or another CA;
* Vault PKI roles and API / agent-based retrieval;
* SPIRE agent workload attestation and SVID issuance.

Manual copy-and-paste of PEM files does not scale and makes rotation brittle.

### Step 4 — distribute trust separately from leaf certificates

Every workload needs the CA bundle that validates peers.

Common patterns:

* a ConfigMap / Secret mounted to workloads;
* OS trust store update in VM images;
* trust-manager bundles in Kubernetes;
* mesh / sidecar distribution.

Do **not** hide the trust bundle inside one application image and forget it. Trust updates must be operable.

### Step 5 — prefer short-lived leaf certificates

For service identities, short-lived certificates are usually better than long-lived ones.

Why:

* less revocation dependence;
* lower value if a private key leaks;
* easier to reason about automatic renewal than about annual emergency replacements.

Practical bias:

* leaf certs short-lived and auto-renewed;
* intermediates medium-lived with planned rotation;
* root long-lived but rarely touched.

### Step 6 — plan revocation and replacement

Even if you prefer short TTLs, you still need a plan for:

* compromised node or pod credentials;
* stolen CA-issued leaf private keys;
* intermediate replacement;
* trust bundle overlap during rotation.

If you do not know how to revoke, replace, and redistribute trust under stress, the PKI is not operationally ready.

### Step 7 — enforce TLS and mTLS at the right layer

Choices:

* in application runtime;
* in reverse proxy or sidecar;
* in service mesh;
* at ingress / gateway only.

For many microservice environments, the cleanest model is:

* **mTLS for east-west service traffic**;
* **separate ingress TLS** for north-south traffic;
* application authorization still done above transport identity.

TLS proves channel and peer identity. It does not replace authorization.

## Kubernetes-first practical path

### A. Bootstrap a root with a self-signed issuer

Use a self-signed issuer only to create the initial root.

See: [cert-manager root / CA bootstrap starter](https://github.com/D3One/Product-Security-Gitbook/blob/main/snippets/k8s/cert-manager-bootstrap-root-ca-and-leaf.yaml)

This starter shows:

* a bootstrap self-signed `ClusterIssuer`;
* a root CA `Certificate`;
* a CA-backed issuer for normal leaf issuance;
* an example leaf certificate for an internal service.

### B. Distribute trust with trust-manager

See: [trust-manager private CA bundle starter](https://github.com/D3One/Product-Security-Gitbook/blob/main/snippets/k8s/trust-manager-private-ca-bundle.yaml)

This is the practical missing piece many teams forget. Issuing leaf certs is only half of the problem; services also need the right trust bundle.

### C. Mount certificates to workloads or terminate via sidecars / ingress

Patterns:

* mount `tls.crt`, `tls.key`, and CA bundle into the pod;
* configure the service runtime to require and verify client certificates for mTLS;
* or let a sidecar / mesh terminate and present workload identity.

## Vault PKI practical path

See: [Vault PKI bootstrap and issuance starter](https://github.com/D3One/Product-Security-Gitbook/blob/main/snippets/vault/vault-pki-offline-root-and-intermediate-bootstrap.sh)

This starter demonstrates the flow, not a full HA production deployment:

* generate root CA material;
* create an intermediate CSR;
* sign it with the root;
* configure a role for service issuance;
* issue a workload certificate with short TTL.

This is a strong fit when applications can authenticate to Vault or when platform automation can fetch and rotate certificates centrally.

## Smallstep / step-ca practical path

See: [step-ca containerized starter](https://github.com/D3One/Product-Security-Gitbook/blob/main/snippets/identity/step-ca-containerized-starter.yaml)

Use this when you want:

* a lighter-weight private CA than a full secrets platform;
* ACME-driven issuance for internal gateways, proxies, and services;
* a cleaner path from “hand-managed certs” to automated private PKI.

## Example runtime configuration ideas

### Service runtime pattern

Every service that terminates mTLS needs:

* a server certificate and private key;
* a trust bundle for peer validation;
* hostname / identity verification rules;
* safe reload or restart strategy when certificates renew.

### Gateway / proxy pattern

For many teams, it is easier to terminate and verify mTLS in:

* Envoy;
* NGINX;
* HAProxy;
* service mesh sidecars.

This reduces the amount of application code that directly handles certificate files and trust stores.

## Common mistakes

* using one long-lived self-signed cert everywhere;
* keeping the same root and same intermediate forever;
* distributing leaf certs but forgetting trust-bundle automation;
* storing private keys in images or source control;
* relying on revocation only, with very long certificate lifetimes;
* enabling mTLS transport but keeping authorization weak or implicit;
* assuming the mesh’s default self-signed setup is production-ready forever.

## Fast decision checklist

Choose **cert-manager + trust-manager** when:

* you are mostly on Kubernetes;
* you want Kubernetes-native certificates and trust bundles.

Choose **step-ca** when:

* you want a general private CA with relatively low complexity;
* ACME-based automation is attractive.

Choose **Vault PKI** when:

* Vault already exists or policy-driven issuance matters more than K8s-native UX.

Choose **SPIRE** when:

* workload identity and attestation are the real requirement.

## Read next

* [Service-to-Service Auth, Webhooks, and Event-Driven Security](/architecture-api-crypto-and-identity/index-1/service-to-service-auth-webhooks-and-event-driven-security.md)
* [Zero-Trust Egress and Private Connectivity Patterns](/architecture-api-crypto-and-identity/index-1/zero-trust-egress-and-private-connectivity-patterns.md)
* [Workload Federation and Non-Human Identities](/architecture-api-crypto-and-identity/index-2/workload-federation-and-non-human-identities.md)
* [Vault Installation, HA, and Automation Pack](/cloud-kubernetes-and-infrastructure-security/index/vault-installation-ha-and-automation-pack.md)
* [Container and Kubernetes Security](/cloud-kubernetes-and-infrastructure-security/index-1.md)

***

*Author attribution: Ivan Piskunov, 2026 - Educational and defensive-engineering use.*


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.product-security.expert/cloud-kubernetes-and-infrastructure-security/index/internal-pki-for-microservices-mtls-and-certificate-automation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
