Adventures in mTLS Debugging

July 12, 2021 kubernetes 1 Comment edit

I have a Kubernetes 1.19.11 cluster deployed along with Istio 1.6.14. I have a central instance of Prometheus for scraping metrics, and based on the documentation, I have a manually-injected sidecar so Prometheus can make use of the Istio certificates for mTLS during scraping. Under Prometheus v2.20.1 this worked great. However, I was trying to update some of the infrastructure components to take advantage of new features and Prometheus after v2.21.0 just would not scrape.

These are my adventures in trying to debug this issue. Some of it is to remind me of what I did. Some of it is to save you some trouble if you run into the issue. Some of it is to help you see what I did so you can apply some of the techniques yourself.

TL;DR: The problem is that Prometheus v2.21.0 disabled HTTP/2 and that needs to be re-enabled for things to work. There should be a Prometheus release soon that allows you to re-enable HTTP/2 with environment variables.

I created a repro repository with a minimal amount of setup to show how things work. It can get you from a bare Kubernetes cluster up to Istio 1.6.14 and Prometheus using the same values I am. You’ll have to supply your own microservice/app to demonstrate scraping, but the prometheus-example-app may be a start.

I deploy Prometheus using the Helm chart. As part of that, I have an Istio sidecar manually injected just like they do in the official 1.6 Istio release manifests. By doing this, the sidecar will download and share the certificates but it won’t proxy any of the Prometheus traffic.

I then have a Prometheus scrape configuration that uses the certificates mounted in the container. If it finds a pod that has the Istio sidecar annotations (indicating it’s got the sidecar injected), it’ll use the certificates for authentication and communication.

- job_name: "kubernetes-pods-istio-secure"
  scheme: https
  tls_config:
    ca_file: /etc/istio-certs/root-cert.pem
    cert_file: /etc/istio-certs/cert-chain.pem
    key_file: /etc/istio-certs/key.pem
    insecure_skip_verify: true

If I deploy Prometheus v2.20.1, I see that my services are being scraped by the kubernetes-pods-istio-secure job, they’re using HTTPS, and everything is good to go. Under v2.20.1, I see the error connection reset by peer. I tried asking about this in the Prometheus newsgroup to no avail, so… I dove in.

My first step was to update the Helm chart extraArgs to turn on Prometheus debug logging.

extraArgs:
  log.level: debug

I was hoping to see more information about what was happening. Unfortunately, I got basically the same thing.

level=debug ts=2021-07-06T20:58:32.984Z caller=scrape.go:1236 component="scrape manager" scrape_pool=kubernetes-pods-istio-secure target=https://10.244.3.10:9102/metrics msg="Scrape failed" err="Get \"https://10.244.3.10:9102/metrics\": read tcp 10.244.4.89:36666->10.244.3.10:9102: read: connection reset by peer"

This got me thinking one of two things may have happened in v2.21.0:

Something changed in Prometheus; OR
Something changed in the OS configuration of the Prometheus container

I had recently fought with a dotnet CLI problem where certain TLS cipher suites were disabled by default and some OS configuration settings on our build agents affected what was seen as allowed vs. not allowed. This was stuck in my mind so I couldn’t immediately rule out the container OS configuration.

To validate the OS issue I was going to try using curl and/or openssl to connect to the microservice and see what the cipher suites were. Did I need an Istio upgrade? Was there some configuration setting I was missing? Unfortunately, it turns out the Prometheus Docker image is based on a custom busybox image where there are no package managers or tools. I mean, this is actually a very good thing from a security perspective but it’s a pain for debugging.

What I ended up doing was getting a recent Ubuntu image and connecting using that, just to see. I figured if there was anything obvious going on that I could take the extra steps of creating a custom Prometheus image with curl and openssl to investigate further. I mounted a manual sidecar just like I did for Prometheus so I could get to the certificates without proxying traffic, then I ran some commands:

curl https://10.244.3.10:9102/metrics \
  --cacert /etc/istio-certs/root-cert.pem \
  --cert /etc/istio-certs/cert-chain.pem \
  --key /etc/istio-certs/key.pem \
  --insecure

openssl s_client \
  -connect 10.244.3.10:9102 \
  -cert /etc/istio-certs/cert-chain.pem  \
  -key /etc/istio-certs/key.pem \
  -CAfile /etc/istio-certs/root-cert.pem \
  -alpn "istio"

Here’s some example output from curl to show what I was seeing:

root@sleep-5f98748557-s4wh5:/# curl https://10.244.3.10:9102/metrics --cacert /etc/istio-certs/root-cert.pem --cert /etc/istio-certs/cert-chain.pem --key /etc/istio-certs/key.pem --insecure -v
*   Trying 10.244.3.10:9102...
* TCP_NODELAY set
* Connected to 10.244.3.10 (10.244.3.10) port 9102 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/istio-certs/root-cert.pem
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: [NONE]
*  start date: Jul  7 20:21:33 2021 GMT
*  expire date: Jul  8 20:21:33 2021 GMT
*  issuer: O=cluster.local
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x564d80d81e10)
> GET /metrics HTTP/2
> Host: 10.244.3.10:9102
> user-agent: curl/7.68.0
> accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 2147483647)!
< HTTP/2 200

A few things in particular:

I found the --alpn "istio" thing for openssl while looking through Istio issues to see if there were any pointers there. It’s always good to read through issues lists to get ideas and see if other folks are running into the same problems.
Both openssl and curl were able to connect to the microservice using the certificates from Istio.
The cipher suite shown in the openssl output was one that was considered “recommended.” I forgot to capture that output for the blog article, sorry about that.

At this point I went to the release notes for Prometheus v2.21.0 to see what had changed. I noticed two things that I thought may affect my situation:

This release is built with Go 1.15, which deprecates X.509 CommonName in TLS certificates validation.
[CHANGE] Disable HTTP/2 because of concerns with the Go HTTP/2 client. #7588 #7701

I did see in that curl output that it was using HTTP/2 but… is it required? Unclear. However, looking at the Go docs about the X.509 CommonName thing, that’s easy enough to test. I just needed to add an environment variable to the Helm chart for Prometheus:

env:
  - name: GODEBUG
    value: x509ignoreCN=0

After redeploying… it didn’t fix anything. That wasn’t the problem. That left the HTTP/2 thing. However, what I found was it’s hardcoded off, not disabled through some configuration mechanism so there isn’t a way to just turn it back on to test. The only way to test it is to do a fully custom build.

The Prometheus build for a Docker image is really complicated. They have this custom build tool promu that runs the build in a custom build container and all this is baked into layers of make and yarn and such. As it turns out, not all of it happens in the container, either, because if you try to build on a Mac you’ll get an error like this:

... [truncated huge list of downloads] ...
go: downloading github.com/PuerkitoBio/urlesc v0.0.0-20170810143723-de5bf2ad4578
go: downloading github.com/Azure/go-autorest/autorest/validation v0.3.1
go: downloading github.com/Azure/go-autorest/autorest/to v0.4.0
go build github.com/aws/aws-sdk-go/service/ec2: /usr/local/go/pkg/tool/linux_amd64/compile: signal: killed
!! command failed: build -o .build/linux-amd64/prometheus -ldflags -X github.com/prometheus/common/version.Version=2.28.1 -X github.com/prometheus/common/version.Revision=b0944590a1c9a6b35dc5a696869f75f422b107a1 -X github.com/prometheus/common/version.Branch=HEAD -X github.com/prometheus/common/version.BuildUser=root@76a91e410d00 -X github.com/prometheus/common/version.BuildDate=20210709-14:47:03  -extldflags '-static' -a -tags netgo,builtinassets github.com/prometheus/prometheus/cmd/prometheus: exit status 1
make: *** [Makefile.common:227: common-build] Error 1
!! The base builder docker image exited unexpectedly: exit status 2

You can only build on Linux even though it’s happening in a container. At least right now. Maybe that’ll change in the future. Anyway, this meant I needed to create a Linux VM and set up an environment there that could build Prometheus… or figure out how to force a build system to do it, say by creating a fake PR to the Prometheus project. I went the Linux VM route.

I changed the two lines where the HTTP/2 was disabled, I pushed that to a temporary Docker Hub location, and I got it deployed in my cluster.

Success! Once HTTP/2 was re-enabled, Prometheus was able to scrape my Istio pods again.

I worked through this all with the Prometheus team and they were able to replicate the issue using my repro repo. They are now working through how to re-enable HTTP/2 using environment variables or configuration.

All of this took close to a week to get through.

It’s easy to read these blog articles and think the writer just blasted through all this and it was all super easy, that I already knew the steps I was going to take and flew through it. I didn’t. There was a lot of reading issues. There was a lot of trying things and then retrying those same things because I forgot what I’d just tried, or maybe I discovered I forgot to change a configuration value. I totally deleted and re-created my test Kubernetes cluster like five times because I also tried updating Istio and… well, you can’t really “roll back Istio.” It got messy. Not to mention, debugging things at the protocol level is a spectacular combination of “not interesting” and “not my strong suit.”

My point is, don’t give up. Pushing through these things and reading and banging your head on it is how you get the experience so that next time you will have been through it.

Deploy Standalone Kayenta with an Azure Storage Backend

November 20, 2020 azure, kubernetes, linux, spinnaker 0 Comments edit

Kayenta is the subcomponent of Spinnaker that handles automated canary analysis during a deployment. It reads from your metric sources and compares the stats from an existing deployed service against a new version of the service to see if there are anomalies or problems, indicating the rollout should be aborted if the new service fails to meet specified tolerances.

I’m a huge fan of Spinnaker, but sometimes you already have a full CI/CD system in place and you really don’t want to replace all of that with Spinnaker. You really just want the canary part of Spinnaker. Luckily, you can totally use Kayenta as a standalone service. They even have some light documentation on it!

In my specific case, I also want to use Azure Storage as the place where I store the data for Kayenta - canary configuration, that sort of thing. It’s totally possible to do that, but, at least at the time of this writing, the hal config canary Halyard command does not have Azure listed and the docs don’t cover it.

So there are a couple of things that come together here, and maybe all of it’s interesting to you or maybe only one piece. In any case, here’s what we’re going to build:

Standalone Kayenta diagram

A Kubernetes ingress to allow access to Kayenta from your CI/CD pipeline.
A deployment of the Kayenta microservice.
Kayenta configured to use an Azure Storage Account to hold its configuration and such.

Things I’m not going to cover:

How exactly your CI/CD canary stage needs to work.
How long a canary stage should last.
How exactly you should configure Kayenta (other than the Azure part).
Which statistics you should monitor for your services to determine if they “pass” or “fail.”
Securing the Kayenta ingress so only authenticated/authorized access is allowed.

This stuff is hard and it gets pretty deep pretty quickly. I can’t cover it all in one go. I don’t honestly have answers to all of it anyway, since a lot of it depends on how your build pipeline is set up, how your app is set up, and what your app does. There’s no “one-size-fits-all.”

Let’s do it.

Deployment

First, provision an Azure Storage account. Make sure you enable HTTP access because right now Kayenta requires HTTP and not HTTPS.

You also need to provision a container in the Azure Storage account to hold the Kayenta contents.

# I love me some PowerShell, so examples/scripts will be PowerShell.
# Swap in your preferred names as needed.
$ResourceGroup = "myresourcegroup"
$StorageAccountName = "kayentastorage"
$StorageContainerName = "kayenta"
$Location = "westus2"

# Create the storage account with HTTP enabled.
az storage account create `
  --name $StorageAccountName `
  --resoure-group $ResourceGroup `
  --location $Location `
  --https-only false `
  --sku Standard_GRS

# Get the storage key so you can create a container.
$StorageKey = az storage account keys list `
  --account-name $StorageAccountName `
  --query '[0].value' `
  -o tsv

# Create the container that will hold Kayenta stuff.
az storage container create `
  --name $StorageContainerName `
  --account-name $StorageAccountName `
  --account-key $StorageKey

Let’s make a namespace in Kubernetes for Kayenta so we can put everything we’re deploying in there.

# We'll use the namespace a lot, so a variable
# for that in our scripting will help.
$Namespace = "kayenta"
kubectl create namespace $Namespace

Kayenta needs Redis. We can use the Helm chart to deploy a simple Redis instance. Redis must not be in clustered mode, and there’s no option for providing credentials.

helm repo add bitnami https://charts.bitnami.com/bitnami

# The name of the deployment will dictate the name of the
# Redis master service that gets deployed. In this example,
# 'kayenta-redis' as the deployment name will create a
# 'kayenta-redis-master' service. We'll need that later for
# Kayenta configuration.
helm install kayenta-redis bitnami/redis `
  -n $Namespace `
  --set cluster.enabled=false `
  --set usePassword=false `
  --set master.persistence.enabled=false

Now let’s get Kayenta configured. This is a full, commented version of a Kayenta configuration file. There’s also a little doc on Kayenta configuration that might help. What we’re going to do here is put the kayenta.yml configuration into a Kubernetes ConfigMap so it can be used in our service.

Here’s a ConfigMap YAML file based on the fully commented version, but with the extra stuff taken out. This is also where you’ll configure the location of Prometheus (or whatever) where Kayenta will read stats. For this example, I’m using Prometheus with some basic placeholder config.

apiVersion: v1
kind: ConfigMap
metadata:
  name: kayenta
  namespace: kayenta
data:
  kayenta.yml: |-
    server:
      port: 8090

    # This should match the name of the master service from when
    # you deployed the Redis Helm chart earlier.
    redis:
      connection: redis://kayenta-redis-master:6379

    kayenta:
      atlas:
        enabled: false

      google:
        enabled: false

    # This is the big one! Here's where you configure your Azure Storage
    # account and container details.
      azure:
        enabled: true
        accounts:
          - name: canary-storage
            storageAccountName: kayentastorage
            # azure.storageKey is provided via environment AZURE_STORAGEKEY
            # so it can be stored in a secret. You'll see that in a bit.
            # Don't check in credentials!
            accountAccessKey: ${azure.storageKey}
            container: kayenta
            rootFolder: kayenta
            endpointSuffix: core.windows.net
            supportedTypes:
              - OBJECT_STORE
              - CONFIGURATION_STORE

      aws:
        enabled: false

      datadog:
        enabled: false

      graphite:
        enabled: false

      newrelic:
        enabled: false

    # Configure your Prometheus here. Or if you're using something else, disable
    # Prometheus and configure your own metrics store. The important part is you
    # MUST have a metrics store configured!
      prometheus:
        enabled: true
        accounts:
        - name: canary-prometheus
          endpoint:
            baseUrl: http://prometheus:9090
          supportedTypes:
            - METRICS_STORE

      signalfx:
        enabled: true

      wavefront:
        enabled: false

      gcs:
        enabled: false

      blobs:
        enabled: true

      s3:
        enabled: false

      stackdriver:
        enabled: false

      memory:
        enabled: false

      configbin:
        enabled: false

      remoteJudge:
        enabled: false

    # Enable the SCAPE endpoint that has the same user experience that the Canary StageExecution in Deck/Orca has.
    # By default this is disabled - in standalone we enable it!
      standaloneCanaryAnalysis:
        enabled: true

      metrics:
        retry:
          series: SERVER_ERROR
          statuses: REQUEST_TIMEOUT, TOO_MANY_REQUESTS
          attempts: 10
          backoffPeriodMultiplierMs: 1000

      serialization:
        writeDatesAsTimestamps: false
        writeDurationsAsTimestamps: false

    management.endpoints.web.exposure.include: '*'
    management.endpoint.health.show-details: always

    keiko:
      queue:
        redis:
          queueName: kayenta.keiko.queue
          deadLetterQueueName: kayenta.keiko.queue.deadLetters

    spectator:
      applicationName: ${spring.application.name}
      webEndpoint:
        enabled: true

    swagger:
      enabled: true
      title: Kayenta API
      description:
      contact:
      patterns:
        - /admin.*
        - /canary.*
        - /canaryConfig.*
        - /canaryJudgeResult.*
        - /credentials.*
        - /fetch.*
        - /health
        - /judges.*
        - /metadata.*
        - /metricSetList.*
        - /metricSetPairList.*
        - /metricServices.*
        - /pipeline.*
        - /standalone.*

Save that and deploy it to the cluster.

kubectl apply -f kayenta-configmap.yml

You’ll notice in the config we just put down that we did not include the Azure Storage acccount key. Assuming we want to commit that YAML to a source control system at some point, we definitely don’t want credentials in there. Instead, let’s use a Kubernetes secret for the Azure Storage account key.

# Remember earlier we got the storage account key for creating
# the container? We're going to use that again.
kubectl create secret generic azure-storage `
  -n $Namespace `
  --from-literal=storage-key="$StorageKey"

It’s deployment time! Let’s get a Kayenta container into the cluster! Obviously you can tweak all the tolerances and affinities and node selectors and all that to your heart’s content. I’m keeping the example simple.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kayenta
  namespace: kayenta
  labels:
    app.kubernetes.io/name: kayenta
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kayenta
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kayenta
    spec:
      containers:
        - name: kayenta
          # Find the list of tags here: https://console.cloud.google.com/gcr/images/spinnaker-marketplace/GLOBAL/kayenta?gcrImageListsize=30
          # This is just the tag I've been using for a while. I use one of the images NOT tagged
          # with Spinnaker because the Spinnaker releases are far slower.
          image: "gcr.io/spinnaker-marketplace/kayenta:0.17.0-20200803200017"
          env:
            # If you need to troubleshoot, you can set the logging level by adding
            # -Dlogging.level.root=TRACE
            # Without the log at DEBUG level, very little logging comes out at all and
            # it's really hard to see if something goes wrong. If you don't want that
            # much logging, go ahead and remove the log level option here.
            - name: JAVA_OPTS
              value: "-XX:+UnlockExperimentalVMOptions -Dlogging.level.root=DEBUG"
            # We can store secrets outside config and provide them via the environment.
            # Insert them into the config file using ${dot.delimited} versions of the
            # variables, like ${azure.storageKey} which we saw in the ConfigMap.
            - name: AZURE_STORAGEKEY
              valueFrom:
                secretKeyRef:
                  name: azure-storage
                  key: storage-key
          ports:
            - name: http
              containerPort: 8090
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /health
              port: http
          readinessProbe:
            httpGet:
              path: /health
              port: http
          volumeMounts:
            - name: config-volume
              mountPath: /opt/kayenta/config
      volumes:
        - name: config-volume
          configMap:
            name: kayenta

And let’s save and apply.

kubectl apply -f kayenta-deployment.yml

If you have everything wired up right, the Kayenta instance should start. But we want to see something happen, right? Without kubectl port-forward?

Let’s put a LoadBalancer service in here so we can access it. I’m going to show the simplest Kubernetes LoadBalancer here, but in your situation you might have, say, an nginx ingress in play or something else. You’ll have to adjust as needed.

apiVersion: v1
kind: Service
metadata:
  name: kayenta
  namespace: kayenta
  labels:
    app.kubernetes.io/name: kayenta
spec:
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app.kubernetes.io/name: kayenta
  type: LoadBalancer

Let’s see it do something. You should be able to get the public IP address for that LoadBalancer service by doing:

kubectl get service/kayenta -n $Namespace

You’ll see something like this:

NAME         TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)    AGE
kayenta      LoadBalancer   10.3.245.137   104.198.205.71   80/TCP     54s

Take note of that external IP and you can visit the Swagger docs in a browser: http://104.198.205.71/swagger-ui.html

If it’s all wired up, you should get some Swagger docs!

The first operation you should try is under credentials-controller - GET /credentials. This will tell you what metrics and object stores Kayenta thinks it’s talking to. The result should look something like this:

[
  {
    "name": "canary-prometheus",
    "supportedTypes": [
      "METRICS_STORE"
    ],
    "endpoint": {
      "baseUrl": "http://prometheus"
    },
    "type": "prometheus",
    "locations": [],
    "recommendedLocations": []
  },
  {
    "name": "canary-storage",
    "supportedTypes": [
      "OBJECT_STORE",
      "CONFIGURATION_STORE"
    ],
    "rootFolder": "kayenta",
    "type": "azure",
    "locations": [],
    "recommendedLocations": []
  }
]

If you are missing the canary-storage account pointing to azure - that means Kayenta can’t access the storage account or it’s otherwise misconfigured. I found the biggest gotcha here was that it’s HTTP-only and that’s not the default for a storage account if you create it through the Azure portal. You have to turn that on.

Troubleshooting

What do you do if you can’t figure out why Kayenta isn’t connecting to stuff?

Up in the Kubernetes deployment, you’ll see the logging is set up at the DEBUG level. The logging is pretty good at this level. You can use kubectl logs to get the logs from the Kayenta pods or, better, use stern for that Those logs are going to be your secret. You’ll see errors that pretty clearly indicate whether there’s a DNS problem or a bad password or something similar.

If you still aren’t getting enough info, turn the log level up to TRACE. It can get noisy, but you’ll only need it for troubleshooting.

Next Steps

There’s a lot you can do from here.

Canary configuration: Actually configuring a canary is hard. For me, it took deploying a full Spinnaker instance and doing some canary stuff to figure it out. There’s a bit more doc on it now, but it’s definitely tricky. Here’s a pretty basic configuration where we just look for errors by ASP.NET microservice controller. No, I can not help or support you in configuring a canary. I’ll give you this example with no warranties, expressed or implied.

{
  "canaryConfig": {
    "applications": [
      "app"
    ],
    "classifier": {
      "groupWeights": {
        "StatusCodes": 100
      },
      "scoreThresholds": {
        "marginal": 75,
        "pass": 75
      }
    },
    "configVersion": "1",
    "description": "App Canary Configuration",
    "judge": {
      "judgeConfigurations": {
      },
      "name": "NetflixACAJudge-v1.0"
    },
    "metrics": [
      {
        "analysisConfigurations": {
          "canary": {
            "direction": "increase",
            "nanStrategy": "replace"
          }
        },
        "groups": [
          "StatusCodes"
        ],
        "name": "Errors By Controller",
        "query": {
          "customInlineTemplate": "PromQL:sum(increase(http_requests_received_total{app='my-app',azure_pipelines_version='${location}',code=~'5\\\\d\\\\d|4\\\\d\\\\d'}[120m])) by (action)",
          "scopeName": "default",
          "serviceType": "prometheus",
          "type": "prometheus"
        },
        "scopeName": "default"
      }
    ],
    "name": "app-config",
    "templates": {
    }
  },
  "executionRequest": {
    "scopes": {
      "default": {
        "controlScope": {
          "end": "2020-11-20T23:01:09.3NZ",
          "location": "baseline",
          "scope": "control",
          "start": "2020-11-20T21:01:09.3NZ",
          "step": 2
        },
        "experimentScope": {
          "end": "2020-11-20T23:01:09.3NZ",
          "location": "canary",
          "scope": "experiment",
          "start": "2020-11-20T21:01:09.3NZ",
          "step": 2
        }
      }
    },
    "siteLocal": {
    },
    "thresholds": {
      "marginal": 75,
      "pass": 95
    }
  }
}

Integrate with your CI/CD pipeline: Your deployment is going to need to know how to track the currently deployed vs. new/canary deployment. Statistics are going to need to be tracked that way, too. (That’s the same as if you were using Spinnaker.) I’ve been using the KubernetesManifest@0 task in Azure DevOps, setting trafficSplitMethod: smi and making use of the canary control there. A shell script polls Kayenta to see how the analysis is going.

How you do this for your template is very subjective. Pipelines at this level are really complex. I’d recommend working with Postman or some other HTTP debugging tool to get things working before trying to automate it.

Secure it!: You probably don’t want public anonymous access to the Kayenta API. I locked mine down with oauth2-proxy and Istio but you could do it with nginx ingress and oauth2-proxy or some other mechanism.

Put a UI on it!: As you can see, configuring Kayenta canaries without a UI is actually pretty hard. Nike has a UI for standalone Kayenta called “Referee”. At the time of this writing there’s no Docker container for it so it’s not as easy to deploy as you might like. However, there is a Dockerfile gist that might be helpful. I have not personally got this working, but it’s on my list of things to do.

Huge props to my buddy Chris who figured a lot of this out, especially the canary configuration and Azure DevOps integration pieces.

Hamilton Halloween

November 5, 2020 halloween, maker, costumes 0 Comments edit

Due to the COVID-19 pandemic, we didn’t end up doing our usual hand-out-candy-and-count-kids thing. However, we did make costumes. How could we not? Something had to stay normal.

My daughter Phoenix, who is now nine, is obsessed with Hamilton. I think she listens to it at least once daily. Given that, she insisted that we do Hamilton costumes. I was to be A. Ham, Jenn as Eliza, and Phoenix as their daughter also named Eliza.

I was able to put Phoe’s costume together in two or three days. We used a pretty standard McCall’s pattern with decent instructions and not much complexity.

For mine… I had to do some pretty custom work. I started with these patterns:

It took me a couple of months to get things right. They didn’t really have 6’2” fat guys in the Revolutionary War so there was a lot of adjustment, especially to the coat, to get things to fit. I made muslin versions of everything probably twice, maybe three times for the coat to get the fit right.

I had a really challenging time figuring out how the shoulders on the coat went together. The instructions on the pattern are fairly vague and not what I’m used to with more commercial patterns. This tutorial article makes a similar coat and helped a lot in figuring out how things worked. It’s worth checking out.

Modifications I had to make:

Coat:
- Arms lengthened.
- Arm holes bigger around.
- Arms bigger around.
- Body lengthened.
- Lapels trimmed to be square at the top (more like the stage production).
- Didn’t put on the shoulder buttons or the back ribbons the pattern called for.
- Set buttonholes 2” long and 2.5” apart (roughly - they don’t specify this in the pattern but other research turned these measurements up and it worked out).
- Six buttonholes around each cuff instead of three (more like the stage production).
Pants:
- Cut off slightly below the knee.
- Taper to fit tightly around the bottom below the knee.
- Add a “flap” on each side at the bottom of each leg to look like there are buttons holding it together.
- Finish the bottom of each leg with a band.
Vest: Lengthen the body.

I didn’t have to modify the shirt. The shirt is already intentionally big and baggy because that’s how shirts were back then, so there was a lot of play.

The pants were more like… I didn’t have a decent pattern that actually looked like Revolutionary War pants so I took some decent costume pants and just modded them up. They didn’t have button fly pants back then and my pants have that, but I also wasn’t interested in drop-front pants or whatever other pants I’d have ended up with. I do need to get around in these things.

I didn’t keep a cost tally this time and it’s probably good I didn’t. There are well over 50 buttons on this thing and buttons are not cheap. I bought good wool for the whole thing at like $25/yard (average) and there are a good six-to-eight yards in here. I went through a whole bolt of 60” muslin betwee my costume and the rest of our costumes. I can’t possibly have come out under $300.

But they turned out great!

Here’s my costume close up on a dress form:

Front view of Hamilton

Three-quarters view of Hamilton

And the costume in action:

Travis as Alexander Hamilton

Here’s the whole family! I think they turned out nicely.

The whole Hamilton family!

Work! Work!

We're looking for a mind at work!

Scripts for Managing Azure Container Registry

November 5, 2020 azure, docker, powershell 0 Comments edit

I’ve been doing some work with creating and migrating Azure Container Registry instances around lately so I thought I’d share a few helpful scripts. Obvious disclaimers - YMMV, works on my machine, I’m not responsible if you delete something you shouldn’t have, etc.

New-AzureContainerRegistry.ps1

I need to create container registries that have customer managed key support enabled. Unfortunately, there are a lot of steps to this and there are some things that aren’t obvious, like:

You need to use the “Premium” SKU for this to work.
The Key Vault and the thing being encrypted using customer managed keys (e.g., the container registry) need to be in the same subscription and geographic region. They only say this in the docs about VM disk encryption but it seems to be applicable to all CMK usage.

Normally I’d think about doing this with something like Terraform but as of this writing, Terraform doesn’t have support for ACR + CMK so… script it is.

	<#
	.SYNOPSIS
	Creates a new Azure Container Registry.
	.DESCRIPTION
	Creates a new Azure Container Registry instance with customer managed key
	encryption enabled. Generates keys and identities for the registry.

	The location of the registry is inferred from the location of the specified
	Key Vault. This is because, when using customer managed keys, the Key Vault
	and the resource being encrypted must be in the same geographic location.
	This is a limitation of the Azure platform.
	.PARAMETER Name
	The name of the container registry to create. This name must be globally
	unique (across all Azure), must be alphanumeric characters only, and between
	5 and 50 characters long. The managed identity name will be based on the ACR
	name provided here with the suffix `-acr`. The key created for the registry
	will be `cmk-acr-name`.
	.PARAMETER KeyVault
	The name of the Azure Key Vault in which the customer managed key should be
	created. The managed identity for the ACR will get key permissions in this
	vault to allow the ACR to use the generated key.
	.PARAMETER ResourceGroup
	The name of the resource group in which the ACR and associated identity
	should be created. The Key Vault does not need to be in this resource group.
	It is assumed the group exists already; it will not be created.
	.PARAMETER Subscription
	The subscription in which the ACR will be created. The resource group and
	Key Vault also need to be in this subscription.
	.EXAMPLE
	./New-AzureContainerRegistry.ps1 `
	-Name "myregistry" `
	-Subscription "34d0efec-3fa8-4abe-a5e5-1d46d93183ec" `
	-KeyVault "myvault" `
	-ResourceGroup "mygroup"

	Creates a managed identity "myregistry-acr" in the "mygroup" resource group
	and grants it access to the "myvault" Azure Key Vault so it can be used to
	handle CMK for the registry.

	Creates a new key in the "myvault" Azure Key Vault called "cmk-acr-myregistry"
	which will be used for encrypting the registry.

	Creates an Azure Container Registry called "myregistry" in the "mygroup"
	resource group and in the same geographic location as the "myvault" Azure Key
	Vault. Enables CMK via the managed identity and key that were created.
	#>
	[CmdletBinding(SupportsShouldProcess = $True)]
	Param(
	[Parameter(Mandatory = $True)]
	[ValidateNotNullOrEmpty()]
	[ValidateLength(5, 50)]
	[ValidatePattern('[a-z0-9]+')]
	[string]
	$Name,

	[Parameter(Mandatory = $True)]
	[ValidateNotNullOrEmpty()]
	[string]
	$KeyVault,

	[Parameter(Mandatory = $True)]
	[ValidateNotNullOrEmpty()]
	[string]
	$ResourceGroup,

	[Parameter(Mandatory = $True)]
	[ValidateNotNullOrEmpty()]
	[Guid]
	$Subscription
	)

	Begin {
	Write-Verbose "Checking for az CLI."
	If ($null -eq (Get-Command "az")) {
	Throw "The az CLI was not found. Install here: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli"
	Exit 1
	}
	}
	Process {
	# Begin happens once per script execution; Process happens for every item in
	# the pipeline if it's executed as part of a pipeline run.
	function Retry {
	param (
	[Parameter(Mandatory = $True)]
	[int]
	$Retries,

	[Parameter(Mandatory = $True)]
	[int]
	$SecondsBetweenRetries,

	[Parameter(Mandatory = $True)]
	[ScriptBlock]
	$ScriptBlock
	)

	$TryCount = 0
	Do {
	$TryCount++
	Try {
	$ScriptBlock.Invoke()
	If ($LASTEXITCODE -ne 0) {
	throw "Error executing program in script block."
	}

	Return
	}
	Catch {
	$RetryError = $_
	Write-Verbose "Failed on try $TryCount / $Retries - sleeping before retry."
	Start-Sleep -Seconds $SecondsBetweenRetries
	}
	} While ($TryCount -le $Retries);

	throw $RetryError
	}

	# Progress is managed at two levels:
	# - Overall progress
	# - Pre-flight check vs. deployment
	# It's assumed the pre-flight check section is about 50% of the progress
	# and the deployment is the other 50%. Progress increments are just
	# "divide 100 by the number of steps in the section." It's not complex.
	Write-Progress -Activity "Provisioning Azure Container Registry" -Status "Pre-Flight Check" -CurrentOperation "Verifying access to subscription $Subscription" -PercentComplete 0
	&az account show -s $Subscription \| Out-Null
	If ($LASTEXITCODE -ne 0) {
	Throw "Unable to access subscription $Subscription. Verify you have done 'az login' and have access to the subscription."
	Exit 1
	}

	Write-Progress -Activity "Provisioning Azure Container Registry" -Status "Pre-Flight Check" -CurrentOperation "Verifying presence of resource group $ResourceGroup" -PercentComplete 8
	&az group show -g $ResourceGroup --subscription $Subscription \| Out-Null
	If ($LASTEXITCODE -ne 0) {
	Throw "Unable to locate resource group $ResourceGroup. The resource group must exist prior to creating the ACR."
	Exit 1
	}

	# CMK requires the Key Vault to be in the same geographic region as the
	# thing(s) consuming keys. Get location while checking for presence.
	Write-Progress -Activity "Provisioning Azure Container Registry" -Status "Pre-Flight Check" -CurrentOperation "Verifying presence and location of Azure Key Vault $KeyVault" -PercentComplete 16
	$KeyVaultJson = &az keyvault show -n $KeyVault --subscription $Subscription
	If ($LASTEXITCODE -ne 0) {
	Throw "Unable to locate Azure Key Vault $KeyVault. The Key Vault must exist prior to creating the ACR."
	Exit 1
	}

	$KeyVaultData = $KeyVaultJson \| ConvertFrom-Json
	$KeyVaultLocation = $KeyVaultData.location
	Write-Verbose "Key Vault $KeyVault is in $KeyVaultLocation. The ACR will also be in this location."

	Write-Progress -Activity "Provisioning Azure Container Registry" -Status "Pre-Flight Check" -CurrentOperation "Verifying ACR $Name does not already exist" -PercentComplete 24
	$ExistingAcrCount = az acr list --subscription $Subscription --query "length([?name=='$Name'])"
	If ($ExistingAcrCount -ne 0) {
	Throw "Found an existing ACR with the name $Name. The registry must not already exist."
	Exit 1
	}

	$IdentityName = "$Name-acr"
	Write-Progress -Activity "Provisioning Azure Container Registry" -Status "Pre-Flight Check" -CurrentOperation "Verifying managed identity $IdentityName does not already exist" -PercentComplete 32
	$ExistingIdentityCount = az identity list --subscription $Subscription -g $ResourceGroup --query "length([?name=='$IdentityName'])"
	If ($ExistingIdentityCount -ne 0) {
	Throw "Found an existing identity with the name $IdentityName. The managed identity must not already exist."
	Exit 1
	}

	$KeyName = "cmk-acr-$Name"
	Write-Progress -Activity "Provisioning Azure Container Registry" -Status "Pre-Flight Check" -CurrentOperation "Verifying key $KeyName in vault $KeyVault does not already exist" -PercentComplete 40
	$ExistingKeyCount = az keyvault key list --vault-name $KeyVault --subscription $Subscription --query "length([?name=='$KeyName'])"
	If ($ExistingKeyCount -ne 0) {
	Throw "Found an existing key with the name $KeyName in vault $KeyVault. The key must not already exist."
	Exit 1
	}

	Write-Progress -Activity "Provisioning Azure Container Registry" -Status "Deployment" -CurrentOperation "Create managed identity for ACR" -PercentComplete 50
	If ($PSCmdlet.ShouldProcess($IdentityName, "Create managed identity for ACR")) {
	$IdentityJson = az identity create `
	--resource-group $ResourceGroup `
	--name $IdentityName `
	--location $KeyVaultLocation `
	--subscription $Subscription
	If ($LASTEXITCODE -ne 0) {
	Exit 1
	}
	$IdentityData = $IdentityJson \| ConvertFrom-Json
	$IdentityId = $IdentityData.id
	$IdentityPrincipal = $IdentityData.principalId
	Write-Verbose "Created identity $IdentityName. Object ID: $IdentityId; Principal ID: $IdentityPrincipal"

	# Wait for the identity to be available. Eventual consistency means this could be a while.
	Retry -Retries 30 -SecondsBetweenRetries 5 -ScriptBlock {
	Write-Verbose "Waiting for eventual consistency to propagate identity $IdentityName..."
	$ExistingIdentityCount = az identity list --subscription $Subscription -g $ResourceGroup --query "length([?name=='$IdentityName'])"
	If ($ExistingIdentityCount -eq 0) {
	Throw "Identity $IdentityName not found."
	}
	}
	}

	Write-Progress -Activity "Provisioning Azure Container Registry" -Status "Deployment" -CurrentOperation "Grant Key Vault access to managed identity" -PercentComplete 65
	If ($PSCmdlet.ShouldProcess($KeyVault, "Grant key access to managed identity")) {
	az keyvault set-policy `
	--resource-group $ResourceGroup `
	--name $KeyVault `
	--object-id $IdentityPrincipal `
	--subscription $Subscription `
	--key-permissions get unwrapKey wrapKey \| Out-Null
	If ($LASTEXITCODE -ne 0) {
	Exit 1
	}
	}

	Write-Progress -Activity "Provisioning Azure Container Registry" -Status "Deployment" -CurrentOperation "Create customer managed key" -PercentComplete 70
	If ($PSCmdlet.ShouldProcess($KeyVault, "Create customer managed key")) {
	$KeyJson = az keyvault key create `
	--name $KeyName `
	--subscription $Subscription `
	--vault-name $KeyVault
	If ($LASTEXITCODE -ne 0) {
	Exit 1
	}

	$KeyData = $KeyJson \| ConvertFrom-Json
	$KeyId = $KeyData.key.kid
	Write-Verbose "Created key $KeyName. Key ID: $KeyId"
	}

	Write-Progress -Activity "Provisioning Azure Container Registry" -Status "Deployment" -CurrentOperation "Create Azure Container Registry" -PercentComplete 85
	If ($PSCmdlet.ShouldProcess($Name, "Create Azure Container Registry")) {
	az acr create `
	--resource-group $ResourceGroup `
	--name $Name `
	--identity $IdentityId `
	--key-encryption-key $KeyId `
	--location $KeyVaultLocation `
	--subscription $Subscription `
	--sku Premium
	If ($LASTEXITCODE -ne 0) {
	Exit 1
	}
	}

	Write-Progress -Activity "Provisioning Azure Container Registry" -Completed
	}

view raw New-AzureContainerRegistry.ps1 hosted with ❤ by GitHub

Delete-AzureContainerImages.ps1

This is more a “pruning” operation than deleting, but “prune” isn’t an approved PowerShell verb and I do love me some PowerShell.

In a CI/CD environment, generally you want to keep:

The current successfully deployed image.
The previous successfully deployed image.
The image you want to deploy next (canary style).

…and, actually, that’s about it. CI/CD is fail-forward, so there’s not really a roll-back-three-versions case. You’d roll back the code and build a new container.

Point being, there’s not really a retention policy that handles this in ACR right now. While this script also doesn’t totally handle it the way I’d like, what it can do is keep the most recent X tags of an image and prune all the old ones. I also added a way to regex match a container repository by name so you can be more precise about targeting what you want to prune.

-<#
-.SYNOPSIS
-    Trims down the set of images in an Azure Container Registry.
-.DESCRIPTION
-    There isn't really a retention policy setting on ACR such that the last X
-    tags will be retained for a given repository. This script helps bridge that gap by
-    going through all the repositories in an ACR, selecting the latest X tags to ignore,
-    then removing tags that were created after that time.
-.PARAMETER Registry
-    The name of the container registry with the images to prune.
-.PARAMETER TagsToKeep
-    The number of recent tags to keep.
-.PARAMETER RepositoryMatch
-    An optional filter regex that allows you to only prune repositories that match the expression.
-.EXAMPLE
-    ./Delete-AzureContainerImages.ps1 `
-      -Registry "sourceacr" `
-      -TagsToKeep 5
-    This will prune every repository in the `sourceacr` registry. Each repository will
-    only keep the most recent five tags.
-.EXAMPLE
-    ./Delete-AzureContainerImages.ps1 `
-      -Registry "sourceacr" `
-      -TagsToKeep 5 `
-      -RepositoryMatch 'services/accounts/.+'
-    This will prune every repository in the `sourceacr` registry that has a name
-    matching the regular expression 'services/accounts/.+' - so `services/accounts/test`
-    and `services/accounts/spec` will be pruned, but `services/balances/test` will not.
-    Only the most recent five tags will be kept.
-#>
-[CmdletBinding(SupportsShouldProcess = $True, ConfirmImpact = 'High')]
-Param(
-    [Parameter(Mandatory = $True)]
-    [ValidateNotNullOrEmpty()]
-    [ValidateLength(5, 50)]
-    [ValidatePattern('[a-z0-9]+')]
-    [string]
-    $Registry,
-    [Parameter(Mandatory = $True)]
-    [ValidateRange(1, [System.Int32]::MaxValue)]
-    [int]
-    $TagsToKeep,
-    [Parameter(Mandatory = $False)]
-    [string]
-    $RepositoryMatch
-)
-Begin {
-    Write-Verbose "Checking for az CLI."
-    If ($null -eq (Get-Command "az")) {
-        Throw "The az CLI was not found. Install here: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli"
-        Exit 1
-    }
-}
-Process {
-    Write-Progress -Activity "Pruning Azure Container Registry Images" -Status "Getting images from $Registry" -CurrentOperation "Retrieving repository list" -PercentComplete 1
-    $RepoListJson = az acr repository list -n $Registry
-    If ($LASTEXITCODE -ne 0) {
-        Throw "Unable to read repository list from $Registry."
-        Exit 1
-    }
-    $RepoList = $RepoListJson | ConvertFrom-Json -NoEnumerate
-    Write-Verbose "Found $($SourceRepoList.Length) repositories in $Registry."
-    $ImagesToPrune = @()
-    for ($i = 1; $i -le $RepoList.Length; $i++) {
-        $RepositoryName = $RepoList[$i - 1]
-        $PercentComplete = 25 * ($i / $RepoList.Length)
-        Write-Progress -Activity "Pruning Azure Container Registry Images" -Status "Getting image tags from $Registry" -CurrentOperation "Retrieving tags for $RepositoryName" -PercentComplete $PercentComplete
-        If ((-not [System.String]::IsNullOrWhiteSpace($RepositoryMatch)) -and (-not ($RepositoryName -match $RepositoryMatch))) {
-            Write-Verbose "$RepositoryName does not match against $RepositoryMatch; skipping."
-            Continue
-        }
-        $TagListJson = az acr repository show-tags -n $Registry --repository $RepositoryName --orderby time_desc
-        If ($LASTEXITCODE -ne 0) {
-            Throw "Unable to read repository tags for $RepositoryName from $Registry."
-            Exit 1
-        }
-        $TagList = $TagListJson | ConvertFrom-Json -NoEnumerate
-        Write-Verbose "Found $($TagList.Length) tags for $RepositoryName."
-        for ($j = 0; $j -lt $TagList.Length; $j++) {
-            If ($j -lt $TagsToKeep) {
-                Continue
-            }
-            $Tag = $TagList[$j]
-            $ImagesToPrune += "$RepositoryName`:$Tag"
-        }
-    }
-    for ($i = 1; $i -le $ImagesToPrune.Length; $i++) {
-        $ImageName = $ImagesToPrune[$i - 1]
-        $PercentComplete = 25 + 75 * ($i / $ImagesToPrune.Length)
-        Write-Progress -Activity "Pruning Azure Container Registry Images" -Status "Pruning images" -CurrentOperation "Removing $ImageName" -PercentComplete $PercentComplete
-        If ($PSCmdlet.ShouldProcess($ImageName, "Prune image")) {
-            az acr repository delete -n $Registry --image $ImageName --only-show-errors -y | Out-Null
-            If ($LASTEXITCODE -ne 0) {
-                Throw "Failed to prune $ImageName."
-                Exit 1
-            }
-        }
-    }
-    Write-Progress -Activity "Copying Azure Container Registry Images" -Completed
-}

view raw Delete-AzureContainerImages.ps1 hosted with ❤ by GitHub

Copy-AzureContainerImages.ps1

This is sort of a bulk copy operation for ACR. For reasons I won’t get into, I needed to copy all the images off an ACR, delete/re-create the ACR, and copy them all back. While the az CLI supports importing one image/tag at a time, there’s not really a bulk copy. There’s a ‘transfer artifacts’ mechanism but it’s sort of complex to set up and the az CLI is already here, so…

This script gets all the repositories and all the tags from each repository and does az acr import on all of them. It’s not fast, but it gets the job done.

	<#
	.SYNOPSIS
	Copies all the container images from one Azure Container Registry to another.
	.DESCRIPTION
	The az CLI offers the ability to import one container registry image into an
	ACR instance at a time. The artifact transfer mechanism that might be used
	to bulk copy container images is in preview and is difficult to set up. This
	script offers the best of all worlds - a bulk execution of the `az import`
	command to bring all the images from one ACR into another.

	It is assumed you are authenticated and have access to perform the list
	operations on the source and import operations on the destination.
	.PARAMETER Source
	The name of the container registry with the images to import into the destination.
	.PARAMETER Destination
	The name of the container registry where images should go.
	.EXAMPLE
	./Copy-AzureContainerImages.ps1 `
	-Source "sourceacr" `
	-Destination "destinationacr"
	#>
	[CmdletBinding(SupportsShouldProcess = $False)]
	Param(
	[Parameter(Mandatory = $True)]
	[ValidateNotNullOrEmpty()]
	[ValidateLength(5, 50)]
	[ValidatePattern('[a-z0-9]+')]
	[string]
	$Source,

	[Parameter(Mandatory = $True)]
	[ValidateNotNullOrEmpty()]
	[ValidateLength(5, 50)]
	[ValidatePattern('[a-z0-9]+')]
	[string]
	$Destination
	)

	Begin {
	Write-Verbose "Checking for az CLI."
	If ($null -eq (Get-Command "az")) {
	Throw "The az CLI was not found. Install here: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli"
	Exit 1
	}
	}
	Process {
	Write-Progress -Activity "Copying Azure Container Registry Images" -Status "Getting images from $Source" -CurrentOperation "Retrieving repository list" -PercentComplete 1
	$SourceRepoListJson = az acr repository list -n $Source
	If ($LASTEXITCODE -ne 0) {
	Throw "Unable to read repository list from $Source."
	Exit 1
	}

	$SourceRepoList = $SourceRepoListJson \| ConvertFrom-Json -NoEnumerate
	Write-Verbose "Found $($SourceRepoList.Length) repositories in $Source."
	$ImagesToImport = @()
	for($i = 1; $i -le $SourceRepoList.Length; $i++) {
	$RepositoryName = $SourceRepoList[$i - 1]
	$PercentComplete = 25 * ($i / $SourceRepoList.Length)
	Write-Progress -Activity "Copying Azure Container Registry Images" -Status "Getting image tags from $Source" -CurrentOperation "Retrieving tags for $RepositoryName" -PercentComplete $PercentComplete
	$TagListJson = az acr repository show-tags -n $Source --repository $RepositoryName --orderby time_asc
	If ($LASTEXITCODE -ne 0) {
	Throw "Unable to read repository tags for $RepositoryName from $Source."
	Exit 1
	}

	$TagList = $TagListJson \| ConvertFrom-Json -NoEnumerate
	Write-Verbose "Found $($TagList.Length) tags for $RepositoryName."
	$TagList \| ForEach-Object {
	$Tag = $_
	$ImagesToImport += "$RepositoryName`:$Tag"
	}
	}

	for ($i = 1; $i -le $ImagesToImport.Length; $i++) {
	$ImageName = $ImagesToImport[$i - 1]
	$PercentComplete = 25 + 75 * ($i / $ImagesToImport.Length)
	Write-Progress -Activity "Copying Azure Container Registry Images" -Status "Importing images to $Destination" -CurrentOperation "Importing $ImageName" -PercentComplete $PercentComplete
	az acr import --name $Destination --source "$Source.azurecr.io/$ImageName" --image $ImageName \| Out-Null
	If ($LASTEXITCODE -ne 0) {
	Throw "Failed to import $ImageName to $Destination."
	Exit 1
	}
	}

	Write-Progress -Activity "Copying Azure Container Registry Images" -Completed
	}

view raw Copy-AzureContainerImages.ps1 hosted with ❤ by GitHub

Setting Up oauth2-proxy with Istio

September 3, 2020 kubernetes 14 Comments edit

Here’s what I want:

Istio 1.6.4 in Kubernetes acting as the ingress.
oauth2-proxy wrapped around one application, not the whole cluster.
OpenID Connect support for Azure AD - both interactive OIDC and support for client_credentials OAuth flow.
Istio token validation in front of the app.
No replacing the Istio sidecar. I want things running as stock as possible so I’m not too far off the beaten path when it’s upgrade time.

I’ve set this up in the past without too much challenge using nginx ingress but I don’t want Istio bypassed here. Unfortunately, setting up oauth2-proxy with an Istio (Envoy) ingress is a lot more complex than sticking a couple of annotations in there.

Luckily, I found this blog article by Justin Gauthier who’d done a lot of the leg-work to figure things out. The difference in that blog article and what I want done are:

That article uses an older version of Istio so some of the object definitions don’t apply to my Istio 1.6.4 setup.
That article wraps everything in the cluster (via the Istio ingress) with oauth2-proxy and I only want one service wrapped.

With all that in mind, let’s get going.

Prerequisites

There are some things you need to set up before you can get this going.

DNS Entries

Pick a subdomain on which you’ll have the service and the oauth2-proxy. For our purposes, let’s pick cluster.example.com as the subdomain. You want a single subdomain so you can share cookies and so it’s easier to set up DNS and certificates.

We’ll put the app and oauth2-proxy under that.

The application/service being secured will be at myapp.cluster.example.com.
The oauth2-proxy will be at oauth.cluster.example.com.

In your DNS system you need to assign the wildcard DNS *.cluster.example.com to the IP address that your Istio ingress is using. If someone visits https://myapp.cluster.example.com they should be able to get to your service in the cluster via the Istio ingress gateway.

Azure AD Application

For an application to allow OpenID Connect / OAuth through Azure AD, you need to register the application with Azure AD. The application should be for the service you’re securing.

In that application you need to:

On the “Overview” tab, make a note of…
- The “Application (client) ID” - you’ll need it later. For this example, let’s say it’s APPLICATION-ID-GUID.
- The “Directory (tenant) ID” - you’ll need it later. For this example, let’s say it’s TENANT-ID-GUID
On the “Authentication” tab:
- Under “Web / Redirect URIs,” set the redirect URI to /oauth2/callback relative to your app, like https://myapp.cluster.example.com/oauth2/callback.
- Under “Implicit grant,” check the box to allow access tokens to be issued.
On the “Expose an API” tab, create a scope. It doesn’t matter really what it’s called, but if no scopes are present then client_credentials won’t work. I called mine user_impersonation but you could call yours fluffy and it wouldn’t matter. The scope URI will end up looking like api://APPLICATION-ID-GUID/user_impersonation where that GUID is the ID for your application.
On the “API permissions” tab:
- Grant permission to that user_impersonation scope you just created.
- Grant permission to Microsoft.Graph - User.Read so oauth2-proxy can validate credentials.
- Click the “Grant admin consent” button at the top or client_credentials won’t work. There’s no way to grant consent in the middle of that flow.
On the “Certificates & secrets” page, under “Client secrets,” create a client secret and take note of it. You’ll need it later. For this example, we’ll say the client secret is myapp-client-secret but yours is going to be a long string of random characters.

Finally, somewhat related - take note of the email domain associated with your users in Azure Active Directory. For our example, we’ll say everyone has an @example.com email address. We’ll use that when configuring oauth2-proxy for who can log in.

cert-manager

Set up cert-manager in the cluster. I found the DNS01 solver worked best for me with Istio in the mix because it was easy to get Azure DNS hooked up.

The example here assumes that you have it set up so you can drop a Certificate into a Kubernetes namespace and cert-manager will take over, request a certificate, and populate the appropriate Kubernetes secret that can be used by the Istio ingress gateway for TLS.

Setting up cert-manager isn’t hard, but there’s already a lot of documentation on it so I’m not going to repeat all of it.

If you can’t use cert-manager in your environment then you’ll have to adjust for that when you see the steps where the TLS bits are getting set up later.

The Setup

OK, you have the prerequisites set up, let’s get to it.

Istio Service Entry

If you have traffic going through an egress in Istio, you will need to set up a ServiceEntry to allow access to the various Azure AD endpoints from oauth2-proxy. I have all outbound traffic requiring egress so this was something I had to do.

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: azure-istio-egress
  namespace: istio-system
spec:
  hosts:
  - '*.microsoft.com'
  - '*.microsoftonline.com'
  - '*.windows.net'
  location: MESH_EXTERNAL
  ports:
  - name: https
    number: 443
    protocol: HTTPS
  resolution: NONE

I use a lot of other Azure services, so I have some pretty permissive outbound allowances. You can try to reduce this to just the minimum of what you need by doing a little trial and error. I know I ran into:

graph.windows.com - Azure AD graph API
login.windows.net - Common JWKS endpoint
sts.windows.net - Token issuer, also used for token validation
*.microsoftonline.com, *.microsoft.com - Some UI redirection happens to allow OIDC login here with a Microsoft account

I’ll admit after I got through a bunch of different minor things, I just started whitelisting egress allowances. It wasn’t that important for me to be exact for this.

I did deploy this to the istio-system namespace. It seems that it doesn’t matter where a ServiceEntry gets deployed, once it’s out there it works for any service in the cluster. I ended up just deploying all of these to the istio-system namespace so it’s easier to track.

TLS Certificate

OpenID Connect via Azure AD requires a TLS connection for your app. cert-manager takes care of converting a Certificate object to a Kubernetes Secret for us.

It’s important to note that we’re going to use the standard istio-ingressgateway to handle our inbound traffic, and that’s in the istio-system namespace. You can’t read Kubernetes secrets across namespaces, so the Certificate needs to be deployed to the istio-system namespace.

This is one of the places where you’ll see why it’s good to have picked a common subdomain for the oauth2-proxy and the app - wildcard certificate.

apiVersion: cert-manager.io/v1beta1
kind: Certificate
metadata:
  name: tls-myapp-production
  namespace: istio-system
spec:
  commonName: '*.cluster.example.com'
  dnsNames:
  - '*.cluster.example.com'
  issuerRef:
    kind: ClusterIssuer
    name: letsencrypt-production
  secretName: tls-myapp-production

Application Namespace

Create your application namespace and enable Istio sidecar injection. This is where your app/service, oauth2-proxy, and Redis will go.

kubectl create namespace myapp
kubectl label namespace myapp istio-injection=enabled

Redis

You need to enable Redis as a session store for oauth2-proxy if you want the Istio token validation in place. I gather this isn’t required if you don’t want Istio doing any token validation, but I did, so here we go.

I used the Helm chart v10.5.7 for Redis. There are… a lot of ways you can set up Redis. I set up the demo version here in a very simple, non-clustered manner. Depending on how you set up Redis, you may need to adjust your oauth2-proxy configuration.

Here’s the values.yaml I used for deploying Redis:

cluster:
  enabled: false
usePassword: true
password: "my-redis-password"
master:
  persistence:
    enabled: false

The Application

When you deploy your application, you’ll need to set up:

The Kubernetes Deployment and Service
The Istio VirtualService and Gateway

The Deployment doesn’t have anything special, it just exposes a port that can be routed to by a Service. Here’s a simple Deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: myapp
  labels:
    app.kubernetes.io/name: myapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: myapp
  template:
    metadata:
      labels:
        app.kubernetes.io/name: myapp
    spec:
      containers:
      - image: "docker.io/path/to/myapp:sometag"
        imagePullPolicy: IfNotPresent
        name: myapp
        ports:
        - containerPort: 80
          name: http
          protocol: TCP

We have a Kubernetes Service for that Deployment:

apiVersion: v1
kind: Service
metadata:
  name: myapp
  namespace: myapp
  labels:
    app.kubernetes.io/name: myapp
spec:
  ports:
  # Exposes container port 80 on service port 8000.
  # This is pretty arbitrary, but you need to know
  # the Service port for the VirtualService later.
  - name: http
    port: 8000
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: myapp

The Istio VirtualService is another layer on top of the Service that helps in traffic control. Here’s where we start tying the ingress gateway to the Service.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  labels:
    app.kubernetes.io/name: myapp
  name: myapp
  namespace: myapp
spec:
  gateways:
  # Name of the Gateway we're going to deploy in a minute.
  - myapp
  hosts:
  # The full host name of the app.
  - myapp.cluster.example.com
  http:
  - route:
    - destination:
        # This is the Kubernetes Service info we just deployed.
        host: myapp
        port:
          number: 8000

Finally, we have an Istio Gateway that ties the ingress to our VirtualService.

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  labels:
    app.kubernetes.io/name: myapp
  name: myapp
  namespace: myapp
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    # Same host as the one in the VirtualService, the full
    # name for the service.
    - myapp.cluster.example.com
    port:
      # The name here must be unique across all of the ports named
      # in the Istio ingress. It doesn't matter what it is as long
      # as it's unique. I like using a modified version of the
      # host name.
      name: https-myapp-cluster-example-com
      number: 443
      protocol: HTTPS
    tls:
      # This is the name of the secret that cert-manager placed
      # in the istio-system namespace. It should match the
      # secretName in the Certificate.
      credentialName: tls-myapp-production
      mode: SIMPLE

At this point, if you have everything set up right, you should be able to hit https://myapp.cluster.example.com and get to it anonymously. There’s no oauth2-proxy in place, but the ingress is all wired up to use TLS with that wildcard certificate cert-manager got you and the DNS was set up, too.

If you can’t get to the service, one of the things isn’t lining up:

You forgot to enable Istio sidecar injection on the app namespace or did it after you deployed. Restart the deployments to get the sidecars added.
DNS hasn’t propagated.
The secret with the TLS certificate isn’t in the istio-system namespace - it must be in istio-system for the ingress to find it.
The Gateway isn’t lining up - credentialName is wrong, host name is wrong, port name isn’t unique.
The VirtualService isn’t lining up - host name is wrong, Gateway name doesn’t match, Service name or port is wrong.
The Service isn’t lining up - the selector doesn’t select any pods, the destination port on the pods is wrong.

If it feels like you’re Odysseus trying to shoot an arrow through 12 axes, yeah, it’s a lot like that. This isn’t even all the axes.

oauth2-proxy

For this I used the Helm chart v3.2.2 for oauth2-proxy. I created the cookie secret for it like this:

docker run -ti --rm python:3-alpine python -c 'import secrets,base64; print(base64.b64encode(secrets.token_bytes(16)));'

You’re also going to need the client ID from your Azure AD application as well as the client secret. You should have grabbed those during the prerequisites earlier.

The values:

config:
  # The client ID of your AAD application.
  clientID: "APPLICATION-ID-GUID"
  # The client secret you generated for the AAD application.
  clientSecret: "myapp-client-secret"
  # The cookie secret you just generated with the Python container.
  cookieSecret: "the-big-base64-thing-you-made"
  # Here's where the interesting stuff happens:
  configFile: |-
    auth_logging = true
    azure_tenant = "TENANT-ID-GUID"
    cookie_httponly = true
    cookie_refresh = "1h"
    cookie_secure = true
    email_domains = "example.com"
    oidc_issuer_url = "https://sts.windows.net/TENANT-ID-GUID/"
    pass_access_token = true
    pass_authorization_header = true
    provider = "azure"
    redis_connection_url = "redis://redis-master.myapp.svc.cluster.local:6379"
    redis_password = "my-redis-password"
    request_logging = true
    session_store_type = "redis"
    set_authorization_header = true
    silence_ping_logging = true
    skip_provider_button = true
    skip_auth_strip_headers = false
    skip_jwt_bearer_tokens = true
    standard_logging = true
    upstreams = [ "static://" ]

Important things to note in the configuration file here:

The client ID, client secret, and Azure tenant ID information are all from that Azure AD application you registered as a prerequisite.
The logging settings, like silence_ping_logging or auth_logging are totally up to you. These don’t matter to the functionality but make it easier to troubleshoot.
The redis_connection_url is going to depend on how you deployed Redis. You want to connect to the Kubernetes Service that points to the master, at least in this demo setup. There are a lot of Redis config options for oauth2-proxy that you can tweak. Also, storing passwords in config like this isn’t secure so, like, do something better. But it’s also a lot more to explain how to set up and mount secrets and all that here, so just pretend we did the right thing.
The pass_access_token, pass_authorization_header, set_authorization_header, and skip_jwt_bearer_tokens values are super key here. The first three must be set that way for OIDC or OAuth to work; the last one must be set for client_credentials to work.

Note on client_credentials: If you want to use client_credentials with your app, you need to set up an authenticated emails file in oauth2-proxy. In that emails file, you need to include the service principal ID for the application that’s authenticating. Azure AD issues a token for applications with that service principal ID as the subject, and there’s no email.

The service principal ID can be retrieved if you have your application ID:

az ad sp show --id APPLICATION-ID-GUID --query objectId --out tsv

You’ll also need your app to request a scope when you submit a client_credentials request - use api://APPLICATION-ID-GUID/.default as the scope. (That .default scope won’t exist unless you have some scope defined, which is why you defined one earlier.)

Getting back to it… Once oauth2-proxy is set up, you need to add the Istio wrappers on it.

First, let’s add that VirtualService…

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  labels:
    app.kubernetes.io/name: oauth2-proxy
  name: oauth2-proxy
  namespace: myapp
spec:
  gateways:
  # We'll deploy this gateway in a moment.
  - oauth2-proxy
  hosts:
  # Full host name of the oauth2-proxy.
  - oauth.cluster.example.com
  http:
  - route:
    - destination:
        # This should line up with the Service that the
        # oauth2-proxy Helm chart deployed.
        host: oauth2-proxy
        port:
          number: 80

Now the Gateway…

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  labels:
    app.kubernetes.io/name: oauth2-proxy
  name: oauth2-proxy
  namespace: myapp
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    # Same host as the one in the VirtualService, the full
    # name for oauth2-proxy.
    - oauth.cluster.example.com
    port:
      # Again, this must be unique across all ports named in
      # the Istio ingress.
      name: https-oauth-cluster-example-com
      number: 443
      protocol: HTTPS
    tls:
      # Same secret as the application - it's a wildcard cert!
      credentialName: tls-myapp-production
      mode: SIMPLE

OK, now you should be able to get something if you hit https://oauth.cluster.example.com. You’re not passing through it for authentication yet you will likely see something along the lines of an error saying “The reply URL specified in the request does not match the reply URLs configured for the application.” The point is, it shouldn’t be some arbitrary 500 or 404. oauth2-proxy should kick in.

Istio Token Validation - RequestAuthentication

We want Istio to do some token validation in front of our application, so we can deploy a RequestAuthentication object.

apiVersion: security.istio.io/v1beta1
kind: RequestAuthentication
metadata:
  labels:
    app.kubernetes.io/name: myapp
  name: myapp
  namespace: myapp
spec:
  jwtRules:
  - issuer: https://sts.windows.net/TENANT-ID-GUID/
    jwksUri: https://login.windows.net/common/discovery/keys
  selector:
    matchLabels:
      # Match labels should not select the oauth2-proxy, just
      # the application being secured.
      app.kubernetes.io/name: myapp

The Magic - Envoy Filter for Authentication

The real magic is this last step, an Istio EnvoyFilter to pass authentication requests for your app through oauth2-proxy. This is the biggest takeaway I got from Justin’s blog article and it’s really the key to the whole thing.

Envoy filter format is in flux. The object defined here is really dependent on the version of Envoy that Istio is using. This was a huge pain. I ended up finding the docs for the Envoy ExtAuthz filter and feeling my way through the exercise, but you should be aware these things do change.

Here’s the Envoy filter:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  labels:
    app.kubernetes.io/name: myapp
  name: myapp
  namespace: istio-system
spec:
  configPatches:
  - applyTo: HTTP_FILTER
    match:
      context: GATEWAY
      listener:
        filterChain:
          filter:
            name: envoy.http_connection_manager
            subFilter:
              # In Istio 1.6.4 this is the first filter. The examples showing insertion
              # after some other authorization filter or not showing where to insert
              # the filter at all didn't work for me. Istio just failed to insert the
              # filter (silently) and moved on.
              name: istio.metadata_exchange
          # The filter should catch traffic to the service/application.
          sni: myapp.cluster.example.com
    patch:
      operation: INSERT_AFTER
      value:
        name: envoy.filters.http.ext_authz
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz
          http_service:
            authorizationRequest:
              allowedHeaders:
                patterns:
                - exact: accept
                - exact: authorization
                - exact: cookie
                - exact: from
                - exact: proxy-authorization
                - exact: user-agent
                - exact: x-forwarded-access-token
                - exact: x-forwarded-email
                - exact: x-forwarded-for
                - exact: x-forwarded-host
                - exact: x-forwarded-proto
                - exact: x-forwarded-user
                - prefix: x-auth-request
                - prefix: x-forwarded
            authorizationResponse:
              allowedClientHeaders:
                patterns:
                - exact: authorization
                - exact: location
                - exact: proxy-authenticate
                - exact: set-cookie
                - exact: www-authenticate
                - prefix: x-auth-request
                - prefix: x-forwarded
              allowedUpstreamHeaders:
                patterns:
                - exact: authorization
                - exact: location
                - exact: proxy-authenticate
                - exact: set-cookie
                - exact: www-authenticate
                - prefix: x-auth-request
                - prefix: x-forwarded
            server_uri:
              # URIs here should be to the oauth2-proxy service inside your
              # cluster, in the namespace where it was deployed. The port
              # in that 'cluster' line should also match up.
              cluster: outbound|80||oauth2-proxy.myapp.svc.cluster.local
              timeout: 1.5s
              uri: http://oauth2-proxy.myapp.svc.cluster.local

That’s it, you should be good to go!

Note I didn’t really mess around with trying to lock the headers down too much. This is the set I found from the blog article by Justin Gauthier and every time I tried to tweak too much, something would stop working in subtle ways.

Try It Out

With all of this in place, you should be able to hit https://myapp.cluster.example.com and the Envoy filter will redirect you through oauth2-proxy to Azure Active Directory. Signing in should get you redirected back to your application, this time authenticated.

Troubleshooting

There are a lot of great tips about troubleshooting and diving into Envoy on the Istio site. This forum post is also pretty good.

Here are a couple of spot tips that I found to be of particular interest.

Finding the Envoy Version

As noted in the EnvoyFilter section, filter formats change based on the version of Envoy that Istio is using. You can find out what version of Envoy you’re running in your Istio cluster by using:

$podname = kubectl get pod -l app=prometheus -n istio-system -o jsonpath='{$.items[0].metadata.name}'
kubectl exec -it $podname -c istio-proxy -n istio-system -- pilot-agent request GET server_info

You’ll get a lot of JSON explaining info about the Envoy sidecar, but the important bit is:

{
 "version": "80ad06b26b3f97606143871e16268eb036ca7dcd/1.14.3-dev/Clean/RELEASE/BoringSSL"
}

In this case, it’s 1.14.3.

Look at What Envoy is Doing

It’s hard to figure out where the Envoy configuration gets hooked up. The istioctl proxy-status command can help you.

istioctl proxy-status will yield a list like this:

NAME                                                         CDS        LDS        EDS        RDS          PILOT                       VERSION
myapp-768b999cb5-v649q.myapp                                 SYNCED     SYNCED     SYNCED     SYNCED       istiod-5cf5bd4577-frngc     1.6.4
istio-egressgateway-85b568659f-x7cwb.istio-system            SYNCED     SYNCED     SYNCED     NOT SENT     istiod-5cf5bd4577-frngc     1.6.4
istio-ingressgateway-85c67886c6-stdsf.istio-system           SYNCED     SYNCED     SYNCED     SYNCED       istiod-5cf5bd4577-frngc     1.6.4
oauth2-proxy-5655cc447d-5ftbq.myapp                          SYNCED     SYNCED     SYNCED     SYNCED       istiod-5cf5bd4577-frngc     1.6.4
redis-5f7c5b99db-tp5l7.myapp                                 SYNCED     SYNCED     SYNCED     SYNCED       istiod-5cf5bd4577-frngc     1.6.4

Once you’ve deployed, you’ll see a myapp listener as well as the Istio ingress. You can dump their config by doing something like

istioctl proxy-config listeners myapp-768b999cb5-v649q.myapp -o json

Sub in the name of the listener as needed. It will generate a huge raft of JSON, so you might need to dump it to a file so you can scroll around in it and find what you want.

The application listener will show you info about the sidecar attached to the app.
The ingress gateway listener will show you info about ingress traffic (including showing your Envoy filter).

When All Else Fails, Restart the Ingress

When all else fails, restart the ingress pod. kubectl rollout restart deploy/istio-ingressgateway -n istio-system can get you pretty far. When it seems like everything should be working but you’re getting errors like “network connection reset” and it doesn’t make sense… just try kicking the ingress pods. Sometimes the configuration needs to be freshly rebuilt and deployed and that’s how you do it.

I don’t know why this happens, but if you’ve deployed and undeployed some Envoy filters a couple of times… sometimes something just stops working. Restarting the ingress is the only way I’ve found to fix it… but it works!

Other Options

oauth2-proxy isn’t the only way to get this done.

I did see this authservice plugin, which appears to be an Envoy extension to provide oauth2-proxy services right in Envoy itself. Unfortunately, it doesn’t support the latest Istio versions; it requires you manually replace the Istio sidecar with this custom version; and it doesn’t seem to support client_credentials, which is a primary use case for me.

There’s an OAuth2 filter for Envoy currently in active development (alpha) but I didn’t see that it supported OIDC. I could be wrong there. I’d love to see someone get this working inside Istio.

For older Istio there was an App Identity and Access Adapter but Mixer adapters/plugins have been deprecated in favor of WASM extensions for Envoy.

Are there others? Let me know in the comments!

Travis Illig

Paraesthesia: .NET Development and Some Pictures of My Cat

Adventures in mTLS Debugging

Deploy Standalone Kayenta with an Azure Storage Backend

Deployment

Troubleshooting

Next Steps

Hamilton Halloween

Scripts for Managing Azure Container Registry

New-AzureContainerRegistry.ps1

Delete-AzureContainerImages.ps1

Copy-AzureContainerImages.ps1

Setting Up oauth2-proxy with Istio

Prerequisites

DNS Entries

Azure AD Application

cert-manager

The Setup

Istio Service Entry

TLS Certificate

Application Namespace

Redis

The Application

oauth2-proxy

Istio Token Validation - RequestAuthentication

The Magic - Envoy Filter for Authentication

Try It Out

Troubleshooting

Finding the Envoy Version

Look at What Envoy is Doing

When All Else Fails, Restart the Ingress

Other Options