kubernetes comments edit

I have a situation that is possibly kind of niche, but it was a real challenge to figure out so I thought I’d share the solution in case it helps you.

I have a Kubernetes cluster with Istio installed. My Istio ingress gateway is connected to an Apigee API management front-end via mTLS. Requests come in to Apigee then get routed to a secured public IP address where only Apigee is authorized to connect.

Unfortunately, this results in all requests coming in with the same Host header:

  1. Client requests api.services.com/v1/resource/operation.
  2. Apigee gets that request and routes to 1.2.3.4/v1/resource/operation via the Istio ingress gateway and mTLS.
  3. An Istio VirtualService answers to hosts: "*" (any host header at all) and matches entirely on URL path - if it’s /v1/resource/operation it routes to mysvc.myns.svc.cluster.local/resource/operation.

This is how the ingress tutorial on the Istio site works, too. No hostname-per-service.

However, there are a couple of wrenches in the works, as expected:

  • There are some API endpoints on the service that aren’t exposed through Apigee. They’re internal-only operations that allow for service-to-service communications in the cluster but aren’t for outside callers.
  • I want to do canary deployments and route traffic slowly from an existing version of the service to a new, canary version. I need both the external and internal traffic routed this way to get accurate results.

The combination of these things is a problem. I can’t assume that the match-on-path-regex setting will work for internal traffic - I need any internal service to route properly based on host name. However, you also can’t match on host: "*" for internal traffic that doesn’t come through an ingress. That means I would need two different VirtualService instances - one for internal traffic, one for external.

But if I have two different VirtualService objects to manage, it means I need to keep them in sync over the canary, which kind of sucks. I’d like to set the traffic balancing in one spot and have it work for both internal and external traffic.

I asked how to do this on the Istio discussion forum and thought for a while that a VirtualService delegate would be the answer - have one VirtualService with the load balancing information, a second service for internal traffic (delegating to the load balancing service), and a third service for external traffic (delegating to the load balancing service). It’s more complex, but I’d get the ability to control traffic in one spot.

Unfortunately (the word “unfortunately” shows up a lot here, doesn’t it?), you can’t use delegates on a VirtualService that doesn’t also connect to a gateway. That is, if it’s internal/mesh traffic, you don’t get the delegate support. This issue in the Istio repo touches on that.

Here’s where I landed.

First, I updated Apigee so it takes care of two things for me:

  1. It adds a Service-Host header with the internal host name of the target service, like Service-Host: mysvc.myns.svc.cluster.local. It more tightly couples the Apigee part of things to the service internal structure, but it frees me up from having to route entirely by regex in the cluster. (You’ll see why in a second.) I did try to set the Host header directly, but Apigee overwrites this when it issues the request on the back end.
  2. It does all the path manipulation before issuing the request. If the internal service wants /v1/resource/operation to be /resource/operation, that path update happens in Apigee so the inbound request will have the right path to start.

I did the Service-Host header with an “AssignMessage” policy.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<AssignMessage async="false" continueOnError="false" enabled="true" name="Add-Service-Host-Header">
    <DisplayName>Add Service Host Header</DisplayName>
    <Set>
        <Headers>
            <Header name="Service-Host">mysvc.myns.svc.cluster.local</Header>
        </Headers>
    </Set>
    <IgnoreUnresolvedVariables>true</IgnoreUnresolvedVariables>
    <AssignTo createNew="false" transport="http" type="request"/>
</AssignMessage>

Next, I added an Envoy filter to the Istio ingress gateway so it knows to look for the Service-Host header and update the Host header accordingly. Again, I used Service-Host because I couldn’t get Apigee to properly set Host directly. If you can figure that out and get the Host header coming in correctly the first time, you can skip the Envoy filter.

The filter needs to run first thing in the pipeline, before Istio tries to route traffic. I found that pinning it just before the istio.metadata_exchange stage got the job done.

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: propagate-host-header-from-apigee
  namespace: istio-system
spec:
  workloadSelector:
    labels:
      istio: ingressgateway
      app: istio-ingressgateway
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: GATEWAY
        listener:
          filterChain:
            filter:
              name: "envoy.http_connection_manager"
            subFilter:
              # istio.metadata_exchange is the first filter in the connection
              # manager, at least in Istio 1.6.14.
              name: "istio.metadata_exchange"
      patch:
        operation: INSERT_BEFORE
        value:
          name: envoy.filters.http.lua
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
            inline_code: |
              function envoy_on_request(request_handle)
                local service_host = request_handle:headers():get("service-host")
                if service_host ~= nil then
                  request_handle:headers():replace("host", service_host)
                end
              end

Finally, the VirtualService that handles the traffic routing needs to be tied both to the ingress and to the mesh gateway. The hosts setting can just be the internal service name, though, since that’s what the ingress will use now.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: mysvc
  namespace: myns
spec:
  gateways:
    - istio-system/apigee-mtls
    - mesh
  hosts:
    - mysvc
  http:
    - route:
        - destination:
            host: mysvc-stable
          weight: 50
        - destination:
            host: mysvc-baseline
          weight: 25
        - destination:
            host: mysvc-canary
          weight: 25

Once all these things are complete, both internal and external traffic will be routed by the single VirtualService. Now I can control canary load balancing in a single location and be sure that I’m getting correct overall test results and statistics with as few moving pieces as possible.

Disclaimer: There may be reasons you don’t want to treat external traffic the same as internal, like if you have different DestinationRule settings for traffic management inside vs. outside, or if you need to pass things through different authentication filters or whatever. Everything I’m working with is super locked down so I treat internal and external traffic with the same high levels of distrust and ensure that both types of traffic are scrutinized equally. YMMV.

mac, network comments edit

I have a Mac and my user account is attached to a Windows domain. The benefit of this is actually pretty minimal in that I can change my domain password and it propagates to the local Mac user account, but that’s about it. It seems to cause more trouble than it’s worth.

I recently had an issue where something got out of sync and I couldn’t log into my Mac using my domain account. This is sort of a bunch of tips and things I did to recover that.

First, have a separate local admin account. Make it a super complex password and never use it for anything else. This is sort of your escape hatch to try to recover your regular user account. Even if you want to have a local admin account so your regular user account can stay a user and no admin… have a dedicated “escape hatch” admin account that’s separate from the “I use this sometimes for sudo purposes” admin account. I have this, and if I hadn’t, that’d have been the end of it.

It’s good to remember for a domain-joined account there are three security tokens that all need to be kept in sync: Your domain user password, your local machine OS password, and your disk encryption token. When you reboot the computer, the first password you’ll be asked for should unlock the disk encryption. Usually the token for disk encryption is tied nicely to the machine account password so you enter the one password and it both unlocks the disk and logs you in. The problem I was running into was those got out of sync. For a domain-joined account, the domain password usually is also tied to these things.

Next, keep your disk encryption recovery code handy. Store it in a password manager or something. If things get out of sync, you can use the recovery code to unlock the disk and then your OS password to log in.

For me, I was able to log in as my separate local admin account but my machine password wasn’t working unless I was connected to the domain. Only way to connect to the domain was over a VPN. That meant I needed to enable fast user switching so I could connect to the VPN under the separate local admin and then switch - without logging out - to my domain account.

Once I got to my own account I could use the Users & groups app to change my domain password and have the domain and machine accounts re-synchronized. ALWAYS ALWAYS ALWAYS USE USERS & GROUPS TO CHANGE YOUR DOMAIN ACCOUNT PASSWORD. I have not found a way otherwise to ensure everything is in sync. Don’t change it from some other workstation, don’t change it from Azure Active Directory. This is the road to ruin. Stay with Users & Groups.

The last step was that my disk encryption token wasn’t in sync - OS and domain connection was good, but I couldn’t log in after a reboot. I found the answer in a Reddit thread:

su local_admin
sysadminctl -secureTokenStatus domain_account_username
sysadminctl -secureTokenOff domain_account_username \
  -password domain_account_password \
  interactive
sysadminctl -secureTokenOn domain_account_username \
  -password domain_account_password \
  interactive

Basically, as the standalone local admin, turn off and back on again the connection to the drive encryption. This refreshes the token and gets it back in sync.

Reboot, and you should be able to log in with your domain account again.

To test it out, you may want to try changing your password from Users & Groups to see that the sync works. If you get a “password complexity” error, it could be the sign of an issue… or it could be the sign that your domain has a “you can’t change the password more than once every X days” sort of policy and since you changed it earlier you are changing it again too soon. YMMV.

And, again, always change your password from Users & Groups.

kubernetes comments edit

I have a Kubernetes 1.19.11 cluster deployed along with Istio 1.6.14. I have a central instance of Prometheus for scraping metrics, and based on the documentation, I have a manually-injected sidecar so Prometheus can make use of the Istio certificates for mTLS during scraping. Under Prometheus v2.20.1 this worked great. However, I was trying to update some of the infrastructure components to take advantage of new features and Prometheus after v2.21.0 just would not scrape.

These are my adventures in trying to debug this issue. Some of it is to remind me of what I did. Some of it is to save you some trouble if you run into the issue. Some of it is to help you see what I did so you can apply some of the techniques yourself.

TL;DR: The problem is that Prometheus v2.21.0 disabled HTTP/2 and that needs to be re-enabled for things to work. There should be a Prometheus release soon that allows you to re-enable HTTP/2 with environment variables.

I created a repro repository with a minimal amount of setup to show how things work. It can get you from a bare Kubernetes cluster up to Istio 1.6.14 and Prometheus using the same values I am. You’ll have to supply your own microservice/app to demonstrate scraping, but the prometheus-example-app may be a start.

I deploy Prometheus using the Helm chart. As part of that, I have an Istio sidecar manually injected just like they do in the official 1.6 Istio release manifests. By doing this, the sidecar will download and share the certificates but it won’t proxy any of the Prometheus traffic.

I then have a Prometheus scrape configuration that uses the certificates mounted in the container. If it finds a pod that has the Istio sidecar annotations (indicating it’s got the sidecar injected), it’ll use the certificates for authentication and communication.

- job_name: "kubernetes-pods-istio-secure"
  scheme: https
  tls_config:
    ca_file: /etc/istio-certs/root-cert.pem
    cert_file: /etc/istio-certs/cert-chain.pem
    key_file: /etc/istio-certs/key.pem
    insecure_skip_verify: true

If I deploy Prometheus v2.20.1, I see that my services are being scraped by the kubernetes-pods-istio-secure job, they’re using HTTPS, and everything is good to go. Under v2.20.1, I see the error connection reset by peer. I tried asking about this in the Prometheus newsgroup to no avail, so… I dove in.

My first step was to update the Helm chart extraArgs to turn on Prometheus debug logging.

extraArgs:
  log.level: debug

I was hoping to see more information about what was happening. Unfortunately, I got basically the same thing.

level=debug ts=2021-07-06T20:58:32.984Z caller=scrape.go:1236 component="scrape manager" scrape_pool=kubernetes-pods-istio-secure target=https://10.244.3.10:9102/metrics msg="Scrape failed" err="Get \"https://10.244.3.10:9102/metrics\": read tcp 10.244.4.89:36666->10.244.3.10:9102: read: connection reset by peer"

This got me thinking one of two things may have happened in v2.21.0:

  • Something changed in Prometheus; OR
  • Something changed in the OS configuration of the Prometheus container

I had recently fought with a dotnet CLI problem where certain TLS cipher suites were disabled by default and some OS configuration settings on our build agents affected what was seen as allowed vs. not allowed. This was stuck in my mind so I couldn’t immediately rule out the container OS configuration.

To validate the OS issue I was going to try using curl and/or openssl to connect to the microservice and see what the cipher suites were. Did I need an Istio upgrade? Was there some configuration setting I was missing? Unfortunately, it turns out the Prometheus Docker image is based on a custom busybox image where there are no package managers or tools. I mean, this is actually a very good thing from a security perspective but it’s a pain for debugging.

What I ended up doing was getting a recent Ubuntu image and connecting using that, just to see. I figured if there was anything obvious going on that I could take the extra steps of creating a custom Prometheus image with curl and openssl to investigate further. I mounted a manual sidecar just like I did for Prometheus so I could get to the certificates without proxying traffic, then I ran some commands:

curl https://10.244.3.10:9102/metrics \
  --cacert /etc/istio-certs/root-cert.pem \
  --cert /etc/istio-certs/cert-chain.pem \
  --key /etc/istio-certs/key.pem \
  --insecure

openssl s_client \
  -connect 10.244.3.10:9102 \
  -cert /etc/istio-certs/cert-chain.pem  \
  -key /etc/istio-certs/key.pem \
  -CAfile /etc/istio-certs/root-cert.pem \
  -alpn "istio"

Here’s some example output from curl to show what I was seeing:

root@sleep-5f98748557-s4wh5:/# curl https://10.244.3.10:9102/metrics --cacert /etc/istio-certs/root-cert.pem --cert /etc/istio-certs/cert-chain.pem --key /etc/istio-certs/key.pem --insecure -v
*   Trying 10.244.3.10:9102...
* TCP_NODELAY set
* Connected to 10.244.3.10 (10.244.3.10) port 9102 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/istio-certs/root-cert.pem
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: [NONE]
*  start date: Jul  7 20:21:33 2021 GMT
*  expire date: Jul  8 20:21:33 2021 GMT
*  issuer: O=cluster.local
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x564d80d81e10)
> GET /metrics HTTP/2
> Host: 10.244.3.10:9102
> user-agent: curl/7.68.0
> accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 2147483647)!
< HTTP/2 200

A few things in particular:

  1. I found the --alpn "istio" thing for openssl while looking through Istio issues to see if there were any pointers there. It’s always good to read through issues lists to get ideas and see if other folks are running into the same problems.
  2. Both openssl and curl were able to connect to the microservice using the certificates from Istio.
  3. The cipher suite shown in the openssl output was one that was considered “recommended.” I forgot to capture that output for the blog article, sorry about that.

At this point I went to the release notes for Prometheus v2.21.0 to see what had changed. I noticed two things that I thought may affect my situation:

  1. This release is built with Go 1.15, which deprecates X.509 CommonName in TLS certificates validation.
  2. [CHANGE] Disable HTTP/2 because of concerns with the Go HTTP/2 client. #7588 #7701

I did see in that curl output that it was using HTTP/2 but… is it required? Unclear. However, looking at the Go docs about the X.509 CommonName thing, that’s easy enough to test. I just needed to add an environment variable to the Helm chart for Prometheus:

env:
  - name: GODEBUG
    value: x509ignoreCN=0

After redeploying… it didn’t fix anything. That wasn’t the problem. That left the HTTP/2 thing. However, what I found was it’s hardcoded off, not disabled through some configuration mechanism so there isn’t a way to just turn it back on to test. The only way to test it is to do a fully custom build.

The Prometheus build for a Docker image is really complicated. They have this custom build tool promu that runs the build in a custom build container and all this is baked into layers of make and yarn and such. As it turns out, not all of it happens in the container, either, because if you try to build on a Mac you’ll get an error like this:

... [truncated huge list of downloads] ...
go: downloading github.com/PuerkitoBio/urlesc v0.0.0-20170810143723-de5bf2ad4578
go: downloading github.com/Azure/go-autorest/autorest/validation v0.3.1
go: downloading github.com/Azure/go-autorest/autorest/to v0.4.0
go build github.com/aws/aws-sdk-go/service/ec2: /usr/local/go/pkg/tool/linux_amd64/compile: signal: killed
!! command failed: build -o .build/linux-amd64/prometheus -ldflags -X github.com/prometheus/common/version.Version=2.28.1 -X github.com/prometheus/common/version.Revision=b0944590a1c9a6b35dc5a696869f75f422b107a1 -X github.com/prometheus/common/version.Branch=HEAD -X github.com/prometheus/common/version.BuildUser=root@76a91e410d00 -X github.com/prometheus/common/version.BuildDate=20210709-14:47:03  -extldflags '-static' -a -tags netgo,builtinassets github.com/prometheus/prometheus/cmd/prometheus: exit status 1
make: *** [Makefile.common:227: common-build] Error 1
!! The base builder docker image exited unexpectedly: exit status 2

You can only build on Linux even though it’s happening in a container. At least right now. Maybe that’ll change in the future. Anyway, this meant I needed to create a Linux VM and set up an environment there that could build Prometheus… or figure out how to force a build system to do it, say by creating a fake PR to the Prometheus project. I went the Linux VM route.

I changed the two lines where the HTTP/2 was disabled, I pushed that to a temporary Docker Hub location, and I got it deployed in my cluster.

Success! Once HTTP/2 was re-enabled, Prometheus was able to scrape my Istio pods again.

I worked through this all with the Prometheus team and they were able to replicate the issue using my repro repo. They are now working through how to re-enable HTTP/2 using environment variables or configuration.

All of this took close to a week to get through.

It’s easy to read these blog articles and think the writer just blasted through all this and it was all super easy, that I already knew the steps I was going to take and flew through it. I didn’t. There was a lot of reading issues. There was a lot of trying things and then retrying those same things because I forgot what I’d just tried, or maybe I discovered I forgot to change a configuration value. I totally deleted and re-created my test Kubernetes cluster like five times because I also tried updating Istio and… well, you can’t really “roll back Istio.” It got messy. Not to mention, debugging things at the protocol level is a spectacular combination of “not interesting” and “not my strong suit.”

My point is, don’t give up. Pushing through these things and reading and banging your head on it is how you get the experience so that next time you will have been through it.

azure, kubernetes, linux, spinnaker comments edit

Kayenta is the subcomponent of Spinnaker that handles automated canary analysis during a deployment. It reads from your metric sources and compares the stats from an existing deployed service against a new version of the service to see if there are anomalies or problems, indicating the rollout should be aborted if the new service fails to meet specified tolerances.

I’m a huge fan of Spinnaker, but sometimes you already have a full CI/CD system in place and you really don’t want to replace all of that with Spinnaker. You really just want the canary part of Spinnaker. Luckily, you can totally use Kayenta as a standalone service. They even have some light documentation on it!

In my specific case, I also want to use Azure Storage as the place where I store the data for Kayenta - canary configuration, that sort of thing. It’s totally possible to do that, but, at least at the time of this writing, the hal config canary Halyard command does not have Azure listed and the docs don’t cover it.

So there are a couple of things that come together here, and maybe all of it’s interesting to you or maybe only one piece. In any case, here’s what we’re going to build:

Standalone Kayenta diagram

  • A Kubernetes ingress to allow access to Kayenta from your CI/CD pipeline.
  • A deployment of the Kayenta microservice.
  • Kayenta configured to use an Azure Storage Account to hold its configuration and such.

Things I’m not going to cover:

  • How exactly your CI/CD canary stage needs to work.
  • How long a canary stage should last.
  • How exactly you should configure Kayenta (other than the Azure part).
  • Which statistics you should monitor for your services to determine if they “pass” or “fail.”
  • Securing the Kayenta ingress so only authenticated/authorized access is allowed.

This stuff is hard and it gets pretty deep pretty quickly. I can’t cover it all in one go. I don’t honestly have answers to all of it anyway, since a lot of it depends on how your build pipeline is set up, how your app is set up, and what your app does. There’s no “one-size-fits-all.”

Let’s do it.

Deployment

First, provision an Azure Storage account. Make sure you enable HTTP access because right now Kayenta requires HTTP and not HTTPS.

You also need to provision a container in the Azure Storage account to hold the Kayenta contents.

# I love me some PowerShell, so examples/scripts will be PowerShell.
# Swap in your preferred names as needed.
$ResourceGroup = "myresourcegroup"
$StorageAccountName = "kayentastorage"
$StorageContainerName = "kayenta"
$Location = "westus2"

# Create the storage account with HTTP enabled.
az storage account create `
  --name $StorageAccountName `
  --resoure-group $ResourceGroup `
  --location $Location `
  --https-only false `
  --sku Standard_GRS

# Get the storage key so you can create a container.
$StorageKey = az storage account keys list `
  --account-name $StorageAccountName `
  --query '[0].value' `
  -o tsv

# Create the container that will hold Kayenta stuff.
az storage container create `
  --name $StorageContainerName `
  --account-name $StorageAccountName `
  --account-key $StorageKey

Let’s make a namespace in Kubernetes for Kayenta so we can put everything we’re deploying in there.

# We'll use the namespace a lot, so a variable
# for that in our scripting will help.
$Namespace = "kayenta"
kubectl create namespace $Namespace

Kayenta needs Redis. We can use the Helm chart to deploy a simple Redis instance. Redis must not be in clustered mode, and there’s no option for providing credentials.

helm repo add bitnami https://charts.bitnami.com/bitnami

# The name of the deployment will dictate the name of the
# Redis master service that gets deployed. In this example,
# 'kayenta-redis' as the deployment name will create a
# 'kayenta-redis-master' service. We'll need that later for
# Kayenta configuration.
helm install kayenta-redis bitnami/redis `
  -n $Namespace `
  --set cluster.enabled=false `
  --set usePassword=false `
  --set master.persistence.enabled=false

Now let’s get Kayenta configured. This is a full, commented version of a Kayenta configuration file. There’s also a little doc on Kayenta configuration that might help. What we’re going to do here is put the kayenta.yml configuration into a Kubernetes ConfigMap so it can be used in our service.

Here’s a ConfigMap YAML file based on the fully commented version, but with the extra stuff taken out. This is also where you’ll configure the location of Prometheus (or whatever) where Kayenta will read stats. For this example, I’m using Prometheus with some basic placeholder config.

apiVersion: v1
kind: ConfigMap
metadata:
  name: kayenta
  namespace: kayenta
data:
  kayenta.yml: |-
    server:
      port: 8090

    # This should match the name of the master service from when
    # you deployed the Redis Helm chart earlier.
    redis:
      connection: redis://kayenta-redis-master:6379

    kayenta:
      atlas:
        enabled: false

      google:
        enabled: false

    # This is the big one! Here's where you configure your Azure Storage
    # account and container details.
      azure:
        enabled: true
        accounts:
          - name: canary-storage
            storageAccountName: kayentastorage
            # azure.storageKey is provided via environment AZURE_STORAGEKEY
            # so it can be stored in a secret. You'll see that in a bit.
            # Don't check in credentials!
            accountAccessKey: ${azure.storageKey}
            container: kayenta
            rootFolder: kayenta
            endpointSuffix: core.windows.net
            supportedTypes:
              - OBJECT_STORE
              - CONFIGURATION_STORE

      aws:
        enabled: false

      datadog:
        enabled: false

      graphite:
        enabled: false

      newrelic:
        enabled: false

    # Configure your Prometheus here. Or if you're using something else, disable
    # Prometheus and configure your own metrics store. The important part is you
    # MUST have a metrics store configured!
      prometheus:
        enabled: true
        accounts:
        - name: canary-prometheus
          endpoint:
            baseUrl: http://prometheus:9090
          supportedTypes:
            - METRICS_STORE

      signalfx:
        enabled: true

      wavefront:
        enabled: false

      gcs:
        enabled: false

      blobs:
        enabled: true

      s3:
        enabled: false

      stackdriver:
        enabled: false

      memory:
        enabled: false

      configbin:
        enabled: false

      remoteJudge:
        enabled: false

    # Enable the SCAPE endpoint that has the same user experience that the Canary StageExecution in Deck/Orca has.
    # By default this is disabled - in standalone we enable it!
      standaloneCanaryAnalysis:
        enabled: true

      metrics:
        retry:
          series: SERVER_ERROR
          statuses: REQUEST_TIMEOUT, TOO_MANY_REQUESTS
          attempts: 10
          backoffPeriodMultiplierMs: 1000

      serialization:
        writeDatesAsTimestamps: false
        writeDurationsAsTimestamps: false

    management.endpoints.web.exposure.include: '*'
    management.endpoint.health.show-details: always

    keiko:
      queue:
        redis:
          queueName: kayenta.keiko.queue
          deadLetterQueueName: kayenta.keiko.queue.deadLetters

    spectator:
      applicationName: ${spring.application.name}
      webEndpoint:
        enabled: true

    swagger:
      enabled: true
      title: Kayenta API
      description:
      contact:
      patterns:
        - /admin.*
        - /canary.*
        - /canaryConfig.*
        - /canaryJudgeResult.*
        - /credentials.*
        - /fetch.*
        - /health
        - /judges.*
        - /metadata.*
        - /metricSetList.*
        - /metricSetPairList.*
        - /metricServices.*
        - /pipeline.*
        - /standalone.*

Save that and deploy it to the cluster.

kubectl apply -f kayenta-configmap.yml

You’ll notice in the config we just put down that we did not include the Azure Storage acccount key. Assuming we want to commit that YAML to a source control system at some point, we definitely don’t want credentials in there. Instead, let’s use a Kubernetes secret for the Azure Storage account key.

# Remember earlier we got the storage account key for creating
# the container? We're going to use that again.
kubectl create secret generic azure-storage `
  -n $Namespace `
  --from-literal=storage-key="$StorageKey"

It’s deployment time! Let’s get a Kayenta container into the cluster! Obviously you can tweak all the tolerances and affinities and node selectors and all that to your heart’s content. I’m keeping the example simple.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kayenta
  namespace: kayenta
  labels:
    app.kubernetes.io/name: kayenta
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kayenta
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kayenta
    spec:
      containers:
        - name: kayenta
          # Find the list of tags here: https://console.cloud.google.com/gcr/images/spinnaker-marketplace/GLOBAL/kayenta?gcrImageListsize=30
          # This is just the tag I've been using for a while. I use one of the images NOT tagged
          # with Spinnaker because the Spinnaker releases are far slower.
          image: "gcr.io/spinnaker-marketplace/kayenta:0.17.0-20200803200017"
          env:
            # If you need to troubleshoot, you can set the logging level by adding
            # -Dlogging.level.root=TRACE
            # Without the log at DEBUG level, very little logging comes out at all and
            # it's really hard to see if something goes wrong. If you don't want that
            # much logging, go ahead and remove the log level option here.
            - name: JAVA_OPTS
              value: "-XX:+UnlockExperimentalVMOptions -Dlogging.level.root=DEBUG"
            # We can store secrets outside config and provide them via the environment.
            # Insert them into the config file using ${dot.delimited} versions of the
            # variables, like ${azure.storageKey} which we saw in the ConfigMap.
            - name: AZURE_STORAGEKEY
              valueFrom:
                secretKeyRef:
                  name: azure-storage
                  key: storage-key
          ports:
            - name: http
              containerPort: 8090
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /health
              port: http
          readinessProbe:
            httpGet:
              path: /health
              port: http
          volumeMounts:
            - name: config-volume
              mountPath: /opt/kayenta/config
      volumes:
        - name: config-volume
          configMap:
            name: kayenta

And let’s save and apply.

kubectl apply -f kayenta-deployment.yml

If you have everything wired up right, the Kayenta instance should start. But we want to see something happen, right? Without kubectl port-forward?

Let’s put a LoadBalancer service in here so we can access it. I’m going to show the simplest Kubernetes LoadBalancer here, but in your situation you might have, say, an nginx ingress in play or something else. You’ll have to adjust as needed.

apiVersion: v1
kind: Service
metadata:
  name: kayenta
  namespace: kayenta
  labels:
    app.kubernetes.io/name: kayenta
spec:
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app.kubernetes.io/name: kayenta
  type: LoadBalancer

Let’s see it do something. You should be able to get the public IP address for that LoadBalancer service by doing:

kubectl get service/kayenta -n $Namespace

You’ll see something like this:

NAME         TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)    AGE
kayenta      LoadBalancer   10.3.245.137   104.198.205.71   80/TCP     54s

Take note of that external IP and you can visit the Swagger docs in a browser: http://104.198.205.71/swagger-ui.html

If it’s all wired up, you should get some Swagger docs!

The first operation you should try is under credentials-controller - GET /credentials. This will tell you what metrics and object stores Kayenta thinks it’s talking to. The result should look something like this:

[
  {
    "name": "canary-prometheus",
    "supportedTypes": [
      "METRICS_STORE"
    ],
    "endpoint": {
      "baseUrl": "http://prometheus"
    },
    "type": "prometheus",
    "locations": [],
    "recommendedLocations": []
  },
  {
    "name": "canary-storage",
    "supportedTypes": [
      "OBJECT_STORE",
      "CONFIGURATION_STORE"
    ],
    "rootFolder": "kayenta",
    "type": "azure",
    "locations": [],
    "recommendedLocations": []
  }
]

If you are missing the canary-storage account pointing to azure - that means Kayenta can’t access the storage account or it’s otherwise misconfigured. I found the biggest gotcha here was that it’s HTTP-only and that’s not the default for a storage account if you create it through the Azure portal. You have to turn that on.

Troubleshooting

What do you do if you can’t figure out why Kayenta isn’t connecting to stuff?

Up in the Kubernetes deployment, you’ll see the logging is set up at the DEBUG level. The logging is pretty good at this level. You can use kubectl logs to get the logs from the Kayenta pods or, better, use stern for that Those logs are going to be your secret. You’ll see errors that pretty clearly indicate whether there’s a DNS problem or a bad password or something similar.

If you still aren’t getting enough info, turn the log level up to TRACE. It can get noisy, but you’ll only need it for troubleshooting.

Next Steps

There’s a lot you can do from here.

Canary configuration: Actually configuring a canary is hard. For me, it took deploying a full Spinnaker instance and doing some canary stuff to figure it out. There’s a bit more doc on it now, but it’s definitely tricky. Here’s a pretty basic configuration where we just look for errors by ASP.NET microservice controller. No, I can not help or support you in configuring a canary. I’ll give you this example with no warranties, expressed or implied.

{
  "canaryConfig": {
    "applications": [
      "app"
    ],
    "classifier": {
      "groupWeights": {
        "StatusCodes": 100
      },
      "scoreThresholds": {
        "marginal": 75,
        "pass": 75
      }
    },
    "configVersion": "1",
    "description": "App Canary Configuration",
    "judge": {
      "judgeConfigurations": {
      },
      "name": "NetflixACAJudge-v1.0"
    },
    "metrics": [
      {
        "analysisConfigurations": {
          "canary": {
            "direction": "increase",
            "nanStrategy": "replace"
          }
        },
        "groups": [
          "StatusCodes"
        ],
        "name": "Errors By Controller",
        "query": {
          "customInlineTemplate": "PromQL:sum(increase(http_requests_received_total{app='my-app',azure_pipelines_version='${location}',code=~'5\\\\d\\\\d|4\\\\d\\\\d'}[120m])) by (action)",
          "scopeName": "default",
          "serviceType": "prometheus",
          "type": "prometheus"
        },
        "scopeName": "default"
      }
    ],
    "name": "app-config",
    "templates": {
    }
  },
  "executionRequest": {
    "scopes": {
      "default": {
        "controlScope": {
          "end": "2020-11-20T23:01:09.3NZ",
          "location": "baseline",
          "scope": "control",
          "start": "2020-11-20T21:01:09.3NZ",
          "step": 2
        },
        "experimentScope": {
          "end": "2020-11-20T23:01:09.3NZ",
          "location": "canary",
          "scope": "experiment",
          "start": "2020-11-20T21:01:09.3NZ",
          "step": 2
        }
      }
    },
    "siteLocal": {
    },
    "thresholds": {
      "marginal": 75,
      "pass": 95
    }
  }
}

Integrate with your CI/CD pipeline: Your deployment is going to need to know how to track the currently deployed vs. new/canary deployment. Statistics are going to need to be tracked that way, too. (That’s the same as if you were using Spinnaker.) I’ve been using the KubernetesManifest@0 task in Azure DevOps, setting trafficSplitMethod: smi and making use of the canary control there. A shell script polls Kayenta to see how the analysis is going.

How you do this for your template is very subjective. Pipelines at this level are really complex. I’d recommend working with Postman or some other HTTP debugging tool to get things working before trying to automate it.

Secure it!: You probably don’t want public anonymous access to the Kayenta API. I locked mine down with oauth2-proxy and Istio but you could do it with nginx ingress and oauth2-proxy or some other mechanism.

Put a UI on it!: As you can see, configuring Kayenta canaries without a UI is actually pretty hard. Nike has a UI for standalone Kayenta called “Referee”. At the time of this writing there’s no Docker container for it so it’s not as easy to deploy as you might like. However, there is a Dockerfile gist that might be helpful. I have not personally got this working, but it’s on my list of things to do.

Huge props to my buddy Chris who figured a lot of this out, especially the canary configuration and Azure DevOps integration pieces.

halloween, maker, costumes comments edit

Due to the COVID-19 pandemic, we didn’t end up doing our usual hand-out-candy-and-count-kids thing. However, we did make costumes. How could we not? Something had to stay normal.

My daughter Phoenix, who is now nine, is obsessed with Hamilton. I think she listens to it at least once daily. Given that, she insisted that we do Hamilton costumes. I was to be A. Ham, Jenn as Eliza, and Phoenix as their daughter also named Eliza.

I was able to put Phoe’s costume together in two or three days. We used a pretty standard McCall’s pattern with decent instructions and not much complexity.

For mine… I had to do some pretty custom work. I started with these patterns:

It took me a couple of months to get things right. They didn’t really have 6’2” fat guys in the Revolutionary War so there was a lot of adjustment, especially to the coat, to get things to fit. I made muslin versions of everything probably twice, maybe three times for the coat to get the fit right.

I had a really challenging time figuring out how the shoulders on the coat went together. The instructions on the pattern are fairly vague and not what I’m used to with more commercial patterns. This tutorial article makes a similar coat and helped a lot in figuring out how things worked. It’s worth checking out.

Modifications I had to make:

  • Coat:
    • Arms lengthened.
    • Arm holes bigger around.
    • Arms bigger around.
    • Body lengthened.
    • Lapels trimmed to be square at the top (more like the stage production).
    • Didn’t put on the shoulder buttons or the back ribbons the pattern called for.
    • Set buttonholes 2” long and 2.5” apart (roughly - they don’t specify this in the pattern but other research turned these measurements up and it worked out).
    • Six buttonholes around each cuff instead of three (more like the stage production).
  • Pants:
    • Cut off slightly below the knee.
    • Taper to fit tightly around the bottom below the knee.
    • Add a “flap” on each side at the bottom of each leg to look like there are buttons holding it together.
    • Finish the bottom of each leg with a band.
  • Vest: Lengthen the body.

I didn’t have to modify the shirt. The shirt is already intentionally big and baggy because that’s how shirts were back then, so there was a lot of play.

The pants were more like… I didn’t have a decent pattern that actually looked like Revolutionary War pants so I took some decent costume pants and just modded them up. They didn’t have button fly pants back then and my pants have that, but I also wasn’t interested in drop-front pants or whatever other pants I’d have ended up with. I do need to get around in these things.

I didn’t keep a cost tally this time and it’s probably good I didn’t. There are well over 50 buttons on this thing and buttons are not cheap. I bought good wool for the whole thing at like $25/yard (average) and there are a good six-to-eight yards in here. I went through a whole bolt of 60” muslin betwee my costume and the rest of our costumes. I can’t possibly have come out under $300.

But they turned out great!

Here’s my costume close up on a dress form:

Front view of Hamilton

Three-quarters view of Hamilton

And the costume in action:

Travis as Alexander Hamilton

Here’s the whole family! I think they turned out nicely.

The whole Hamilton family!

Work! Work!

We're looking for a mind at work!