This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Managing Workflow

Learn how to manage and maintain your Drycc Workflow deployment.

1 - Tuning Component Settings

Helm Charts are a set of Kubernetes manifests that reflect best practices for deploying an application or service on Kubernetes.

After you add the Drycc Chart Repository, you can customize the chart using helm inspect values drycc/workflow > values.yaml before using helm install to complete the installation.

There are a few ways to customize the respective component:

  • If the value is exposed in the values.yaml file as derived above, one may modify the section of the component to tune these settings. The modified value(s) will then take effect at chart installation or release upgrade time via either of the two respective commands:

     $ helm install drycc oci://registry.drycc.cc/charts/workflow \
         -n drycc \
         --namespace drycc \
         -f values.yaml
     $ helm upgrade drycc oci://registry.drycc.cc/charts/workflow \
         -n drycc \
         --namespace drycc \
         -f values.yaml
    
  • If the value hasn’t yet been exposed in the values.yaml file, one may edit the component deployment with the tuned setting. Here we edit the drycc-controller deployment:

     $ kubectl --namespace drycc edit deployment drycc-controller
    

    Add/edit the setting via the appropriate environment variable and value under the env section and save. The updated deployment will recreate the component pod with the new/modified setting.

  • Lastly, one may also fetch and edit the chart as served by version control/the chart repository itself:

     $ helm fetch oci://registry.drycc.cc/charts/workflow --untar
     $ $EDITOR workflow/charts/controller/templates/controller-deployment.yaml
    

    Then run helm install ./workflow --namespace drycc --name drycc to apply the changes, or helm upgrade drycc ./workflow if the cluster is already running.

Setting Resource limits

You can set resource limits to Workflow components by modifying the values.yaml file fetched earlier. This file has a section for each Workflow component. To set a limit to any Workflow component just add resources in the section and set them to the appropriate values.

Below is an example of how the builder section of values.yaml might look with CPU and memory limits set:

builder:
  imageOrg: "drycc"
  imagePullPolicy: "Always"
  imageTag: "canary"
  resources:
    limits:
      cpu: 1000m
      memory: 2048Mi
    requests:
      cpu: 500m
      memory: 1024Mi

Customizing the Builder

The following environment variables are tunable for the Builder component:

Setting Description
DEBUG Enable debug log output (default: false)
BUILDER_POD_NODE_SELECTOR A node selector setting for builder job. As it may sometimes consume a lot of node resources, one may want a given builder job to run in a specific node only, so it won’t affect critical nodes. for example pool:testing,disk:magnetic

Customizing the Controller

The following environment variables are tunable for the Controller component:

Setting Description
REGISTRATION_MODE set registration to “enabled”, “disabled”, or “admin_only” (default: “admin_only”)
GUNICORN_WORKERS number of gunicorn workers spawned to process requests (default: CPU cores * 4 + 1)
RESERVED_NAMES a comma-separated list of names which applications cannot reserve for routing (default: “drycc, drycc-builder”)
DRYCC_DEPLOY_HOOK_URLS a comma-separated list of URLs to send deploy hooks to.
DRYCC_DEPLOY_HOOK_SECRET_KEY a private key used to compute the HMAC signature for deploy hooks.
DRYCC_DEPLOY_REJECT_IF_PROCFILE_MISSING rejects a deploy if the previous build had a Procfile but the current deploy is missing it. A 409 is thrown in the API. Prevents accidental process types removal. (default: “false”, allowed values: “true”, “false”)
DRYCC_DEPLOY_PROCFILE_MISSING_REMOVE when turned on (default) any missing process type in a Procfile compared to the previous deploy is removed. When set to false will allow an empty Procfile to go through without removing missing process types, note that new images, configs and so on will get updated on all proc types. (default: “true”, allowed values: “true”, “false”)
DRYCC_DEFAULT_CONFIG_TAGS set tags for all applications by default, for example: ‘{“role”: “worker”}’. (default: ‘’)
KUBERNETES_NAMESPACE_DEFAULT_QUOTA_SPEC set resource quota to application namespace by setting ResourceQuota spec, for example: {"spec":{"hard":{"pods":"10"}}}, restrict app owner to spawn more then 10 pods (default: “”, no quota will be applied to namespace)

LDAP authentication settings

Configuration options for LDAP authentication are detailed here.

The following environment variables are available for enabling LDAP authentication of user accounts in the Passport component:

Setting Description
LDAP_ENDPOINT The URI of the LDAP server. If not specified, LDAP authentication is not enabled (default: “”, example: ldap://hostname).
LDAP_BIND_DN The distinguished name to use when binding to the LDAP server (default: “”)
LDAP_BIND_PASSWORD The password to use with LDAP_BIND_DN (default: “”)
LDAP_USER_BASEDN The distinguished name of the search base for user names (default: “”)
LDAP_USER_FILTER The name of the login field in the users search base (default: “username”)
LDAP_GROUP_BASEDN The distinguished name of the search base for user’s groups names (default: “”)
LDAP_GROUP_FILTER The filter for user’s groups (default: “”, example: objectClass=person)

Global and per application settings

Setting Description
DRYCC_DEPLOY_BATCHES the number of pods to bring up and take down sequentially during a scale (default: number of available nodes)
DRYCC_DEPLOY_TIMEOUT deploy timeout in seconds per deploy batch (default: 120)
IMAGE_PULL_POLICY the Kubernetes image pull policy for application images (default: “IfNotPresent”) (allowed values: “Always”, “IfNotPresent”)
KUBERNETES_DEPLOYMENTS_REVISION_HISTORY_LIMIT how many revisions Kubernetes keeps around for a given Deployment (default: all revisions)
KUBERNETES_POD_TERMINATION_GRACE_PERIOD_SECONDS how many seconds Kubernetes waits for a pod to finish work after a SIGTERM before sending SIGKILL (default: 30)

See the Deploying Apps guide for more detailed information on those.

Customizing the Database

The following environment variables are tunable for the Database component:

Setting Description
BACKUP_FREQUENCY how often the database should perform a base backup (default: “12h”)
BACKUPS_TO_RETAIN number of base backups the backing store should retain (default: 5)

Customizing Fluentbit

The following values can be changed in the values.yaml file or by using the --values flag with the Helm CLI.

Key Description
config.service The service section defines the global properties of the service.
config.inputs An input section defines a source (related to an input plugin).
config.filters A filter section defines a filter (related to a filter plugin)
config.outputs The outputs section specifies a destination that certain records should follow after a Tag match.

For more information about the various variables that can be set please see the fluentbit.

Customizing the Monitor

Grafana

We have exposed some of the more useful configuration values directly in the chart. This allows them to be set using either the values.yaml file or by using the --set flag with the Helm CLI. You can see these options below:

Setting Default Value Description
user “admin” The first user created in the database (this user has admin privileges)
password “admin” Password for the first user.
allow_sign_up “true” Allows users to sign up for an account.

For a list of other options you can set by using environment variables please see the configuration file in GitHub.

Victoriametrics

You can find a list of values that can be set using environment variables here.

Customizing the Registry

The Registry component can be tuned by following the distribution config doc.

2 - Configure DNS

The Drycc Workflow controller and all applications deployed via Workflow are intended (by default) to be accessible as subdomains of the Workflow cluster’s domain.

For example, assuming example.com were a cluster’s domain:

  • The controller should be accessible at drycc.example.com
  • Applications should be accessible (by default) at <application name>.example.com

Given that this is the case, the primary objective in configuring DNS is to direct traffic for all subdomains of a cluster’s domain to the cluster node(s) hosting the platform’s router component, which can direct traffic within the cluster to the correct endpoints.

With a Load Balancer

Generally, it is recommended that a [load balancer][] be used to direct inbound traffic to one or more routers. In such a case, configuring DNS is as simple as defining a wildcard record in DNS that points to the load balancer.

For example, assuming a domain of example.com:

  • An A record enumerating each of your load balancer(s) IPs (i.e. DNS round-robining)
  • A CNAME record referencing an existing fully-qualified domain name for the load balancer
    • Per AWS’ own documentation, this is the recommended strategy when using AWS Elastic Load Balancers, as ELB IPs may change over time.

DNS for any applications using a “custom domain” (a fully-qualified domain name that is not a subdomain of the cluster’s own domain) can be configured by creating a CNAME record that references the wildcard record described above.

Although it depends on your distribution of Kubernetes and your underlying infrastructure, in many cases, the IP(s) or existing fully-qualified domain name of a load balancer can be determined directly using the kubectl tool:

$ kubectl --namespace=istio-nginx describe service | grep "LoadBalancer"
LoadBalancer Ingress:	a493e4e58ea0511e5bb390686bc85da3-1558404688.us-west-2.elb.amazonaws.com

The LoadBalancer Ingress field typically describes an existing domain name or public IP(s). Note that if Kubernetes is able to automatically provision a load balancer for you, it does so asynchronously. If the command shown above is issued very soon after Workflow installation, the load balancer may not exist yet.

Without a Load Balancer

On some platforms (Minikube, for instance), a load balancer is not an easy or practical thing to provision. In these cases, one can directly identify the public IP of a Kubernetes node that is hosting a router pod and use that information to configure the local /etc/hosts file.

Because wildcard entries do not work in a local /etc/hosts file, using this strategy may result in frequent editing of that file to add fully-qualified subdomains of a cluster for each application added to that cluster. Because of this, a more viable option may be to utilize the xip.io service.

In general, for any IP, a.b.c.d, the fully-qualified domain name any-subdomain.a.b.c.d.xip.io will resolve to the IP a.b.c.d. This can be enormously useful.

To begin, find the node(s) hosting router instances using kubectl:

$ kubectl --namespace=istio-ingress describe pod | grep Node:
Node:       ip-10-0-0-199.us-west-2.compute.internal/10.0.0.199
Node:       ip-10-0-0-198.us-west-2.compute.internal/10.0.0.198

The command will display information for every router pod. For each, a node name and IP are displayed in the Node field. If the IPs appearing in these fields are public, any of these may be used to configure your local /etc/hosts file or may be used with xip.io. If the IPs shown are not public, further investigation may be needed.

You can list the IP addresses of a node using kubectl:

$ kubectl describe node ip-10-0-0-199.us-west-2.compute.internal
# ...
Addresses:	10.0.0.199,10.0.0.199,54.218.85.175
# ...

Here, the Addresses field lists all the node’s IPs. If any of them are public, again, they may be used to configure your local /etc/hosts file or may be used with xip.io.

Tutorial: Configuring DNS with Google Cloud DNS

In this section, we’ll describe how to configure Google Cloud DNS for routing your domain name to your Drycc cluster.

We’ll assume the following in this section:

  • Your Ingress service has a load balancer in front of it.
    • The load balancer need not be cloud based, it just needs to provide a stable IP address or a stable domain name.
  • You have the mystuff.com domain name registered with a registrar.
    • Replace your domain name with mystuff.com in the instructions to follow.
  • Your registrar lets you alter the nameservers for your domain name (most registrars do).

Here are the steps for configuring cloud DNS to route to your Drycc cluster:

  1. Get the load balancer IP or domain name
  • If you are on Google Container Engine, you can run kubectl get svc -n istio-ingress and look for the LoadBalancer Ingress column to get the IP address
  1. Create a new Cloud DNS Zone (on the console: Networking => Cloud DNS, then click on Create Zone)
  2. Name your zone, and set the DNS name to mystuff.com. (note the . at the end).
  3. Click on the Create button
  4. Click on the Add Record Set button on the resulting page
  5. If your load balancer provides a stable IP address, enter the following fields in the resulting form:
  6. DNS Name: *
  7. Resource Record Type: A
  8. TTL: the DNS TTL of your choosing. If you’re testing or you anticipate that you’ll tear down and rebuild many drycc clusters over time, we recommend a low TTL
  9. IPv4 Address: The IP that you got in the very first step
  10. Click the Create button
  11. If your load balancer provides the stable domain name lbdomain.com, enter the following fields in the resulting form:
  12. DNS Name: *
  13. Resource Record Type: CNAME
  14. TTL: the DNS TTL of your choosing. If you’re testing or you anticipate that you’ll tear down and rebuild many drycc clusters over time, we recommend a low TTL
  15. Canonical name: lbdomain.com. (note the . at the end)
  16. Click on the Create button
  17. In your domain registrar, set the nameservers for your mystuff.com domain to the ones under the data column in the NS record on the same page. They’ll often be something like the below (note the trailing . characters).
ns-cloud-b1.googledomains.com.
ns-cloud-b2.googledomains.com.
ns-cloud-b3.googledomains.com.
ns-cloud-b4.googledomains.com.

Note: If you ever have to re-create your Drycc cluster, simply go back to step 6.4 or 7.4 (depending on your load balancer) and change the IP address or domain name to the new value. You may have to wait for the TTL you set to expire.

Testing

To test that traffic reaches its intended destination, a request can be sent to the Drycc controller like so (do not forget the trailing slash!):

curl http://drycc.example.com/v2/

Or:

curl http://drycc.54.218.85.175.xip.io/v2/

Since such requests require authentication, a response such as the following should be considered an indicator of success:

{"detail":"Authentication credentials were not provided."}

3 - Deploy Hooks

Deploy hooks allow an external service to receive a notification whenever a new version of your app is pushed to Workflow.

It’s useful to help keep the development team informed about deploys, while it can also be used to integrate different systems together.

After one or more hooks are set up, hook output and errors appear in your drycc grafana app logs:

2011-03-15T15:07:29-07:00 drycc[api]: Deploy hook sent to http://drycc.rocks

Deploy hooks are a generic HTTP hook. An administrator can create and configure multiple deploy hooks by tuning the controller settings via the Helm chart.

HTTP POST Hook

The HTTP deploy hook performs an HTTP POST to a URL. The parameters included in the request are the same as the variables available in the hook message: app, release, release_summary, sha and user. See below for their descriptions:

app=secure-woodland&release=v4&release_summary=gabrtv%20deployed%35b3726&sha=35b3726&user=gabrtv

Optionally, if a deploy hook secret key is added to the controller through tuning the controller settings, a new Authorization header will be present in the POST request. The value of this header is computed as the HMAC hex digest of the request URL, using the secret as the key.

In order to authenticate that this request came from Workflow, use the secret key, the full URL and the HMAC-SHA1 hashing algorithm to compute the signature. In Python, that would look something like this:

import hashlib
import hmac

hmac.new("my_secret_key", "http://drycc.rocks?app=secure-woodland&release=v4&release_summary=gabrtv%20deployed%35b3726&sha=35b3726&user=gabrtv", digestmod=hashlib.sha1).hexdigest()

If the value of the computed HMAC hex digest and the value in the Authorization header are identical, then the request came from Workflow.

4 - Platform Logging

Logs are a stream of time-stamped events aggregated from the output streams of all your app’s running processes. Retrieve, filter, or use syslog drains.

We’re working with Quickwit to bring you an application log cluster and search interface.

Architecture Diagram

┌───────────┐                   ┌───────────┐                     
│ Container │                   │  Grafana  │
└───────────┘                   └───────────┘
      │                               ^
     log                              |                
      │                               |                
      ˅                               │                
┌───────────┐                   ┌───────────┐     
│ Fluentbit │─────otel/grpc────>│  Quickwit │     
└───────────┘                   └───────────┘     
                                                                          

Default Configuration

Fluent Bit is based on a pluggable architecture where different plugins play a major role in the data pipeline, with more than 70 built-in plugins available. Please refer to the charts values.yaml for specific configurations.

5 - Platform Monitoring

Add platform monitoring to your apps to spot issues in advance and respond to incidents quickly.

Description

We now include a monitoring stack for introspection on a running Kubernetes cluster. The stack includes 4 components:

Architecture Diagram

┌────────────────┐                                                        
│ HOST           │                                                        
│  node-exporter │◀──┐                          ┌──────────────────┐         
└────────────────┘   │                          │kube-state-metrics│         
                     │                          └──────────────────┘         
┌────────────────┐   │                                    ▲                    
│ HOST           │   │    ┌─────────────────┐             │                    
│  node-exporter │◀──┼────│ victoriametrics │─────────────┘                    
└────────────────┘   │    └─────────────────┘                                  
                     │             ▲                                         
┌───────────────┐    │             │                                         
│ HOST          │    │             ▼                                         
│  node-exporter│◀───┘       ┌──────────┐                                    
└───────────────┘            │ Grafana  │                                    
                             └──────────┘                                    

Grafana

Grafana allows users to create custom dashboards that visualize the data captured to the running VictoriaMetrics component. By default Grafana is exposed using a service annotation through the router at the following URL: http://grafana.mydomain.com. The default login is admin/admin. If you are interested in changing these values please see [Tuning Component Settings][].

Grafana will preload several dashboards to help operators get started with monitoring Kubernetes and Drycc Workflow. These dashboards are meant as starting points and don’t include every item that might be desirable to monitor in a production installation.

Drycc Workflow monitoring by default does not write data to the host filesystem or to long-term storage. If the Grafana instance fails, modified dashboards are lost.

Production Configuration

A production install of Grafana should have the following configuration values changed if possible:

  • Change the default username and password from admin/admin. The value for the password is passed in plain text so it is best to set this value on the command line instead of checking it into version control.
  • Enable persistence
  • Use a supported external database such as mysql or postgres. You can find more information here

On Cluster Persistence

Enabling persistence will allow your custom configuration to persist across pod restarts. This means that the default SQLite database (which stores things like sessions and user data) will not disappear if you upgrade the Workflow installation.

If you wish to have persistence for Grafana you can set enabled to true in the values.yaml file before running helm install.

 grafana:
   # Configure the following ONLY if you want persistence for on-cluster grafana
   # GCP PDs and EBS volumes are supported only
   persistence:
     enabled: true # Set to true to enable persistence
     size: 5Gi # PVC size

Off Cluster Grafana

If you wish to provide your own Grafana instance you can set grafana.enabled in the values.yaml file before running helm install.

VictoriaMetrics

VictoriaMetrics is a fast and scalable open source time series database and monitoring solution that lets users build a monitoring platform without scalability issues and minimal operational burden, it is fully compatible with the prometheus format.

On Cluster Persistence

You can set node-exporter and kube-state-metrics to true or false in the values.yaml.

  • If you wish to have persistence for VictoriaMetrics you can set enabled to true in the values.yaml file before running helm install.
victoriametrics:
  vmstorage:
    replicas: 3
    extraArgs:
    - --retentionPeriod=30d
    temporary:
      enabled: true
      size: 5Gi
      storageClass: "toplvm-ssd"
    persistence:
      enabled: true
      size: 10Gi
      storageClass: "toplvm-hdd"
  node-exporter:
    enabled: true
  kube-state-metrics:
    enabled: true

Off Cluster VictoriaMetrics

To use false VictoriaMetrics, please provide the following values in the values.yaml file before running helm install.

  • victoriametrics.enabled=false
  • grafana.prometheusUrl="http://my.prometheus.url:9090"
  • controller.prometheusUrl="http://my.prometheus.url:9090"

6 - Production Deployments

When preparing a Workflow deployment for production workloads, there are some additional recommendations.

Running Workflow without Drycc Storage

In production, persistent storage can be achieved by running an external object store. For users on AWS, GCE/GKE, or Azure, the convenience of Amazon S3, Google GCS, or Microsoft Azure Storage makes running a Storage-less Workflow cluster quite reasonable. For users who have restrictions on using external object storage, Swift object storage can be an option.

Running a Workflow cluster without Storage provides several advantages:

  • Removes state from worker nodes
  • Reduces resource usage
  • Reduces complexity and operational burden of managing Workflow

See Configuring Object Storage for details on removing this operational complexity.

Review Security Considerations

There are some additional security-related considerations when running Workflow in production. See [Security Considerations][] for details.

Registration is Admin-Only

By default, registration with the Workflow controller is in “admin_only” mode. The first user to run a drycc register command becomes the initial “admin” user, and registrations after that are disallowed unless requested by an admin.

Please see the following documentation to learn about changing registration mode:

Disable Grafana Signups

It is also recommended to disable signups for the Grafana dashboards.

Please see the following documentation to learn about disabling Grafana signups:

7 - Upgrading Workflow

Drycc Workflow releases may be upgraded in-place with minimal downtime.

This upgrade process requires:

Upgrade Process

Step 1: Apply the Workflow upgrade

Helm will remove all components from the previous release. Traffic to applications deployed through Workflow will continue to flow during the upgrade. No service interruptions should occur.

If Workflow is not configured to use off-cluster Postgres, the Workflow API will experience a brief period of downtime while the database recovers from backup.

First, find the name of the release helm gave to your deployment with helm ls, then run

$ helm upgrade <release-name> oci://registry.drycc.cc/charts/workflow

Note: If using off-cluster object storage on gcs and/or off-cluster registry using gcr and intending to upgrade from a pre-v2.10.0 chart to v2.10.0 or greater, the key_json values will now need to be pre-base64-encoded. Therefore, assuming the rest of the custom/off-cluster values are defined in the existing values.yaml used for previous installs, the following may be run:

$ B64_KEY_JSON="$(cat ~/path/to/key.json | base64 -w 0)"
$ helm upgrade <release_name> drycc/workflow -f values.yaml --set gcs.key_json="${B64_KEY_JSON}",registry-token-refresher.gcr.key_json="${B64_KEY_JSON}"

Alternatively, simply replace the appropriate values in values.yaml and do without the --set parameter. Make sure to wrap it in single quotes as double quotes will give a parser error when upgrading.

Step 2: Verify Upgrade

Verify that all components have started and passed their readiness checks:

$ kubectl --namespace=drycc get pods
NAME                                     READY     STATUS    RESTARTS   AGE
drycc-builder-2448122224-3cibz            1/1       Running   0          5m
drycc-controller-1410285775-ipc34         1/1       Running   3          5m
drycc-controller-celery-694f75749b-cmxxn  3/3       Running   0          5m
drycc-database-e7c5z                      1/1       Running   0          5m
drycc-fluentbit-45h7j                     1/1       Running   0          5m
drycc-fluentbit-4z7lw                     1/1       Running   0          5m
drycc-fluentbit-k2wsw                     1/1       Running   0          5m
drycc-fluentbit-skdw4                     1/1       Running   0          5m
drycc-valkey-8nazu                        1/1       Running   0          5m
drycc-grafana-tm266                       1/1       Running   0          5m
drycc-registry-1814324048-yomz5           1/1       Running   0          5m
drycc-registry-proxy-4m3o4                1/1       Running   0          5m
drycc-registry-proxy-no3r1                1/1       Running   0          5m
drycc-registry-proxy-ou8is                1/1       Running   0          5m
drycc-registry-proxy-zyajl                1/1       Running   0          5m

Step 3: Upgrade the Drycc Client

Users of Drycc Workflow should now upgrade their drycc client to avoid getting WARNING: Client and server API versions do not match. Please consider upgrading. warnings.

curl -sfL https://www.drycc.cc/install-cli.sh | bash - && sudo mv drycc $(which drycc)