How to get started with Azure Chaos Studio

Since it was made publicly famous by Netflix in 2011, the adoption of Chaos Engineering reached a state where even the big Cloud Hyperscalers are offering managed solutions to implement Chaos Engineering within your organization. This blog post will cover the managed solution from Azure and show you how to get started with Azure Chaos Studio.

Contents

Learnings

In this blog post, you will learn:

  • Chaos Engineering basics
  • How to enable an Azure Resource for Chaos Studio
  • Creating and running your first Chaos Experiment on Azure

What is Azure Chaos Studio?

With the launch of the public preview in November 2021, Microsoft built Azure Chaos Studio as a fully managed service that is deeply integrated into Azure. Therefore, you will manage Chaos Studio with the Azure Resource Manager and configure its resources the same way you are deploying your infrastructure in Azure. This means furthermore, it integrates out-of-the-box with Azure Policy, Azure Application Insights, and Azure Active Directory (RBAC).

Azure Chaos Studio finally offers an uncomplicated way to adopt chaos engineering mythology by controlled fault injection in your Azure resources in a safe way. The core resources of Chaos Studio are Experiments and Targets. Targets are enabling Azure Resources for Chaos Studio. An Experiment instead, describes the faults to run. Chaos Studio helps you orchestrate complex experiments – run them in sequence or parallel, time-delayed, or across regions. At the time of writing this article, the fault library contains over 30 faults.

Currently, still in public preview, you can now try it free of charge. After April 3, 2023, nevertheless – Azure Chaos Studio will be pay-as-you-go based on experiment execution.

Before we dive into Azure Chaos Studio, we will first introduce chaos engineering generally.

What is Chaos Engineering and why you should implement it?

As described above, Azure Chaos studio is helping to implement chaos engineering mythology. But what is chaos engineering actually?

This website has the following introduction:

Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production.

principlesofchaos.org

Experimenting defines the injection of faults that cause system components to fail but in a controlled and observed way. Consider any Web application that relies on a database. In order to connect to this database, the application uses a secret from an Azure Key Vault. If our application lost access to database secrets, how would it react? Before we start our experiment, we build a hypothesis: invalidating the secret will not affect the application. For our experiment, we inject a fault into this Key Vault, which activates the firewall and prevents any access for 10 minutes. The Web application soon can’t access the Key Vault to fetch the secret. Meanwhile, we are monitoring the application and observing its behavior: once the firewall was activated on the Key Vault, the application crashed. Our next step is to document the observation and act accordingly.

There are undoubtedly many experiments that could reveal weaknesses in a system, such as this one. We can also take dependencies offline (APIs, VMs, Containers/Pods, etc.), restrict access to resources (as in the example explained above), stress CPUs or memory, and more.

The goals of running those chaos engineering experiments are to increase awareness about system weaknesses, grow confidence in your solution’s ability to handle disturbance and build experience with those scenarios.

But should we be experimenting really on our production environment?

Yes, the chaos engineering mythology strongly prefers to run experiments on your production environment. As environments behave differently on load, usage, and traffic, it is difficult to simulate the characteristics of a service’s behavior outside the production environment. Besides, everyone should expect services to fail. No one can control when it will fail. We can only control how well prepared we are and how much confidence we have in the resiliency of the services by conducting experiments in a controlled way that will identify issues that are likely to arise.

How to get started with Azure Chaos Studio

Until here we learned about Chaos Engineering and Azure Chaos Studio. This section will guide you through setting up Chaos Studio and running your first experiment.

We will stick with the fault scenario that disables Key Vault access. We will use a Resource Group called rg-chaos-demo for the whole demo and Bicep as our tool for deploying the needed resources.

Creating the needed resources

First, we need a Key Vault. We will create our Azure Key Vault used for Chaos experimenting and one secret called mySuperSecret with this Bicep snippet:

resource akv 'Microsoft.KeyVault/vaults@2022-07-01' = {
  name: 'akv-chaos-demo'
  location: 'westeurope'
  properties: {
    sku: {
      family: 'A'
      name: 'standard'
    }
    enabledForDeployment: false
    enabledForDiskEncryption: false
    enabledForTemplateDeployment: false
    enableRbacAuthorization: true
    enableSoftDelete: true
    publicNetworkAccess: 'Enabled'
    softDeleteRetentionInDays: 90
    tenantId: tenant().tenantId
  }
}

resource superSecret 'Microsoft.KeyVault/vaults/secrets@2022-07-01' = {
  name: 'secret-chaos-demo'
  parent: akv
  properties: {
    attributes: {
      enabled: true
    }
    contentType: 'secret'
    value: 'mysupersecretsecret'
  }
}

As soon as the Key Vault is created, it needs to be onboarded as a target resource in Chaos Studio. Targets can be service-direct (direct communication with the Azure resource) or agent-based (needs installation). In addition to the target, we need to add the capabilities. This is needed to control which resources are enabled for fault injection and which faults can run against those resources. It’s an additional security layer to run experiments only on the resources we want. Please note that the name of the resource must match the Target name/type defined in this table for supported resources. Additionally, we will find a row describing our target’s recommended role assignment. We will cover this later.

resource akvChaosStudioTarget 'Microsoft.Chaos/targets@2022-10-01-preview' = {
  name: 'Microsoft-KeyVault'
  location: 'westeurope'
  scope: akv
  properties: {}
}

Now that the Target has been enabled, we can start adding experiments to it. The Chaos Studio fault library has around 30 faults available as of writing this article. We will search for the Key Vault Deny Access fault and note down the Fault Type and URN as we need them for our experiment. Azure Chaos Studio allows us to configure our experiments in several steps, as we saw in the introduction. For our purpose, we are good with only running one step. Inside the Actions array of the step, we will configure the type with Fault Type and name with URN that we previously noted down. As for the duration how long the experiment will run, we choose 10 minutes:

resource akvAccessDenied 'Microsoft.Chaos/experiments@2022-10-01-preview' = {
  name: '${akv.name}AccessDenied10m'
  location: 'westeurope'
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    selectors: [
      {
        type: 'List'
        id: 'Selector1'
        targets: [
          {
            id: akvChaosStudioTarget.id
            type: 'ChaosTarget'
          }
        ]
      }
    ]
    steps: [
      {
        name: 'akvAccessDenied10m'
        branches: [
          {
            name: 'akvAccessDenied10m'
            actions: [
              {
                name: 'urn:csci:microsoft:keyVault:denyAccess/1.0'
                type: 'continuous'
                duration: 'PT10M'
                selectorId: 'Selector1'
                parameters: []
              }
            ]
          }
        ]
      }
    ]
  }
}

The permission model within Azure Chaos Studio will create a System Identity (App registration) for every experiment. We need to assign this Identity to the role described within the fault provider table mentioned above. In our case we need to assign the role of “Key Vault Contributor” to the System Managed Identity on the Key Vault:

resource roleAssignment 'Microsoft.Authorization/roleAssignments@2020-04-01-preview' = {
  name: guid(subscription().id, resourceGroup().id, akvAccessDenied.id)
  properties: {
    roleDefinitionId: resourceId('Microsoft.Authorization/roleDefinitions', 'f25e0fa2-a7c8-4377-a976-54943a77a395')
    principalId: akvAccessDenied.identity.principalId
    principalType: 'ServicePrincipal'
  }
}

Gluing it all together

Now we are ready to deploy all resources and finally run our first experiment with the Azure Chaos Studio. For this, we are creating the file main.bicep and inserting all the code examples above. If finished, then we are going to execute the Bicep code with the Azure-CLI to deploy all resources into the Resource Group:

az deployment group create \ 
  --name FirstAzureChaosExperiment \ 
  --resource-group rg-chaos-demo \ 
  --template-file main.bicep 

If all resources are deployed successfully, then we can navigate onto the Azure Portal.

IMPORTANT: Before we can run our first experiment, we need to activate the capability DenyAccess-1.0 on the Key Vault Target. As of now, it is not supported to add a capability via Bicep (see this GitHub Issue). We can do this however manually via Azure Portal until the Bicep Team will fix the issue.

In the Portal, search for Chaos Studio (preview) in the search bar. Select the Targets view and click the Manage actions link right next to the resource akv-chaos-demo. The Manage actions pop-up will appear. We have to enable the Key Vault Deny Access here to make sure that we can only run this fault (we can only select capabilities here that match the Target type) on the chosen Target.

Get started with Azure Chaos Studio - manage action capabilities

Time to experiment a bit!

In the Azure Portal, navigate to the designated Key Vault, in our case that the previously created akv-chaos-demo Key Vault. Verify if we have access to the secret that we created in the key vault. This seems to be working, till we execute our experiment.

Get started with Azure Chaos Studio - Azure Key Vault secret

In the Portal, search again for Chaos Studio (preview) in the search bar. Select Experiments and in this view, select the experiment created. Now we can run this experiment by clicking on Start.

Get started with Azure Chaos Studio – Azure Key Vault secret

When the experiment Status changes to running, we can navigate back to our key vault. If we now select Networking from the left pane, we should see that the firewall is activated, and we can’t access the secret because it is forbidden by the firewall.

Get started with Azure Chaos Studio -experiment running
Get started with Azure Chaos Studio - Azure Key Vault firewall
Get started with Azure Chaos Studio - Azure Key vault secret forbidden by firewall

Congratulations, if you followed along, you successfully executed your first Chaos experiment. Hopefully, there are more to come!

Conclusion

Chaos engineering is an important pattern for modern service architecture. It should be adopted to increase resilience for services and minimize service disturbances for customers. With Azure Chaos Studio, no one has an excuse to not do chaos engineering now. The managed service from Azure offers a surprisingly easy and quick way to implement the mythology with existing tooling. For automation around scheduling experiments, you can leverage logic apps as described here until this feature will be built (hopefully) into Azure Chaos Studio.

Chaos Engineering with Azure Chaos Studio part 2 – Meetup Announcement

See the accompanying GitHub Repository for a full code example and join us on April 27th, when the second part of Chaos Engineering with Azure Chaos Studio takes place. Here we are doing more complex experiments on the Azure Kubernetes Service. Registration for our next virtual meetup is possible via this link.

In case you missed the first part of Chaos Engineering with Azure Chaos Studio, here is the link to the recording on our YouTube channel.