Tracing a production incident back to git commits #

In this 5 minute tutorial you'll learn how Kosli can track a production incident in Cyber-dojo back to git commits.

Something has gone wrong and https://cyber-dojo.org is displaying a 500 error!

Prod cyber-dojo is down with a 500

It was working an hour ago. What has happened in the last hour?

Getting ready #

You need to:

  • Install Kosli CLI.
  • Get a Kosli API token.
  • Set the KOSLI_ORG environment variable to cyber-dojo (the Kosli cyber-dojo organization is public so any authenticated user can read its data) and KOSLI_API_TOKEN to your token:
    export KOSLI_ORG=cyber-dojo
    export KOSLI_API_TOKEN=<your-api-token>
    

Start with the environment #

https://cyber-dojo.org is running in an AWS environment that reports to Kosli as aws-prod.
Get a log of this environment's changes:

kosli log env aws-prod

At the time this tutorial was written the output of this command displayed the first page of 177 snapshots. You will see the first page of considerably more than 177 snapshots because aws-prod has moved on since this incident (it has been resolved with new commits which have created new deployments). To limit the output you can set the interval for the command:

kosli log env aws-prod --interval 176..177

The output should be:

SNAPSHOT  EVENT                                                                          FLOW      DEPLOYMENTS
#177      Artifact: 274425519734.dkr.ecr.eu-central-1.amazonaws.com/creator:31dee35      creator   #87 
          Fingerprint: 5d1c926530213dadd5c9fcbf59c8822da56e32a04b0f9c774d7cdde3cf6ba66d             
          Description: 1 instance stopped running (from 1 to 0).                               
          Reported at: Tue, 06 Sep 2022 16:53:28 CEST                                          
                                                                                               
#176      Artifact: 274425519734.dkr.ecr.eu-central-1.amazonaws.com/creator:b7a5908      creator   #89 
          Fingerprint: 860ad172ace5aee03e6a1e3492a88b3315ecac2a899d4f159f43ca7314290d5a             
          Description: 1 instance started running (from 0 to 1).                               
          Reported at: Tue, 06 Sep 2022 16:52:28 CEST

These two snapshots belong to the same blue-green deployment. You see artifact creator:b7a5908 starting in snapshot #176, and artifact creator:31dee35 exiting in snapshot #177.

Dig into the artifact #

You are interested in #176, showing the newly running artifact, creator:b7a5908, with the fingerprint starting 860ad17.

Let's learn more about this artifact:

kosli get artifact creator@860ad17
Name:        cyberdojo/creator:b7a5908
Flow:        creator
Fingerprint: 860ad172ace5aee03e6a1e3492a88b3315ecac2a899d4f159f43ca7314290d5a
Created on:  Tue, 06 Sep 2022 16:48:07 CEST • 21 hours ago
Git commit:  b7a590836cf140e17da3f01eadd5eca17d9efc65
Commit URL:  https://github.com/cyber-dojo/creator/commit/b7a590836cf140e17da3f01eadd5eca17d9efc65
Build URL:   https://github.com/cyber-dojo/creator/actions/runs/3001102984
State:       COMPLIANT
History:  
    Artifact created                               Tue, 06 Sep 2022 16:48:07 CEST
    Deployment #88 to aws-beta environment         Tue, 06 Sep 2022 16:49:59 CEST
    Deployment #89 to aws-prod environment         Tue, 06 Sep 2022 16:51:12 CEST
    Started running in aws-beta#196 environment    Tue, 06 Sep 2022 16:51:42 CEST
    Started running in aws-prod#176 environment    Tue, 06 Sep 2022 16:52:28 CEST

Follow to the commit #

You can follow the commit URL.

cyber-dojo github diff

The incident was caused by a simple typo in the app.rb file!

Perhaps someone accidentally inserted the "s" while trying to save the file? Either way, this is clearly the problem because the function is called respond_to without the s.

You were able to trace the problem back to a specific commit without any access to cyber-dojo's aws-prod environment.