GET Health Check

Design discussion for new Health Check implementation.

Objectives

The goal for this design is to implement a new Health check for mojaloop switch services that allows for a greater level of detail.

It Features:

  • Clear HTTP Statuses (no need to inspect the response to know there are no issues)

  • Backwards compatibility with existing health checks - No longer a requirement. See this discussion.

  • Information about the version of the API, and how long it has been running for

  • Information about sub-service (kafka, logging sidecar and mysql) connections

Request Format

/health

Uses the newly implemented health check. As discussed here since there will be no added connection overhead (e.g. pinging a database) as part of implementing the health check, there is no need to complicate things with a simple and detailed version.

Responses Codes:

  • 200 - Success. The API is up and running, and is sucessfully connected to necessary services.

  • 502 - Bad Gateway. The API is up and running, but the API cannot connect to necessary service (eg. kafka).

  • 503 - Service Unavailable. This response is not implemented in this design, but will be the default if the api is not and running

Response Format

NameTypeDescriptionExample

status

statusEnum

The status of the service. Options are OK and DOWN. See statusEnum below.

"OK"

uptime

number

How long (in seconds) the service has been alive for.

123456

started

string (ISO formatted date-time)

When the service was started (UTC)

"2019-05-31T05:09:25.409Z"

versionNumber

string (semver)

The current version of the service.

"5.2.5"

services

Array<serviceHealth>

A list of services this service depends on, and their connection status

see below

serviceHealth

NameTypeDescriptionExample

name

subServiceEnum

The sub-service name. See subServiceEnum below.

"broker"

status

enum

The status of the service. Options are OK and DOWN

"OK"

subServiceEnum

The subServiceEnum enum describes a name of the subservice:

Options:

  • datastore -> The database for this service (typically a MySQL Database).

  • broker -> The message broker for this service (typically Kafka).

  • sidecar -> The logging sidecar sub-service this service attaches to.

  • cache -> The caching sub-service this services attaches to.

statusEnum

The status enum represents status of the system or sub-service.

It has two options:

  • OK -> The service or sub-service is healthy.

  • DOWN -> The service or sub-service is unhealthy.

When a service is OK: the API is considered healthy, and all sub-services are also considered healthy.

If any sub-service is DOWN, then the entire health check will fail, and the API will be considered DOWN.

Defining Sub-Service health

It is not enough to simply ping a sub-service to know if it is healthy, we want to go one step further. These criteria will change with each sub-service.

datastore

For datastore, a status of OK means:

  • An existing connection to the database

  • The database is not empty (contains more than 1 table)

broker

For broker, a status of OK means:

  • An existing connection to the kafka broker

  • The necessary topics exist. This will change depending on which service the health check is running for.

For example, for the central-ledger service to be considered healthy, the following topics need to be found:

topic-admin-transfer
topic-transfer-prepare
topic-transfer-position
topic-transfer-fulfil

sidecar

For sidecar, a status of OK means:

  • An existing connection to the sidecar

cache

For cache, a status of OK means:

  • An existing connection to the cache

Swagger Definition

Note: These will be added to the existing swagger definitions for the following services:

  • ml-api-adapter

  • central-ledger

  • central-settlement

  • central-event-processor

  • email-notifier

{
  /// . . . 
  "/health": {
    "get": {
      "operationId": "getHealth",
      "tags": [
        "health"
      ],
      "responses": {
        "default": {
          "schema": {
              "$ref": "#/definitions/health"
          },
          "description": "Successful"
        }
      }
    }
  },
  // . . .
  "definitions": {
    "health": {
      "type": "object",
      "properties": {
        "status": {
          "type": "string",
          "enum": [
            "OK",
            "DOWN"
          ]
        },
        "uptime": {
          "description": "How long (in seconds) the service has been alive for.",
          "type": "number",
        },
        "started": {
          "description": "When the service was started (UTC)",
          "type": "string",
          "format": "date-time"
        },
        "versionNumber": {
          "description": "The current version of the service.",
          "type": "string",
          "example": "5.2.3",
        },
        "services": {
          "description": "A list of services this service depends on, and their connection status",
          "type": "array",
          "items": {
            "$ref": "#/definitions/serviceHealth"
          }
        },
      },
    },
    "serviceHealth": {
      "type": "object",
      "properties": {
        "name": {
          "description": "The sub-service name.",
          "type": "string",
          "enum": [
            "datastore",
            "broker",
            "sidecar",
            "cache"
          ]
        },
        "status": {
          "description": "The connection status with the service.",
          "type": "string",
          "enum": [
            "OK",
            "DOWN"
          ]
        }
      }
    }
  }
}

Example Requests and Responses:

Successful Legacy Health Check:

GET /health HTTP/1.1
Content-Type: application/json

200 SUCCESS
{
  "status": "OK"
}

Successful New Health Check:

GET /health?detailed=true HTTP/1.1
Content-Type: application/json

200 SUCCESS
{
  "status": "OK",
  "uptime": 0,
  "started": "2019-05-31T05:09:25.409Z",
  "versionNumber": "5.2.3",
  "services": [
    {
      "name": "broker",
      "status": "OK",
    }
  ]
}

Failed Health Check, but API is up:

GET /health?detailed=true HTTP/1.1
Content-Type: application/json

502 BAD GATEWAY
{
  "status": "DOWN",
  "uptime": 0,
  "started": "2019-05-31T05:09:25.409Z",
  "versionNumber": "5.2.3",
  "services": [
    {
      "name": "broker",
      "status": "DOWN",
    }
  ]
}

Failed Health Check:

GET /health?detailed=true HTTP/1.1
Content-Type: application/json

503 SERVICE UNAVAILABLE

Sequence Diagram

Sequence design diagram for the GET Health

Last updated