Status Monitoring and Healthchecks Requirements

Project TitleStatus Monitoring and Healthchecks
Target Release
Epic
Document Status
DRAFT
Document Owner

Document Sign-Off
Subject Matter Expert(s)
Technical Expert(s)

Background & Business Value

We would like to ensure that the API Proxy endpoints are up and running. We believe the ability to do this can be helped by using healthcheck and monitoring tool which can send out notifications if any proxies become unavailable (upcheck) or start returning error messages (healthcheck).

Goals

  • Establish a monitoring system for the API Proxy Endpoints
    • Ensure it can notify the Team
  • Establish Standards for types of Monitoring
    • Upcheck (required)
      • Simply checks if the API Proxy is up and listening. It does not check anything further than the API Gateway.
    • Healthcheck (opt-in)
      • Checks if the backend API service is healthy and responding. Check flows through the proxy to the backend application server.
  • Establish procedures for setting up uptime and healthcheck monitoring
    • Document the features

Assumptions

Out of Scope

  • Being responsible for notifying the API service owners
  • Being responsible for the uptime of backend API services

Requirements

Ticket(s)TitleUser StoryPriorityNotes

/upcheckAs an API Gateway Admin, I want to know if one of our API Proxies is no longer available (is no longer deployed).MUST HAVE
  • The is required on all API Proxies.
  • Path: /upcheck
  • This will be provided by the API Gateway system.
  • This is a reserved endpoint and can't be defined in the API swagger doc.
  • The response will be a simple 200 OK.
  • No security is required on this endpoint.

/healthcheckAs an API Gateway Admin, I want to give the API service developers a standardized way that they can monitor the health of their applications through the API Gateway. (Testing that a call going through the API Gateway will make it all the way to the backend service and verify that the backend service is responding correctly.)MUST HAVE
  • This is an opt-in addition for API Proxies.
  • Path: /healthcheck
  • This is a reserved endpoint and can't be defined in the API swagger doc.
  • The response will be a simple 200 OK for success.
    • All other responses will be considered failures and will require notification to be sent out.
  • If the API developer has a particular response they expect to see, they can provide that at the time of configuration.
  • A common Healthcheck API Key will be used to ensure that it's the Uptime Robot healthcheck system that is calling the endpoint. (question)

Healthcheck should be Opt-InAs an API Developer, I don't want to be forced to provide a /healthcheck endpoint. I do want the ability to provide on in the future.MUST HAVE
  • As part of the API Creation Request process we should ask if they would like for a Healthcheck to be setup for them. If they do want it setup they should provide:
    • If they want the normal 200 OK response is enough to determine if the service is healthy (and no notification should be sent).
    • If they have a specific message body they want to see returned, in order to determine if the service is healthy or not.
      • The functionality in Uptime Robot checks for keywords and not exact bodies. So, just a few keywords to search for is what will actually be used.
    • Should the notifications be sent to the functional account that is associated with the application (question)

User Interaction, Design & Architecture

Creating a new monitor

Examples and References

Questions

Below is a list of questions to be addressed as a result of this requirements document:

QuestionOutcomeDecision Date
For the /healthcheck endpoints (the ones that flow through the API Gateway to the backend), should we secure them with an API key? Should it be a single API Key that we use on UptimeRobot for all healthchecks? Would this mean that the shared flow that accepted healthcheck requests would only check against that API Key; so other legitimate keys for the overall API Proxy would not work?
When a /healthcheck reports itself as down, should we standardize that the notification will only be sent to the functional/shared account address? Do we want to be more loose and let the API developer specific other addresses to send to?