Metrics

Documentation Objective

This document describes how to get access to the metrics collected by Hadean.

Prerequisites

In order to gather metrics, make sure you have a simulation that is currently running. If that is not the case, please refer to the section on starting your simulation. You will also need your cluster key for accessing the metrics of remote simulations.

Other dependencies:

Accessing metrics through aether CLI

While your simulation is running, our infrastructure will collect metrics from all the different machines, and make them available through our HTTP Gateway. The easiest way to talk to the Gateway is through the aether CLI.

We collect metrics on a per-machine basis, and since your simulation is run on multiple machines, the Gateway will expose multiple endpoints from where to scrape metrics.

To list all the endpoints, we will use the aether metric --list command.

$ aether metric --list
//169.2.2.1/exporter/
//169.2.2.1/node_exporter/

To get the actual metrics, we will use the aether metric --get <endpoint> command, which will print all the metrics of a particular machine (in prometheus format) to stdout.

$ aether metric --get //169.2.2.1/exporter/
# HELP metrics_server_dev_requests_served Number of requests served by this metrics server
# TYPE metrics_server_dev_requests_served counter
metrics_server_dev_requests_served 1
# HELP metrics_server_dev_uptime_ms Uptime of the metrics server
# TYPE metrics_server_dev_uptime_ms gauge
metrics_server_dev_uptime_ms 27090
# HELP data_sent_to_muxers_kib The egress bandwidth (in KiB) sent by a worker to muxers.
# TYPE data_sent_to_muxers_kib counter
data_sent_to_muxers_kib{workerid="7",hadean_pid="35.177.221.250.18023.0"} 6.058594
# HELP data_sent_to_muxers_kib The egress bandwidth (in KiB) sent by a worker to muxers.
# TYPE data_sent_to_muxers_kib counter
data_sent_to_muxers_kib{workerid="5",hadean_pid="35.177.221.250.18019.0"} 5.753906
# HELP data_sent_to_muxers_kib The egress bandwidth (in KiB) sent by a worker to muxers.
# TYPE data_sent_to_muxers_kib counter
data_sent_to_muxers_kib{workerid="4",hadean_pid="35.177.221.250.18017.0"} 7.277344
# HELP data_sent_to_muxers_kib The egress bandwidth (in KiB) sent by a worker to muxers.
# TYPE data_sent_to_muxers_kib counter
data_sent_to_muxers_kib{workerid="6",hadean_pid="35.177.221.250.18021.0"} 4.535156
# TYPE ticks_elapsed counter
ticks_elapsed{tick="tick",hadean_pid="35.177.221.250.18001.0"} 54.000000
# TYPE time_elapsed gauge
time_elapsed{time="time",hadean_pid="35.177.221.250.18001.0"} 1602167525.000000

Note: for remote runs, you can simply add the --cluster <key> argument to your aether metric invocations.

Behind the scenes, the CLI's --list and --get commands query our Hadean Gateway REST API. For example --list queries the /managers/metrics/ endpoint of our API and returns to stdout its response. The --get command on the other hand simply makes a request to /<ip>/exporter/metrics/ to scrape the metrics of one of our services, and returns the output to stdout. The Gateway can only be accessed from within your cluster, but the CLI creates a tunnel for you and does these requests on your behalf.

Setting up a metric server proxy

Another way to access the Gateway is through the aether metric --serve command. This makes the Gateway queryable through curl, your browser and Prometheus on a local address.

$ aether metric --serve --port 9989 --ip 127.0.0.1

This will start a proxy to our Gateway on the specified port:

Serving metrics at 127.0.0.1:9989

The benefit of using serve over get/list is that you can build more complex things on top of our Gateway API.

Using a browser to access metrics

First we will navigate to http://localhost:9989/managers/metrics, which should return a JSON like:

{ "endpoints": ["/169.2.2.1/exporter/", "/169.2.2.1/node_exporter/"] }

We then navigate to http://localhost:9989/169.2.2.1/exporter/metrics/ in order to get the metrics of a particular machine.

Using curl to access metrics

The process is very similar to the one above, but in this case the output is printed to stdout.

$ curl http://localhost:9989/managers/metrics
{ "endpoints": ["/169.2.2.1/exporter/", "/169.2.2.1/node_exporter/"] }
$ curl http://aether-sdk.mshome.net:9218/169.2.2.1/exporter/metrics
# HELP metrics_server_dev_requests_served Number of requests served by this metrics server
# TYPE metrics_server_dev_requests_served counter
metrics_server_dev_requests_served 1
# HELP metrics_server_dev_uptime_ms Uptime of the metrics server
# TYPE metrics_server_dev_uptime_ms gauge
metrics_server_dev_uptime_ms 27090

or for the same result in powershell:

Invoke-WebRequest http://localhost:12336/managers/metrics

Using Prometheus to access metrics

An even more user-friendly way to scrape metrics, is to let Prometheus do it for you! For this we expose the aether metric --prometheus-template command. This will generate a Prometheus configuration which defines all the jobs that will scrape endpoints for metrics.

$ aether metric --prometheus-template --serve-port 9989 .\

Note: The --serve-port and --serve-host options can be used to point at a aether metric --serve command running on a different machine.

This above command will create a file .\prometheus.yml which can be supplied to Prometheus via: prometheus --config.file=.\prometheus.yml. To see your metrics, navigate to http://localhost:9090.

Dynamic scaling

Your simulation can scale up or down depending on your current load, which means that new machines might get created or machines could get brought down. This means that the endpoints returned by --list (including the REST endpoint /managers/metrics) can become outdated. This means that you need to regularly refresh the list of endpoints in order to update your view of the system. For example, let's assume that you are running a simulation and you want to see the current set of metrics. You execute aether metric --list and aether metrics --get <endpoint> to get them, but now your simulation scales up (or down) due to a change in load. To see the new list of endpoints you must --list the endpoints again, since you only know about the endpoints from before the load went up.

Note: Even if you know where HadeanOS will spawn your next machine, you cannot assume that /<new-ip-address>/exporter/metrics/ is queryable, unless /managers/metrics was queried beforehand.

A more dynamic Prometheus config

So far we have shown how to create a static template file, but what if your simulation dynamically scales while Prometheus is scraping? For this we will need a script which generates a new template file, and asks Prometheus to reload it.

# generate the config
aether metric --prometheus-template --serve-port 9989 .\
# load it with Prometheus, while also enabling the `lifecycle` option
Start-Process prometheus -ArgumentList '--config.file=.\prometheus.yml','--web.enable-lifecycle'
for ( ; ; )
{
# sleep for a few minutes
Start-Sleep -S 300
# regenerate config
aether metric --prometheus-template --serve-port 9989 .\
# tell Prometheus to reload the config file
Invoke-WebRequest http://localhost:9090/-/reload -Method Post
}

Remote runs

For remote runs, you can simply add the --cluster <key> argument to your aether metric invocations.

Common errors

If you can't get your metrics, it is most likely because the simulation is not currently running.

If you get an output of 404 Not Found while executing aether metric --get //endpoint/, make sure that:

  • --list is listing the endpoint.
  • The endpoint starts with two forward slashes (//).
  • The endpoint ends with a forward slash (/). If you are still getting 404 errors, get in contact with us since there might be a problem with the cluster.

The Hadean Gateway API can also return JSON objects that have an error key set to the error which occured. This usually means something has gone wrong on our side. Example: querying /managers/metrics can return { "error": "Failed to do X." }. We respect HTTP status codes, in the sense that if something has gone wrong on our side, we return a 50X status code, but if the request is successful we return a 200 status code.