SlapOS (introduction) is a general purpose overlay operating system for distributed POSIX infrastructures. It is based on a Master and Slave design where the Master assigns services to Slave nodes. Slave nodes in turn process the list of services using buildout and send connection and consumption information as well as their monitoring status back to the Master. This monitoring status of each services is based on Promises, which will be explained in detail in this document.
This section will briefly introduce Promises and how they are used in SlapOS to monitor whether an instance is accessible or not.
import socket
from slapos.grid.promise import interface
from slapos.grid.promise.generic import GenericPromise
from zope.interface import implementer
@implementer(interface.IPromise)
class RunPromise(GenericPromise):
def __init__(self, config):
super(RunPromise, self).__init__(config)
def sense(self):
"""
Simply test if we can connect to specified host:port.
"""
hostname = self.getConfig('hostname')
port = int(self.getConfig('port'))
addr = (hostname , port)
try:
socket.create_connection(addr).close()
except (socket.herror, socket.gaierror) as e:
self.logger.error("ERROR hostname/port (%s) is not correct: %s", addr, e)
except (socket.error, socket.timeout) as e:
self.logger.error("ERROR while connecting to %s: %s", addr, e)
else:
self.logger.info("port connection OK (%s)", addr)
def anomaly(self):
"""
There is an anomaly if last 3 senses were bad.
"""
return self._anomaly(result_count=3, failure_amount=3)
Port Listening Promise
A Promise is a python script doing some arbitrary work and then return a promise result saying if the promise succeeded or if it has failed. A promise script can define configurations which will be used to check the state. Promises are generated during instantiation in $ instance_home/etc/plugin
and then Slapgrid will run each depending on theirs configuration to know if an instance is working or not.
The most simple example of promise is "check_if_port_listening" which is trying to open a socket to an ip/port. If it works and no other promise is failing, the instance will be green. If the socket can't be created, slapgrid will raise PromiseError then reports it to the SlapOS Master and the instance will become red.
The promise system should be used on all SlapOS softwares and stacks to define as precisely as possible whether an instance is working or not.
We want to promote a simple, easy and standardised way of writing promise scripts that will verify the state of the system. These scripts can be launched by slapgrid and are configurable for each Software Release. Every promise has three parts:
The promise sensor collects the value of some monitoring aspects such as "if server is supposed to be started, get the response of an http request, else return 'server stopped' and in case of timeout return empty string".
The promise test is Green if the result of the promise sensor of the previous example is not empty, else Red. This ensures that a server that is started actually responds to http requests. There is no margin of tolerance for promise tests.
The promise anomaly detector is Green if one of the three last promise sensor values was not empty, else it is red. This ensures that we call bang only if the server is really stopped, not if an Internet glitch happened.
Note: Promises are what Buildout launches at the end. They return True or False. True means that one aspect of the partition is OK.
The following section will show how to add a Promise to a software release. This can either be an existing Promise from the SlapOS repository recipe folder or a new Promise written from scratch.
[promise-check-site]
recipe = slapos.cookbook:promise.plugin
eggs =
slapos.toolbox
output = ${directory:plugins}/promise-check-mysite-status.py
module = check_site_state
config-site-url = ${publish:site-url}
config-connection-timeout = 20
config-foo = bar
A recipe slapos.cookbook:promise.plugin can be used to generate promise scripts.
To use any of the existing promises requires to add a new section to the software release profile (and don't forget to add it in the parts
list, too). For example:
[promise-check-site]
recipe = slapos.cookbook:promise.plugin
eggs =
slapos.toolbox
output = ${directory:plugins}/promise-check-mysite-status.py
# module is the promise file name (without .py) in slapos.toolbox
module = check_site_state
config-site-url = ${publish:site-url}
config-connection-timeout = 20
config-foo = bar
This will generate an script which will check will test whether the ${publish:site-url}
is available and timeout after 20 seconds which will cause the promise to fail. Passing config-foo=bar
gives an example of how parameters are passed to the promise.
from slapos.grid.promise import interface
from slapos.grid.promise.generic import GenericPromise, TestResult, AnomalyResult
from zope.interface import implementer
@implementer(interface.IPromise)
class RunPromise(GenericPromise):
def __init__(self, config):
super(RunPromise, self).__init__(config)
# run the promise everty 2 minutes
self.setPeriodicity(minute=2)
def anomaly(self):
"""
Called to detect if there is an anomaly.
Return AnomalyResult or TestResult object
# When AnomalyResult has failure bang is called if another promise didn't bang
"""
# Example
promise_result_list = self.getLastPromiseResultList(result_count=3, only_failure=True)
if len(promise_result_list) > 2:
return AnomalyResult(problem=True, message=promise_result_list[0][0]['mesage'])
return AnomalyResult(problem=False, message="")
# It's possible to use Generic helper methods
# return self._anomaly(result_count=3, failure_amount=3)
def sense(self):
"""
Run the promise code and store the result in promise log file
raise error, log error message, ... for failure
"""
# DO SOMETHING...
failed = True
raised = False
if failed:
self.logger.error("ERROR while checking instance http server")
else:
self.logger.info("http server is OK")
if raised:
raise ValueError("Server URL is not correct")
def test(self):
"""
Test promise and say if problem is detected or not
Return TestResult object
"""
# Example
promise_result_list = self.getLastPromiseResultList(result_count=1)[0]
problem = False
message = ""
for result in promise_result_list:
if result['status'] == 'ERROR' and not problem:
problem = True
message += "\n%s" % result['message']
return TestResult(problem=problem, messsage=message)
# It's possible to use Generic helper methods
# return self._test(result_count=1, failure_amount=1)
This script is an example of a Promise in python. Writing a Promise consists of defining a class called RunPromise:
class RunPromise(GenericPromise):
which inherits from the GenericPromise
class inside this class defining the methods anomaly(), sense() and test().
Python promises should be placed into the folder etc/plugin
of the computer partition.
sense() runs the promise code with the given parameters, collects data for the promise whenever is makes sense and appends to a log file.
test() read promise log and return TestResult object describing the actual promise state. The test method is called when Buildout processes a partition, a partition is marked as correctly processed if there is no Buildout failures and all promises test() pass.
anomaly() returns AnomalyResult object which describes the promise state. The anomaly
method is called by SlapGrid when the partition is correctly processed to check if the partition has no anomaly. If AnomalyResult.hasFailed()
is True, bang is called if another promise of the same instance didn't call bang.
...
@abstractmethod
def sense(self):
"""Run the promise code and log the result"""
def anomaly(self):
"""Called to detect if there is an anomaly which require to bang."""
return self._anomaly()
def test(self):
"""Test promise and say if problem is detected or not"""
return self._test()
def run(self, check_anomaly=False, can_bang=True):
"""
Method called to run the Promise
@param check_anomaly: Say if anomaly method should be called
@param can_bang: Set to True if bang can be called, this parameter should
be set to False if bang is already called by another promise.
"""
...
The GenericPromise class contain base implementation of Promise and provides a method run() which reads the option 'check_anomaly' to enforce call of anomaly() instead of test(). By default, run a promise script will call sense() to produce result and test() to check results. Option check_anomaly is used used by buildout for periodic promise check, when the partition is already well deployed.
In future, GenericPromise will be improved to provide more methods that can be used in sense() to store promise graph data. This graph data will be used on monitor interface to plot a chart of promise result progression.
...
self.getConfig(key, default=None)
self.getLastPromiseResultList(latest_minute=0, result_count=COUNT, only_failure=False)
self._test(result_count=COUNT, failure_amount=XX, latest_minute=0)
self._anomaly(result_count=COUNT, failure_amount=XX, latest_minute=0)
...
Promises inherit the following methods from GenericPromise:
self.getTitle()
- returns Promise title, eg. my_promiseself.getName()
- returns Promise (file) name, eg. my_promise.pyself.getPromiseFile()
- returns Promise file pathself.getPeriodicity()
- returns current Promise periodicityself.getLogFile()
- return path log to fileself.getLogFolder()
- return path to monitoring logs folderself.getPartitionFolder()
- return base partition folderself.getConfig(key, default=None)
- return configuration sent to Promise classself.getLastPromiseResultList(latest_minute=0, result_count=COUNT, only_failure=False)
- read the promise log result group from the latest promise execution specified by COUNT. Set latest_minute to specify the maximum promise execution time to search. If only_failure is True, will only get failure messages.self._test(result_count=COUNT, failure_amount=XX, latest_minute=0)
- return TestResult from latest Promise resultself._anomaly(result_count=COUNT, failure_amount=XX, latest_minute=0)
- return AnomalyResult from latest Promise resultThese inherited methods should be called promise in __init__() after the line "GenericPromise.__init__(self, config)":
minute=XX
) - change the default periodicity to check promise anomalyNote: if Anomaly and Test are disabled, promise will raise because promise cannot check nothing.
In your promise code, you will be able to call self.getConfig("site-url")
, self.getConfig("connection-timeout")
and self.getConfig("foo")
. The returned value of self.getConfig(KEY)
is None if the config parameter KEY is not set.
from slapos.promise.plugin.check_site_state import RunPromise
Promise code must be committed to the slapos.toolbox repository. Please put your promise into the folder slapos/promise/plugin, so you can import them in a file in etc/plugin folder of your instance.
For debugging, the monitor promise script added by monitor can be used to test promises execution without using slapgrid. The script will be exposed in the bin/
directory of the software release.
You can run a promise, using:
SR_DIRECTORY/bin/monitor.runpromise --config etc/monitor.conf --console --dry-run [ARG, ...]
Note, that legacy promises are promises placed in PARTITION_DIRECTORY/etc/promise
, they can be bash or others executable scripts. The promise launcher will use a special wrapper to call them as a subprocess, the success or failure state will be based on the process return code (0 = sucess, > 1 = failure).
To set the frequency of buildout runs, the software release should write a file periodicity into software release folder which contains the time period in seconds, eg. to process the partition every 12 hours, the file /opt/slapgrid/SR_MD5SUM/periodicity should contain 43200= 12h
This section covers monitoring of partitions along with goals of running Promises correctly as well as things to avoid.
In normal conditions:
Running buildout on all partitions after a bang is supposed to converge to a stable state with all promises passing.
Slapgrid is configured to run promises at some interval of time which can be configured differently for each promise sensor (see before). SlapOS knows nothing about the results of running promise sensors. The only thing the Master knows is that a bang was issued due to anomaly detection.
The goal of monitoring is to provide good quality of services by knowing problems before customer tells us. This is done by ensuring that servers are alive and partitions are fulfilling all promises.
Servers should contact master periodically to notify that they are alive. The master will show the state of each server according to a colour. A server is Green if it contacted the master within the last 5 minutes. If it contacted the master within the last hour 1 hour, the server is Orange else it's Red. From a monitoring point of view, the server conctacts the master whenever Slapgrid connects to slapOS master, no matter what for.
The master shows the state of each requested partition according to a colour. A partition is Green if the latest result sent by Slapgrid for that partition is OK (meaning that all promises succeeded and there were no other failures) and if that message was sent less than one day ago and less than the buildout run frequency defined by the software release and if no bang was trigered after that. Else the partition is Red.
Note 1: Buildout on a partition in SlapOS will be executed at least once per computer configurable frequency (usually one day) and at least once per software release configurable frequency (seldom configured).
Note 2: the computer configurable frequency of Buildout run must be stored on the Computer in SlapOS master at registration time and updated, else it is impossible to check promise fulfillment.
There are four monitoring crimes that every developer should keep in mind:
This section introduces the "Watchdog", a process that is monitoring other processes and can call "bang" to the Master.
Watchdog is a simple SlapOS Node feature allowing to watch any process managed by supervisord. All processes scripts into PARTITION_DIRECTORY/etc/service directory are watched. They are automatically configured in supervisord with an added on-watch suffix on their process_name. Whenever one of them exits, watchdog will trigger an alert (bang) that is sent to the Master. Bang will force SlapGrid to reprocess all instances of the service. This also forces recheck of all promises and post the result to master, letting the master decide whether the partition state is Green or Red.
Bang should be called as much as needed in a day by a partition. There should not be a limitation in number of calls else it's not possible to adapt dynamically. A Master protection against recurring bang calls should be considered using a kind of quota per day, that might depend on price or defined into the software release. if the bang quota of the day is reached, the master will reject all future calls until the next day.
As a bang will trigger a run of Buildout, Buildout, in theory, is run all the time repeatedly. This is why it is supposed to have 0 execution time (theoretical model). But since that would take 100% of CPU, we have to call it less often. So, we find ways to call it less often:
Buildout is actually called by SlapGrid. SlapGrid itself is called every Y (in theory, Y = 0, but in reality 1 minute). So, SlapGrid is called:
Currently bang has to go through the master. It is possible in future to consider a short cut that does not go through the master. But it is probably simpler and cleaner to run SlapProxy locally if one needs full autonomy.