SlapOS Home SlapOS

    SlapOS Design Document - Understanding SlapOS Promises

    FINAL - A design document introducing Promises and on how it is used in SlapOS.
    • Last Update:2018-06-29
    • Version:001
    • Language:en

    Understanding SlapOS Promises

    SlapOS (introduction) is a general purpose overlay operating system for distributed POSIX infrastructures. It is based on a Master and Slave design where the Master assigns services to Slave nodes. Slave nodes in turn process the list of services using buildout and send connection and consumption information as well as their monitoring status back to the Master. This monitoring status of each services is based on Promises, which will be explained in detail in this document.

    Table of Content

    • What is a Promise?
    • Adding a Promise to a Software Release
    • Monitoring Promises
    • Watchdog

    What is a Promise?

    This section will briefly introduce Promises and how they are used in SlapOS to monitor whether an instance is accessible or not.

    Example Promise

    from slapos.recipe.librecipe import GenericBaseRecipe
    from zc.buildout.easy_install import _safe_arg, script_header
    import sys
    
    template = script_header + r"""
    # BEWARE: This file is operated by slapgrid
    # BEWARE: It will be overwritten automatically
    import socket
    import sys
    
    addr = "%(hostname)s", %(port)s
    
    try:
      socket.create_connection(addr).close()
    except (socket.error, socket.timeout):
      sys.stderr.write("%%s on %%s isn't listening\n" %% addr)
      sys.exit(127)
    """
    
    class Recipe(GenericBaseRecipe):
      """
      Check listening port promise
      """
    
      def install(self):
        promise = self.createExecutable(self.options['path'], template % {
          'python': _safe_arg(sys.executable),
          'dash_S': '', # BBB buildout 1.x
          'hostname': self.options['hostname'],
          'port': self.options['port'],
          })
    
        return [promise]

    Port Listening Promise

    A Promise is an executable doing some arbitrary work and then exiting with exit code 0 ("it works") or greater ("it doesn't work"). Promises are generated during instantiation in $ instance_home/etc/promise and then run in this directory by SlapGrid to know if an instance is working or not.

    The most simple example of promise is "check_if_port_listening" which is trying to open a socket to an ip/port. If it works it exits with exit code 0 and slapgrid knows that the instance is working. If the socket can't be created, it exits with another exit code, and slapgrid reports it to the SlapOS Master.

    The promise system should be used on all SlapOS softwares and stacks to define as precisely as possible whether an instance is working or not.

    Promise Parts

    • Promise sensor
    • Promise test
    • Promise anomaly detector

    We want to promote a simple, easy and standardised way of writing promise scripts that will verify the state of the system. These scripts can be launched by cron and are configurable for each Software Release. Every promise has three parts:

    The promise sensor collects the value of some monitoring aspects such as "if server is supposed to be started, get the response of an http request, else return 'server stopped' and in case of timeout return empty string".

    The promise test is Green if the result of the promise sensor of the previous example is not empty, else Red. This ensures that a server that is started actually responds to http requests. There is no margin of tolerance for promise tests.

    The promise anomaly detector is Green if one of the three last promise sensor values was not empty, else it is red. This ensures that we call bang only if the server is really stopped, not if an Internet glitch happened.

    Note: Promises are what Buildout launches at the end. They return True or False. True means that one aspect of the partition is OK. Cron does not launch the Promises directly, but anomaly detectors. Very often, anomaly detector and Promises are the same executable with the same result, but not always. Therefore, the two concepts are different. What they have in common is that they often sense the same thing. But detecting an anomaly is not the same as detecting that a promise is initially met.

    Adding A Promise to a Software Release

    The following section will show how to add a Promise to a software release. This can either be an existing Promise from the SlapOS repository recipe folder or a new Promise written from scratch.

    Adding Existing Promises to a Software Release

    [promise-check-site]
    recipe = slapos.cookbook:promise.plugin
    eggs =
      slapos.toolbox
    output = ${directory:plugins}/promise-check-mysite-status.py
    content = 
      from slapos.promise.plugin.check_site_state import RunPromise
    config-site-url = ${publish:site-url}
    config-connection-timeout = 20
    config-foo = bar
    mode = 600

    A recipe slapos.cookbook:promise.plugin can be used to generate promise scripts.

    To use any of the existing promises requires to add a new section to the software release profile (and don't forget to add it in the parts list, too). For example:

    [promise-check-site]
    recipe = slapos.cookbook:promise.plugin
    eggs =
      slapos.toolbox
    output = ${directory:plugins}/promise-check-mysite-status.py
    content = 
      from slapos.promise.plugin.check_site_state import RunPromise
    config-site-url = ${publish:site-url}
    config-connection-timeout = 20
    config-foo = bar
    mode = 600

    This will generate an script which will check will test whether the ${publish:site-url} is available and timeout after 20 seconds which will cause the promise to fail. Passing config-foo=bar gives an example of how parameters are passed to the promise.

    Add New Promise to a Software Release

    from zope import interface as zope_interface
    from slapos.grid.promise import interface
    from slapos.grid.promise.generic import GenericPromise, TestResult, AnomalyResult
    
    class RunPromise(GenericPromise):
    
      zope_interface.implements(interface.IPromise)
    
      def __init__(self, config):
        GenericPromise.__init__(self, config)
        # run the promise everty 2 minutes
        self.setPeriodicity(minute=2)
    
      def anomaly(self):
        """
          Called to detect if there is an anomaly.
          Return AnomalyResult or TestResult object
          # When AnomalyResult has failure bang is called if another promise didn't bang
        """
    
        # Example
        promise_result_list = self.getLastPromiseResultList(result_count=3, only_failure=True)
        if len(promise_result_list) > 2:
          return AnomalyResult(problem=True, message=promise_result_list[0][0]['mesage'])
        return AnomalyResult(problem=False, message="")
    
        # It's possible to use Generic helper methods
        # return self._anomaly(result_count=3, failure_amount=3)
    
      def sense(self):
        """
          Run the promise code and store the result
            raise error, log error message, ... for failure
        """
    
        # DO SOMETHING...
        failed = True
        raised = False
        if failed:
          self.logger.error("ERROR while checking instance http server")
        else:
          self.logger.info("http server is OK")
        if raised:
          raise ValueError("Server URL is not correct")
    
      def test(self):
        """
          Test promise and say if problem is detected or not
          Return TestResult object
        """
    
       # Example
       promise_result_list = self.getLastPromiseResultList(result_count=1)[0]
       problem = False
       message = ""
       for result in promise_result_list:
         if result['status'] == 'ERROR' and not problem:
           problem = True
         message += "\n%s" % result['message']
    
       return TestResult(problem=problem, messsage=message)
    
       # It's possible to use Generic helper methods
       # return self._test(result_count=1, failure_amount=1)

    This script is an example of a Promise in python. Writing a Promise consists of defining a class called RunPromise:

    class RunPromise(GenericPromise):

    which inherits from the GenericPromise class inside this class defining the methods anomaly(), sense() and test().

    Python promises should be placed into the folder etc/plugin of the computer partition.

    sense() runs the promise with the given frequency, collects data for the promise whenever is makes sense and appends to a log file.

    test() checks TestResult object describing the actual promise state. Test method is called when Buildout processes a partition, a partition is marked as correctly processed if there is no Buildout failures and all promises test() pass.

    anomaly() returns the AnomalyResult object which describes the promise state. The anomaly() method is called by SlapGrid when the partition is correctly processed to check if the partition has no anomaly. If AnomalyResult.hasFailed() is True, bang is called if another promise of the same instance didn't call bang.

    GenericPromise

    ...
    @abstractmethod
      def sense(self):
        """Run the promise code and log the result"""
    
      def anomaly(self):
        """Called to detect if there is an anomaly which require to bang."""
        return self._anomaly()
    
      def test(self):
        """Test promise and say if problem is detected or not"""
        return self._test()
    
      def run(self, check_anomaly=False, can_bang=True):
        """
          Method called to run the Promise
          @param check_anomaly: Say if anomaly method should be called
          @param can_bang: Set to True if bang can be called, this parameter should
            be set to False if bang is already called by another promise.
        """
        ...

    The GenericPromise class contain base implementation of Promise and provides a method run() which reads the option 'check_anomaly' to enforce call of anomaly() instead of test(). By default, run a promise script will call sense() and test(). Option check_anomaly is used used by buildout for periodic promise check, when the partition is already well deployed.

    In future, GenericPromise will be improved to provide more methods that can be used in sense() to store promise graph data. This graph data will be used on monitor interface to plot a chart of promise result progression.

    Methods Available in Promise Class

    ...
    self.getConfig(key, default=None)
    self.getLastPromiseResultList(latest_minute=0, result_count=COUNT, only_failure=False)
    self._test(result_count=COUNT, failure_amount=XX, latest_minute=0)
    self._anomaly(result_count=COUNT, failure_amount=XX, latest_minute=0)
    ...
    

    Promises inherit the following methods from GenericPromise:

    • self.getTitle() - returns Promise title, eg. my_promise
    • self.getName() - returns Promise (file) name, eg. my_promise.py
    • self.getPromiseFile() - returns Promise file path
    • self.getPeriodicity() - returns current Promise periodicity
    • self.setPeriodicity(minute=XX) - set Promise periodicity in minutes in __init__()
    • self.getLogFile() - return path log to file
    • self.getLogFolder() - return path to monitoring logs folder
    • self.getPartitionFolder() - return base partition folder
    • self.getConfig(key, default=None) - return configuration sent to Promise class
      Default configuration keys availble are: partition-id, computer-id, partition-key, partition-cert and master-url.
    • self.getLastPromiseResultList(latest_minute=0, result_count=COUNT, only_failure=False) - read the promise log result group from the latest promise execution specified by COUNT. Set latest_minute to specify the maximum promise execution time to search. If only_failure is True, will only get failure messages.
    • self._test(result_count=COUNT, failure_amount=XX, latest_minute=0) - return TestResult from latest Promise result
    • self._anomaly(result_count=COUNT, failure_amount=XX, latest_minute=0) - return AnomalyResult from latest Promise result

    In your promise code, you will be able to call self.getConfig("site-url"), self.getConfig("connection-timeout") and self.getConfig("foo"). The returned value of self.getConfig(KEY) is None if the config parameter KEY is not set.

    Developing Python Promises

    from slapos.promise.plugin.my_promise_check_site import RunPromise

    Promise code must be committed to the slapos.toolbox repository. Please put your promise into the folder slapos/promise/plugin, so you can import them in a file in etc/plugin folder.

    For debugging, the monitor promise script added by monitor can be used to test promises execution without using slapgrid. The script will be exposed in the bin/ directory of the software release.

    You can run a promise, using:

    SR_DIRECTORY/bin/monitor.runpromise --config etc/monitor.conf --console --dry-run [ARG, ...]

    Note, that legacy promises are promises placed in PARTITION_DIRECTORY/etc/promise, they can be bash or others executable scripts. The promise launcher will use a special wrapper to call them as a subprocess, the success or failure state will be based on the process return code (0 = sucess, > 1 = failure).

    To set the frequency of buildout runs, the software release should write a file periodicity into software release folder which contains the time period in seconds, eg. to process the partition every 12 hours, the file /opt/slapgrid/SR_MD5SUM/periodicity should contain 43200= 12h

    Monitoring Promises

    This section covers monitoring of partitions along with goals of running Promises correctly as well as things to avoid.

    Controlling Partition Status

    • Periodic Instantiation
    • Periodic Promise sensors
    • Bang

    In normal conditions:

    • Instantiation runs periodically (at least once in an interval of computer configurable frequency which is usually 24 hours), running promises and posting to master, hence showing signs of life.
    • Slapgrid runs periodically a set of promise sensors, and upon anomaly detection on the promise sensor value, bang is called on the partition.
    • Upon call of bang, a run of partition instantiation is scheduled by SlapOS Master on all partitions that belong to the same software instance tree.

    Running buildout on all partitions after a bang is supposed to converge to a stable state with all promises passing.

    Slapgrid is configured to run promises at some interval of time which can be configured differently for each promise sensor (see before). SlapOS knows nothing about the results of running promise sensors. The only thing the Master knows is that a bang was issued due to anomaly detection.

    Monitoring Goals

    • Servers are alive
    • Partitions are fulfilling all promises

    The goal of monitoring is to provide good quality of services by knowing problems before customer tells us. This is done by ensuring that servers are alive and partitions are fulfilling all promises.

    Alive servers

    Servers should contact master periodically to notify that they are alive. The master will show the state of each server according to a colour. A server is Green if it contacted the master within the last 5 minutes. If it contacted the master within the last hour 1 hour, the server is Orange else it's Red. From a monitoring point of view, the server conctacts the master whenever Slapgrid connects to slapOS master, no matter what for.


    Fulfilled promises

    The master shows the state of each requested partition according to a colour. A partition is Green if the latest result sent by Slapgrid for that partition is OK (meaning that all promises succeeded and there were no other failures) and if that message was sent less than one day ago and less than the buildout run frequency defined by the software release and if no bang was trigered after that. Else the partition is Red.

    Note 1: Buildout on a partition in SlapOS will be executed at least once per computer configurable frequency (usually one day) and at least once per software release configurable frequency (seldom configured).

    Note 2: the computer configurable frequency of Buildout run must be stored on the Computer in SlapOS master at registration time and updated, else it is impossible to check promise fulfillment.

    Monitoring Crimes

    • Buildout runs all the time without ever going to sleep
    • Run all promises every minute
    • Always falling promises
    • Buildout taking too long to process a computer partition

    There are four monitoring crimes that every developer should keep in mind:

    • Buildout runs all the time without ever going to sleep
      If Buildout runs all then time too much resources are consumed which can overload the server. One should care to so that all promises of the Software Release can be solved.
    • Run all promises every minutes
      It's not required to run all promises in monitor every minute, instead they should be configurable, the frequency should be set for each promise.
    • Always falling promises
      If a promise never reaches the stage that it passes, it means that the SR is badly implemented and should be reviewed.
    • Buildout taking too long to process a computer partition
      Buildout should process a computer partition in a short time, else it prevents ensuring reponsive provisionning of other paritions. The time to process a computer partition should be less that one minute.

    Watchdog

    This section introduces the "Watchdog", a process that is monitoring other processes and can call "bang" to the Master.

    Watchdog Explained

    Watchdog is a simple SlapOS Node feature allowing to watch any process managed by supervisord. All processes scripts into PARTITION_DIRECTORY/etc/service directory are watched. They are automatically configured in supervisord with an added on-watch suffix on their process_name. Whenever one of them exits, watchdog will trigger an alert (bang) that is sent to the Master. Bang will force SlapGrid to reprocess all instances of the service. This also forces recheck of all promises and post the result to master, letting the master decide whether the partition state is Green or Red.

    Bang

    • Called explicitly (eg. by a Promise or a Service)
    • Called implicitly when a process watched by Watchgod changes to an unsupposed state

    Bang should be called as much as needed in a day by a partition. There should not be a limitation in number of calls else it's not possible to adapt dynamically. A Master protection against recurring bang calls should be considered using a kind of quota per day, that might depend on price or defined into the software release. if the bang quota of the day is reached, the master will reject all future calls until the next day.

    As a bang will trigger a run of Buildout, Buildout, in theory, is run all the time repeatedly. This is why it is supposed to have 0 execution time (theoretical model). But since that would take 100% of CPU, we have to call it less often. So, we find ways to call it less often:

    • every X (this can be configured at the profile level)
    • if promises are not all satisfied
    • if requested services are not available
    • as the result of bang

    Buildout is actually called by SlapGrid. SlapGrid itself is called every Y (in theory, Y = 0, but in reality 1 minute). So, SlapGrid is called:

    • at least every minute
    • right after a SlapGrid call if something happened in the previous call (eg. request of new service, failing Promise) with an increasing delay to reduce CPU load

    Currently bang has to go through the master. It is possible in future to consider a short cut that does not go through the master. But it is probably simpler and cleaner to run SlapProxy locally if one needs full autonomy.

    Thank You

    Image Nexedi Office
    • Nexedi SA
    • 147 Rue du Ballon
    • 59110 La Madeleine
    • France