yangck
/
celery


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304
							==================
 Monitoring Guide
==================

.. contents::
    :local:

Introduction
============

There are several tools available to monitor and inspect Celery clusters.
This document describes some of these, as as well as
features related to monitoring, like events and broadcast commands.


Monitoring and Inspecting Workers
=================================

celeryctl
---------

* Listing active nodes in the cluster
    ::

    $ celeryctl status

* Show the result of a task
    ::

        $ celeryctl -t tasks.add 4e196aa4-0141-4601-8138-7aa33db0f577

    Note that you can omit the name of the task as long as the
    task doesn't use a custom result backend.

* Listing all tasks that are currently being executed
    ::

        $ celeryctl inspect active

* Listing scheduled ETA tasks
    ::

        $ celeryctl inspect scheduled

    These are tasks reserved by the worker because they have the
    ``eta`` or ``countdown`` argument set.

* Listing reserved tasks
    ::

        $ celeryctl inspect reserved

    This will list all tasks that have been prefetched by the worker,
    and is currently waiting to be executed (does not include tasks
    with an eta).

* Listing the history of revoked tasks
    ::

        $ celeryctl inspect revoked

* Show registered tasks
    ::

        $ celeryctl inspect registered_tasks

* Showing statistics
    ::

        $ celeryctl inspect stats

* Diagnosing the worker pools
    ::

        $ celeryctl inspect diagnose

    This will verify that the workers pool processes are available
    to do work, note that this will not work if the worker is busy.

* Enabling/disabling events
    ::

        $ celeryctl inspect enable_events
        $ celeryctl inspect disable_events


By default the inspect commands operates on all workers.
You can specify a single, or a list of workers by using the
``--destination`` argument::

    $ celeryctl inspect -d w1,w2 reserved


:Note: All ``inspect`` commands supports the ``--timeout`` argument,
       which is the number of seconds to wait for responses.
       You may have to increase this timeout If you're getting empty responses
       due to latency.

Django Admin
------------

TODO

celeryev
--------

TODO

celerymon
---------

TODO

Monitoring and inspecting RabbitMQ
==================================

To manage a Celery cluster it is important to know how
RabbitMQ can be monitored.

RabbitMQ ships with the `rabbitmqctl(1)`_ command,
with this you can list queues, exchanges, bindings,
queue lenghts, the memory usage of each queue, as well
as manage users, virtual hosts and their permissions.

:Note: The default virtual host (``"/"``) is used in these
       examples, if you use a custom virtual host you have to add
       the ``-p`` argument to the command, e.g:
       ``rabbitmqctl list_queues -p my_vhost ....``


.. _`rabbitmqctl(1)`: http://www.rabbitmq.com/man/rabbitmqctl.1.man.html

Inspecting queues
-----------------

Finding the number of tasks in a queue::


    $ rabbitmqctl list_queues name messages messages_ready \
                              messages_unacknowlged


Here ``messages_ready`` is the number of messages ready
for delivery (sent but not received), ``messages_unacknowledged``
is the number of messages that has been received by a worker but
not acknowledged yet (meaning it is in progress, or has been reserved).
``messages`` is the sum of ready and unacknowledged messages combined.


Finding the number of workers currently consuming from a queue::

    $ rabbitmqctl list_queues name consumers

Finding the amount of memory allocated to a queue::

    $ rabbitmqctl list_queues name memory

:Tip: Adding the ``-q`` option to `rabbitmqctl(1)`_ makes the output
      easier to parse.

Munin
=====

This is a list of known Munin plugins that can be useful when
maintaining a Celery cluster.

* rabbitmq-munin: Munin-plugins for RabbitMQ.

    http://github.com/ask/rabbitmq-munin

* celery_tasks: Monitors the number of times each task type has
  been executed (requires ``celerymon``).

    http://exchange.munin-monitoring.org/plugins/celery_tasks-2/details

* celery_task_states: Monitors the number of tasks in each state
  (requires ``celerymon``).

    http://exchange.munin-monitoring.org/plugins/celery_tasks/details

Events
======

The worker has the ability to send a message whenever some event
happens. These events are then captured by tools like ``celerymon`` and 
``celeryev`` to monitor the cluster.

Snapshots
---------

Even a single worker can produce a huge amount of events, so storing
the history of these events on disk may be hard.

A sequence of events describes the cluster state in that time period,
by taking periodic snapshots of this state we can capture all interesting
information, but only periodically write it to disk.

To take snapshots you need a Camera class, with this you can define
what should happen every time the state is captured. You can
write it to a database, send it by e-mail or something else entirely).

``celeryev`` is then used to take snapshots with the camera,
for example if you want to capture state every 2 seconds using the
camera ``myapp.Camera`` you run ``celeryev`` with the following arguments::

    $ celeryev -c myapp.Camera --frequency=2.0

Custom Camera
~~~~~~~~~~~~~

Here is an example camera that is simply dumping the snapshot to the screen:

.. code-block:: python

    from pprint import pformat

    from celery.events.snapshot import Polaroid

    class DumpCam(Polaroid):

        def shutter(self, state):
            if not state.event_count:
                # No new events since last snapshot.
                return
            print("Workers: %s" % (pformat(state.workers, indent=4), ))
            print("Tasks: %s" % (pformat(state.tasks, indent=4), ))
            print("Total: %s events, %s tasks" % (
                state.event_count, state.task_count))

Now you can use this cam with ``celeryev`` by specifying
it with the ``-c`` option::

    $ celeryev -c myapp.DumpCam --frequency=2.0

Or you can use it programatically like this::

    from celery.events import EventReceiver
    from celery.messaging import establish_connection
    from celery.events.state import State
    from myapp import DumpCam

    def main():
        state = State()
        with establish_connection() as connection:
            recv = EventReceiver(connection, handlers={"*": state.event})
            with DumpCam(state, freq=1.0):
                recv.capture(limit=None, timeout=None)

    if __name__ == "__main__":
        main()

Event Reference
---------------

This list contains the events sent by the worker, and their arguments.

Task Events
~~~~~~~~~~~

* ``task-received(uuid, name, args, kwargs, retries, eta, hostname,
  timestamp)``

    Sent when the worker receives a task.

* ``task-started(uuid, hostname, timestamp)``

    Sent just before the worker executes the task.

* ``task-succeeded(uuid, result, runtime, hostname, timestamp)``

    Sent if the task executed successfully.
    Runtime is the time it took to execute the task using the pool.
    (Time starting from the task is sent to the pool, and ending when the
    pool result handlers callback is called).

* ``task-failed(uuid, exception, traceback, hostname, timestamp)``

    Sent if the execution of the task failed.

* ``task-revoked(uuid)``

    Sent if the task has been revoked (Note that this is likely
    to be sent by more than one worker)

* ``task-retried(uuid, exception, traceback, hostname, delay, timestamp)``

    Sent if the task failed, but will be retried in the future.
    (**NOT IMPLEMENTED**)

Worker Events
~~~~~~~~~~~~~

* ``worker-online(hostname, timestamp)``

    The worker has connected to the broker and is online.

* ``worker-heartbeat(hostname, timestamp)``

    Sent every minute, if the worker has not sent a heartbeat in 2 minutes,
    it is considered to be offline.

* ``worker-offline(hostname, timestamp)``

    The worker has disconnected from the broker.