Browse Source

Started writing the Monitoring User Guide

Ask Solem 14 years ago
parent
commit
52cc5164e1
1 changed files with 233 additions and 20 deletions
  1. 233 20
      docs/userguide/monitoring.rst

+ 233 - 20
docs/userguide/monitoring.rst

@@ -5,21 +5,211 @@
 .. contents::
     :local:
 
+Introduction
+============
+
+There are several tools available to monitor and inspect Celery clusters.
+This document describes some of these, as as well as
+features related to monitoring, like events and broadcast commands.
+
+
+Monitoring and Inspecting Workers
+=================================
+
+celeryctl
+---------
+
+* Listing active nodes in the cluster
+    ::
+
+    $ celeryctl status
+
+* Show the result of a task
+    ::
+
+        $ celeryctl -t tasks.add 4e196aa4-0141-4601-8138-7aa33db0f577
+
+    Note that you can omit the name of the task as long as the
+    task doesn't use a custom result backend.
+
+* Listing all tasks that are currently being executed
+    ::
+
+        $ celeryctl inspect active
+
+* Listing scheduled ETA tasks
+    ::
+
+        $ celeryctl inspect scheduled
+
+    These are tasks reserved by the worker because they have the
+    ``eta`` or ``countdown`` argument set.
+
+* Listing reserved tasks
+    ::
+
+        $ celeryctl inspect reserved
+
+    This will list all tasks that have been prefetched by the worker,
+    and is currently waiting to be executed (does not include tasks
+    with an eta).
+
+* Listing the history of revoked tasks
+    ::
+
+        $ celeryctl inspect revoked
+
+* Show registered tasks
+    ::
+
+        $ celeryctl inspect registered_tasks
+
+* Showing statistics
+    ::
+
+        $ celeryctl inspect stats
+
+* Diagnosing the worker pools
+    ::
+
+        $ celeryctl inspect diagnose
+
+    This will verify that the workers pool processes are available
+    to do work, note that this will not work if the worker is busy.
+
+* Enabling/disabling events
+    ::
+
+        $ celeryctl inspect enable_events
+        $ celeryctl inspect disable_events
+
+
+By default the inspect commands operates on all workers.
+You can specify a single, or a list of workers by using the
+``--destination`` argument::
+
+    $ celeryctl inspect -d w1,w2 reserved
+
+
+:Note: All ``inspect`` commands supports the ``--timeout`` argument,
+       which is the number of seconds to wait for responses.
+       You may have to increase this timeout If you're getting empty responses
+       due to latency.
+
+Django Admin
+------------
+
+TODO
+
+celeryev
+--------
+
+TODO
+
+celerymon
+---------
+
+TODO
+
+Monitoring and inspecting RabbitMQ
+==================================
+
+To manage a Celery cluster it is important to know how
+RabbitMQ can be monitored.
+
+RabbitMQ ships with the `rabbitmqctl(1)`_ command,
+with this you can list queues, exchanges, bindings,
+queue lenghts, the memory usage of each queue, as well
+as manage users, virtual hosts and their permissions.
+
+:Note: The default virtual host (``"/"``) is used in these
+       examples, if you use a custom virtual host you have to add
+       the ``-p`` argument to the command, e.g:
+       ``rabbitmqctl list_queues -p my_vhost ....``
+
+
+.. _`rabbitmqctl(1)`: http://www.rabbitmq.com/man/rabbitmqctl.1.man.html
+
+Inspecting queues
+-----------------
+
+Finding the number of tasks in a queue::
+
+
+    $ rabbitmqctl list_queues name messages messages_ready \
+                              messages_unacknowlged
+
+
+Here ``messages_ready`` is the number of messages ready
+for delivery (sent but not received), ``messages_unacknowledged``
+is the number of messages that has been received by a worker but
+not acknowledged yet (meaning it is in progress, or has been reserved).
+``messages`` is the sum of ready and unacknowledged messages combined.
+
+
+Finding the number of workers currently consuming from a queue::
+
+    $ rabbitmqctl list_queues name consumers
+
+Finding the amount of memory allocated to a queue::
+
+    $ rabbitmqctl list_queues name memory
+
+:Tip: Adding the ``-q`` option to `rabbitmqctl(1)`_ makes the output
+      easier to parse.
+
+Munin
+=====
+
+This is a list of known Munin plugins that can be useful when
+maintaining a Celery cluster.
+
+* rabbitmq-munin: Munin-plugins for RabbitMQ.
+
+    http://github.com/ask/rabbitmq-munin
+
+* celery_tasks: Monitors the number of times each task type has
+  been executed (requires ``celerymon``).
+
+    http://exchange.munin-monitoring.org/plugins/celery_tasks-2/details
+
+* celery_task_states: Monitors the number of tasks in each state
+  (requires ``celerymon``).
+
+    http://exchange.munin-monitoring.org/plugins/celery_tasks/details
+
 Events
 ======
 
-Describe events
-
+The worker has the ability to send a message whenever some event
+happens. These events are then captured by tools like ``celerymon`` and 
+``celeryev`` to monitor the cluster.
 
 Snapshots
 ---------
 
-Describe snapshots
+Even a single worker can produce a huge amount of events, so storing
+the history of these events on disk may be hard.
+
+A sequence of events describes the cluster state in that time period,
+by taking periodic snapshots of this state we can capture all interesting
+information, but only periodically write it to disk.
+
+To take snapshots you need a Camera class, with this you can define
+what should happen every time the state is captured. You can
+write it to a database, send it by e-mail or something else entirely).
+
+``celeryev`` is then used to take snapshots with the camera,
+for example if you want to capture state every 2 seconds using the
+camera ``myapp.Camera`` you run ``celeryev`` with the following arguments::
 
+    $ celeryev -c myapp.Camera --frequency=2.0
 
 Custom Camera
 ~~~~~~~~~~~~~
 
+Here is an example camera that is simply dumping the snapshot to the screen:
+
 .. code-block:: python
 
     from pprint import pformat
@@ -59,33 +249,56 @@ Or you can use it programatically like this::
     if __name__ == "__main__":
         main()
 
+Event Reference
+---------------
 
+This list contains the events sent by the worker, and their arguments.
 
+Task Events
+~~~~~~~~~~~
 
-Tools
-=====
+* ``task-received(uuid, name, args, kwargs, retries, eta, hostname,
+  timestamp)``
 
-celerymon
-=========
+    Sent when the worker receives a task.
 
-Describe celerymon
+* ``task-started(uuid, hostname, timestamp)``
 
-celeryev
-========
+    Sent just before the worker executes the task.
 
-Describe celeryev
+* ``task-succeeded(uuid, result, runtime, hostname, timestamp)``
 
-RabbitMQ
-========
+    Sent if the task executed successfully.
+    Runtime is the time it took to execute the task using the pool.
+    (Time starting from the task is sent to the pool, and ending when the
+    pool result handlers callback is called).
 
-Describe rabbitmq tools. rabbitmqctl, Alice, etc...
+* ``task-failed(uuid, exception, traceback, hostname, timestamp)``
 
-Django Admin
-============
+    Sent if the execution of the task failed.
 
-Describe the snapshot camera django-celery ships with.
+* ``task-revoked(uuid)``
 
-Munin
-=====
+    Sent if the task has been revoked (Note that this is likely
+    to be sent by more than one worker)
+
+* ``task-retried(uuid, exception, traceback, hostname, delay, timestamp)``
+
+    Sent if the task failed, but will be retried in the future.
+    (**NOT IMPLEMENTED**)
+
+Worker Events
+~~~~~~~~~~~~~
+
+* ``worker-online(hostname, timestamp)``
+
+    The worker has connected to the broker and is online.
+
+* ``worker-heartbeat(hostname, timestamp)``
+
+    Sent every minute, if the worker has not sent a heartbeat in 2 minutes,
+    it is considered to be offline.
+
+* ``worker-offline(hostname, timestamp)``
 
-Maintain a list of related munin plugins
+    The worker has disconnected from the broker.