123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304 |
- ==================
- Monitoring Guide
- ==================
- .. contents::
- :local:
- Introduction
- ============
- There are several tools available to monitor and inspect Celery clusters.
- This document describes some of these, as as well as
- features related to monitoring, like events and broadcast commands.
- Monitoring and Inspecting Workers
- =================================
- celeryctl
- ---------
- * Listing active nodes in the cluster
- ::
- $ celeryctl status
- * Show the result of a task
- ::
- $ celeryctl -t tasks.add 4e196aa4-0141-4601-8138-7aa33db0f577
- Note that you can omit the name of the task as long as the
- task doesn't use a custom result backend.
- * Listing all tasks that are currently being executed
- ::
- $ celeryctl inspect active
- * Listing scheduled ETA tasks
- ::
- $ celeryctl inspect scheduled
- These are tasks reserved by the worker because they have the
- ``eta`` or ``countdown`` argument set.
- * Listing reserved tasks
- ::
- $ celeryctl inspect reserved
- This will list all tasks that have been prefetched by the worker,
- and is currently waiting to be executed (does not include tasks
- with an eta).
- * Listing the history of revoked tasks
- ::
- $ celeryctl inspect revoked
- * Show registered tasks
- ::
- $ celeryctl inspect registered_tasks
- * Showing statistics
- ::
- $ celeryctl inspect stats
- * Diagnosing the worker pools
- ::
- $ celeryctl inspect diagnose
- This will verify that the workers pool processes are available
- to do work, note that this will not work if the worker is busy.
- * Enabling/disabling events
- ::
- $ celeryctl inspect enable_events
- $ celeryctl inspect disable_events
- By default the inspect commands operates on all workers.
- You can specify a single, or a list of workers by using the
- ``--destination`` argument::
- $ celeryctl inspect -d w1,w2 reserved
- :Note: All ``inspect`` commands supports the ``--timeout`` argument,
- which is the number of seconds to wait for responses.
- You may have to increase this timeout If you're getting empty responses
- due to latency.
- Django Admin
- ------------
- TODO
- celeryev
- --------
- TODO
- celerymon
- ---------
- TODO
- Monitoring and inspecting RabbitMQ
- ==================================
- To manage a Celery cluster it is important to know how
- RabbitMQ can be monitored.
- RabbitMQ ships with the `rabbitmqctl(1)`_ command,
- with this you can list queues, exchanges, bindings,
- queue lenghts, the memory usage of each queue, as well
- as manage users, virtual hosts and their permissions.
- :Note: The default virtual host (``"/"``) is used in these
- examples, if you use a custom virtual host you have to add
- the ``-p`` argument to the command, e.g:
- ``rabbitmqctl list_queues -p my_vhost ....``
- .. _`rabbitmqctl(1)`: http://www.rabbitmq.com/man/rabbitmqctl.1.man.html
- Inspecting queues
- -----------------
- Finding the number of tasks in a queue::
- $ rabbitmqctl list_queues name messages messages_ready \
- messages_unacknowlged
- Here ``messages_ready`` is the number of messages ready
- for delivery (sent but not received), ``messages_unacknowledged``
- is the number of messages that has been received by a worker but
- not acknowledged yet (meaning it is in progress, or has been reserved).
- ``messages`` is the sum of ready and unacknowledged messages combined.
- Finding the number of workers currently consuming from a queue::
- $ rabbitmqctl list_queues name consumers
- Finding the amount of memory allocated to a queue::
- $ rabbitmqctl list_queues name memory
- :Tip: Adding the ``-q`` option to `rabbitmqctl(1)`_ makes the output
- easier to parse.
- Munin
- =====
- This is a list of known Munin plugins that can be useful when
- maintaining a Celery cluster.
- * rabbitmq-munin: Munin-plugins for RabbitMQ.
- http://github.com/ask/rabbitmq-munin
- * celery_tasks: Monitors the number of times each task type has
- been executed (requires ``celerymon``).
- http://exchange.munin-monitoring.org/plugins/celery_tasks-2/details
- * celery_task_states: Monitors the number of tasks in each state
- (requires ``celerymon``).
- http://exchange.munin-monitoring.org/plugins/celery_tasks/details
- Events
- ======
- The worker has the ability to send a message whenever some event
- happens. These events are then captured by tools like ``celerymon`` and
- ``celeryev`` to monitor the cluster.
- Snapshots
- ---------
- Even a single worker can produce a huge amount of events, so storing
- the history of these events on disk may be hard.
- A sequence of events describes the cluster state in that time period,
- by taking periodic snapshots of this state we can capture all interesting
- information, but only periodically write it to disk.
- To take snapshots you need a Camera class, with this you can define
- what should happen every time the state is captured. You can
- write it to a database, send it by e-mail or something else entirely).
- ``celeryev`` is then used to take snapshots with the camera,
- for example if you want to capture state every 2 seconds using the
- camera ``myapp.Camera`` you run ``celeryev`` with the following arguments::
- $ celeryev -c myapp.Camera --frequency=2.0
- Custom Camera
- ~~~~~~~~~~~~~
- Here is an example camera that is simply dumping the snapshot to the screen:
- .. code-block:: python
- from pprint import pformat
- from celery.events.snapshot import Polaroid
- class DumpCam(Polaroid):
- def shutter(self, state):
- if not state.event_count:
- # No new events since last snapshot.
- return
- print("Workers: %s" % (pformat(state.workers, indent=4), ))
- print("Tasks: %s" % (pformat(state.tasks, indent=4), ))
- print("Total: %s events, %s tasks" % (
- state.event_count, state.task_count))
- Now you can use this cam with ``celeryev`` by specifying
- it with the ``-c`` option::
- $ celeryev -c myapp.DumpCam --frequency=2.0
- Or you can use it programatically like this::
- from celery.events import EventReceiver
- from celery.messaging import establish_connection
- from celery.events.state import State
- from myapp import DumpCam
- def main():
- state = State()
- with establish_connection() as connection:
- recv = EventReceiver(connection, handlers={"*": state.event})
- with DumpCam(state, freq=1.0):
- recv.capture(limit=None, timeout=None)
- if __name__ == "__main__":
- main()
- Event Reference
- ---------------
- This list contains the events sent by the worker, and their arguments.
- Task Events
- ~~~~~~~~~~~
- * ``task-received(uuid, name, args, kwargs, retries, eta, hostname,
- timestamp)``
- Sent when the worker receives a task.
- * ``task-started(uuid, hostname, timestamp)``
- Sent just before the worker executes the task.
- * ``task-succeeded(uuid, result, runtime, hostname, timestamp)``
- Sent if the task executed successfully.
- Runtime is the time it took to execute the task using the pool.
- (Time starting from the task is sent to the pool, and ending when the
- pool result handlers callback is called).
- * ``task-failed(uuid, exception, traceback, hostname, timestamp)``
- Sent if the execution of the task failed.
- * ``task-revoked(uuid)``
- Sent if the task has been revoked (Note that this is likely
- to be sent by more than one worker)
- * ``task-retried(uuid, exception, traceback, hostname, delay, timestamp)``
- Sent if the task failed, but will be retried in the future.
- (**NOT IMPLEMENTED**)
- Worker Events
- ~~~~~~~~~~~~~
- * ``worker-online(hostname, timestamp)``
- The worker has connected to the broker and is online.
- * ``worker-heartbeat(hostname, timestamp)``
- Sent every minute, if the worker has not sent a heartbeat in 2 minutes,
- it is considered to be offline.
- * ``worker-offline(hostname, timestamp)``
- The worker has disconnected from the broker.
|