monitoring.rst 7.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304
  1. ==================
  2. Monitoring Guide
  3. ==================
  4. .. contents::
  5. :local:
  6. Introduction
  7. ============
  8. There are several tools available to monitor and inspect Celery clusters.
  9. This document describes some of these, as as well as
  10. features related to monitoring, like events and broadcast commands.
  11. Monitoring and Inspecting Workers
  12. =================================
  13. celeryctl
  14. ---------
  15. * Listing active nodes in the cluster
  16. ::
  17. $ celeryctl status
  18. * Show the result of a task
  19. ::
  20. $ celeryctl -t tasks.add 4e196aa4-0141-4601-8138-7aa33db0f577
  21. Note that you can omit the name of the task as long as the
  22. task doesn't use a custom result backend.
  23. * Listing all tasks that are currently being executed
  24. ::
  25. $ celeryctl inspect active
  26. * Listing scheduled ETA tasks
  27. ::
  28. $ celeryctl inspect scheduled
  29. These are tasks reserved by the worker because they have the
  30. ``eta`` or ``countdown`` argument set.
  31. * Listing reserved tasks
  32. ::
  33. $ celeryctl inspect reserved
  34. This will list all tasks that have been prefetched by the worker,
  35. and is currently waiting to be executed (does not include tasks
  36. with an eta).
  37. * Listing the history of revoked tasks
  38. ::
  39. $ celeryctl inspect revoked
  40. * Show registered tasks
  41. ::
  42. $ celeryctl inspect registered_tasks
  43. * Showing statistics
  44. ::
  45. $ celeryctl inspect stats
  46. * Diagnosing the worker pools
  47. ::
  48. $ celeryctl inspect diagnose
  49. This will verify that the workers pool processes are available
  50. to do work, note that this will not work if the worker is busy.
  51. * Enabling/disabling events
  52. ::
  53. $ celeryctl inspect enable_events
  54. $ celeryctl inspect disable_events
  55. By default the inspect commands operates on all workers.
  56. You can specify a single, or a list of workers by using the
  57. ``--destination`` argument::
  58. $ celeryctl inspect -d w1,w2 reserved
  59. :Note: All ``inspect`` commands supports the ``--timeout`` argument,
  60. which is the number of seconds to wait for responses.
  61. You may have to increase this timeout If you're getting empty responses
  62. due to latency.
  63. Django Admin
  64. ------------
  65. TODO
  66. celeryev
  67. --------
  68. TODO
  69. celerymon
  70. ---------
  71. TODO
  72. Monitoring and inspecting RabbitMQ
  73. ==================================
  74. To manage a Celery cluster it is important to know how
  75. RabbitMQ can be monitored.
  76. RabbitMQ ships with the `rabbitmqctl(1)`_ command,
  77. with this you can list queues, exchanges, bindings,
  78. queue lenghts, the memory usage of each queue, as well
  79. as manage users, virtual hosts and their permissions.
  80. :Note: The default virtual host (``"/"``) is used in these
  81. examples, if you use a custom virtual host you have to add
  82. the ``-p`` argument to the command, e.g:
  83. ``rabbitmqctl list_queues -p my_vhost ....``
  84. .. _`rabbitmqctl(1)`: http://www.rabbitmq.com/man/rabbitmqctl.1.man.html
  85. Inspecting queues
  86. -----------------
  87. Finding the number of tasks in a queue::
  88. $ rabbitmqctl list_queues name messages messages_ready \
  89. messages_unacknowlged
  90. Here ``messages_ready`` is the number of messages ready
  91. for delivery (sent but not received), ``messages_unacknowledged``
  92. is the number of messages that has been received by a worker but
  93. not acknowledged yet (meaning it is in progress, or has been reserved).
  94. ``messages`` is the sum of ready and unacknowledged messages combined.
  95. Finding the number of workers currently consuming from a queue::
  96. $ rabbitmqctl list_queues name consumers
  97. Finding the amount of memory allocated to a queue::
  98. $ rabbitmqctl list_queues name memory
  99. :Tip: Adding the ``-q`` option to `rabbitmqctl(1)`_ makes the output
  100. easier to parse.
  101. Munin
  102. =====
  103. This is a list of known Munin plugins that can be useful when
  104. maintaining a Celery cluster.
  105. * rabbitmq-munin: Munin-plugins for RabbitMQ.
  106. http://github.com/ask/rabbitmq-munin
  107. * celery_tasks: Monitors the number of times each task type has
  108. been executed (requires ``celerymon``).
  109. http://exchange.munin-monitoring.org/plugins/celery_tasks-2/details
  110. * celery_task_states: Monitors the number of tasks in each state
  111. (requires ``celerymon``).
  112. http://exchange.munin-monitoring.org/plugins/celery_tasks/details
  113. Events
  114. ======
  115. The worker has the ability to send a message whenever some event
  116. happens. These events are then captured by tools like ``celerymon`` and
  117. ``celeryev`` to monitor the cluster.
  118. Snapshots
  119. ---------
  120. Even a single worker can produce a huge amount of events, so storing
  121. the history of these events on disk may be hard.
  122. A sequence of events describes the cluster state in that time period,
  123. by taking periodic snapshots of this state we can capture all interesting
  124. information, but only periodically write it to disk.
  125. To take snapshots you need a Camera class, with this you can define
  126. what should happen every time the state is captured. You can
  127. write it to a database, send it by e-mail or something else entirely).
  128. ``celeryev`` is then used to take snapshots with the camera,
  129. for example if you want to capture state every 2 seconds using the
  130. camera ``myapp.Camera`` you run ``celeryev`` with the following arguments::
  131. $ celeryev -c myapp.Camera --frequency=2.0
  132. Custom Camera
  133. ~~~~~~~~~~~~~
  134. Here is an example camera that is simply dumping the snapshot to the screen:
  135. .. code-block:: python
  136. from pprint import pformat
  137. from celery.events.snapshot import Polaroid
  138. class DumpCam(Polaroid):
  139. def shutter(self, state):
  140. if not state.event_count:
  141. # No new events since last snapshot.
  142. return
  143. print("Workers: %s" % (pformat(state.workers, indent=4), ))
  144. print("Tasks: %s" % (pformat(state.tasks, indent=4), ))
  145. print("Total: %s events, %s tasks" % (
  146. state.event_count, state.task_count))
  147. Now you can use this cam with ``celeryev`` by specifying
  148. it with the ``-c`` option::
  149. $ celeryev -c myapp.DumpCam --frequency=2.0
  150. Or you can use it programatically like this::
  151. from celery.events import EventReceiver
  152. from celery.messaging import establish_connection
  153. from celery.events.state import State
  154. from myapp import DumpCam
  155. def main():
  156. state = State()
  157. with establish_connection() as connection:
  158. recv = EventReceiver(connection, handlers={"*": state.event})
  159. with DumpCam(state, freq=1.0):
  160. recv.capture(limit=None, timeout=None)
  161. if __name__ == "__main__":
  162. main()
  163. Event Reference
  164. ---------------
  165. This list contains the events sent by the worker, and their arguments.
  166. Task Events
  167. ~~~~~~~~~~~
  168. * ``task-received(uuid, name, args, kwargs, retries, eta, hostname,
  169. timestamp)``
  170. Sent when the worker receives a task.
  171. * ``task-started(uuid, hostname, timestamp)``
  172. Sent just before the worker executes the task.
  173. * ``task-succeeded(uuid, result, runtime, hostname, timestamp)``
  174. Sent if the task executed successfully.
  175. Runtime is the time it took to execute the task using the pool.
  176. (Time starting from the task is sent to the pool, and ending when the
  177. pool result handlers callback is called).
  178. * ``task-failed(uuid, exception, traceback, hostname, timestamp)``
  179. Sent if the execution of the task failed.
  180. * ``task-revoked(uuid)``
  181. Sent if the task has been revoked (Note that this is likely
  182. to be sent by more than one worker)
  183. * ``task-retried(uuid, exception, traceback, hostname, delay, timestamp)``
  184. Sent if the task failed, but will be retried in the future.
  185. (**NOT IMPLEMENTED**)
  186. Worker Events
  187. ~~~~~~~~~~~~~
  188. * ``worker-online(hostname, timestamp)``
  189. The worker has connected to the broker and is online.
  190. * ``worker-heartbeat(hostname, timestamp)``
  191. Sent every minute, if the worker has not sent a heartbeat in 2 minutes,
  192. it is considered to be offline.
  193. * ``worker-offline(hostname, timestamp)``
  194. The worker has disconnected from the broker.