Browse Source

Userguide: Added some sections on Granularity, Data-locality and state

Ask Solem 15 years ago
parent
commit
7bd2d1f623
1 changed files with 86 additions and 3 deletions
  1. 86 3
      docs/userguide/tasks.rst

+ 86 - 3
docs/userguide/tasks.rst

@@ -24,12 +24,11 @@ a task, ``celery.decorators.task``:
     from celery.decorators import task
     from django.contrib.auth import User
 
-    @task()
+    @task
     def create_user(username, password):
         User.objects.create(username=username, password=password)
 
-Note the parens after ``@task()`` the task decorator takes any execution
-options the ``Task`` class does:
+The task decorator takes the same execution options the ``Task`` class does:
 
 .. code-block:: python
 
@@ -296,5 +295,89 @@ This means that your workers must optimally be updated with the same software
 as the client, this is a drawback, but the alternative is a technical
 challenge that has yet to be solved.
 
+Performance and Strategies
+==========================
+
+Granularity
+-----------
+
+The tasks granularity is the degree of parallelization your task have.
+It's better to have a lot of small tasks, than just a few long running
+ones.
+
+With smaller tasks, you can process more tasks in parallell and the tasks
+won't run long enough to block the worker from processing other waiting tasks.
+
+But there's a limit, sending messages takes processing power too. If
+your tasks are so short the overhead of passing them around is worse than
+just executing them inline, you should reconsider your strategy. There is no
+universal answer here.
+
+Data locality
+-------------
+
+The worker processing the task should optimally be as close to the data as
+possible. The best would be to have a copy in memory, the worst being a
+full transfer from another continent.
+
+If the data is far away, you could try to run another worker at location, or
+if that's not possible, cache often used data, or preload data you know you
+know is going to be used.
+
+The easiest way to share data between workers is to use a distributed caching
+system, like `memcached`_.
+
+.. _`memcached`: http://memcached.org/
+http://research.microsoft.com/pubs/70001/tr-2003-24.pdf
+
+State
+-----
+
+Since celery is a distributed system, you can't know in which process, or even
+on what machine the task will run, also you can't even know if the task will
+run in a timely manner, so please be wary of the state you pass on to tasks.
+
+One gotcha is Django model objects, they shouldn't be passed on as arguments
+to task classes, it's almost always better to refetch the object from the
+database instead, as there are possible race conditions involved.
+
+Imagine the following scenario where you have an article, and a task
+that automatically expands some abbreviations in it.
+
+.. code-block:: python
+
+    class Article(models.Model):
+        title = models.CharField()
+        body = models.TextField()
+
+    @task
+    def expand_abbreviations(article):
+        article.body.replace("MyCorp", "My Corporation")
+        article.save()
+
+First, an author creates an article and saves it, then the author
+clicks on a button that initiates the abbreviation task.
+
+    >>> article = Article.objects.get(id=102)
+    >>> expand_abbreviations.delay(model_object)
+
+Now, the queue is very busy, so the task won't be run for another 2 minutes,
+in the meantime another author makes some changes to the article,
+when the task is finally run, the body of the article is reverted to the old
+version, because the task had the old body in its argument.
+
+Fixing the race condition is easy, just use the article id instead, and
+refetch the article in the task body:
+
+.. code-block:: python
+
+    @task
+    def expand_abbreviations(article_id)
+        article = Article.objects.get(id=article_id)
+        article.body.replace("MyCorp", "My Corporation")
+        article.save()
 
+    >>> expand_abbreviations(article_id)
 
+There might even be performance benefits to this approach, as sending large
+messages may be expensive.