thnee.se

Counters with PostgreSQL and Django

Guide to implementing counters in PostgreSQL, especially paying attention to common pitfalls, with pratical code examples in SQL and Django ORM.
2019-12-27
Software Engineering
PostgreSQL
Python
Django

Almost all web apps will need counters. Let’s take a look at how to create counters that are accurate and have good performance.

Perhaps you want to count the number of related objects per object, such as total number of comments per article. Or perhaps you want to count the total number of things in a system, such as total number of users. In either case, there are certain things that must be taken into consideration, to avoid slowing down performance, and to ensure correctness.

Foreword: Transaction isolation

In order to understand how to make atomic operations in SQL, it is necessary to first understand isolation levels. The isolation level affects how the database will create and keep locks on the data, which are used to handle conflict scenarios.

Try to at least read the Wikipedia page on Isolation, and the Postgres Transaction Isolation docs are also really good.

This article is written for PostgreSQL using the default isolation level Read Committed.

In PostgreSQL, MS SQL Server, and Oracle the default isolation level is Read Commited.
In MySQL/MariaDB the default isolation level is Repeatable Read.
It's not hard to change the isolation level itself, but keep in mind that changing the isolation level for an existing application can change how it functions, and application code may need to be refactored. So this is something you need to be aware of and choose correctly from the start.

Example models

As an example, let’s say that there is an app called blog with these models.

class User(models.Model):
    username = models.CharField()


class Article(models.Model):
    content = models.CharField()


class Comment(models.Model):
    creator = models.ForeignKey("blog.User", related_name="comments")
    article = models.ForeignKey("blog.Article", related_name="comments")

    message = models.CharField()
    publish_status = models.CharField(
        max_length=100,
        choices=["public", "Public", "private", "Private"],
    )

Stage 1: Subqueries

Let’s say that there is a requirement to show the number of public comments for each article in a list of articles.

A common mistake

You might start out by doing something like this.

articles = Article.objects.all()
for article in articles:
    print(article.comments.filter(publish_status="public").count())

This is going to produce a lot of queries. If you are still writing queries like this, you absolutely have to learn how to use subqeries.

A bit better

Let’s use annotate and a subquery instead, this will dramatically improve performance.

Article.objects.annotate(
    total_public_comments=models.Count(
        "comments", filter=models.Q(publish_status="public")
    )
)

This is probably the easiest way to start, and it is fine, up to a point. But it may not scale far enough — the count will be executed every single time, even if the value has not changed, and as the number of comments grows the count will take longer.

What if you have a lot of articles and comments, and there is a requirement to find the top ten articles with the most comments?

Stage 2: Dedicated counters

Let’s use dedicated counters to keep track of the values. This has several benefits.

  • Avoid counting all the objects repeatedly, only increment / decrement existing value.
  • Counters are immediately available everywhere for displaying, ordering, filtering.

Let’s add the following fields to the models.

class User(models.Model):
    ...
    total_public_comments = models.BigIntegerField(default=0)


class Article(models.Model):
    ...
    total_public_comments = models.BigIntegerField(default=0)

But now it is important to really think about how to keep this counter up to date.

A common mistake

I have seen code like this on several occasions, which is a mistake.

article = Article.objects.get(...)
article.total_public_comments += 1
article.save()

This is unfortunately not good at all. The incrementation happens in Python, not in the database, which means that it is not an atomic operation. There is a time between reading the old value from the database and writing the new value to the database, where other processes may also be running the same function. Which leads to one process overwriting the value of the other, resulting in an inconsistent value.

A bit better

To ensure the counter value is correct, it is necessary to do an atomic operation known as Compare-and-swap or Fetch-and-add.

One might sprinkle statements like this all over the place.

article = Article.objects.get(...)
article.total_public_comments = models.F("total_public_comments") + 1
article.save(update_fields=["total_public_comments"])

Which will perform SQL like this.

UPDATE "blog_article"
   SET "total_public_comments" = "blog_article"."total_public_comments" + 1
 WHERE "blog_article"."id" = ...

This is definitely better, as it does an atomic update at the database level, which ensures the value is actually correct.

But it is brittle, because you have to remember to add it everywhere. For example, it would be very nice to not have to modify the Django admin just to keep the counters up to date.

And it has the potential to deadlock if multiple processes are trying to update the same counter. It is always possible to avoid deadlocks by carefully crafting the code in a certain way. But it would be nice to have a solution that does not even have the potential for deadlocks. (Or dreadlocks as I like to say, jokingly).

If some code has the potential to deadlock, it can go unnoticed for a long time, because it is unlikely to happen when you have low amounts of traffic. But as traffic increases, so does the likelihood that the deadlock starts happening, and it can blow up in your face. So let's build a solution that does not even have the potential to deadlock in the first place.

Stage 3: Queue updates

As long as multiple processes are trying to update the same counter value, there is a potential for deadlocks. The solution to this problem is to have a queue of updates, which is applied periodically. This has several benefits.

  • Eliminates deadlocks.
    To update a counter, an INSERT operation is used instead of an UPDATE.
  • Better performance.
    An INSERT is faster than an UPDATE or an INSERT ... ON CONFLICT ... DO UPDATE (aka. upsert).

Let’s create a new app called counting, and add this model.

class Task(models.Model):
    id = models.AutoField(primary_key=True, editable=False, verbose_name="ID")

    table_name = models.CharField(max_length=200, null=True, blank=True)
    row_id = models.IntegerField(null=True, blank=True)
    name = models.CharField(max_length=200)
    value = models.IntegerField()

Since row_id is an IntegerField, this assumes that all models are standardized to use integers for their id field.

Process counter tasks

Let’s add a management command in the counting app that processes queue tasks and updates counters. This management command will need to be scheduled to run repeatedly, for example via a cronjob.

from django.core.management.base import BaseCommand
from django import db


class Command(BaseCommand):
    def handle(self, *args, **kwargs):
        with db.connection.cursor() as cursor:
            cursor.callproc("counting_process_tasks")

It would be entirely possible to implement this function in Python, safely and correctly. But if it is implemented as a database function, performance will be much better.

Django does not have built in support for creating functions. Custom SQL can be added by manually adding a RunSQL operation to a migration.

Let’s add the following SQL in the counting app.

CREATE OR REPLACE FUNCTION counting_process_tasks(task_limit INT DEFAULT 1000)
RETURNS void
AS $$
DECLARE
  counter_sum RECORD;
BEGIN
  FOR counter_sum IN
    WITH
    task_ids AS (
      SELECT
        MIN(id) as min_id,
        MIN(id) + task_limit as max_id
      FROM counting_task
    ),
    deleted_task AS (
      DELETE FROM counting_task
      WHERE id BETWEEN
        (SELECT min_id from task_ids) AND
        (SELECT max_id FROM task_ids)
      returning *
    ),
    counter_sums AS (
      SELECT
        table_name,
        row_id,
        name,
        SUM(value) as sum
      FROM deleted_task
      GROUP BY table_name, row_id, name
      HAVING SUM(value) <> 0
    )
    SELECT
      table_name,
      row_id,
      name,
      sum
    FROM counter_sums
  LOOP
    EXECUTE format(
      'UPDATE %I SET %I = %I + %s WHERE id = %L',
      counter_sum.table_name,
      counter_sum.name,
      counter_sum.name,
      counter_sum.sum,
      counter_sum.row_id
    );
  END LOOP;
END;
$$ LANGUAGE plpgsql;

This function will delete a number of tasks from the queue, calculate the sum for every unique counter name, and execute a dynamic SQL statement to update the row in the target table for each sum.

Stage 4: Update via trigger

Instead of relying on Python/Django to update counters every time, let’s use a database trigger. This has several benefits.

  • Less code and less mistakes.
    Triggers are executed automatically by the database regardless of what part of the app is performing an operation.
  • Better reliability and correctness in case of failures.
    It is guaranteed to execute in the same transaction as the operation that triggered it.
  • Better performance.
    There are fewer roundtrips to the database.

Let’s add the following SQL in the blog app.

CREATE OR REPLACE FUNCTION blog_comment_update_counters()
RETURNS trigger
AS $$
BEGIN
    IF TG_OP = 'INSERT' THEN
        IF NEW.publish_status = 'public' THEN
            INSERT INTO counting_task VALUES ('blog_article', NEW.article_id, 'total_public_comments', +1);
            INSERT INTO counting_task VALUES ('blog_user', NEW.creator_id, 'total_public_comments', +1);
        END IF;

        RETURN NEW;

    ELSIF TG_OP = 'UPDATE' THEN
        IF OLD.publish_status = 'private' AND NEW.publish_status = 'public' THEN
            INSERT INTO counting_task VALUES ('blog_article', NEW.article_id, 'total_public_comments', +1);
            INSERT INTO counting_task VALUES ('blog_user', NEW.creator_id, 'total_public_comments', +1);

        ELSIF OLD.publish_status = 'public' AND NEW.publish_status = 'private' THEN
            INSERT INTO counting_task VALUES ('blog_article', NEW.article_id, 'total_public_comments', -1);
            INSERT INTO counting_task VALUES ('blog_user', NEW.creator_id, 'total_public_comments', -1);
        END IF;

        RETURN NEW;

    ELSIF TG_OP = 'DELETE' THEN
        IF OLD.publish_status = 'public' THEN
            INSERT INTO counting_task VALUES ('blog_article', OLD.article_id, 'total_public_comments', -1);
            INSERT INTO counting_task VALUES ('blog_user', OLD.creator_id, 'total_public_comments', -1);
        END IF;

        RETURN OLD;
    END IF;
END;
$$
LANGUAGE plpgsql;

DROP TRIGGER IF EXISTS blog_comment_update_counters_t ON blog_comment;

CREATE TRIGGER blog_comment_update_counters_t
AFTER INSERT OR UPDATE OR DELETE ON blog_comment
FOR EACH ROW EXECUTE PROCEDURE blog_comment_update_counters();

This function will update two counters for every article and every user.

  • Total number of public comments published on every article.
  • Total number of public comments created by every user.
This function is just an example, it does not handle every possible case. For example, it does not handle the case of changing the user or article on an existing comment. And it only counts public comments, not private comments. You will need to define what operations are supported and what counters are needed in your application, and handle those appropriately.

References

This article is heavily based on an article by depesz, and adapted for Django. It has an interesting additional point about performance: the counter queue table could be unlogged for a substantial speed up, but this would also decrease reliability during failures.

Also check out this interesting article by Laurenz Albe. It discusses why count(*) is slow, and suggests a method for getting estimated counts for tables.

Yet another article by Pierre Jambet discusses the possible deadlock situations that can happen.