Almost all web apps will need counters. Let’s take a look at how to create counters that are accurate and have good performance.
Perhaps you want to count the number of related objects per object, such as total number of comments per article. Or perhaps you want to count the total number of things in a system, such as total number of users. In either case, there are certain things that must be taken into consideration, to avoid slowing down performance, and to ensure correctness.
Foreword: Transaction isolation
In order to understand how to make atomic operations in SQL, it is necessary to first understand isolation levels. The isolation level affects how the database will create and keep locks on the data, which are used to handle conflict scenarios.
Try to at least read the Wikipedia page on Isolation, and the Postgres Transaction Isolation docs are also really good.
This article is written for PostgreSQL using the default isolation level Read Committed.
In MySQL/MariaDB the default isolation level is Repeatable Read.
Example models
As an example, let’s say that there is an app called blog
with these models.
class User(models.Model):
username = models.CharField()
class Article(models.Model):
content = models.CharField()
class Comment(models.Model):
creator = models.ForeignKey("blog.User", related_name="comments")
article = models.ForeignKey("blog.Article", related_name="comments")
message = models.CharField()
publish_status = models.CharField(
max_length=100,
choices=["public", "Public", "private", "Private"],
)
Stage 1: Subqueries
Let’s say that there is a requirement to show the number of public comments for each article in a list of articles.
A common mistake
You might start out by doing something like this.
articles = Article.objects.all()
for article in articles:
print(article.comments.filter(publish_status="public").count())
This is going to produce a lot of queries. If you are still writing queries like this, you absolutely have to learn how to use subqeries.
A bit better
Let’s use annotate
and a subquery instead, this will dramatically improve performance.
Article.objects.annotate(
total_public_comments=models.Count(
"comments", filter=models.Q(publish_status="public")
)
)
This is probably the easiest way to start, and it is fine, up to a point. But it may not scale far enough — the count will be executed every single time, even if the value has not changed, and as the number of comments grows the count will take longer.
What if you have a lot of articles and comments, and there is a requirement to find the top ten articles with the most comments?
Stage 2: Dedicated counters
Let’s use dedicated counters to keep track of the values. This has several benefits.
- Avoid counting all the objects repeatedly, only increment / decrement existing value.
- Counters are immediately available everywhere for displaying, ordering, filtering.
Let’s add the following fields to the models.
class User(models.Model):
...
total_public_comments = models.BigIntegerField(default=0)
class Article(models.Model):
...
total_public_comments = models.BigIntegerField(default=0)
But now it is important to really think about how to keep this counter up to date.
A common mistake
I have seen code like this on several occasions, which is a mistake.
article = Article.objects.get(...)
article.total_public_comments += 1
article.save()
This is unfortunately not good at all. The incrementation happens in Python, not in the database, which means that it is not an atomic operation. There is a time between reading the old value from the database and writing the new value to the database, where other processes may also be running the same function. Which leads to one process overwriting the value of the other, resulting in an inconsistent value.
A bit better
To ensure the counter value is correct, it is necessary to do an atomic operation known as Compare-and-swap or Fetch-and-add.
One might sprinkle statements like this all over the place.
article = Article.objects.get(...)
article.total_public_comments = models.F("total_public_comments") + 1
article.save(update_fields=["total_public_comments"])
Which will perform SQL like this.
UPDATE "blog_article"
SET "total_public_comments" = "blog_article"."total_public_comments" + 1
WHERE "blog_article"."id" = ...
This is definitely better, as it does an atomic update at the database level, which ensures the value is actually correct.
But it is brittle, because you have to remember to add it everywhere. For example, it would be very nice to not have to modify the Django admin just to keep the counters up to date.
And it has the potential to deadlock if multiple processes are trying to update the same counter. It is always possible to avoid deadlocks by carefully crafting the code in a certain way. But it would be nice to have a solution that does not even have the potential for deadlocks. (Or dreadlocks as I like to say, jokingly).
Stage 3: Queue updates
As long as multiple processes are trying to update the same counter value, there is a potential for deadlocks. The solution to this problem is to have a queue of updates, which is applied periodically. This has several benefits.
- Eliminates deadlocks.
To update a counter, anINSERT
operation is used instead of anUPDATE
. - Better performance.
AnINSERT
is faster than anUPDATE
or anINSERT ... ON CONFLICT ... DO UPDATE
(aka. upsert).
Let’s create a new app called counting
, and add this model.
class Task(models.Model):
id = models.AutoField(primary_key=True, editable=False, verbose_name="ID")
table_name = models.CharField(max_length=200, null=True, blank=True)
row_id = models.IntegerField(null=True, blank=True)
name = models.CharField(max_length=200)
value = models.IntegerField()
Since row_id
is an IntegerField
, this assumes that all models are standardized
to use integers for their id
field.
Process counter tasks
Let’s add a management command in the counting
app
that processes queue tasks and updates counters.
This management command will need to be scheduled to run repeatedly,
for example via a cronjob.
from django.core.management.base import BaseCommand
from django import db
class Command(BaseCommand):
def handle(self, *args, **kwargs):
with db.connection.cursor() as cursor:
cursor.callproc("counting_process_tasks")
It would be entirely possible to implement this function in Python, safely and correctly. But if it is implemented as a database function, performance will be much better.
Django does not have built in support for creating functions.
Custom SQL can be added by manually adding a RunSQL
operation to a migration.
Let’s add the following SQL in the counting
app.
CREATE OR REPLACE FUNCTION counting_process_tasks(task_limit INT DEFAULT 1000)
RETURNS void
AS $$
DECLARE
counter_sum RECORD;
BEGIN
FOR counter_sum IN
WITH
task_ids AS (
SELECT
MIN(id) as min_id,
MIN(id) + task_limit as max_id
FROM counting_task
),
deleted_task AS (
DELETE FROM counting_task
WHERE id BETWEEN
(SELECT min_id from task_ids) AND
(SELECT max_id FROM task_ids)
returning *
),
counter_sums AS (
SELECT
table_name,
row_id,
name,
SUM(value) as sum
FROM deleted_task
GROUP BY table_name, row_id, name
HAVING SUM(value) <> 0
)
SELECT
table_name,
row_id,
name,
sum
FROM counter_sums
LOOP
EXECUTE format(
'UPDATE %I SET %I = %I + %s WHERE id = %L',
counter_sum.table_name,
counter_sum.name,
counter_sum.name,
counter_sum.sum,
counter_sum.row_id
);
END LOOP;
END;
$$ LANGUAGE plpgsql;
This function will delete a number of tasks from the queue, calculate the sum for every unique counter name, and execute a dynamic SQL statement to update the row in the target table for each sum.
Stage 4: Update via trigger
Instead of relying on Python/Django to update counters every time, let’s use a database trigger. This has several benefits.
- Less code and less mistakes.
Triggers are executed automatically by the database regardless of what part of the app is performing an operation. - Better reliability and correctness in case of failures.
It is guaranteed to execute in the same transaction as the operation that triggered it. - Better performance.
There are fewer roundtrips to the database.
Let’s add the following SQL in the blog
app.
CREATE OR REPLACE FUNCTION blog_comment_update_counters()
RETURNS trigger
AS $$
BEGIN
IF TG_OP = 'INSERT' THEN
IF NEW.publish_status = 'public' THEN
INSERT INTO counting_task VALUES ('blog_article', NEW.article_id, 'total_public_comments', +1);
INSERT INTO counting_task VALUES ('blog_user', NEW.creator_id, 'total_public_comments', +1);
END IF;
RETURN NEW;
ELSIF TG_OP = 'UPDATE' THEN
IF OLD.publish_status = 'private' AND NEW.publish_status = 'public' THEN
INSERT INTO counting_task VALUES ('blog_article', NEW.article_id, 'total_public_comments', +1);
INSERT INTO counting_task VALUES ('blog_user', NEW.creator_id, 'total_public_comments', +1);
ELSIF OLD.publish_status = 'public' AND NEW.publish_status = 'private' THEN
INSERT INTO counting_task VALUES ('blog_article', NEW.article_id, 'total_public_comments', -1);
INSERT INTO counting_task VALUES ('blog_user', NEW.creator_id, 'total_public_comments', -1);
END IF;
RETURN NEW;
ELSIF TG_OP = 'DELETE' THEN
IF OLD.publish_status = 'public' THEN
INSERT INTO counting_task VALUES ('blog_article', OLD.article_id, 'total_public_comments', -1);
INSERT INTO counting_task VALUES ('blog_user', OLD.creator_id, 'total_public_comments', -1);
END IF;
RETURN OLD;
END IF;
END;
$$
LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS blog_comment_update_counters_t ON blog_comment;
CREATE TRIGGER blog_comment_update_counters_t
AFTER INSERT OR UPDATE OR DELETE ON blog_comment
FOR EACH ROW EXECUTE PROCEDURE blog_comment_update_counters();
This function will update two counters for every article and every user.
- Total number of public comments published on every article.
- Total number of public comments created by every user.
References
This article is heavily based on an article by depesz, and adapted for Django. It has an interesting additional point about performance: the counter queue table could be unlogged for a substantial speed up, but this would also decrease reliability during failures.
Also check out this interesting article by Laurenz Albe.
It discusses why count(*)
is slow,
and suggests a method for getting estimated counts for tables.
Yet another article by Pierre Jambet discusses the possible deadlock situations that can happen.