October 25, 2022

Notes from Conferences: Data-Oriented Design

By Mia Bajić in conferences, django

Data-Oriented Design

Talk: https://www.youtube.com/watch?v=gfPNeQR1aSc

Software’s Only Job is to Transform Data

What Do Users Care About?

getting their output data -> they don’t care about what do you do in the middle

Data Characteristics:

format
volume
latency
throughput
statistical distribution

Context Is Everything

Example: check if a number exists in a given set

numbers: set[int]

def is_match(number: int) -> bool:
    return number in numbers 

But what is there’s just a single number in that set? In that case in operator is wasteful.

def is_match(number: int) -> bool:
    return number == 42

What is a billion numbers? Building a set in this case would exhaust the memory.

def is_match(number: int) -> bool:
    with connection.cursor():
        cursor.execute(
            "SELECT 1 FROM numbers WHERE n = %s",
            (number, ),
        )
        return (cursor.fetchone() is not None) 

What is matching many numbers against few? Many functions aren’t called just once, they’re called repeatedly, which might not be efficient.

numbers = set[int]

def matches(sought: set[int]) -> set[int]:
    return numbers.intersection(sought)

What is set is all off numbers between 1 and 1 million? In that case we could add more maths.

What if false positives are acceptable? In that case the bloom filter would be a better data structure.

…

-> the data that we use as the input will influence which function do we write and also the context

-> implementation depends on data characteristics and that is data-oriented design

Software’s only job is to transform data using specific hardware

General-purpose CPU

Django is mostly run on general-purpose CPU
there’s a gap betweek processor and memory performance since 1980
processor can do a lot of stuff fast, but it can’t get stuff from memory fast
the solution to this issue is CPU cache between CPU and memory, so the data is transferred into cache, so it can be fetched quickly
modern CPU’s have 3 levels of cache (sometimes even 4)
each layer of cache is 3x faster than the previous one
how many cycles takes fetch from cache:

Implications of This Design

use smaller representations
lay out data in access order
Python built-in types are big
```
In [1]: 64 // 8
Out [1]: 8
  
In [2]: import sys
  
In [3]: sys.getsizeof(9001)
Out [3]: 28
```
- 8 bytes for an int + 20 bytes for another stuff (PyObject id, etc.) ```python In [4]: numbers = list(range(1000))
In [5]: len(numbers) * 8 Out[5]: 8000

In [6]: ( …: sys.getsizeof(numbers) …: + sum(sys.getsizeof(n) for n in numbers) …: ) Out[6]: 36052 ```

C data types are much slimmer

In [7]: import array
  
In [8]: numbers2 = array.array('Q', range(1000))
  
In [9]: sys.getsizeof(numbers2)
Out[9]: 8320

normally use NumPy or Pandas for arrays

Data-Oriented Design for the Web

typical website:
user <-> browser loop is already super optimized and quick
but what to put in the something part?
one option is to put there Django & Postgres
another possible architecture:
how fast does the data transfer need to be?
- <100ms - great
- <1s - pretty good
- <3s - at 3s lose ~50% of visitors
- 10s - users likely to retry or give up

How to Improve the Flow Between Browser and Django?

write minimal, performant HTML - if you get your CSS and JavaScript tags in the first kilobyte the browser will stop requesting those early on, don’t load HTML with loads of classes as some CSS frameworks to like to add etc.
introduce HTTP caching - tell browsers what to store forever, so for example images are loaded once only
get on the latest version of HTTP - 3, or at least 2
response compression (GZipMiddleware) - compresses requests, it’s a compression algorithm for text formats like HTML
HTML minification (django-minify-html) - saves around 1-3% of the request size
use a CDN (Content Delivery Network) - saving a lot of network time, as servers as closer to users

Resources

MDN: Web Performance
web.dev/learn
WebPageTest.org
web.dev/measure

How to Improve the flow Between Django and Database?

avoid N+1 queries
```
books = Book.objects.order_by("title")
for book in books:
      print(book.title, "by", book.author.name)
```
- iterate books + for each of N books fetch book.autor
- instead you can use select_related
```
    books = (
        Book.objects.order_by("title")
        .select_related("author")
    )
    for book in books:
        print(book.title, "by", book.author.name)
```
  - select_related sends one query, where fetches books with author joined in
  - the drawback of select_related
  - comes as single table of data
  - some values might be repetitive
- another solution is to use prefetch_related
```
    books = (
        book.objects.order_by("title")
        .prefetch_related("author")
    )
    for book in books:
        print(book.title, "by", book.author.name)
```
  - the first query fetches the books and the second query fetches related authors for the fetched books
  - so if there’s 5 authors, which have written 1000 of books it will pull back the 5 authors once, no repetition
  - depends n the data volumes, it can be faster than select_related
- you can also use django-auto-prefetch which works in the following way:
  - fetch books
  - on first access of book.author fetch related authors for all books
  - solves N+1 problem for foreign keys and one-to-one fields (it doesn’t help with many-to-many)

split models

imagine the following model:

  class User(AbstractUser):
      avatar = models.ImageField(...)
      ...

you’re given the following task: store user’s ACME access token and refresh token

one solution:

  class User(AbstractUser):
      avatar = models.ImageField(...)
      ...
      acme_access_token = models.TextField()
      acme_access_expires = models.DateTimeField()
      acme_refresh_token = models.TextField()

the problem with this solution is that it’s going to slow down every place where users are queried
every time you update a row, Postgres creates a whole new copy of that data every time

better solution:

  class User(AbstractUser):
      avatar = models.ImageField(...)
      ...
            
  class UserAcmeToken(models.Model):
      user = models.OneToOneField(User, primary_key=True)
      acme_access_token = models.TextField()
      acme_access_expires = models.DateTimeField()

multiple counts in one pass

example:

  published_count = (
      author.book_set.filter(published=True).count()
  )
  unpublished_count = (
      author.book_set.filter(published=False).count()
  )

we can use another way to ask for both parts at the same time:

  counts = (
      author.book_set.aggregate(
          published=Count('pk', filter=Q(published=True)),
          unpublished=Count('pk', filter=Q(published=False)),
      )
  )

in this way Postgres goes only ones over data

Resources

Docs: Database access optimization
The Temple of Django Database Performance
Post: “Django and the N+1 Queries Problem”
django-debug-toolbar or Kolo

Resources for Data-Oriented Design

Mike Acton - Data-Oriented Design and C++
Andrew Kelley - Practical DOD
Andreas Fredriksson - Context is Everything

About Mia Bajić

Software engineer and community events organizer.