Notes from Conferences: Data-Oriented Design

Data-Oriented Design

Talk: https://www.youtube.com/watch?v=gfPNeQR1aSc


Software’s Only Job is to Transform Data

image


What Do Users Care About?

  • getting their output data -> they don’t care about what do you do in the middle


Data Characteristics:

  • format
  • volume
  • latency
  • throughput
  • statistical distribution


Context Is Everything

Example: check if a number exists in a given set

numbers: set[int]

def is_match(number: int) -> bool:
    return number in numbers 


But what is there’s just a single number in that set? In that case in operator is wasteful.

def is_match(number: int) -> bool:
    return number == 42


What is a billion numbers? Building a set in this case would exhaust the memory.

def is_match(number: int) -> bool:
    with connection.cursor():
        cursor.execute(
            "SELECT 1 FROM numbers WHERE n = %s",
            (number, ),
        )
        return (cursor.fetchone() is not None) 


What is matching many numbers against few? Many functions aren’t called just once, they’re called repeatedly, which might not be efficient.

numbers = set[int]

def matches(sought: set[int]) -> set[int]:
    return numbers.intersection(sought)


What is set is all off numbers between 1 and 1 million? In that case we could add more maths.

What if false positives are acceptable? In that case the bloom filter would be a better data structure.

-> the data that we use as the input will influence which function do we write and also the context

-> implementation depends on data characteristics and that is data-oriented design


Software’s only job is to transform data using specific hardware

General-purpose CPU

  • Django is mostly run on general-purpose CPU image
  • there’s a gap betweek processor and memory performance since 1980 image
  • processor can do a lot of stuff fast, but it can’t get stuff from memory fast
  • the solution to this issue is CPU cache between CPU and memory, so the data is transferred into cache, so it can be fetched quickly image
  • modern CPU’s have 3 levels of cache (sometimes even 4)
  • each layer of cache is 3x faster than the previous one image
  • how many cycles takes fetch from cache: image


Implications of This Design

  • use smaller representations
  • lay out data in access order
  • Python built-in types are big
    In [1]: 64 // 8
    Out [1]: 8
      
    In [2]: import sys
      
    In [3]: sys.getsizeof(9001)
    Out [3]: 28
    
    • 8 bytes for an int + 20 bytes for another stuff (PyObject id, etc.) ```python In [4]: numbers = list(range(1000))

    In [5]: len(numbers) * 8 Out[5]: 8000

    In [6]: ( …: sys.getsizeof(numbers) …: + sum(sys.getsizeof(n) for n in numbers) …: ) Out[6]: 36052 ```

  • C data types are much slimmer
    In [7]: import array
      
    In [8]: numbers2 = array.array('Q', range(1000))
      
    In [9]: sys.getsizeof(numbers2)
    Out[9]: 8320
    
  • normally use NumPy or Pandas for arrays


Data-Oriented Design for the Web

  • typical website: image
  • user <-> browser loop is already super optimized and quick image
  • but what to put in the something part? image
  • one option is to put there Django & Postgres image
  • another possible architecture: image
  • how fast does the data transfer need to be?
    • <100ms - great
    • <1s - pretty good
    • <3s - at 3s lose ~50% of visitors
    • 10s - users likely to retry or give up


How to Improve the Flow Between Browser and Django?

image

  • write minimal, performant HTML - if you get your CSS and JavaScript tags in the first kilobyte the browser will stop requesting those early on, don’t load HTML with loads of classes as some CSS frameworks to like to add etc.
  • introduce HTTP caching - tell browsers what to store forever, so for example images are loaded once only
  • get on the latest version of HTTP - 3, or at least 2
  • response compression (GZipMiddleware) - compresses requests, it’s a compression algorithm for text formats like HTML
  • HTML minification (django-minify-html) - saves around 1-3% of the request size
  • use a CDN (Content Delivery Network) - saving a lot of network time, as servers as closer to users


Resources

  • MDN: Web Performance
  • web.dev/learn
  • WebPageTest.org
  • web.dev/measure


How to Improve the flow Between Django and Database?

image image

  • avoid N+1 queries
    books = Book.objects.order_by("title")
    for book in books:
          print(book.title, "by", book.author.name)
    
    • iterate books + for each of N books fetch book.autor
    • instead you can use select_related
          books = (
              Book.objects.order_by("title")
              .select_related("author")
          )
          for book in books:
              print(book.title, "by", book.author.name)
      
      • select_related sends one query, where fetches books with author joined in
      • the drawback of select_related image
      • comes as single table of data
      • some values might be repetitive
    • another solution is to use prefetch_related
          books = (
              book.objects.order_by("title")
              .prefetch_related("author")
          )
          for book in books:
              print(book.title, "by", book.author.name)
      
      • the first query fetches the books and the second query fetches related authors for the fetched books
      • so if there’s 5 authors, which have written 1000 of books it will pull back the 5 authors once, no repetition
      • depends n the data volumes, it can be faster than select_related
    • you can also use django-auto-prefetch which works in the following way:
      • fetch books
      • on first access of book.author fetch related authors for all books
      • solves N+1 problem for foreign keys and one-to-one fields (it doesn’t help with many-to-many)
  • split models
    • imagine the following model:
        class User(AbstractUser):
            avatar = models.ImageField(...)
            ...
      
    • you’re given the following task: store user’s ACME access token and refresh token
    • one solution:
        class User(AbstractUser):
            avatar = models.ImageField(...)
            ...
            acme_access_token = models.TextField()
            acme_access_expires = models.DateTimeField()
            acme_refresh_token = models.TextField()
      
    • the problem with this solution is that it’s going to slow down every place where users are queried
    • every time you update a row, Postgres creates a whole new copy of that data every time
    • better solution:
        class User(AbstractUser):
            avatar = models.ImageField(...)
            ...
                  
        class UserAcmeToken(models.Model):
            user = models.OneToOneField(User, primary_key=True)
            acme_access_token = models.TextField()
            acme_access_expires = models.DateTimeField()
      
  • multiple counts in one pass
    • example:
        published_count = (
            author.book_set.filter(published=True).count()
        )
        unpublished_count = (
            author.book_set.filter(published=False).count()
        )
      
    • we can use another way to ask for both parts at the same time:
        counts = (
            author.book_set.aggregate(
                published=Count('pk', filter=Q(published=True)),
                unpublished=Count('pk', filter=Q(published=False)),
            )
        )
      
    • in this way Postgres goes only ones over data


Resources

  • Docs: Database access optimization
  • The Temple of Django Database Performance
  • Post: “Django and the N+1 Queries Problem”
  • django-debug-toolbar or Kolo


Resources for Data-Oriented Design

  • Mike Acton - Data-Oriented Design and C++
  • Andrew Kelley - Practical DOD
  • Andreas Fredriksson - Context is Everything
Mia Bajić's Picture

About Mia Bajić

I’m a Prague-based software engineer passionate about knowledge sharing & community building. I’m the main organizer of Prague Python Pizza & Prague Python meetups, and a co-organizer of EuroPython & PyCon CZ.