Notes from Conferences: Data-Oriented Design
Data-Oriented Design
Talk: https://www.youtube.com/watch?v=gfPNeQR1aSc
Software’s Only Job is to Transform Data
What Do Users Care About?
- getting their output data -> they don’t care about what do you do in the middle
Data Characteristics:
- format
- volume
- latency
- throughput
- statistical distribution
Context Is Everything
Example: check if a number exists in a given set
numbers: set[int]
def is_match(number: int) -> bool:
return number in numbers
But what is there’s just a single number in that set? In that case in
operator is wasteful.
def is_match(number: int) -> bool:
return number == 42
What is a billion numbers? Building a set in this case would exhaust the memory.
def is_match(number: int) -> bool:
with connection.cursor():
cursor.execute(
"SELECT 1 FROM numbers WHERE n = %s",
(number, ),
)
return (cursor.fetchone() is not None)
What is matching many numbers against few? Many functions aren’t called just once, they’re called repeatedly, which might not be efficient.
numbers = set[int]
def matches(sought: set[int]) -> set[int]:
return numbers.intersection(sought)
What is set is all off numbers between 1 and 1 million? In that case we could add more maths.
What if false positives are acceptable? In that case the bloom filter would be a better data structure.
…
-> the data that we use as the input will influence which function do we write and also the context
-> implementation depends on data characteristics and that is data-oriented design
Software’s only job is to transform data using specific hardware
General-purpose CPU
- Django is mostly run on general-purpose CPU
- there’s a gap betweek processor and memory performance since 1980
- processor can do a lot of stuff fast, but it can’t get stuff from memory fast
- the solution to this issue is CPU cache between CPU and memory, so the data is transferred into cache, so it can be fetched quickly
- modern CPU’s have 3 levels of cache (sometimes even 4)
- each layer of cache is 3x faster than the previous one
- how many cycles takes fetch from cache:
Implications of This Design
- use smaller representations
- lay out data in access order
- Python built-in types are big
In [1]: 64 // 8 Out [1]: 8 In [2]: import sys In [3]: sys.getsizeof(9001) Out [3]: 28
- 8 bytes for an int + 20 bytes for another stuff (PyObject id, etc.) ```python In [4]: numbers = list(range(1000))
In [5]: len(numbers) * 8 Out[5]: 8000
In [6]: ( …: sys.getsizeof(numbers) …: + sum(sys.getsizeof(n) for n in numbers) …: ) Out[6]: 36052 ```
- C data types are much slimmer
In [7]: import array In [8]: numbers2 = array.array('Q', range(1000)) In [9]: sys.getsizeof(numbers2) Out[9]: 8320
- normally use NumPy or Pandas for arrays
Data-Oriented Design for the Web
- typical website:
- user <-> browser loop is already super optimized and quick
- but what to put in the something part?
- one option is to put there Django & Postgres
- another possible architecture:
- how fast does the data transfer need to be?
- <100ms - great
- <1s - pretty good
- <3s - at 3s lose ~50% of visitors
-
10s - users likely to retry or give up
How to Improve the Flow Between Browser and Django?
- write minimal, performant HTML - if you get your CSS and JavaScript tags in the first kilobyte the browser will stop requesting those early on, don’t load HTML with loads of classes as some CSS frameworks to like to add etc.
- introduce HTTP caching - tell browsers what to store forever, so for example images are loaded once only
- get on the latest version of HTTP - 3, or at least 2
- response compression (GZipMiddleware) - compresses requests, it’s a compression algorithm for text formats like HTML
- HTML minification (django-minify-html) - saves around 1-3% of the request size
- use a CDN (Content Delivery Network) - saving a lot of network time, as servers as closer to users
Resources
- MDN: Web Performance
- web.dev/learn
- WebPageTest.org
- web.dev/measure
How to Improve the flow Between Django and Database?
- avoid N+1 queries
books = Book.objects.order_by("title") for book in books: print(book.title, "by", book.author.name)
- iterate books + for each of N books fetch book.autor
- instead you can use
select_related
books = ( Book.objects.order_by("title") .select_related("author") ) for book in books: print(book.title, "by", book.author.name)
select_related
sends one query, where fetches books with author joined in- the drawback of
select_related
- comes as single table of data
- some values might be repetitive
- another solution is to use
prefetch_related
books = ( book.objects.order_by("title") .prefetch_related("author") ) for book in books: print(book.title, "by", book.author.name)
- the first query fetches the books and the second query fetches related authors for the fetched books
- so if there’s 5 authors, which have written 1000 of books it will pull back the 5 authors once, no repetition
- depends n the data volumes, it can be faster than
select_related
- you can also use
django-auto-prefetch
which works in the following way:- fetch books
- on first access of book.author fetch related authors for all books
- solves N+1 problem for foreign keys and one-to-one fields (it doesn’t help with many-to-many)
- split models
- imagine the following model:
class User(AbstractUser): avatar = models.ImageField(...) ...
- you’re given the following task: store user’s ACME access token and refresh token
- one solution:
class User(AbstractUser): avatar = models.ImageField(...) ... acme_access_token = models.TextField() acme_access_expires = models.DateTimeField() acme_refresh_token = models.TextField()
- the problem with this solution is that it’s going to slow down every place where users are queried
- every time you update a row, Postgres creates a whole new copy of that data every time
- better solution:
class User(AbstractUser): avatar = models.ImageField(...) ... class UserAcmeToken(models.Model): user = models.OneToOneField(User, primary_key=True) acme_access_token = models.TextField() acme_access_expires = models.DateTimeField()
- imagine the following model:
- multiple counts in one pass
- example:
published_count = ( author.book_set.filter(published=True).count() ) unpublished_count = ( author.book_set.filter(published=False).count() )
- we can use another way to ask for both parts at the same time:
counts = ( author.book_set.aggregate( published=Count('pk', filter=Q(published=True)), unpublished=Count('pk', filter=Q(published=False)), ) )
- in this way Postgres goes only ones over data
- example:
Resources
- Docs: Database access optimization
- The Temple of Django Database Performance
- Post: “Django and the N+1 Queries Problem”
- django-debug-toolbar or Kolo
Resources for Data-Oriented Design
- Mike Acton - Data-Oriented Design and C++
- Andrew Kelley - Practical DOD
- Andreas Fredriksson - Context is Everything