Hisham-Pak/dj-gsoc-2024-proposal-idcb.md

## dj-gsoc-2024-proposal-idcb.md

      
    Raw
  

              dj-gsoc-2024-proposal-idcb.md
            
          
    "Improve Database Cache Backend" By Hisham Mahmood, Django, 2024

Table of contents


Abstract
Implementation Plan

Using ORM instead of SQL queries
Compression zlib
Probabilistic Culling
Making Use of Upsert
Make set_many Method Use Single Query
Atomic incr/decr


Timeline
References
About me

1. Abstract

This project aims to boost performance, efficiency and reliability of Django’s database cache backend. The proposed improvements are:

Using ORM instead of SQL queries: Simplifies the process.
Implementation of zlib Compression: Optimizes storage efficiency by reducing the size of cached data.
Introduction of Atomic Increment/Decrement Operations: Ensures data integrity during simultaneous updates.
Adoption of Probabilistic Culling: Effectively manages cache size and the eviction of older data.
Transition from Text to Blob for Value Storage: Enhances space utilization and performance by storing cache values as blobs instead of text.
Make set_many Method Use Single Query: Reduces database load and improves speed by making set_many method execute single or fewer queries.
Utilization of Upsert Operations: Efficiently handles insert or update operations.

2. Implementation Plan

Using ORM instead of SQL queries

To use orm we can create model at runtime:
...
from django.apps import apps

class BaseDatabaseCache(BaseCache):
    def __init__(self, table, params):
        super().__init__(params)
        self._table = table
        self.cache_model_class = self._get_cache_entry_model(self._table)
    
    def _get_cache_entry_model(self, table):
        try:
            model = apps.get_registered_model("django_cache", "cacheentry")
            if table == model._meta.db_table:
                return model
            del apps.all_models["django_cache"]["cacheentry"]
        except (LookupError, KeyError):
            pass

        class CacheEntry(models.Model):
            cache_key = models.CharField(max_length=255, unique=True, primary_key=True)
            value = models.TextField()
            expires = models.DateTimeField(db_index=True)

            class Meta:
                app_label = "django_cache"
                db_table = table

        return CacheEntry
...
Now for example, for get_many instead of executing sql we can:
def get_many(self, keys, version=None):
    ...
    db = router.db_for_read(self.cache_model_class)
    connection = connections[db]
    rows = self.cache_model_class.objects.using(db).filter(cache_key__in=list(key_map)).values_list('cache_key', 'value', 'expires')
    ...
Another example, for has_key method:
def has_key(self, key, version=None):
    key = self.make_and_validate_key(key, version=version)
    db = router.db_for_read(self.cache_model_class)
    connection = connections[db]
    now = tz_now().replace(microsecond=0)

    return self.cache_model_class.objects.using(db).filter(Q(cache_key=key) & Q(expires__gt=now)).exists()
All tests pass for these two method changes.
Compression zlib

Using zlib's compress/decompress methods improves performances considerably, especially as data starts getting larger and it is also true to some extent for small data as well.
b64encoded = base64.b64encode(zlib.compress(pickled, level=6)).decode("latin1")
value = pickle.loads(zlib.decompress(base64.b64decode(value.encode())))
Benchmark:


Cache Alias
set benchmark
set_many benchmark
get benchmark
get_many benchmark
delete benchmark
incr benchmark


DatabaseCache
3.520
507.169
0.214
4.701
0.687
5.362


w/ zlib
2.442
363.081
0.138
2.092
0.311
2.283


Probabilistic Culling

Currently, a separate cull query is run on every set query changing this behaviour improves performance at the cost of culling expired values little later.
_cull_probability = 0.01
if random() <= _cull_probability:
    self._cull(db, cursor, now, num)
Making Use of Upsert

Will use update_or_create and/or bulk_create with update_conflicts=True
Use Blob Instead of Text for Values

Replace TextField with BinaryField and remove b64 encoding/decodings. (backwards compatibility needed)
Make set_many Method Use Single Query

Looks like set_many needs to be overriden and adapted accordingly
from django.utils.timezone import timedelta

def set_many(self, data, timeout=DEFAULT_TIMEOUT, version=None, chunk_size=100):
    with transaction.atomic():
        objs = []
        timeout = self.get_backend_timeout(timeout)
        db = router.db_for_write(self.cache_model_class)
        connection = connections[db]
        for key, value in data.items():
            key = self.make_and_validate_key(key, version=version)
            pickled = pickle.dumps(value, self.pickle_protocol)
            b64encoded = base64.b64encode(pickled).decode("latin1")
            if timeout is None:
                exp = datetime.max
                if settings.USE_TZ:
                    exp = exp.replace(tzinfo=timezone.utc)
            else:
                tz = timezone.utc if settings.USE_TZ else None
                exp = datetime.fromtimestamp(timeout, tz=tz)
            exp = exp.replace(microsecond=0)
            obj = self.cache_model_class(cache_key=key, value=b64encoded, expires=exp)
            objs.append(obj)

            if len(objs) == chunk_size:
                self.cache_model_class.objects.bulk_create(objs, update_conflicts=True, update_fields=("value", "expires",), unique_fields=("cache_key",))
                objs = []

        if objs:
            self.cache_model_class.objects.bulk_create(objs, update_conflicts=True, update_fields=("value", "expires",), unique_fields=("cache_key",))
    return []
All tests pass, caveats:

bulk_create's update_conflicts=True not supported on oracle (see docs)
memory vs time trade off

Benchmark:


Cache Alias
set benchmark
set_many benchmark
get benchmark
get_many benchmark
delete benchmark
incr benchmark


DatabaseCache
3.520
507.169
0.214
4.701
0.687
5.362


w/ set_many
9.008
75.977
0.274
7.274
0.741
5.250


Atomic incr/decr

The idea is to override incr, decr methods from BaseCache in DatabaseCache and do it atomically
3. Timeline

Community Bonding (May 1 - May 26)


Understand the codebase in more depth.
Interact with the community and my mentors.
Refine the project plan and timeline with the help of mentors.

Week 1-2 (May 27 - June 9)


Task: Transition from SQL queries to ORM.
Deliverable: Database operations in the cache backend use Django ORM.

Week 3-4 (June 10 - June 23)


Task: Implement zlib compression.
Deliverable: Cached data is compressed using zlib before being stored.

Week 5-6 (June 24 - July 7)


Task: Implement atomic increment/decrement operations.
Deliverable: Atomic increment/decrement operations are supported in the cache backend.

Week 7 (July 8 - July 14)


Task: Implement probabilistic culling.
Deliverable: Probabilistic culling is used to manage cache size but not on every set operation.

Week 8 (July 15 - July 21)


Task: Transition from text to blob for value storage.
Deliverable: Cache values are stored as blobs instead of text.

Week 9-10 (July 22 - August 4)


Task: Make set_many method use single query.
Deliverable: set_many operation uses single or fewer query.

Week 11-12 (August 5 - August 18)


Task: Implement upsert operations.
Deliverable: Upsert operations are supported in the cache backend.

Final Week (August 19 - August 26)


Buffer time to finish any pending work, write tests, improve documentation, and clean up the code.
Submit the code for final evaluation.

If time permits or after gsoc will look into hooking the database cache tables into migrations.
4. References


https://adamj.eu/tech/2015/05/17/building-a-better-databasecache-for-django-on-mysql/
https://django-mysql.readthedocs.io/en/latest/cache.html
https://github.com/adamchainz/django-mysql/blob/80de65fa4eb1801f7ff721f40a419ca63d2f338d/src/django_mysql/cache.py
https://github.com/adamchainz/django-mysql-benchmark-cache
https://code.djangoproject.com/wiki/DynamicModels

5. About me

My name is Hisham Mahmood and I am a first year computer science student at National University of Sciences and Technology, Islamabad. I started to use python in 2019 and since then I have been using django.
Contributions to Django

Tickets


#31405 : LoginRequiredAuthenticationMiddleware force all views to require authentication by default.
#33517 : Extracting seconds also returns fractional seconds on PostgreSQL and Oracle.
#34044 : Admin app search filter appears in nav sidebar but not on admin homepage.
#34910 : Color contrast admin plus icon.
#35099 : Combining QuerySets with "|" or "&" mutates right-hand side.
#35173 : DisallowedModelAdminLookup raised when filtering with __isnull on a ForeignKey not listed in list_filters.

Pull Requests (Open & Merged)


PR-17844 : Fixed #35173 -- Reallowed filtering with __isnull on foreign keys not listed in list_filters.
PR-17829 : Fixed #35099 -- Prevented mutating queryset when combining with & and | operators.
PR-17792 : Refs #31405 -- Added LoginRequiredAuthenticationMiddleware force all views to require authentication by default.
PR-17769 : Refs #33517 -- Prevented __second lookup from returning fractional seconds on Oracle.
PR-17744 : Fixed #34910 -- Improved color contrast for add/change icons in admin.
PR-17709 : #34044 -- Add the admin app filter on index pages in addition to the nav sidebar

Contact


Primary Email: hishammahmood41@gmail.com
Secondary Email: mail@thehisham.com
LinkedIn: Hisham Mahmood
Cache Alias	set benchmark	set_many benchmark	get benchmark	get_many benchmark	delete benchmark	incr benchmark
DatabaseCache	3.520	507.169	0.214	4.701	0.687	5.362
w/ zlib	2.442	363.081	0.138	2.092	0.311	2.283