xanderim/high_perfomance_django.md

## high_perfomance_django.md

      
    Raw
  

              high_perfomance_django.md
            
          
    Query Caching


Django-cachalot http://django-cachalot.readthedocs.io/en/latest/
Johnny Cache https://github.com/jmoiron/johnny-cache (deprecated)
Cache Machine https://cache-machine.readthedocs.io/en/latest/

init.py:
import johnny.cache
johnny.cache.enable()
Alternate Data Stores


http://obartunov.livejournal.com/175235.html
http://cramer.io/2014/05/12/scaling-sql-with-redis

Sharding


http://www.craigkerstiens.com/2012/11/30/sharding-your-database/
https://www.percona.com/blog/2009/08/06/why-you-dont-want-to-shard/
scaling pinterest presentation http://lanyrd.com/2013/qconsf/scrdgq/
scaling instagram http://lanyrd.com/2012/airbnb-mike-krieger/srrzg/
http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
https://github.com/disqus/sharding-example

Russian Doll Caching

{% cache MIDDLE_TTL "post_list" request.GET.page %}
  {% include "inc/post/header.html" %}
  <div class="post-list">
  {% for post in post_list %}
    {% cache LONG_TTL "post_teaser_" post.id post.last_modified %}
      {% include "inc/post/teaser.html" %}
    {% endcache %}
  {% endfor %}
  </div>
{% endcache %}

Custom cache template tag

class CacheNode(template.Node):
  bust_param = 'flush-the-cache'
  
  def needs_cache_busting(self, request):
    bust = False
    if request.GET and self.bust_param in request.GET:
      bust = True
    return bust
  
  def render(self, context):
    value = cache.get(cache_key)
    if self.needs_cache_busting(request) or value is None:
      value = self.nodelist.render(context)
      cache.set(cache_key, value, expire_time)
    return value
    
 def jitter(num, variance=0.2):
   min_num = num * (1 - variance)
   max_num = num * (1 - variance)
   return randint(min_num, max_num)

https://en.wikipedia.org/wiki/Thundering_herd_problem
https://gist.github.com/ipmb/cb0c667ee4a7acd6c4f8

Do slow work later


https://django-transaction-hooks.readthedocs.io
http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html
http://lanyrd.com/2012/djangocon-us/sxbyb/

Front-end optimizations


http://shop.oreilly.com/product/9780596529307.do
http://yslow.org/
https://developers.google.com/speed/pagespeed/

Compress images


http://easy-thumbnails.readthedocs.io/en/latest/ref/optimize/

Serve assets from a cdn


http://whitenoise.evans.io/en/stable/

Avoiding single points of failure


chaos monkey released into wild

Database


https://blog.codinghorror.com/hardware-is-cheap-programmers-are-expensive/

Database tuning

Postgresql


shared_buffers 25% of RAM up to 8GB
work_mem (2xRAM) / max_connections
maintenance_work_mem RAM / 16
effective_cache_size RAM / 2
max_connections less than 400
http://lanyrd.com/2012/djangocon-europe/srpqz/

MySQL


https://tools.percona.com/wizard
innodb-buffer-pool-size 80% of RAM

uWSGI tuning


processes. start with 2x processor cores and go up. if you have other services like memcache or Varnish start with (number of cores + 1)
threads
if your app is thread-safe. Use stats option and uwsgitop to determine the optimal number of processes and threads for your workload
thunder-lock
This option helps balance the load better amongst all processes/threads.
http://uwsgi-docs.readthedocs.io/en/latest/articles/SerializingAccept.html
harakiri
max number of seconds a worker can take to process a single request before it is killed off. It prevents all the workers from getting tied up with long-running requests.
max-requests
applications can leak memory over time, it tells uWSGI to respawn worker after X requests. set it to a sufficiently high number
post-buffering
The max size of an HTTP request body in bytes (usually a file upload) that will go into memory. Larger requests will be saved to a temporary file on disk.
stats
publish statistics about uWSGI process. 127.0.0.1:1717, /tmp/stat.sock. pip install uwsgitop
auto-procname
A nicer human-readable process name
procname-prefix-spaced


http://docs.gunicorn.org/en/latest/install.html#async-workers
https://glyph.twistedmatrix.com/2014/02/unyielding.html
https://lincolnloop.com/blog/architecting-realtime-applications/

Tuning Django


pylibmc
django-redis
https://en.wikipedia.org/wiki/Cache_stampede
django-newcache https://github.com/joshourisman/django-newcache
https://github.com/lincolnloop/django-ft-cache

Databases

CONN_MAX_AGE. 300 is a good value to start with if you're unsure
Logging

It is a common source of file permissions issues.
Output to STDERR and either have uWSGI log this to file or pick up the output with your process manager (upstart, systemd, supervisor)
http://uwsgi-docs.readthedocs.io/en/latest/Logging.html
General security


django-secure
https://www.owasp.org

Web accelerator

Varnish caches responses based on the URL and the contetns of the headers defined by Vary header. A typical Django request may vary on Accept-Encoding and Cookie. For anonymous requests, the cookies rarely matter. Improve your hit rate greatly by stripping them out so the anonymous requests all look the same.
You can define a get parameter to pass through the caching. Pick the same param as in custom template cache tag.
Improving your hitrate

https://varnish-cache.org/docs/4.0/users-guide/increasing-your-hitrate.html
On sites there users arelogged in and page content varies for every user, split up pages such that some expensive parts do't vary per user. In some cases the only difference is the user name displayed on the screen. For these sorts of pages, you can use 2 phase rendering process. Django renders anonymized version of the page for Varnish to cache, the use AJAX to make an additional request filling in the personalized bits. The other option is to use Eege Side Include and let Varnish use that information to assemble the page for you. https://varnish-cache.org/docs/4.0/users-guide/esi.html
cache any hardcoded redirects

Security

https://github.com/bitly/oauth2_proxy
Backup

https://github.com/wal-e/wal-e
Monitoring

Instrumentation


What is the slowest part of my system? A time breakdown per request Pytho, SQL, cache, etc
What is the average response time for a request hitting Django
Which views are the slowest and consume the most time
Which database queries are the slowest and consume the most time?
How are all these numbers changing over time?


https://github.com/django-statsd/django-statsd
https://github.com/etsy/statsd
https://hekad.readthedocs.io/en/latest/man/plugin.html#statsd-input
kibana + elasticsearch + heka

Load testing


https://httpd.apache.org/docs/2.4/programs/ab.html
https://github.com/JoeDog/siege
http://jmeter.apache.org/

Launch planning


Use load balancers to split traffic between old and new. Make sure you enabled session affinity or sticky sessions so users won't bounce between new and old
dark launch http://farmdev.com/thoughts/85/dark-launching-or-dark-testing-new-software-features/
Invisibly proxy live traffic to the new infrastracture using something like Gor https://leonsbox.com/improving-testing-by-using-real-traffic-from-production-8bfbddd009ad
partial deployment with feature switch
http://blog.disqus.com/post/789540337/partial-deployment-with-feature-switches (404)
https://www.youtube.com/watch?v=WMRjj06R6jg
https://featureflags.io/2016/04/15/feature-toggle-resources/
warm cache. script that crawls most popular urls

Monitoring the launch

Server resources


htop
list of open files for process lsof
trace library and syscalls ltrace, strace

What to watch


is the load average safe? Not exceed the number of CPU cores
Any processes constantly using all of cpu core? split the process up across more workers to take advantage of multiple cores.
is the server swapping (swp)? more RAM or reduce number of running processes
Are any python processes using excessive memory (> 300 MB RES). Profiler
Varnish, cache, database use a lot of memory. If they aren't - check configuration.

Varnish


varnishstat
varnishhist
varnishtop
varnishlog

What to watch


is your hitrate acceptable
are URLs you expect to be cached actually getting served from cache?
are URLs that should not be cached, bypassing the cache?
what are the top URLs bypassing the cache? Can they be cached?
Are the common 404 or permanent redirects caught with Varnish instead of Django?

uWSGI


pip install uwsgitop

What to watch


is the average response time acceptable (< 1s)
all workers busy all the time? if there is still CPU and RAM to spare (htop) - add workers or threads

Celery


http://docs.celeryproject.org/en/latest/userguide/monitoring.html#commands
http://docs.celeryproject.org/en/latest/userguide/monitoring.html#events
http://docs.celeryproject.org/en/latest/userguide/monitoring.html#flower-real-time-celery-web-monitor

What to watch


Are all tasks completing successfully?
Is the queue growing faster than the workers can process tasks? if the server has free resources - add Celery workers, if not - add another server to process tasks

Memcached


https://github.com/lincolnloop/memcache-top

What to watch


How is your hitrate? It should be > 90%. If not it could be due to a high eviction rate or poor caching strategy
Are connections and usage well balanced across the servers? If not you'll want to investigate efficient hashing algorithm or modify the function that generates the cache keys
Is the time spent per operation averaging less than 2ms? If not, you may be maxing out the hardware (swapping, network congestion, etc.)

Database Postgres


pg_top (e <query_id> to explain it in-place)
https://www.postgresql.org/docs/9.4/static/pgstatstatements.html
psql -P border=2 -P format=wrapped -P linestyle=unicode
https://github.com/dalibo/pgbadger

Database MySQL


https://www.percona.com/doc/percona-toolkit/LATEST/pt-query-digest.html
mytop (e <query_id> to explain it in-place)

What to watch


disks are often the bottleneck, iowait time. Check it via top as X%wa in the CPUs row.
number of connections is well under the maximum connections you've configured. bump up the max or investigate if that many connections are actually needed
watch out for "idle in transaction" connections. if you do see them, they should go away quickly. if they hang around, one of the application accessing your database might be leaking connections
are queries running for more than a second? they could be waiting on a lock or require some optimization.
check for query patterns that are frequently displayed. could they be cached or optimized?

Locking the database


https://signalvnoise.com/posts/3174-taking-the-pain-out-of-mysql-schema-changes
https://www.aeracode.org/2012/11/13/one-change-not-enough/

Final thoughts


https://engineering.instagram.com/what-powers-instagram-hundreds-of-instances-dozens-of-technologies-adf2e22da2ad