- Django-cachalot http://django-cachalot.readthedocs.io/en/latest/
- Johnny Cache https://github.com/jmoiron/johnny-cache (deprecated)
- Cache Machine https://cache-machine.readthedocs.io/en/latest/
init.py:
import johnny.cache
johnny.cache.enable()
- http://www.craigkerstiens.com/2012/11/30/sharding-your-database/
- https://www.percona.com/blog/2009/08/06/why-you-dont-want-to-shard/
- scaling pinterest presentation http://lanyrd.com/2013/qconsf/scrdgq/
- scaling instagram http://lanyrd.com/2012/airbnb-mike-krieger/srrzg/
- http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
- https://github.com/disqus/sharding-example
{% cache MIDDLE_TTL "post_list" request.GET.page %}
{% include "inc/post/header.html" %}
<div class="post-list">
{% for post in post_list %}
{% cache LONG_TTL "post_teaser_" post.id post.last_modified %}
{% include "inc/post/teaser.html" %}
{% endcache %}
{% endfor %}
</div>
{% endcache %}
class CacheNode(template.Node):
bust_param = 'flush-the-cache'
def needs_cache_busting(self, request):
bust = False
if request.GET and self.bust_param in request.GET:
bust = True
return bust
def render(self, context):
value = cache.get(cache_key)
if self.needs_cache_busting(request) or value is None:
value = self.nodelist.render(context)
cache.set(cache_key, value, expire_time)
return value
def jitter(num, variance=0.2):
min_num = num * (1 - variance)
max_num = num * (1 - variance)
return randint(min_num, max_num)
- https://en.wikipedia.org/wiki/Thundering_herd_problem
- https://gist.github.com/ipmb/cb0c667ee4a7acd6c4f8
- https://django-transaction-hooks.readthedocs.io
- http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html
- http://lanyrd.com/2012/djangocon-us/sxbyb/
- http://shop.oreilly.com/product/9780596529307.do
- http://yslow.org/
- https://developers.google.com/speed/pagespeed/
- chaos monkey released into wild
- shared_buffers 25% of RAM up to 8GB
- work_mem (2xRAM) / max_connections
- maintenance_work_mem RAM / 16
- effective_cache_size RAM / 2
- max_connections less than 400 http://lanyrd.com/2012/djangocon-europe/srpqz/
- https://tools.percona.com/wizard
- innodb-buffer-pool-size 80% of RAM
- processes. start with 2x processor cores and go up. if you have other services like memcache or Varnish start with (number of cores + 1)
- threads if your app is thread-safe. Use stats option and uwsgitop to determine the optimal number of processes and threads for your workload
- thunder-lock This option helps balance the load better amongst all processes/threads. http://uwsgi-docs.readthedocs.io/en/latest/articles/SerializingAccept.html
- harakiri max number of seconds a worker can take to process a single request before it is killed off. It prevents all the workers from getting tied up with long-running requests.
- max-requests applications can leak memory over time, it tells uWSGI to respawn worker after X requests. set it to a sufficiently high number
- post-buffering The max size of an HTTP request body in bytes (usually a file upload) that will go into memory. Larger requests will be saved to a temporary file on disk.
- stats publish statistics about uWSGI process. 127.0.0.1:1717, /tmp/stat.sock. pip install uwsgitop
- auto-procname A nicer human-readable process name
- procname-prefix-spaced
- http://docs.gunicorn.org/en/latest/install.html#async-workers
- https://glyph.twistedmatrix.com/2014/02/unyielding.html
- https://lincolnloop.com/blog/architecting-realtime-applications/
- pylibmc
- django-redis
- https://en.wikipedia.org/wiki/Cache_stampede
- django-newcache https://github.com/joshourisman/django-newcache
- https://github.com/lincolnloop/django-ft-cache
CONN_MAX_AGE. 300 is a good value to start with if you're unsure
It is a common source of file permissions issues. Output to STDERR and either have uWSGI log this to file or pick up the output with your process manager (upstart, systemd, supervisor) http://uwsgi-docs.readthedocs.io/en/latest/Logging.html
- django-secure
- https://www.owasp.org
Varnish caches responses based on the URL and the contetns of the headers defined by Vary header. A typical Django request may vary on Accept-Encoding and Cookie. For anonymous requests, the cookies rarely matter. Improve your hit rate greatly by stripping them out so the anonymous requests all look the same.
You can define a get parameter to pass through the caching. Pick the same param as in custom template cache tag.
https://varnish-cache.org/docs/4.0/users-guide/increasing-your-hitrate.html On sites there users arelogged in and page content varies for every user, split up pages such that some expensive parts do't vary per user. In some cases the only difference is the user name displayed on the screen. For these sorts of pages, you can use 2 phase rendering process. Django renders anonymized version of the page for Varnish to cache, the use AJAX to make an additional request filling in the personalized bits. The other option is to use Eege Side Include and let Varnish use that information to assemble the page for you. https://varnish-cache.org/docs/4.0/users-guide/esi.html
https://github.com/bitly/oauth2_proxy
https://github.com/wal-e/wal-e
- What is the slowest part of my system? A time breakdown per request Pytho, SQL, cache, etc
- What is the average response time for a request hitting Django
- Which views are the slowest and consume the most time
- Which database queries are the slowest and consume the most time?
- How are all these numbers changing over time?
- https://github.com/django-statsd/django-statsd
- https://github.com/etsy/statsd
- https://hekad.readthedocs.io/en/latest/man/plugin.html#statsd-input kibana + elasticsearch + heka
- https://httpd.apache.org/docs/2.4/programs/ab.html
- https://github.com/JoeDog/siege
- http://jmeter.apache.org/
- Use load balancers to split traffic between old and new. Make sure you enabled session affinity or sticky sessions so users won't bounce between new and old
- dark launch http://farmdev.com/thoughts/85/dark-launching-or-dark-testing-new-software-features/
- Invisibly proxy live traffic to the new infrastracture using something like Gor https://leonsbox.com/improving-testing-by-using-real-traffic-from-production-8bfbddd009ad
- partial deployment with feature switch http://blog.disqus.com/post/789540337/partial-deployment-with-feature-switches (404) https://www.youtube.com/watch?v=WMRjj06R6jg https://featureflags.io/2016/04/15/feature-toggle-resources/
- warm cache. script that crawls most popular urls
- htop
- list of open files for process lsof
- trace library and syscalls ltrace, strace
- is the load average safe? Not exceed the number of CPU cores
- Any processes constantly using all of cpu core? split the process up across more workers to take advantage of multiple cores.
- is the server swapping (swp)? more RAM or reduce number of running processes
- Are any python processes using excessive memory (> 300 MB RES). Profiler
- Varnish, cache, database use a lot of memory. If they aren't - check configuration.
- varnishstat
- varnishhist
- varnishtop
- varnishlog
- is your hitrate acceptable
- are URLs you expect to be cached actually getting served from cache?
- are URLs that should not be cached, bypassing the cache?
- what are the top URLs bypassing the cache? Can they be cached?
- Are the common 404 or permanent redirects caught with Varnish instead of Django?
- pip install uwsgitop
- is the average response time acceptable (< 1s)
- all workers busy all the time? if there is still CPU and RAM to spare (htop) - add workers or threads
- http://docs.celeryproject.org/en/latest/userguide/monitoring.html#commands
- http://docs.celeryproject.org/en/latest/userguide/monitoring.html#events
- http://docs.celeryproject.org/en/latest/userguide/monitoring.html#flower-real-time-celery-web-monitor
- Are all tasks completing successfully?
- Is the queue growing faster than the workers can process tasks? if the server has free resources - add Celery workers, if not - add another server to process tasks
- How is your hitrate? It should be > 90%. If not it could be due to a high eviction rate or poor caching strategy
- Are connections and usage well balanced across the servers? If not you'll want to investigate efficient hashing algorithm or modify the function that generates the cache keys
- Is the time spent per operation averaging less than 2ms? If not, you may be maxing out the hardware (swapping, network congestion, etc.)
- pg_top (e <query_id> to explain it in-place)
- https://www.postgresql.org/docs/9.4/static/pgstatstatements.html
- psql -P border=2 -P format=wrapped -P linestyle=unicode
- https://github.com/dalibo/pgbadger
- https://www.percona.com/doc/percona-toolkit/LATEST/pt-query-digest.html
- mytop (e <query_id> to explain it in-place)
- disks are often the bottleneck, iowait time. Check it via top as X%wa in the CPUs row.
- number of connections is well under the maximum connections you've configured. bump up the max or investigate if that many connections are actually needed
- watch out for "idle in transaction" connections. if you do see them, they should go away quickly. if they hang around, one of the application accessing your database might be leaking connections
- are queries running for more than a second? they could be waiting on a lock or require some optimization.
- check for query patterns that are frequently displayed. could they be cached or optimized?