Skip to content

Instantly share code, notes, and snippets.

@zzzeek
Last active May 18, 2016 03:21
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save zzzeek/c69138fd0d0b3e553a1f to your computer and use it in GitHub Desktop.
Save zzzeek/c69138fd0d0b3e553a1f to your computer and use it in GitHub Desktop.
"""This demonstration script illustrates that the default settings
for Openstack applications like Nova regarding WSGI workers and greenlets
are completely mis-matched with the default database settings, leading
to severe performance and stability penalties.
Since Openstack applications generally do not use real web servers and instead
implement their own servers in Python using eventlet combined with
multiprocessing, there is no utility for intelligently distributing WSGI
requests across workers; under high concurrency, web requests are distributed
to worker processes more or less randomly, such that particular workers may be
tasked with many requests while others are relatively idle at the same time.
To make things worse, the default number of greenlets in a worker is 1000,
which is vastly larger than the number of database connections allowed in a
process at one time, which defaults to *15*. Dozens or hundreds of requests
can be sent to a single worker process which will all queue up serially,
waiting for access to the very small number of DB connections and often
just timing out, which throws a connection pool error.
The script here provides a self-contained runner that's based on Nova's
current WSGI server, and illustrates a simple WSGI application within
that server. By simulating requests that might take a little long, between
zero and five seconds randomly, it is easily demonstratable that most requests
spend inordinate amounts of time waiting for a saturated connection pool
to be available, along with plenty of these waits leading to timeouts
which here are illustrated as 500 errors. While we reduce the pool timeout
from 30 seconds to 10 to illustrate this more clearly, note that there's no
reason that *any* request should have to wait *at all* for a connection
with proper configuration.
By simply reducing the number of greenlets to the same as the number of
connections (ten by default), the script can serve requests with no errors,
an average connection wait time that is an order of magnitude less, and
a faster requests / second overall.
The conclusions of this test are:
1. Ideally, all Openstack applications should be using real web servers, not
the current homegrown solution, so that request / worker / greenlet / thread
distribution is handled by the many robust solutions already available for
this problem.
2. With the current system, the number of greenlets and the number of database
connections max should be configured to more or less *match*; this means
that with DB settings at their default (SQLAlchemy defaults to 5 pooled
connections + 10 overflow for a total of 15), the number of greenlets would
also be 15. More realistically, the size of pool overflow can be increased
from 10 to about 100 or so, and the number of greenlets per process can
then also be about 100.
To Run
======
Install Nova, PyMySQL, and a MySQL server; also install a load testing
program, such as Apache ab.
Then run this script with the Python interpreter::
python demo_os_workers.py --dburl mysql+pymysql://scott:tiger@localhost/mydb
it will then wait for connections on port 8080.
While the script runs, hit it with the ab test and a lot of
concurrency::
$ ab -c 200 -n 2000 http://127.0.0.1:8080
The output will show lots of messages like::
INFO:loadtest:Status for pid 32625: avg wait time for connection 1.4358 sec;
worst wait time 3.9267 sec; connection failures 5;
num requests over the limit: 29; max concurrency seen: 25
and::
ERROR:loadtest:error in pid 32630: QueuePool limit of size 5 overflow 5
reached, connection timed out, timeout 5
The INFO message shows that we're waiting well over a second average for
connections and getting failures too. The ERROR message shows a failure
where we waited so long for a connection it timed out after five seconds.
Then do the same process again, but this time running the script with
--greenlets=10::
python demo_os_workers.py --dburl mysql+pymysql://scott:tiger@localhost/mydb --greenlets=10
Run the ab test, even bump the concurrency
higher. You should see zero failures/errors and only messages like::
INFO:loadtest:Status for pid 460: avg wait time for connection 0.0140 sec;
worst wait time 0.0540 sec; connection failures 0;
num requests over the limit: 0; max concurrency seen: 11
At the end of the ab test the req/sec should be faster too, since workers
are now waiting in the order of 14ms for connections rather than 1.4sec.
"""
# initiate eventlet monkeypatching. This is required
# as we are using an eventlet greenlet pool, all socket / threading
# code needs to run under that context.
# source: nova/cmd/__init__.py
import eventlet
eventlet.monkey_patch(os=False)
# logging, so we can see things.
import logging
log = logging.getLogger("loadtest")
log.setLevel(logging.INFO)
logging.basicConfig()
# oslo_service is used in nova/service.py
from oslo_service import service
from oslo_config import cfg
# use nova's wsgi Server directly
from nova.wsgi import Server
CONF = cfg.CONF
def create_application(url, pool_size, max_overflow, pool_timeout):
"""the actual "app".
The "app" itself just checks out a database connection, pretends
to work with it on a very slow operation, for a random time between zero
and five seconds, then returns the connection to the pool.
We're using a raw SQLAlchemy engine/pool in the same way we'd get
one from oslo_db.sqlalchemy.engines.create_engine(), but just without
all the config overhead and plugins.
"""
from sqlalchemy import create_engine
# note that time.sleep() is one of many functions
# that eventlet.monkey_patch patches.
import time
import os
import random
eng = create_engine(url,
pool_size=pool_size, max_overflow=max_overflow,
pool_timeout=pool_timeout)
max_conn_per_process = pool_size + max_overflow
stats = {
"concurrent_count": 0,
"num_reqs_w_excess_concurrency": 0,
"max_concurrent_count": 0,
"failures": 0,
"time_waited_for_db": [],
"num_reqs": 0
}
def simple_app(environ, start_response):
stats['concurrent_count'] += 1
stats["num_reqs"] += 1
stats['max_concurrent_count'] = max(
stats['max_concurrent_count'], stats['concurrent_count'])
if stats['concurrent_count'] > max_conn_per_process * 1.1:
stats['num_reqs_w_excess_concurrency'] += 1
if stats["num_reqs"] % 10 == 0:
log.info(
"Status for pid %d: "
"avg wait time for connection %.4f sec; "
"worst wait time %.4f sec; "
"connection failures %d; "
"num requests over the limit: %d; "
"max concurrency seen: %s",
os.getpid(),
sum(stats["time_waited_for_db"]) /
len(stats["time_waited_for_db"]),
max(stats["time_waited_for_db"]),
stats["failures"],
stats["num_reqs_w_excess_concurrency"],
stats["max_concurrent_count"]
)
response_headers = [('Content-type', 'text/plain')]
try:
now = time.time()
conn = eng.connect()
stats['time_waited_for_db'].append(time.time() - now)
try:
with conn.begin():
time.sleep(random.random() * 5)
finally:
conn.close()
except Exception as e:
stats["failures"] += 1
log.error("error in pid %d: %s", os.getpid(), e)
start_response("500 server error", response_headers)
return ['Exception: %s\n' % e]
else:
start_response('200 OK', response_headers)
return ['Hello world!\n']
finally:
stats['concurrent_count'] -= 1
return simple_app
class Service(service.Service):
"""the "Service" - this is a simpified version of the nova Service
we see in nova/service.py. The basic idea is the same, e.g. has a
nova.wsgi.Server() as the thing we are running.
"""
def __init__(
self, url, num_pooled_db_connections,
num_extra_db_connections, num_greenlets,
db_pool_timeout):
self.server = Server(
"a server", create_application(
url,
pool_size=num_pooled_db_connections,
max_overflow=num_extra_db_connections,
pool_timeout=db_pool_timeout
), port=8080, pool_size=num_greenlets)
def start(self):
self.server.start()
def stop(self):
self.server.stop()
def wait(self):
self.server.wait()
# some CONF fanfare that oslo.config wants
CONF([],
project='test',
version="0.0.1",
default_config_files=[])
def run_server(
url,
num_pooled_db_connections,
num_extra_db_connections, num_greenlets,
num_workers, db_pool_timeout):
server = Service(
url,
num_pooled_db_connections=num_pooled_db_connections,
num_extra_db_connections=num_extra_db_connections,
num_greenlets=num_greenlets,
db_pool_timeout=db_pool_timeout
)
# if the combination of service / greenlet pool worked correctly,
# as many as 100 concurrent connections would be evenly distributed among
# the 10 workers, and we should never see the connection pool from one
# worker timed out.
_launcher = service.launch(CONF, server, workers=num_workers)
max_db_connections = num_pooled_db_connections + num_extra_db_connections
will_show_bug = max_db_connections < num_greenlets
if will_show_bug:
print(
"Running server with %d max DB connections per worker, "
"%d greenlets per worker and %d workers. \nThis configuration "
"WILL show the bug "
"when request testing with at least %d concurrent "
"requests, more likely at least with %d concurrent "
"requests and more as that number grows. Run with "
"--greenlets=%d to make the bug disappear."
"\n\nNow run a command like:\n\n"
"ab -c %d -n 3000 http://127.0.0.1:8080/" % (
max_db_connections, num_greenlets, num_workers,
max_db_connections, max_db_connections * num_workers,
max_db_connections, max_db_connections * num_workers * .9
))
else:
print(
"Running server with %d max DB connections per worker, "
"%d greenlets per worker and %d workers. This configuration "
"WILL NOT show the bug. \n\nHit the server with whatever "
"concurrency you like, e.g.\n\n "
"ab -c 200 -n 3000 http://127.0.0.1:8080/" % (
max_db_connections, num_greenlets, num_workers,
))
_launcher.wait()
def main():
import argparse
parser = argparse.ArgumentParser(
description="illustrate WSGI service distribution")
parser.add_argument(
"--workers", type=int, default=10,
help="Number of worker processes")
parser.add_argument(
"--greenlets", type=int, default=1000,
help="number of greenlets per worker process")
parser.add_argument(
"--pooled_connections", type=int, default=5,
help="number of DB connections pooled per worker process")
parser.add_argument(
"--overflow_connections", type=int, default=5,
help="number of extra DB connections per worker process")
parser.add_argument(
"--connection_timeout", type=int, default=5,
help="number of seconds to wait before erroring out for no "
"DB connection")
parser.add_argument(
"--dburl", type=str,
default="mysql+pymysql://scott:tiger@localhost/test",
help="Database URL"
)
args = parser.parse_args()
run_server(
url=args.dburl,
num_pooled_db_connections=args.pooled_connections,
num_extra_db_connections=args.overflow_connections,
num_greenlets=args.greenlets,
num_workers=args.workers,
db_pool_timeout=args.connection_timeout
)
if __name__ == '__main__':
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment