Last active
May 18, 2016 03:21
-
-
Save zzzeek/c69138fd0d0b3e553a1f to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""This demonstration script illustrates that the default settings | |
for Openstack applications like Nova regarding WSGI workers and greenlets | |
are completely mis-matched with the default database settings, leading | |
to severe performance and stability penalties. | |
Since Openstack applications generally do not use real web servers and instead | |
implement their own servers in Python using eventlet combined with | |
multiprocessing, there is no utility for intelligently distributing WSGI | |
requests across workers; under high concurrency, web requests are distributed | |
to worker processes more or less randomly, such that particular workers may be | |
tasked with many requests while others are relatively idle at the same time. | |
To make things worse, the default number of greenlets in a worker is 1000, | |
which is vastly larger than the number of database connections allowed in a | |
process at one time, which defaults to *15*. Dozens or hundreds of requests | |
can be sent to a single worker process which will all queue up serially, | |
waiting for access to the very small number of DB connections and often | |
just timing out, which throws a connection pool error. | |
The script here provides a self-contained runner that's based on Nova's | |
current WSGI server, and illustrates a simple WSGI application within | |
that server. By simulating requests that might take a little long, between | |
zero and five seconds randomly, it is easily demonstratable that most requests | |
spend inordinate amounts of time waiting for a saturated connection pool | |
to be available, along with plenty of these waits leading to timeouts | |
which here are illustrated as 500 errors. While we reduce the pool timeout | |
from 30 seconds to 10 to illustrate this more clearly, note that there's no | |
reason that *any* request should have to wait *at all* for a connection | |
with proper configuration. | |
By simply reducing the number of greenlets to the same as the number of | |
connections (ten by default), the script can serve requests with no errors, | |
an average connection wait time that is an order of magnitude less, and | |
a faster requests / second overall. | |
The conclusions of this test are: | |
1. Ideally, all Openstack applications should be using real web servers, not | |
the current homegrown solution, so that request / worker / greenlet / thread | |
distribution is handled by the many robust solutions already available for | |
this problem. | |
2. With the current system, the number of greenlets and the number of database | |
connections max should be configured to more or less *match*; this means | |
that with DB settings at their default (SQLAlchemy defaults to 5 pooled | |
connections + 10 overflow for a total of 15), the number of greenlets would | |
also be 15. More realistically, the size of pool overflow can be increased | |
from 10 to about 100 or so, and the number of greenlets per process can | |
then also be about 100. | |
To Run | |
====== | |
Install Nova, PyMySQL, and a MySQL server; also install a load testing | |
program, such as Apache ab. | |
Then run this script with the Python interpreter:: | |
python demo_os_workers.py --dburl mysql+pymysql://scott:tiger@localhost/mydb | |
it will then wait for connections on port 8080. | |
While the script runs, hit it with the ab test and a lot of | |
concurrency:: | |
$ ab -c 200 -n 2000 http://127.0.0.1:8080 | |
The output will show lots of messages like:: | |
INFO:loadtest:Status for pid 32625: avg wait time for connection 1.4358 sec; | |
worst wait time 3.9267 sec; connection failures 5; | |
num requests over the limit: 29; max concurrency seen: 25 | |
and:: | |
ERROR:loadtest:error in pid 32630: QueuePool limit of size 5 overflow 5 | |
reached, connection timed out, timeout 5 | |
The INFO message shows that we're waiting well over a second average for | |
connections and getting failures too. The ERROR message shows a failure | |
where we waited so long for a connection it timed out after five seconds. | |
Then do the same process again, but this time running the script with | |
--greenlets=10:: | |
python demo_os_workers.py --dburl mysql+pymysql://scott:tiger@localhost/mydb --greenlets=10 | |
Run the ab test, even bump the concurrency | |
higher. You should see zero failures/errors and only messages like:: | |
INFO:loadtest:Status for pid 460: avg wait time for connection 0.0140 sec; | |
worst wait time 0.0540 sec; connection failures 0; | |
num requests over the limit: 0; max concurrency seen: 11 | |
At the end of the ab test the req/sec should be faster too, since workers | |
are now waiting in the order of 14ms for connections rather than 1.4sec. | |
""" | |
# initiate eventlet monkeypatching. This is required | |
# as we are using an eventlet greenlet pool, all socket / threading | |
# code needs to run under that context. | |
# source: nova/cmd/__init__.py | |
import eventlet | |
eventlet.monkey_patch(os=False) | |
# logging, so we can see things. | |
import logging | |
log = logging.getLogger("loadtest") | |
log.setLevel(logging.INFO) | |
logging.basicConfig() | |
# oslo_service is used in nova/service.py | |
from oslo_service import service | |
from oslo_config import cfg | |
# use nova's wsgi Server directly | |
from nova.wsgi import Server | |
CONF = cfg.CONF | |
def create_application(url, pool_size, max_overflow, pool_timeout): | |
"""the actual "app". | |
The "app" itself just checks out a database connection, pretends | |
to work with it on a very slow operation, for a random time between zero | |
and five seconds, then returns the connection to the pool. | |
We're using a raw SQLAlchemy engine/pool in the same way we'd get | |
one from oslo_db.sqlalchemy.engines.create_engine(), but just without | |
all the config overhead and plugins. | |
""" | |
from sqlalchemy import create_engine | |
# note that time.sleep() is one of many functions | |
# that eventlet.monkey_patch patches. | |
import time | |
import os | |
import random | |
eng = create_engine(url, | |
pool_size=pool_size, max_overflow=max_overflow, | |
pool_timeout=pool_timeout) | |
max_conn_per_process = pool_size + max_overflow | |
stats = { | |
"concurrent_count": 0, | |
"num_reqs_w_excess_concurrency": 0, | |
"max_concurrent_count": 0, | |
"failures": 0, | |
"time_waited_for_db": [], | |
"num_reqs": 0 | |
} | |
def simple_app(environ, start_response): | |
stats['concurrent_count'] += 1 | |
stats["num_reqs"] += 1 | |
stats['max_concurrent_count'] = max( | |
stats['max_concurrent_count'], stats['concurrent_count']) | |
if stats['concurrent_count'] > max_conn_per_process * 1.1: | |
stats['num_reqs_w_excess_concurrency'] += 1 | |
if stats["num_reqs"] % 10 == 0: | |
log.info( | |
"Status for pid %d: " | |
"avg wait time for connection %.4f sec; " | |
"worst wait time %.4f sec; " | |
"connection failures %d; " | |
"num requests over the limit: %d; " | |
"max concurrency seen: %s", | |
os.getpid(), | |
sum(stats["time_waited_for_db"]) / | |
len(stats["time_waited_for_db"]), | |
max(stats["time_waited_for_db"]), | |
stats["failures"], | |
stats["num_reqs_w_excess_concurrency"], | |
stats["max_concurrent_count"] | |
) | |
response_headers = [('Content-type', 'text/plain')] | |
try: | |
now = time.time() | |
conn = eng.connect() | |
stats['time_waited_for_db'].append(time.time() - now) | |
try: | |
with conn.begin(): | |
time.sleep(random.random() * 5) | |
finally: | |
conn.close() | |
except Exception as e: | |
stats["failures"] += 1 | |
log.error("error in pid %d: %s", os.getpid(), e) | |
start_response("500 server error", response_headers) | |
return ['Exception: %s\n' % e] | |
else: | |
start_response('200 OK', response_headers) | |
return ['Hello world!\n'] | |
finally: | |
stats['concurrent_count'] -= 1 | |
return simple_app | |
class Service(service.Service): | |
"""the "Service" - this is a simpified version of the nova Service | |
we see in nova/service.py. The basic idea is the same, e.g. has a | |
nova.wsgi.Server() as the thing we are running. | |
""" | |
def __init__( | |
self, url, num_pooled_db_connections, | |
num_extra_db_connections, num_greenlets, | |
db_pool_timeout): | |
self.server = Server( | |
"a server", create_application( | |
url, | |
pool_size=num_pooled_db_connections, | |
max_overflow=num_extra_db_connections, | |
pool_timeout=db_pool_timeout | |
), port=8080, pool_size=num_greenlets) | |
def start(self): | |
self.server.start() | |
def stop(self): | |
self.server.stop() | |
def wait(self): | |
self.server.wait() | |
# some CONF fanfare that oslo.config wants | |
CONF([], | |
project='test', | |
version="0.0.1", | |
default_config_files=[]) | |
def run_server( | |
url, | |
num_pooled_db_connections, | |
num_extra_db_connections, num_greenlets, | |
num_workers, db_pool_timeout): | |
server = Service( | |
url, | |
num_pooled_db_connections=num_pooled_db_connections, | |
num_extra_db_connections=num_extra_db_connections, | |
num_greenlets=num_greenlets, | |
db_pool_timeout=db_pool_timeout | |
) | |
# if the combination of service / greenlet pool worked correctly, | |
# as many as 100 concurrent connections would be evenly distributed among | |
# the 10 workers, and we should never see the connection pool from one | |
# worker timed out. | |
_launcher = service.launch(CONF, server, workers=num_workers) | |
max_db_connections = num_pooled_db_connections + num_extra_db_connections | |
will_show_bug = max_db_connections < num_greenlets | |
if will_show_bug: | |
print( | |
"Running server with %d max DB connections per worker, " | |
"%d greenlets per worker and %d workers. \nThis configuration " | |
"WILL show the bug " | |
"when request testing with at least %d concurrent " | |
"requests, more likely at least with %d concurrent " | |
"requests and more as that number grows. Run with " | |
"--greenlets=%d to make the bug disappear." | |
"\n\nNow run a command like:\n\n" | |
"ab -c %d -n 3000 http://127.0.0.1:8080/" % ( | |
max_db_connections, num_greenlets, num_workers, | |
max_db_connections, max_db_connections * num_workers, | |
max_db_connections, max_db_connections * num_workers * .9 | |
)) | |
else: | |
print( | |
"Running server with %d max DB connections per worker, " | |
"%d greenlets per worker and %d workers. This configuration " | |
"WILL NOT show the bug. \n\nHit the server with whatever " | |
"concurrency you like, e.g.\n\n " | |
"ab -c 200 -n 3000 http://127.0.0.1:8080/" % ( | |
max_db_connections, num_greenlets, num_workers, | |
)) | |
_launcher.wait() | |
def main(): | |
import argparse | |
parser = argparse.ArgumentParser( | |
description="illustrate WSGI service distribution") | |
parser.add_argument( | |
"--workers", type=int, default=10, | |
help="Number of worker processes") | |
parser.add_argument( | |
"--greenlets", type=int, default=1000, | |
help="number of greenlets per worker process") | |
parser.add_argument( | |
"--pooled_connections", type=int, default=5, | |
help="number of DB connections pooled per worker process") | |
parser.add_argument( | |
"--overflow_connections", type=int, default=5, | |
help="number of extra DB connections per worker process") | |
parser.add_argument( | |
"--connection_timeout", type=int, default=5, | |
help="number of seconds to wait before erroring out for no " | |
"DB connection") | |
parser.add_argument( | |
"--dburl", type=str, | |
default="mysql+pymysql://scott:tiger@localhost/test", | |
help="Database URL" | |
) | |
args = parser.parse_args() | |
run_server( | |
url=args.dburl, | |
num_pooled_db_connections=args.pooled_connections, | |
num_extra_db_connections=args.overflow_connections, | |
num_greenlets=args.greenlets, | |
num_workers=args.workers, | |
db_pool_timeout=args.connection_timeout | |
) | |
if __name__ == '__main__': | |
main() | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment