zzzeek/demo_os_workers.py

## demo_os_workers.py
"""This demonstration script illustrates that the default settings
for Openstack applications like Nova regarding WSGI workers and greenlets
are completely mis-matched with the default database settings, leading
to severe performance and stability penalties.

Since Openstack applications generally do not use real web servers and instead
implement their own servers in Python using eventlet combined with
multiprocessing, there is no utility for intelligently distributing WSGI
requests across workers; under high concurrency, web requests are distributed
to worker processes more or less randomly, such that particular workers may be
tasked with many requests while others are relatively idle at the same time.

To make things worse, the default number of greenlets in a worker is 1000,
which is vastly larger than the number of database connections allowed in a
process at one time, which defaults to *15*.   Dozens or hundreds of requests
can be sent to a single worker process which will all queue up serially,
waiting for access to the very small number of DB connections and often
just timing out, which throws a connection pool error.

The script here provides a self-contained runner that's based on Nova's
current WSGI server, and illustrates a simple WSGI application within
that server.   By simulating requests that might take a little long, between
zero and five seconds randomly, it is easily demonstratable that most requests
spend inordinate amounts of time waiting for a saturated connection pool
to be available, along with plenty of these waits leading to timeouts
which here are illustrated as 500 errors.  While we reduce the pool timeout
from 30 seconds to 10 to illustrate this more clearly, note that there's no
reason that *any* request should have to wait *at all* for a connection
with proper configuration.

By simply reducing the number of greenlets to the same as the number of
connections (ten by default), the script can serve requests with no errors,
an average connection wait time that is an order of magnitude less, and
a faster requests / second overall.

The conclusions of this test are:

1. Ideally, all Openstack applications should be using real web servers, not
   the current homegrown solution, so that request / worker / greenlet / thread
   distribution is handled by the many robust solutions already available for
   this problem.

2. With the current system, the number of greenlets and the number of database
   connections max should be configured to more or less *match*;  this means
   that with DB settings at their default (SQLAlchemy defaults to 5 pooled
   connections + 10 overflow for a total of 15), the number of greenlets would
   also be 15.  More realistically, the size of pool overflow can be increased
   from 10 to about 100 or so, and the number of greenlets per process can
   then also be about 100.

To Run
======

Install Nova, PyMySQL, and a MySQL server; also install a load testing
program, such as Apache ab.

Then run this script with the Python interpreter::

    python demo_os_workers.py --dburl mysql+pymysql://scott:tiger@localhost/mydb

it will then wait for connections on port 8080.

While the script runs, hit it with the ab test and a lot of
concurrency::

    $ ab -c 200 -n 2000 http://127.0.0.1:8080

The output will show lots of messages like::

    INFO:loadtest:Status for pid 32625: avg wait time for connection 1.4358 sec;
      worst wait time 3.9267 sec; connection failures 5;
      num requests over the limit: 29; max concurrency seen: 25

and::

    ERROR:loadtest:error in pid 32630: QueuePool limit of size 5 overflow 5
    reached, connection timed out, timeout 5

The INFO message shows that we're waiting well over a second average for
connections and getting failures too.  The ERROR message shows a failure
where we waited so long for a connection it timed out after five seconds.

Then do the same process again, but this time running the script with
--greenlets=10::

    python demo_os_workers.py --dburl mysql+pymysql://scott:tiger@localhost/mydb --greenlets=10

Run the ab test, even bump the concurrency
higher.  You should see zero failures/errors and only messages like::

    INFO:loadtest:Status for pid 460: avg wait time for connection 0.0140 sec;
      worst wait time 0.0540 sec; connection failures 0;
      num requests over the limit: 0; max concurrency seen: 11

At the end of the ab test the req/sec should be faster too, since workers
are now waiting in the order of 14ms for connections rather than 1.4sec.


"""
# initiate eventlet monkeypatching.  This is required
# as we are using an eventlet greenlet pool, all socket / threading
# code needs to run under that context.
# source: nova/cmd/__init__.py
import eventlet
eventlet.monkey_patch(os=False)

# logging, so we can see things.
import logging
log = logging.getLogger("loadtest")
log.setLevel(logging.INFO)
logging.basicConfig()

# oslo_service is used in nova/service.py
from oslo_service import service
from oslo_config import cfg

# use nova's wsgi Server directly
from nova.wsgi import Server

CONF = cfg.CONF


def create_application(url, pool_size, max_overflow, pool_timeout):
    """the actual "app".

    The "app" itself just checks out a database connection, pretends
    to work with it on a very slow operation, for a random time between zero
    and five seconds, then returns the connection to the pool.

    We're using a raw SQLAlchemy engine/pool in the same way we'd get
    one from oslo_db.sqlalchemy.engines.create_engine(), but just without
    all the config overhead and plugins.

    """
    from sqlalchemy import create_engine

    # note that time.sleep() is one of many functions
    # that eventlet.monkey_patch patches.
    import time

    import os
    import random

    eng = create_engine(url,
        pool_size=pool_size, max_overflow=max_overflow,
        pool_timeout=pool_timeout)

    max_conn_per_process = pool_size + max_overflow

    stats = {
        "concurrent_count": 0,
        "num_reqs_w_excess_concurrency": 0,
        "max_concurrent_count": 0,
        "failures": 0,
        "time_waited_for_db": [],
        "num_reqs": 0
    }

    def simple_app(environ, start_response):
        stats['concurrent_count'] += 1
        stats["num_reqs"] += 1
        stats['max_concurrent_count'] = max(
            stats['max_concurrent_count'], stats['concurrent_count'])

        if stats['concurrent_count'] > max_conn_per_process * 1.1:
            stats['num_reqs_w_excess_concurrency'] += 1

        if stats["num_reqs"] % 10 == 0:

            log.info(
                "Status for pid %d: "
                "avg wait time for connection %.4f sec; "
                "worst wait time %.4f sec; "
                "connection failures %d; "
                "num requests over the limit: %d; "
                "max concurrency seen: %s",
                os.getpid(),
                sum(stats["time_waited_for_db"]) /
                    len(stats["time_waited_for_db"]),
                max(stats["time_waited_for_db"]),
                stats["failures"],
                stats["num_reqs_w_excess_concurrency"],
                stats["max_concurrent_count"]
            )

        response_headers = [('Content-type', 'text/plain')]

        try:
            now = time.time()
            conn = eng.connect()
            stats['time_waited_for_db'].append(time.time() - now)
            try:
                with conn.begin():
                    time.sleep(random.random() * 5)
            finally:
                conn.close()
        except Exception as e:
            stats["failures"] += 1
            log.error("error in pid %d: %s", os.getpid(), e)
            start_response("500 server error", response_headers)
            return ['Exception: %s\n' % e]
        else:
            start_response('200 OK', response_headers)
            return ['Hello world!\n']
        finally:
            stats['concurrent_count'] -= 1
    return simple_app


class Service(service.Service):
    """the "Service" - this is a simpified version of the nova Service
    we see in nova/service.py.   The basic idea is the same, e.g. has a
    nova.wsgi.Server() as the thing we are running.

    """
    def __init__(
            self, url, num_pooled_db_connections,
            num_extra_db_connections, num_greenlets,
            db_pool_timeout):
        self.server = Server(
            "a server", create_application(
                url,
                pool_size=num_pooled_db_connections,
                max_overflow=num_extra_db_connections,
                pool_timeout=db_pool_timeout
            ), port=8080, pool_size=num_greenlets)

    def start(self):
        self.server.start()

    def stop(self):
        self.server.stop()

    def wait(self):
        self.server.wait()

# some CONF fanfare that oslo.config wants
CONF([],
     project='test',
     version="0.0.1",
     default_config_files=[])


def run_server(
    url,
    num_pooled_db_connections,
    num_extra_db_connections, num_greenlets,
    num_workers, db_pool_timeout):

    server = Service(
        url,
        num_pooled_db_connections=num_pooled_db_connections,
        num_extra_db_connections=num_extra_db_connections,
        num_greenlets=num_greenlets,
        db_pool_timeout=db_pool_timeout
    )

    # if the combination of service / greenlet pool worked correctly,
    # as many as 100 concurrent connections would be evenly distributed among
    # the 10 workers, and we should never see the connection pool from one
    # worker timed out.
    _launcher = service.launch(CONF, server, workers=num_workers)

    max_db_connections = num_pooled_db_connections + num_extra_db_connections
    will_show_bug = max_db_connections < num_greenlets

    if will_show_bug:
        print(
            "Running server with %d max DB connections per worker, "
            "%d greenlets per worker and %d workers.  \nThis configuration "
            "WILL show the bug "
            "when request testing with at least %d concurrent "
            "requests, more likely at least with %d concurrent "
            "requests and more as that number grows.  Run with "
            "--greenlets=%d to make the bug disappear."
            "\n\nNow run a command like:\n\n"
            "ab -c %d -n 3000 http://127.0.0.1:8080/" % (
                max_db_connections, num_greenlets, num_workers,
                max_db_connections, max_db_connections * num_workers,
                max_db_connections, max_db_connections * num_workers * .9
            ))
    else:
        print(
            "Running server with %d max DB connections per worker, "
            "%d greenlets per worker and %d workers.  This configuration "
            "WILL NOT show the bug.  \n\nHit the server with whatever "
            "concurrency you like, e.g.\n\n "
            "ab -c 200 -n 3000 http://127.0.0.1:8080/" % (
                max_db_connections, num_greenlets, num_workers,
            ))

    _launcher.wait()


def main():
    import argparse

    parser = argparse.ArgumentParser(
        description="illustrate WSGI service distribution")
    parser.add_argument(
        "--workers", type=int, default=10,
        help="Number of worker processes")
    parser.add_argument(
        "--greenlets", type=int, default=1000,
        help="number of greenlets per worker process")
    parser.add_argument(
        "--pooled_connections", type=int, default=5,
        help="number of DB connections pooled per worker process")
    parser.add_argument(
        "--overflow_connections", type=int, default=5,
        help="number of extra DB connections per worker process")
    parser.add_argument(
        "--connection_timeout", type=int, default=5,
        help="number of seconds to wait before erroring out for no "
        "DB connection")
    parser.add_argument(
        "--dburl", type=str,
        default="mysql+pymysql://scott:tiger@localhost/test",
        help="Database URL"
    )
    args = parser.parse_args()

    run_server(
        url=args.dburl,
        num_pooled_db_connections=args.pooled_connections,
        num_extra_db_connections=args.overflow_connections,
        num_greenlets=args.greenlets,
        num_workers=args.workers,
        db_pool_timeout=args.connection_timeout
    )

if __name__ == '__main__':
    main()
	"""This demonstration script illustrates that the default settings
	for Openstack applications like Nova regarding WSGI workers and greenlets
	are completely mis-matched with the default database settings, leading
	to severe performance and stability penalties.

	Since Openstack applications generally do not use real web servers and instead
	implement their own servers in Python using eventlet combined with
	multiprocessing, there is no utility for intelligently distributing WSGI
	requests across workers; under high concurrency, web requests are distributed
	to worker processes more or less randomly, such that particular workers may be
	tasked with many requests while others are relatively idle at the same time.

	To make things worse, the default number of greenlets in a worker is 1000,
	which is vastly larger than the number of database connections allowed in a
	process at one time, which defaults to 15. Dozens or hundreds of requests
	can be sent to a single worker process which will all queue up serially,
	waiting for access to the very small number of DB connections and often
	just timing out, which throws a connection pool error.

	The script here provides a self-contained runner that's based on Nova's
	current WSGI server, and illustrates a simple WSGI application within
	that server. By simulating requests that might take a little long, between
	zero and five seconds randomly, it is easily demonstratable that most requests
	spend inordinate amounts of time waiting for a saturated connection pool
	to be available, along with plenty of these waits leading to timeouts
	which here are illustrated as 500 errors. While we reduce the pool timeout
	from 30 seconds to 10 to illustrate this more clearly, note that there's no
	reason that any request should have to wait at all for a connection
	with proper configuration.

	By simply reducing the number of greenlets to the same as the number of
	connections (ten by default), the script can serve requests with no errors,
	an average connection wait time that is an order of magnitude less, and
	a faster requests / second overall.

	The conclusions of this test are:

	1. Ideally, all Openstack applications should be using real web servers, not
	the current homegrown solution, so that request / worker / greenlet / thread
	distribution is handled by the many robust solutions already available for
	this problem.

	2. With the current system, the number of greenlets and the number of database
	connections max should be configured to more or less match; this means
	that with DB settings at their default (SQLAlchemy defaults to 5 pooled
	connections + 10 overflow for a total of 15), the number of greenlets would
	also be 15. More realistically, the size of pool overflow can be increased
	from 10 to about 100 or so, and the number of greenlets per process can
	then also be about 100.

	To Run
	======

	Install Nova, PyMySQL, and a MySQL server; also install a load testing
	program, such as Apache ab.

	Then run this script with the Python interpreter::

	python demo_os_workers.py --dburl mysql+pymysql://scott:tiger@localhost/mydb

	it will then wait for connections on port 8080.

	While the script runs, hit it with the ab test and a lot of
	concurrency::

	$ ab -c 200 -n 2000 http://127.0.0.1:8080

	The output will show lots of messages like::

	INFO:loadtest:Status for pid 32625: avg wait time for connection 1.4358 sec;
	worst wait time 3.9267 sec; connection failures 5;
	num requests over the limit: 29; max concurrency seen: 25

	and::

	ERROR:loadtest:error in pid 32630: QueuePool limit of size 5 overflow 5
	reached, connection timed out, timeout 5

	The INFO message shows that we're waiting well over a second average for
	connections and getting failures too. The ERROR message shows a failure
	where we waited so long for a connection it timed out after five seconds.

	Then do the same process again, but this time running the script with
	--greenlets=10::

	python demo_os_workers.py --dburl mysql+pymysql://scott:tiger@localhost/mydb --greenlets=10

	Run the ab test, even bump the concurrency
	higher. You should see zero failures/errors and only messages like::

	INFO:loadtest:Status for pid 460: avg wait time for connection 0.0140 sec;
	worst wait time 0.0540 sec; connection failures 0;
	num requests over the limit: 0; max concurrency seen: 11

	At the end of the ab test the req/sec should be faster too, since workers
	are now waiting in the order of 14ms for connections rather than 1.4sec.


	"""
	# initiate eventlet monkeypatching. This is required
	# as we are using an eventlet greenlet pool, all socket / threading
	# code needs to run under that context.
	# source: nova/cmd/__init__.py
	import eventlet
	eventlet.monkey_patch(os=False)

	# logging, so we can see things.
	import logging
	log = logging.getLogger("loadtest")
	log.setLevel(logging.INFO)
	logging.basicConfig()

	# oslo_service is used in nova/service.py
	from oslo_service import service
	from oslo_config import cfg

	# use nova's wsgi Server directly
	from nova.wsgi import Server

	CONF = cfg.CONF


	def create_application(url, pool_size, max_overflow, pool_timeout):
	"""the actual "app".

	The "app" itself just checks out a database connection, pretends
	to work with it on a very slow operation, for a random time between zero
	and five seconds, then returns the connection to the pool.

	We're using a raw SQLAlchemy engine/pool in the same way we'd get
	one from oslo_db.sqlalchemy.engines.create_engine(), but just without
	all the config overhead and plugins.

	"""
	from sqlalchemy import create_engine

	# note that time.sleep() is one of many functions
	# that eventlet.monkey_patch patches.
	import time

	import os
	import random

	eng = create_engine(url,
	pool_size=pool_size, max_overflow=max_overflow,
	pool_timeout=pool_timeout)

	max_conn_per_process = pool_size + max_overflow

	stats = {
	"concurrent_count": 0,
	"num_reqs_w_excess_concurrency": 0,
	"max_concurrent_count": 0,
	"failures": 0,
	"time_waited_for_db": [],
	"num_reqs": 0
	}

	def simple_app(environ, start_response):
	stats['concurrent_count'] += 1
	stats["num_reqs"] += 1
	stats['max_concurrent_count'] = max(
	stats['max_concurrent_count'], stats['concurrent_count'])

	if stats['concurrent_count'] > max_conn_per_process * 1.1:
	stats['num_reqs_w_excess_concurrency'] += 1

	if stats["num_reqs"] % 10 == 0:

	log.info(
	"Status for pid %d: "
	"avg wait time for connection %.4f sec; "
	"worst wait time %.4f sec; "
	"connection failures %d; "
	"num requests over the limit: %d; "
	"max concurrency seen: %s",
	os.getpid(),
	sum(stats["time_waited_for_db"]) /
	len(stats["time_waited_for_db"]),
	max(stats["time_waited_for_db"]),
	stats["failures"],
	stats["num_reqs_w_excess_concurrency"],
	stats["max_concurrent_count"]
	)

	response_headers = [('Content-type', 'text/plain')]

	try:
	now = time.time()
	conn = eng.connect()
	stats['time_waited_for_db'].append(time.time() - now)
	try:
	with conn.begin():
	time.sleep(random.random() * 5)
	finally:
	conn.close()
	except Exception as e:
	stats["failures"] += 1
	log.error("error in pid %d: %s", os.getpid(), e)
	start_response("500 server error", response_headers)
	return ['Exception: %s\n' % e]
	else:
	start_response('200 OK', response_headers)
	return ['Hello world!\n']
	finally:
	stats['concurrent_count'] -= 1
	return simple_app


	class Service(service.Service):
	"""the "Service" - this is a simpified version of the nova Service
	we see in nova/service.py. The basic idea is the same, e.g. has a
	nova.wsgi.Server() as the thing we are running.

	"""
	def __init__(
	self, url, num_pooled_db_connections,
	num_extra_db_connections, num_greenlets,
	db_pool_timeout):
	self.server = Server(
	"a server", create_application(
	url,
	pool_size=num_pooled_db_connections,
	max_overflow=num_extra_db_connections,
	pool_timeout=db_pool_timeout
	), port=8080, pool_size=num_greenlets)

	def start(self):
	self.server.start()

	def stop(self):
	self.server.stop()

	def wait(self):
	self.server.wait()

	# some CONF fanfare that oslo.config wants
	CONF([],
	project='test',
	version="0.0.1",
	default_config_files=[])


	def run_server(
	url,
	num_pooled_db_connections,
	num_extra_db_connections, num_greenlets,
	num_workers, db_pool_timeout):

	server = Service(
	url,
	num_pooled_db_connections=num_pooled_db_connections,
	num_extra_db_connections=num_extra_db_connections,
	num_greenlets=num_greenlets,
	db_pool_timeout=db_pool_timeout
	)

	# if the combination of service / greenlet pool worked correctly,
	# as many as 100 concurrent connections would be evenly distributed among
	# the 10 workers, and we should never see the connection pool from one
	# worker timed out.
	_launcher = service.launch(CONF, server, workers=num_workers)

	max_db_connections = num_pooled_db_connections + num_extra_db_connections
	will_show_bug = max_db_connections < num_greenlets

	if will_show_bug:
	print(
	"Running server with %d max DB connections per worker, "
	"%d greenlets per worker and %d workers. \nThis configuration "
	"WILL show the bug "
	"when request testing with at least %d concurrent "
	"requests, more likely at least with %d concurrent "
	"requests and more as that number grows. Run with "
	"--greenlets=%d to make the bug disappear."
	"\n\nNow run a command like:\n\n"
	"ab -c %d -n 3000 http://127.0.0.1:8080/" % (
	max_db_connections, num_greenlets, num_workers,
	max_db_connections, max_db_connections * num_workers,
	max_db_connections, max_db_connections * num_workers * .9
	))
	else:
	print(
	"Running server with %d max DB connections per worker, "
	"%d greenlets per worker and %d workers. This configuration "
	"WILL NOT show the bug. \n\nHit the server with whatever "
	"concurrency you like, e.g.\n\n "
	"ab -c 200 -n 3000 http://127.0.0.1:8080/" % (
	max_db_connections, num_greenlets, num_workers,
	))

	_launcher.wait()


	def main():
	import argparse

	parser = argparse.ArgumentParser(
	description="illustrate WSGI service distribution")
	parser.add_argument(
	"--workers", type=int, default=10,
	help="Number of worker processes")
	parser.add_argument(
	"--greenlets", type=int, default=1000,
	help="number of greenlets per worker process")
	parser.add_argument(
	"--pooled_connections", type=int, default=5,
	help="number of DB connections pooled per worker process")
	parser.add_argument(
	"--overflow_connections", type=int, default=5,
	help="number of extra DB connections per worker process")
	parser.add_argument(
	"--connection_timeout", type=int, default=5,
	help="number of seconds to wait before erroring out for no "
	"DB connection")
	parser.add_argument(
	"--dburl", type=str,
	default="mysql+pymysql://scott:tiger@localhost/test",
	help="Database URL"
	)
	args = parser.parse_args()

	run_server(
	url=args.dburl,
	num_pooled_db_connections=args.pooled_connections,
	num_extra_db_connections=args.overflow_connections,
	num_greenlets=args.greenlets,
	num_workers=args.workers,
	db_pool_timeout=args.connection_timeout
	)

	if __name__ == '__main__':
	main()