abhijeetviswa/gsoc-2020-proposal.rst Secret

## gsoc-2020-proposal.rst

      
    Raw
  

              gsoc-2020-proposal.rst
            
          
    Adding Support for Windows and Oracle in the Parallel Test Runner

Project Proposal for Django for Google Summer of Code 2020 by Abhijeet Viswakumar.

Table of Contents


Abstract

Current Implementation of the ParallelTestSuite and its Drawbacks

The Django project uses unit testing to ensure that changes do not break existing functionality. There are close to 13,000 tests in Django. This number is bound to increase as time goes on. The same is true for big projects made using Django. Running the tests quickly and accurately is critical.
Django currently uses the Python multiprocessing module to run test suites in parallel. The parallel runner relies on the fork method of the multiprocessing module so that the initialized state can be transferred over to the child processes. Unfortunately, this means that the parallel runner does not work on operating systems which do not support the fork method by default, namely Windows and macOS.
Furthermore, the Oracle backend doesn't have support for the current parallel runner at all.
Goals

The primary goals are:

Rewrite the parallel test runner to add support for the spawn method as well, so that tests can be run in parallel on Windows and macOS.
Add support to the Oracle backend to run tests in parallel.

Benefits

Improving the parallel test runner to support the spawn method as well as support for Oracle will reduce the amount of time spent on waiting for tests to finish and free up more time for development.
Implementation

This section is split into two, detailing the implementation for both the spawn method as well as for Oracle.
Implementing the spawn Method

Overview

Before the test runner is started, a setup process is done. This involves things like initializing the app registry, loading settings etc. In the present implementation of the parallel test runner, this initialization process needs to be done only once, since child processes inherit their parent's state when using the fork method.
However, in the spawn method, the child processes do not inherit any state and hence they should be initialized (if necessary) before they run any tests.
The pull request, #12547, is a rough POC, showing how this can be achieved.
In the POC, to DiscoverRunner.run_tests(), two arguments were added: a callback and the argument list for the call back. These two arguments are passed down to the ParallelTestSuite and eventually to _init_worker() which will use the callbacks to setup the test environment. Any other information necessary (like the database names used for the tests) are also passed down to the child process.
My implementation aims for the parallel runner to continue using the fork method on Unix, falling back to spawn on operating systems that don't support fork.
However, A problem of using the spawn method arises when using in-memory SQLite databases. The present implementation of the parallel runner assumes the fork method is used and hence the child process would automatically have a clone of the in-memory SQLite databases.
Two workarounds that I had discussed in my POC summary are:

In-memory databases and spawn cannot be specified together.
Create an SQL dump of the in-memory database using the Connection.iterdump() method provided by the sqlite3 module.

Other possible workarounds (as was discussed on the django-developers mailing list) are:

Using the VACCUM INTO SQLite statement.
Using shared memory to pass the SQLite dumps to child processes, removing the need of disk I/O.

All of these are equally plausible implementation strategies and I believe the final product should be the least complicated and most performant one.
Advantages

The proposed implementation is as simple as possible. The number of changes that would have to be made to to use the parallel runner in a Django project or within Django's runtests module is minimum. Further more, backwards compatibility is maintained as much as possible.
Adding Support for Oracle

Overview

The original PR that added support for running tests in parallel decided to defer supporting Oracle, due to the complexity of doing so, so as not to miss the Django 1.9 deadline.
Oracle, unlike other RDBMSs, does not support multiple schemas per user. This means that test users need to be created for each parallel runner. Another problem is figuring out a simple way to clone and import schemas.
The current implementation already creates a single test user to run tests, reserving the user that is provided in the database configuration for setup, administrative tasks, and cleanup. I plan to expand upon the present implementation, and use this "administrative" user, to create additional test users and schemas. The details of this is discussed in the Schedule (Week 2, Second Milestone).
I will tackle the problem of cloning and importing schemas using the Oracle Data Pump. It provides utilities as well a PL/SQL API to dump a schema to a file. Utilizing the Data Pump requires a lot of configuration (such as granting tests users certain roles, creating and granting access to directories) which can be done using the administrative user.
Further, the process of creating a schema dump and then importing it back in, might take considerable time depending upon the machine on which the database is running. However, these disadvantages are potentially outweighed if there are a large number of independent tests that can be run in parallel.
Advantages

The primary advantage would be the parallel running of tests. Furthermore, usage of the PL/SQL API removes the requirement for client-side utilities while performing tests.
Schedule and Milestones

Here are the rough goals I have in mind:

Discussing and coming to a consensus on how to deal with in-memory SQLite databases.
Discussing on how best to create Oracle schema dumps (especially for remote databases).
Discussing writing tests for the test runner and ParallelTestSuite.

The schedule detailed below is assuming that my University's academic calender is followed. However, with the Covid-19 pandemic, I suspect that my semester may end later than scheduled, even though online classes are going on at the moment.
My University exams are scheduled for the first two weeks of May. The community bonding period also starts during the same time. Since the period is for 4 weeks, I don't expect my exams to be a huge detriment to me bonding with the community or with my mentor.
Towards the end of July/beginning of August, I will be busy with academic registration and hence may not be able to invest my full time.
Other than the above, I do not have any commitments during my summer.
Community Bonding

(May 5 - May 31 -- 4 weeks)
Community bonding is necessary for a community-driven project like Django. I hope to get involved with the community about my project as well as other patches.
First Milestone - ParallelTestSuite

(June 1 - June 29-- 4 weeks)
Weeks 1 and 2: Initializing and running the child processes

I will spend the first two weeks rewriting the ParallelTestSuite and the DiscoverRunner to support both fork as well as spawn. I will use the POC as a basis for my implementation. In the POC, The DiscoverRunner expects arguments and callbacks required for setting up the child processes. I would like to look into this further and see if there are alternatives other than using call backs and minimizing the number of breaking changes.
By the second week, I'm hopeful to have a working test runner and suite which can use spawn as well as fork.
Week 3: Supporting in-memory SQLite databases

As mentioned earlier, the spawn method will break in-memory SQLite databases. During this week, I'll work on adding support for in-memory SQLite databases when using spawn. A number of methods to implement this have been discussed in the Overview section. The method chosen will depend upon its simplicity and performance.
Week 4: Finishing touches on the parallel test runner and documentation

During the 4th week, I plan to finish up the parallel test runner as well as document the changes made, including, any API changes that may have been required (for e.g., in DiscoverRunner).
By the Phase 1 Evaluation deadline, I plan to have finished adding support for spawn method in DiscoverRunner and ParallelTestSuite.
Second Milestone - Oracle DB

(July 3 - July 27 -- 3.5 weeks)
Week 1: Creating schema dump

During the first week, I'll mostly be working on creating a schema dump. As stated earlier, the impdp/expdp utilities or the PL/SQL API can be used.
I expect this to take a week since and cause the most amount of trouble for me since I haven't worked with Oracle in Python or otherwise, other than the tiny amount of testing I did as part of writing this proposal.
Week 2: Creating Test Users

As already mentioned in the Overview section, a different test user has to be created and assigned for each worker process. during this week, I'll create methods that setup these test users, including creating them, setting up and assigning them tablespaces as well as granting them any required permissions.
I aim to do this through a single API call that will prepare all the test users at one go. The reason for a separate method to do this instead of just leaving it inside the clone_test_db() method after creating a test user is because the connection parameters are switched to those of the test user. Therefore, with each call of the clone_test_db(), we would have to switch the connection back to the administrative user. I believe it would be easier to just create all test users in one go, using an API call. Note that, by API, I mean the BaseDatabaseCreation class.
Week 3: Import and remap dumped schema

The clone_test_db() will handle loading up the generated schema dump (implemented during Week 1) and remapping it to the test user.
Once again, I expect this to take a week.
Week 4: Setting up child processes

During the 4th week, I'll be working on setting up the child processes to connect with the respective cloned schemas. I suspect this will also require making further changes to the ParallelTestSuite above those already made while working on the spawn method during Phase 1.
The second phase is only 3.5 weeks according to the GSOC Timeline. By the second evaluation, I expect to have a working (albeit incomplete) DiscoverRunner and ParallelTestSuite that will be able to set up additional test users and clone the default test schema for the other worker processes. All that will remain is cleaning up the test database.
Third Milestone - Finishing up Oracle, code cleanup and documentation

(August 1 - August 24 -- 3 weeks)
Week 1: Cleaning up the test schemas

Assuming that the ParallelTestSuite is working as expected on Oracle, I'll work on the clean up tasks associated with the test runner and database i.e dropping test users, schemas, tablespaces etc.
Weeks 2 and 3: Code cleanup and documentation

These two weeks will solely be dedicated to cleaning up my code and adding documentation. These two weeks will also act as a buffer in case I miss one my milestones and I am behind schedule.
Rest of the Summer

If I hit all my milestones ahead of target, including code cleanup, documentation as well as merging the changes, then I would like to dedicate the rest of my time towards working on existing tickets on Trac, specifically the Database layer. A couple of them are, #7623, #11541, #29771. Incase I'm not able to work on these during the GSOC period, I fully intend to work on them after GSOC.
About Me

My name is Abhijeet Viswakumar. I am an undergraduate engineering sophomore, majoring in Electronics and Instrumentation Engineering at BITS Pilani, Hyderabad Campus. I live in India (UTC+05:30). I began coding at the age of 12 in Visual Basic 6, after having stumbled on a small online community of game developers.
I started working with Python and Django as part of a college project called SmartCampus. It is a mobile-first application, used as a campus exclusive payments platform. After launching in mid-September 2019, the platform has facilitated payments close to ₹15,220,000 (~$205,000).
I started contributing to Django in November of 2019, after reporting #30953 I later reported and fixed #31246. I've also worked on #29129, #28290, and #31126.
My email is abhijeetviswa (at) gmail.com. My nick on the #django and #django-dev IRC channels as well as on the Django forum is abhijeetviswa.