Skip to content

Instantly share code, notes, and snippets.

@trel
Forked from d-w-moore/PREP_Genquery_Iterator.md
Last active August 4, 2018 02:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save trel/2c31334aeb8ad92fc5439d8516397e9c to your computer and use it in GitHub Desktop.
Save trel/2c31334aeb8ad92fc5439d8516397e9c to your computer and use it in GitHub Desktop.
Work in progress, Blog Entry for Python RE GenQuery Iterator

News / August 2018

GenQuery Iterator for the iRODS Python Rule Engine Plugin

Administrators and users of iRODS have, for a long time, found its rule engine and associated rule language a valuable tool for customizing their systems' data policy, workflows, and other data management needs. Whether you are configuring a specific Policy Enforcement Point (PEP) with a series of actions to respond to iRODS system events, or writing an application more specific to your iRODS Zone, the process of authoring rules sometimes demands a knowledge of certain techniques and idioms.

A frequently-used technique involves the retrieval of information about data and other objects registered to the iRODS Catalog (in a relational database). More specifically, this means querying and extracting individual rows from that database via the iRODS server that manages the local iRODS Zone. This near-SQL feature of iRODS is known as the General Query (or GenQuery), and in the native iRODS rule language the corresponding idiom is the Language Integrated General Query (or LIGQ).

When authoring in the iRODS rule language, anyone familiar with SQL can use the LIGQ to query the iRODS Catalog of all objects and metadata that have been registered to it.

The following snippet utilizes the LIGQ:

 *host = ''
 foreach (*h in SELECT RESC_LOC WHERE DATA_RESC_NAME = '*resc_name' ) {
      *host = *h.RESC_LOC;
 }

The above example uses the Language Integrated General Query syntax to find the hostname of the iRODS server hosting the storage resource named *resc_name. The hostname yielded by the query is placed in the output variable *host.

Python Rules

Of course not everyone has the time to learn a domain specific language, and that is why a plugin interface has been designed to allow rules to be written in a variety of languages. In fact, it's already very practical to write rules in the Python scripting language, as long as your local administrator has installed and configured the Python Rule Engine Plugin (PREP).

Python is a particularly good fit for this task since it is full-featured, easy to learn, and of late incredibly popular, to the extent that it is typically named as being among the top 5 programming languages used in science and industry.

In addition, Python's object orientation and "iterable" abstraction are of great interest here. Its generator functions and user-definable, class-based iterators can offer a streamlined and straightforward interface to things such as database queries.

Iterator Magic

Without the iterator abstraction, we might have written the following code to scan, and place into a Python list structure, all data objects in our local zone that are owned by the user alice:

def findMyObjects(rule_args, callback, rei):

  My_Results_List = []

  ret_val = callback.msiMakeGenQuery( "COLL_NAME, DATA_NAME" , "DATA_OWNER_NAME = 'alice'",
              irods_types.GenQueryInp())
  genQueryInp = ret_val ['arguments'][2]
  
  ret_val = callback.msiExecGenQuery(genQueryInp, irods_types.GenQueryOut())
  genQueryOut = ret_val['arguments'][1]
  
  continue_index = 1
  
  while continue_index > 0:
    for j in range(genQueryOut.rowCnt):
      entry = '{}/{}'.format(genQueryOut.sqlResult[0].row(j), genQueryOut.sqlResult[1].row(j))
      My_Results_List.append(entry)
    continue_index = genQueryOut.continueInx
    if continue_index > 0:
      ret_val = callback.msiGetMoreRows(genQueryInp, genQueryOut, 0)
      genQueryOut = ret_val['arguments'][1]

This does work, but is not very compact or even terribly readable. It might be improved with the help of some clever refactoring, but the better choice would would be to create an iterator to streamline row-by-row access to the query results.

We have done exactly that and the effort will be included in the next PREP release in the form of the genquery.py module. Like the session_vars.py module, it needs only to be imported from within core.py or from Python rules submitted to irule.

With use of the new iterator, the above code becomes merely:

from genquery import ( row_iterator, paged_iterator, AS_DICT, AS_LIST )
# ...
def findMyObjects(rule_args, callback, rei):
  My_Results_List = []
  rows = row_iterator(
                  ["COLL_NAME","DATA_NAME"],   # requested columns
                  "DATA_OWNER_NAME = 'alice'", # condition for query
                  AS_DICT,                     # retrieve as key/value structure
                  callback):
  for row in row:
    My_Results_List.append("{COLL_NAME}/{DATA_NAME}".format(**row))

To page through a large number of results from the catalog, we can retrieve lists of 256 (MAX_SQL_ROWS) rows each.

This is used in the following example, which could be used to log the paths to all the data objects in the local zone that exceed a certain size limit:

MY_SIZE_LIMIT = (10 * 1024**3 - 1) # record everything ten gibibytes or larger
for page in paged_iterator(["COLL_NAME","DATA_NAME","DATA_SIZE"], 
                                  "DATA_SIZE > '{}'".format(MY_SIZE_LIMIT),
                                  AS_LIST, callback):
  for row in page:
    callback.writeLine("serverLog", "{!s}".format( 
       ("Collection_Name: {0}\t"  + \
        "Data_Object: {1}\t" + \
        "Data_Size: {2}" ).format( *row )
    ))

Coming Soon

The genquery.py module will be included by default in the next release of the Python Rule Engine Plugin, but existing deployments of iRODS will need to add the import statement to the top of their core.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment