News / August 2018
Administrators and users of iRODS have, for a long time, found its rule engine and associated rule language a valuable tool for customizing their systems' data policy, workflows, and other data management needs. Whether you are configuring a specific Policy Enforcement Point (PEP) with a series of actions to respond to iRODS system events, or writing an application more specific to your iRODS Zone, the process of authoring rules sometimes demands a knowledge of certain techniques and idioms.
A frequently-used technique involves the retrieval of information about data and other objects registered to the iRODS Catalog (in a relational database). More specifically, this means querying and extracting individual rows from that database via the iRODS server that manages the local iRODS Zone. This near-SQL feature of iRODS is known as the General Query (or GenQuery), and in the native iRODS rule language the corresponding idiom is the Language Integrated General Query (or LIGQ).
When authoring in the iRODS rule language, anyone familiar with SQL can use the LIGQ to query the iRODS Catalog of all objects and metadata that have been registered to it.
The following snippet utilizes the LIGQ:
*host = ''
foreach (*h in SELECT RESC_LOC WHERE DATA_RESC_NAME = '*resc_name' ) {
*host = *h.RESC_LOC;
}
The above example uses the Language Integrated General Query syntax to
find the hostname of the iRODS server hosting the storage resource named *resc_name
. The
hostname yielded by the query is placed in the output variable *host
.
Python Rules
Of course not everyone has the time to learn a domain specific language, and that is why a plugin interface has been designed to allow rules to be written in a variety of languages. In fact, it's already very practical to write rules in the Python scripting language, as long as your local administrator has installed and configured the Python Rule Engine Plugin (PREP).
Python is a particularly good fit for this task since it is full-featured, easy to learn, and of late incredibly popular, to the extent that it is typically named as being among the top 5 programming languages used in science and industry.
In addition, Python's object orientation and "iterable" abstraction are of great interest here. Its generator functions and user-definable, class-based iterators can offer a streamlined and straightforward interface to things such as database queries.
Iterator Magic
Without the iterator abstraction, we might have written the following code to scan, and place
into a Python list structure, all data objects in our local zone that are owned by the
user alice
:
def findMyObjects(rule_args, callback, rei):
My_Results_List = []
ret_val = callback.msiMakeGenQuery( "COLL_NAME, DATA_NAME" , "DATA_OWNER_NAME = 'alice'",
irods_types.GenQueryInp())
genQueryInp = ret_val ['arguments'][2]
ret_val = callback.msiExecGenQuery(genQueryInp, irods_types.GenQueryOut())
genQueryOut = ret_val['arguments'][1]
continue_index = 1
while continue_index > 0:
for j in range(genQueryOut.rowCnt):
entry = '{}/{}'.format(genQueryOut.sqlResult[0].row(j), genQueryOut.sqlResult[1].row(j))
My_Results_List.append(entry)
continue_index = genQueryOut.continueInx
if continue_index > 0:
ret_val = callback.msiGetMoreRows(genQueryInp, genQueryOut, 0)
genQueryOut = ret_val['arguments'][1]
This does work, but is not very compact or even terribly readable. It might be improved with the help of some clever refactoring, but the better choice would would be to create an iterator to streamline row-by-row access to the query results.
We have done exactly that and the effort will be included in the next PREP release
in the form of the genquery.py
module. Like the session_vars.py
module, it needs only
to be imported from within core.py
or from Python rules submitted to irule
.
With use of the new iterator, the above code becomes merely:
from genquery import ( row_iterator, paged_iterator, AS_DICT, AS_LIST )
# ...
def findMyObjects(rule_args, callback, rei):
My_Results_List = []
rows = row_iterator(
["COLL_NAME","DATA_NAME"], # requested columns
"DATA_OWNER_NAME = 'alice'", # condition for query
AS_DICT, # retrieve as key/value structure
callback):
for row in row:
My_Results_List.append("{COLL_NAME}/{DATA_NAME}".format(**row))
To page through a large number of results from the catalog, we can retrieve lists of 256 (MAX_SQL_ROWS) rows each.
This is used in the following example, which could be used to log the paths to all the data objects in the local zone that exceed a certain size limit:
MY_SIZE_LIMIT = (10 * 1024**3 - 1) # record everything ten gibibytes or larger
for page in paged_iterator(["COLL_NAME","DATA_NAME","DATA_SIZE"],
"DATA_SIZE > '{}'".format(MY_SIZE_LIMIT),
AS_LIST, callback):
for row in page:
callback.writeLine("serverLog", "{!s}".format(
("Collection_Name: {0}\t" + \
"Data_Object: {1}\t" + \
"Data_Size: {2}" ).format( *row )
))
Coming Soon
The genquery.py
module will be included by default in the next release of the Python Rule Engine Plugin, but existing deployments of iRODS will need to add the import statement to the top of their core.py
.