omegaml/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Using shared modules in omega-ml

Shared modules allow data scientists to implement complex functionality
as virtual functions, classes, installable scripts, or packaged apps.
A motivating example

Consider a use case where we want to leverage sqlalchemy's ORM Models

so that we can use them in multiple virtualobj functions.
# models.py
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import declarative_base, relationship

Base = declarative_base()

class User(Base):
    __tablename__ = "user_account"
    id = Column(Integer, primary_key=True)
    name = Column(String(30))
    fullname = Column(String)
    addresses = relationship(
        "Address", back_populates="user", cascade="all, delete-orphan"
    )

    def __repr__(self):
        return f"User(id={self.id!r}, name={self.name!r}, fullname={self.fullname!r})"
    def as_dict(self):
        return {c.name: getattr(self, c.name) for c in self.__table__.columns}

Define the SQL dataset as follows:
om.scripts.put('mssql+pyodbc://user:password/db', 'sqldb')

If the database does not exist yet, you can create the tables as follows:
connection = om.datasets.get('sqldb', raw=True)
Base.metadata.create_all(connection.engine)

Now we can use the User model by creating a session from this dataset:
from sqlalchemy.orm import Session 
connection = om.scripts.get('sqldb', raw=True)
Base.metadata.create_all(connection.engine)
# add a user
with Session(connection.engine) as session:
     user = User(name='Jane', fullname='Walker')
     session.add(user)
     session.commit()
# query users
with Session(connection.engine) as session:
    result = session.execute(select(User))
    users = result.scalars().all()

Unfortunately, when we have several virtualobj functions, we would have to
repeat the Model definition in every function. This is because virtualobj
functions require all code to be in their local scope, due to the way
Python's pickle serialization works.
How to modularize code

Let's look at several practical and powerful options for sharing common code:

Option 1: ORM Models as a VirtualObjectHandler dataset
Option 2: Use a common VirtualObjectHandler with virtualobj subclasses
Option 3: Package code as a script
Option 4: Implement a Flask or Dash App
Option 5: Install third-party packages

Below we show example code and evaluate the pros and cons of each option at the
end of the article and look at the best ways to organize your code.
Option 1: ORM Models as a VirtualObjectHandler dataset

One way for modularizing common code is to implement a VirtualObjectHandler
subclass:
# mymodule.py
from omegaml.backends.virtualobj import VirtualObjectHandler
__name__ = '__code__' if __name__ != '__main__' else __name__

class Models(VirtualObjectHandler):
    def models(self, *args, **kwargs):
        from sqlalchemy import Column
        from sqlalchemy import Integer
        from sqlalchemy import String
        from sqlalchemy.orm import declarative_base
        
        Base = declarative_base()

        class User(Base):
           __tablename__ = "user_account"
           id = Column(Integer, primary_key=True)
           name = Column(String(30))
           fullname = Column(String)
           def __repr__(self):
               return f"User(id={self.id!r}, name={self.name!r}, fullname={self.fullname!r})"
           def as_dict(self):
               return {c.name: getattr(self, c.name) for c in self.__table__.columns}

        self.User = User
        return self

    def __call__(self, **args, **kwargs):
        # enable e.g. om.datasets.get('models').User
        return self.models()

We can store this as a dataset or a script:
# always use replace=True (to avoid calling the class)
om.datasets.put(Models, 'orm/models', replace=True)

Now in our virtualobj function we can use the models as before:
@virtualobj
def myfunc(*args, **kwargs):
    import omegaml as om
    from sqlalchemy.orm import Session
    from sqlalchemy import select
    # load the ORM models and db connection
    models = om.datasets.get('orm/models')
    connection = om.datasets.get('sqldb', raw=True)
    # use the ORM models and db connection
    with Session(connection.engine) as session:
        result = session.execute(select(models.User))
        objs = result.all()
    # convert to dict so result can be serialized
    # -- note the custom as_dict() method on the User model
    return [o[0].as_dict() for o in objs]

Option 2: Use a common VirtualObjectHandler

Sometimes we may have virtualobj functions that are the same except for some
specific detail. As an example, consider the module loading code from Option 3 below.
In this situation it is convenient to implement a common base class, as a custom
VirtualObjectHandler, and make each virtualobj function part of a subclass:
The following SharedBaseModel implements the module loading logic:
from omegaml.backends.virtualobj import virtualobj, VirtualObjectHandler
# this line explained in section "Avoid module not found errors", at the end
__name__ = '__code__' if __name__ != '__main__' else __name__

class SharedBaseModel(VirtualObjectHandler):
    def __call__(self, *args, update=False, **kwargs):
        self.load_modules(update=update)
        return super().__call__(*args, **kwargs)

    def load_modules(self, update=False):
        import omegaml as om
        import sys, shutil
        packages = ['shared']
        path = '/tmp/local/packages'
        if update:
            shutil.rmtree(path, ignore_errors=True)
        for pkg in packages:
            mod = om.scripts.get(pkg, install=True, keep=True, localpath=path)
            setattr(self, pkg, mod)

    def predict(self, **kwargs):
        raise NotImplementedError

Now we transform our virtualobj function to a subclass of the SharedBaseModel:
class MyModel(SharedBaseModel):
    def predict(self, **kwargs):
        import omegaml as om
        from sqlalchemy.orm import Session
        from sqlalchemy import select
        # get the models module from the shared package
        models = self.shared.models
        # work with models as before
        connection = om.datasets.get('sqldb', raw=True)
        with Session(connection.engine) as session:
            result = session.execute(select(models.User))
            objs = result.all()
        return [o[0].as_dict() for o in objs]

Deploying this is essentially as before, except now we store the MyModel
class instead of the function itself.
om.models.put(MyModel, 'myfunc', replace=True)

Option 3: Package code as script

For larger, more complex code it is considered a best practice to modularize
the code and distribute it as an installable "pip" package.
Package Structure
Create a shared module as a normal python package, e.g.
project
+ shared
  + __init__.py
  + models.py
  + common.py 
setup.py

In __init__.py, be sure to import any dependent modules, so they can be
accessed as attributes.
# __init__.py
    import shared.models as models
    import shared.common as common

Deployment
We can deploy this code by running
$ om scripts put ./project/shared shared

Once deployed we can load the module in our virtualobj function:
@virtualobj
def myfunc(*args, update=False, **kwargs):
    import omegaml as om
    from sqlalchemy.orm import Session
    from sqlalchemy import select

    # load the shared package
    # -- this is equivalent to Python's import statement
    # -- optionally, we reinstall before using this
    if update:
      from shutil import rmtree
      rmtree('/tmp/local/packages', ignore_errors=True)
    shared = om.scripts.get('shared', keep=True, install=True, localpath='/tmp/local/packages')

    # now we can use the models, and any other shared code, as before
    models = shared.models
    connection = om.datasets.get('sqldb', raw=True)
    with Session(connection.engine) as session:
        result = session.execute(select(models.User))
        objs = result.all()
    return [o[0].as_dict() for o in objs]

Note that when the shared module is updated, we need to force a reload by all runtime workers:
# run this every time you have updated the package
om.runtime.model('myfunc').predict([], update=True)

Sometimes it may be necessary to restart all runtime workers in order to install an update:
# allow a few minutes for the runtime to restart
$ om runtime celery control shutdown

# check periodically to see if the runtime is up by running again:
$ om runtime ping
{'message': 'ping return message', 'time': '2023-06-10T14:44:05.715022', 'args': (), 'kwargs': {}, 'worker': 'celery@worker-system-worker-omdemo'}

# --OR--
$ om runtime status
{'celery@eowyn': [], 'celery@worker-system-worker-omdemo': []}

Option 4: Implement a Flask or Dash App

To implement a UI or Dashboard application it may be most useful to leverage
a web application framework like Flask or Dash Plotly application. In practice
this is similar to Option 3, packaging code as a script, however we now call
this a "app".
The difference between an "app" and a "script" is that your code not only
contains a model's logic that is served via a REST API, but it also provides
the user interface as a web application. In the backend of the application we
can still leverage options 1 - 3 as we see fit.
Package Structure
Create an app module as a normal python package, e.g.
project
+ helloworld
  + static
    . base.css
  + __init__.py
  + app.py
  + routes.py
  + models.py
  + common.py 
setup.py

Deployment
We can directly deploy this app to the omega-ml runtime:
  $ om scripts put ./project/helloworld apps/helloworld
  $ om runtime restart app helloworld

Option 5: Install third-party packages

Note this is for temporary installations and testing only. For
permanent installation, use your organization's deployment process
for the container base images that run the omega-ml platform.
We can install third-party Python using the following command:
$ om runtime env install <package name>

This requires that the omegaml runtime can access the PyPI Package index. If
this is not the case, for example in an internal network that blocks outside
URLs, we can still install packages by uploading the to the runtime worker
and running a small job to install the packages. To do this follow these
steps:


Download the packages as wheel files, and create a tar file
$ mkdir ./packages && cd packages
$ pip download <name>
$ tar -czf mypackages.tgz *whl7

Be sure to download these packages using the same Python version and
runtime platform as the omegaml runtime worker. Unless the packages match
the runtime's Python version and platform, the installation process will fail


Save the tarfile to om.datasets
$ om datasets put mypackages.tgz packages/mypackages.tgz


Create and run the following notebook and run it on the runtime
$ om shell
[] code = """
   import omegaml as om 
   om.datasets.get('packages/mypackages.tgz', local='/tmp/packages/mypackages.tgz')
   !tar -C /tmp/packages -xf /tmp/packages/mypackages.tgz 
   !ls /tmp/packages
   %pip install --no-index --find-links /tmp/packages -U pandas
   """
   om.jobs.create(code, 'install-packages')
   om.runtime.job('install-packages').get()


Inside an application deployed using apphub, you can use the same process,
by running the install-packages notebook on startup of the application:
# app.py
def create_app(...):
  import omegaml as om 
  om.jobs.run('install-packages')

# other modules
import <package-name>


How do I know which option fits my use case?

All options are valid ways. Choose by considering your objectives and the size and complexity
of your code:


Objective
(1) Virtual dataset  or script
(2) subclasses (common base class)
(3) Packaged  Script
(4) Packaged App


Small number of shared objects or functions
X
(X)


Same logic with a few specifics for each function
(X)
X


Complex or large number of shared code


X


Combined logic and user interface


X


A key distinction between virtual objects (functions/VirtualObjectHandler subclasses) and packages scripts/apps is that
the latter take longer to package, deploy and install. While virtual objects are fast and easy to deploy, scripts and
apps provide a workflow that is more amenable to a traditional software engineering process. Apps
can provide a runtime performance advantage because all the scripts are loaded at startup time,
while virtual objects and packaged scripts are loaded for each execution.
The following table lists the trade-offs for each option.


Trade-Off
(1) Virtual dataset  or script
(2) subclasses (common base class)
(3) Packaged  Script
(4) Packaged App


When loaded
each request
each request
each request
App startup


Time to load during  request processing
10-100ms *)
10-100ms *)
> 10 seconds (first-time); < 50ms (subsequent)
already loaded


Accessible from  REST API
yes
yes
yes
no (app provides  its own REST API)


Scalable by adding  more runtime instances
yes
yes
yes
yes


How to best organize and deploy my code?

omega-ml is designed with simplicity and fast deployment in mind. That is,
when you develop a model or a script, it can be instantly deployed and used
by the runtime without delay. This results in a fast feedback development
model, where you can develop and run your code as a backend, very much the
same way as you do in a Jupyter Notebook on your local workstation or laptop.
While this is great for an exploratory and iterative style of working, when we
deploy a productive application, we want a stable and repeatable process. omega-ml
provides the same seamless experience in this case, adding a repeatable process
as deployable artifacts.
There are two phases to writing and deploying code with omega-ml:


Develop your code and run it in the omega-ml runtime
omega-ml does not dictate a particular way of organizing your code. You may
use Jupyter Notebooks, modularize your code in separate .py files, or
leverage packaged scripts. The best way depends on the complexity of
your application and your working style. When your work is of an exploratory
nature, Jupyter Notebooks are a great fit. If you want to modularize your code
and create a maintainable code base, applying software engineering best
practices like modularity, versioning and CICD, using packaged scripts is the
best fit.


Export artifacts and deploy to the target environment
Once you have tested your code in your omega-ml development environment,
you want to export and deploy it to your production system. This
works by exporting all artifacts and then importing them again in
the target environment. This is achieved by running the om runtime export
and om runtime import commands, respectively.


Since these two phases can each consist of multiple steps, we can combine all
steps in a deployfile.yaml. This serves both as the repeatable process as well
as the documentation of all of our deployments.
A sample deployfile looks like this:
# deployfile.yaml
# -- get a complete example by running: om runtime deploy example 
datasets:
  - name: mydata
    local: data/mydata.csv
models:
    - name: mymodel
      local: package.mymodel
scripts:
    - name: apps/helloworld
      local: ./helloworld

We can run this deployfile by running om runtime deploy. Get more details
by running om runtime deploy example and om help runtime.
Avoid "module not found" errors

If the runtime raises a "module not found error", it means that your virtual
object contains references to a module the very class or function. This is
a result of the way that Python serializes objects. We can avoid this error
by adding the following line of code to all modules declarding a base class,
a @virtualobj function or a VirtualObjectHandler class.
# first line of code to add to all your modules that define 
# a 
__name__ = '__code__' if __name__ != '__main__' else __name__

For example our mymodule.py code will look like this:
# mymodule.py
__name__ = '__code__' if __name__ != '__main__' else __name__

class BaseModel(VirtualObjectHandler):
      ...

class Model(BaseModel):
      ...
Objective	(1) Virtual dataset or script	(2) subclasses (common base class)	(3) Packaged Script	(4) Packaged App
Small number of shared objects or functions	X	(X)
Same logic with a few specifics for each function	(X)	X
Complex or large number of shared code			X
Combined logic and user interface				X
Trade-Off	(1) Virtual dataset or script	(2) subclasses (common base class)	(3) Packaged Script	(4) Packaged App
When loaded	each request	each request	each request	App startup
Time to load during request processing	10-100ms *)	10-100ms *)	> 10 seconds (first-time); < 50ms (subsequent)	already loaded
Accessible from REST API	yes	yes	yes	no (app provides its own REST API)
Scalable by adding more runtime instances	yes	yes	yes	yes