Skip to content

Instantly share code, notes, and snippets.

@tiry
Last active December 28, 2016 15:44
Show Gist options
  • Save tiry/37f8d1b119645f4c44ddddd67dcfcfe1 to your computer and use it in GitHub Desktop.
Save tiry/37f8d1b119645f4c44ddddd67dcfcfe1 to your computer and use it in GitHub Desktop.

Nuxeo Queue Importer

Goal

nuxeo-importer-core contains several sample codes that can be adapted to run imports leveraging :

  • thread-pooling
  • batching (import several documents inside a given transaction)
  • event processing filtering (enable bulk mode or skip some events)

This is the most efficient solution to run very fast imports.

However, the default implementation used to come with some limitations and constraints :

  • extending the importer is done in Java
    • this can be an issue for non Java developers
  • multi-threading policy can be complex
    • multi-threading policy depends on the source layout and dependencies between entries
  • if import fails in the middle, then it must be restarted

The work on queue based importer and Kafka aims at addressing these limitations.

Principles and architecture

Decoupling Read from Write

We want the importer infrastructure to promote a clear separation between the 2 sides of the import process :

  • Reader / Producer : the one reading the imput data (from files, DB ...)
  • Write / Consumer : the one writing the data into Nuxeo Repository

By decoupling the Reader and Writer, we have several gains :

  • we can get the Writer/Consumer part very generic
    • have a highly optimized importer engine
  • we can run separately the producer and the consumer
    • this means we can more easily re-run the import without being forced to re-run all the pre-processing
  • developers "working on the import process" have mainly to work on the Reader/Producer part
    • this part being mainly decoupled from Nuxeo, they do not have to be Nuxeo developers

Queues

In order to have this decoupling, the idea is to add a queue between the 2 parts of the importer:

Source data => Producer => Queue(s) => Consumer => Import Data in Nuxeo

This is a new implementation of the importer: nuxeo-importer-queues. We clearly split the importer flow in 2 sub parts and have the queue system externalizable.

  • Import part 1
    • read the data from the source
    • build an import message (can include some transformation)
    • en-queue the message
  • Import part 2
    • read the message from the queue
    • create a document inside the repository based on the message

The queue in the middle also allows us to completly decouple the threading model between the 2 parts :

  • part 1 can be mono-threaded if this simpler (since this is usually not the bottleneck)
  • part 2 is by default multi-threaded and batched to increase performances

The queuing system can be proivided by different backend, the current nuxeo-importer-queues currently supports 2 backends :

  • ChronicalQueues
    • in JVM but easy to setup
  • Apache Kafka
    • distributed MOM

Kafka may be a little big more complex to deploy but in exchange of the additional effort you need to do for the setup it does provide some additional benefit :

  • you can scale the queue between several servers
    • this means that you can run import on different Nuxeo Server nodes
  • you are not limited by available memory
    • Kafka queues on disk
  • you can write the client part in any language supported by Kafka
    • Java, JavaScript, Python, .Net, Ruby ...

Import message

More than the Java API, the real interface you need to implement on the producer side is the message format.

XXX

Importing Binaries

XXX describe principles.

Using the importer

Setting up the importer imfrastructure

  • add marketplavce package
  • setup kafka
  • choose client

Example importer client in JavaScript

XXX

Example importer client in python

XXX

Configuring the server side importer

  • threads
  • document factory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment