Skip to content

Instantly share code, notes, and snippets.

@nandilugio
Last active January 6, 2023 10:00
Show Gist options
  • Save nandilugio/0f1113880afe8074e5745e576b4ff32d to your computer and use it in GitHub Desktop.
Save nandilugio/0f1113880afe8074e5745e576b4ff32d to your computer and use it in GitHub Desktop.
Sync proposal
## Sync framework ############################################################
# To be added to models that can be part of a synced group of entities
# AR implementation. Other could be used if API is preserved.
module Syncable
attr_accessor :sync_group_id, :syncer
after_update do
SyncGroup.find(sync_group_id).touch! if changed.include?(@_ssynced_attrs)
end
def find_by_sync_group_id(id)
end
def self.synced_with(syncer_class)
syncer = syncer_class
end
def self.synced_attrs(attrs)
@_synced_attrs = attrs
end
# Registers the synced group of local instances mapping to a NS entity instance
# AR implementation. Other could be used if API is preserved.
class SyncGroup < ActiveRecord::Base
attr_accessor :sync_group_id, :syncer, :touched_at, :synced_at
def touch!
touched_at = Time.now
end
def mark_synced!
synced_at = Time.now
end
end
# Base for integrations
class Syncer
# "From now on, sync these instances as a X on the remote"
def sync(*attrs)
thing = find(*attrs)
group = group_dependencies(thing)
ns_thing = translate(thing)
create!(ns_thing, group)
end
# Sync all instances that have been updated since last sync
def update_all_pending!
group_ids = SyncGroup
.where(syncer: self.class)
.where("touched_at > synced_at")
.pluck(:id)
group_ids.each do |group_id|
update_group!(group_id)
end
end
private
def group_dependencies(thing)
group = new_group
sync_group_members(thing).each do |syncable_member|
syncable_member.sync_group_id = group.id
end
group
end
def new_group
SyncGroup.create!(syncer: self.class)
end
def update_group!(sync_group_id)
thing = find_by_sync_group_id(sync_group_id)
ns_thing = translate(thing)
update!(ns_thing)
end
# The following operations (CRUD) can be refactored to be done asynchronously:
# - Sidekiq: doesn't ensure order: pass thing.id and translate to ns_thing in
# worker to ensure last version is synced? probably not enough if we need to
# keep order of events.
# - Implement a queue: low throughput, postgres is enough. A sidekiq worker
# (unique? self-enqueuing?) can consume it to avoid a single-point-of-failure.
def create!(ns_thing)
create_remote(ns_thing)
group.mark_synced!
end
def update!(ns_thing)
update_remote(ns_thing)
group.mark_synced!
end
end
## Local model #################################################################
class Thing
has_one :thing_dep
include Syncable
synced_with NetSuiteSomethingSyncer # See below (*)
synced_attrs :x, :y
end
class ThingDep
belongs_to :thing
include Syncable
synced_with NetSuiteSomethingSyncer # See below (*)
synced_attrs :x, :y
end
# (*) Thing and its deps map to a Something entity in NS
## Netsuite integration ########################################################
# The following encapsulate NetSuite specific logic:
# - Network communication
# - Finding info on local model
# - Translation of that info to NetSuite's corresponding model
# Only needed if CRUD operations are done the same way for all NS objects
class NetSuiteSyncer < Syncer
def create_remote(ns_thing)
ns_thing.add
end
def update_remote(ns_thing)
ns_thing.update
end
end
# Thing and its deps map to a Something entity in NS
class NetSuiteSomethingSyncer < NetSuiteSyncer
def find(thing)
thing.includes(ThingDep)
end
def find_by_sync_group_id(id)
Thing.find_by_sync_group_id(id)
end
def translate(thing)
# Could use a NetSuiteSomethingTraslator
NetSuiteGem::Entity::Something.new({x: thing.x, y: thing.y + 1})
end
def sync_group_members(thing)
[thing, thing.thing_dep] # Syncables
end
end
## Business logic (controllers,etc.) ###########################################
class ThingCreationService
def create(*attrs)
thing = Thing.new(*attrs)
thing.thing_dep = ThingDep.new(*attrs)
thing.save!
mail_user(thing)
NetSuiteSomethingSyncer.new.sync(thing)
end
end
# NetSuiteSomethingSyncer#sync registers instance for syncing and does NS create
# We could easily modify the proposal to defer the NS create
thing = ThingCreationService.create(...)
# Marks group as dirty
thing.thing_dep.update_attributes!(x: 345)
# This could be a recurrent worker, a self-enqueuing one, we could enqueue
# on-update, etc.
class NetSuiteSyncWorker
def perform
# Performs NS update
NetSuiteSomethingSyncer.new.update_all_pending!
end
end
@laurafg
Copy link

laurafg commented Feb 14, 2017

Some things:

  • There's an after_update callback in Syncable module which triggers a real sync to Netsuite. So, if Thing updates and ThingDep updates too, 2 syncs happen to the same group? We were trying to avoid that, so that callback should just change SyncGroup#updated_at and later on sync_all! will sync all pending groups, no?

  • You will be syncing after_update no matter which attributes changed. We should minimize that, either having a list of attributes or checking a a "digest" from the SyncGroup params (only if that digest is different from the last one sent we need to sync again)

  • What happens if a "Thing" belongs also to a another SyncGroup ?

  • We need to do syncing in workers for sure (using unique digest https://github.com/camaloon/camaloon/pull/4307/files#diff-670e08306bf73f046b72cf5a0c0b7cdeR6) to make sure just one sync happens at a time

  • About a queue of commands or similar implementation to handle partial syncs instead of the final state of the record, I'd like to think real use cases first. The only use case I can think right now is: when order is in sent state, we need to sync the order but also create an itemfullfilment in Netsuite. But we can rethink this and create an ItemFullfiment for each ActiveRecord ProductionOrder and that would be a easier sync

  • Be also aware that sometimes Syncer#sync will be a "create_remote/update_remote" but sometimes also an "initialize_remote+create_remote"

@rogercampos
Copy link

rogercampos commented Feb 14, 2017

Hey, let me declare some statements I think we all agree are true, just to set a fundamental base on top of which we can construct everything else. We must agree on this first!

1- The general problem can be described as the problem of managing derived state. We have some state managed via the camaloon application (ecommerce, backoffice, etc.), which ends up into a database (the main postgres db, even though it could also include additional databases in the future, or S3, or whatever). And we want to replicate part of this state into a third party service probably using a transformation function for at least part of that state (an ETL).

2- What we want to accomplish is to have a more-or-less up-to-date replication of the original state into the third party service, always. This is feature independent, we want that always to happen, controlled via the ETL functions (mappings). With this I mean that this whole syncing thing is based purely on data, so we should implement a solution that works on data. This means setting triggers (or observers, etc) to the data we want to trigger a new sync directly, instead of "triggering" the syncs from controllers or other feature-specific parts of the application.

3- We want to base our syncing around "models", given that the state we want to use as source data is already modeled that way and the third party service also treats data as models. Generally speaking, we can map "User" to "User" or "Product" to "SalesProduct", mapping a model in our application to a model in the third party service, even though the names of those models can vary. Is this true? Or do we already have an example when one of our models map to multiple models in the third party service, or multiple models in our application map to the same model in the third party service? For now I asume this is true.

4- We want to have independence between the code that decides when a sync is needed and the code that actually performs a sync. The first problem is about knowing when a sync is required (observing specific fields in the database, p.e.). The second problem is about performing the syncing. This separation solves the problems of managing api rates limits in third party apis, triggering multiple unnecessary syncings, and some more.

5- Given that we base the syncing around models (point 3 before), the act of syncing a model will always have the steps of 1) Given an instance of a model in our application, 2) I create the representation of that model that I want to send to the third party (ETL) 3) I send that representation to the third party using an API. A possible optimization here would be to avoid syncing data that is already exactly the same data the third party already has, for that we could store an MD5 of the generated representation and compare with the previous stored value of that md5, and avoid the sync if they're equal.

6- Given that we base the syncing around models (point 3 before), the data we need to store internally in order to manage the syncing functionality can be directly stored in the same storage used for the original model. For example, we need to know when a syncing is required, for that we could store a boolean ("sync required? true/false"), but we could also store the MD5 mentioned before. In the example of syncing User, we could store those 2 attributes in the "users" table, even though maybe the ETL function for a User grabs some information for other tables.

Now, other notes about your proposal:

  • "A design goal was not to have this map living in our database in any way" I don't follow! can you elaborate xD. Maybe provide an example in pseudo-code like you did but using real examples (User instead of Thing).

  • I think we shouldn't use the rails' updated_at record, that would generate innecessary syncings! We only want to trigger a sync when the representation in the third party service changes, for that we need the ETL to operate or manage the derived state explicitly (ej: we know the source attribute data used by the mapping, when some of those attributes changes, a sync is probably needed).

  • The "queue" solution you mention is replaced by the idea of separating the code in charge of knowing when a change is needed and the code that performs syncing, no?

A simple example for mapping User in pseudo code based on the ideas I think we should follow. The complex in the following example is, of course, setting the correct observers for each case. We need something for that (it's ofuscated, but https://github.com/camaloon/camaloon/commit/1ec7e216a92cf4d5d69adc05861434fb39ff4422 maybe can do that with some changes).

# Entity User in our app maps to User in netsuite

# Code that knows how to sync a user to netsuite

def netsuite_representation(user)
  {
    name: user.name,
    email: user.email,
    orders_count: user.orders.accountable.count,
    segment: user.customer.segment
  }
end

def sync_user_to_netsuite(id, user_representation)
  # Use API to push the given representation to netsuite
  NetsuiteApi.new.update_user!(id, user_representation)
end

def update_user(user)
  data = netsuite_representation(user)
  netsuite_id = user.netsuite_id || raise("First you need to create the user in netsuite!")
  sync_user_to_netsuite(netsuite_id, data)
end


# Code that knows when a sync of a user is required

def schedule_for_sync(user)
  user.sync_required_to_netsuite = true
end

# AR callbacks
User.on_create + on_destroy -> create_user(user) + destroy_user(user) # Not implemented

# AR callbacks
User.on_update(:name, :email, :customer_id) -> schedule_for_sync(user)

Customer.on_update(:segment) -> schedule_for_sync(customer.user)

Order.on_update(:state, :user_id) + Order.on_destroy + Order.on_create -> schedule_for_sync(order.user)



# Code that knows when to perform a sync of users

while(true) do
  User.where(sync_required_to_netsuite: true).find_each do |user| 
    update_user(user)
    user.sync_required_to_netsuite = false
  end
  sleep 5
end

@nandilugio
Copy link
Author

nandilugio commented Feb 14, 2017

Aja! ladies first @Laura:

There's an after_update callback in Syncable module which triggers a real sync to Netsuite. So, if Thing updates and ThingDep updates too, 2 syncs happen to the same group? We were trying to avoid that, so that callback should just change SyncGroup#updated_at and later on sync_all! will sync all pending groups, no?

Totally true. That was the original idea. Code now reflects that

You will be syncing after_update no matter which attributes changed. We should minimize that, either having a list of attributes or checking a a "digest" from the SyncGroup params (only if that digest is different from the last one sent we need to sync again)

Also true, forgot! Updated the code :)

What happens if a "Thing" belongs also to a another SyncGroup ?

Good point. If we happen to need this, a multi-group version of the Syncable module should be easy to do. 👍

We need to do syncing in workers for sure (using unique digest https://github.com/camaloon/camaloon/pull/4307/files#diff-670e08306bf73f046b72cf5a0c0b7cdeR6) to make sure just one sync happens at a time

Yup. That's discussed in the "About asynchronous implementation and queues" section of the 1st comment.

About a queue of commands or similar implementation to handle partial syncs instead of the final state of the record, I'd like to think real use cases first. The only use case I can think right now is: when order is in sent state, we need to sync the order but also create an itemfullfilment in Netsuite. But we can rethink this and create an ItemFullfiment for each ActiveRecord ProductionOrder and that would be a easier sync

Yes, that's why I've left that out of the proposal for now. If the sync operation is only one-directional (local -> remote) and no state/event sequence is to be maintained, the proposed solution above should suffice. I've enclosed the update in a "worker" for clarity above.

Be also aware that sometimes Syncer#sync will be a "create_remote/update_remote" but sometimes also an "initialize_remote+create_remote"

Not sure what you mean here. Are you referring to the initialization of the netsuite entity instance so it retrieves the defaults from the remote end? if so, I guess that's a problem that we could attack on the "translation" function.

It can perform that initialization before creating the instance. We could even try to cache that call for defaults or even manually (and perhaps dangerously) configure those same defaults locally, but assuming syncing is going to be done asynchronously, we can simply fail and retry on that request.

Anyway, that's a problem particular to NetSuite and should not affect the structure of the proposal here.


Now @Roger:

1- The general problem can be described as the problem of managing derived state. We have some state managed via the camaloon application (ecommerce, backoffice, etc.), which ends up into a database (the main postgres db, even though it could also include additional databases in the future, or S3, or whatever). And we want to replicate part of this state into a third party service probably using a transformation function for at least part of that state (an ETL).

also

"A design goal was not to have this map living in our database in any way" I don't follow! can you elaborate xD. Maybe provide an example in pseudo-code like you did but using real examples (User instead of Thing).

Yes, this is an important point. That transformation function / ETL corresponds to the #translate method above. That's what I've meant with "not to have this map living in our database in any way". We need the data to be what it is (our model), and there's no need to persist the result of the transformation since it can always be obtained. Unless of course, we want to maintain the event sequence, in which my proposal is to persist that transformation output in a command added to a queue.

2- What we want to accomplish is to have a more-or-less up-to-date replication of the original state into the third party service, always. This is feature independent, we want that always to happen, controlled via the ETL functions (mappings). With this I mean that this whole syncing thing is based purely on data, so we should implement a solution that works on data. This means setting triggers (or observers, etc) to the data we want to trigger a new sync directly, instead of "triggering" the syncs from controllers or other feature-specific parts of the application.

Agree, yes +1

3- We want to base our syncing around "models", given that the state we want to use as source data is already modeled that way and the third party service also treats data as models. Generally speaking, we can map "User" to "User" or "Product" to "SalesProduct", mapping a model in our application to a model in the third party service, even though the names of those models can vary. Is this true? Or do we already have an example when one of our models map to multiple models in the third party service, or multiple models in our application map to the same model in the third party service? For now I asume this is true.

Well, @Laura please confirm this please but if I'm not mistaken SalesOrder maps to an Order and things like it's BillingInfo, ShippingInfo, etc. If any of those deps change, the SalesOrder should get the update. That's why we need a SyncGroup (or SyncMap? ;p). LineItems would probably fall in that same category.

Not sure if we already have an example of the inverse (one local to many remotes) but I guess that could be easily managed by the implementation of the CRUD methods for that particular syncer.

Also, we have entities that do not map in conceptual meaning. A Customer is not defined exactly the same in all scopes. Marketing wants a customer entity that groups all possible orders and contacts a particular client organization has. That's probably what we want in SalesForce and the DW (well, we want everything on the DW ;p). But NetSuite needs all "Fiscal Identities" (dejavú!) billed by us. @Laura is refactoring our model to have both matches but I wonder if that couldn't be also abstracted as an ETL function.

4- We want to have independence between the code that decides when a sync is needed and the code that actually performs a sync. The first problem is about knowing when a sync is required (observing specific fields in the database, p.e.). The second problem is about performing the syncing. This separation solves the problems of managing api rates limits in third party apis, triggering multiple unnecessary syncings, and some more.

True. Do you feel this proposal fulfills those needs?

5- Given that we base the syncing around models (point 3 before), the act of syncing a model will always have the steps of 1) Given an instance of a model in our application, 2) I create the representation of that model that I want to send to the third party (ETL) 3) I send that representation to the third party using an API. A possible optimization here would be to avoid syncing data that is already exactly the same data the third party already has, for that we could store an MD5 of the generated representation and compare with the previous stored value of that md5, and avoid the sync if they're equal.

If we'd like to have an idempotent sync process and be able to resync from any state, calculating hashes for every instance can be very expensive, and that cost will increase linearly with the contents of the DB. Relying on the timestamps of the group and the command queue we get those benefits without that cost.

6- Given that we base the syncing around models (point 3 before), the data we need to store internally in order to manage the syncing functionality can be directly stored in the same storage used for the original model. For example, we need to know when a syncing is required, for that we could store a boolean ("sync required? true/false"), but we could also store the MD5 mentioned before. In the example of syncing User, we could store those 2 attributes in the "users" table, even though maybe the ETL function for a User grabs some information for other tables.

(see above replies)

  • "A design goal was not to have this map living in our database in any way" I don't follow! can you elaborate xD. Maybe provide an example in pseudo-code like you did but using real examples (User instead of Thing).

(above)

  • I think we shouldn't use the rails' updated_at record, that would generate innecessary syncings! We only want to trigger a sync when the representation in the third party service changes, for that we need the ETL to operate or manage the derived state explicitly (ej: we know the source attribute data used by the mapping, when some of those attributes changes, a sync is probably needed).

True true!! renamed the var to touched_at. Some implementation around that was missing. Thx @Laura!

  • The "queue" solution you mention is replaced by the idea of separating the code in charge of knowing when a change is needed and the code that performs syncing, no?

"Replaced", or "implemented by"? ;p

A simple example for mapping User in pseudo code based on the ideas I think we should follow. The complex in the following example is, of course, setting the correct observers for each case. We need something for that (it's ofuscated, but https://github.com/camaloon/camaloon/commit/1ec7e216a92cf4d5d69adc05861434fb39ff4422 maybe can do that with some changes).

Yes. I think we're on the same page. The details @Laura pointed out where probably misleading enough for you to feel we were not but please re-read the updated code above with this in mind and comment again if you still feel this proposal is not aligned with your ideas.

NOTE: Also trimmed some corners on the 1st comment. Search for "EDIT: " there :)

@rogercampos
Copy link

rogercampos commented Feb 15, 2017

I was definitely missing background to fully grasp your proposal xD. Glad we concur on all this, my example in pseudo-code tried to clarify all the concepts. Some things:

  • Regarding my point 3) before, even if we have entities in netsuite like SalesOrder that maps to multiple models in our app (because it includes BillingInfo etc.), can't we see this as composition? Meaning, in this example, we have one "primary" entity in our app that maps to SalesOrder, which is Order. And in the mapping function of that entity there are multiple fields that we gather from other models. Same as in my pseudo-code example of User, here we would have:
def netsuite_representation(order)
  {
    ref: order.ref,
    billing_country: order.billing_info.country,
    # etc...
  }
end

This makes the sync group complexity unnecessary? Or are there more use cases?

Also, we have entities that do not map in conceptual meaning

Maybe then the solution is to change our app as well, introducing FiscalIdentity again! Then both models (ours and netsuite's) will match more closely and this extra complexities in syncing will dissapear.

  • About the syncing strategy

True. Do you feel this proposal fulfills those needs?

Yes, no problem with that. I see you use 2 timestamps instead of one boolean to control the syncing status, implementation detail, it's the same. But use meaningful names "last_netsuite_synced_at" instead of "synced_at" xD.

  • About 5). Yes, it's true we don't need md5 hashes if we use specific observers only on the adequate fields. However, purely speaking you're assuming that a change in the input of the ETL function will result in a different output, which could not be always true. But in practice I would definitely not implement any hashing strategy to avoid extra syncing if we use specific-observers.

  • I will elaborate a bit on my old work of "dr_manhattan" to see it we can use it here.

@nandilugio
Copy link
Author

nandilugio commented Feb 15, 2017

Regarding my point 3) before, even if we have entities in netsuite like SalesOrder that maps to multiple models in our app (because it includes BillingInfo etc.), can't we see this as composition? ...

This makes the sync group complexity unnecessary? Or are there more use cases?

Yes, the transformation or #translate function will do exactly that, but in order to get the touched_at timestamp updated when that BillingInfo dependency changes, we need to keep track of those.

In fact there is something I don't 100% like there but see no magic we can do to solve cleanly: The developer has the responsibility to be explicit about the dependencies that can change the output of the transformation function so there are 2 things that have to be manually synchronized while developing: the dependencies declared in #sync_group_members and the ones used in #translate.

I've also just noticed that the inclusion of Syncable already offers the info #sync_group_members returns, so specifying that by hand is probably redundant. Maybe some magic there is not that difficult to implement but need to think about this a bit. Anyway, these are implementation details only. We'll solve these if we agree on developing this.

Maybe then the solution is to change our app as well, introducing FiscalIdentity again! Then both models (ours and netsuite's) will match more closely and this extra complexities in syncing will dissapear.

Yes, in fact @Laura is bringing it back from the dead 👻 It's named FiscalCustomernow :p. See #4403.

But use meaningful names "last_netsuite_synced_at" instead of "synced_at" xD.

Nah man, this solution transcends NetSuite ;p heheh. In fact, it could be used for SalesForce or whatever.

I will elaborate a bit on my old work of "dr_manhattan" to see it we can use it here.

Looking forward to see it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment