Skip to content

Instantly share code, notes, and snippets.

@monishdeb
Last active November 21, 2019 03:35
Show Gist options
  • Save monishdeb/7cdce6175c768263459fb1457cb2f595 to your computer and use it in GitHub Desktop.
Save monishdeb/7cdce6175c768263459fb1457cb2f595 to your computer and use it in GitHub Desktop.

Suppose we want to migrate contacts from a remote application, along with its associated infromation like notes, addresses, contributions, activities etc. There's two technique to migrate the data:

  1. Job processing - The concept of job processing is to complete a single process in each iteration. In here a complete process means to migrate a single contact with all its assocaited data in a single iteration. Suppose a single process involves (A -> B -> C ... ->M)n where there M number of subtasks and 'n' number of contacts to migrate.

  2. Pipeline processing - The concept of pipeline processing is to complete all the processes in subtasks. Means to migrate all contacts and its associated data, first migrate all the contacts, then using external reference (like civicrm_contact.external_id which stores the ID of remote application's contact's ID) migrate all the associated addresses, then all contributions etc. In other words it will be (A)n -> (B)n -> (C)n.... -> (M)n where there are M number of subtasks and n number of each entities to migrate.

I prefer to choose the 2nd technique because it helps us to track the intermittant error or missing data during migration. Suppose if you go with this technique, you have first migrated all contacts without any issue and next you have decided to migrate all its associated contributions, but during the latter process we encountered a glitch in the code or the process halted due to timeout. In such cases it would be easy to track down the error and fix accordingly. And then rerun the migration script of the contribution only. Whereas if you go with the first technique you need to run the whole look again and it will get worse when the failure is on different level say address import fail for contact A and contribution import failed for contact B.

Now, at this point we have proven that why #2 is better then #1, now if we go with the 2nd approach its very important that we follow some strict rules to ensure a smooth migration, which are as follows:

  1. Always store the external reference of the entity in either a core field or custom field. Like for contact use civicrm_contact.external_identifier and for contribution use custom field. This is essential to go from subtask A to B , like after migrating all the contacts, on basis of external identifier we can migrate its address and other assocaited data. e.g. we built a similar script for RaiserEdge migration, where we have a separate job called createContact() to migrate all the contacts and during the process we store external ID (as CONSTITUENT_ID) in a custom field here. On basis of this external identifier we will be able to import other associated data, like to migrate all the address records we in this case, we fetch the respective contact ID on basis of store CONSTITUENT_ID in here. Now its upto you if you need to store the external reference of address or not, that depends if there is any future use.

  2. You MUST have a error handler for each schedule job otherwise it would tough to track and reimport the missing data. In earlier example of createContact() function we do have a error handler here which stores the error messages along with external identifier in a seperate table. This basic syntax of this table would be like civicrm_raiseredge_error_data(id, entity_table, entity_id, error_message)

  3. You need to have a seperate job to migrate the missing data OR you can you use the same job which has somekind of flag to identify that this time it is only going to lookout of those missing data whose external reference is stored in the table which records the error messages (mentioned above).

  4. Provide batchsize parameter in the migration job, like if you want to migrate 1000 contacts in each iteration then your job should have a parameter which can take the value from dev.

Lastly it depends on you if you want to merge few subtasks into one and you can do so with the help of API chaining. Suppose you want to migrate all the contacts and its associated address in one job i.e. (A)n -> (B)n ->... to (A -> B)n ->... then you can do so via API chanining as :

civicrm_api3('Contact', 'create',
  [
  ... // other contact parametes
  'api.Address.create' => [
      'contact_id' => "\$value.id",
      ... // other address parameters
    ],
  ]
);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment