ollie314/identityResolver.fs

## identityResolver.fs
module IdentificationProcessing

open System
open Domain
open ContractDomain

type DataSubjectIdentityResolver =
            string          // the name of the type
                -> obj      // the object instance to look into
                -> string   // the owner identity

// here the tricky part of the identification process since w may have information about the isntance of the
// object to look into but we may also have no inforamtion about that. In such case, we are using the internal
// scoring processor which is actualy able to score about data matching
// The default case of the resolver should use the DataSubjectIdentifier but the implementation must be adapted
// so that definitions fit with actual needs
let resolveOwner : DataSubjectIdentityResolver =
    fun s i ->
        match s with
        | "ContractEvent" ->
            // TODO: use a generic approach since we don't neccessary have dependencies in the scope of the processing
            let c : ContractEvent = downcast i
            c.PartnerId
        // to be continued...
        | _ -> "-"

(*
    The idea behind that is to reduce the complexity by filtering eligible data subject.
    At first, we are looking for data subject in the index by performing a query
    on the index using all fields presents in the source vector, then, it uses
    the result to perform a proximity evaluation for each items in the result set.

    lets imagine the following object
    const o = {email: "john@mydom.com", firstname: "john", lastnaem: "doe", phone: "+41795477978"}

    The system will transform that to the following LUCENE Query
    q = "email:john@mydom.com OR firstname:john OR lastnaem:doe OR phone:+41795477978"

    Then it will perform the research and, based on the result, it will perform an evaluation using both
    a distance on each provided fields and a score according to the presence of the field in the result.

    A result may contains only a part of provided fields. For instance, it may contains only the email,
    the phone and the firstname.
    A table gives the weight of each fields in the comparison.
    Note: In our case, the weight of each field should be relative which means the fact that the sum represents
    a probablity doesn't matter.

    For instance, we may have this table:
    tabble = {
        email: 0.6
        firstname: 0.05
        lastname: 0.05
        phone: 0.3
    }
    In such case, we follow the idea that a matching email pvides a confidence of 60% that both items match.

    According to our example, we only have the email field and the phone field.
    We are fixing the probablity gate to 0.8 which means that a full match (email and phone) will pass and
    all other comparison will fail which is a bit aggresive in terms of filtering. In order to tackle that
    we add the distance comparison which will ponderate the result. The distance is transformed to a match
    rate which is multiplicate to the field's weight to obtain the final result.

    Let's express that mathematically.

    let
        - fw: the field's weight,
        - d: the distance between the field in the result and the field to check for
        - p: proximity between two vector
    we have

          n
         ----
     p =  \     (fwi * di)
          /
         ----
        i = 0

    where
        - fwi is the field weight at the position i in the vector
        - and di is the distance result of the the item at the position i in the vector

    As we states earlier, the threshold is 0.8 (which actually should be fixed  by the business). So if the
    matching probability is over this threshold, we have a potential match.

    Rules for result processing should be defined by the business. For isntance, we can use the following definition

        - if no match is found, the identity will be created (and indexed by the way - we have to take care about
            potential latency of the index vs the velocity of the event stream)
        - If only one match is found, we are associating the event with the identity
        - if nore than one match is found, we are generating an alert to let a manager operates the identity selection
            and the system associates the event with all potential identity (the association is made with the flag 'potential').
            When the manager selects the correct identity, events are review so that all event with other identities wiill be dropped
            and the flag potential will be removed.

    In order to tackle the concern regarding the velocity of the stream versus the latency of the indexing
    process, we can make a ttl based local record (ttl at least equal to the average maximum indexing
    process duration). All local record will be added to the search result. Since all records are ttl based, we will only keep relevant
    and acceptable amount of records in the processing window.
*)

// defines a partial representation of a datasubject
// See the fsharp implementation of the event hub stack to gather information
// about the real representatio nf a data subject (whcih is more abstract)
type DataSubjectInfo = {
    id: string                      // unique id of the data subject in our system
    Email: string                   // email of the data subject
    Firstname: string               // firstname of the data subject
    Lastname: string                // lastname of the data subject
    Phone: string                   // phone of the data subject
    // to be completed...
}

// defines a resolver able to change the name of a field to a float
and FieldWeightResolver = string -> float

// this defines a data source to look into in order to load data subject
and DataSubjectSearchProvider =
    string                          // the query to launch for searching
        -> DataSubjectInfo list     // the list of data subject found

// Defines a service ble to calculate distnace between two string
and DistanceCalculator = (string * string) -> int

// Return the rate associates to the distance between two strings
and DistanceScorer = DistanceCalculator -> (string * string) -> float

// this type defines a scorer
and DataSubjectIdentityMatcher =
    DistanceScorer                      // dependency: reference to the scorer to use
        -> DistanceCalculator           // dependency: reference to the service able to calculate the distance
        -> FieldWeightResolver          // dependency: reference to the service able to ponderate
        -> DataSubjectInfo              // data subject to check for
        -> DataSubjectInfo              // data subject in to check accross
        -> float                        // identity matching rate

// define the filtering process to apply on the search results
and DataSubjectIdentityFilter =
    DataSubjectIdentityMatcher          // dependency: reference to the service able to match two identity
        -> DistanceScorer               // dependency: reference to the scorer to use
        -> DistanceCalculator           // dependency: reference to the service able to calculate the distance
        -> FieldWeightResolver          // dependency: reference to the service able to ponderate
        -> float                        // configuration: the threshold over which a matching is accepted
        -> DataSubjectInfo              // the data subject to check the identity for
        -> DataSubjectInfo list         // the list of data subject to check accros
        -> DataSubjectInfo list         // the list of matching identity

// Defines the data subject identifier services
and DataSubjectIdentifier =
    DataSubjectSearchProvider           // dependency: the service able to look for data subject
        -> DataSubjectIdentityFilter    // dependency: the filter to use to filter the request
        -> DataSubjectIdentityMatcher   // dependency: the service able to filter identities
        -> DistanceScorer               // dependency: reference to the scorer to use
        -> DistanceCalculator           // dependency: reference to the service able to calculate the distance
        -> FieldWeightResolver          // dependency: reference to the service able to ponderate
        -> float                        // configuration: the threshold over which a matching is accepted
        -> DataSubjectInfo              // the data subject to fetch identity for
        -> DataSubjectInfo list         // the list of matching identities

(* ==================== IMPLEMENTATION ======================= *)

// Dummy implementation based on the rule presented in the documentation
let resolveFieldWeight: FieldWeightResolver =
    fun s ->
        match s with
            | "email" -> 0.6
            | "firstname" -> 0.05
            | "lastname" -> 0.05
            | "phone" -> 0.3
            | _ -> 0.0

// simple implementation of the levenstein distance calculation
let levenshteinDistance: DistanceCalculator =
    fun (s1,s2) ->
        let s1' = s1.ToCharArray()
        let s2' = s2.ToCharArray()

        let rec dist l1 l2 = match (l1,l2) with
            | (l1, 0) -> l1
            | (0, l2) -> l2
            | (l1, l2) ->
                if s1'.[l1-1] = s2'.[l2-1] then dist (l1-1)(l2-1)
                else
                    let d1 = dist (l1-1) l2
                    let d2 = dist l1 (l2-1)
                    let d3 = dist (l1-1)(l2-1)
                    1 + Math.Min(d1, Math.Min(d2,d3))
        dist s1.Length s2.Length

// naive implementation of a scorer
let naiveDistanceScorer: DistanceScorer =
    fun distance (a,b) ->
        ((1 |> double) - (distance (a,b) |> double) / (a.Length |> double))

// resolve a property to a string * string
let resolve (i: Reflection.PropertyInfo) (ds: DataSubjectInfo): (string * string) =
    (i.Name, i.GetValue(ds).ToString())

// Transform a dataSubjectInfo to a list of properties (reflection)
let toPropertyList (ds: DataSubjectInfo) = (Array.toList(ds.GetType().GetProperties()))

// transform a data subject info into a list of string * string
let tuplize (ds: DataSubjectInfo) =
    let rec f (l: Reflection.PropertyInfo list) (acc: (string*string) list)  =
        match l with
            | [] -> acc
            | head :: tail -> f tail ((resolve head ds) :: acc)

    f (ds |> toPropertyList) []

// Transform a list of fields for a data subject to a lucene query
let makeLuceneQuery (ds: DataSubjectInfo) : string =
    let rec f (l: Reflection.PropertyInfo list) (acc: string list) =
        match l with
        | [] -> acc

        // TODO: drop empty field here
        | head :: tail -> f tail ((resolve head ds |> fun (a,b) -> (sprintf "%s:%s" a b)) :: acc)

    // ["email":"...";"phone":"..."[;...]]
    let parts = f (ds |> toPropertyList) []

    // email:... OR phone:... [...]
    parts |> String.concat " OR "

let dataSubjectIdentityMatcher: DataSubjectIdentityMatcher =
    fun score dist resolveWeight refDs ds ->

        let s = score dist

        let refDs' = refDs |> tuplize
        let ds' = ds |> tuplize

        let rec f i acc =
            match i with
            | 0 -> acc
            | _ ->
                let (k,v) = refDs'.[i-1]
                let (_,v') = ds'.[i-1]
                let w = resolveWeight k
                let sc = s (v,v')
                f (i-1) (acc + sc * w)

        f (refDs'.Length) 0.0

let dataSubjectIdentityFilter: DataSubjectIdentityFilter =
    fun matches score dist resolveWeigth t refDs ds ->
        let s = matches score dist resolveWeigth
        ds |> List.filter (fun d -> (s refDs d) > t)

let dataSubjectIdentifier: DataSubjectIdentifier =
    fun search filter matches score dist resolveWeight threshold ds ->
        let f = filter matches score dist resolveFieldWeight threshold
        let l = ds |> makeLuceneQuery |> search
        f ds l
	module IdentificationProcessing

	open System
	open Domain
	open ContractDomain

	type DataSubjectIdentityResolver =
	string // the name of the type
	-> obj // the object instance to look into
	-> string // the owner identity

	// here the tricky part of the identification process since w may have information about the isntance of the
	// object to look into but we may also have no inforamtion about that. In such case, we are using the internal
	// scoring processor which is actualy able to score about data matching
	// The default case of the resolver should use the DataSubjectIdentifier but the implementation must be adapted
	// so that definitions fit with actual needs
	let resolveOwner : DataSubjectIdentityResolver =
	fun s i ->
	match s with
	\| "ContractEvent" ->
	// TODO: use a generic approach since we don't neccessary have dependencies in the scope of the processing
	let c : ContractEvent = downcast i
	c.PartnerId
	// to be continued...
	\| _ -> "-"

	(*
	The idea behind that is to reduce the complexity by filtering eligible data subject.
	At first, we are looking for data subject in the index by performing a query
	on the index using all fields presents in the source vector, then, it uses
	the result to perform a proximity evaluation for each items in the result set.

	lets imagine the following object
	const o = {email: "john@mydom.com", firstname: "john", lastnaem: "doe", phone: "+41795477978"}

	The system will transform that to the following LUCENE Query
	q = "email:john@mydom.com OR firstname:john OR lastnaem:doe OR phone:+41795477978"

	Then it will perform the research and, based on the result, it will perform an evaluation using both
	a distance on each provided fields and a score according to the presence of the field in the result.

	A result may contains only a part of provided fields. For instance, it may contains only the email,
	the phone and the firstname.
	A table gives the weight of each fields in the comparison.
	Note: In our case, the weight of each field should be relative which means the fact that the sum represents
	a probablity doesn't matter.

	For instance, we may have this table:
	tabble = {
	email: 0.6
	firstname: 0.05
	lastname: 0.05
	phone: 0.3
	}
	In such case, we follow the idea that a matching email pvides a confidence of 60% that both items match.

	According to our example, we only have the email field and the phone field.
	We are fixing the probablity gate to 0.8 which means that a full match (email and phone) will pass and
	all other comparison will fail which is a bit aggresive in terms of filtering. In order to tackle that
	we add the distance comparison which will ponderate the result. The distance is transformed to a match
	rate which is multiplicate to the field's weight to obtain the final result.

	Let's express that mathematically.

	let
	- fw: the field's weight,
	- d: the distance between the field in the result and the field to check for
	- p: proximity between two vector
	we have

	n
	----
	p = \ (fwi * di)
	/
	----
	i = 0

	where
	- fwi is the field weight at the position i in the vector
	- and di is the distance result of the the item at the position i in the vector

	As we states earlier, the threshold is 0.8 (which actually should be fixed by the business). So if the
	matching probability is over this threshold, we have a potential match.

	Rules for result processing should be defined by the business. For isntance, we can use the following definition

	- if no match is found, the identity will be created (and indexed by the way - we have to take care about
	potential latency of the index vs the velocity of the event stream)
	- If only one match is found, we are associating the event with the identity
	- if nore than one match is found, we are generating an alert to let a manager operates the identity selection
	and the system associates the event with all potential identity (the association is made with the flag 'potential').
	When the manager selects the correct identity, events are review so that all event with other identities wiill be dropped
	and the flag potential will be removed.

	In order to tackle the concern regarding the velocity of the stream versus the latency of the indexing
	process, we can make a ttl based local record (ttl at least equal to the average maximum indexing
	process duration). All local record will be added to the search result. Since all records are ttl based, we will only keep relevant
	and acceptable amount of records in the processing window.
	*)

	// defines a partial representation of a datasubject
	// See the fsharp implementation of the event hub stack to gather information
	// about the real representatio nf a data subject (whcih is more abstract)
	type DataSubjectInfo = {
	id: string // unique id of the data subject in our system
	Email: string // email of the data subject
	Firstname: string // firstname of the data subject
	Lastname: string // lastname of the data subject
	Phone: string // phone of the data subject
	// to be completed...
	}

	// defines a resolver able to change the name of a field to a float
	and FieldWeightResolver = string -> float

	// this defines a data source to look into in order to load data subject
	and DataSubjectSearchProvider =
	string // the query to launch for searching
	-> DataSubjectInfo list // the list of data subject found

	// Defines a service ble to calculate distnace between two string
	and DistanceCalculator = (string * string) -> int

	// Return the rate associates to the distance between two strings
	and DistanceScorer = DistanceCalculator -> (string * string) -> float

	// this type defines a scorer
	and DataSubjectIdentityMatcher =
	DistanceScorer // dependency: reference to the scorer to use
	-> DistanceCalculator // dependency: reference to the service able to calculate the distance
	-> FieldWeightResolver // dependency: reference to the service able to ponderate
	-> DataSubjectInfo // data subject to check for
	-> DataSubjectInfo // data subject in to check accross
	-> float // identity matching rate

	// define the filtering process to apply on the search results
	and DataSubjectIdentityFilter =
	DataSubjectIdentityMatcher // dependency: reference to the service able to match two identity
	-> DistanceScorer // dependency: reference to the scorer to use
	-> DistanceCalculator // dependency: reference to the service able to calculate the distance
	-> FieldWeightResolver // dependency: reference to the service able to ponderate
	-> float // configuration: the threshold over which a matching is accepted
	-> DataSubjectInfo // the data subject to check the identity for
	-> DataSubjectInfo list // the list of data subject to check accros
	-> DataSubjectInfo list // the list of matching identity

	// Defines the data subject identifier services
	and DataSubjectIdentifier =
	DataSubjectSearchProvider // dependency: the service able to look for data subject
	-> DataSubjectIdentityFilter // dependency: the filter to use to filter the request
	-> DataSubjectIdentityMatcher // dependency: the service able to filter identities
	-> DistanceScorer // dependency: reference to the scorer to use
	-> DistanceCalculator // dependency: reference to the service able to calculate the distance
	-> FieldWeightResolver // dependency: reference to the service able to ponderate
	-> float // configuration: the threshold over which a matching is accepted
	-> DataSubjectInfo // the data subject to fetch identity for
	-> DataSubjectInfo list // the list of matching identities

	(* ==================== IMPLEMENTATION ======================= *)

	// Dummy implementation based on the rule presented in the documentation
	let resolveFieldWeight: FieldWeightResolver =
	fun s ->
	match s with
	\| "email" -> 0.6
	\| "firstname" -> 0.05
	\| "lastname" -> 0.05
	\| "phone" -> 0.3
	\| _ -> 0.0

	// simple implementation of the levenstein distance calculation
	let levenshteinDistance: DistanceCalculator =
	fun (s1,s2) ->
	let s1' = s1.ToCharArray()
	let s2' = s2.ToCharArray()

	let rec dist l1 l2 = match (l1,l2) with
	\| (l1, 0) -> l1
	\| (0, l2) -> l2
	\| (l1, l2) ->
	if s1'.[l1-1] = s2'.[l2-1] then dist (l1-1)(l2-1)
	else
	let d1 = dist (l1-1) l2
	let d2 = dist l1 (l2-1)
	let d3 = dist (l1-1)(l2-1)
	1 + Math.Min(d1, Math.Min(d2,d3))
	dist s1.Length s2.Length

	// naive implementation of a scorer
	let naiveDistanceScorer: DistanceScorer =
	fun distance (a,b) ->
	((1 \|> double) - (distance (a,b) \|> double) / (a.Length \|> double))

	// resolve a property to a string * string
	let resolve (i: Reflection.PropertyInfo) (ds: DataSubjectInfo): (string * string) =
	(i.Name, i.GetValue(ds).ToString())

	// Transform a dataSubjectInfo to a list of properties (reflection)
	let toPropertyList (ds: DataSubjectInfo) = (Array.toList(ds.GetType().GetProperties()))

	// transform a data subject info into a list of string * string
	let tuplize (ds: DataSubjectInfo) =
	let rec f (l: Reflection.PropertyInfo list) (acc: (string*string) list) =
	match l with
	\| [] -> acc
	\| head :: tail -> f tail ((resolve head ds) :: acc)

	f (ds \|> toPropertyList) []

	// Transform a list of fields for a data subject to a lucene query
	let makeLuceneQuery (ds: DataSubjectInfo) : string =
	let rec f (l: Reflection.PropertyInfo list) (acc: string list) =
	match l with
	\| [] -> acc

	// TODO: drop empty field here
	\| head :: tail -> f tail ((resolve head ds \|> fun (a,b) -> (sprintf "%s:%s" a b)) :: acc)

	// ["email":"...";"phone":"..."[;...]]
	let parts = f (ds \|> toPropertyList) []

	// email:... OR phone:... [...]
	parts \|> String.concat " OR "

	let dataSubjectIdentityMatcher: DataSubjectIdentityMatcher =
	fun score dist resolveWeight refDs ds ->

	let s = score dist

	let refDs' = refDs \|> tuplize
	let ds' = ds \|> tuplize

	let rec f i acc =
	match i with
	\| 0 -> acc
	\| _ ->
	let (k,v) = refDs'.[i-1]
	let (_,v') = ds'.[i-1]
	let w = resolveWeight k
	let sc = s (v,v')
	f (i-1) (acc + sc * w)

	f (refDs'.Length) 0.0

	let dataSubjectIdentityFilter: DataSubjectIdentityFilter =
	fun matches score dist resolveWeigth t refDs ds ->
	let s = matches score dist resolveWeigth
	ds \|> List.filter (fun d -> (s refDs d) > t)

	let dataSubjectIdentifier: DataSubjectIdentifier =
	fun search filter matches score dist resolveWeight threshold ds ->
	let f = filter matches score dist resolveFieldWeight threshold
	let l = ds \|> makeLuceneQuery \|> search
	f ds l