Skip to content

Instantly share code, notes, and snippets.

@ollie314
Created September 6, 2020 22:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ollie314/72f7ff1294d4719779c57fbb477b7054 to your computer and use it in GitHub Desktop.
Save ollie314/72f7ff1294d4719779c57fbb477b7054 to your computer and use it in GitHub Desktop.
Simple sample to resolve identity
module IdentificationProcessing
open System
open Domain
open ContractDomain
type DataSubjectIdentityResolver =
string // the name of the type
-> obj // the object instance to look into
-> string // the owner identity
// here the tricky part of the identification process since w may have information about the isntance of the
// object to look into but we may also have no inforamtion about that. In such case, we are using the internal
// scoring processor which is actualy able to score about data matching
// The default case of the resolver should use the DataSubjectIdentifier but the implementation must be adapted
// so that definitions fit with actual needs
let resolveOwner : DataSubjectIdentityResolver =
fun s i ->
match s with
| "ContractEvent" ->
// TODO: use a generic approach since we don't neccessary have dependencies in the scope of the processing
let c : ContractEvent = downcast i
c.PartnerId
// to be continued...
| _ -> "-"
(*
The idea behind that is to reduce the complexity by filtering eligible data subject.
At first, we are looking for data subject in the index by performing a query
on the index using all fields presents in the source vector, then, it uses
the result to perform a proximity evaluation for each items in the result set.
lets imagine the following object
const o = {email: "john@mydom.com", firstname: "john", lastnaem: "doe", phone: "+41795477978"}
The system will transform that to the following LUCENE Query
q = "email:john@mydom.com OR firstname:john OR lastnaem:doe OR phone:+41795477978"
Then it will perform the research and, based on the result, it will perform an evaluation using both
a distance on each provided fields and a score according to the presence of the field in the result.
A result may contains only a part of provided fields. For instance, it may contains only the email,
the phone and the firstname.
A table gives the weight of each fields in the comparison.
Note: In our case, the weight of each field should be relative which means the fact that the sum represents
a probablity doesn't matter.
For instance, we may have this table:
tabble = {
email: 0.6
firstname: 0.05
lastname: 0.05
phone: 0.3
}
In such case, we follow the idea that a matching email pvides a confidence of 60% that both items match.
According to our example, we only have the email field and the phone field.
We are fixing the probablity gate to 0.8 which means that a full match (email and phone) will pass and
all other comparison will fail which is a bit aggresive in terms of filtering. In order to tackle that
we add the distance comparison which will ponderate the result. The distance is transformed to a match
rate which is multiplicate to the field's weight to obtain the final result.
Let's express that mathematically.
let
- fw: the field's weight,
- d: the distance between the field in the result and the field to check for
- p: proximity between two vector
we have
n
----
p = \ (fwi * di)
/
----
i = 0
where
- fwi is the field weight at the position i in the vector
- and di is the distance result of the the item at the position i in the vector
As we states earlier, the threshold is 0.8 (which actually should be fixed by the business). So if the
matching probability is over this threshold, we have a potential match.
Rules for result processing should be defined by the business. For isntance, we can use the following definition
- if no match is found, the identity will be created (and indexed by the way - we have to take care about
potential latency of the index vs the velocity of the event stream)
- If only one match is found, we are associating the event with the identity
- if nore than one match is found, we are generating an alert to let a manager operates the identity selection
and the system associates the event with all potential identity (the association is made with the flag 'potential').
When the manager selects the correct identity, events are review so that all event with other identities wiill be dropped
and the flag potential will be removed.
In order to tackle the concern regarding the velocity of the stream versus the latency of the indexing
process, we can make a ttl based local record (ttl at least equal to the average maximum indexing
process duration). All local record will be added to the search result. Since all records are ttl based, we will only keep relevant
and acceptable amount of records in the processing window.
*)
// defines a partial representation of a datasubject
// See the fsharp implementation of the event hub stack to gather information
// about the real representatio nf a data subject (whcih is more abstract)
type DataSubjectInfo = {
id: string // unique id of the data subject in our system
Email: string // email of the data subject
Firstname: string // firstname of the data subject
Lastname: string // lastname of the data subject
Phone: string // phone of the data subject
// to be completed...
}
// defines a resolver able to change the name of a field to a float
and FieldWeightResolver = string -> float
// this defines a data source to look into in order to load data subject
and DataSubjectSearchProvider =
string // the query to launch for searching
-> DataSubjectInfo list // the list of data subject found
// Defines a service ble to calculate distnace between two string
and DistanceCalculator = (string * string) -> int
// Return the rate associates to the distance between two strings
and DistanceScorer = DistanceCalculator -> (string * string) -> float
// this type defines a scorer
and DataSubjectIdentityMatcher =
DistanceScorer // dependency: reference to the scorer to use
-> DistanceCalculator // dependency: reference to the service able to calculate the distance
-> FieldWeightResolver // dependency: reference to the service able to ponderate
-> DataSubjectInfo // data subject to check for
-> DataSubjectInfo // data subject in to check accross
-> float // identity matching rate
// define the filtering process to apply on the search results
and DataSubjectIdentityFilter =
DataSubjectIdentityMatcher // dependency: reference to the service able to match two identity
-> DistanceScorer // dependency: reference to the scorer to use
-> DistanceCalculator // dependency: reference to the service able to calculate the distance
-> FieldWeightResolver // dependency: reference to the service able to ponderate
-> float // configuration: the threshold over which a matching is accepted
-> DataSubjectInfo // the data subject to check the identity for
-> DataSubjectInfo list // the list of data subject to check accros
-> DataSubjectInfo list // the list of matching identity
// Defines the data subject identifier services
and DataSubjectIdentifier =
DataSubjectSearchProvider // dependency: the service able to look for data subject
-> DataSubjectIdentityFilter // dependency: the filter to use to filter the request
-> DataSubjectIdentityMatcher // dependency: the service able to filter identities
-> DistanceScorer // dependency: reference to the scorer to use
-> DistanceCalculator // dependency: reference to the service able to calculate the distance
-> FieldWeightResolver // dependency: reference to the service able to ponderate
-> float // configuration: the threshold over which a matching is accepted
-> DataSubjectInfo // the data subject to fetch identity for
-> DataSubjectInfo list // the list of matching identities
(* ==================== IMPLEMENTATION ======================= *)
// Dummy implementation based on the rule presented in the documentation
let resolveFieldWeight: FieldWeightResolver =
fun s ->
match s with
| "email" -> 0.6
| "firstname" -> 0.05
| "lastname" -> 0.05
| "phone" -> 0.3
| _ -> 0.0
// simple implementation of the levenstein distance calculation
let levenshteinDistance: DistanceCalculator =
fun (s1,s2) ->
let s1' = s1.ToCharArray()
let s2' = s2.ToCharArray()
let rec dist l1 l2 = match (l1,l2) with
| (l1, 0) -> l1
| (0, l2) -> l2
| (l1, l2) ->
if s1'.[l1-1] = s2'.[l2-1] then dist (l1-1)(l2-1)
else
let d1 = dist (l1-1) l2
let d2 = dist l1 (l2-1)
let d3 = dist (l1-1)(l2-1)
1 + Math.Min(d1, Math.Min(d2,d3))
dist s1.Length s2.Length
// naive implementation of a scorer
let naiveDistanceScorer: DistanceScorer =
fun distance (a,b) ->
((1 |> double) - (distance (a,b) |> double) / (a.Length |> double))
// resolve a property to a string * string
let resolve (i: Reflection.PropertyInfo) (ds: DataSubjectInfo): (string * string) =
(i.Name, i.GetValue(ds).ToString())
// Transform a dataSubjectInfo to a list of properties (reflection)
let toPropertyList (ds: DataSubjectInfo) = (Array.toList(ds.GetType().GetProperties()))
// transform a data subject info into a list of string * string
let tuplize (ds: DataSubjectInfo) =
let rec f (l: Reflection.PropertyInfo list) (acc: (string*string) list) =
match l with
| [] -> acc
| head :: tail -> f tail ((resolve head ds) :: acc)
f (ds |> toPropertyList) []
// Transform a list of fields for a data subject to a lucene query
let makeLuceneQuery (ds: DataSubjectInfo) : string =
let rec f (l: Reflection.PropertyInfo list) (acc: string list) =
match l with
| [] -> acc
// TODO: drop empty field here
| head :: tail -> f tail ((resolve head ds |> fun (a,b) -> (sprintf "%s:%s" a b)) :: acc)
// ["email":"...";"phone":"..."[;...]]
let parts = f (ds |> toPropertyList) []
// email:... OR phone:... [...]
parts |> String.concat " OR "
let dataSubjectIdentityMatcher: DataSubjectIdentityMatcher =
fun score dist resolveWeight refDs ds ->
let s = score dist
let refDs' = refDs |> tuplize
let ds' = ds |> tuplize
let rec f i acc =
match i with
| 0 -> acc
| _ ->
let (k,v) = refDs'.[i-1]
let (_,v') = ds'.[i-1]
let w = resolveWeight k
let sc = s (v,v')
f (i-1) (acc + sc * w)
f (refDs'.Length) 0.0
let dataSubjectIdentityFilter: DataSubjectIdentityFilter =
fun matches score dist resolveWeigth t refDs ds ->
let s = matches score dist resolveWeigth
ds |> List.filter (fun d -> (s refDs d) > t)
let dataSubjectIdentifier: DataSubjectIdentifier =
fun search filter matches score dist resolveWeight threshold ds ->
let f = filter matches score dist resolveFieldWeight threshold
let l = ds |> makeLuceneQuery |> search
f ds l
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment