Skip to content

Instantly share code, notes, and snippets.

Last active March 24, 2020 20:00
What would you like to do?

TL:DR; I'm not sure if it just a data problem or i'm tackling the problem wrong


What i'm trying to do is getting a list of entries with variable properties mapped to a well known list of entries with fixed properties

In other words i have a variable textual entry composed of {"ABC (test)" "wrongdata"} which needs to be mapped to to a fixed entry {"ABC" "test"}

Problem statement

There are a variety of data providers that output entries with different combinations of fields and values that need to possibly be mapped to the same final entry.

The input data format contains about 4 fields, with a few optional ones, and the output one contains the same amount of fields and an order of magnitude more fields that further identify each entry

Data format

Input format is a list containing a possible variant of

  • "Name, Edition"
  • "Name (Variant)" (no Edition)
  • "Name, WrongEdition" (WrongEdition is wrong, it should be ignored)
  • "Name (Variant), Edition" (Edition could be wrong or it could be a hint for Variant)
  • optional additional values, not described here, that could be used to further process each entry

Output format is

  • a map (json file in the form of {"Edition": {"Name", "property1", "property2"...})

Current Approach

I created a sort of parser that for each possible combination of inputs it tries different permutation of the output, until it finds one which more closely mimics the input

However, there are several drawbacks

  1. it creates a lot of false duplicates
  2. it takes an enormous amount of time to find and write each rule
  3. every time you want to add a new input provide you need to repeat the whole process

What the parser is doing is merely for ed in Editions; for name in Editions.Names if name == MYNAME, then FOUND().

Current limitations

If this was a one-time parser it would be done and done, but there are hundreds of parsers that perform differently so what I thought of doing was creating a generic one instead of an ad-hoc one, but it's becoming an eldritch monstrosity the more rules i add

Example unified parser: ttps://

Right now if a new edition (data entry format) comes out, i need to a new rule to each data provider, while i wanted to make something that would let me add a single rule to a single parser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment