- Retrieve all party identifiers
- For each party: parse XML and map known names to its identifier
- Build an Aho-Corasick tree from the party names
The tree is searched with the proceedings input string (case-insensitive, leftmost longest-match), yielding the names that were matched. The lookup map is used to find the party identifier that corresponds with the matched name. The name is linked, unless it has already been recognized as a member's name. Another reason not to link a found name of a (single-member) party, is if it is part of a longer name, such as that of a motion or committee.
There is also a grey area where a longer name starts with a party name. In Dutch we could say, e.g., (1) VVD-fractie or (2) VVD-partijprogramma. Case one, we could argue, is functionally synonymous with the party itself when found in parliamentary proceedings. Case two, however, commonly refers to an artifact, or is more loosely used to denote a certain consensus within the party. It does not refer to the party as a whole.
Finally, there is an issue with party names that also occur as common words. Because a case-insensitive search is used, we run the risk of annotating the common usage of these words. A simple solution is to never annotate these names in their lowercase forms, although several names/acronyms remain ambiguous.
Dutch examples of such party names are:
- Nieuw Nederland
- LEF (Lijst 17)
- Volkspartij
- EB
- Mens
- WO (common usage also as acronym)
- Vrije Boeren
- C.D. (common as acronym and initials)
- BP (also British Petroleum)
- Ab (common male first name)
- VAR (also Verklaring arbeidsrelatie)
- Jong
- TON
- LSP (also Landelijk Schakelpunt)
Currently, the focus is solely on political parties. The names and acronyms of other political organisations can be found as appendices of government documents, e.g. http://www.rijksbegroting.nl/2013/voorbereiding/begroting,kst173857_26.html