swapnilraj/Lyricist - Fix title parser for - separator.md

## Lyricist - Fix title parser for - separator.md

      
    Raw
  

              Lyricist - Fix title parser for - separator.md
            
          
    Title extraction fix

This fix is addressing the issue
#11, to explain the
issue in detail; the current extraction is done using regular expressions and
the current expression used is /(.*)(\||\/\/\/|-|:)(.*)(\s-\sYouTube)$/. The
issue with this expression is the eager nature of the regex engine where it
tries to capture the biggest expression possible.
This document is also an attempt at documenting the rather big regular
expression.
There are four groups for artist name, a separator for the artist name and the
song title, song title and the YouTube suffix respectively.
(.*)              Matches any character, ideally the artist name
(\||\/\/\/|-|:)   Matches one of the separators, which can be one |, ///, - or :
(.*)              Matches any character, ideally the song title
(\s-\sYouTube)$   Matches the YouTube suffix, put by youtube when it puts
                  the title of the song in the title of the document

The idea here is that the separator can be used to split the artist name and
the song title, we also see some more text like official video or featuring
artists which we want to parse out with other regular expressions.

Now that we know what the expression is meant to do, we see how the eager nature
of the engine is causing the issue, the first part of the expression (.*)
captures artist name the separator and the - in song title is matched by the
group responsible for catching the separators.
       Matched the expression
       meant to capture the separators
            +
            |
            v
Kasbian - Re-wired
^          ^ ^   ^
|          | |   |
+----------+ +---+
Matched by   Matched
expression   by the
.* meant     expression (.*)
to capture   mwant to capture
the artist   the song title
name


Solutions

Negated separators for the artist name

In layman terms, this approach expects anything before the first occurrence of
any separator is the artist name, where the artist name cannot have any of the
separators.
This solution makes the assumption that the artist name will not have a dash or
any of the separators. The regex under this solution would be
([^-]*)(\||\/\/\/|-|:)(.*)(\s-\sYouTube)$ where the first group matches
any characters other than - which is the equivalent of
saying that the artist name does not contain any separators.
Pros


No assumptions made about the separator.

Cons


As it turns out there are many artists with a dash and other separators
in their name.

Examples:
A-Star
a-ha


Spacing around the separators

The only issue of this nature we have observed in the wild is the example in
issue (#11), where the song title has a - on further investigation there to
be a substantial amount of examples with artists who have a - in their name.
Listed in the cons of the previous approach are a few examples of artist with
the - separators in their name and possibly more song titles with a -. To
avoid this issue we make an assumption about the separator itself, The
assumption being made here is that will be some kind of spacing around the
-. The separator will change to (\||\/\/\/|\s+-\s+|:). The only addition
is the variable amount of space around the - separator.
We are not making any changes to the spacing around the other separators since
we have not observed any artist name or song title containing the other separators
but this could change in the future and we might have to write a Deterministic
Finite Automaton.
Pros


No assumptions made about the artist name or song title.

Cons


If the title of a music video contains - as a separator without any spaces
around it then this solution wouldn't match the title at all.

A small consolation is that we only need to enforce the space either to
the left or right of -, thus the resulting expression would be
(\||\/\/\/|\s+-|:) or (\||\/\/\/|-\s+|:).


Conclusion

From the suggested fixes I propose to implement the second solution that
assumes that there will be some spacing around the - separator. To
make assumptions about the separator seems like a better trade-off compared to
making assumptions about either the artist name or the song title.
This issue was spotted in the wild by me, apart from the fix suggested this
also speaks to the bigger issues that need to be addressed, the projects needs
an automated way of reporting failures, possibly with the help of Google
Analytics and/or ways for the users to report failures in their experience,
it would be highly desirable when users make a report that the extension is
able to gather some data about the occurred failures, by maintaining some
sort of internal log