Skip to content

Instantly share code, notes, and snippets.

@swapnilraj
Last active December 31, 2018 15:55
Show Gist options
  • Save swapnilraj/7d594835271d609c559706b550e91d2d to your computer and use it in GitHub Desktop.
Save swapnilraj/7d594835271d609c559706b550e91d2d to your computer and use it in GitHub Desktop.
Lyricist = Proposal for fixing issue 11

Title extraction fix

This fix is addressing the issue #11, to explain the issue in detail; the current extraction is done using regular expressions and the current expression used is /(.*)(\||\/\/\/|-|:)(.*)(\s-\sYouTube)$/. The issue with this expression is the eager nature of the regex engine where it tries to capture the biggest expression possible.

This document is also an attempt at documenting the rather big regular expression. There are four groups for artist name, a separator for the artist name and the song title, song title and the YouTube suffix respectively.

(.*)              Matches any character, ideally the artist name
(\||\/\/\/|-|:)   Matches one of the separators, which can be one |, ///, - or :
(.*)              Matches any character, ideally the song title
(\s-\sYouTube)$   Matches the YouTube suffix, put by youtube when it puts
                  the title of the song in the title of the document

The idea here is that the separator can be used to split the artist name and
the song title, we also see some more text like official video or featuring
artists which we want to parse out with other regular expressions.

Now that we know what the expression is meant to do, we see how the eager nature of the engine is causing the issue, the first part of the expression (.*) captures artist name the separator and the - in song title is matched by the group responsible for catching the separators.

       Matched the expression
       meant to capture the separators
            +
            |
            v
Kasbian - Re-wired
^          ^ ^   ^
|          | |   |
+----------+ +---+
Matched by   Matched
expression   by the
.* meant     expression (.*)
to capture   mwant to capture
the artist   the song title
name

Solutions

Negated separators for the artist name

In layman terms, this approach expects anything before the first occurrence of any separator is the artist name, where the artist name cannot have any of the separators.

This solution makes the assumption that the artist name will not have a dash or any of the separators. The regex under this solution would be ([^-]*)(\||\/\/\/|-|:)(.*)(\s-\sYouTube)$ where the first group matches any characters other than - which is the equivalent of saying that the artist name does not contain any separators.

Pros

  • No assumptions made about the separator.

Cons

  • As it turns out there are many artists with a dash and other separators in their name.
    • Examples:
    • A-Star
    • a-ha

Spacing around the separators

The only issue of this nature we have observed in the wild is the example in issue (#11), where the song title has a - on further investigation there to be a substantial amount of examples with artists who have a - in their name. Listed in the cons of the previous approach are a few examples of artist with the - separators in their name and possibly more song titles with a -. To avoid this issue we make an assumption about the separator itself, The assumption being made here is that will be some kind of spacing around the -. The separator will change to (\||\/\/\/|\s+-\s+|:). The only addition is the variable amount of space around the - separator.

We are not making any changes to the spacing around the other separators since we have not observed any artist name or song title containing the other separators but this could change in the future and we might have to write a Deterministic Finite Automaton.

Pros

  • No assumptions made about the artist name or song title.

Cons

  • If the title of a music video contains - as a separator without any spaces around it then this solution wouldn't match the title at all.
    • A small consolation is that we only need to enforce the space either to the left or right of -, thus the resulting expression would be (\||\/\/\/|\s+-|:) or (\||\/\/\/|-\s+|:).

Conclusion

From the suggested fixes I propose to implement the second solution that assumes that there will be some spacing around the - separator. To make assumptions about the separator seems like a better trade-off compared to making assumptions about either the artist name or the song title.

This issue was spotted in the wild by me, apart from the fix suggested this also speaks to the bigger issues that need to be addressed, the projects needs an automated way of reporting failures, possibly with the help of Google Analytics and/or ways for the users to report failures in their experience, it would be highly desirable when users make a report that the extension is able to gather some data about the occurred failures, by maintaining some sort of internal log

@JorikSchellekens
Copy link

The regex for the first solution is actually making the assumption that that there are no '-'s in the song title specifically. I don't think this is wise. We would also have to negate all the separators for which we already check for. This would make the regex the much more cumbersome ([^\|\/-:]*)(\||\/\/\/|-|:)(.*)(\s-\sYouTube)$ [Where we simply assume the title may not have / instead of the string ///]

I agree with you, by drawing on my limited experience of watching youtube music videos, that the second solution would be more reliable.

We should open an issue about Google Analytics as an enhancement.

Could you fix the grammar in the paragraph starting on line 89?

@swapnilraj
Copy link
Author

Yeah in the fist version of the document (which was local not a gist), for the case where the artist name was a negation of the separator I had all the separator but it did not make sense since we have not seen any examples where we would benefit from that, and there could possibly an artist with the other separators in their, which again brings us to the same conclusion that making assumptions about the artist name is not nice.

I have fixed the grammatical issue, thanks for the tip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment