Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save KilianKilmister/8ccb0c7500ea70580f0e956be4eea98c to your computer and use it in GitHub Desktop.
Save KilianKilmister/8ccb0c7500ea70580f0e956be4eea98c to your computer and use it in GitHub Desktop.
"Answer looks like spam" stackoverflow: regex-or-substring-operation-to-strip-out-a-url-from-a-keyword-onwards

Beforehand

I'm going ultra basic here as regex tends to be quite daunting when you're not familiar with it, my apologies if you already know most of this.

NOTE: a great tool for working with regexes is regex101. It contains a test-suit, regex analysis and basic documentation of regex special chars.

Circumstance

Do you know the exact keyword beforehand? and is the URL the full string (so no leading/trailing text?) If it's always the same keyword (eg. filter), you can use String.prototype.match and a few capture-groups to neatly prepare it:

Basic Regex

A basic regex could look like /^(.*)(\/filter\/.*)$/ where:

  • ^ -> an anchor for the start of the string (so the match MUST start at index 0)
  • (.*) -> the first capturing group (anything before /filter/)
    • . -> match any non-linebreak char
    • * -> repeat 0 or more times
  • (\/<your keyword>\/.*) -> second capturing group (match anything from /filter/ until end of string)
    • \/filter\/ -> it's important you escape forward slashes (/) inside a regexp literal, otherwise they will terminate the expression and probably fail to compile.
    • .* -> like above (just matches anything)
  • $ -> anchor for the end of string (so the whole match MUST include the entire string)

Basic matcher function

A basic function could look like this:

// NOTE: if you have to do this very often, you should declare
// the regexp outside the function and reuse it for a bit better performance
const matcher = `/^(.*)(\/filter\/.*)$/`

/**
 * @param {string} url the url to process
 * @returns {{ main: string, stripped: string}}
 */
function splitURL (url) {
  const match = url.match(matcher)
  return { main: match[1], stripped: match[2] }
}

Explanation

String.prototype.match(regexp: RegExp) can be a bit confusing if you're not used to it. But it's not that complicated. using the example url and regex:

('http://example.com/category/subcat/filter/size/1/').match(/^(.*)(\/filter\/.*)$/)

Will return a RegExpMatchArray like this:

[
  'http://example.com/category/subcat/filter/size/1/', // <-- index 0, the full match (in this example it's the entire string)
  'http://example.com/category/subcat', // <- index 1, the first capture group (`(.*)`)
  '/filter/size/1/', // <- index 2, the second group (`(\/filter\/.*)`)
  index: 0, // <-- key `index`, the starting index of the match (in this case 0, the start of the string)
  input: 'http://example.com/category/subcat/filter/size/1/', // <- key `input`, the string on which `String.prototype.match` was called
  groups: undefined // <- key `groups`, an object that stores the named capture groups and their value. (here undefined since we didn't have any named groups)
]

The way your average console.log displays it is a little odd, so to crearify:

  • we have a normal Array with 3 items:
[ 
  'http://example.com/category/subcat/filter/size/1/filter/',
  'http://example.com/category/subcat/filter/size/1',
  '/filter/'
]

with 3 additional properties added to it:

  • index: 0
  • input: 'http://example.com/category/subcat/filter/size/1/filter/'
  • groups: undefined

so as a regular object it would be displayed as:

{
  length: 3
  0: 'http://example.com/category/subcat/filter/size/1/filter/',
  1: 'http://example.com/category/subcat/filter/size/1',
  2: '/filter/',
  index: 0,
  input: 'http://example.com/category/subcat/filter/size/1/filter/',
  groups: undefined
}

In the above function we just pick index 1 and 2 from the Match Array and return them in an object.

Named Capture Groups

We could use named capture groups too: /^(?<main>.*)(?<stripped>\/filter\/.*)$/ this way we can just do:

const matcher = /^(?<main>.*)(?<stripped>\/filter\/.*)$/
function splitURL (url) {
  return url.match(matcher).groups
}

using that on the example url will return basically the same array, but now with a groups property, which we can then return:

[
  'http://example.com/category/subcat/filter/size/1/',
  'http://example.com/category/subcat', // <- named groups are still indexed the same way they were without a name
  '/filter/size/1/',
  index: 0,
  input: 'http://example.com/category/subcat/filter/size/1/',
  groups: [Object: null prototype] { // <- we can return this object and save the picking we did before
    main: 'http://example.com/category/subcat',
    stripped: '/filter/size/1/'
  }
]

A few Notes

  • the groups object has a prototype of null, so it doesn't have any of the methods a normal object would (eg. toString or hasOwnProperty). Trying to call one of those will throw an error along the lines of undefined is not a function

  • if the keyword isn't static, but you know it by the time you get the url, you can always use the RegExp constructor and a template literal, eg.

const matcher = new RegExp(`^(?<main>.*)(?<stripped>\/${yourKeywordVariable}\/.*)$`)
  • the example regex here works for "best case" scenarios. It will fail when, for example the url has the keyword twice eg. 'http://example.com/category/subcat/filter/size/1/filter/'. in this case the above functions would return:
{
  main: 'http://example.com/category/subcat/filter/size/1',
  stripped: '/filter/'
}

this could be fixed with conditionals in the regex like lookahead/lookbehind, but the exact form will depend on what the exact usecase is. it's usually not worth it to make a catch-all regex unless it's actually needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment