Skip to content

Instantly share code, notes, and snippets.

@fasikaWalle
Last active April 19, 2021 02:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fasikaWalle/c3e26ccd88dc4c37838766c0212253c9 to your computer and use it in GitHub Desktop.
Save fasikaWalle/c3e26ccd88dc4c37838766c0212253c9 to your computer and use it in GitHub Desktop.
A search pattern for URL match.

Matching a URL regular expression

A URL pattern is a set of ordered characters that the Google Search System uses to fit real URLs found by the search engine. You may define URL patterns that should include matching URLs in your index and URL patterns that should exclude matching URLs from your index. This guide will show you how to create a URL pattern.

Summary

This tutorial explains search pattern for URL which is going to follow regular expression that explains the components of the expression and their use.We use forward slash at the beggining and at the end of the exprssion to show where the expression start and end.Overall a this regular expression is a text or string that describes a search pattern match for URL.

matching URL regex: /^(https?://)?([\da-z.-]+).([a-z.]{2,6})([/\w .-])/?$/

 Example that match the expression
 https://www.google.com/
 www.gmail.com/
 http://www.yahoo.com/

Table of Contents

Regex Components

Anchors

Anchors argue that the engine's current position in the string corresponds to a predetermined spot, such as the start or end of a section.It is useful for a variety of purposes. To begin with, it allows you to decide that you only want to match digits at the end of a line. Second,when you tell the engine you want to find a complex pattern at a specific spot, it works quickly.

In matching URL case there are :

  • The caret (^) -It is used to indicate that a match must occur at the beginning of the searched text. /^(https?://)?([\da-z.-]+).([a-z.]{2,6})([/\w .-])/?$/, In this expression the URL must match starting from the begining. In the above expression when we try to compare the the pattern we have to start from the begining ,In the following example the URL match begins with the first character because of we use caret(^) at the begining of the expression.
e.g: ``/^https/`` - This expression only valid when the first word of the string starts with https , example for this is "https"
  • dollar ($) -The regular expression that precedes it should be at the end of the test series, so it matches the end of the string. The dollar ($) symbol corresponds to the string's end.
e.g :/.com/ matches:fasika@gmail.com

Using caret at the beggining of the expression and use dolar sign at the end of the expression will tell us it must much everything.

e.g:/http$/ - In this case the dollar sign is going to match at the end of the string. If we take the string this is a protocol which is http this is going to match the last http word.

Quantifiers

Quantifiers determine the minimum number of instances of a character, category, or character class in the input required to find a match. basically We can tell the computer about the reputation. ?, +, *, {n}, {n, }, {n,m} In matching URL case there are :

*(0 0r more)       - It shows that whatever becomes before * can occur zero or more number of times.
+(1 or more)       - It shows that whatever becomes before +can occur one or more number of times.
?(otional 0 or 1)  - It shows that whatever becomes before ? it can be optional either you can include it or not.
{2,6}              - This will show how many characters you need to fulfill the expression.

In the URL matching /^(https?://)?([\da-z.-]+).([a-z.]{2,6})([/\w .-])/?$/, there are 6 quantifiers and am going to explain what are the uses of them.

? quantifiers and their use

In this URL expression the first ? is used to make the "s" character optional because of there is a protocol which doesn't have "s" which is(http) only.The second ? is going to make the whole protocol part optional because some URL's doesn't explicitly write the protocol (eg. www.google.com). The third ? is going to make the "/" optional which you can simply write the URL without forward slash at the end, e.g (www.gmail.com)

The + quantifier and its use

The fourth quantifiers in the expression is + which came after [\da-z\.-], in this case the + quantifier tells us whateever inside the square bracket you can use them one or more times in your URL string.

 e.g:(google..) In this example I use `.` two times which the `+` quantifier allows me I can write one or more times.

The * quantifier and its use

The fifth and sixth quantifier listed in the expression are *. The first * which come after [\/\w \.-] in this case you can use one or more times thoese which are inside the square bracket which are (".","words","/","-")

e.g: (google/google/-) in this case as we can see I use google twice ,forward slash twice and the hyphen once in this scenario I ignore the `.` which I can because we use * that we are able to not use the whole expression. The second `*` which come after ([\/\w \.-]*)* is going to use the the group expression zero or more times.

Literal-Characters

A literal character, also known as a matching character, refers to a specific character in the text.

https  - matches: https
 :     - It match's colon.
\/\/   - The two backslashes are escape characters which allows to to use  double `//`. 
\.     - The backslashes are escape characters which allows to use  double `.` (dot).
-      - It indicates hypens character.
\/     - The backslashes are escape characters which allows to to use single   `/`

Character Classes

You can tell the regex engine to fit any one of many characters using a "character class," also known as a "character set." Simply put, the characters you want to fit should be enclosed in square brackets.

From the above expression I am going to explain the use of each of them below
\d    - This mean matches a digit, from 0 to 9 similar to [0-9].
a-z   - This matchs between lowercase a to z.
\w    - This matches any word in a string.
[]    - Square brackets in a regular expression are used to indicate a character set. 

Grouping and Capturing

Capturing groups are a way to treat a group of characters as they were one. They're made by putting the characters to be grouped in parentheses.

(https?:\/\/)  matches: 'https://', 'http://'
([\da-z\.-]+)  matches: '012google.-','gmail-'
([a-z\.]{2,6}) matches:'googl.','go.'
([\/\w \.-]*)  matches:'/gmail.','/net-'

Bracket Expressions

A bracket expression is a set of characters enclosed by the characters ‘[' and ‘]'. It can fit any character in the list. If the caret (^) is the first character in the sequence, it matches any character that isn't in the list.

[\da-z\.-]  matches:'02gmail.','yahoo01-'
[a-z\.]     matches:'gmail.','yahoo.'
[\/\w \.-]  matches:'/gmail-','/google.'

Greedy and Lazy Match

Until exploring shorter matches by backtracking, a greedy quantifier tries to replicate the sub-pattern as many times as possible.In most cases, a greedy pattern would fit the longest string possible.All quantifiers are greedy by nature. +,*,?,{2,6} are all greedy matches.

Author

Name: Fasika Walle, JavaScript developer who is passionate about spreading trends and ideas that help web developers work more efficiently. github:https://github.com/fasikaWalle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment