nilshah98/sed_regex.md

## sed_regex.md

      
    Raw
  

              sed_regex.md
            
          
    Issue

Exported data contains date in two formats, ie - 07/12/2021 and 7/20/2021 which makes it a little trickier to do analysis on the data.
Sample data -
www.google.com,07/12/2021,123
www.wikipedia.org,7/20/2021,567

Solution

Using sed and regex, detect those dates and make them uniform.
Regex

We cannot detect date using [1-9]\/ as this will match both 07/12/2021 and 7/20/2021. Hence, we need to add ,(comma) to give more specificty.

Note that in \/ here, \(backward slash) is used to escape the next character ie. /(forward slash), so that we can match that in regex, and it is not parsed by regex for some functionality or delimiter.

Hence, the regex we can use to detect only 7/20/2021 from other dates such as 07/12/2021 is - ,[1-9]\/ . We have added , at the start.
Sed

Now that we have detected the string we need to change using regex, we need to find a way to edit it. SED is a stream editor, which edits text in only one pass, hence is efficient. Also, sed can be piped to filter your text further.

Sed also supports regex, addresses (detecting by location of text, ex- header only), substitution, among other features.
We will use regex with substitution, to come up with our solution.

Sed with regex follows the following format - sed '<start>/<pattern>/<substitution>/<final>'. Ex - sed '10s/[a-z]/1/g

<start> : You can denote where do you want to start, or which line to target here. Also, if you add s it means substitution will be performed, else only matching will be done. In the above case, we only perform substitution for line 10, as 10s at the start
<pattern : This is the regex pattern to search for
<substitution> : This is what we want to replace with
<final> : This is any additional actions, example g to do it globally, p to print, d to delete, etc ...

In thie case, sed '10s/[a-z]/1/g  will replace all alphabets with 1 at line 10.

While substitution one can use & to refer to the matching text. Ex - cat domains.csv | sed 's/,[1-9]\//0&/gp' .

This will add 0 at the start of the matching phrase, In this case we matched ,7/ part of www.wikipedia.org,7/20/2021,567 and that will be replaced with 0,7/, so the string becomes - www.wikipedia.org0,7/20/2021,567. But, that's not what we want.
Luckily sed also offers us functionality to segregate our pattern to refer back to specific part of the patter. Hence, we can create separate references to each section of our pattern. We dothis by encompassing that section in /( ... /)  .

We can then refer to these sections in the substition part by using /{num} to refer to that section. To refer to first section, we use /1 .
In this case, we can create two references, one to the initial comma and second to the actual number with the forwardslash. To do that - ,[1-9]\/ becomes - \(,\)\([1-9\/\) . Hence, we created two sections to refer to over here.

While substitution, we need \10\2 . This will add the first matched section, insert 0 and then add the second matched section.

Hence, our final sed script becomes - 's/\(,\)\([1-9]\/\)/\10\2/g'
Final Solution

sed -i 's/\(,\)\([1-9]\/\)/\10\2/g' domains.csv

Here we refer to each section of our matched pattern and add 0 between those.
References


SED Man Page
SED Tutorialspoint
SED Linuxize