Exported data contains date in two formats, ie - 07/12/2021
and 7/20/2021
which makes it a little trickier to do analysis on the data.
Sample data -
www.google.com,07/12/2021,123
www.wikipedia.org,7/20/2021,567
Using sed and regex, detect those dates and make them uniform.
We cannot detect date using [1-9]\/
as this will match both 07/12/2021
and 7/20/2021
. Hence, we need to add ,
(comma) to give more specificty.
Note that in \/
here, \
(backward slash) is used to escape the next character ie. /
(forward slash), so that we can match that in regex, and it is not parsed by regex for some functionality or delimiter.
Hence, the regex we can use to detect only 7/20/2021
from other dates such as 07/12/2021
is - ,[1-9]\/
. We have added ,
at the start.
Now that we have detected the string we need to change using regex, we need to find a way to edit it. SED is a stream editor, which edits text in only one pass, hence is efficient. Also, sed can be piped to filter your text further.
Sed also supports regex, addresses (detecting by location of text, ex- header only), substitution, among other features.
We will use regex with substitution, to come up with our solution.
Sed with regex follows the following format - sed '<start>/<pattern>/<substitution>/<final>'
. Ex - sed '10s/[a-z]/1/g
<start>
: You can denote where do you want to start, or which line to target here. Also, if you adds
it means substitution will be performed, else only matching will be done. In the above case, we only perform substitution for line 10, as10s
at the start<pattern
: This is the regex pattern to search for<substitution>
: This is what we want to replace with<final>
: This is any additional actions, exampleg
to do it globally,p
to print,d
to delete, etc ...
In thie case,sed '10s/[a-z]/1/g
will replace all alphabets with 1 at line 10.
While substitution one can use &
to refer to the matching text. Ex - cat domains.csv | sed 's/,[1-9]\//0&/gp'
.
This will add 0 at the start of the matching phrase, In this case we matched ,7/
part of www.wikipedia.org,7/20/2021,567
and that will be replaced with 0,7/
, so the string becomes - www.wikipedia.org0,7/20/2021,567
. But, that's not what we want.
Luckily sed also offers us functionality to segregate our pattern to refer back to specific part of the patter. Hence, we can create separate references to each section of our pattern. We dothis by encompassing that section in /( ... /)
.
We can then refer to these sections in the substition part by using /{num}
to refer to that section. To refer to first section, we use /1
.
In this case, we can create two references, one to the initial comma and second to the actual number with the forwardslash. To do that - ,[1-9]\/
becomes - \(,\)\([1-9\/\)
. Hence, we created two sections to refer to over here.
While substitution, we need \10\2
. This will add the first matched section, insert 0 and then add the second matched section.
Hence, our final sed script becomes - 's/\(,\)\([1-9]\/\)/\10\2/g'
sed -i 's/\(,\)\([1-9]\/\)/\10\2/g' domains.csv
Here we refer to each section of our matched pattern and add 0 between those.