Skip to content

Instantly share code, notes, and snippets.

@aMcCode
Last active August 15, 2022 06:32
Show Gist options
  • Save aMcCode/4a07662d32dd6295e6625ef211f2fb67 to your computer and use it in GitHub Desktop.
Save aMcCode/4a07662d32dd6295e6625ef211f2fb67 to your computer and use it in GitHub Desktop.
Using Regex to Find All Text in Single Quotes

Matching All Text in Single Quotes

In my daily work, I am required to parse text generated by various systems. One such system generates a data entry audit trail with output that looks like the text below. When I need to compare the old value entered to the new value entered for a given datapoint, I search the audit trail for the latest entry prior to the current entry based on an an audit date. Once I have that specific audit entry, I search for the specific data entered. Since I can count on the validated system that produces the audit trail to output the audit entry in a specific way, i.e. "User entered 'some value'", I can use regex to extract the value. I also need to find where the user has entered an empty value.

The purpose of this tutorial is to detail how, ((?<=\').*(?=\'))|(User entered empty\.), the regex or regular expression pattern for this specific search works.

Sample Text

User entered '01 Jan 2019'
User entered 'Grade 2: Moderate (GRADE 2)'
User entered 'Not Recovered/Not Resolved (NOT RECOVERED/NOT RESOLVED)'
User entered 'Recovered/Resolved with Sequelae (RECOVERED/RESOLVED WITH SEQUELAE)' reason for change: Data Entry Error
User opened query 'It is indicated that outcome is Recovered, Recovered with Sequelae, or Fatal, but ongoing is checked. Please review and correct.' (Site from System).
User entered '1'
User entered 'Yes (Y)' reason for change: Data Entry Error
User entered 'No (N)'
User entered empty.
User entered 'Participant was advised to discontinue use of antacids and to see primary care physician.' reason for change: Data Entry Error

Summary

The regex pattern, ((?<=\').*(?=\'))|(User entered empty\.), is short and relatively simple, but does utilize several common components of regex search patterns. Again, this specific pattern can be used to find any text that falls within single quotes or where the user has entered an empty value.

Table of Contents

Regex Components

Anchors

Anchors are used to indicate the start and beginning of a string. There is no need to use anchors, such as ^ and $, for this pattern since the validated system removes hard returns within user entries before saving them to the audit trail.

Quantifiers

The only quantifier used in this pattern is *. This character is used to match 0 or more of the preceding token, which in our case is, .. We use the . to match any character except line breaks.

OR Operator

The OR Operator, specifically, |, is used so that we can capture the one user entry that does not contain single quotes. If we want to know when the user entered an empty value, we have to search for "User entered empty".

Grouping and Capturing

Capturing groups are contained within parenthesis. They group multiple tokens together to create a capture group for extracting a substring from longer text. We have two separate capturing groups in this regex. The first is ((?<=\').*(?=\')) and it captures any text between the first and last single quote. The second group is (User entered empty\.). It captures text matching exactly what is in parenthesis, taking into account the escpated Dot character so it doesn't match any character at the end of the group.

Greedy and Lazy Match

Because we are using .*, we are using a greedy match for the first group. We want to get any and all characters between single quotes. The . indicates any character and * indicates that we want to get all repeats of any character. If we were to change our pattern to a lazy match, by adding ? after *, we would get the 3 separate matches listed below if we search the text 'The patient indicated they had 'multiple headaches' during this time' because it contains multiple sets of single quotes.

The patient indicated they had
multiple headaches
during this time

Look-ahead and Look-behind

(?<= is a positive look-behind. It matches a group before the main expression without including it in the result. If we leave this out of the pattern, the first quotation mark would be included in every result. Similary, we include ?=, to look-behind and exclude the last quotation mark. It is important to note that not all browsers support these options, so that should be researched prior to implementing a regex search in this fashion.

Author

I am Alicia McNeil, a software developer aspiring to become a web developer. Please see my work on github..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment