Skip to content

Instantly share code, notes, and snippets.

@BigAN
Created December 19, 2014 10:19
Show Gist options
  • Save BigAN/40181f44222197efcc67 to your computer and use it in GitHub Desktop.
Save BigAN/40181f44222197efcc67 to your computer and use it in GitHub Desktop.
return bigram python
0
down vote
accepted
You could do this through positive lookahead,
>>> import re
>>> s = "My name is really nice. This is so awesome."
>>> m = re.findall(r'(?=(\b\w+\b \S+))', s)
>>> m
['My name', 'name is', 'is really', 'really nice.', 'This is', 'is so', 'so awesome.']
Pattern Explanation:
(?=...) Lookaheads are zero-length assertions just like the start and end of line, and start and end of word. It won't consume characters in the string, but only assert whether a match is possible or not.
() Capturing group which was used to capture characters which matches the pattern present inside the ().
\b Word boundary. It matches between a word character and a non-word character.
\w+ Matches one or more word characters.
\S+ Matches the space and the following non-space characters.
findall function usually prints the characters inside the captured groups. If there is no capturing groups then it would print the matches. In our case it would prints the characters which was present inside the group index 1. To match overlapping characters, you need to put the pattern inside a lookahead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment