Created
April 9, 2023 13:07
-
-
Save nirmalyaghosh/496827a86b8c4a41a0a83803da2ebb3f to your computer and use it in GitHub Desktop.
Extract positions of indicated spans from indicated text. Used as a precursor to the step to convert named entities identified by alternative processes into a spaCy NER format
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from typing import List | |
def extract_span_start_end_positions(text: str, spans: List[str]): | |
""" | |
Extract positions of indicated spans from indicated text. | |
Adapted from : https://www.programcreek.com/python/?CodeExample=convert+to+spans | |
Args: | |
text: The string to be searched | |
spans: The spans of interest within the string. Can be single or | |
multiple contiguous words. | |
Returns: | |
[list of (span, start, end) tuples] mapping each token to corresponding indices | |
in the text. | |
""" | |
cur_idx = 0 | |
spans_w_positions = [] | |
for span in spans: | |
tmp = text.find(span, cur_idx) | |
l = len(span) | |
cur_idx = tmp | |
spans_w_positions.append((span, cur_idx, cur_idx + l)) | |
cur_idx += l | |
return spans_w_positions |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment