Skip to content

Instantly share code, notes, and snippets.

@izuna385
Last active June 23, 2019 09:24
Show Gist options
  • Save izuna385/512a9c62868c751a8290a9676f994d16 to your computer and use it in GitHub Desktop.
Save izuna385/512a9c62868c751a8290a9676f994d16 to your computer and use it in GitHub Desktop.
import spacy # version 2.1.4
import scispacy
nlp = spacy.load('en_core_sci_md')
def prevent_sentence_boundaries(doc):
for i, token in enumerate(doc):
if not can_be_sentence_start(token, doc):
token.is_sent_start = False
return doc
def can_be_sentence_start(token, doc):
if token.i == 0:
return True
# We're not checking for is_title here to ignore arbitrary titlecased
# tokens within sentences
# elif token.is_title:
# return True
elif str(token.nbor(-1).text + token.nbor(0).text ) in doc.text:
try:
if str(token.nbor(-1).text + token.nbor(0).text + token.nbor().text) in doc.text:
print(str(token.nbor(-1).text + token.nbor(0).text + token.nbor().text))
return False
else:
pass
except:
pass
elif token.nbor(-1).is_punct:
return True
elif token.nbor(-1).is_space:
return True
else:
return False
# original text
text = "Leucine-rich repeat kinase 2 (LRRK2) mutations are the most common genetic cause of Parkinson's disease (PD). However, only few cases carrying LRRK2 mutations have been reported in Taiwanese PD patients. We used targeted next generation sequencing (NGS), covering 24 candidate genes involved in neurodegenerative disorders, to analyze 40 probands with familial PD, and 10 patients with mixed neurodegenerative disorders. Sanger sequencing of the identified mutation in the first set of the study was performed in additional 270 PD patients, including 139 familial PD and 131 early-onset PD (onset age less than 50 years old), and 300 age/gender matched control subjects. We found a missense variant, p.I2012T, in the LRRK2 gene in one sporadic patient having early-onset frontotemporal dementia with parkinsonism and dystonia. Sanger sequencing this substitution in additional 270 PD patients in the second set of the study revealed two additional variant carriers: one having autosomal-dominant familial PD, and one with sporadic PD. The p.I2012T substitution was absent in 300 normal control subjects. Analyzing family members of the proband with p.I2012T revealed co-segregation of the variant and parkinsonism. Clinical presentations, levodopa responses, and (Tc99m)TRODAT - SPECT imaging findings of this index family were similar to idiopathic PD. Our results revealed clinical heterogeneity of the LRRK2 p.I2012T substitution, and demonstrated the use of targeted NGS for genetic diagnosis in multiplex families with PD or mixed neurodegenerative disorders."
doc = nlp(text)
for span in doc.sents:
print(span.text)
# p.I2012T can't be sentence boundary because p.I2012T is one token in pubtator format, and there is no space after '.'
'''
Leucine-rich repeat kinase 2 (LRRK2) mutations are the most common genetic cause of Parkinson's disease (PD).
However, only few cases carrying LRRK2 mutations have been reported in Taiwanese PD patients.
We used targeted next generation sequencing (NGS), covering 24 candidate genes involved in neurodegenerative disorders, to analyze 40 probands with familial PD, and 10 patients with mixed neurodegenerative disorders.
Sanger sequencing of the identified mutation in the first set of the study was performed in additional 270 PD patients, including 139 familial PD and 131 early-onset PD (onset age less than 50 years old), and 300 age/gender matched control subjects.
We found a missense variant, p.
I2012T, in the LRRK2 gene in one sporadic patient having early-onset frontotemporal dementia with parkinsonism and dystonia.
Sanger sequencing this substitution in additional 270 PD patients in the second set of the study revealed two additional variant carriers: one having autosomal-dominant familial PD, and one with sporadic PD.
The p.
I2012T substitution was absent in 300 normal control subjects.
Analyzing family members of the proband with p.
I2012T revealed co-segregation of the variant and parkinsonism.
Clinical presentations, levodopa responses, and (Tc99m)TRODAT - SPECT imaging findings of this index family were similar to idiopathic PD.
Our results revealed clinical heterogeneity of the LRRK2 p.
I2012T substitution, and demonstrated the use of targeted NGS for genetic diagnosis in multiplex families with PD or mixed neurodegenerative disorders.
'''
nlp.add_pipe(prevent_sentence_boundaries, before='parser')
doc2 = nlp(text)
print('add rules')
# If consecutive three words are appead in original text(say, ['p', '.', 'I2012T'] -> 'p.I2012T' ), add sentenceboundary False flag.
for span in doc2.sents:
print(span.text)
'''
Leucine-rich repeat kinase 2 (LRRK2) mutations are the most common genetic cause of Parkinson's disease (PD).
However, only few cases carrying LRRK2 mutations have been reported in Taiwanese PD patients.
We used targeted next generation sequencing (NGS), covering 24 candidate genes involved in neurodegenerative disorders, to analyze 40 probands with familial PD, and 10 patients with mixed neurodegenerative disorders.
Sanger sequencing of the identified mutation in the first set of the study was performed in additional 270 PD patients, including 139 familial PD and 131 early-onset PD (onset age less than 50 years old), and 300 age/gender matched control subjects.
We found a missense variant, p.I2012T, in the LRRK2 gene in one sporadic patient having early-onset frontotemporal dementia with parkinsonism and dystonia.
Sanger sequencing this substitution in additional 270 PD patients in the second set of the study revealed two additional variant carriers: one having autosomal-dominant familial PD, and one with sporadic PD.
The p.I2012T substitution was absent in 300 normal control subjects.
Analyzing family members of the proband with p.I2012T revealed co-segregation of the variant and parkinsonism.
Clinical presentations, levodopa responses, and (Tc99m)TRODAT - SPECT imaging findings of this index family were similar to idiopathic PD.
Our results revealed clinical heterogeneity of the LRRK2 p.I2012T substitution, and demonstrated the use of targeted NGS for genetic diagnosis in multiplex families with PD or mixed neurodegenerative disorders.
'''
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment