Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created December 2, 2019 20:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/57409f42ba0f47985952007afd7c771f to your computer and use it in GitHub Desktop.
Save rjurney/57409f42ba0f47985952007afd7c771f to your computer and use it in GitHub Desktop.
window = 5
candidates = []
for index, row in df.iterrows():
doc = nlp(row['_Body'])
for ent in doc.ents:
rec = {}
rec['body'] = doc.text
rec['entity'] = ent
rec['entity_text'] = ent.text
rec['entity_start'] = ent.start
rec['entity_end'] = ent.end
left_token_start = max(0, ent.start - 1 - window)
left_token_end = ent.start
rec['left_tokens_text'] = [x.text for x in doc[left_token_start : left_token_end]]
right_token_start = min(ent.end, len(doc) - 1)
right_token_end = min(ent.end + window, len(doc) - 1)
rec['right_tokens_text'] = [x.text for x in doc[right_token_start : right_token_end]]
rec['entity_start'] = ent.start
rec['entity_text'] = ent.text
rec['ent_type'] = ent.label_
rec['wikidata_id'] = ent.kb_id
rec['entity_end'] = ent.end
rec['original_index'] = index
rec['label'] = 0
candidates.append(rec)
df_out = pd.DataFrame(candidates)
df_out = df_out.reindex().sort_index()
df_out.to_csv('../../data/text_extractions.one_file.df_out.csv')
df_out.head()
index body entity_text left_tokens_text right_tokens_text entity_start ent_type wikidata_id entity_end original_index
0 BerkeleyDB Concurrency What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support? How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention? I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency. My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting. C++ ['optimal', 'level', 'of', 'concurrency', 'that', 'the'] ['implementation', 'of', 'BerkeleyDB', 'can', 'reasonably'] 12 LANGUAGE 0 13 0
1 Python equivalent of Jstack? Is there a python equivalent of jstack? I've got a hung process and I really want to see what it's up to because I have yet to reproduce the defect in development. Jstack ['Python', 'equivalent', 'of'] ['?', 'Is', 'there', 'a', 'python'] 3 PERSON 0 4 1
2 UTF-8 In Python logging, how? I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example: This explodes with UnicodeDecodeError on the logging.info() call. At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this: When it should be doing this: Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation. Python ['encoded', 'string', 'to', 'a', 'file', 'using'] ["'s", 'logging', 'package', '.', ' '] 20 ORG 0 21 2
3 UTF-8 In Python logging, how? I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example: This explodes with UnicodeDecodeError on the logging.info() call. At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this: When it should be doing this: Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation. Python ['\n', 'At', 'a', 'lower', 'level', ','] ["'s", 'logging', 'package', 'is', 'using'] 49 ORG 0 50 2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment