Created
December 2, 2019 20:42
-
-
Save rjurney/57409f42ba0f47985952007afd7c771f to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
window = 5 | |
candidates = [] | |
for index, row in df.iterrows(): | |
doc = nlp(row['_Body']) | |
for ent in doc.ents: | |
rec = {} | |
rec['body'] = doc.text | |
rec['entity'] = ent | |
rec['entity_text'] = ent.text | |
rec['entity_start'] = ent.start | |
rec['entity_end'] = ent.end | |
left_token_start = max(0, ent.start - 1 - window) | |
left_token_end = ent.start | |
rec['left_tokens_text'] = [x.text for x in doc[left_token_start : left_token_end]] | |
right_token_start = min(ent.end, len(doc) - 1) | |
right_token_end = min(ent.end + window, len(doc) - 1) | |
rec['right_tokens_text'] = [x.text for x in doc[right_token_start : right_token_end]] | |
rec['entity_start'] = ent.start | |
rec['entity_text'] = ent.text | |
rec['ent_type'] = ent.label_ | |
rec['wikidata_id'] = ent.kb_id | |
rec['entity_end'] = ent.end | |
rec['original_index'] = index | |
rec['label'] = 0 | |
candidates.append(rec) | |
df_out = pd.DataFrame(candidates) | |
df_out = df_out.reindex().sort_index() | |
df_out.to_csv('../../data/text_extractions.one_file.df_out.csv') | |
df_out.head() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
index | body | entity_text | left_tokens_text | right_tokens_text | entity_start | ent_type | wikidata_id | entity_end | original_index | |
---|---|---|---|---|---|---|---|---|---|---|
0 | BerkeleyDB Concurrency What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support? How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention? I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency. My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting. | C++ | ['optimal', 'level', 'of', 'concurrency', 'that', 'the'] | ['implementation', 'of', 'BerkeleyDB', 'can', 'reasonably'] | 12 | LANGUAGE | 0 | 13 | 0 | |
1 | Python equivalent of Jstack? Is there a python equivalent of jstack? I've got a hung process and I really want to see what it's up to because I have yet to reproduce the defect in development. | Jstack | ['Python', 'equivalent', 'of'] | ['?', 'Is', 'there', 'a', 'python'] | 3 | PERSON | 0 | 4 | 1 | |
2 | UTF-8 In Python logging, how? I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example: This explodes with UnicodeDecodeError on the logging.info() call. At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this: When it should be doing this: Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation. | Python | ['encoded', 'string', 'to', 'a', 'file', 'using'] | ["'s", 'logging', 'package', '.', ' '] | 20 | ORG | 0 | 21 | 2 | |
3 | UTF-8 In Python logging, how? I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example: This explodes with UnicodeDecodeError on the logging.info() call. At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this: When it should be doing this: Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation. | Python | ['\n', 'At', 'a', 'lower', 'level', ','] | ["'s", 'logging', 'package', 'is', 'using'] | 49 | ORG | 0 | 50 | 2 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment