rjurney/candidates.py

## candidates.py
window = 5
candidates = []
for index, row in df.iterrows():
    doc = nlp(row['_Body'])
    for ent in doc.ents:
        rec = {}
        rec['body'] = doc.text
        rec['entity'] = ent
        rec['entity_text'] = ent.text
        rec['entity_start'] = ent.start
        rec['entity_end'] = ent.end

        left_token_start = max(0, ent.start - 1 - window)
        left_token_end = ent.start
        rec['left_tokens_text'] = [x.text for x in doc[left_token_start : left_token_end]]

        right_token_start = min(ent.end, len(doc) - 1)
        right_token_end = min(ent.end + window, len(doc) - 1)
        rec['right_tokens_text'] = [x.text for x in doc[right_token_start : right_token_end]]

        rec['entity_start'] = ent.start
        rec['entity_text'] = ent.text
        rec['ent_type'] = ent.label_
        rec['wikidata_id'] = ent.kb_id
        rec['entity_end'] = ent.end

        rec['original_index'] = index
        rec['label'] = 0

        candidates.append(rec)

df_out = pd.DataFrame(candidates)
df_out = df_out.reindex().sort_index()

df_out.to_csv('../../data/text_extractions.one_file.df_out.csv')

df_out.head()

## text_extractions.one_file.df_out.csv

          
            index
            body
            entity_text
            left_tokens_text
            right_tokens_text
            entity_start
            ent_type
            wikidata_id
            entity_end
            original_index

            
              0
              BerkeleyDB Concurrency 
What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support?
How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention?
I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency.
My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting.

              C++
              ['optimal', 'level', 'of', 'concurrency', 'that', 'the']
              ['implementation', 'of', 'BerkeleyDB', 'can', 'reasonably']
              12
              LANGUAGE
              0
              13
              0

            
              1
              Python equivalent of Jstack? Is there a python equivalent of jstack? I've got a hung process and I really want to see what it's up to because I have yet to reproduce the defect in development.

              Jstack
              ['Python', 'equivalent', 'of']
              ['?', 'Is', 'there', 'a', 'python']
              3
              PERSON
              0
              4
              1

            
              2
              UTF-8 In Python logging, how? I'm trying to log a UTF-8 encoded string to a file using Python's logging package.  As a toy example:
This explodes with UnicodeDecodeError on the logging.info() call.
At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes.  Essentially, Python is doing this:
When it should be doing this:
Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation.

              Python
              ['encoded', 'string', 'to', 'a', 'file', 'using']
              ["'s", 'logging', 'package', '.', ' ']
              20
              ORG
              0
              21
              2

            
              3
              UTF-8 In Python logging, how? I'm trying to log a UTF-8 encoded string to a file using Python's logging package.  As a toy example:
This explodes with UnicodeDecodeError on the logging.info() call.
At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes.  Essentially, Python is doing this:
When it should be doing this:
Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation.

              Python
              ['\n', 'At', 'a', 'lower', 'level', ',']
              ["'s", 'logging', 'package', 'is', 'using']
              49
              ORG
              0
              50
              2
	window = 5
	candidates = []
	for index, row in df.iterrows():
	doc = nlp(row['_Body'])
	for ent in doc.ents:
	rec = {}
	rec['body'] = doc.text
	rec['entity'] = ent
	rec['entity_text'] = ent.text
	rec['entity_start'] = ent.start
	rec['entity_end'] = ent.end

	left_token_start = max(0, ent.start - 1 - window)
	left_token_end = ent.start
	rec['left_tokens_text'] = [x.text for x in doc[left_token_start : left_token_end]]

	right_token_start = min(ent.end, len(doc) - 1)
	right_token_end = min(ent.end + window, len(doc) - 1)
	rec['right_tokens_text'] = [x.text for x in doc[right_token_start : right_token_end]]

	rec['entity_start'] = ent.start
	rec['entity_text'] = ent.text
	rec['ent_type'] = ent.label_
	rec['wikidata_id'] = ent.kb_id
	rec['entity_end'] = ent.end

	rec['original_index'] = index
	rec['label'] = 0

	candidates.append(rec)

	df_out = pd.DataFrame(candidates)
	df_out = df_out.reindex().sort_index()

	df_out.to_csv('../../data/text_extractions.one_file.df_out.csv')

	df_out.head()
index	body	entity_text	left_tokens_text	right_tokens_text	entity_start	ent_type	wikidata_id	entity_end	original_index
0	BerkeleyDB Concurrency What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support? How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention? I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency. My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting.	C++	['optimal', 'level', 'of', 'concurrency', 'that', 'the']	['implementation', 'of', 'BerkeleyDB', 'can', 'reasonably']	12	LANGUAGE	0	13	0
1	Python equivalent of Jstack? Is there a python equivalent of jstack? I've got a hung process and I really want to see what it's up to because I have yet to reproduce the defect in development.	Jstack	['Python', 'equivalent', 'of']	['?', 'Is', 'there', 'a', 'python']	3	PERSON	0	4	1
2	UTF-8 In Python logging, how? I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example: This explodes with UnicodeDecodeError on the logging.info() call. At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this: When it should be doing this: Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation.	Python	['encoded', 'string', 'to', 'a', 'file', 'using']	["'s", 'logging', 'package', '.', ' ']	20	ORG	0	21	2
3	UTF-8 In Python logging, how? I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example: This explodes with UnicodeDecodeError on the logging.info() call. At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this: When it should be doing this: Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation.	Python	['\n', 'At', 'a', 'lower', 'level', ',']	["'s", 'logging', 'package', 'is', 'using']	49	ORG	0	50	2