yoshikischmitz/0_readme.md

## 0_readme.md

      
    Raw
  

              0_readme.md
            
          
    Scripts for preparing a full-text-search sqlite database from an english-chinese parallel corpus

These scripts were written to prepare the FTS database for use in Otacon507's Chinese Example sentences plugin:
https://github.com/otacon507/zh-examples
The database importer will hopefully be rewritten in python and integrated into the above repo so end-users can easily create their own databases. I originally used Dr. Xian Qian's 2MParallelCorpus(https://github.com/qxred/2MParallelCorpus). Please note that though this data is of a generally high quality, it does contain some instances of non-English/non-Chinese sentences(est. 3000+ using langid), ~900+ html fragments, and an unknown amount of nonsense sentence fragments(but probably not too many).
Regarding the nonsense sentence fragments, we partially avoided this issue in the plugin by sorting results by their sentence length's distance from the average length of the result set. This is accomplished in two queries:
// pseudocode:
str = '你好'
avg_len = db.query(`SELECT AVG(LENGTH(zh)) 
                    FROM example_sentences WHERE zh MATCH(?)`, str)
results = db.query("SELECT *
                    FROM example_sentences 
                    WHERE zh MATCH(?) 
                    ORDER BY ABS((#{avg_len} - LENGTH(zh))) ", str)

In the plugin these results are then further processed into html lists and returned to the user's flashcard.
Why go through all this trouble? Why not use an API?

Originally I wrote a simple prototype for the plugin using data from dict.cn when Otacon507 first approached me with the idea: https://gist.github.com/Takemikazuchi/9907875
There are several issues with this approach: 1, dependency on an external data source with no guarantees of continued existence. 2, slow response times(have to download an entire html page for a couple sentences). 3, many characters/words did not have any sentences.
At the end of the day relying on html scraping is too brittle to distribute in a plugin for end-users. After some extensive research and learning of the existence of bilingual corpora used in linguistics research, we shifted focus to acquiring high-quality copora with permissive licenses to distribute with the plugin. Although this does come with the space penalty of the large corpus files that need to be downloaded, we believe the speed and flexibility this affords us to process the data to be well worth it. We considered using the Glosbe API but unfortunately most of their Chinese data is from sources like the UN/EU corpus or extremely technical documents not suitable for educational use.
There are some issues with the corpora approach as well though, notably the fact that many of the high-quality ones are behind paywalls or only for academics to use, and conversely finding high-quality free copora, with conversational sentences is extremely difficult.
Steps for generating the database

Step 1:
Prepare a CSV file with the following two-column format: english,chinese
Step 2:
Create a FTS4 virtual table in sqlite3 (http://www.sqlite.org/fts3.html)
Step 3:
Import the data from the CSV file into the Virtual Table. The import is done in batches of 500, multithreaded, and wrapped in a single transaction due to the size of the data.(Multithreading uses Ruby's green threads but still benefits from context-switching during wait-heavy operations).

  
## 1_pairs_txt_to_csv.rb
require 'csv'

# Data can be downloaded here:
# http://www.hlt.utdallas.edu/~qx/2MParallel.tar.gz
CSV.open(ARGV[1], 'w+') do |csv|
  csv << ["english", "chinese"]
  IO.readlines(ARGV[0], "\n\n").each_with_index do |row, index|
    en, zh = row.split("\n")
    csv << [en, zh]
  end
end

## 2_full_text_search_table.sql
# This will create a fully indexed full text search table. Note that
# it will also index our English sentences, and generate a lot of
# unnecessary data. As this data was prepared for a bare-minimum working
# prototype, I did not investigate ways to prevent this from happening,
# but let me know in the comments if you know of a good way around that
# issue.
CREATE VIRTUAL TABLE example_sentences USING fts4(en TEXT, zh TEXT);

## 3_import_sentences_to_sqlite.rb
require 'sqlite3'
require 'thread'
require 'csv'
require 'progress_bar'

# Take a CSV file of bilingual sentence pairs in the following format: english,chinese
# Batch imports this to a sqlite3 table with the following columns: en,zh
# See `2_full_text_search_table` for the table definition.
# call like so: `ruby 1_import_sentences_to_sqlite.rb path-to-database path-to-csv-file`

# This script will eventually be converted to python and integrated with the zh-examples plugin:
# https://github.com/otacon507/zh-examples
# Note: Takes about 1:30 to import 260MB(2196905 sentences in the Xian Qian dataset) of data with an i7 laptop.
# This is a single-transaction operation. If you run out of memory running this script the transaction
# will be roll-backed.
# The default batch-size is 500, any larger number will exceed the term-limit for sqlite(2000 terms)

db = SQLite3::Database.open ARGV[0]
FILENAME = ARGV[1]
file = CSV.open(FILENAME)
count = %x{wc -l #{FILENAME}}.split.first.to_i
bar = ProgressBar.new(count)
queue = Queue.new

BATCH_SIZE = 500

# Produce the queries in a different thread, since we can read from the file and build our
# queries faster than the queries can be processed:
query_generator = Thread.new do
  file.each_slice(BATCH_SIZE) do |batch|
    # Escape single-quotes:
    batch.each{|x| x.map!{|y| y.to_s.gsub("'","''")}}
    first_row = batch.pop
    query = <<-SQL
      INSERT INTO 'example_sentences'
        SELECT '#{first_row[0]}' AS 'en', '#{first_row[1]}' AS 'zh'
        #{
           batch.map do |row|
             "UNION SELECT '#{row[0]}', '#{row[1]}'"
           end.join(' ')
         };
    SQL
    queue << query
  end
end

# Start a transaction in the main thread and keep processing queries
# off the queue until the query_generator is dead and the queue is empty:
db.transaction
loop do
  db.execute(queue.pop)
  bar.increment!(BATCH_SIZE)
  break if !query_generator.alive? && queue.empty?
end
puts "finished processing, committing changes..."
db.commit
query_generator.join
	require 'csv'

	# Data can be downloaded here:
	# http://www.hlt.utdallas.edu/~qx/2MParallel.tar.gz
	CSV.open(ARGV[1], 'w+') do \|csv\|
	csv << ["english", "chinese"]
	IO.readlines(ARGV[0], "\n\n").each_with_index do \|row, index\|
	en, zh = row.split("\n")
	csv << [en, zh]
	end
	end
	# This will create a fully indexed full text search table. Note that
	# it will also index our English sentences, and generate a lot of
	# unnecessary data. As this data was prepared for a bare-minimum working
	# prototype, I did not investigate ways to prevent this from happening,
	# but let me know in the comments if you know of a good way around that
	# issue.
	CREATE VIRTUAL TABLE example_sentences USING fts4(en TEXT, zh TEXT);
	require 'sqlite3'
	require 'thread'
	require 'csv'
	require 'progress_bar'

	# Take a CSV file of bilingual sentence pairs in the following format: english,chinese
	# Batch imports this to a sqlite3 table with the following columns: en,zh
	# See `2_full_text_search_table` for the table definition.
	# call like so: `ruby 1_import_sentences_to_sqlite.rb path-to-database path-to-csv-file`

	# This script will eventually be converted to python and integrated with the zh-examples plugin:
	# https://github.com/otacon507/zh-examples
	# Note: Takes about 1:30 to import 260MB(2196905 sentences in the Xian Qian dataset) of data with an i7 laptop.
	# This is a single-transaction operation. If you run out of memory running this script the transaction
	# will be roll-backed.
	# The default batch-size is 500, any larger number will exceed the term-limit for sqlite(2000 terms)

	db = SQLite3::Database.open ARGV[0]
	FILENAME = ARGV[1]
	file = CSV.open(FILENAME)
	count = %x{wc -l #{FILENAME}}.split.first.to_i
	bar = ProgressBar.new(count)
	queue = Queue.new

	BATCH_SIZE = 500

	# Produce the queries in a different thread, since we can read from the file and build our
	# queries faster than the queries can be processed:
	query_generator = Thread.new do
	file.each_slice(BATCH_SIZE) do \|batch\|
	# Escape single-quotes:
	batch.each{\|x\| x.map!{\|y\| y.to_s.gsub("'","''")}}
	first_row = batch.pop
	query = <<-SQL
	INSERT INTO 'example_sentences'
	SELECT '#{first_row[0]}' AS 'en', '#{first_row[1]}' AS 'zh'
	#{
	batch.map do \|row\|
	"UNION SELECT '#{row[0]}', '#{row[1]}'"
	end.join(' ')
	};
	SQL
	queue << query
	end
	end

	# Start a transaction in the main thread and keep processing queries
	# off the queue until the query_generator is dead and the queue is empty:
	db.transaction
	loop do
	db.execute(queue.pop)
	bar.increment!(BATCH_SIZE)
	break if !query_generator.alive? && queue.empty?
	end
	puts "finished processing, committing changes..."
	db.commit
	query_generator.join