Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Cross-lingual product category dataset creation script.
"""Creates the product category dataset from the Cross-Lingual
Sentiment dataset [1]. The output can be used directly with the
CLSCL reference implementation in NUT [2].
./ {en|de|fr|jp} {train|test|unlabeled} output_dir num_docs
e.g. use the following line to create the French unlabeled document set:
./ fr unlabeled fr/product_category 20000
The product category dataset was used in:
P. Prettenhofer and B. Stein, Cross-lingual adaptation using structural
correspondence learning, ACM TIST (to appear), 2011.
import sys
from os import path, mkdir
from itertools import islice
cats = ["books", "dvd", "music"]
def pipenlabel(src, target, label, n=50000):
"""Pipe max `n` lines from `src` to `target` and label each line
with `label`.
for line in islice(src, n):
target.write(line[:line.rindex(":")+1] + label)
def main(argv):
lang, type, out_dir, n = argv
n = int(n)
out_dir = path.normpath(out_dir)
if not path.exists(out_dir):
fout = open(path.join(out_dir, "%s.processed" % type), "w+")
for cat in cats:
pipenlabel(open(path.join(lang, cat, "%s.processed" % type)),
fout, cat, n=n)
if __name__ == "__main__":
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment