Skip to content

Instantly share code, notes, and snippets.

@shotarok
Created May 3, 2016 12:10
Show Gist options
  • Save shotarok/62f869d277d9755ebdc92708386bb0ad to your computer and use it in GitHub Desktop.
Save shotarok/62f869d277d9755ebdc92708386bb0ad to your computer and use it in GitHub Desktop.
Convert tsv files in "Kaggle Display Advertising Challenge Dataset" into vwopal wabbit format files
#!/usr/bin/env python
# conding:utf-8
import sys
def main():
for index, line in enumerate(sys.stdin):
elems = line.rstrip().split("\t")
label, nums, categories = elems[0], elems[1:14], elems[14:40]
vw_label = "1" if label == "1" else "-1"
num_fstr = " ".join(["I{}:{}".format(i, v) for i, v in enumerate(nums) if len(v) > 0])
cat_fstr = " ".join(["{}".format(v) for v in categories if len(v) > 0])
print("{} '{} |i {} |c {}".format(vw_label, index, num_fstr, cat_fstr))
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment