Last active
December 25, 2015 09:39
-
-
Save renaud/6955869 to your computer and use it in GitHub Desktop.
Transforms topic-model input file, from DCA format (space separated) to LDA-C format (column-separated)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
''' | |
Transforms topic-model input file, | |
from DCA format (space separated) | |
to LDA-C format (column-separated) | |
@author renaud@apache.org | |
''' | |
import sys | |
dca_file = sys.argv[1] | |
out_file = sys.argv[2] #"{}.lda-c".format(dca_file) | |
print "writing to: " + out_file | |
out = open(out_file, "a") | |
with open(dca_file) as f: | |
#skip first 2 lines, these are the cnts. | |
for line in f.readlines()[3:]: | |
l = line.rstrip().split(' ') | |
out.write(l[0]) | |
# all ints but the first | |
for x in range(1, len(l)-1, 2): | |
out.write(" {}:{}".format(l[x], l[x+1])) | |
out.write("\n") | |
out.close() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
123 | |
1234 | |
27 0 1 1 2 2 2 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1 13 1 14 1 15 2 16 1 17 1 18 2 19 1 20 2 21 1 22 1 23 1 24 1 25 1 26 1 | |
137 2 1 4 2 5 1 13 7 21 1 25 1 27 1 28 1 29 1 30 1 31 1 32 1 33 1 34 6 35 6 36 1 37 1 38 1 39 1 40 1 41 1 42 1 43 1 44 1 45 2 46 2 47 1 48 1 49 3 50 2 51 1 52 2 53 2 54 2 55 2 56 1 57 2 58 1 59 1 60 3 61 3 62 1 63 4 64 2 65 2 66 1 67 1 68 1 69 1 70 1 71 1 72 2 73 2 74 2 75 3 76 3 77 1 78 1 79 1 80 2 81 2 82 1 83 1 84 1 85 1 86 1 87 1 88 1 89 1 90 1 91 1 92 1 93 1 94 1 95 1 96 1 97 1 98 1 99 1 100 1 101 2 102 1 103 2 104 1 105 1 106 1 107 2 108 1 109 1 110 1 111 1 112 1 113 1 114 1 115 1 116 1 117 2 118 1 119 1 120 1 121 1 122 1 123 1 124 1 125 1 126 1 127 1 128 1 129 1 130 1 131 1 132 1 133 1 134 1 135 1 136 1 137 1 138 1 139 1 140 1 141 1 142 1 143 1 144 1 145 1 146 1 147 1 148 1 149 1 150 1 151 1 152 1 153 1 154 1 155 1 156 1 157 1 | |
37 9 1 23 1 25 1 35 1 69 1 72 1 191 1 220 1 221 3 222 2 223 1 224 4 225 1 226 1 227 1 228 1 229 2 230 1 231 1 232 1 233 1 234 1 235 1 236 1 237 1 238 1 239 1 240 1 241 1 242 1 243 1 244 1 245 1 246 1 247 1 248 1 249 1 | |
98 2 1 11 1 24 2 25 3 27 1 35 1 51 1 54 4 59 2 65 3 106 2 112 1 128 1 134 2 145 1 147 2 149 1 182 1 192 1 237 1 259 1 262 1 264 2 289 1 312 1 339 1 342 1 343 1 344 2 345 2 346 1 347 3 348 9 349 9 350 4 351 2 352 7 353 2 354 1 355 1 356 1 357 1 358 2 359 2 360 3 361 4 362 2 363 2 364 1 365 1 366 5 367 1 368 1 369 1 370 6 371 5 372 1 373 2 374 1 375 1 376 1 377 1 378 1 379 1 380 1 381 1 382 1 383 1 384 1 385 1 386 1 387 1 388 1 389 1 390 1 391 2 392 1 393 1 394 1 395 1 396 1 397 1 398 1 399 1 400 1 401 1 402 3 403 1 404 1 405 1 406 1 407 1 408 1 409 1 410 1 411 1 412 1 413 1 | |
28 33 1 90 1 156 1 190 1 229 1 410 1 414 1 415 1 416 2 417 1 418 1 419 1 420 2 421 1 422 1 423 1 424 1 425 1 426 1 427 1 428 1 429 1 430 1 431 1 432 1 433 1 434 1 435 1 | |
23 27 1 33 1 90 1 173 1 229 1 361 1 415 1 417 1 419 1 420 1 421 1 423 1 436 1 437 1 438 1 439 1 440 1 441 1 442 1 443 1 444 1 445 1 446 1 | |
37 27 1 166 1 238 1 286 1 300 1 403 1 427 1 447 1 448 2 449 1 450 2 451 2 452 2 453 3 454 1 455 1 456 1 457 1 458 1 459 1 460 1 461 1 462 1 463 1 464 2 465 1 466 1 467 1 468 1 469 1 470 1 471 1 472 1 473 1 474 1 475 1 476 1 | |
62 11 1 24 1 65 3 134 1 145 1 200 1 231 1 232 1 259 1 260 2 312 1 392 1 403 1 413 1 422 1 430 1 447 2 450 1 451 2 459 1 468 1 477 2 478 1 479 4 480 2 481 1 482 1 483 1 484 1 485 2 486 1 487 1 488 1 489 1 490 1 491 1 492 1 493 1 494 1 495 1 496 1 497 2 498 1 499 3 500 4 501 1 502 1 503 1 504 1 505 1 506 1 507 1 508 1 509 1 510 1 511 1 512 1 513 1 514 1 515 1 516 1 517 1 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment