Skip to content

Instantly share code, notes, and snippets.

@vdavez
Last active August 29, 2015 14:27
Show Gist options
  • Save vdavez/d714e2ea553fc206ce09 to your computer and use it in GitHub Desktop.
Save vdavez/d714e2ea553fc206ce09 to your computer and use it in GitHub Desktop.
parsing PDF

MCID == Marked Content ID (a div, essentially)

BDC == Beginning of marked content (don't know what it stands for)

10.9428 == Font size

0 0 10.9615 == UNKNOWN

171.1056 == Xmin coordinate

618.8265 == UNKNOWN

Tm == Text Matrix (this is just a thing to let you know text is about to follow)

[(P)18(etitit)-27...]TJ == TEXT! Within the parenthetical is the text. The number next to the parentheticals reflect spacing.

Td, TD, and T* == New line

\\222 == single quote

\\223 == left quote

\\224 == right quote

\\255 == end-of-line hyphen

/T1_2 1 Tf 24.449 0 = A new text style, where the 24.449 refers to font and 0 refers to size (0 means it's autosized). (note the actual definition of the font is included in some /FONT resource...

\\037 == A different (?) hyphen

Tc = Character spacing

<</MCID 2 >>BDC 10.9428 0 0 10.9615 171.1056 618.8265 Tm [(P)18(etiti)-27(oners )-86(nonetheless )-86(contend )-86(that )-86(respondents\\222 )-86(clai)-18(m )-86(is )]TJ -1 -1.182 Td [(l)-18(ike )-114(the )-114(pre-empted )-114(war)-27(ni)-18(ng )-104(neutra)-27(l)-18(i)-18(zati)-27(on )-114(clai)-18(m )-114(because )-114(it )]TJ 0 -1.182 TD [(is )-14(based )-23(on )-14(st)-18(atements )-23(that )-14(\\223might )-23(create )-14(a )-23(fa)-27(lse )-14(i)-18(mpressi)-27(on\\224 )]TJ T* [(rather )-250(than )-241(st)-18(atements )-250(that )-250(are )-241(\\223)-36(i)-18(nherently )-250(fa)-27(lse)27(.)-54(\\224 )-750(Br)-27(ief )]TJ T* [(for )-141(P)18(etiti)-27(oners )-132(39. )-750(But )-141(the )-141(extent )-132(of )-141(the )-132(fa)-27(lsehood )-141(a)-27(l)-18(leged )]TJ T* [(does )-4(not )-4(a)-27(lter )-4(the )-4(nature )-14(of )-4(the )-4(clai)-18(m. )-750(N)36(oth)-27(i)-18(ng )-4(i)-18(n )-4(the )-4(Label\\255)]TJ T* [(i)-18(ng )-50(A)55(c)-18(t\\222)55(s )-41(text )-50(or )-41(pur)-18(pose )-50(or )-41(i)-18(n )-50(the )-41(plura)-27(l)-18(ity )-50(opi)-18(ni)-27(on )-50(i)-18(n )]TJ /T1_2 1 Tf 24.449 0 Td [(Cipo)-36(l\\037)]TJ 0.0273 Tc -24.449 -1.182 Td [(lon)27(e)]TJ /T1_3 1 Tf ( )Tj /T1_1 1 Tf 0 Tc 2.37 0 Td [(suggests )-186(that )-177(whether )-177(a )-186(clai)-18(m )-177(is )-177(pre-empted )-186(tur)-27(ns )-177(i)-18(n )]TJ -2.37 -1.182 Td [(any )-214(way )-214(on )-214(the )-214(disti)-18(nc)-18(ti)-27(on )-214(between )-214(misleadi)-18(ng )-214(and )-214(i)-18(nher\\255)]TJ T* [(ently )-14(fa)-27(lse )-14(st)-18(atements. )-750(P)18(etiti)-27(oners\\222 )-4(misunderst)-18(andi)-18(ng )-14(is )-14(the )]TJ T* [(same )-50(one )-50(that )-50(led )-50(the )-50(Cour)-27(t )-50(of )-41(Appea)-27(ls )-50(for )-50(the )-50(Fi)-18(f)-45(th )-50(Circuit, )]TJ T* [(when )-95(confronted )-95(w)-27(ith )-95(a )-95(\\223)-54(l)-18(ight\\224 )-95(descr)-27(iptors )-86(clai)-18(m, )-95(to )-95(reach )-95(a )]TJ T* [(resu)-27(lt )-223(at )-223(odds )-232(w)-27(ith )-223(the )-223(Cour)-27(t )-223(of )-223(Appea)-27(ls\\222 )-232(decisi)-27(on )-223(i)-18(n )-223(th)-27(is )]TJ T* [(case)27(. )-750(See )]TJ /T1_2 1 Tf 5.049 0 Td [(Bro)-27(w)-27(n,)]TJ /T1_3 1 Tf ( )Tj /T1_1 1 Tf 3.631 0 Td [(479 )-77(F)109(. )-86(3d, )-77(at )-77(391)27(\\226)-54(393. )-750(Cer)-27(t)-18(ai)-18(nly)91(, )-77(the )-77(ex\\255)]TJ -8.68 -1.182 Td [(tent )-50(of )-59(the )-50(fa)-27(lsehood )-59(a)-27(l)-18(leged )-50(may )-59(bear )-50(on )-59(whether )-50(a )-59(plai)-18(nti)-18(ff )]TJ T* [(can )-232(prove )-241(her )-232(fraud )-232(clai)-18(m, )-241(but )-232(the )-232(mer)-27(its )-241(of )-232(respondents\\222 )]TJ T* [(clai)-18(m )-86(are )-86(not )-86(before )-86(us. )]TJ EMC /P
@vdavez
Copy link
Author

vdavez commented Aug 26, 2015

import pdb
import re

with open('lib/test_pdf.dat','r') as f:
  d = f.read()
  text_pattern = "(?<=\().*?(?=\))"  
  match_pattern = "(([\d|.]+\s){6}Tm)(.*?)(?=([\d|.]+\s){6})"

  r = re.findall(match_pattern, d)

  blocks = {}

  for m in r:
    size = re.match('([\d|.]+)', m[0]).group(0)
    text = re.findall(text_pattern, ''.join(m))
    if size in blocks:
      blocks[size] += text
    else:
      blocks[size] = text
  pdb.set_trace()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment