Skip to content

Instantly share code, notes, and snippets.

@vdavez
Last active August 29, 2015 14:27
Show Gist options
  • Save vdavez/d714e2ea553fc206ce09 to your computer and use it in GitHub Desktop.
Save vdavez/d714e2ea553fc206ce09 to your computer and use it in GitHub Desktop.
parsing PDF

MCID == Marked Content ID (a div, essentially)

BDC == Beginning of marked content (don't know what it stands for)

10.9428 == Font size

0 0 10.9615 == UNKNOWN

171.1056 == Xmin coordinate

618.8265 == UNKNOWN

Tm == Text Matrix (this is just a thing to let you know text is about to follow)

[(P)18(etitit)-27...]TJ == TEXT! Within the parenthetical is the text. The number next to the parentheticals reflect spacing.

Td, TD, and T* == New line

\\222 == single quote

\\223 == left quote

\\224 == right quote

\\255 == end-of-line hyphen

/T1_2 1 Tf 24.449 0 = A new text style, where the 24.449 refers to font and 0 refers to size (0 means it's autosized). (note the actual definition of the font is included in some /FONT resource...

\\037 == A different (?) hyphen

Tc = Character spacing

<</MCID 2 >>BDC 10.9428 0 0 10.9615 171.1056 618.8265 Tm [(P)18(etiti)-27(oners )-86(nonetheless )-86(contend )-86(that )-86(respondents\\222 )-86(clai)-18(m )-86(is )]TJ -1 -1.182 Td [(l)-18(ike )-114(the )-114(pre-empted )-114(war)-27(ni)-18(ng )-104(neutra)-27(l)-18(i)-18(zati)-27(on )-114(clai)-18(m )-114(because )-114(it )]TJ 0 -1.182 TD [(is )-14(based )-23(on )-14(st)-18(atements )-23(that )-14(\\223might )-23(create )-14(a )-23(fa)-27(lse )-14(i)-18(mpressi)-27(on\\224 )]TJ T* [(rather )-250(than )-241(st)-18(atements )-250(that )-250(are )-241(\\223)-36(i)-18(nherently )-250(fa)-27(lse)27(.)-54(\\224 )-750(Br)-27(ief )]TJ T* [(for )-141(P)18(etiti)-27(oners )-132(39. )-750(But )-141(the )-141(extent )-132(of )-141(the )-132(fa)-27(lsehood )-141(a)-27(l)-18(leged )]TJ T* [(does )-4(not )-4(a)-27(lter )-4(the )-4(nature )-14(of )-4(the )-4(clai)-18(m. )-750(N)36(oth)-27(i)-18(ng )-4(i)-18(n )-4(the )-4(Label\\255)]TJ T* [(i)-18(ng )-50(A)55(c)-18(t\\222)55(s )-41(text )-50(or )-41(pur)-18(pose )-50(or )-41(i)-18(n )-50(the )-41(plura)-27(l)-18(ity )-50(opi)-18(ni)-27(on )-50(i)-18(n )]TJ /T1_2 1 Tf 24.449 0 Td [(Cipo)-36(l\\037)]TJ 0.0273 Tc -24.449 -1.182 Td [(lon)27(e)]TJ /T1_3 1 Tf ( )Tj /T1_1 1 Tf 0 Tc 2.37 0 Td [(suggests )-186(that )-177(whether )-177(a )-186(clai)-18(m )-177(is )-177(pre-empted )-186(tur)-27(ns )-177(i)-18(n )]TJ -2.37 -1.182 Td [(any )-214(way )-214(on )-214(the )-214(disti)-18(nc)-18(ti)-27(on )-214(between )-214(misleadi)-18(ng )-214(and )-214(i)-18(nher\\255)]TJ T* [(ently )-14(fa)-27(lse )-14(st)-18(atements. )-750(P)18(etiti)-27(oners\\222 )-4(misunderst)-18(andi)-18(ng )-14(is )-14(the )]TJ T* [(same )-50(one )-50(that )-50(led )-50(the )-50(Cour)-27(t )-50(of )-41(Appea)-27(ls )-50(for )-50(the )-50(Fi)-18(f)-45(th )-50(Circuit, )]TJ T* [(when )-95(confronted )-95(w)-27(ith )-95(a )-95(\\223)-54(l)-18(ight\\224 )-95(descr)-27(iptors )-86(clai)-18(m, )-95(to )-95(reach )-95(a )]TJ T* [(resu)-27(lt )-223(at )-223(odds )-232(w)-27(ith )-223(the )-223(Cour)-27(t )-223(of )-223(Appea)-27(ls\\222 )-232(decisi)-27(on )-223(i)-18(n )-223(th)-27(is )]TJ T* [(case)27(. )-750(See )]TJ /T1_2 1 Tf 5.049 0 Td [(Bro)-27(w)-27(n,)]TJ /T1_3 1 Tf ( )Tj /T1_1 1 Tf 3.631 0 Td [(479 )-77(F)109(. )-86(3d, )-77(at )-77(391)27(\\226)-54(393. )-750(Cer)-27(t)-18(ai)-18(nly)91(, )-77(the )-77(ex\\255)]TJ -8.68 -1.182 Td [(tent )-50(of )-59(the )-50(fa)-27(lsehood )-59(a)-27(l)-18(leged )-50(may )-59(bear )-50(on )-59(whether )-50(a )-59(plai)-18(nti)-18(ff )]TJ T* [(can )-232(prove )-241(her )-232(fraud )-232(clai)-18(m, )-241(but )-232(the )-232(mer)-27(its )-241(of )-232(respondents\\222 )]TJ T* [(clai)-18(m )-86(are )-86(not )-86(before )-86(us. )]TJ EMC /P
@vdavez
Copy link
Author

vdavez commented Aug 10, 2015

((()([\w\s\d',.-"]+)()))(-?\d+)

OR EVEN BETTER!

(?<=()\w+(?=))

@vdavez
Copy link
Author

vdavez commented Aug 10, 2015

Extracted text using only regular expressions...

Petitioners nonetheless contend that respondents' claim is like the pre-empted warning neutralization claim because it is based on statements that "might create a false impression" rather than statements that are "inherently false." Brief for Petitioners 39. But the extent of the falsehood alleged does not alter the nature of the claim. Nothing in the Label-ing Act's text or purpose or in the plurality opinion in Cipol-lone suggests that whether a claim is pre-empted turns in any way on the distinction between misleading and inher-ently false statements. Petitioners' misunderstanding is the same one that led the Court of Appeals for the Fifth Circuit, when confronted with a "light" descriptors claim, to reach a result at odds with the Court of Appeals' decision in this case. See Brown, 479 F. 3d, at 391--393. Certainly, the ex-tent of the falsehood alleged may bear on whether a plaintiff can prove her fraud claim, but the merits of respondents' claim are not before us.

@vdavez
Copy link
Author

vdavez commented Aug 10, 2015

@vdavez
Copy link
Author

vdavez commented Aug 26, 2015

import pdb
import re

with open('lib/test_pdf.dat','r') as f:
  d = f.read()
  text_pattern = "(?<=\().*?(?=\))"  
  match_pattern = "(([\d|.]+\s){6}Tm)(.*?)(?=([\d|.]+\s){6})"

  r = re.findall(match_pattern, d)

  blocks = {}

  for m in r:
    size = re.match('([\d|.]+)', m[0]).group(0)
    text = re.findall(text_pattern, ''.join(m))
    if size in blocks:
      blocks[size] += text
    else:
      blocks[size] = text
  pdb.set_trace()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment