Skip to content

Instantly share code, notes, and snippets.

@eumesy
Last active December 22, 2016 02:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save eumesy/aeaeccac51acef2a8b143139f556e2c6 to your computer and use it in GitHub Desktop.
Save eumesy/aeaeccac51acef2a8b143139f556e2c6 to your computer and use it in GitHub Desktop.

Workflow

  1. data: train, dev, test

    • list of (NE, Token 'w', POS 'pos', Chunk 'chk')
    O       -DOCSTART-      -X-     O
    
    B-ORG   EU      NNP     B-NP
    O       rejects VBZ     B-VP
    B-MISC  German  JJ      B-NP
    O       call    NN      I-NP
    O       to      TO      B-VP
    O       boycott VB      I-VP
    B-MISC  British JJ      B-NP
    O       lamb    NN      I-NP
    ...
    
    • ↓ feature extraction: $ ./feature.py < train > train.f, ...
  2. feature file: train.f, dev.f, test.f

    O       w[0]=-DOCSTART- pos[0]=-X-      chk[0]=O        p4[0]=-DOC      p5[0]=-DOCS ...
     
    O       w[0]=CRICKET    w[1]=-  w[0]|w[1]=CRICKET|-     w[1]|w[2]=-|LEICESTERSHIRE      pos[0]=NNP pos[1]=__COLON__        pos[0]|pos[1]=NNP|__COLON__     chk[0]=B-NP     p4[0]=CRIC      p5[0]=CRICK ...
    O       w[-1]=CRICKET   w[0]=-  w[1]=LEICESTERSHIRE     w[0]|w[1]=-|LEICESTERSHIRE      w[1]|w[2]=LEICESTERSHIRE|TAKE   pos[-1]=NNP     pos[0]=__COLON__        pos[1]=NNP      pos[0]|pos[1]=__COLON__|NNP     chk[0]=O        p4[0]=False     p5[0]=False ...
    B-ORG   w[-1]=- w[0]=LEICESTERSHIRE     w[1]=TAKE       w[0]|w[1]=LEICESTERSHIRE|TAKE   w[1]|w[2]=TAKE|OVER     pos[-1]=__COLON__       pos[0]=NNP      pos[1]=NNP      pos[0]|pos[1]=NNP|NNP ...
    ...
    
    • ↓ training: $ crfsuite learn -a ap -p max_iterations=20 -m ner.model train.f
  3. model: ner.model

    • → ★ dump (check weights): $ crfsuite dump ner.model > ner.dump

      ...
      TRANSITIONS = {
        (1) O --> O: 29.006869
        (1) O --> B-ORG: 22.085804
        (1) O --> B-MISC: 22.575587
        (1) O --> B-PER: 32.707861
      ...
      STATE_FEATURES = {
        (0) w[0]=-DOCSTART- --> O: 8.808922
        (0) pos[0]=-X- --> O: 8.808922
        (0) chk[0]=O --> O: 0.113915
        (0) chk[0]=O --> B-ORG: -7.506798
      
    • ↓ test (tagging, prediction): $ crfsuite tag -r -m ner.model < dev.f > dev.eval

      • -i: add marginal probability
  4. (gold, predicted value): dev.eval, test.eval

    O       O
     
    O       O
    O       O
    B-ORG   B-PER
    O       O
    O       O
    ...
    
    • → evaluation: $ conlleval.py < dev.eval

      processed 51578 tokens with 5943 phrases; found: 5906 phrases; correct: ...
      accuracy: ...
      
    • → error analysis: $ crfsuite tag -m ner.model -r < test.f | merge.py test > test.error

    • → ★ error analysis (sentences including false instances only): $ crfsuite tag -m ner.model -r < test.f | merge.py test | false_instance.py > test.error

Usage of less

  • ★: useful commands

Open file

  • $ less FILE
  • $ some commands | less (piping)

Find

  • /: search forward for a pattern ★
  • ?: search backward for a pattern
  • n: next (repeat previous search) ★
  • N: previous (repeat previous search, but in the reverse direction)

Move

  • line ★
    • j, Ctrl+n, Enter,

    • k, Ctrl+p,

  • window ★
    • f, Ctrl+v, Space
    • b, Alt+v
  • half window
    • Ctrl+d
    • Ctrl+u
  • file ★
    • g
    • G (Shift+g)

Misc.

  • ma: mark the current position with the letter 'a'
  • 'a: go to the marked position 'a'
  • q, ZZ: exit ★
  • v: open the file with the default editor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment