Skip to content

Instantly share code, notes, and snippets.

@dustalov
Last active August 16, 2016 02:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save dustalov/2021295 to your computer and use it in GitHub Desktop.
Save dustalov/2021295 to your computer and use it in GitHub Desktop.
Link Grammar for Russian (Parser of the Parser)
# encoding: utf-8
# Processor of Link Grammar for Russian output.
#
class LinkParser::Lexer
# This exception raises when link grammar is invalid and Lexer
# is unable to understand the output.
#
class InvalidLinkGrammar < RuntimeError
attr_reader :input
# @private
def initialize input
super 'Invalid link grammar'
@input = input
end
end
# Abstract syntax tree of the parser output.
#
AST = Struct.new(:value)
# A structure that represents link in Link Grammar.
# Includes type and position definitions along with word and its
# morphosyntactic descriptors.
#
Link = Struct.new(:type, :subtype, :id, :word, :msd)
# A structure that represents word in Link Grammar. Includes
# morphosyntactic descriptors.
#
Word = Struct.new(:word, :msd)
attr_reader :input, :lexer
private :input, :lexer
# Create a new {Lexer} instance to process the given parser output.
#
# @param input [String] output of the parser.
#
def initialize input
@input = input
end
# Perform parsing of the parser output. This wording is silly, but
# I really can't implement good Link Parser right now.
#
# @return [AST] the AST of given parser output.
#
def parse
@lexer = StringScanner.new(input)
parse_value.value
ensure
lexer.eos? or
raise('Unexpected data: "%s"' % lexer.string[lexer.pos..-1])
end
protected
# Parse any supported syntactic construction of our parser.
#
# @return [AST] the AST of given parser output.
#
def parse_value
trim_space!
parse_list or
parse_string or
parse_link or
raise InvalidLinkGrammar, input
ensure
trim_space!
end
# List parser.
#
# @return [AST] the AST of given parser output.
#
def parse_list
return false unless lexer.scan /\(\s*/
list = []
more_values = false
while contents = parse_value rescue nil
list << contents.value
more_values = lexer.scan /\s+/
end
raise 'Missing value' if more_values
lexer.scan /\s*\)\s*/ or raise 'Unclosed list'
AST.new(list)
end
# String parser.
#
# @return [AST] the AST of given parser output.
#
def parse_string
return false unless lexer.scan /"/
string = lexer.scan(/[^\"]+/)
lexer.scan /"/ or raise 'Undetermined string'
AST.new(Word.new(*classify_word(string)))
end
# Link parser.
#
# @return [AST] the AST of given parser output.
#
def parse_link
return false unless token = lexer.scan(/[\wА-Яа-яЁё!:\-\.\,\?]+/)
complex_type, id, string = token.split(/:/)
type, subtype = complex_type.match(/([A-Z]+)(.*)/)[1..2]
AST.new(Link.new(type, subtype, id.to_i, *classify_word(string)))
end
# Skip whitespace characters because we are not interested in them.
#
def trim_space!
lexer.scan /\s+/
self
end
# Word classification method that idenfities LEFT-WALL, RIGHT-WALL,
# punctuation and regular word tokens.
#
# @param word [String] the word to classify.
#
# @return [Array<[String, Symbol], [String, NilClass]>]
# classification data.
#
def classify_word(word)
case word
when 'LEFT-WALL' then [ :left_wall ]
when 'RIGHT-WALL' then [ :right_wall ]
when '.' then [ '.' ]
else
if unknown_word = word.match(/^\[(.+)\]$/)
[ unknown_word[1] ]
else
word.split('.', 2).map { |s| !s.empty? ? s : nil }
end
end
end
end
@dustalov
Copy link
Author

Usage of this class:

pp LinkParser.analyze('Мама мыла раму.')

Will produce something like this Hash:

{#<struct LinkParser::Lexer::Word word=:left_wall, msd=nil>=>
  [#<struct LinkParser::Lexer::Link
    type="X",
    subtype="p",
    id=4,
    word=".",
    msd=nil>,
   #<struct LinkParser::Lexer::Link
    type="W",
    subtype="d",
    id=2,
    word="мыла",
    msd="vnpdpfs">],
 #<struct LinkParser::Lexer::Word word="мама", msd="nlfsi">=>
  [#<struct LinkParser::Lexer::Link
    type="S",
    subtype="f3",
    id=2,
    word="мыла",
    msd="vnpdpfs">],
 #<struct LinkParser::Lexer::Word word="мыла", msd="vnpdpfs">=>
  [#<struct LinkParser::Lexer::Link
    type="W",
    subtype="d",
    id=0,
    word=:left_wall,
    msd=nil>,
   #<struct LinkParser::Lexer::Link
    type="S",
    subtype="f3",
    id=1,
    word="мама",
    msd="nlfsi">,
   #<struct LinkParser::Lexer::Link
    type="MV",
    subtype="v",
    id=3,
    word="раму",
    msd="ndfsv">],
 #<struct LinkParser::Lexer::Word word="раму", msd="ndfsv">=>
  [#<struct LinkParser::Lexer::Link
    type="MV",
    subtype="v",
    id=2,
    word="мыла",
    msd="vnpdpfs">],
 #<struct LinkParser::Lexer::Word word=".", msd=nil>=>
  [#<struct LinkParser::Lexer::Link
    type="X",
    subtype="p",
    id=0,
    word=:left_wall,
    msd=nil>,
   #<struct LinkParser::Lexer::Link
    type="RW",
    subtype="",
    id=5,
    word=:right_wall,
    msd=nil>],
 #<struct LinkParser::Lexer::Word word=:right_wall, msd=nil>=>
  [#<struct LinkParser::Lexer::Link
    type="RW",
    subtype="",
    id=4,
    word=".",
    msd=nil>]}

@radixvinni
Copy link

radixvinni commented Nov 22, 2012

Я так понимаю, это синтаксический анализатор. Насчет руссокого языка всегда стоял такой вопрос.

Мы можем сказать "Мама мыла раму" и "Раму мыла мама". По одним окончаниям не возможно определить кто кого мыл.

@dustalov
Copy link
Author

dustalov commented Dec 1, 2012

@radixvinni, я увидел комментарий только сейчас.

В этом gist представлен не синтаксический анализатор, а парсер выхлопа Web-сервиса синтаксического анализатора http://sz.ru/parser. Для снятия многих видов неоднозначности, в том числе и семантической, используют корпусы текстов или семантические словари (тезаурусы). В лоб такие задачи не решают.

@dustalov
Copy link
Author

dustalov commented Dec 1, 2012

Однако этот анализатор такие вещами не занимается и делает как знает :)

@mirth
Copy link

mirth commented Jan 30, 2013

В link_parser.rb:62, видимо, закралась опечатка.

@dustalov
Copy link
Author

Я так невнимателен! Спасибо, исправил.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment