Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save mauro-oto/9282368c31d43afae5b4e9a06704280f to your computer and use it in GitHub Desktop.
Save mauro-oto/9282368c31d43afae5b4e9a06704280f to your computer and use it in GitHub Desktop.
# The goal of this problem is to extract headers from a block of text,
# and arrange them hierarchically.
#
# See the specs for more detail on the output
HEADER_HTML = /\<(h[\d])\>([^<>]+)\<\/h[\d]\>/
HEADER_LEVEL_AND_CONTENT = /h(?<level>[\d])(?<content>.*)/
def header_hierarchy(html)
html.scan(HEADER_HTML).map(&:join).map do |node|
node.gsub(HEADER_LEVEL_AND_CONTENT) do
" " * (($~[:level].to_i - 1) * 2) + "[h" + $~[:level] + "] " + $~[:content]
end
end
end
describe '#header_hierarchy' do
context 'EASY' do
it 'can extract a single header' do
expect(header_hierarchy("<h1>Foo</h1>")).to eq(['[h1] Foo'])
end
it 'can extract one nested level of header' do
expect(
header_hierarchy("<h1>Foo</h1><h2>Bar</h2>")
).to eq([
'[h1] Foo',
' [h2] Bar'
])
end
end
context 'MEDIUM' do
it 'can extract multiple levels of nested headers' do
expect(
header_hierarchy("<h1>Foo</h1><h2>Bar</h2><h3>Baz</h3><h4>Bam</h4>")
).to eq([
'[h1] Foo',
' [h2] Bar',
' [h3] Baz',
' [h4] Bam'
])
end
end
context 'HARD' do
it 'can extract multiple nested headers in multiple branches' do
expect(
header_hierarchy("<h1>Foo</h1><h2>Bar</h2><h3>Baz</h3><h2>Bam</h2><h3>Ba</h3>")
).to eq([
'[h1] Foo',
' [h2] Bar',
' [h3] Baz',
' [h2] Bam',
' [h3] Ba'
])
end
end
end
@JoshCheek
Copy link

Hi, I modified it slightly so that it would work on the jquery site, also. I removed the "[H0]" req, b/c it seemed like an implementation detail of the original solution. There are still limitations, and the regex approach is ultimately not going to be able to handle them all, but I like that it gets really far with very little overhead (eg nokogiri is a big dependency, and the regex approach will work for many scopes that this function may be used in)

# The goal of this problem is to extract headers from a block of text,
# and arrange them hierarchically.
#
# See the specs for more detail on the output

HEADER_HTML = /\<(h[\d])\b.*?\>(.*?)\<\/h[\d]\>/
STRIP_TAGS  = /<.*?>/
HEADER_LEVEL_AND_CONTENT = /h(?<level>[\d])(?<content>.*)/

def header_hierarchy(html)
  html.scan(HEADER_HTML).map(&:join).map do |node|
    node.gsub(STRIP_TAGS, '').gsub(HEADER_LEVEL_AND_CONTENT) do
      " " * (($~[:level].to_i - 1) * 2) + "[h" + $~[:level] + "] " + $~[:content]
    end
  end
end

describe '#header_hierarchy' do
  context 'EASY' do
    it 'can extract a single header' do
      expect(header_hierarchy("<h1>Foo</h1>")).to eq(['[h1] Foo'])
    end

    it 'can extract one nested level of header' do
      expect(
        header_hierarchy("<h1>Foo</h1><h2>Bar</h2>")
      ).to eq([
        '[h1] Foo',
        '  [h2] Bar'
      ])
    end
  end

  context 'MEDIUM' do
    it 'can extract multiple levels of nested headers' do
      expect(
        header_hierarchy("<h1>Foo</h1><h2>Bar</h2><h3>Baz</h3><h4>Bam</h4>")
      ).to eq([
        '[h1] Foo',
        '  [h2] Bar',
        '    [h3] Baz',
        '      [h4] Bam'
      ])
    end
  end

  context 'HARD' do
    it 'can extract multiple nested headers in multiple branches' do
      expect(
        header_hierarchy("<h1>Foo</h1><h2>Bar</h2><h3>Baz</h3><h2>Bam</h2><h3>Ba</h3>")
      ).to eq([
        '[h1] Foo',
        '  [h2] Bar',
        '    [h3] Baz',
        '  [h2] Bam',
        '    [h3] Ba'
      ])
    end
  end

  describe 'True parsing' do
    require 'net/http'
    let :html do
      # cache the html to reduce the cost of this test
      file_path = File.expand_path 'jquery.html', __dir__
      File.exist? file_path or
        File.write file_path, Net::HTTP.get(URI("https://jquery.com/"))
      File.read file_path
    end

    it 'can parse an entire document' do
      expect(header_hierarchy html).to eq [
        "  [h2] jQuery",
        "    [h3] Lightweight Footprint",
        "    [h3] CSS3 Compliant",
        "    [h3] Cross-Browser",
        "  [h2] What is jQuery?",
        "  [h2] Other Related Projects",
        "    [h3] Resources",
        "  [h2] A Brief Look",
        "    [h3] DOM Traversal and Manipulation",
        "    [h3] Event Handling",
        "    [h3] Ajax",
        "    [h3] Books"
      ]
    end
  end
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment