Skip to content

Instantly share code, notes, and snippets.

@agnaldo4j
Last active October 15, 2016 11:20
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save agnaldo4j/721e6b0ac193c7f41de7bc7239b50603 to your computer and use it in GitHub Desktop.
Save agnaldo4j/721e6b0ac193c7f41de7bc7239b50603 to your computer and use it in GitHub Desktop.
Crawler example with mochiweb_html and mochiweb_xpath.

Elixir simple crawler example

This example uses thease projects:

  1. mochiweb
  2. mochiweb_xpath

The HtmlPageReader goes to html's page and transform that content with mochiweb_html.parse

defmodule Usecase.HtmlPageReader do

    def read(url) do
        :httpc.request(url) |>
        read_result |>
        read_body |>
        parse_body
    end

    defp read_result({:ok, result}) do
      result
    end

    defp read_body({_status, _header, body}) do
      body
    end

    defp parse_body(body) do
      :mochiweb_html.parse(body)
    end
end

The ErlangNewsReader transverse a parsed tree to find news and tranform to a tuple with title and text attributes.

defmodule Usecase.ErlangNewsReader do

    def find_news_from(tree) do
      execute_xpath('//div/h3[contains(text(), "NEWS")]/..',tree) |>
      first_element |>
      list_of_news |>
      Enum.map(&parse_new/1)
    end

    defp list_of_news(news_container) do
      execute_xpath('/div/div/div', news_container)
    end

    defp parse_new(node) do
         title = execute_xpath('div/p/a/text()', node) |> first_element
         text = execute_xpath('div/div/text()', node) |> first_element
         %{:title => title, :text => text}
    end

    defp execute_xpath(xpath, node) do
      :mochiweb_xpath.execute(xpath, node)
    end

    defp first_element([head|_tail]) do
      head
    end

    defp first_element([]) do
      ""
    end
end

The ErlangPageReader is a facade for other modules and tells what web page will be parsed. In this case we'll parse the news content of http://www.erlang.org. 😜

defmodule Usecase.ErlangPageReader do
  def read_news do
    Usecase.HtmlPageReader.read('http://www.erlang.org') |>
    Usecase.ErlangNewsReader.find_news_from
  end
end

❗ To run this code, you will need run inets application before.

:inets.start

At this time that i do this example the result is:

[
 %{
  text: "\n             Erlang/OTP 19.0 is a new major release with new features, quite a few (characteristics) improvements, as well as a few incompatibilities.\n          ",
  title: "Erlang/OTP 19.0 has been released"
 },
 %{
  text: "", 
  title: "Notes from OTP Technical Board"
 },
 %{
  text: "\n            This is the release candidate before the final OTP 19.0 product release in June 2016.\n          ",
  title: "Erlang/OTP 19.0-rc1 is available for testing"
 }
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment