Skip to content

Instantly share code, notes, and snippets.

@cljoly
Last active February 18, 2016 15:23
Show Gist options
  • Save cljoly/c29e1a3dbee5c08f17c8 to your computer and use it in GitHub Desktop.
Save cljoly/c29e1a3dbee5c08f17c8 to your computer and use it in GitHub Desktop.
#!/usr/bin/env ocamlscript
Ocaml.packs := [ "str" ; "lambdasoup" ; "markup" ]
--
(*
This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
For more information, please refer to <http://unlicense.org/>
*)
open Soup;;
(* Address to remove trailing .html from *)
let website_specific = "\\(h*t*t*p*:*/*/*w*w*w*oclaunch.eu.org/.*\\).html";;
(* Rewrite example.com/page.html to example.com/page *)
let remove_internal_html link =
R.attribute "href" link
|> (fun href_url -> (* Cas unmatch -> return as-is *)
let url_without_html =
Str.(replace_first
(regexp website_specific)
"\\1"
href_url)
in
set_attribute "href" url_without_html link;
)
;;
let () =
let (soup:soup node) =
Markup.(channel stdin |> parse_xml |> signals)
|> from_signals
in
(* HTML header is removed, to prevent it we show it now *)
soup $$ "a[href]"
|> iter remove_internal_html;
let result = soup |> signals in
Markup.(write_xml result |> to_channel stdout)
;;
@aantron
Copy link

aantron commented Feb 18, 2016

https://gist.github.com/leowzukw/c29e1a3dbee5c08f17c8#file-remove_trailing_html-ml-L34: I think you want something like

"\\(^.*\\)\\.html$"

or even simplier,

"\\(\\.html\\)$"

and replace the group with "".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment