Skip to content

Instantly share code, notes, and snippets.

@szemate
Last active September 15, 2022 12:34
Show Gist options
  • Save szemate/b42ce8f2e8286a1ef502d1a58ecc7120 to your computer and use it in GitHub Desktop.
Save szemate/b42ce8f2e8286a1ef502d1a58ecc7120 to your computer and use it in GitHub Desktop.
HTML converter exercise

HTML to Plain Text Converter

Create a script that takes the name of a HTML file as a command line argument, and prints its visible content as plain text to standard output (to the terminal).

  • Only the body contents should be printed.
  • All HTML tags and indentation should be removed.
  • Contents of block elements (<p>, <h#>, <div>) should be divided by empty lines.
  • Line breaks (<br />) should be respeced.
  • Only <br /> elements should be treated as line breaks, new line characters should not.

Example input (index.html):

<!DOCTYPE html>
<html>
<head>
  <title>Lorem ipsum</title>
</head>
<body>
  <h1>Lorem Ipsum</h1>
  <h2>Dolor Sit Amet</h2>
  <p>
    <b>C</b>onsectetur adipiscing elit, sed do eiusmod
    tempor incididunt ut labore et dolore magna aliqua.<br />
    Ut enim ad minim veniam, quis nostrud exercitation ullamco
    laboris nisi ut aliquip ex ea commodo consequat.
  </p>
  <p>
    <b>D</b>uis aute irure dolor in reprehenderit in voluptate
    velit esse cillum dolore eu fugiat nulla pariatur.
  </p>
</body>
</html>

Example output:

Lorem Ipsum

Dolor Sit Amet

Consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment