rishabkdoshi/file1.html

## file1.html
<!DOCTYPE html
	PUBLIC "-//W3C//DTD HTML 4.01//EN"
	"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv='Content-Type' content='text/html;charset=UTF-8'>
<title>Assignment 2. Shell scripting</title>
</head>

<body>

<h1>Assignment 2. Shell scripting</h1>

<h2>Laboratory: Spell-checking Hawaiian</h2>

<p>Keep a log in the file <samp>lab2.log</samp> of what you do in the
lab so that you can reproduce the results later. This should not
merely be a transcript of what you typed: it should be more like a
true lab notebook, in which you briefly note down what you did and
what happened.</p>

<p>For this laboratory we assume you're in the standard C or <a
href='http://en.wikipedia.org/wiki/POSIX'>POSIX</a> <a
href='http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07'>locale</a>. The
shell command <a
href='http://pubs.opengroup.org/onlinepubs/9699919799/utilities/locale.html'><samp>locale</samp></a>
should output <samp>LC_CTYPE="C"</samp>
or <samp>LC_CTYPE="POSIX"</samp>. If it doesn't, use the following
shell command:</p>

<pre><samp><a href='http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#export'>export</a> LC_ALL='C'
</samp></pre>

<p>and make sure <samp>locale</samp> outputs the right thing afterwards.</p>

<p>We also assume the file <samp>words</samp> contains a sorted list of
English words. Create such a file by sorting the contents of the file
<samp>/usr/share/dict/words</samp> on the SEASnet GNU/Linux hosts, and putting
the result into a file named <samp>words</samp> in your working
directory. To do that, you can use
the <samp><a href='http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html'>sort</a></samp>
command.</p>

<p>Then, take a text file containing the HTML in this
assignment's web page, and run the following commands with that
text file being standard input. Describe generally what each command
outputs (in particular, how its output differs from that of the
previous command), and why.</p>

<pre><samp>tr -c 'A-Za-z' '[\n*]'
tr -cs 'A-Za-z' '[\n*]'
tr -cs 'A-Za-z' '[\n*]' | sort
tr -cs 'A-Za-z' '[\n*]' | sort -u
tr -cs 'A-Za-z' '[\n*]' | sort -u | comm - words
tr -cs 'A-Za-z' '[\n*]' | sort -u | comm -23 - words # ENGLISHCHECKER
</samp></pre>

<p>Let's take the last command (marked ENGLISHCHECKER)
as the crude implementation of an
English spelling checker. Suppose we want to change ENGLISHCHECKER to be a
spelling checker for
<a href='http://en.wikipedia.org/wiki/Hawaiian_language'>Hawaiian</a>,
a language whose traditional orthography
has only the following
letters (or their capitalized equivalents):</p>

<pre><samp>p k ' m n w l h a e i o u
</samp></pre>

<p>In this lab for convenience we use ASCII apostrophe (') to
represent the Hawaiian &#8216;okina (&#8216;); it has no capitalized
equivalent.</p>

<p>Create in the file <samp>hwords</samp> a simple Hawaiian
dictionary containing a copy of all the Hawaiian words in
the tables in
"<a href='https://www.mauimapp.com/moolelo/hwnwdshw.htm'>Hawaiian to English</a>", an introductory list of words.
Use <a href='http://www.gnu.org/software/wget/'>Wget</a> to obtain
your copy of that web page.
Extract these words systematically from the tables in "Hawaiian to English".
Remove all instances of "<samp>?</samp>", "<samp>&lt;u&gt;</samp>" and
"<samp>&lt;/u&gt;</samp>".
For each remaining line of the form
"<samp><var>A</var>&lt;td<var>X</var>&gt;<var>W</var>&lt;/td&gt;<var>Z</var></samp>",
where <var>A</var> and <var>Z</var> are zero or more spaces,
<var>X</var> contains no "<samp>&gt;</samp>" characters
and <var>W</var> consists of entirely Hawaiian characters or spaces,
assume that <var>W</var> contains zero or more nonempty Hawaiian words
and extract those words; each word is a maximal sequence of one or
more Hawaiian characters.
Treat upper case letters as if they were lower case, and treat
<samp>`</samp> (ASCII grave accent) as if it were <samp>'</samp>
(ASCII apostrophe, which we use to represent &#8216;okina).
For example, the entry "<samp>H&lt;u&gt;a&lt;/u&gt;`ule lau</samp>"
contains the two words
"<samp>ha'ule</samp>" and "<samp>lau</samp>".
Sort the resulting list of words, removing any duplicates.
Do not attempt to repair any remaining problems by hand; just use the
systematic rules mentioned above. Automate the systematic rules
into a shell script <samp>buildwords</samp>, which you should copy into your
log; it should read the HTML from standard input and write a sorted list of
unique words to standard output.  For example, we should be able to run this
script with a command like this:</p>

<pre><samp>cat foo.html bar.html | ./buildwords | less
</samp></pre>

<p>If the shell script has bugs and
doesn't do all the work, your log should record in detail each bug
it has.</p>

<p>From the ENGLISHCHECKER command, derive a shell command
HAWAIIANCHECKER that checks the
spelling of Hawaiian rather than English, under the assumption
that <samp>hwords</samp> is a Hawaiian dictionary and that every
maximal nonempty sequence of ASCII letters or apostrophe is intended
to be a Hawaiian word and needs its spelling checked. Input that
is upper case should be lower-cased before it is checked against the
dictionary, since the dictionary is in all lower case.</p>

<p>Check your work by running your Hawaiian spelling checker on
<a href='assign2.html'>this very web page</a> (which you should also
fetch with Wget), and on the Hawaiian dictionary <samp>hwords</samp>
itself. Count the number of distinct misspelled words on this very web
page, using both ENGLISHCHECKER and HAWAIIANCHECKER. Count the number
of distinct words on this very web page that ENGLISHCHECKER reports as
misspelled but HAWAIIANCHECKER does not, and give two examples of
these words.  Similarly, count the number of distinct words (and give
two examples) that HAWAIIANCHECKER reports as misspelled on this very
web page but ENGLISHCHECKER does not.</p>

<h2>Homework: Find poorly-named files</h2>

<p><em>Warning: it will be difficult to do this homework without
  attending the lab session for hints.</em></p>

<p>You're working in a project that has lots of files and
  is intended to be portable to a wide variety of systems,
  some of which have fairly-restrictive rules for file names.
  Your project has established the following portability guidelines:</p>

<ol>
  <li>A file name component (i.e., the part of file names
    other than "<samp>/</samp>") must contain only ASCII
    letters, "<samp>.</samp>", "<samp>-</samp>", and "<samp>_</samp>".</li>
  <li>A file name component cannot start with "<samp>-</samp>".</li>
  <li>Except for "<samp>.</samp>" and "<samp>..</samp>",
    a file name component cannot start with "<samp>.</samp>".</li>
  <li>A file name component must not contain more than 14 characters.</li>
  <li>No two file names in your project can differ only in case.
    For example, if your project has a file name with component
    "<samp>St._Andrews</samp>" it cannot also have a file
    name with component "<samp>st._anDrEWS</samp>" in the same directory.</li>
</ol>

<p>Your boss has given you the job of looking for projects that do not
  follow these guidelines. Write a shell script <samp>poornames</samp>
  that accepts a project's directory name <var>D</var> as an operand
  and looks for violations of one or more of the guidelines the last
  file name component of <var>D</var>, or in files under <var>D</var>;
  the files might be immediate children of <var>D</var>, or might be
  under some subdirectory (recursively).</p>

<p>If given no operands, <samp>poornames</samp> should act as
  if <var>D</var> is "<samp>.</samp>", the current working directory.
  If given two or more operands, or a single operand that begins
  with "<samp>-</samp>" or is not the name of a directory,
  <samp>poornames</samp> should diagnose the problem on stderr and exit
  with a failure status.
  If <samp>poornames</samp> encounters an error when traversing a directory
  (e.g., lack of permissions to read a subdirectory) it should diagnose
  the problem on stderr, but it need not exit with a failure status.</p>

<p>The <samp>poornames</samp> script should output a
  line containing each full file name if and only if the file name's
  last component does not follow the guidelines.
  For example, the command "<samp>poornames /usr/bin</samp>"
  should output a line "<samp>/usr/bin/[</samp>" because
  there is a directory entry with that name, and
  its last component "<samp>[</samp>" contains a character
  that falls outside the guidelines.
  The <samp>poornames</samp> script should not output duplicate lines.
  The order of output lines does not matter.</p>

<p>The <samp>poornames</samp> script should not follow symbolic links
  when recursively looking for poorly-named files. However, it should
  check the symbolic links' names, just as it checks the names of
  all other files.</p>

<p>The <samp>poornames</samp> script should work with file names whose
components contain special characters like spaces, "*", and leading
"<samp>-</samp>" (except that <var>D</var> itself cannot begin with
"<samp>-</samp>"). However, you need not worry about file names
containing newlines.
</p>

<p>Your script should be runnable as an ordinary user, and should be
portable to any system that
supports <a href='http://pubs.opengroup.org/onlinepubs/9699919799/utilities/toc.html'>standard
POSIX shell and utilities</a>; please see
its <a href='http://pubs.opengroup.org/onlinepubs/9699919799/idx/utilities.html'>list
of utilities</a> for the commands that your script may use. (Hint: see
the <samp>find</samp> and <samp>sort</samp>
utilities.)</p>

<p>When testing your script, it is a good idea to do the testing in a
subdirectory devoted just to testing.  This will reduce the likelihood
of messing up your home directory or main development directory if
your script goes haywire.</p>

<p>Once your script passes your initial tests, try it out on
  three test cases <samp>/usr/bin</samp>, <samp>/usr/lib</samp>,
  and <samp>/usr/share/man</samp> on the SEASnet GNU/Linux
  host <samp>lnxsrv07</samp>. Save the outputs of these runs
  into the six files <samp>bin.1</samp>, <samp>bin.2</samp>,
  <samp>lib.1</samp>, <samp>lib.2</samp>, <samp>man.1</samp>,
  and <samp>man.2</samp>, where the <samp>.1</samp> files
  record standard output and the <samp>.2</samp> files
  record standard error.</p>

<h2>Submit</h2>

<p>Submit the following files.</p>
<ul>

<li>The script <samp>buildwords</samp> as described in the lab.</li>

<li>The file <samp>lab2.log</samp> as described in the lab.</li>

<li>The <samp>poornames</samp> script described in the homework.</li>

<li>The six test output files described in the homework.</li>
</ul>

<p><em>Warning: You should edit your files with Emacs or with other
    editors that end lines with plain LF (i.e., newline
    or <samp>'\n'</samp>).</em> Do not use Notepad or similar tools
    that may convert line endings
    to <a href='https://en.wikipedia.org/wiki/CRLF'>CRLF</a> form.</p>

<p>All files should be ASCII text files, with no
carriage returns, and with no more than 80 columns per line (except possibly
for the test output files). The shell
command:</p>

<pre><samp>awk '/\r/ || 80 &lt; length' buildwords lab2.log poornames
</samp></pre>

<p>should output nothing.</p>


<hr>
<address>
 &copy; 2005, 2007, 2008, 2010, 2013, 2019 Paul Eggert.
 See <a href='../copyright.html'>copying rules</a>.<br>

 $Id: assign2.html,v 1.32 2019/03/27 22:57:58 eggert Exp $

</address>

</body>
</html>
	<!DOCTYPE html
	PUBLIC "-//W3C//DTD HTML 4.01//EN"
	"http://www.w3.org/TR/html4/strict.dtd">
	<html>
	<head>
	<meta http-equiv='Content-Type' content='text/html;charset=UTF-8'>
	<title>Assignment 2. Shell scripting</title>
	</head>

	<body>

	<h1>Assignment 2. Shell scripting</h1>

	<h2>Laboratory: Spell-checking Hawaiian</h2>

	<p>Keep a log in the file <samp>lab2.log</samp> of what you do in the
	lab so that you can reproduce the results later. This should not
	merely be a transcript of what you typed: it should be more like a
	true lab notebook, in which you briefly note down what you did and
	what happened.</p>

	<p>For this laboratory we assume you're in the standard C or <a
	href='http://en.wikipedia.org/wiki/POSIX'>POSIX</a> <a
	href='http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07'>locale</a>. The
	shell command <a
	href='http://pubs.opengroup.org/onlinepubs/9699919799/utilities/locale.html'><samp>locale</samp></a>
	should output <samp>LC_CTYPE="C"</samp>
	or <samp>LC_CTYPE="POSIX"</samp>. If it doesn't, use the following
	shell command:</p>

	<pre><samp><a href='http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#export'>export</a> LC_ALL='C'
	</samp></pre>

	<p>and make sure <samp>locale</samp> outputs the right thing afterwards.</p>

	<p>We also assume the file <samp>words</samp> contains a sorted list of
	English words. Create such a file by sorting the contents of the file
	<samp>/usr/share/dict/words</samp> on the SEASnet GNU/Linux hosts, and putting
	the result into a file named <samp>words</samp> in your working
	directory. To do that, you can use
	the <samp><a href='http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html'>sort</a></samp>
	command.</p>

	<p>Then, take a text file containing the HTML in this
	assignment's web page, and run the following commands with that
	text file being standard input. Describe generally what each command
	outputs (in particular, how its output differs from that of the
	previous command), and why.</p>

	<pre><samp>tr -c 'A-Za-z' '[\n*]'
	tr -cs 'A-Za-z' '[\n*]'
	tr -cs 'A-Za-z' '[\n*]' \| sort
	tr -cs 'A-Za-z' '[\n*]' \| sort -u
	tr -cs 'A-Za-z' '[\n*]' \| sort -u \| comm - words
	tr -cs 'A-Za-z' '[\n*]' \| sort -u \| comm -23 - words # ENGLISHCHECKER
	</samp></pre>

	<p>Let's take the last command (marked ENGLISHCHECKER)
	as the crude implementation of an
	English spelling checker. Suppose we want to change ENGLISHCHECKER to be a
	spelling checker for
	<a href='http://en.wikipedia.org/wiki/Hawaiian_language'>Hawaiian</a>,
	a language whose traditional orthography
	has only the following
	letters (or their capitalized equivalents):</p>

	<pre><samp>p k ' m n w l h a e i o u
	</samp></pre>

	<p>In this lab for convenience we use ASCII apostrophe (') to
	represent the Hawaiian ‘okina (‘); it has no capitalized
	equivalent.</p>

	<p>Create in the file <samp>hwords</samp> a simple Hawaiian
	dictionary containing a copy of all the Hawaiian words in
	the tables in
	"<a href='https://www.mauimapp.com/moolelo/hwnwdshw.htm'>Hawaiian to English</a>", an introductory list of words.
	Use <a href='http://www.gnu.org/software/wget/'>Wget</a> to obtain
	your copy of that web page.
	Extract these words systematically from the tables in "Hawaiian to English".
	Remove all instances of "<samp>?</samp>", "<samp><u></samp>" and
	"<samp></u></samp>".
	For each remaining line of the form
	"<samp><var>A</var><td<var>X</var>><var>W</var></td><var>Z</var></samp>",
	where <var>A</var> and <var>Z</var> are zero or more spaces,
	<var>X</var> contains no "<samp>></samp>" characters
	and <var>W</var> consists of entirely Hawaiian characters or spaces,
	assume that <var>W</var> contains zero or more nonempty Hawaiian words
	and extract those words; each word is a maximal sequence of one or
	more Hawaiian characters.
	Treat upper case letters as if they were lower case, and treat
	<samp>`</samp> (ASCII grave accent) as if it were <samp>'</samp>
	(ASCII apostrophe, which we use to represent ‘okina).
	For example, the entry "<samp>H<u>a</u>`ule lau</samp>"
	contains the two words
	"<samp>ha'ule</samp>" and "<samp>lau</samp>".
	Sort the resulting list of words, removing any duplicates.
	Do not attempt to repair any remaining problems by hand; just use the
	systematic rules mentioned above. Automate the systematic rules
	into a shell script <samp>buildwords</samp>, which you should copy into your
	log; it should read the HTML from standard input and write a sorted list of
	unique words to standard output. For example, we should be able to run this
	script with a command like this:</p>

	<pre><samp>cat foo.html bar.html \| ./buildwords \| less
	</samp></pre>

	<p>If the shell script has bugs and
	doesn't do all the work, your log should record in detail each bug
	it has.</p>

	<p>From the ENGLISHCHECKER command, derive a shell command
	HAWAIIANCHECKER that checks the
	spelling of Hawaiian rather than English, under the assumption
	that <samp>hwords</samp> is a Hawaiian dictionary and that every
	maximal nonempty sequence of ASCII letters or apostrophe is intended
	to be a Hawaiian word and needs its spelling checked. Input that
	is upper case should be lower-cased before it is checked against the
	dictionary, since the dictionary is in all lower case.</p>

	<p>Check your work by running your Hawaiian spelling checker on
	<a href='assign2.html'>this very web page</a> (which you should also
	fetch with Wget), and on the Hawaiian dictionary <samp>hwords</samp>
	itself. Count the number of distinct misspelled words on this very web
	page, using both ENGLISHCHECKER and HAWAIIANCHECKER. Count the number
	of distinct words on this very web page that ENGLISHCHECKER reports as
	misspelled but HAWAIIANCHECKER does not, and give two examples of
	these words. Similarly, count the number of distinct words (and give
	two examples) that HAWAIIANCHECKER reports as misspelled on this very
	web page but ENGLISHCHECKER does not.</p>

	<h2>Homework: Find poorly-named files</h2>

	<p><em>Warning: it will be difficult to do this homework without
	attending the lab session for hints.</em></p>

	<p>You're working in a project that has lots of files and
	is intended to be portable to a wide variety of systems,
	some of which have fairly-restrictive rules for file names.
	Your project has established the following portability guidelines:</p>

	<ol>
	<li>A file name component (i.e., the part of file names
	other than "<samp>/</samp>") must contain only ASCII
	letters, "<samp>.</samp>", "<samp>-</samp>", and "<samp>_</samp>".</li>
	<li>A file name component cannot start with "<samp>-</samp>".</li>
	<li>Except for "<samp>.</samp>" and "<samp>..</samp>",
	a file name component cannot start with "<samp>.</samp>".</li>
	<li>A file name component must not contain more than 14 characters.</li>
	<li>No two file names in your project can differ only in case.
	For example, if your project has a file name with component
	"<samp>St._Andrews</samp>" it cannot also have a file
	name with component "<samp>st._anDrEWS</samp>" in the same directory.</li>
	</ol>

	<p>Your boss has given you the job of looking for projects that do not
	follow these guidelines. Write a shell script <samp>poornames</samp>
	that accepts a project's directory name <var>D</var> as an operand
	and looks for violations of one or more of the guidelines the last
	file name component of <var>D</var>, or in files under <var>D</var>;
	the files might be immediate children of <var>D</var>, or might be
	under some subdirectory (recursively).</p>

	<p>If given no operands, <samp>poornames</samp> should act as
	if <var>D</var> is "<samp>.</samp>", the current working directory.
	If given two or more operands, or a single operand that begins
	with "<samp>-</samp>" or is not the name of a directory,
	<samp>poornames</samp> should diagnose the problem on stderr and exit
	with a failure status.
	If <samp>poornames</samp> encounters an error when traversing a directory
	(e.g., lack of permissions to read a subdirectory) it should diagnose
	the problem on stderr, but it need not exit with a failure status.</p>

	<p>The <samp>poornames</samp> script should output a
	line containing each full file name if and only if the file name's
	last component does not follow the guidelines.
	For example, the command "<samp>poornames /usr/bin</samp>"
	should output a line "<samp>/usr/bin/[</samp>" because
	there is a directory entry with that name, and
	its last component "<samp>[</samp>" contains a character
	that falls outside the guidelines.
	The <samp>poornames</samp> script should not output duplicate lines.
	The order of output lines does not matter.</p>

	<p>The <samp>poornames</samp> script should not follow symbolic links
	when recursively looking for poorly-named files. However, it should
	check the symbolic links' names, just as it checks the names of
	all other files.</p>

	<p>The <samp>poornames</samp> script should work with file names whose
	components contain special characters like spaces, "*", and leading
	"<samp>-</samp>" (except that <var>D</var> itself cannot begin with
	"<samp>-</samp>"). However, you need not worry about file names
	containing newlines.
	</p>

	<p>Your script should be runnable as an ordinary user, and should be
	portable to any system that
	supports <a href='http://pubs.opengroup.org/onlinepubs/9699919799/utilities/toc.html'>standard
	POSIX shell and utilities</a>; please see
	its <a href='http://pubs.opengroup.org/onlinepubs/9699919799/idx/utilities.html'>list
	of utilities</a> for the commands that your script may use. (Hint: see
	the <samp>find</samp> and <samp>sort</samp>
	utilities.)</p>

	<p>When testing your script, it is a good idea to do the testing in a
	subdirectory devoted just to testing. This will reduce the likelihood
	of messing up your home directory or main development directory if
	your script goes haywire.</p>

	<p>Once your script passes your initial tests, try it out on
	three test cases <samp>/usr/bin</samp>, <samp>/usr/lib</samp>,
	and <samp>/usr/share/man</samp> on the SEASnet GNU/Linux
	host <samp>lnxsrv07</samp>. Save the outputs of these runs
	into the six files <samp>bin.1</samp>, <samp>bin.2</samp>,
	<samp>lib.1</samp>, <samp>lib.2</samp>, <samp>man.1</samp>,
	and <samp>man.2</samp>, where the <samp>.1</samp> files
	record standard output and the <samp>.2</samp> files
	record standard error.</p>

	<h2>Submit</h2>

	<p>Submit the following files.</p>
	<ul>

	<li>The script <samp>buildwords</samp> as described in the lab.</li>

	<li>The file <samp>lab2.log</samp> as described in the lab.</li>

	<li>The <samp>poornames</samp> script described in the homework.</li>

	<li>The six test output files described in the homework.</li>
	</ul>

	<p><em>Warning: You should edit your files with Emacs or with other
	editors that end lines with plain LF (i.e., newline
	or <samp>'\n'</samp>).</em> Do not use Notepad or similar tools
	that may convert line endings
	to <a href='https://en.wikipedia.org/wiki/CRLF'>CRLF</a> form.</p>

	<p>All files should be ASCII text files, with no
	carriage returns, and with no more than 80 columns per line (except possibly
	for the test output files). The shell
	command:</p>

	<pre><samp>awk '/\r/ \|\| 80 < length' buildwords lab2.log poornames
	</samp></pre>

	<p>should output nothing.</p>


	<hr>
	<address>
	© 2005, 2007, 2008, 2010, 2013, 2019 Paul Eggert.
	See <a href='../copyright.html'>copying rules</a>.<br>

	$Id: assign2.html,v 1.32 2019/03/27 22:57:58 eggert Exp $

	</address>

	</body>
	</html>