Skip to content

Instantly share code, notes, and snippets.

@bnjns
Last active May 10, 2022 07:14
Show Gist options
  • Save bnjns/d6f4ca4c43eb86ffd1acb12b94a95f1e to your computer and use it in GitHub Desktop.
Save bnjns/d6f4ca4c43eb86ffd1acb12b94a95f1e to your computer and use it in GitHub Desktop.

Text Statistics Tool

You are given the following text:

lorem ipsum dolor sit amet consectetur lorem ipsum et mihi quoniam et adipiscing elit.sed quoniam et advesperascit et mihi ad villam revertendum est nunc quidem hactenus ex rebus enim timiditas non ex vocabulis nascitur.nummus in croesi divitiis obscuratur pars est tamen divitiarum.nam quibus rebus efficiuntur voluptates eae non sunt in potestate sapientis.hoc mihi cum tuo fratre convenit.qui ita affectus beatum esse numquam probabis duo reges constructio interrete.de hominibus dici non necesse est.eam si varietatem diceres intellegerem ut etiam non dicente te intellego parvi enim primo ortu sic iacent tamquam omnino sine animo sint.ea possunt paria non esse.quamquam tu hanc copiosiorem etiam soles dicere.de quibus cupio scire quid sentias.universa enim illorum ratione cum tota vestra confligendum puto.ut nemo dubitet eorum omnia officia quo spectare quid sequi quid fugere debeant nunc vero a primo quidem mirabiliter occulta natura est nec perspici nec cognosci potest.videmusne ut pueri ne verberibus quidem a contemplandis rebus perquirendisque deterreantur sunt enim prima elementa naturae quibus auctis virtutis quasi germen efficitur.nam ut sint illa vendibiliora haec uberiora certe sunt.cur deinde metrodori liberos commendas.mihi inquam qui te id ipsum rogavi nam adhuc meo fortasse vitio quid ego quaeram non perspicis.quibus ego vehementer assentior.cur iustitia laudatur mihi enim satis est ipsis non satis.quid est enim aliud esse versutum nobis heracleotes ille dionysius flagitiose descivisse videtur a stoicis propter oculorum dolorem.diodorus eius auditor adiungit ad honestatem vacuitatem doloris.nos quidem virtutes sic natae sumus ut tibi serviremus aliud negotii nihil habemus.

Here are a few facts and definition about the text above:

  • Everything is lowercase.
  • There are only letters, full stops (.), and single whitespace characters.
  • A word is defined as a sequence of letters delimited by either a whitespace or a full stop . character.
  • A full stop character is not considered a word. A full stop is never preceded or followed by whitespace.
  • Any two words are separated either by a single whitespace character (dolor sit), or by a full stop with no spaces (elit.sed).

Code a system that will allow us to determine the following statistics

  • How many words are there in the text (including duplicates)? (260)
  • Which six words occur the most in the text? (non, est, enim, ut, quid, mihi)
  • What percentage of the words only occur once? (82% rounded down)

Tips:

  • You are free to copy paste the text above as a string in your preferred language; there is no need to read from a file in your code or build something like a REST endpoint
  • You can assume that all input text has been cleaned to the above definition; there is no need to worry about parsing for now
  • Concentrate on solving for the above text; do not worry about the general case (although you are welcome to discuss how you would handle the general case as you go along)
  • Strive to write clear and maintainable code
  • Try to avoid simply answering each question in turn - we would ideally want to build something that we can deploy and extend later on
  • Treat this as though you are building an actual system that we would like to deploy and extend later on, considering things like testability, rather than simply answering each statistic in turn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment