Skip to content

Instantly share code, notes, and snippets.

@psychemedia
Last active December 16, 2019 12:09
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save psychemedia/14272d065a5e5246acd648868e5599f5 to your computer and use it in GitHub Desktop.
Save psychemedia/14272d065a5e5246acd648868e5599f5 to your computer and use it in GitHub Desktop.
First fumblings of a sketch around profiling a Jupyter notebook as a text, looking at text readability metrics, code complexity metrics, etc
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Notebook Profiles\n",
"\n",
"This exploratory coding notebook explores several techniques to support the static profiling of Jupyter notebooks as texts, reporting on various metrics, including:\n",
"\n",
"- notebook size (markdown and code line counts);\n",
"- readability scores;\n",
"- reading time estimates;\n",
"- code complexity and maintability.\n",
"\n",
"The motivating context was a tool for generating summary reports on the estimated workload associated with 100 or so notebooks over 25 or so directories (1 directory / 4 notebooks per week) for a third year undergraduate equivalent Open University course on data management and analysis.\n",
"\n",
"Previous notebook recipes include generating simple reports that pull out headings from notebooks to act as notebook summaries (eg [`Get Contents`](https://github.com/innovationOUtside/TM351_forum_examples/blob/master/Get%20Contents.ipynb)). Such recipes may provide a useful component in a notebook quality report if the report is also intended to provide a summary / overview of notebooks. (It might be most useful to offer heading summaries as an option in a notebook profiling report?)\n",
"\n",
"Tools supporting the profiling of one or more notebooks across one or more directories and the generation of simple statistics over them are also provided.\n",
"\n",
"The profiler is also capabale of running simple health checks over a notebook, for example reporting on:\n",
"\n",
"- whether code cells have been executed, and if so, whether code cell execution in complete and in linear order;\n",
"- packages / modules loaded in to the notebook;\n",
"- unused code items in a notebook (for example, modules loaded but not used).\n",
"\n",
"Currently, code profiling is only applied to code that appears in code cells, not code that is quoted or described in markdown cells. \n",
"\n",
"There is a potential for making IPython magics for some of the reporting functions (for example, `radon` or `wily` reports) to provide live feedback / reporting during the creation of content in a notebook.\n",
"\n",
"### Notebooks\n",
"\n",
"In the first instance, reports are generated for code cell inputs and markdown cells; code outputs and raw cells are not considered. Code appearing in markdown cells is identified as code-like but not analysed in terms of code complexity etc.\n",
"\n",
"For each markdown cell, we can generate a wide range of simple text document statistics. Several packages exist to support such analyses (for example, [`textstat`](https://github.com/shivam5992/textstat), [`readability`](https://github.com/andreasvc/readability/)) but the focus in this notebook will be on metrics derived using the [`spacy`](https://spacy.io/) underpinned [`textacy`](https://github.com/chartbeat-labs/textacy) package for things like [readability](https://chartbeat-labs.github.io/textacy/api_reference/misc.html?highlight=readability#text-statistics) metrics. Several simple custom metrics are also suggested.\n",
"\n",
"For code in code cells, the [`radon`](https://radon.readthedocs.io) package is used to generate code metrics, with additional packages providing further simple metrics.\n",
"\n",
"A test notebook is provided (`Notebook_profile_test.ipynb`) against which we can test various elements of this notebook.\n",
"\n",
"### Potential Future Work\n",
"\n",
"In terms of analysing cell outputs (not covered as yet), reports could be generated on the sorts of asset that appear to be displayed in each cell output, whether code warnings or errors are raised, etc. There is also potential for running in association with something like [`nbval`](https://github.com/computationalmodelling/nbval) to test that notebooks test correctly against previously run cell outputs.\n",
"\n",
"We might also explore the extent to which interactive notebook profiling tools, such as magics or notebook extensions, be used to support the authoring of new instructional notebooks.\n",
"\n",
"We might also ask to what extent might interactive notebook profiling tools be used to support learners working through instructional material and reflecting on their work? Code health metrics, such as [cell execution success](https://nbgallery.github.io/health_paper.html) used by *nbgallery* may provide clues regarding which code activity cells students struggled to get working, for example. By looking at statistics across students (for example, in assessment notebooks with cell execution success log monitoring enabled) we may be able to identify \"healthy\" or \"unhealthy\" activities; for example, a healthy activity is one in which students can get their code to run with one or two tries, an unhealthy activity is one where they make repeated attempts at trying to get the code to work as they desire. \n",
"\n",
"The notebook profiler should also be runnable against notebooks created using Jupytext from markdown rendered from OU-XML. It would probably make *more* sense to build a custom OU-XML profiler, eg one that could perhaps draw on a summary XML doc generated from OU-XML source docs using XSLT. I'll try to bear in mind creating reporting functions that might be useable in this wider sense. (OU-XML will also have thngs like a/v components, and may have explicit time guidance on expected time spent on particular activities.)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Settings\n",
"\n",
"The following parameters are used notebook wide in the generation of reports."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"READING_RATE = 100 # words per minute\n",
"# What is a sensible reading rate for undergraduate level academic teaching material?\n",
"# 250 wpm gives a rate of 15,000 wph\n",
"# 10,000 wph corresponds to about 170 words per minute\n",
"# OU guidance: 35 wpm for challenging texts, 70 wpm for medium texts, 120 wpm for easy texts\n",
"\n",
"CODE_READING_RATE = 35 # tokens per minute -- UNUSED\n",
"\n",
"CODE_LINE_READING_TIME = 1 # time in seconds to read a code line\n",
"\n",
"LINE_WIDTH = 160 #character width of a line of markdown text; used to calculate \"screen lines\"\n",
"\n",
"CODE_CELL_REVIEW_TIME = 5 # nominal time in seconds to run each code cell / review each code cell output\n",
"\n",
"CELL_SKIP_TIME = 1 # nomimal time in seconds to move from one cell to the next"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Open Notebook\n",
"\n",
"Open and read a notebook, such as the associated test notebook:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"TEST_NOTEBOOK = 'Notebook_profile_test.ipynb'\n",
"\n",
"import nbformat\n",
"with open(TEST_NOTEBOOK,'r') as f:\n",
" nb = nbformat.reads(f.read(), as_version=4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyse Markdown Cells\n",
"\n",
"Iterate through markdown cells and generate cell by cell reports.\n",
"\n",
"We can start off by generating some simple counts for a single notebook.\n",
"\n",
"Let's preview the contents of a single cell:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'cell_type': 'markdown',\n",
" 'metadata': {},\n",
" 'source': '# Test Notebook for Notebook Profiler\\n\\nThis notebook provides a test case for the notebook profiler.\\n\\nIt includes a range of markdown and code cells intended to test various features of the profiler.\\n\\nNote that this notebook does not necessarily run...'}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nb.cells[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can look at just the markdown component associated with a markdown cell - this will be the basis for our markdown text analysis."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"txt = nb.cells[0]['source']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Estimates of reading time are often based on word count estimates. The Medium website use a reading time estimator that also factors in the presence of images in a text as well as wordcount / sentence length. The [`readtime`](https://github.com/alanhamlett/readtime) package uses the Medium reading time estimation algorithm to give a reading time estimate.\n",
"\n",
"\n",
"?? TO DO - more on the reading time equation; also need something like maybe: +10s for every code cell to run it and look at output? Different reading time per line of code?\n",
"\n",
"*It might be worth looking at forking this reading time estimator and try to factor in reading time elements that reflect the presence of code? Or maybe use a slower reading rate for code? Or factor in code complexity? The presence of links might also affect reading time.*"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Reading time in seconds: 25.0; in minutes: 1.'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#https://github.com/alanhamlett/readtime\n",
"#%pip install readtime\n",
"\n",
"import readtime\n",
"import math\n",
"\n",
"rt = readtime.of_markdown(txt, wpm=READING_RATE).delta.total_seconds()\n",
"\n",
"#Round up on the conversion of estimated reading time in seonds, to minutes...\n",
"f'Reading time in seconds: {rt}; in minutes: {math.ceil(rt/60)}.'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `spacy` natural language processing package provides a wide ranging of basic tools for parsing texts."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"#%pip install spacy\n",
"import spacy\n",
"\n",
"#Check we have the small English model at least\n",
"SPACY_LANG_MODEL = 'en_core_web_sm'\n",
"\n",
"try:\n",
" import en_core_web_sm\n",
"except:\n",
" import spacy.cli\n",
" spacy.cli.download(SPACY_LANG_MODEL)\n",
"\n",
"#Load a model that a text is parsed against\n",
"nlp = spacy.load(SPACY_LANG_MODEL)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To call on `spacy`, we need to create tokenised document representation of the text (conveniently, the original text version is also stored as part of the object)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"run_control": {
"marked": false
}
},
"outputs": [],
"source": [
"doc = nlp(txt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `textacy` package builds on `spacy` to provide a range of higher level tools and statistics, from simple statistics such as word and sentence counts to more complex readability scores using a variety of [readability measures](https://readable.com/blog/the-flesch-reading-ease-and-flesch-kincaid-grade-level/).\n",
"\n",
"One way of using readability measures would be to set reading rates dynamically for each markdown cell based on calculated readability scores."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"({'n_sents': 4,\n",
" 'n_words': 40,\n",
" 'n_chars': 203,\n",
" 'n_syllables': 63,\n",
" 'n_unique_words': 27,\n",
" 'n_long_words': 15,\n",
" 'n_monosyllable_words': 25,\n",
" 'n_polysyllable_words': 6},\n",
" {'flesch_kincaid_grade_level': 6.895,\n",
" 'flesch_reading_ease': 63.440000000000026,\n",
" 'smog_index': 10.125756701596842,\n",
" 'gunning_fog_index': 10.0,\n",
" 'coleman_liau_index': 11.080711825000005,\n",
" 'automated_readability_index': 7.47325,\n",
" 'lix': 47.5,\n",
" 'gulpease_index': 68.25,\n",
" 'wiener_sachtextformel': 6.5195})"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#%pip install textacy\n",
"from textacy import TextStats\n",
"\n",
"ts = TextStats(doc)\n",
"ts.basic_counts, ts.readability_stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `textacy` package can also pull out notable features in a text, such as key terms or acronyms, both of which may be useful as part of a notebook summary."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('notebook profiler', 0.08196495093971548),\n",
" ('test case', 0.06744856661263204),\n",
" ('Test Notebook', 0.06479107591582292),\n",
" ('code cell', 0.05486312750180375),\n",
" ('markdown', 0.024974748258550644),\n",
" ('feature', 0.023809657889882128),\n",
" ('range', 0.022746625242650347)]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Extract keyterms\n",
"import textacy.ke\n",
"textacy.ke.textrank(doc, normalize=\"lemma\", topn=10)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from textacy.extract import acronyms_and_definitions\n",
"acronyms_and_definitions(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As well as using measures provided by off-the-shelf packages, it's also useful to define some simple metrics of our own that don't appear in other packages."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To start with, let's try to estimate the notebook length as it appears on screen by calculating how many \"screen lines\" a markdown cell is likely to take up. This can be calculated by splitting long lines of text over multiple lines based on a screen line width parameter."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import textwrap\n",
"\n",
"def _count_screen_lines(txt, width=LINE_WIDTH):\n",
" \"\"\"Count the number of screen lines that a markdown cell takes up.\"\"\"\n",
" ll = txt.split('\\n\\n')\n",
" _ll = []\n",
" for l in ll:\n",
" #Model screen flow: split a line if it is more than `width` characters long\n",
" _ll=_ll+textwrap.wrap(l, width)\n",
" n_screen_lines = len(_ll)\n",
" return n_screen_lines"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"screen_txt='As well as \"text\", markdown cells may contain cell blocks. The following is a basic report generator for summarising key statistical properties of code blocks. (We will see later an alternative way of calculating such metrics for well form Python code at least.)'\n",
"_count_screen_lines(screen_txt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `textacy` package does not appear to provide average sentence length statistics (although sentence length metrics may play a role in calculating readability scores? So maybe there are usable functions somewhere in there?) but we can straightforwardly define our own."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"import statistics\n",
"\n",
"def sentence_lengths(doc):\n",
" \"\"\"Generate elementary sentence length statistics.\"\"\"\n",
" s_mean = None\n",
" s_median = None\n",
" s_stdev = None\n",
" s_lengths = []\n",
" for sentence in doc.sents:\n",
" #Punctuation elements are tokens in their own right; remove these from sentence length counts\n",
" s_lengths.append(len( [tok.text for tok in sentence if tok.pos_ != \"PUNCT\"]))\n",
" \n",
" if s_lengths:\n",
" #If we have at least one measure, we can generate some simple statistics\n",
" s_mean = statistics.mean(s_lengths)\n",
" s_median = statistics.median(s_lengths)\n",
" s_stdev = statistics.stdev(s_lengths) if len(s_lengths) > 1 else 0\n",
" \n",
" return s_lengths, s_mean, s_median, s_stdev"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sentence statistics are generated from a `spacy` `doc` object and returned as separate statistics."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[7, 11, 18, 8] 11 9.5 4.96655480858378\n"
]
}
],
"source": [
"s_lengths, s_mean, s_median, s_stdev = sentence_lengths(doc)\n",
"print(s_lengths, s_mean, s_median, s_stdev)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As well as \"text\", markdown cells may contain cell blocks. The following is a basic report generator for summarising key statistical propererties of code blocks. (We will see later an alternative way of calculating such metrics for well form Python code at least.)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def _code_block_summarise(lines,\n",
" n_blank_code_lines = 0,\n",
" n_single_line_comment_code_lines = 0,\n",
" n_code_lines = 0):\n",
" \n",
" lines = lines.splitlines() if isinstance(lines, str) else lines\n",
" \n",
" #if lines[0].startwsith('%%'): \n",
" ##block magic - we could detect which?\n",
" #This would let us report on standard block magic such as %%bash\n",
" #as well as custom magic such as %%sql\n",
" for l in lines:\n",
" if not l.strip():\n",
" n_blank_code_lines = n_blank_code_lines + 1\n",
" elif l.strip().startswith(('#')): #Also pattern match \"\"\".+\"\"\" and '''.+'''\n",
" n_single_line_comment_code_lines = n_single_line_comment_code_lines + 1\n",
" #How should we detect block comments?\n",
" #elif l.strip().startswith(('!')):\n",
" ## IPyhton shell command\n",
" #elif l.startswith('%load_ext'):\n",
" ##Import some magic - we could detect which?\n",
" else:\n",
" n_code_lines = n_code_lines + 1\n",
" return n_blank_code_lines, n_single_line_comment_code_lines, n_code_lines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the code block summary in a more general report on \"features\" within a markdown cell (sentence statistics are handled elsewhere):"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def _report_md_features(txt):\n",
" \"\"\"Report on features in markdown documents.\n",
" For example, number of headings or paragraphs, or code block analysis.\"\"\"\n",
" n_headers = 0\n",
" n_paras = 0\n",
" n_total_code_lines = 0\n",
" n_code_lines = 0\n",
" n_blank_code_lines = 0\n",
" n_single_line_comment_code_lines = 0\n",
"\n",
" in_code_block = False\n",
" \n",
" n_screen_lines = _count_screen_lines(txt)\n",
" \n",
" #Markdown processor ignores whitespace at start and end of a markdown cell\n",
" txt = txt.strip()\n",
" \n",
" n_code_blocks = 0\n",
" \n",
" #We will see how to improve the handling of code blocks in markdown cells later\n",
" for l in txt.split('\\n'):\n",
" if l.strip().startswith('```'):\n",
" in_code_block = not in_code_block\n",
" if in_code_block:\n",
" n_code_blocks = n_code_blocks + 1\n",
" elif in_code_block:\n",
" n_total_code_lines = n_total_code_lines + 1\n",
" n_blank_code_lines, n_single_line_comment_code_lines, \\\n",
" n_code_lines = _code_block_summarise(l,\n",
" n_blank_code_lines,\n",
" n_single_line_comment_code_lines,\n",
" n_code_lines)\n",
" elif l.startswith('#'):\n",
" #Markdown heading\n",
" n_headers = n_headers + 1\n",
" elif not l.strip():\n",
" #A paragraph is identified by an double end of line (\\n\\n) outside a code block\n",
" #So if we have an empty line that signifies a paragraph break?\n",
" n_paras = n_paras + 1\n",
" \n",
" n_code = (n_total_code_lines, n_code_lines, \\\n",
" n_blank_code_lines, n_single_line_comment_code_lines)\n",
" \n",
" return n_headers, n_paras, n_screen_lines, n_code_blocks, n_code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So for example, the features we can report on might include the number of headings paragraphs, screen lines, or code block features."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1, 3, 4, 0, (0, 0, 0, 0))"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n_headers, n_paras, n_screen_lines, n_code_blocks, n_code = _report_md_features(txt)\n",
"n_headers, n_paras, n_screen_lines, n_code_blocks, n_code"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0, 0, 0, 0)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(n_total_code_lines, n_code_lines, n_blank_code_lines, n_single_line_comment_code_lines) = n_code\n",
"n_total_code_lines, n_code_lines, n_blank_code_lines, n_single_line_comment_code_lines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Code Blocks in Markdown Cells\n",
"A question arises when we have code blocks appearing in markdown cells. How should these be treated? Should we report the code toward markdown counts, or should we separately treat the code, discounting it from markdown word counts but reporting it as \"code in markdown\"?\n",
"\n",
"Another approach might be to include and codes of block appearing in markdown cells as part of the markdown word count, but provide an additional report identifying how many lines of code appeared as part of the markdown."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `excode` package provides an easy way of grabbing code blocks from markdown text, so we might be able to use that to mprove the handling of code blocks inside markdown cells.\n",
"\n",
"Lets grab a simple text case of some markdown containing some code blocks:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This cell contains two code blocks.\n",
"\n",
"Here's one:\n",
"\n",
"```python\n",
"import pandas\n",
"\n",
"#Create a dataframe\n",
"df = pd.DataFrame()\n",
"```\n",
"\n",
"and here's another:\n",
"\n",
"```python\n",
"import pandas\n",
"\n",
"#Create a dataframe\n",
"df = pd.DataFrame()\n",
"```\n",
"\n",
"So that's two...\n"
]
}
],
"source": [
"mc = nb.cells[2]['source']\n",
"print(mc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if we can extract those code blocks..."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['import pandas\\n\\n#Create a dataframe\\ndf = pd.DataFrame()\\n',\n",
" 'import pandas\\n\\n#Create a dataframe\\ndf = pd.DataFrame()\\n']"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#%pip install excode\n",
"import excode\n",
"import io\n",
"\n",
"#excode seems to expect a file buffer...\n",
"excode.extract(io.StringIO(mc))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now report on the structure of code blocks in markdown cells more directly:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"def code_block_report(c):\n",
" \"\"\"Generate simple code report when passed a list of code lines\n",
" or a string containing multiple `\\n` separated code lines.\"\"\"\n",
" \n",
" n_total_code_lines = 0\n",
" n_code_lines = 0\n",
" n_blank_code_lines = 0\n",
" n_single_line_comment_code_lines = 0\n",
" \n",
" #We won't count leading or lagging empty lines as code lines...\n",
" lines = c.strip().splitlines() if isinstance(c, str) else c\n",
" \n",
" #If first or last line is empty, strip it\n",
" if len(lines) > 1:\n",
" lines = lines[1:] if not lines[0].strip() else lines\n",
" lines = lines[:-1] if not lines[-1].strip() else lines\n",
" \n",
" #print(lines)\n",
" \n",
" n_total_code_lines = len(lines)\n",
" \n",
" n_blank_code_lines, n_single_line_comment_code_lines, \\\n",
" n_code_lines = _code_block_summarise(lines,\n",
" n_blank_code_lines,\n",
" n_single_line_comment_code_lines,\n",
" n_code_lines)\n",
" \n",
" return (n_total_code_lines, n_blank_code_lines,\\\n",
" n_single_line_comment_code_lines, n_code_lines)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Running the above function should generate some simple code statistics:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This cell contains two code blocks.\n",
"\n",
"Here's one:\n",
"\n",
"```python\n",
"import pandas\n",
"\n",
"#Create a dataframe\n",
"df = pd.DataFrame()\n",
"```\n",
"\n",
"and here's another:\n",
"\n",
"```python\n",
"import pandas\n",
"\n",
"#Create a dataframe\n",
"df = pd.DataFrame()\n",
"```\n",
"\n",
"So that's two...\n",
"4 1 1 2\n",
"4 1 1 2\n"
]
}
],
"source": [
"print(mc)\n",
"for c in excode.extract(io.StringIO(mc)):\n",
" (n_total_code_lines, n_blank_code_lines, \\\n",
" n_single_line_comment_code_lines, n_code_lines) = code_block_report(c)\n",
" \n",
" print(n_total_code_lines, n_blank_code_lines, \\\n",
" n_single_line_comment_code_lines, n_code_lines )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could also use the `radon` code analyser (which does count empty lines as code lines unless we explictly strip them).\n",
"\n",
"However, it should be noted that the `radon` code analysis relies on well formed Python code that can be loaded as into the Python AST parser. This means that code that doesn't parse as valid Python, either because it contains an error or because the code is not actually Python code (for example, in course materials we make use of SQL block magic to allow us to write SQL code in a code cell).\n",
"\n",
"The `radon` parser will also report an error if it comes across IPython line or cell magic code, or `!` prefixed shell commands.\n",
"\n",
"We will see later how we can start to cleanse a code string of IPython `!` and `%` prefixed directives when we consider parsing code cells. "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Module(loc=4, lloc=2, sloc=2, comments=1, multi=0, blank=1, single_comments=1)\n",
"Module(loc=4, lloc=2, sloc=2, comments=1, multi=0, blank=1, single_comments=1)\n"
]
},
{
"data": {
"text/plain": [
"(4, 2, 2, 1, 0, 1, 1)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#%pip install radon\n",
"from radon.raw import analyze\n",
"for c in excode.extract(io.StringIO(mc)):\n",
" r = analyze(c.strip())\n",
" print(r)\n",
"r.loc, r.lloc, r.sloc, r.comments, r.multi, r.blank, r.single_comments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can bundle up the `radon` analyzer to make it a little easier to call for our purposes:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"def r_analyze(c):\n",
" \"\"\"Analyse a code string using radon.analyze.\"\"\"\n",
" r = analyze(c.strip())\n",
" n_total_code_lines = r.loc\n",
" n_blank_code_lines = r.blank\n",
" n_single_line_comment_code_lines = r.comments\n",
" n_code_lines = r.sloc\n",
" return (n_total_code_lines, n_blank_code_lines, \\\n",
" n_single_line_comment_code_lines, n_code_lines)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then siple call `r_analyze()` function with a code string:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4 1 1 2\n",
"4 1 1 2\n"
]
}
],
"source": [
"for c in excode.extract(io.StringIO(mc)):\n",
" (n_total_code_lines, n_blank_code_lines, \\\n",
" n_single_line_comment_code_lines, n_code_lines) = r_analyze(c)\n",
" \n",
" print(n_total_code_lines, n_blank_code_lines, \\\n",
" n_single_line_comment_code_lines, n_code_lines)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Code Reading (and Execution) Time\n",
"\n",
"It would be useful if we had a heuristic for code reading time.\n",
"\n",
"One approach would be to tokenise the code and estimate reading time from a simple \"tokens per minute\" reading rate, or use a reading rate appropriate for \"difficult\" text. Another approach might be to try to make use of code complexity scores and code length.\n",
"\n",
"A pragmatic way may just be to estimate based on lines of code, with a nominal reading time allocated to each line of code."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"def code_reading_time(n_code_lines, n_single_line_comment_code_lines, line_time=CODE_LINE_READING_TIME):\n",
" \"\"\"Crude reading time estimate for a code block.\"\"\"\n",
" code_reading_time = line_time * (n_code_lines + n_single_line_comment_code_lines)\n",
" return code_reading_time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The way we currently process code in markdown cells, it will be timed at the standard reading rate. It may be appropriate to add a simple modifier that also adds a \"code reading overhead\" to the reading time based on the amount of code in a markdown cell.\n",
"\n",
"For code in code cells, rather than code blocks in markdown cells, might also be worth exploring *code execution time*, that is, an overhead associated with running each code cell. A crude way of calculating this would be to levy a fixed amount of time to account for running the code cell and inspecting the result. A more considered approach would look to cell profiling / execution time logs and code cell outputs in a run notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Custom Report Aggregator\n",
"\n",
"For convenience, we can bundle up the custom metrics we have created into a function that returns a single report object."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"import math\n",
"\n",
"def process_extras(doc):\n",
" \"\"\"Generate a dict containing additional metrics.\"\"\"\n",
" \n",
" n_headers, n_paras, n_screen_lines, n_code_blocks, n_code = _report_md_features(doc.text)\n",
" s_lengths, s_mean, s_median, s_stdev = sentence_lengths(doc)\n",
" (n_total_code_lines, n_code_lines, n_blank_code_lines, n_single_line_comment_code_lines) = n_code\n",
" \n",
" _reading_time = readtime.of_markdown(doc.text, wpm=READING_RATE).delta.total_seconds()\n",
" #Add reading time overhead for code\n",
" line_of_code_overhead = 1 #time in seconds to add to reading of each code line\n",
" _reading_time = _reading_time + code_reading_time(n_code_lines, n_single_line_comment_code_lines,\n",
" line_of_code_overhead)\n",
" \n",
" extras = {'n_headers':n_headers,\n",
" 'n_paras':n_paras,\n",
" 'n_screen_lines':n_screen_lines,\n",
" 's_lengths':s_lengths,\n",
" 's_mean':s_mean,\n",
" 's_median':s_median,\n",
" 's_stdev':s_stdev,\n",
" 'n_code_blocks':n_code_blocks,\n",
" 'n_total_code_lines':n_total_code_lines,\n",
" 'n_code_lines':n_code_lines,\n",
" 'n_blank_code_lines':n_blank_code_lines,\n",
" 'n_single_line_comment_code_lines':n_single_line_comment_code_lines,\n",
" 'reading_time_s':_reading_time,\n",
" 'reading_time_mins': math.ceil(_reading_time/60),\n",
" 'mean_sentence_length': s_mean,\n",
" 'median_sentence_length': s_median,\n",
" 'stdev_sentence_length': s_stdev,\n",
" #The following are both listy, so we need to handle them when we move to a dataframe\n",
" # TO DO - paramterise the number of key terms\n",
" 'keyterms':textacy.ke.textrank(doc, normalize=\"lemma\", topn=10),\n",
" 'acronyms':acronyms_and_definitions(doc)\n",
" }\n",
" return extras"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Running the `process_extras()` function on a `doc` object returns the extra metrics as keyed items in a single `dict`:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"({'n_sents': 4,\n",
" 'n_words': 40,\n",
" 'n_chars': 203,\n",
" 'n_syllables': 63,\n",
" 'n_unique_words': 27,\n",
" 'n_long_words': 15,\n",
" 'n_monosyllable_words': 25,\n",
" 'n_polysyllable_words': 6},\n",
" {'flesch_kincaid_grade_level': 6.895,\n",
" 'flesch_reading_ease': 63.440000000000026,\n",
" 'smog_index': 10.125756701596842,\n",
" 'gunning_fog_index': 10.0,\n",
" 'coleman_liau_index': 11.080711825000005,\n",
" 'automated_readability_index': 7.47325,\n",
" 'lix': 47.5,\n",
" 'gulpease_index': 68.25,\n",
" 'wiener_sachtextformel': 6.5195},\n",
" {'n_headers': 1,\n",
" 'n_paras': 3,\n",
" 'n_screen_lines': 4,\n",
" 's_lengths': [7, 11, 18, 8],\n",
" 's_mean': 11,\n",
" 's_median': 9.5,\n",
" 's_stdev': 4.96655480858378,\n",
" 'n_code_blocks': 0,\n",
" 'n_total_code_lines': 0,\n",
" 'n_code_lines': 0,\n",
" 'n_blank_code_lines': 0,\n",
" 'n_single_line_comment_code_lines': 0,\n",
" 'reading_time_s': 25.0,\n",
" 'reading_time_mins': 1,\n",
" 'mean_sentence_length': 11,\n",
" 'median_sentence_length': 9.5,\n",
" 'stdev_sentence_length': 4.96655480858378,\n",
" 'keyterms': [('notebook profiler', 0.08196495093971548),\n",
" ('test case', 0.06744856661263204),\n",
" ('Test Notebook', 0.06479107591582292),\n",
" ('code cell', 0.05486312750180375),\n",
" ('markdown', 0.024974748258550644),\n",
" ('feature', 0.023809657889882128),\n",
" ('range', 0.022746625242650347)],\n",
" 'acronyms': {}})"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ts.basic_counts, ts.readability_stats, process_extras(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate a Whole Notebook Markdown Report\n",
"\n",
"The whole notebook report can come in various flavours:\n",
" \n",
"- top level summary statistics that merge all the markdown content into a single cell and then analyse that;\n",
"- aggregated cell level statistics that summarise the statistics calculated for each markdown cell separately;\n",
"- individual cell level statistics that report the statistics for each cell separately.\n",
"\n",
"Whilst the individual cell level statistics presented in a textual fashion may be overkill, it may be useful to generate visual displays of a notebook that graphically summarise its structure."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Top-Level Summary\n",
"\n",
"Let's start with a report that munges the all the markdown text together and report on that..."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"def process_notebook_full_md(nb):\n",
" \"\"\"Given a notebook, return all the markdown cell content as one string,\n",
" and all the code cell content as another string.\"\"\"\n",
" txt = []\n",
" code = []\n",
" for cell in nb.cells:\n",
" if cell['cell_type']=='markdown':\n",
" txt.append(cell['source'])\n",
" elif cell['cell_type']=='code':\n",
" code.append( cell['source'])\n",
"\n",
" doc = nlp('\\n\\n'.join(txt))\n",
" code = '\\n\\n'.join(code)\n",
" \n",
" return doc, code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `process_notebook_full_md()` function takes a notebook object and returns two strings: one containing all the notebook's markdown cell content, one containing all its code cell content."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('# Test Notebook for Notebook Profiler\\n\\nThis notebook provides a test case for the notebook profiler.\\n\\nIt includes a range of markdown and code cells intended to test various features of the profiler.\\n\\nNote that this notebook does not necessarily run...\\n\\n## Markdown Cells With Cod',\n",
" '# This is a code cell\\nimport pandas\\n\\n#Create a dataframe\\ndf = pd.DataFrame()\\n\\n# This is a code cell with a magic...\\n\\n%matplotlib inline\\nimport time\\n\\ndef fn():\\n \"\"\"How is the docstring handled?\"\"\"\\n pass\\n\\n%load_ext sql\\n\\n%%sql\\nSELECT * FROM TABLE;')"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"full_doc, full_code = process_notebook_full_md(nb)\n",
"full_doc.text[:280], full_code[:250]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make things a bit more tabular in our reporting:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"def process_notebook_md_doc(doc):\n",
" ts = TextStats(doc)\n",
" return pd.DataFrame([{'text':doc.text,\n",
" **ts.basic_counts, **ts.readability_stats, **process_extras(doc)}])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Running the `process_notebook_md_doc()` function on a `doc` object returns a single row dataframe containing summary statistics calculated over the full markdown content of the notebook."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>n_sents</th>\n",
" <th>n_words</th>\n",
" <th>n_chars</th>\n",
" <th>n_syllables</th>\n",
" <th>n_unique_words</th>\n",
" <th>n_long_words</th>\n",
" <th>n_monosyllable_words</th>\n",
" <th>n_polysyllable_words</th>\n",
" <th>flesch_kincaid_grade_level</th>\n",
" <th>...</th>\n",
" <th>n_code_lines</th>\n",
" <th>n_blank_code_lines</th>\n",
" <th>n_single_line_comment_code_lines</th>\n",
" <th>reading_time_s</th>\n",
" <th>reading_time_mins</th>\n",
" <th>mean_sentence_length</th>\n",
" <th>median_sentence_length</th>\n",
" <th>stdev_sentence_length</th>\n",
" <th>keyterms</th>\n",
" <th>acronyms</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td># Test Notebook for Notebook Profiler\\n\\nThis ...</td>\n",
" <td>15</td>\n",
" <td>119</td>\n",
" <td>499</td>\n",
" <td>159</td>\n",
" <td>49</td>\n",
" <td>26</td>\n",
" <td>89</td>\n",
" <td>8</td>\n",
" <td>3.270387</td>\n",
" <td>...</td>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>69.0</td>\n",
" <td>2</td>\n",
" <td>8.733333</td>\n",
" <td>8</td>\n",
" <td>5.417784</td>\n",
" <td>[(single code block, 0.05399890062211835), (co...</td>\n",
" <td>{}</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1 rows × 37 columns</p>\n",
"</div>"
],
"text/plain": [
" text n_sents n_words \\\n",
"0 # Test Notebook for Notebook Profiler\\n\\nThis ... 15 119 \n",
"\n",
" n_chars n_syllables n_unique_words n_long_words n_monosyllable_words \\\n",
"0 499 159 49 26 89 \n",
"\n",
" n_polysyllable_words flesch_kincaid_grade_level ... n_code_lines \\\n",
"0 8 3.270387 ... 6 \n",
"\n",
" n_blank_code_lines n_single_line_comment_code_lines reading_time_s \\\n",
"0 0 3 69.0 \n",
"\n",
" reading_time_mins mean_sentence_length median_sentence_length \\\n",
"0 2 8.733333 8 \n",
"\n",
" stdev_sentence_length keyterms \\\n",
"0 5.417784 [(single code block, 0.05399890062211835), (co... \n",
"\n",
" acronyms \n",
"0 {} \n",
"\n",
"[1 rows x 37 columns]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"process_notebook_md_doc(full_doc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Summarised Cell Level Reporting\n",
"\n",
"For the summarised cell level reporting, generate measures on a per cell basis and then calculate summary statistics over those."
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"def process_notebook_md(nb, fn=''):\n",
" \"\"\"Process all the markdown cells in a notebook.\"\"\"\n",
" cell_reports = pd.DataFrame()\n",
" \n",
" for i, cell in enumerate(nb.cells):\n",
" if cell['cell_type']=='markdown':\n",
" _metrics = process_notebook_md_doc( nlp( cell['source'] ))\n",
" _metrics['cell_count'] = i\n",
" _metrics['cell_type'] = 'md'\n",
" cell_reports = cell_reports.append(_metrics, sort=False)\n",
" \n",
" cell_reports['filename'] = fn\n",
" cell_reports.reset_index(drop=True, inplace=True)\n",
" return cell_reports"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Processing a single notebook returns a dataframe with one row per markdown cell with each metric reported in its own column."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>n_sents</th>\n",
" <th>n_words</th>\n",
" <th>n_chars</th>\n",
" <th>n_syllables</th>\n",
" <th>n_unique_words</th>\n",
" <th>n_long_words</th>\n",
" <th>n_monosyllable_words</th>\n",
" <th>n_polysyllable_words</th>\n",
" <th>flesch_kincaid_grade_level</th>\n",
" <th>...</th>\n",
" <th>reading_time_s</th>\n",
" <th>reading_time_mins</th>\n",
" <th>mean_sentence_length</th>\n",
" <th>median_sentence_length</th>\n",
" <th>stdev_sentence_length</th>\n",
" <th>keyterms</th>\n",
" <th>acronyms</th>\n",
" <th>cell_count</th>\n",
" <th>cell_type</th>\n",
" <th>filename</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td># Test Notebook for Notebook Profiler\\n\\nThis ...</td>\n",
" <td>4</td>\n",
" <td>40</td>\n",
" <td>203</td>\n",
" <td>63</td>\n",
" <td>27</td>\n",
" <td>15</td>\n",
" <td>25</td>\n",
" <td>6</td>\n",
" <td>6.895000</td>\n",
" <td>...</td>\n",
" <td>25.0</td>\n",
" <td>1</td>\n",
" <td>11.00</td>\n",
" <td>9.5</td>\n",
" <td>4.966555</td>\n",
" <td>[(notebook profiler, 0.08196495093971548), (te...</td>\n",
" <td>{}</td>\n",
" <td>0</td>\n",
" <td>md</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>## Markdown Cells With Code Blocks\\n\\nThis cel...</td>\n",
" <td>4</td>\n",
" <td>30</td>\n",
" <td>123</td>\n",
" <td>38</td>\n",
" <td>22</td>\n",
" <td>5</td>\n",
" <td>23</td>\n",
" <td>1</td>\n",
" <td>2.281667</td>\n",
" <td>...</td>\n",
" <td>18.0</td>\n",
" <td>1</td>\n",
" <td>8.25</td>\n",
" <td>5.5</td>\n",
" <td>8.261356</td>\n",
" <td>[(single code block, 0.09825762538579677), (Ma...</td>\n",
" <td>{}</td>\n",
" <td>1</td>\n",
" <td>md</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>This cell contains two code blocks.\\n\\nHere's ...</td>\n",
" <td>8</td>\n",
" <td>49</td>\n",
" <td>173</td>\n",
" <td>58</td>\n",
" <td>23</td>\n",
" <td>6</td>\n",
" <td>41</td>\n",
" <td>1</td>\n",
" <td>0.766097</td>\n",
" <td>...</td>\n",
" <td>28.0</td>\n",
" <td>1</td>\n",
" <td>6.50</td>\n",
" <td>6.0</td>\n",
" <td>4.105745</td>\n",
" <td>[(code block, 0.052250174985765105), (import p...</td>\n",
" <td>{}</td>\n",
" <td>2</td>\n",
" <td>md</td>\n",
" <td></td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows × 40 columns</p>\n",
"</div>"
],
"text/plain": [
" text n_sents n_words \\\n",
"0 # Test Notebook for Notebook Profiler\\n\\nThis ... 4 40 \n",
"1 ## Markdown Cells With Code Blocks\\n\\nThis cel... 4 30 \n",
"2 This cell contains two code blocks.\\n\\nHere's ... 8 49 \n",
"\n",
" n_chars n_syllables n_unique_words n_long_words n_monosyllable_words \\\n",
"0 203 63 27 15 25 \n",
"1 123 38 22 5 23 \n",
"2 173 58 23 6 41 \n",
"\n",
" n_polysyllable_words flesch_kincaid_grade_level ... reading_time_s \\\n",
"0 6 6.895000 ... 25.0 \n",
"1 1 2.281667 ... 18.0 \n",
"2 1 0.766097 ... 28.0 \n",
"\n",
" reading_time_mins mean_sentence_length median_sentence_length \\\n",
"0 1 11.00 9.5 \n",
"1 1 8.25 5.5 \n",
"2 1 6.50 6.0 \n",
"\n",
" stdev_sentence_length keyterms \\\n",
"0 4.966555 [(notebook profiler, 0.08196495093971548), (te... \n",
"1 8.261356 [(single code block, 0.09825762538579677), (Ma... \n",
"2 4.105745 [(code block, 0.052250174985765105), (import p... \n",
"\n",
" acronyms cell_count cell_type filename \n",
"0 {} 0 md \n",
"1 {} 1 md \n",
"2 {} 2 md \n",
"\n",
"[3 rows x 40 columns]"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"total_report = process_notebook_md(nb)\n",
"total_report.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is trivial to create summary statistics directly from the *per* cell report table by aggregating over rows associated with the same notebook; in this case, we can find the total readtime as a simple sum.\n",
"\n",
"However, more generally we may wish to apply the aggegation over a set of grouped results (for example, in a dataframe containing materics from mutliple notebooks, we would want to group by each notebook and then perform the agggragatin on the measures associated with each notebook)."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"total_report['reading_time_mins'].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's also create a function to profile a notebook from a file:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"def process_notebook_file(fn):\n",
" \"\"\"Grab cell level statistics across a whole notebook.\"\"\"\n",
" \n",
" with open(fn,'r') as f:\n",
" try:\n",
" nb = nbformat.reads(f.read(), as_version=4)\n",
" cell_reports = process_notebook_md(nb, fn=fn)\n",
" except:\n",
" print(f'FAILED to process {fn}')\n",
" cell_reports = pd.DataFrame()\n",
" \n",
" cell_reports.reset_index(drop=True, inplace=True)\n",
" return cell_reports"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `process_notbook_file()` function returns a dataframe containing row level reports for each markdown cell in a specified notebook:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>n_sents</th>\n",
" <th>n_words</th>\n",
" <th>n_chars</th>\n",
" <th>n_syllables</th>\n",
" <th>n_unique_words</th>\n",
" <th>n_long_words</th>\n",
" <th>n_monosyllable_words</th>\n",
" <th>n_polysyllable_words</th>\n",
" <th>flesch_kincaid_grade_level</th>\n",
" <th>...</th>\n",
" <th>reading_time_s</th>\n",
" <th>reading_time_mins</th>\n",
" <th>mean_sentence_length</th>\n",
" <th>median_sentence_length</th>\n",
" <th>stdev_sentence_length</th>\n",
" <th>keyterms</th>\n",
" <th>acronyms</th>\n",
" <th>cell_count</th>\n",
" <th>cell_type</th>\n",
" <th>filename</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td># Test Notebook for Notebook Profiler\\n\\nThis ...</td>\n",
" <td>4</td>\n",
" <td>40</td>\n",
" <td>203</td>\n",
" <td>63</td>\n",
" <td>27</td>\n",
" <td>15</td>\n",
" <td>25</td>\n",
" <td>6</td>\n",
" <td>6.895000</td>\n",
" <td>...</td>\n",
" <td>25.0</td>\n",
" <td>1</td>\n",
" <td>11.00</td>\n",
" <td>9.5</td>\n",
" <td>4.966555</td>\n",
" <td>[(notebook profiler, 0.08196495093971548), (te...</td>\n",
" <td>{}</td>\n",
" <td>0</td>\n",
" <td>md</td>\n",
" <td>Notebook_profile_test.ipynb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>## Markdown Cells With Code Blocks\\n\\nThis cel...</td>\n",
" <td>4</td>\n",
" <td>30</td>\n",
" <td>123</td>\n",
" <td>38</td>\n",
" <td>22</td>\n",
" <td>5</td>\n",
" <td>23</td>\n",
" <td>1</td>\n",
" <td>2.281667</td>\n",
" <td>...</td>\n",
" <td>18.0</td>\n",
" <td>1</td>\n",
" <td>8.25</td>\n",
" <td>5.5</td>\n",
" <td>8.261356</td>\n",
" <td>[(single code block, 0.09825762538579677), (Ma...</td>\n",
" <td>{}</td>\n",
" <td>1</td>\n",
" <td>md</td>\n",
" <td>Notebook_profile_test.ipynb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>This cell contains two code blocks.\\n\\nHere's ...</td>\n",
" <td>8</td>\n",
" <td>49</td>\n",
" <td>173</td>\n",
" <td>58</td>\n",
" <td>23</td>\n",
" <td>6</td>\n",
" <td>41</td>\n",
" <td>1</td>\n",
" <td>0.766097</td>\n",
" <td>...</td>\n",
" <td>28.0</td>\n",
" <td>1</td>\n",
" <td>6.50</td>\n",
" <td>6.0</td>\n",
" <td>4.105745</td>\n",
" <td>[(code block, 0.052250174985765105), (import p...</td>\n",
" <td>{}</td>\n",
" <td>2</td>\n",
" <td>md</td>\n",
" <td>Notebook_profile_test.ipynb</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows × 40 columns</p>\n",
"</div>"
],
"text/plain": [
" text n_sents n_words \\\n",
"0 # Test Notebook for Notebook Profiler\\n\\nThis ... 4 40 \n",
"1 ## Markdown Cells With Code Blocks\\n\\nThis cel... 4 30 \n",
"2 This cell contains two code blocks.\\n\\nHere's ... 8 49 \n",
"\n",
" n_chars n_syllables n_unique_words n_long_words n_monosyllable_words \\\n",
"0 203 63 27 15 25 \n",
"1 123 38 22 5 23 \n",
"2 173 58 23 6 41 \n",
"\n",
" n_polysyllable_words flesch_kincaid_grade_level ... reading_time_s \\\n",
"0 6 6.895000 ... 25.0 \n",
"1 1 2.281667 ... 18.0 \n",
"2 1 0.766097 ... 28.0 \n",
"\n",
" reading_time_mins mean_sentence_length median_sentence_length \\\n",
"0 1 11.00 9.5 \n",
"1 1 8.25 5.5 \n",
"2 1 6.50 6.0 \n",
"\n",
" stdev_sentence_length keyterms \\\n",
"0 4.966555 [(notebook profiler, 0.08196495093971548), (te... \n",
"1 8.261356 [(single code block, 0.09825762538579677), (Ma... \n",
"2 4.105745 [(code block, 0.052250174985765105), (import p... \n",
"\n",
" acronyms cell_count cell_type filename \n",
"0 {} 0 md Notebook_profile_test.ipynb \n",
"1 {} 1 md Notebook_profile_test.ipynb \n",
"2 {} 2 md Notebook_profile_test.ipynb \n",
"\n",
"[3 rows x 40 columns]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"process_notebook_file(TEST_NOTEBOOK)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analysing Multiple Notebooks in the Same Directory\n",
"\n",
"As well as analysing notebooks at the notebook level, we may also want to generate individual and aggregated reports for all the notebooks contained in a single directory.\n",
"\n",
"Aggregated reports might include the total estimated time to work through all the notebooks in the directory, for example.\n",
"\n",
"It might be useful to have one entry point and a switch that selects between the notebook summary reports and the full cell level report? Or maybe we should report two dataframes always - aggregated notebook level and individual cell level?"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"def _nb_dir_file_profiler(path, _f, report=False):\n",
" \"\"\"Get the profile for a single file on a specified path.\"\"\"\n",
" f = os.path.join(path, _f)\n",
" if f.endswith('.ipynb'):\n",
" if report:\n",
" print(f'Profiling {f}')\n",
" return process_notebook_file(f)\n",
" return pd.DataFrame()\n",
" \n",
"def nb_dir_profiler(path):\n",
" \"\"\"Profile all the notebooks in a specific directory.\"\"\"\n",
" nb_dir_report = pd.DataFrame()\n",
" for _f in os.listdir(path):\n",
" nb_dir_report = nb_dir_report.append( _nb_dir_profiler(path, _f), sort=False )\n",
" #nb_dir_report['path'] = path\n",
" return nb_dir_report "
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"#nb_dir_profiler('.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analysing Notebooks Across Multiple Directories\n",
"\n",
"As well as analysing all the notebooks contained within a single directory, we may want to automate the production of reports at the directory level across multiple directories."
]
},
{
"cell_type": "code",
"execution_count": 185,
"metadata": {},
"outputs": [],
"source": [
"def nb_multidir_profiler(path, exclude = 'default'):\n",
" \"\"\"Profile all the notebooks in a specific directory and in any child directories.\"\"\"\n",
" \n",
" if exclude == 'default':\n",
" exclude_paths = ['.ipynb_checkpoints', '.git', '.ipynb', '__MACOSX']\n",
" else:\n",
" #If we set exclude, we need to pass it as a list\n",
" exclude_paths = exclude\n",
" nb_multidir_report = pd.DataFrame()\n",
" for _path, dirs, files in os.walk(path):\n",
" #Start walking...\n",
" #If we're in a directory that is not excluded...\n",
" if not set(exclude_paths).intersection(set(_path.split('/'))):\n",
" #Profile that directory...\n",
" nb_dir_report = pd.DataFrame()\n",
" for _f in files:\n",
" nb_dir_report = nb_dir_report.append( _nb_dir_file_profiler(_path, _f), sort=False )\n",
" if not nb_dir_report.empty:\n",
" nb_dir_report['path'] = _path\n",
" nb_multidir_report = nb_multidir_report.append(nb_dir_report, sort=False)\n",
" \n",
" nb_multidir_report = nb_multidir_report.sort_values(by=['path', 'filename'])\n",
" \n",
" nb_multidir_report.reset_index(drop=True, inplace=True)\n",
" \n",
" return nb_multidir_report "
]
},
{
"cell_type": "code",
"execution_count": 186,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>n_sents</th>\n",
" <th>n_words</th>\n",
" <th>n_chars</th>\n",
" <th>n_syllables</th>\n",
" <th>n_unique_words</th>\n",
" <th>n_long_words</th>\n",
" <th>n_monosyllable_words</th>\n",
" <th>n_polysyllable_words</th>\n",
" <th>flesch_kincaid_grade_level</th>\n",
" <th>...</th>\n",
" <th>reading_time_mins</th>\n",
" <th>mean_sentence_length</th>\n",
" <th>median_sentence_length</th>\n",
" <th>stdev_sentence_length</th>\n",
" <th>keyterms</th>\n",
" <th>acronyms</th>\n",
" <th>cell_count</th>\n",
" <th>cell_type</th>\n",
" <th>filename</th>\n",
" <th>path</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td># The *pandas* library: Series and DataFrames</td>\n",
" <td>1</td>\n",
" <td>6</td>\n",
" <td>35</td>\n",
" <td>9</td>\n",
" <td>6</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>4.450000</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>7.000000</td>\n",
" <td>7.0</td>\n",
" <td>0.000000</td>\n",
" <td>[(Series, 0.12192605097566381), (library, 0.11...</td>\n",
" <td>{}</td>\n",
" <td>0</td>\n",
" <td>md</td>\n",
" <td>../Documents/GitHub/tm351-undercertainty/noteb...</td>\n",
" <td>../Documents/GitHub/tm351-undercertainty/noteb...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Python is a general-purpose scripting language...</td>\n",
" <td>10</td>\n",
" <td>123</td>\n",
" <td>570</td>\n",
" <td>188</td>\n",
" <td>79</td>\n",
" <td>30</td>\n",
" <td>75</td>\n",
" <td>13</td>\n",
" <td>7.242772</td>\n",
" <td>...</td>\n",
" <td>2</td>\n",
" <td>12.500000</td>\n",
" <td>12.0</td>\n",
" <td>10.058164</td>\n",
" <td>[(level datum structure, 0.050604038106577987)...</td>\n",
" <td>{}</td>\n",
" <td>1</td>\n",
" <td>md</td>\n",
" <td>../Documents/GitHub/tm351-undercertainty/noteb...</td>\n",
" <td>../Documents/GitHub/tm351-undercertainty/noteb...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Note there are several libraries that we shall...</td>\n",
" <td>3</td>\n",
" <td>64</td>\n",
" <td>338</td>\n",
" <td>92</td>\n",
" <td>49</td>\n",
" <td>17</td>\n",
" <td>45</td>\n",
" <td>5</td>\n",
" <td>9.692500</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>22.333333</td>\n",
" <td>26.0</td>\n",
" <td>10.016653</td>\n",
" <td>[(standard Python code base, 0.058213526433021...</td>\n",
" <td>{}</td>\n",
" <td>3</td>\n",
" <td>md</td>\n",
" <td>../Documents/GitHub/tm351-undercertainty/noteb...</td>\n",
" <td>../Documents/GitHub/tm351-undercertainty/noteb...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>## Python recap: lists and dicts</td>\n",
" <td>1</td>\n",
" <td>5</td>\n",
" <td>24</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>0.520000</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>7.000000</td>\n",
" <td>7.0</td>\n",
" <td>0.000000</td>\n",
" <td>[(Python recap, 0.2923854294015616), (list, 0....</td>\n",
" <td>{}</td>\n",
" <td>4</td>\n",
" <td>md</td>\n",
" <td>../Documents/GitHub/tm351-undercertainty/noteb...</td>\n",
" <td>../Documents/GitHub/tm351-undercertainty/noteb...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Python lists are flexible, mutable, data struc...</td>\n",
" <td>1</td>\n",
" <td>18</td>\n",
" <td>89</td>\n",
" <td>28</td>\n",
" <td>18</td>\n",
" <td>6</td>\n",
" <td>11</td>\n",
" <td>3</td>\n",
" <td>9.785556</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>18.000000</td>\n",
" <td>18.0</td>\n",
" <td>0.000000</td>\n",
" <td>[(python list, 0.12775495473120263), (data str...</td>\n",
" <td>{}</td>\n",
" <td>5</td>\n",
" <td>md</td>\n",
" <td>../Documents/GitHub/tm351-undercertainty/noteb...</td>\n",
" <td>../Documents/GitHub/tm351-undercertainty/noteb...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 41 columns</p>\n",
"</div>"
],
"text/plain": [
" text n_sents n_words \\\n",
"0 # The *pandas* library: Series and DataFrames 1 6 \n",
"1 Python is a general-purpose scripting language... 10 123 \n",
"2 Note there are several libraries that we shall... 3 64 \n",
"3 ## Python recap: lists and dicts 1 5 \n",
"4 Python lists are flexible, mutable, data struc... 1 18 \n",
"\n",
" n_chars n_syllables n_unique_words n_long_words n_monosyllable_words \\\n",
"0 35 9 6 2 3 \n",
"1 570 188 79 30 75 \n",
"2 338 92 49 17 45 \n",
"3 24 6 5 0 4 \n",
"4 89 28 18 6 11 \n",
"\n",
" n_polysyllable_words flesch_kincaid_grade_level ... reading_time_mins \\\n",
"0 0 4.450000 ... 1 \n",
"1 13 7.242772 ... 2 \n",
"2 5 9.692500 ... 1 \n",
"3 0 0.520000 ... 1 \n",
"4 3 9.785556 ... 1 \n",
"\n",
" mean_sentence_length median_sentence_length stdev_sentence_length \\\n",
"0 7.000000 7.0 0.000000 \n",
"1 12.500000 12.0 10.058164 \n",
"2 22.333333 26.0 10.016653 \n",
"3 7.000000 7.0 0.000000 \n",
"4 18.000000 18.0 0.000000 \n",
"\n",
" keyterms acronyms cell_count \\\n",
"0 [(Series, 0.12192605097566381), (library, 0.11... {} 0 \n",
"1 [(level datum structure, 0.050604038106577987)... {} 1 \n",
"2 [(standard Python code base, 0.058213526433021... {} 3 \n",
"3 [(Python recap, 0.2923854294015616), (list, 0.... {} 4 \n",
"4 [(python list, 0.12775495473120263), (data str... {} 5 \n",
"\n",
" cell_type filename \\\n",
"0 md ../Documents/GitHub/tm351-undercertainty/noteb... \n",
"1 md ../Documents/GitHub/tm351-undercertainty/noteb... \n",
"2 md ../Documents/GitHub/tm351-undercertainty/noteb... \n",
"3 md ../Documents/GitHub/tm351-undercertainty/noteb... \n",
"4 md ../Documents/GitHub/tm351-undercertainty/noteb... \n",
"\n",
" path \n",
"0 ../Documents/GitHub/tm351-undercertainty/noteb... \n",
"1 ../Documents/GitHub/tm351-undercertainty/noteb... \n",
"2 ../Documents/GitHub/tm351-undercertainty/noteb... \n",
"3 ../Documents/GitHub/tm351-undercertainty/noteb... \n",
"4 ../Documents/GitHub/tm351-undercertainty/noteb... \n",
"\n",
"[5 rows x 41 columns]"
]
},
"execution_count": 186,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"TEST_DIR = '../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks'\n",
"\n",
"ddf = nb_multidir_profiler(TEST_DIR)\n",
"ddf.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Under the grouped report, we note that the summed reading time in minutes is likely to significantly overestimate the reading time requirement, representing as it does the sum of time in minutes rounded up from seconds. The lower bound giving by the summed reading time in *seconds* more closely relates to the markdown word count.\n",
"\n",
"However, the larger estimate perhaps does also factor in context switching time going from one cell to another. Whilst this may be invisible to the reader if a markdown cell follows a markdown cell, it may be more evident when going from a markdown cell to a code cell. On the other hand, if a markdown cell follows another because there is a change from one subsection to another, there may be a pause for reflection as part of that context switch that *is* captured by the rounding."
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>n_total_code_lines</th>\n",
" <th>n_words</th>\n",
" <th>reading_time_mins</th>\n",
" <th>reading_time_s</th>\n",
" </tr>\n",
" <tr>\n",
" <th>path</th>\n",
" <th>filename</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"6\" valign=\"top\">../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks</th>\n",
" <th>../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.1 Pandas Dataframes.ipynb</th>\n",
" <td>0</td>\n",
" <td>1763</td>\n",
" <td>61</td>\n",
" <td>1077.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.2 Data file formats.ipynb</th>\n",
" <td>0</td>\n",
" <td>171</td>\n",
" <td>5</td>\n",
" <td>107.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.2.0 Data file formats - file encodings.ipynb</th>\n",
" <td>0</td>\n",
" <td>706</td>\n",
" <td>24</td>\n",
" <td>430.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.2.1 Data file formats - CSV.ipynb</th>\n",
" <td>0</td>\n",
" <td>1665</td>\n",
" <td>41</td>\n",
" <td>987.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.2.2 Data file formats - JSON.ipynb</th>\n",
" <td>0</td>\n",
" <td>443</td>\n",
" <td>17</td>\n",
" <td>270.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.2.3 Data file formats - other.ipynb</th>\n",
" <td>0</td>\n",
" <td>825</td>\n",
" <td>21</td>\n",
" <td>499.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" n_total_code_lines \\\n",
"path filename \n",
"../Documents/GitHub/tm351-undercertainty/notebo... ../Documents/GitHub/tm351-undercertainty/notebo... 0 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 0 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 0 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 0 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 0 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 0 \n",
"\n",
" n_words \\\n",
"path filename \n",
"../Documents/GitHub/tm351-undercertainty/notebo... ../Documents/GitHub/tm351-undercertainty/notebo... 1763 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 171 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 706 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 1665 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 443 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 825 \n",
"\n",
" reading_time_mins \\\n",
"path filename \n",
"../Documents/GitHub/tm351-undercertainty/notebo... ../Documents/GitHub/tm351-undercertainty/notebo... 61 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 5 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 24 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 41 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 17 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 21 \n",
"\n",
" reading_time_s \n",
"path filename \n",
"../Documents/GitHub/tm351-undercertainty/notebo... ../Documents/GitHub/tm351-undercertainty/notebo... 1077.0 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 107.0 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 430.0 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 987.0 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 270.0 \n",
" ../Documents/GitHub/tm351-undercertainty/notebo... 499.0 "
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.groupby(['path','filename'])[['n_total_code_lines','n_words',\n",
" 'reading_time_mins', 'reading_time_s' ]].sum()"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks': {'n_words': 5573,\n",
" 'reading_time_mins': 169,\n",
" 'reading_time_s': 3370.0}}"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf_dict = ddf.groupby(['path'])[['n_words', 'reading_time_mins', 'reading_time_s' ]].sum().to_dict(orient='index')\n",
"ddf_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reporting Templates\n",
"\n",
"It's all very well having the data in a dataframe, but it could be more useful to be able to generate some written reports. So what might an example report look like?\n",
"\n",
"How about something like:\n",
"\n",
"> In directory X there were N notebooks. The total markdown wordcount for notebooks in the directory was NN. The total number of lines of code across the notebooks was NN. The total estimated reading time across the notebooks was NN.\n",
">\n",
"> At the notebook level:\n",
"> - notebook A: markdown wordcount NN, lines of code NN, estimated reading time NN;\n",
"\n",
"It might also be useful to provide simple rule (cf. linter rules) that raise warnings about notebooks that go against best practice. For example, notebooks with word counts / code line counts or reading or completion times that exceed recommended limits."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start with a simple template:"
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {},
"outputs": [],
"source": [
"report_template_simple_md = '''\n",
"In directory `{path}` there were {nb_count} notebooks.\n",
"The total markdown wordcount for the notebooks in the directory was {n_words} words,\n",
"with an estimated total reading time of {reading_time_mins} minutes.\n",
"'''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can feed this from a `dict` containing fields required by the report template:"
]
},
{
"cell_type": "code",
"execution_count": 159,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks': {'n_words': 5573,\n",
" 'reading_time_mins': 169,\n",
" 'reading_time_s': 3370.0,\n",
" 'nb_count': 6,\n",
" 'path': '../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks'}}"
]
},
"execution_count": 159,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#%pip install deepmerge\n",
"from deepmerge import always_merger\n",
"\n",
"report_dict = always_merger.merge(ddf_dict, notebook_counts_by_dir )\n",
"for k in report_dict:\n",
" report_dict[k]['path'] = k\n",
"report_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Feeding the `dict` to the template generates the report:"
]
},
{
"cell_type": "code",
"execution_count": 155,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\nIn directory `../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks` there were 6 notebooks.\\nThe total markdown wordcount for the notebooks in the directory was 5573 words,\\nwith an estimated total reading time 169 minutes.\\n'"
]
},
"execution_count": 155,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"report_template_simple_md.format(**report_dict[TEST_DIR])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a function to make it easier to generate the feedstocl `dict`:"
]
},
{
"cell_type": "code",
"execution_count": 190,
"metadata": {},
"outputs": [],
"source": [
"def notebook_report_feedstock_md_test(ddf):\n",
" \"\"\"Create a feedstock dict for report generation. Keyed by directory path.\"\"\"\n",
" ddf_dict = ddf.groupby(['path'])[['n_words', 'reading_time_mins', 'reading_time_s' ]].sum().to_dict(orient='index')\n",
" \n",
" notebook_counts_by_dir = ddf.groupby(['path'])['filename'].nunique().to_dict()\n",
" notebook_counts_by_dir = {k:{'nb_count':notebook_counts_by_dir[k]} for k in notebook_counts_by_dir}\n",
" \n",
" report_dict = always_merger.merge(ddf_dict, notebook_counts_by_dir )\n",
" \n",
" for k in report_dict:\n",
" report_dict[k]['path'] = k\n",
" \n",
" return report_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now use the `notebook_report_feedstock()` function to generate the feedstock `dict` directlry from the report dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 162,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks': {'n_words': 5573,\n",
" 'reading_time_mins': 169,\n",
" 'reading_time_s': 3370.0,\n",
" 'nb_count': 6}}"
]
},
"execution_count": 162,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"notebook_report_feedstock_md_test(ddf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Additional Reporting Levels\n",
"For additional reports, we could start to look for particular grammatical constructions in the markdown text."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When it comes to looking for particular grammatical constructions in the text, the `textacy` package allows us to define patterns of interest in various ways. Are there any particular constructions that we may want to look out for in an instructional text?"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[provides, includes, intended, test, Note]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import textacy\n",
"\n",
"#But how do you define the pattern to extract the largest phrase over a sequence of tokens?\n",
"verb_phrase = r'(<VERB>?<ADV>*<VERB>+)' #extract.pos_regex_matches DEPRECATED\n",
"\n",
"verb_phrase2 = [{\"POS\": \"VERB\", \"OP\":\"?\"}, {\"POS\": \"ADV\", \"OP\": \"*\"},\n",
" {\"POS\": \"VERB\", \"OP\":\"+\"}] #extract.matches\n",
"\n",
"verb_phrase3 = r'POS:BERB:? POS:ADV:* POS:VERB:+' #extract.matches\n",
"\n",
"[vp for vp in textacy.extract.matches(doc, verb_phrase3)][:5]"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'SYM': 1,\n",
" 'PROPN': 4,\n",
" 'ADP': 5,\n",
" 'SPACE': 3,\n",
" 'DET': 6,\n",
" 'NOUN': 12,\n",
" 'VERB': 7,\n",
" 'PUNCT': 3,\n",
" 'PRON': 1,\n",
" 'CCONJ': 1,\n",
" 'PART': 1,\n",
" 'ADJ': 1,\n",
" 'ADV': 2}"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"dict(Counter(([token.pos_ for token in doc])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Code Cell Analysis\n",
"\n",
"As well as reporting on markdown cells, we can also generate reports on code cells. (We could also use similar techiques to report on code blocks found in markdown cells.)\n",
"\n",
"Possible code cell reports include reporting on:\n",
"\n",
"- packages imported into a notebook;\n",
"- number of lines of code / code comments;\n",
"- code complexity.\n",
"\n",
"We could also run static analysis tests over *all* the code loaded into a notebook, for example using things like [`importchecker`](https://github.com/zopefoundation/importchecker) to check that imports are actually used.\n",
"\n",
"Checks against whether code cells in a notebooks: a) have been run; b) whether they have been run in order are also possible. If we extend the analysis to code cell outputs, we could also report on whether cells had been run without warning or error and what sort of output they produced.\n",
"\n",
"Tools such as [`pyflakes`](https://github.com/PyCQA/pyflakes) can also be used to run a wider range of static tests over a codebase, as can other code linters. See also [*Thinking About Things That Might Be Autogradeable or Useful for Automated Marking Support*](https://blog.ouseful.info/2019/12/10/thinking-about-things-that-might-be-autogradeable/) for examples of tests that may be used in autograding, some of which might also be useful for notebook code profiling.\n",
"\n",
"It might also be worth trying to collate possible useful guidelines / heuristics / rules of thumb for creating notebooks that could also provide the basis of quality minded linting checks.\n",
"\n",
"For example:\n",
"\n",
"- a markdown cell should always appear before a code cells to set the context for what the code cell is expected to achieve;\n",
"- a markdown cell commenting on the output of a code cell immediately preceding may be appropriate in some cases;\n",
"- one cell should be defined per code cell. A markdown cell immediately following a code cell that defines a function might include a line of text that might also serve as the function doc text, describing what the function does an dprefacing a code cell that demonstrates the behaviour of the function."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generating code reports over a single notebook\n",
"\n",
"Let's start to put together some metrics we can run against code cells, either at an individual level or from code aggregated from across all the code cells in a notebook."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['abjad', 'numpy', 'pandas', 'IPython.dsiplay']"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c='''#print\\nimport pandas\\n\\nprint('a')\\nimport abjad\\nimport numpy as np\\nfrom IPython.dsiplay import HTML, JSON'''\n",
"\n",
"#https://github.com/andrewp-as-is/list-imports.py #list imports\n",
"#%pip install list-imports\n",
"import list_imports\n",
"list_imports.parse(c)\n",
"#Would also need to capture magics?\n",
"\n",
"# TO DO - NOT CURRENTLY REPORTED"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some utilities may not make sense in the reporting when applied at a cell level. For example, it's quite likely that a package imported into a cell may not be used in that cell, which `pyflakes` would report unfavourably on:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"dummy:1: 'pandas as pd' imported but unused\\n\""
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#%pip install pyflakes\n",
"#pyflakes seems to print the report, so we'd need to find a way to capture it\n",
"from pyflakes.api import check\n",
"from pyflakes.reporter import Reporter\n",
"\n",
"import io\n",
"\n",
"output_w = io.StringIO()\n",
"output_e = io.StringIO()\n",
"\n",
"check('''import pandas as pd''', 'dummy', Reporter(output_w, output_e))\n",
"output_w.getvalue()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another form of analysis that only makes sense at the notebook level is the code cell execution analysis:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[None, None, 1, None, None] False True\n"
]
}
],
"source": [
"# Check execution across notebook - TO DO - NOT CURRENTLY REPORTED\n",
"cell_execution_order = []\n",
"num_code_cells = 0\n",
"for cell in nb.cells:\n",
" if cell['cell_type']=='code':\n",
" cell_execution_order.append(cell['execution_count'])\n",
" num_code_cells = num_code_cells + 1\n",
"\n",
"\n",
"_executed_cells = [i for i in cell_execution_order if i is not None and isinstance(i,int) ]\n",
"in_order_execution = _executed_cells == sorted(_executed_cells)\n",
"\n",
"all_cells_executed = len(_executed_cells)==num_code_cells\n",
"print(cell_execution_order, all_cells_executed, in_order_execution,)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Parsing IPython Code\n",
"\n",
"One thing to bear in mind is that code cells may contain block magic that switches code from the assumed default Python code to potentially a different language. For this reason, we might want to fall back from the `radon` metrics as a result of trying to load code into a Python AST parser when meeting cells that employ cell block magic, or explore whether an IPyhton parser could be used instead."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try to cleanse IPython directives such as shell commands (`!` prefix) or magics (`%` prefix) from a code string so that we can present it to `radon`."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"def sanitise_IPython_code(c):\n",
" \"\"\"Cleanse an IPython code string so we can parse it with radon.\"\"\"\n",
" #Comment out magic and shell commands\n",
" c = '\\n'.join([f'#{_r}' if _r.lstrip().startswith(('%','!')) else _r for _r in c.splitlines()])\n",
" \n",
" return c"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `sanitise_IPython_code()` function partially sanitises an IPython code string so that it can be passed to, and parsed by, the `radon`. Note that where magic or shell statements are used on the right hand side of an assignment statement, this will still cause an error. "
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#%load_ext magic\n",
"import pandas\n",
"\n",
"#!ls\n",
"print(a)\n"
]
},
{
"data": {
"text/plain": [
"(5, 1, 2, 2)"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Use the `radon` analyzer\n",
"#%pip install radon\n",
"from radon.raw import analyze\n",
"\n",
"c = '''%load_ext magic\\nimport pandas\\n\\n!ls\\nprint(a)'''\n",
"c = sanitise_IPython_code(c)\n",
"\n",
"print(c)\n",
"n_total_code_lines, n_blank_code_lines, \\\n",
" n_single_line_comment_code_lines, n_code_lines = r_analyze(sanitise_IPython_code(c))\n",
"\n",
"n_total_code_lines, n_blank_code_lines, n_single_line_comment_code_lines, n_code_lines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To parse a code cell, we can try to use the `radon` analyser, with a sanitised code string, or fall back to using the simpler code sanitiser. It will also be convenient to return the results as a Python `dict` object."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"def robust_code_cell_analyse(c, parser='radon'):\n",
" \"\"\"Use the `radon` code analyser if we can else fall back to the simple custom code analyser.\"\"\"\n",
" \n",
" def cleansed_radon(c):\n",
" return r_analyze(sanitise_IPython_code(c))\n",
" \n",
" if c.startswith('%%'):\n",
" #use local code analyser\n",
" parser = 'local'\n",
"\n",
" if parser == 'radon':\n",
" try:\n",
" _response = cleansed_radon(c)\n",
" except:\n",
" #fallback to simple analyser\n",
" _response = code_block_report(c)\n",
" else:\n",
" _response = code_block_report(c)\n",
" \n",
" (n_total_code_lines, n_blank_code_lines, \\\n",
" n_single_line_comment_code_lines, n_code_lines) = _response\n",
" \n",
" _reading_time = code_reading_time(n_code_lines, n_single_line_comment_code_lines)\n",
" \n",
" response = {\n",
" 'n_total_code_lines': n_total_code_lines,\n",
" 'n_blank_code_lines': n_blank_code_lines,\n",
" 'n_single_line_comment_code_lines': n_single_line_comment_code_lines,\n",
" 'n_code_lines': n_code_lines,\n",
" 'n_screen_lines':n_total_code_lines,\n",
" 'reading_time_s':_reading_time,\n",
" 'reading_time_mins': math.ceil(_reading_time/60)\n",
" }\n",
" \n",
" return response"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The robust analyser should cope with a variety of strings."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'n_total_code_lines': 4, 'n_blank_code_lines': 1, 'n_single_line_comment_code_lines': 2, 'n_code_lines': 1, 'n_screen_lines': 4, 'reading_time_s': 3, 'reading_time_mins': 1}\n",
"{'n_total_code_lines': 2, 'n_blank_code_lines': 0, 'n_single_line_comment_code_lines': 0, 'n_code_lines': 2, 'n_screen_lines': 2, 'reading_time_s': 2, 'reading_time_mins': 1}\n"
]
}
],
"source": [
"print(robust_code_cell_analyse('import pandas\\n\\n# comment\\n!ls'))\n",
"print(robust_code_cell_analyse('%%sql\\nSELECT * FROM TABLE'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now need to start pulling together a function that we can cal to run the basic report and other code cell reports."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"def process_notebook_code_text(txt):\n",
" \"\"\"Generate code cell report.\"\"\"\n",
" report = pd.DataFrame()\n",
" basic_code_report = robust_code_cell_analyse(txt)\n",
" return pd.DataFrame([{'text':txt,\n",
" **basic_code_report }])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The report generates a single row report dataframe from a code string:"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>n_total_code_lines</th>\n",
" <th>n_blank_code_lines</th>\n",
" <th>n_single_line_comment_code_lines</th>\n",
" <th>n_code_lines</th>\n",
" <th>n_screen_lines</th>\n",
" <th>reading_time_s</th>\n",
" <th>reading_time_mins</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>import pandas\\n\\n# comment\\n!ls</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text n_total_code_lines n_blank_code_lines \\\n",
"0 import pandas\\n\\n# comment\\n!ls 4 1 \n",
"\n",
" n_single_line_comment_code_lines n_code_lines n_screen_lines \\\n",
"0 2 1 4 \n",
"\n",
" reading_time_s reading_time_mins \n",
"0 3 1 "
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"process_notebook_code_text('import pandas\\n\\n# comment\\n!ls')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to process code cells as well as markdown cells in our notebook processer, we will need build on the `process_notebook_md()` function to create a more general one. Note that the current approach will give us an inefficient dataframe, column wise, in that whilst each row represents the report from a code cell *or* a markdown cell, the columns cover reports from both code *and* markdown cells."
]
},
{
"cell_type": "code",
"execution_count": 223,
"metadata": {},
"outputs": [],
"source": [
"def process_notebook(nb, fn=''):\n",
" \"\"\"Process all the markdown and code cells in a notebook.\"\"\"\n",
" cell_reports = pd.DataFrame()\n",
" \n",
" for i, cell in enumerate(nb.cells):\n",
" if cell['cell_type']=='markdown':\n",
" _metrics = process_notebook_md_doc( nlp( cell['source'] ))\n",
" _metrics['cell_count'] = i\n",
" _metrics['cell_type'] = 'md'\n",
" cell_reports = cell_reports.append(_metrics, sort=False)\n",
" elif cell['cell_type']=='code':\n",
" _metrics = process_notebook_code_text(cell['source'] )\n",
" _metrics['cell_count'] = i\n",
" _metrics['cell_type'] = 'code'\n",
" cell_reports = cell_reports.append(_metrics, sort=False)\n",
" \n",
" cell_reports['filename'] = fn\n",
" cell_reports.reset_index(drop=True, inplace=True)\n",
" return cell_reports"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We should now be able to generate a report that includes statistics from code as well as markdown cells."
]
},
{
"cell_type": "code",
"execution_count": 224,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>n_sents</th>\n",
" <th>n_words</th>\n",
" <th>n_chars</th>\n",
" <th>n_syllables</th>\n",
" <th>n_unique_words</th>\n",
" <th>n_long_words</th>\n",
" <th>n_monosyllable_words</th>\n",
" <th>n_polysyllable_words</th>\n",
" <th>flesch_kincaid_grade_level</th>\n",
" <th>...</th>\n",
" <th>reading_time_s</th>\n",
" <th>reading_time_mins</th>\n",
" <th>mean_sentence_length</th>\n",
" <th>median_sentence_length</th>\n",
" <th>stdev_sentence_length</th>\n",
" <th>keyterms</th>\n",
" <th>acronyms</th>\n",
" <th>cell_count</th>\n",
" <th>cell_type</th>\n",
" <th>filename</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td># Test Notebook for Notebook Profiler\\n\\nThis ...</td>\n",
" <td>4.0</td>\n",
" <td>40.0</td>\n",
" <td>203.0</td>\n",
" <td>63.0</td>\n",
" <td>27.0</td>\n",
" <td>15.0</td>\n",
" <td>25.0</td>\n",
" <td>6.0</td>\n",
" <td>6.895000</td>\n",
" <td>...</td>\n",
" <td>25.0</td>\n",
" <td>1</td>\n",
" <td>11.00</td>\n",
" <td>9.5</td>\n",
" <td>4.966555</td>\n",
" <td>[(notebook profiler, 0.08196495093971548), (te...</td>\n",
" <td>{}</td>\n",
" <td>0</td>\n",
" <td>md</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>## Markdown Cells With Code Blocks\\n\\nThis cel...</td>\n",
" <td>4.0</td>\n",
" <td>30.0</td>\n",
" <td>123.0</td>\n",
" <td>38.0</td>\n",
" <td>22.0</td>\n",
" <td>5.0</td>\n",
" <td>23.0</td>\n",
" <td>1.0</td>\n",
" <td>2.281667</td>\n",
" <td>...</td>\n",
" <td>18.0</td>\n",
" <td>1</td>\n",
" <td>8.25</td>\n",
" <td>5.5</td>\n",
" <td>8.261356</td>\n",
" <td>[(single code block, 0.09825762538579677), (Ma...</td>\n",
" <td>{}</td>\n",
" <td>1</td>\n",
" <td>md</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>This cell contains two code blocks.\\n\\nHere's ...</td>\n",
" <td>8.0</td>\n",
" <td>49.0</td>\n",
" <td>173.0</td>\n",
" <td>58.0</td>\n",
" <td>23.0</td>\n",
" <td>6.0</td>\n",
" <td>41.0</td>\n",
" <td>1.0</td>\n",
" <td>0.766097</td>\n",
" <td>...</td>\n",
" <td>28.0</td>\n",
" <td>1</td>\n",
" <td>6.50</td>\n",
" <td>6.0</td>\n",
" <td>4.105745</td>\n",
" <td>[(code block, 0.052250174985765105), (import p...</td>\n",
" <td>{}</td>\n",
" <td>2</td>\n",
" <td>md</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td># This is a code cell\\nimport pandas\\n\\n#Creat...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>4.0</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>code</td>\n",
" <td></td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td># This is a code cell with a magic...\\n\\n%matp...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>5.0</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4</td>\n",
" <td>code</td>\n",
" <td></td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 40 columns</p>\n",
"</div>"
],
"text/plain": [
" text n_sents n_words \\\n",
"0 # Test Notebook for Notebook Profiler\\n\\nThis ... 4.0 40.0 \n",
"1 ## Markdown Cells With Code Blocks\\n\\nThis cel... 4.0 30.0 \n",
"2 This cell contains two code blocks.\\n\\nHere's ... 8.0 49.0 \n",
"3 # This is a code cell\\nimport pandas\\n\\n#Creat... NaN NaN \n",
"4 # This is a code cell with a magic...\\n\\n%matp... NaN NaN \n",
"\n",
" n_chars n_syllables n_unique_words n_long_words n_monosyllable_words \\\n",
"0 203.0 63.0 27.0 15.0 25.0 \n",
"1 123.0 38.0 22.0 5.0 23.0 \n",
"2 173.0 58.0 23.0 6.0 41.0 \n",
"3 NaN NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN NaN \n",
"\n",
" n_polysyllable_words flesch_kincaid_grade_level ... reading_time_s \\\n",
"0 6.0 6.895000 ... 25.0 \n",
"1 1.0 2.281667 ... 18.0 \n",
"2 1.0 0.766097 ... 28.0 \n",
"3 NaN NaN ... 4.0 \n",
"4 NaN NaN ... 5.0 \n",
"\n",
" reading_time_mins mean_sentence_length median_sentence_length \\\n",
"0 1 11.00 9.5 \n",
"1 1 8.25 5.5 \n",
"2 1 6.50 6.0 \n",
"3 1 NaN NaN \n",
"4 1 NaN NaN \n",
"\n",
" stdev_sentence_length keyterms \\\n",
"0 4.966555 [(notebook profiler, 0.08196495093971548), (te... \n",
"1 8.261356 [(single code block, 0.09825762538579677), (Ma... \n",
"2 4.105745 [(code block, 0.052250174985765105), (import p... \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" acronyms cell_count cell_type filename \n",
"0 {} 0 md \n",
"1 {} 1 md \n",
"2 {} 2 md \n",
"3 NaN 3 code \n",
"4 NaN 4 code \n",
"\n",
"[5 rows x 40 columns]"
]
},
"execution_count": 224,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"report = process_notebook(nb)\n",
"report.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's just check what columns we are potentially reporting on:"
]
},
{
"cell_type": "code",
"execution_count": 225,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['text', 'n_sents', 'n_words', 'n_chars', 'n_syllables',\n",
" 'n_unique_words', 'n_long_words', 'n_monosyllable_words',\n",
" 'n_polysyllable_words', 'flesch_kincaid_grade_level',\n",
" 'flesch_reading_ease', 'smog_index', 'gunning_fog_index',\n",
" 'coleman_liau_index', 'automated_readability_index', 'lix',\n",
" 'gulpease_index', 'wiener_sachtextformel', 'n_headers', 'n_paras',\n",
" 'n_screen_lines', 's_lengths', 's_mean', 's_median', 's_stdev',\n",
" 'n_code_blocks', 'n_total_code_lines', 'n_code_lines',\n",
" 'n_blank_code_lines', 'n_single_line_comment_code_lines',\n",
" 'reading_time_s', 'reading_time_mins', 'mean_sentence_length',\n",
" 'median_sentence_length', 'stdev_sentence_length', 'keyterms',\n",
" 'acronyms', 'cell_count', 'cell_type', 'filename'],\n",
" dtype='object')"
]
},
"execution_count": 225,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"report.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And let's see if our directory processor now also includes code cell statistics:"
]
},
{
"cell_type": "code",
"execution_count": 226,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"md 160\n",
"code 119\n",
"Name: cell_type, dtype: int64"
]
},
"execution_count": 226,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf2 = nb_multidir_profiler('../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks')\n",
"ddf2['cell_type'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's also check to see how the code cells are reported:"
]
},
{
"cell_type": "code",
"execution_count": 229,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"n_code_blocks 0.0\n",
"n_total_code_lines 390.0\n",
"n_code_lines 228.0\n",
"n_blank_code_lines 25.0\n",
"n_single_line_comment_code_lines 137.0\n",
"dtype: float64"
]
},
"execution_count": 229,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"code_cols = [c for c in ddf2.columns if 'code' in c]\n",
"ddf2[ddf2['cell_type']=='code'][code_cols].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generating Reports Across Multiple Directories\n",
"\n",
"We are now in a position to start generating rich report for notebooks across several directories.\n",
"\n",
"Let's grab data for notebooks across an example set of directories:"
]
},
{
"cell_type": "code",
"execution_count": 231,
"metadata": {},
"outputs": [],
"source": [
"ddf3 = nb_multidir_profiler('../Documents/GitHub/tm351-undercertainty/notebooks/tm351/')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And just quickly test we can generate a report that summarises the notebooks in each directory:"
]
},
{
"cell_type": "code",
"execution_count": 232,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\n",
"In directory `../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 01 Notebooks` there were 5 notebooks.\n",
"The total markdown wordcount for the notebooks in the directory was 3033.0 words,\n",
"with an estimated total reading time of 143 minutes.\n",
"\n",
"\n",
"\n",
"In directory `../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks` there were 6 notebooks.\n",
"The total markdown wordcount for the notebooks in the directory was 5573.0 words,\n",
"with an estimated total reading time of 288 minut\n"
]
}
],
"source": [
"big_feedstock = notebook_report_feedstock_md_test(ddf3)\n",
"report_txt=''\n",
"for d in big_feedstock:\n",
" if 'tm351/Part ' in d:\n",
" report_txt = report_txt + '\\n\\n' + report_template_simple_md.format(**big_feedstock[d])\n",
" \n",
"print(report_txt[:500])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's update the report template and the report feedstock function.\n",
"\n",
"First, what shall we report on?"
]
},
{
"cell_type": "code",
"execution_count": 210,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['filename', 'text', 'n_sents', 'n_words', 'n_chars', 'n_syllables',\n",
" 'n_unique_words', 'n_long_words', 'n_monosyllable_words',\n",
" 'n_polysyllable_words', 'flesch_kincaid_grade_level',\n",
" 'flesch_reading_ease', 'smog_index', 'gunning_fog_index',\n",
" 'coleman_liau_index', 'automated_readability_index', 'lix',\n",
" 'gulpease_index', 'wiener_sachtextformel', 'n_headers', 'n_paras',\n",
" 'n_screen_lines', 's_lengths', 's_mean', 's_median', 's_stdev',\n",
" 'n_code_blocks', 'n_total_code_lines', 'n_code_lines',\n",
" 'n_blank_code_lines', 'n_single_line_comment_code_lines',\n",
" 'reading_time_s', 'reading_time_mins', 'mean_sentence_length',\n",
" 'median_sentence_length', 'stdev_sentence_length', 'keyterms',\n",
" 'acronyms', 'cell_count', 'cell_type', 'path'],\n",
" dtype='object')"
]
},
"execution_count": 210,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf3.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make a start on a complete report template..."
]
},
{
"cell_type": "code",
"execution_count": 304,
"metadata": {},
"outputs": [],
"source": [
"report_template_full = '''\n",
"In directory `{path}` there were {nb_count} notebooks.\n",
"\n",
"- total markdown wordcount {n_words} words across {n_md_cells} markdown cells\n",
"- total code line count of {n_total_code_lines} lines of code across {n_code_cells} code cells\n",
" - {n_code_lines} code lines, {n_single_line_comment_code_lines} comment lines and {n_blank_code_lines} blank lines\n",
"\n",
"Estimated total reading time of {reading_time_mins} minutes.\n",
"\n",
"'''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's add those extra requirements to the the feedstock generator:"
]
},
{
"cell_type": "code",
"execution_count": 300,
"metadata": {},
"outputs": [],
"source": [
"def notebook_report_feedstock(ddf):\n",
" \"\"\"Create a feedstock dict for report generation. Keyed by directory path.\"\"\"\n",
" ddf_dict = ddf.groupby(['path'])[['n_words', 'reading_time_mins', 'reading_time_s',\n",
" 'n_code_lines', 'n_single_line_comment_code_lines',\n",
" 'n_total_code_lines','n_blank_code_lines']].sum().to_dict(orient='index')\n",
" \n",
" notebook_counts_by_dir = ddf.groupby(['path'])['filename'].nunique().to_dict()\n",
" notebook_counts_by_dir = {k:{'nb_count':notebook_counts_by_dir[k]} for k in notebook_counts_by_dir}\n",
" \n",
" report_dict = always_merger.merge(ddf_dict, notebook_counts_by_dir )\n",
" \n",
" code_cell_counts = ddf[ddf['cell_type']=='code'].groupby(['path']).size().to_dict()\n",
" md_cell_counts = ddf[ddf['cell_type']=='md'].groupby(['path']).size().to_dict()\n",
" \n",
" for k in report_dict:\n",
" report_dict[k]['path'] = k\n",
" report_dict[k]['n_code_cells'] = code_cell_counts[k] if k in code_cell_counts else 'NA'\n",
" report_dict[k]['n_md_cells'] = md_cell_counts[k] if k in md_cell_counts else 'NA'\n",
" \n",
" return report_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a wrapper function for generating the report text:"
]
},
{
"cell_type": "code",
"execution_count": 301,
"metadata": {},
"outputs": [],
"source": [
"def reporter(df, template, path_filter=''):\n",
" feedstock = notebook_report_feedstock(df)\n",
" report_txt=''\n",
" for d in feedstock:\n",
" if path_filter in d:\n",
" report_txt = report_txt + '\\n\\n' + template.format(**feedstock[d])\n",
" return report_txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now use the `reporter()` function to generate a report based on filtered paths from a report dataframe and a template:"
]
},
{
"cell_type": "code",
"execution_count": 302,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\n",
"In directory `../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks` there were 6 notebooks.\n",
"\n",
"- total markdown wordcount 5573.0 words across 160\n",
"- total code line count of 390 lines of code across 119 code cells\n",
" - 228 code lines, 137 comment lines and 25 blank lines\n",
"\n",
"Estimated total reading time of 288 minutes.\n",
"\n",
"\n"
]
}
],
"source": [
"print(reporter(ddf2, report_template_full, 'tm351/Part '))"
]
},
{
"cell_type": "code",
"execution_count": 305,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"\n",
"In directory `Part 01 Notebooks` there were 5 notebooks.\n",
"\n",
"- total markdown wordcount 3033.0 words across 65 markdown cells\n",
"- total code line count of 571 lines of code across 65 code cells\n",
" - 327 code lines, 160 comment lines and 84 blank lines\n",
"\n",
"Estimated total reading time of 143 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 02 Notebooks` there were 6 notebooks.\n",
"\n",
"- total markdown wordcount 5573.0 words across 160 markdown cells\n",
"- total code line count of 390 lines of code across 119 code cells\n",
" - 228 code lines, 137 comment lines and 25 blank lines\n",
"\n",
"Estimated total reading time of 288 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 03 Notebooks` there were 4 notebooks.\n",
"\n",
"- total markdown wordcount 11027.0 words across 230 markdown cells\n",
"- total code line count of 808 lines of code across 181 code cells\n",
" - 606 code lines, 131 comment lines and 72 blank lines\n",
"\n",
"Estimated total reading time of 444 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 04 Notebooks` there were 8 notebooks.\n",
"\n",
"- total markdown wordcount 11992.0 words across 232 markdown cells\n",
"- total code line count of 917 lines of code across 259 code cells\n",
" - 595 code lines, 260 comment lines and 64 blank lines\n",
"\n",
"Estimated total reading time of 518 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 05 Notebooks` there were 3 notebooks.\n",
"\n",
"- total markdown wordcount 8499.0 words across 105 markdown cells\n",
"- total code line count of 978 lines of code across 84 code cells\n",
" - 510 code lines, 322 comment lines and 147 blank lines\n",
"\n",
"Estimated total reading time of 231 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 07 Notebooks` there were 2 notebooks.\n",
"\n",
"- total markdown wordcount 6024.0 words across 106 markdown cells\n",
"- total code line count of 0 lines of code across NA code cells\n",
" - 0 code lines, 0 comment lines and 0 blank lines\n",
"\n",
"Estimated total reading time of 127 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 08 Notebooks` there were 3 notebooks.\n",
"\n",
"- total markdown wordcount 12612.0 words across 383 markdown cells\n",
"- total code line count of 770 lines of code across 155 code cells\n",
" - 563 code lines, 59 comment lines and 163 blank lines\n",
"\n",
"Estimated total reading time of 552 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 09 Notebooks` there were 3 notebooks.\n",
"\n",
"- total markdown wordcount 9856.0 words across 254 markdown cells\n",
"- total code line count of 502 lines of code across 110 code cells\n",
" - 359 code lines, 48 comment lines and 105 blank lines\n",
"\n",
"Estimated total reading time of 384 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 10 Notebooks` there were 5 notebooks.\n",
"\n",
"- total markdown wordcount 11511.0 words across 303 markdown cells\n",
"- total code line count of 802 lines of code across 170 code cells\n",
" - 616 code lines, 66 comment lines and 145 blank lines\n",
"\n",
"Estimated total reading time of 506 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 11 Notebooks` there were 6 notebooks.\n",
"\n",
"- total markdown wordcount 17442.0 words across 437 markdown cells\n",
"- total code line count of 1586 lines of code across 250 code cells\n",
" - 1357 code lines, 86 comment lines and 154 blank lines\n",
"\n",
"Estimated total reading time of 733 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 12 Notebooks` there were 2 notebooks.\n",
"\n",
"- total markdown wordcount 6570.0 words across 242 markdown cells\n",
"- total code line count of 657 lines of code across 160 code cells\n",
" - 570 code lines, 30 comment lines and 53 blank lines\n",
"\n",
"Estimated total reading time of 413 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 12 Notebooks/optional_part_12` there were 3 notebooks.\n",
"\n",
"- total markdown wordcount 846.0 words across 21 markdown cells\n",
"- total code line count of 51 lines of code across 14 code cells\n",
" - 37 code lines, 5 comment lines and 9 blank lines\n",
"\n",
"Estimated total reading time of 39 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 14 Notebooks` there were 8 notebooks.\n",
"\n",
"- total markdown wordcount 7077.0 words across 148 markdown cells\n",
"- total code line count of 825 lines of code across 197 code cells\n",
" - 641 code lines, 105 comment lines and 78 blank lines\n",
"\n",
"Estimated total reading time of 359 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 15 Notebooks` there were 10 notebooks.\n",
"\n",
"- total markdown wordcount 4434.0 words across 121 markdown cells\n",
"- total code line count of 1314 lines of code across 208 code cells\n",
" - 1077 code lines, 108 comment lines and 138 blank lines\n",
"\n",
"Estimated total reading time of 336 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 16 Notebooks` there were 6 notebooks.\n",
"\n",
"- total markdown wordcount 2214.0 words across 62 markdown cells\n",
"- total code line count of 527 lines of code across 123 code cells\n",
" - 454 code lines, 51 comment lines and 22 blank lines\n",
"\n",
"Estimated total reading time of 189 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 20 Notebooks` there were 2 notebooks.\n",
"\n",
"- total markdown wordcount 2219.0 words across 59 markdown cells\n",
"- total code line count of 208 lines of code across 24 code cells\n",
" - 124 code lines, 24 comment lines and 46 blank lines\n",
"\n",
"Estimated total reading time of 84 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 21 Notebooks` there were 3 notebooks.\n",
"\n",
"- total markdown wordcount 2200.0 words across 64 markdown cells\n",
"- total code line count of 426 lines of code across 45 code cells\n",
" - 273 code lines, 44 comment lines and 109 blank lines\n",
"\n",
"Estimated total reading time of 110 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 22 Notebooks` there were 4 notebooks.\n",
"\n",
"- total markdown wordcount 5431.0 words across 174 markdown cells\n",
"- total code line count of 528 lines of code across 100 code cells\n",
" - 355 code lines, 58 comment lines and 109 blank lines\n",
"\n",
"Estimated total reading time of 279 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 23 Notebooks` there were 3 notebooks.\n",
"\n",
"- total markdown wordcount 7645.0 words across 187 markdown cells\n",
"- total code line count of 576 lines of code across 109 code cells\n",
" - 384 code lines, 79 comment lines and 138 blank lines\n",
"\n",
"Estimated total reading time of 312 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 25 Notebooks` there were 3 notebooks.\n",
"\n",
"- total markdown wordcount 7447.0 words across 119 markdown cells\n",
"- total code line count of 890 lines of code across 64 code cells\n",
" - 563 code lines, 181 comment lines and 144 blank lines\n",
"\n",
"Estimated total reading time of 220 minutes.\n",
"\n",
"\n",
"\n",
"\n",
"In directory `Part 26 Notebooks` there were 3 notebooks.\n",
"\n",
"- total markdown wordcount 3993.0 words across 82 markdown cells\n",
"- total code line count of 828 lines of code across 45 code cells\n",
" - 535 code lines, 130 comment lines and 153 blank lines\n",
"\n",
"Estimated total reading time of 141 minutes.\n",
"\n",
"\n"
]
}
],
"source": [
"print(reporter(ddf3, report_template_full, 'tm351/Part ').replace('../Documents/GitHub/tm351-undercertainty/notebooks/tm351/',''))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualising Notebook Structure\n",
"\n",
"To provide a glanceable, macroscopic way of comparing the size and structure of multiple notebooks, we can generate a simple visualisation based on screen line counts and colour codes for different cell types or cell state.\n",
"\n",
"Reports that include cell index and a simple line count (for example, reprting the number of code lines or screen lines for markdown) can be rendered directly as linear visualisations showing the overall structure of a notebook. \n",
"\n",
"For example:\n",
"\n",
"- markdown: header;\n",
"- markdown: paragraph;\n",
"- markdown: code block;\n",
"- markdown: blank line;\n",
"- code: code;\n",
"- code: comment;\n",
"- code: magic;\n",
"- code: blank line;\n",
"- other: other cells.\n",
"\n",
"To profile within a cell requires access to cell internals, or generating a cell profile during cell processing.\n",
"\n",
"However, it's easy enough to generate a view over the code and markdown cells.\n",
"\n",
"Let's start by exploring a simple representation:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAV0AAADnCAYAAAC9roUQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAADb0lEQVR4nO3coXUDMRQF0ShnK0gJoS7C1aQqNxMeHOpW5AJsogUj4Huh0ENzPtKYc34A0PjcPQDgnYguQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxASXYDQsXsALBvj+Wu8OceGJbDMpQsQEl2AkOgChEQXICS6ACHRBQiJLkBIdAFCogsQEl2AkOgChEQXICS6ACHRBQiJLkBIdAFCogsQEl2AkOgChEQXICS6ACHRBQiJLkBIdAFCogsQEl2AkOgChEQXICS6ACHRBQiJLkBIdAFCx+4BsOz3b/cCOM2lCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxA6dg+AVT//309vt+uGIXCCSxcgJLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKEjt0DYNXtcn/x+pXvgDNcugAh0QUIiS5ASHQBQqILEBJdgJDoAoREFyAkugAh0QUIiS5ASHQBQqILEBJdgJDoAoREFyAkugAh0QUIiS5ASHQBQqILEBJdgJDoAoREFyAkugAh0QUIiS5ASHQBQqILEBJdgJDoAoREFyAkugAh0QUIiS5ASHQBQqILEBJdgJDoAoREFyAkugAh0QUIiS5ASHQBQqILEBJdgJDoAoREFyAkugAh0QUIiS5ASHQBQqILEBJdgJDoAoREFyAkugAh0QUIiS5ASHQBQqILEBJdgJDoAoREFyAkugAh0QUIiS5ASHQBQqILEBJdgNCYc+7eAPA2XLoAIdEFCIkuQEh0AUKiCxASXYCQ6AKERBcgJLoAIdEFCIkuQEh0AUKiCxB6ANR4EEa6ZpCIAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"\n",
"fig, ax = plt.subplots()\n",
"ax.axis('off')\n",
"\n",
"#Simple representation of lines per cell and cell colour based on cell type\n",
"n_c = [(1,'r'),(2,'pink'), (1,'cornflowerblue'), (2,'pink')]\n",
"\n",
"x=0\n",
"y=0\n",
"\n",
"for _n_c in n_c:\n",
" _y = y + _n_c[0]\n",
" plt.plot([x,x], [y,_y], _n_c[1], linewidth=5)\n",
" y = _y #may want to add a gap when moving from one cell to next\n",
"plt.gca().invert_yaxis()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can get the list of cell size and colour tuples from a notebook's report data frame:"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(4, 'cornflowerblue'),\n",
" (3, 'cornflowerblue'),\n",
" (8, 'cornflowerblue'),\n",
" (5, 'pink'),\n",
" (8, 'pink'),\n",
" (1, 'pink'),\n",
" (3, 'pink'),\n",
" (2, 'pink')]"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"VIS_COLOUR_MAP = {'md':'cornflowerblue','code':'pink'}\n",
"\n",
"def cell_attrib(cell, colour='cell_type', size='n_screen_lines'):\n",
" _colour = VIS_COLOUR_MAP[ cell[colour] ]\n",
" return (cell[size], _colour)\n",
"\n",
"report.apply(cell_attrib, axis=1).to_list()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create a function to visualise a notebook based on its list of cell size and colour tuples; we'll also allow it to habdle multiple lists:"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [],
"source": [
"def nb_vis(cell_map, w=20, gap_boost=1, **kwargs):\n",
" \"\"\"Visualise notebook gross cell structure.\"\"\"\n",
" \n",
" def get_gap(cell_map):\n",
" \"\"\"Automatically set the gap value based on overall length\"\"\"\n",
" \n",
" def get_overall_length(cell_map):\n",
" \"\"\"Get overall length of a notebook.\"\"\"\n",
" overall_len = 0\n",
" gap = 0\n",
" for i ,(l,t) in enumerate(cell_map):\n",
" #i is number of cells if that's useful too?\n",
" overall_len = overall_len + l\n",
" return overall_len\n",
"\n",
" max_overall_len = 0\n",
" \n",
" if isinstance(cell_map,dict):\n",
" for k in cell_map:\n",
" _overall_len = get_overall_length(cell_map[k])\n",
" max_overall_len = _overall_len if _overall_len > max_overall_len else max_overall_len\n",
" else:\n",
" max_overall_len = get_overall_length(cell_map)\n",
"\n",
" #Set the gap at 0.5% of the overall length\n",
" return math.ceil(max_overall_len * 0.01)\n",
" \n",
" \n",
" def plotter(cell_map, x, y, label='', header_gap = 0.2,\n",
" linewidth = 5,\n",
" orientation ='v', gap_colour = 'lightgrey'):\n",
" \"\"\"Plot visualisation of gross cell structure for a single notebook.\"\"\"\n",
" \n",
" if orientation =='v':\n",
" plt.text(x, y, label)\n",
" y = y + header_gap\n",
" else:\n",
" plt.text(y, x, label)\n",
" x = x + header_gap\n",
" \n",
" for _cell_map in cell_map:\n",
" _y = y + gap if gap_colour else y\n",
" __y = _y + _cell_map[0] + 1 #Make tiny cells slightly bigger\n",
" \n",
" if orientation =='v':\n",
" X = _X = __X = x\n",
" Y = y\n",
" _Y =_y\n",
" __Y = __y\n",
" else:\n",
" X = y\n",
" _X = _y\n",
" __X = __y\n",
" Y = _Y = __Y = x\n",
" \n",
" #Add a coloured bar between cells\n",
" if y > 0:\n",
" if gap_colour:\n",
" plt.plot([X,_X],[Y,_Y], gap_colour, linewidth=linewidth)\n",
"\n",
" \n",
" plt.plot([_X,__X], [_Y,__Y], _cell_map[1], linewidth=linewidth)\n",
"\n",
" y = __y\n",
"\n",
" x=0\n",
" y=0\n",
" \n",
" if isinstance(cell_map,list):\n",
" gap = get_gap(cell_map) * gap_boost\n",
" fig, ax = plt.subplots(figsize=(w, 1))\n",
" plotter(cell_map, x, y, **kwargs)\n",
" elif isinstance(cell_map,dict):\n",
" gap = get_gap(cell_map) * gap_boost\n",
" fig, ax = plt.subplots(figsize=(w,len(cell_map)))\n",
" for k in cell_map:\n",
" plotter(cell_map[k], x, y, k, **kwargs)\n",
" x = x + 1\n",
"\n",
" ax.axis('off')\n",
" plt.gca().invert_yaxis()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now easily create a simple visualisation of the gross cell structure of the notebook:"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABGoAAABECAYAAADZXtNTAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAACRUlEQVR4nO3cIU7FQBRAUUoQeBxrQH4Nq0GzFgwLYg2sAffRuEFgECgyzVzScxbw8pppzc2k2xjjAgAAAID1LlcvAAAAAMA3oQYAAAAgQqgBAAAAiBBqAAAAACKEGgAAAIAIoQYAAAAgQqgBAAAAiBBqAAAAACKEGgAAAIAIoQYAAAAgQqgBAAAAiBBqAAAAACKEGgAAAIAIoQYAAAAgQqgBAAAAiBBqAAAAACKEGgAAAIAIoQYAAAAgQqgBAAAAiBBqAAAAACKEGgAAAIAIoQYAAAAgQqgBAAAAiBBqAAAAACKEGgAAAIAIoQYAAAAgQqgBAAAAiBBqAAAAACKEGgAAAIAIoQYAAAAgQqgBAAAAiLhavcDeHl8+xuod+N3T/fv0mc+vt9NnHpXzgb/Z49u5O39On0nX28319JneIY7gqN/OUZ/7v5h9Ps7mh4fTtnqFvbhRAwAAABAh1AAAAABECDUAAAAAEUINAAAAQMQ2hn/tAgAAABS4UQMAAAAQIdQAAAAARAg1AAAAABFCDQAAAECEUAMAAAAQIdQAAAAARAg1AAAAABFCDQAAAECEUAMAAAAQIdQAAAAARAg1AAAAABFCDQAAAECEUAMAAAAQIdQAAAAARAg1AAAAABFCDQAAAECEUAMAAAAQIdQAAAAARAg1AAAAABFCDQAAAECEUAMAAAAQIdQAAAAARAg1AAAAABFCDQAAAECEUAMAAAAQIdQAAAAARAg1AAAAABFCDQAAAECEUAMAAAAQIdQAAAAARHwBwKcfg5S0YWYAAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 1440x72 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"cell_mapping = report.apply(cell_attrib, axis=1).to_list()\n",
"nb_vis(cell_mapping, orientation='h')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can alo visualisation multiple notebooks, labelling each with the notebook name and plotted against the same length axis so that we can compare notebook sizes and structures directly."
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABGoAAADzCAYAAADekfCeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAJrElEQVR4nO3cP4idaRnG4eeRKbKCiEkXUGxkFVMIpnBFExFbQQtBLNQmFiuYRisbS9s0Y2GpFrLCipWVyLCCsOzKSiRoK6SQzQgWSxp5LTa9588c3vt757qqCZzifs93mAw/vvP1GKMAAAAAmO8DswcAAAAA8D6hBgAAACCEUAMAAAAQQqgBAAAACCHUAAAAAIQQagAAAABCCDUAAAAAIYQaAAAAgBBLhJru/m13v9Xdf+vu783eAwAAAHCIHmPM3nC07r45xrjs7peq6s2quj/GeDZ7FwAAAMA+zmYPuCI/6O6vv/j5o1X1iaoSagAAAIBN2Xyo6e4vVdVXquqVMcZ73f3HqroxdRQAAADAAVZ4Rs2Hq+rfLyLNJ6vqc7MHAQAAABxihVDz+6o66+4nVfXTqvrz5D0AAAAAB1niYcIAAAAAK1jhjhoAAACAJQg1AAAAACGEGgAAAIAQQg0AAABACKEGAAAAIIRQAwAAABBCqAEAAAAIIdQAAAAAhBBqAAAAAEIINQAAAAAhzmYPOLUH55dj9oar8vDe051e9+ji9omXHG/Xs1Rt4zwr2fXa3Hn2/MRLMj2+dWOn113X92eWmddlpc/EZs9y/27PngAAcFXcUQMAAAAQQqgBAAAACCHUAAAAAIQQagAAAABC9BjLPGsXAAAAYNPcUQMAAAAQQqgBAAAACCHUAAAAAIQQagAAAABCCDUAAAAAIYQaAAAAgBBCDQAAAEAIoQYAAAAghFADAAAAEEKoAQAAAAgh1AAAAACEEGoAAAAAQgg1AAAAACGEGgAAAIAQQg0AAABACKEGAAAAIIRQAwAAABBCqAEAAAAIIdQAAAAAhBBqAAAAAEIINQAAAAAhhBoAAACAEEINAAAAQAihBgAAACCEUAMAAAAQQqgBAAAACCHUAAAAAIQQagAAAABCCDUAAAAAIYQaAAAAgBBCDQAAAEAIoQYAAAAghFADAAAAEEKoAQAAAAgh1AAAAACEEGoAAAAAQgg1AAAAACGEGgAAAIAQQg0AAABACKEGAAAAIIRQAwAAABBiiVDT3R/v7sezdwAAAAAcY4lQAwAAALCClULNWXf/qrufdPdvuvuDswcBAAAA7GOlUPNyVZ2PMT5VVf+pqlcn7wEAAADYy0qh5p9jjD+9+PmXVfWFmWMAAAAA9rVSqBn/598AAAAA0VYKNR/r7lde/Pytqnpj5hgAAACAfa0Uav5eVd/v7idV9ZGq+tnkPQAAAAB76TF8QwgAAAAgwUp31AAAAABsmlADAAAAEEKoAQAAAAgh1AAAAACEEGoAAAAAQgg1AAAAACGEGgAAAIAQQg0AAABACKEGAAAAIIRQAwAAABDibPaAU3twfjlmb7gqD+893el1jy5un3jJ8XY9S9U2zrO6n796s2dvAAAAuA7cUQMAAAAQQqgBAAAACCHUAAAAAIQQagAAAABC9BjLPGsXAAAAYNPcUQMAAAAQQqgBAAAACCHUAAAAAIQQagAAAABCCDUAAAAAIYQaAAAAgBBCDQAAAEAIoQYAAAAghFADAAAAEEKoAQAAAAgh1AAAAACEEGoAAAAAQgg1AAAAACGEGgAAAIAQQg0AAABACKEGAAAAIIRQAwAAABBCqAEAAAAIIdQAAAAAhBBqAAAAAEIINQAAAAAhhBoAAACAEEINAAAAQAihBgAAACCEUAMAAAAQQqgBAAAACCHUAAAAAIQQagAAAABCCDUAAAAAIYQaAAAAgBBCDQAAAEAIoQYAAAAghFADAAAAEEKoAQAAAAgh1AAAAACEEGoAAAAAQgg1AAAAACGEGgAAAIAQQg0AAABACKEGAAAAIIRQAwAAABBCqAEAAAAIIdQAAAAAhFgi1HT3t7v7r939Tnf/YvYeAAAAgEP0GGP2hqN096er6vWq+vwY493uvjnGuJy9CwAAAGBfK9xR8+Wqem2M8W5VlUgDAAAAbNUKoQYAAABgCSuEmj9U1Te6+1ZVVXffnLwHAAAA4CCbf0ZNVVV3f6eqflRV/62qv4wxvjt3EQAAAMD+lgg1AAAAACtY4atPAAAAAEsQagAAAABCCDUAAAAAIYQaAAAAgBBCDQAAAEAIoQYAAAAghFADAAAAEEKoAQAAAAgh1AAAAACEEGoAAAAAQpzNHnBqD84vx+wNV+Xhvac7ve7Rxe0TLznermep2sZ5VrLrtbnz7PmJl2R6fOvGTq+7ru/PLDOvy0qfiS2cZdeNq/3fcR3/BtjCWVayz99mW/h9dtW28PvxOtr1ulRd/bVZ6TMx8308yv27PXvCqbijBgAAACCEUAMAAAAQQqgBAAAACCHUAAAAAIToMZZ51i4AAADAprmjBgAAACCEUAMAAAAQQqgBAAAACCHUAAAAAIQQagAAAABCCDUAAAAAIYQaAAAAgBBCDQAAAEAIoQYAAAAghFADAAAAEEKoAQAAAAgh1AAAAACEEGoAAAAAQgg1AAAAACGEGgAAAIAQQg0AAABACKEGAAAAIIRQAwAAABBCqAEAAAAIIdQAAAAAhBBqAAAAAEIINQAAAAAhhBoAAACAEEINAAAAQAihBgAAACCEUAMAAAAQQqgBAAAACCHUAAAAAIQQagAAAABCCDUAAAAAIYQaAAAAgBBCDQAAAEAIoQYAAAAghFADAAAAEEKoAQAAAAgh1AAAAACEEGoAAAAAQgg1AAAAACGEGgAAAIAQQg0AAABACKEGAAAAIMRyoaa7f9LdP5y9AwAAAGBfy4UaAAAAgK1aItR094+7+x/d/UZVvTx7DwAAAMAhzmYPOFZ3f7aqvllVn6n3z/N2Vb01dRQAAADAATYfaqrqi1X1+hjjvaqq7v7d5D0AAAAAB1niq08AAAAAK1gh1FxU1de6+6Xu/lBVfXX2IAAAAIBDbP6rT2OMt7v711X1TlX9q6renDwJAAAA4CA9xpi9AQAAAIBa46tPAAAAAEsQagAAAABCCDUAAAAAIYQaAAAAgBBCDQAAAEAIoQYAAAAghFADAAAAEEKoAQAAAAgh1AAAAACEEGoAAAAAQpzNHnBqD84vx+wNV+Xhvac7ve7Rxe0TLznermep2sZ5VrLrtbnz7PmJl2R6fOvGTq+7ru/PLDOvy0qfic2e5f7dnj0BAOCquKMGAAAAIIRQAwAAABBCqAEAAAAIIdQAAAAAhOgxlnnWLgAAAMCmuaMGAAAAIIRQAwAAABBCqAEAAAAIIdQAAAAAhBBqAAAAAEIINQAAAAAhhBoAAACAEEINAAAAQAihBgAAACCEUAMAAAAQQqgBAAAACCHUAAAAAIQQagAAAABC/A+woPHNzGSlMgAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 1440x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"nb_vis({'a':cell_mapping, 'b':cell_mapping[:3],\n",
" 'c':cell_mapping+cell_mapping, 'd':cell_mapping,}, orientation='h')"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"def cell_attribs(cells, colour='cell_type', size='n_screen_lines'):\n",
" return cells.apply(cell_attrib, axis=1, args=(colour,size)).to_list()\n",
"\n",
"zz = ddf.groupby(['filename'])[['cell_type', 'n_screen_lines']].apply(cell_attribs)\n",
"nb_vis(zz.to_dict(), orientation='h', gap_boost=1)\n",
"#[['n_total_code_lines','n_words','reading_time_mins', 'reading_time_s' ]].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also see how they look based on reading time."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x432 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"zz = ddf.groupby(['filename'])[['cell_type', 'reading_time_s']].apply(cell_attribs,'cell_type','reading_time_s')\n",
"nb_vis(zz.to_dict(), orientation='h', gap_boost=2)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualing Intra-Cell Structure\n",
"\n",
"For example, paragraphs and code blocks in markdown cells; comment lines, empty lines, code lines, magic lines / blocks, shell command lines in code cells.\n",
"\n",
"Supporting the level of detail may be be tricky. A multi-column format is probably best showing eg an approximate \"screen's worth\" of content in a column then the next \"scroll\" down displayed in the next column along."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# BELOW HERE - NOTES AND TO DO"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100.0"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Maintainability index\n",
"from radon.metrics import mi_visit\n",
"\n",
"#If True, then count multiline strings as comment lines as well.\n",
"#This is not always safe because Python multiline strings are not always docstrings.\n",
"\n",
"multi = True\n",
"mi_visit(c,multi)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'\\nthe Halstead Volume\\nthe Cyclomatic Complexity\\nthe number of LLOC (Logical Lines of Code)\\nthe percent of lines of comment\\n'"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from radon.metrics import mi_parameters\n",
"mi_parameters(c, multi)\n",
"\n",
"\"\"\"\n",
"the Halstead Volume\n",
"the Cyclomatic Complexity\n",
"the number of LLOC (Logical Lines of Code)\n",
"the percent of lines of comment\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from radon.complexity import cc_visit\n",
"\n",
"#Doesn't like %% or % magic\n",
"cc_visit(c)"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Halstead(total=HalsteadReport(h1=0, h2=0, N1=0, N2=0, vocabulary=0, length=0, calculated_length=0, volume=0, difficulty=0, effort=0, time=0.0, bugs=0.0), functions=[])"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from radon.metrics import h_visit\n",
"h_visit(c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Checking Notebook Metrics Evolution Over Time\n",
"\n",
"The `wily` package uses `radon` to produce code quality reports across a git repository history and generate charts showin the evolution of metrics over the lifetime of a repository. This suggests various corollaries:\n",
"\n",
"- could we generate `wily` style measures over the recent history of a notebook code cell?\n",
"- could we generate `wily` style temporal measures over all the reports (markdown text, as well as code) generated from a notebook across several commits of it to a git repository."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Other Cell Analysis\n",
"\n",
"As a placeholder, should we also at least report on a count of cells that are note code or markdown cells?\n",
"\n",
"Also a count of empty cells?\n",
"\n",
"Is this moving towards some sort of notebook linter?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment