Case Study: Textual Data
Now it is time to crack on with our first case study 🙌.
Assume we have some data in textual format that we want to use as input to some ML model. To make it even more fun, instead of working with standard textual data, let's imagine we have some source code listings we want to process.
Our examplar data will be two functions, written in our most favourite programming language$^5$:
: Examples adapted from Learning PyTorch with Examples tutorial.
>>> # function 1 >>> def init_random_weights_tensor(D_in=1000, H=100, D_out=10): ... """Initialise random weights for MLP as ... PyTorch (Float) Tensors on CPU. ... ... Parameters ... ---------- ... D_in : int ... Input dimension (default 1000) ... H : int ... Hidden dimension (default 100) ... D_out : int ... Output dimension (default 10) ... ... Returns ... ------- ... w1, w2: torch.Tensor ... The two random weight tensors of size [D_in, H], and [H, D_out], ... respectively. ... """ ... dtype = torch.float ... device = torch.device("cpu") ... # Randomly initialize weights ... w1 = torch.randn(D_in, H, device=device, dtype=dtype) ... w2 = torch.randn(H, D_out, device=device, dtype=dtype) ... return w1, w2
>>> # function 2 >>> def init_random_weights_array(D_in=1000, H=100, D_out=10): ... """Initialise random weights for MLP as NumPy Array. ... ... Parameters ... ---------- ... D_in : int ... Input dimension (default 1000) ... H : int ... Hidden dimension (default 100) ... D_out : int ... Output dimension (default 10) ... ... Returns ... ------- ... w1, w2: array_like ... The two random weight matrices of size [D_in, H], and [H, D_out], ... respectively. ... """ ... # Randomly initialize weights ... w1 = np.random.randn(D_in, H) ... w2 = np.random.randn(H, D_out) ... return w1, w2
At a first glance, the code of these two functions look quite similar. We want to extract and encode their textual information to make them become our features for a ML model.
In a more realistic scenario, we could repeat this process for all the functions in an entire source code repository, aiming at extracting meaningful insights on the software from the source code text.
This whole concept of Mining Software Repositories is indeed quite fun, and it is a research field by itself: Machine Learning for Source Code.
1. Extract the Source Code Text
As you may expect, the first thing we need to do is to convert the source code into actual textual data for further processing. Once we will have the text, we will workout a solution to transform text into numerical features for ML.
To process those two function objects, we could leverage on Python's amazing introspection features: the
inspect module in the Standard Library allows to process live Python objects, and the
getsourcelines function looks exactly what we need to extract the source code text from the two functions!
>>> from inspect import getsourcelines >>> from inspect import getsourcelines >>> numpy_fn_text, _ = getsourcelines(init_random_weights_array) >>> torch_fn_text, _ = getsourcelines(init_random_weights_tensor) # let's have a quick look >>> numpy_fn_text[:10]
['def init_random_weights_array(D_in=1000, H=100, D_out=10):\n', ' """Initialise random weights for MLP as NumPy Array.\n', ' \n', ' Parameters\n', ' ----------\n', ' D_in : int \n', ' Input dimension (default 1000)\n', ' H : int \n', ' Hidden dimension (default 100)\n', ' D_out : int \n']
2. Filtering textual data
As you can see, we have a lot of information from the original source code text that we might want to get rid of as per our own objective. Tabulations and carriage returns (
\t); non alphanumeric symbols (e.g.
()); words in different cases (
array), just to mention the most obvious ones.
This is just a glimpse of what entangles a standard NLP pipeline, and we will be using standard Python to build a simple text filtering function.
Our filtering function needs to apply in sequence the following operations to each code word:
- replace all punctuation characters with blank (
" ") in each word
- strip all trailing spaces
- filter out any resulting empty string
- lower all terms.
We will be starting from source code lines as represented by a list of strings (as returned by
getsourcelines), and we will return a unique string corresponding to the filtered source code text. Each word will be separted by a blank space (
There are indeed many ways in which we could implement this function, and some of them are indeed very simple (and yet quite boring™️ 😅).
For example, we could join all the lines into a single string object, and implement our filter by using
str.replace + str.lower. Nothing absolutely wrong with it!
However what if... (1) data granularity changes (e.g. from functions to entire packages) so increasing the memory footprint required?; (2) data comes from a
async I/O stream rather than residing as a whole in main memory?; (3) I need to interleave more specialised operations specific to source code text processing?...
In the real case, these are all indeed very convincing reasons not to immediately fall for the simplest solutions, with all the due caveat of what premature optimisation might imply.
For the task at hand, I could bring in so many reasons to try to convince you about the implementation I am about to present. The truth is: I wanted to make it fun, less obvious, and potentially useful to learn something new.
filter function I have in mind has to be
lazy, so that it could extensively leverage on iterables$^6$ and
Py3 lazy iteration protocol. Therefore, we will be composing our pipeline in a functional fashion, using a combination of
The Further Reading Section contains more links and references on the subject, and how this relates to Data Science in the first place 🙃 .
Let's now implement our
: Please bear in mind that Iterable
>>> from itertools import chain >>> from string import punctuation >>> from typing import Sequence >>> def lazy_filter(fn_lines: Sequence[str]) -> str: ... # separate each word in each line ... terms = map(lambda l: l.strip().split(), fn_lines) ... # chain all the lines ... terms = chain.from_iterable(terms) ... # Replace each punctuation character from each term with blank ... terms = map(lambda t: ''.join(map(lambda c: c if c not in punctuation else " ", t)), terms) ... # Transform in lowercase and remove trailing spaces ... terms = map(lambda t: t.lower().strip(), terms) ... # filter out any resulting empty term (i.e. those of only by punctuations e.g. "----") ... terms = filter(lambda t: len(t.strip()) > 0, terms) ... # join all together and reduce to lower case ... return ' '.join(terms) >>> numpy_fn_text = lazy_filter(numpy_fn_text) >>> torch_fn_text = lazy_filter(torch_fn_text) # let's now have a look at the result >>> torch_fn_text
'def init random weights tensor d in 1000 h 100 d out 10 initialise random weights for mlp as pytorch float tensors on cpu parameters d in int input dimension default 1000 h int hidden dimension default 100 d out int output dimension default 10 returns w1 w2 torch tensor the two random weight tensors of size d in h and h d out respectively dtype torch float device torch device cpu randomly initialize weights w1 torch randn d in h device device dtype dtype w2 torch randn h d out device device dtype dtype return w1 w2'
- Functional Programming and Data Science