HalosGhost/static_typing_class.rst

## static_typing_class.rst

      
    Raw
  

              static_typing_class.rst
            
          
    Static Typing

Though dynamic languages have become far more common in recent years (e.g., Python, Lua, PHP, JS, etc.), many static languages are still in-use today and learning how to work with them is invaluable (particularly for fields such as systems programming—where static languages are far more common).
When we say "dynamic typing", what people usually mean is that the programmer does not need to annotate the variable declaration with an explicit type.
That is, in Python for example, you simply declare a variable that can hold any type of data.
Put another way, in dynamic languages, variables do not have a type even though values might.
In most static languages [1], we do not have this shortcut.
Instead, every variable declaration must be adorned with a data type.
That is, both variables and values have an explicit type.
In practice this so-called "type annotation" limits the kind of data that can be stored in a variable (e.g., variables annotated as being integers can only store integers, not strings).
The primary benefits of this practice are speed, predictability and type safety [2].
There are many reasons to prefer static typing over dynamic typing, but the most obvious are data representation and input type checking.


[1] Some languages (like Rust and Haskell) which are statically typed have a special feature called "type-inference" which allows programmers to get away with not explicitly declaring the type of a variable. However, type inference does not change that the variable still has a specific type.


[2] Telling the computer exactly how it should treat a specific variable means it doesn't have to guess (simplified, but fairly accurate) which provides a fair speed boost (see the following section). In addition to the computer having some speed benefit, the programmer knowing the exact type of the variable means you can design your program to function in more predictable ways!


Data Representation

Though languages like Python hide this from the programmer, every different type of data is represented in a different way on the machine itself.
At the bare level, all data are actually just ones and zeros, so we use those ones and zeros in different ways to best represent different types of data.
When you declare a variable with a type annotation, you tell the computer exactly how the data representation will function; in dynamic languages, the machine needs to do some very interesting things to try and figure out the best way of doing it (and it has to be able to completely change the representation it uses should the type of data encompased be changed).
For example, at the bit level, an unsigned integer (one that can only be positive) is stored exactly how you might imagine:[3]
00000000 is 0 and 11111111 is 255 (0xff in hexadecimal notation).
However, floating point numbers have a drastically different representation, referred to as IEEE754 representation [4].
In IEEE754, 32 bits are divided into three parts, 1 bit for the sign of the number, 8 bits for an exponent, and 23 bits for a "significand" or "mantissa".
In addition to the actual values that this allows you to represent, there are various special cases.
For example, while the bit representation 0x7f800000 (01111111 10000000 00000000 00000000) as an integer just refers to a large number; this bit representation as a float represents +∞ . [5]
Having different data representations allows us to cram as much data as we can manage into a small space, but it also means that different data types operate very differently.
Static typing allows us to use those representations explicitly rather than requiring the computer to do some magic to figure out what is what.


[3] Note that signed integers are stored in a different fashion (two's complement), and that this class is fundamentally incapable of adequately exploring machine level data representation. I only use these examples to offer some simple insight as background to aid in understanding the topic this class is meant to cover. For those interested in delving deeper into some of the things I gloss over in this tutorial, please take a look at Art of Assembly Language; it is a great intro that will offer you a much more in-depth look at how the bare-metal functions.


[4] See Wikipedia's article, the official standard (paywall) and this article (for further reading).


[5] Note that, in this example, I'm assuming your system is "Little Endian" (if you have an x86 processor, that's a pretty good guess). Endianness is definitely an important concept in machine-level representation, but it does not have a huge impact on typing as far as this class is concerned. Wikipedia has a decent article about endianness for further reading


Input Type Checking

Let's say we write a function to add two arguments together (example below in javascript):
function add_two_integers (arg1, arg2) {
    return arg1 + arg2;
}
ex_func() will take any two parameters (not just integers) and add them together (even a string and a number).
If you were to pass "hello" and 4 to this function (i.e., ex_func("hello", 4)), you'll find that it will return "hello4".
This is because javascript interprets the + operator as concatenation when one of the operands is a string.
While this has its uses, if you wanted to write a function that will only operate on a particular type of data, you would need to add some kind of specific checks to code to determine the type of the operators and throw an error if they are the wrong type.
The problem is that, with dynamic languages, it's not that there are no types, it's really more that there is only one type. [6]
This makes type-checking incredibly difficult, so restricting the type of data after the fact is very hard to do.
In static languages, however, the type-checking is done for you (example below in C):
int add_two_integers (int arg1, int arg2) {
    return arg1 + arg2;
}
You'll notice that the length of this function is almost identical to the function in Javascript (only the first line is a few characters longer).
However, for those extra few characters, we gain a few wonderful things:

This function will only operate on integers, if you pass a variable of a different type to this function, the compiler will throw a warning
That int before the function name declares that this function must always return an integer (so we can use it with other functions and know exactly what the output will be). [7]

Note: There is also an obvious draw-back for statically typed languages like this (it's the obvious trade-off with type checking actually). Writing generic functions (those which can operate on multiple types) can be much more difficult. [8]


[6] See this for an elaboration on what I mean.


[7] Nullable types cause some trouble with this deterministic view of C programming, but that catch is beyond the scope of this class to cover.


[8] This is definitely not always the case (see Haskell), but in the interest of full disclosure, I felt it was worth mentioning.


Types of Data


Primitive Data Types

Most static languages include some form of a few very common data types:

boolean
integers (often with divisions of types for higher bit-count and signedness)
character (often used as a special case of integer)
floating point numbers (typically divided into single-precision and double-precision, but often with higher precision available)

For the most part, other types of data in static languages will be aggregations of the above data types.
Luckily, many (if not all) static languages allow you to convert between data types with fairly little pain.
Some of these conversions are always safe (e.g., converting a smaller int into a larger int), but others might cause data loss or other issues (e.g., converting a larger int into a smaller int)—more on this later.

Aggregate Data Types

As the name implies, aggregate data types allow you to combine primitives into larger, more complex types.
The most widely known aggregate data type is probably the array.
An array is simply a list of values; and in static languages, when you declare an array, each item in the array must be of the same type (example below in C):
int ex_array [] = { 0, 1, 2, 3, 4, 5 };
If we were to try and include a 1.2 member in this array, intelligent compilers will throw a warning complaining about mis-matched types.
The reason for this restriction is actually quite simple: arrays in C (and many static languages) are stored on the machine as contiguous memory (i.e., there is no actual separation between the members); as a result, the simplest way for the compiler to get the next member in the array is to jump forward the amount of memory that each member occupies.
This makes arrays very fast, but it means that we can only use one data type in an array.
In fact, in C, there is no string data type, rather we just have arrays of characters (with an extra convention).
In C, there are also structs and unions which allow you to store sets of variables of different types together, but they have their own caveats, and they are a little beyond what we we're covering today.
Another very common aggregate data type is an "enum" (or "enumerated" type).
Enums allow the programmer to do two things: first, define a custom set of values that a variable can hold; second, declare a large number of variables without much boilerplate (repeated code).
For example, let's assume for a minute that C does not have a boolean type. [9]
If we were to create our own boolean data type, we could do it very simply with an enum:
enum boolean {
    FALSE, TRUE
};
With the above example in our code, we can now define a variable like so: enum boolean ex_bool = FALSE;
Essentially, the way this works is that each member of an enum is assigned a numerical value starting at 0.
So, FALSE == 0 and TRUE == 1.
However, variables declared with this type enum boolean can only have the values 0 (FALSE) or 1 (TRUE).
This means, if we write a function that takes an enum boolean as an argument, if a programmer attempts to pass a variable other than these two to that argument, the compiler will fail!
This allows for more fine-tuned type-checking and safety.
In addition to arrays, enums, structs and unions, many object-oriented languages add classes; while classes are in many ways just an aggregate data type, they have a lot more going on under the hood, and deserve a whole class devoted to them.


[9] Actually, because the smallest addressable unit in modern computers is a byte (typically 8 bits), a "true" boolean (a 1-bit wide field) is not a first-class data type in C. Rather, we have a _Bool (with stdbool.h adding a bool keyword through a typedef for simplicity). Additionally, using bit-fields and structs, it is possible to actually create 1-bit wide booleans, but it has its own caveats.