wilzbach/the-expressive-c17-coding-challenge-in-d.md

## the-expressive-c17-coding-challenge-in-d.md

      
    Raw
  

              the-expressive-c17-coding-challenge-in-d.md
            
          
  title
  layout
  date
  categories
  
  
  The Expressive C++17 Coding Challenge in D
  post
  2018-02-13 08:48:57 -0800
  
  
  d
  tech
  
  
You might have seen that I have been coding a lot in D lately and as a few weeks ago there was the
Expressive C++17 Coding Challenge
with its solution in  C++ and Rust now being public, I thought this is an excellent opportunity to show why I like D so much.
The requirements

Let me first recap the requirements of this challenge:

This command line tool should accept the following arguments:

the filename of a CSV file,
the name of the column to overwrite in that file,
the string that will be used as a replacement for that column,
the filename where the output will be written.


./program <input.csv> <colum-name> <replacement-string> <output.csv>

Example input

Given this simple CSV as input
{% highlight csv %}
name,surname,city,country
Adam,Jones,Manchester,UK
Joe,Doe,Cracow,Poland
Michael,Smith,Paris,France
Alex,McNeil,Gdynia,Poland
{% endhighlight %}
the program called with:
./program input.csv city London output.csv

should write:
{% highlight csv %}
name,surname,city,country
Adam,Jones,London,UK
Joe,Doe,London,Poland
Michael,Smith,London,France
Alex,McNeil,London,Poland
{% endhighlight %}
Sounds fairly trivial, right?
Please have a short look first at the "best" C++ and Rust solutions before you look at the D solution.
The solution

Okay, so here's one way to solve this in D.
If you are scared at the moment - don't worry. I will explain it line by line below.
{% highlight d %}
#!/usr/bin/env rdmd
import std.algorithm, std.exception, std.format, std.range, std.stdio;
void main(string[] args) {
enforce(args.length == 5, "Invalid args\n" ~
"./tool <input.csv>   <output.csv>");
auto inFile = args[1], columName = args[2], replacement = args[3], outFile = args[4];
auto lines = File(inFile).byLine.map!(a => a.splitter(","));
auto colIndex = lines.front.countUntil(columName);
enforce(colIndex >= 0, "Invalid column. Valid columns: %(%s, %)".format(lines.front));
auto os = File(outFile, "w");
os.writefln("%(%s, %)", lines.front);
foreach (line; lines.dropOne) {
auto r = line
.enumerate // iterate with an (index, value) tuple
.map!(a => a.index == colIndex ? replacement : a.value)
.joiner(",");
os.writeln(r);
}
}
{%endhighlight%}
So how does this compare to C++17 and Rust?


Language
LoC
Time


C++
125
15s


Rust
83
6s


D
19
12s


D (slightly tweaked)
25
6s


In the latter article I will also present a solution which only uses 12 lines,
but it uses the built-in std.csv module and I think D doesn't even need to cheat.
I used the following D script to generate a simple CSV file with 10 fields and 10m lines:
rdmd --eval='10.iota.map!(a=> "field".text(a)).join(",")
.repeat(10_000_000).joiner("\n").writeln' > input_big.csv

The resulting input_big.csv has a size of 668M.

Aren't you concered that Rust is faster in this benchmark?

Not at all. The challenge was to write expressive code.
When performance really matters D provides the same tools as C or C++ and D even supports native interoperability with C and most of C++.
In this example, however, I/O is the bottleneck and D provides a few convenience features like using locked file handles, s.t. accessing files is thread-safe by default, or supporting unicode input.
However, it's easy to opt out of such productivity features and use other tricks like memory mapped files.
For the interested readers I have attached a slightly optimized version at the end.
In addition, if you are interested in performance,
Jon Degenhardt (member of eBay's data science team),
has made an excellent performance benchmark
between eBay's tsv-utils and existing CSV/TSV processing tools written in C, Go, and Rust.
1) What is #!/usr/bin/env rdmd?

One of the favorite aspects of D is that it has a blazingly fast compiler. Period. I can compile the entire D front-end of the compiler (~200 kLoC) in less than two seconds or the entire standard library with lots and lots of compile-time function evaluation and templates and > 300 kLoC in 5 seconds from scratch without any cache or incremental compilation.
This means that the compiler is almost as fast as an interpreter and rdmd is the tool that allows handy usage as "pseudo-interpreted" language. You can invoke rdmd with any file and it will automatically figure out all required files based on your dependencies and pass them to the compiler.
It's very popular in the D community because for small scripts one doesn't even notice that the program is compiled to real machine code under the hood.
Also if the shebang header is added and the file is executable, D scripts can be used as if they would be script files:
./main.d input.csv city London output.csv

2) So you import a bunch of libraries. What do they do?

import std.algorithm, std.exception, std.format, std.range, std.stdio;

In short std.stdio is for input and output, std.range is about D's magic streams called "ranges" and std.algorithm abstracts on top of them and provides generic interfaces for a lot of sophisticated algorithms.
Moreover, std.exception offers methods for working with exceptions like enforce and finally format bundles methods for string formatting.
Don't worry - the functionality imported from these modules will be explained soon.
3) Your program has a main function. What's so special about it compared to C or C++?

{% highlight d %}
void main(string[] args) {
…
{% endhighlight %}
For starters, arrays in D have a length. Try:
{% highlight d %}
args[5].writeln;
{% endhighlight %}
Compared to C/C++ null-terminated strings and arrays, it won't segfault.
It would just throw a nice Error:
core.exception.RangeError@./main.d(10): Range violation
----------------
??:? _d_arrayboundsp [0x43b622]
prog.d:9 void main.foo(immutable(char)[][]) [0x43ac93]
prog.d:4 _Dmain [0x43ac67]


Oh so D performs automatic bounds-checking before accessing the memory.
Isn't that expensive?

It's almost negligible compared to the safety it buys, but D is a language for everyone, so the people who want to squeeze out the last cycles of their processor can do so by simply compiling with -boundscheck=off (for obvious reasons this isn't recommended).
In D, strings are arrays too and there's another nice property about D's arrays. They are only a view on the actual memory and you don't copy the array, but just the view of the memory (in D it's called a slice).
Consider this example:
{% highlight d %}
int[] arr = [1, 2, 3];
auto bArr = arr[1 .. $];
bArr[] += 2; // this is a vectorized operation
arr.writeln; // [1, 4, 5]
{% endhighlight %}
There many other things D has learned from C and C++.
Walter has recently written a great article on how D helps to vanquish forever these bugs that blasted your kingdom which I highly recommend if you have a C/C++ background.
4) What's up with this enforce?

{% highlight d %}
enforce(args.length == 5, "Invalid args.\n" ~
"./tool <input.csv>   <output.csv>");
{% endhighlight %}

I have never seen the ~ operator before!

It's the string concatenation (or more general array concatenation) operator.
How often how you encountered code like a + b and needed to know the types
of a and b to know whether it's a addition or concatenation?

Why don't you use an if statement and terminate the program explicitly?

{% highlight d %}
if (args.length < 5) {
writeln("Invalid args.");
writeln("./tool <input.csv>   <output.csv>");
return 1;
}
{% endhighlight %}
That's valid D too. D allows a lot of different programming styles, but this article
is intended to highlight a few specific D styles like enforce.
enforce is a function defined in std.exception and throws an exception
if its first argument has a falsy value.

Hmm, I looked at the documentation and saw this monster. I thought it simply throws an exception?

{% highlight d %}
auto enforce(E : Throwable = Exception, T)(T value, lazy string msg = null, string file = FILE, size_t line = LINE)
{% endhighlight %}
I don't have the time to fully dive into D's syntax, but auto
instructs the compiler infer the return type for you. This leads to the interesting
Voldemort return types as they can't be named by the user, but
that's a good topic for another article.
The next part looks a bit complicated (E : Throwable = Exception, T), but don't worry yet.
It means that E is a template parameter which needs to inherit from Throwable (the root of all exceptions),
and is by default Exception. T is the template type of value.

Wait. I just instantiated a template without specifying its template parameters?

Yes, the D compiler does all the hard work for.
The technical term is Implicit Function-Template Instantiation (IFTI).
Of course, we could have instructed enforce to throw a custom exception,
but more on template instantiation later.

Alright. So this function takes a generic value and a msg, but a lazy string msg?

lazy is a special keyword in D and tells the compiler to defer
the evaluation of an argument expression until is actually needed.

I don't understand. msg seems to be a string concatentation of two strings.
Isn't this done before the enforce is called?

"Invalid args.\n" ~ "./tool <input.csv> <colum-name> <replacement-string> <output.csv>"

No, lazy is lazy and the string concatenation doesn't happen at the caller site,
but can be requested explicitly by the callee.
It gets a bit clearer if we look at the second enforce:
{% highlight d %}
enforce(colIndex < 0, "Invalid column name. Valid are: %(%s, %)".format(lines.front));
{% endhighlight %}
format and all the expensive work of formatting the error message is never done on the default path,
but only if an exception actually gets thrown.
Ignore the %(%s, %) formatting string for a bit, it will be explained soon.

Ok, but how does that work?

In short: the compiler does a few smart lowering for you and creates an anonymous lambda.
For more details, see this advanced article about D's lazy.

But there's more magic here. What's __FILE__ and __LINE__?

{% highlight d %}
string file = FILE, size_t line = LINE
{% endhighlight %}
Remember that D is a compiled language and accessing the stack isn't as easy as
asking the interpreter nicely.
These two default arguments are automatically set by the compiler
with the file and line number of the caller.
This is important for logging or throwing exceptions like we have done here.
So an API author can simply say "Hey, I would like to know the line number of my caller."
and doesn't depend on the user hacking the replacements like its down in C/C++
with the preprocessor macros:
{% highlight cpp %}
#ifdef SPDLOG_DEBUG_ON
#define SPDLOG_DEBUG(logger, ...) logger->debug(VA_ARGS)  << " (" << FILE << " #" << LINE <<")";
#else
#define SPDLOG_DEBUG(logger, ...)
#endif
{% endhighlight %}
In fact, D doesn't even have a preprocessor.
5) auto and a statically typed language

{% highlight d %}
auto inFile = args[1], columName = args[2], replacement = args[3], outFile = args[4];
{% endhighlight %}

Hmm, but what's auto? I thought D has a static type system?

Yes D is statically typed, but the compiler is pretty smart, so we can let him do all the hard for for us. auto is a filler word for the compiler that means "whatever the type of the assignment, use this as type of this variable".
6) What the hell is UFCS?

{% highlight d %}
auto lines = File(inFile).byLine.map!(a => a.splitter(","));
{% endhighlight %}
One of the major features of D is the Unified Function Call Syntax (UFCS). In short, the compiler will look up a function in the current namespace if it's not found as a member function of a type,
but let's go through this step by step.

I looked at the documentation of File and it has a method byLine. So where's the magic?

Have another look at map, it's located in std.algorithm.

Okay, wait. How does this work?

The compiler internally rewrites the expression File.byLine.map to the following:
{% highlight d %}
map(File.byLine());
{% endhighlight %}
Missing parenthesis are allowed too - after all the compiler knows that the symbol is a function.

Okay, but what's up with this !(a => a.splitter(",")))?

! is similar to C++/Java's <> and allows to instantiate a template.
In this case it's a lambda function of a => a.splitter(",").
Notice that for splitter UFCS is used again and your brain might be more used to reading splitter(a, ",") for now.
7) Ranges

Okay to recap, we have taken the input of a file by line, splitted every line by commas ,.

Wouldn't this result in a lot of unnecessary allocation?

The short answer is: D uses "iterators on steroids" which are lazy and work is only done when explicitly requested. Usually range algorithms don't even require any heap allocation as everything is done on the stack.
For example, in the next line .front returns the the first line though which countUntil explicitly iterates:
{% highlight d %}
auto colIndex = lines.front.countUntil(columnName);
{% endhighlight %}
So lines.front looks something like:
{% highlight d %}
["name", "surname", "city", "country"]
{% endhighlight %}
countUntil will return the of the first match or -1 otherwise.
It's a bit similar to indexOf function known from e.g. JavaScript, but it accepts a template.
So we could have supplied a custom predicate function:
{% highlight d %}
lines.front.countUntil!(a => a.endsWith("ty"));
{% endhighlight %}
8) std.format: and compile-time checking of parameters

The next lines are:
{% highlight d %}
enforce(colIndex >= 0, "Invalid column name. Valid are: %(%s, %)".format(lines.front));
auto os = File(outFile, "w");
os.writefln("%(s, %)", lines.front);
{% endhighlight %}

I have never seen writefln("%(s, %)"). What happens here?

writefln is just a handy wrapper around D's format function.
format itself provides a lot of options for serialization, but it's very similar to printf, although it does provide a few goodies like the special syntax for arrays %(s, %).
This syntax opens an array formatting "scope" by %( and closes it with %).
Within this array "scope" the elements should be formatted with s (their string serialization)
and use ,  a delimiter between the element.
It's a shorthand syntax that often comes in handy, but if you don't like it there are many other ways to achieve the same result. For example, joiner:
{% highlight d %}
lines.front.joiner(",").writeln;
{% endhighlight %}

How would such an error message look like?

object.Exception@./main.d(9): Invalid column name. Valid are: "name", "surname", "city", "country"
----------------
??:? pure @safe void std.exception.bailOut!(Exception).bailOut(immutable(char)[], ulong, const(char[])) [0x7a34b57e]
??:? pure @safe bool std.exception.enforce!(Exception, bool).enforce(bool, lazy const(char)[], immutable(char)[], ulong) [0x7a34b4f8]
??:? _Dmain [0x7a34b17f]


Okay, but isn't printf bad and unsafe? I heard that languages like Python are moving away from C-like formatting.

A Python library can only realize that arguments and formatted string don't fit when it's called.
In D the compiler knows the types of the arguments and if you pass the format string at compile-time, guess what, the format can be checked compile-time.
Try to compile a format string that tries to format strings as numbers:
{% highlight d %}
writefln!"%d"("foo");
{% endhighlight %}
The compiler will complain:
/dlang/dmd/linux/bin64/../../src/phobos/std/stdio.d(3876): Error: static assert  "Incorrect format specifier for range: %d"
onlineapp.d(4):        instantiated from here: writefln!("%d", string)


Wow, that's really cool. How does this work?

D has another unique feature: compile-time function evaluation (CTFE) that allows to execute almost any function at compile-time.
All that happens is that writefln is instantiated at compile-time with the string as template argument and then it calls the same format function that would normally be called at run-time with the known string.
The coolest part about this is that there's no special casing in the compiler and everything is just a few lines of library code.
9) Let's parse the file

Now that we have found the index of the replacement column, have opened the output csv file and have already written the header to it, all that's left is to go over the input CSV file line by line and replace the specific CSV column with the replacement:
{% highlight d %}
foreach (line; lines.dropOne) // remove the header
{
auto r = line
.enumerate // iterate with an (index, value) tuple
// and lazily map a different value for the specific CSV column
.map!(a => a.index == colIndex ? replacement : a.value),
.joiner(","); // join the lines back to a CSV format
os.writeln(r);
}
{% endhighlight %}
One of the cool parts of D ranges is that they are so flexible.
You want to do everything in a functional way?
D has you covered:
{% highlight d %}
alias csvPipe = pipe!(enumerate,
map!(a => a.index == colIndex ? replacement : a.value),
partial!(reverseArgs!joiner, "_"),
);
lines.dropOne.map!csvPipe.each!(a => os.writeln(a));
{% endhighlight %}
There's another cool thing about D - std.parallelism.
Have you ever been annoyed that a loop takes too long, but didn't know a quick way
to parallelize your code?
Again, D has you covered with .parallel:
{% highlight d %}
foreach (line; lines.parallel)
// expensive operation in parallel
{% endhighlight %}

No way. I don't believe this can be so simple.

Just try it yourself.
The Garbage Collector (GC)

On the internet and especially on reddit and HackerNews there's a huge criticism of D's decision to do use a GC. Go, Java, Ruby, JavaScript etc. all use a GC, but I can't better phrase it than Adam D. Ruppe:

D is a pragmatic language aimed toward writing fast code, fast. Garbage collection has proved to be a smashing success in the industry, providing productivity and memory-safety to programmers of all skill levels. D's GC implementation follows in the footsteps of industry giants without compromising expert's ability to tweak even further.

So ask your question:

Okay, "ability to tweak even further" sounds a bit vague, what does this mean? I can tweak the memory usage?

Well, of course you can do that, but that's something most languages with a GC allow you to do.
D allows you to get the benefit of both worlds, profit from the convenience of the GC and use manual allocation methods for the hot paths in your program. This is great, because you can use the same language for prototyping and shipping your application.
A short and simplified summary of allocation patterns in D:

malloc and friends are available in D (everything from C is)
RAII is supported (e.g. File you saw earlier is reference-counted and automatically deallocates its buffer and close the file once all references are dead)
there's std.experimental.allocator for everyone with custom allocation needs
std.typecons provides a lot of library goodies like Unique, Scoped, RefCounted for @nogc allocation.

Mike Parker has recently started an extensive GC Series on the DBlog which I recommend to everyone who prefers performance over convenience.
Other goodies

std.csv


Hey, I saw that there's std.csv in D, why didn't you use it?

Simple - it felt like cheating:
{% highlight d %}
import std.algorithm, std.csv, std.functional, std.file, std.range;
void main(string[] args)
{
auto inputFile = args[1], columnName = args[2], replacement = args[3], outputFile = args[4];
auto records = inputFile.readText.csvReader!(string[string])(null);
outputFile.write(records.map!((r) {
r[columnName] = replacement;
return r;
}).pipe!(rows => records.header.join(",") ~ "\n" ~
rows.map!(r => records.header.map!(h => r[h]).join(",")).join("\n")
));
}
{%endhighlight%}
std.getopt

One of the reasons why this challenge used positional arguments and no flags is that argument parsing is pretty hard in C++. It's not in D.
std.getopt provides convenience for everything out of the box:
{% highlight d %}
import std.getopt;
int main(string[] args)
{
string input, output, selectedColumn, fill = "FOO";
auto opts = getopt(args,
"i|input", &input,
"o|output", &output,
"s|select", "Select a column to overwrite", &selectedColumn,
"f|fill", "Overwrite (default: FOO)", &fill,
);
if (opts.helpWanted || input.length == 0) {
defaultGetoptPrinter("./program", opts.options);
return 1;
}
return 0;
}
{%endhighlight%}
DMD, LDC and GDC

One of the things that newcomers are often getting confused by is that D has three compilers.
The short summary is:

DMD (DigitalMars D compiler) - latest greatest features + fast compilation (= ideal for development)
LDC (uses the LLVM backend) - battle-tested LLVM backend + sophisticated optimizers + cross-compilation (=ideal for production)
GDC (uses the GCC backend) - similar points as LDC

Benchmark and performance

Benchmarking a language compiler is a bit tricky as very often you end up benchmarking library functions.
In general, D code can be as fast as C++ and often is even faster - after all
the LDC and GDC compilers have the same backend as clang++ or g++ with all its optimization logic.
If you are interested to see how D programs perform against similar programs written in other languages, checkout Kostya's benchmarks.
There's also an excellent performance benchmark
from Jon Degenhardt (member of eBay's data science team) on how eBay's tsv-utils
compare against existing CSV/TSV processing tools written in C, Go, and Rust.
@safe

Even though D is a system programming language that allows you to mess with pointers, raw memory and even inline assembly, it provides a sane way to deal with the dirty details.
D has a @safe subset of the language in which the compiler will enforce that you don't do anything stupid thing and shoot yourself in the feet with e.g. accessing undefined memory.
Unittest

One strategic advantage of D is that unit-testing is so easy as it's built-in in the language and compiler. This is a valid D program:
{% highlight d %}
unittest {
assert(1 == 2);
}
{% endhighlight %}
And with -unittest the compiler can be instructed to emit unittest block to the object files or binary.
Here, rdmd is again a friendly tool and you can directly go ahead and test your line with you this:
rdmd -main -unittest test.d

No advanced tooling setup required. Of course, this also means that it's particulary easy to automatically verify all examples that are listed in the documentation, because there part of the testsuite. I even went one step further and made it possible to directly edit and run the examples on dlang.org.
Other cool D features

There are many other cool features that D offers that didn't make it in this article, but as a teaser for future articles:

Code generation within the language (cut down your boilerplate)
Strong and easy Compile-Time introspection (Meta-programming)
alias this for subtyping
-betterC (using D without a runtime)
mixin for easily generating code
A module system that doesn't suck
debug attribute to break out of pure code
Built-in documentation
Contracts and invariants
scope(exit) and scope(failure) for structuring creation with its destruction
Native interfacing with C (and most of C++)
with for loading symbols into the current name

For a full list, see the Overview of D and don't forget that the full language specification is readable in one evening.
Downsides


Okay, so you say D is so great, but why hasn't it taken off?

There's a lot more to a programming language than just the language and compiler. D has to fight with the problems all young languages have to deal with e.g. small ecosystem, few tutorials / sparse documentation and occasional rough edges.
Languages like Kotlin, Rust or Go have it a lot easier, because they have a big corporate sponsor
which gives these language a big boost.
Without such a boost, it's a chicken/egg problem: if nobody is learning D, it also means that no one can write tutorials or better documentation.
Also many people have learnt a few languages and use them in production. There's little incentive for them to redesign their entire stack.
However, things improved greatly over the last years and nowadays even companies like Netflix, eBay, or Remedy Games use D. A few examples:

the fastest parallel file system for High Performance Computing is written in D
if you drive by train in Europe, chances are good that you were guided by D (Funkwerk - the company that manages the transport passenger information system - develops their software in D)
if you don't use an Adblocker, chances are good that algorithms written in D bid in real-time for showing you advertisement (two of the leading companies in digital advertising (Sociomantic and Adroll) use D)

The organizations using D page lists more of these success stories.
Of course, D - like every other language - has its "ugly" parts, but there's always work in progress to fix these and compared to all other languages I have worked with,
the ugly parts are relatively tiny.
Where to go from here?


Okay that sounds great, but how do I install D on my system?

Use the install script:
curl https://dlang.org/install.sh | bash -s

or use your package manager.
And start hacking!
Acknowledgements

Thanks a lot to Timothee Cour, Juan Miguel Cejuela, Jon Degenhardt, Lio Lunesu, Mike Franklin, Simen Kjærås, Arredondo, Martin Tschierschke, Nicholas Wilson, Arun Chandrasekaran, jmh530, Dukc, and ketmar for their helpful feedback.
Attachements

It's possible to do three easy tweaks to make I/O faster in D:

disabling auto-decoding with byCodeUnit
non-thread-safe I/O with lockingTextWriter
use of std.mmfile

{% highlight d %}
#!/usr/bin/env rdmd
import std.algorithm, std.exception, std.format, std.mmfile,
std.range, std.stdio, std.utf;
void main(string[] args) {
enforce(args.length == 5, "Invalid args\n" ~
"./tool <input.csv>   <output.csv>");
auto inFile = args[1], columName = args[2],
replacement = args[3].byCodeUnit, outFile = args[4];
scope mmFile = new MmFile(args[1]);
auto lines = (cast(string) mmFile[0..mmFile.length])
.splitter('\n').map!(a => a.byCodeUnit.splitter(","));
auto colIndex = lines.front.countUntil!equal(columName);
enforce(colIndex >= 0, "Invalid column. Valid columns: %(%s, %)".format(lines.front));
auto os = File(outFile, "w");
os.writefln("%(%s, %)", lines.front);
auto w = os.lockingTextWriter;
foreach (line; lines.dropOne) {
auto r = line
.enumerate // iterate with an (index, value) tuple
.map!(a => choose(a.index == colIndex, replacement, a.value))
.joiner(",".byCodeUnit);
w.put(r);
w.put('\n');
}
}
{%endhighlight%}