yawaramin/doctool.md

## doctool.md

      
    Raw
  

              doctool.md
            
          
    I propose here a new convention and design for a tool to auto-generate API documentation for source code in any language, with no dependencies or intricate build commands. Here are my assumptions:

Source code in plain text format
Language supports comments, either single- or multi-line (multi-line would be easier to use for documentation)
A convention to interpret comments starting with a certain character as documentation comments, e.g. for OCaml we would use (** to mark the start of documentation comments.

Given the above, we can create a tool that has the following UI:
doctool [--output-dir DIR] --start STARTSTR --end ENDSTR --lang LANGUAGE FILE...
E.g. usage:
$ doctool --output-dir doc --start '(**' --end '*)' --lang ocaml src/*.mli
Given the above, doctool will output documentation extracted from all the MLI files, in Markdown format, into the doc directory. The Markdown output will contain any Markdown-style markup (and HTML tags, of course) that was in the original documentation comments. It will also set up the output for correct syntax highlighting by simply passing along the --lang option argument to all code blocks. Note that doctool will not try to do any actual syntax highlighting.
In fact, doctool won't try to do any actual lexing, verifying, typechecking, or (HTML) formatting itself. It will apply a very basic format to the Markdown output and simply rely on a simple convention to generate well-formed documentation. Hence, it won't need any other tooling, whether it be compiler, lexer, parser, any intermediate files, etc. It's a pure text-to-text transformation.
Here is the convention: Any doc comment that appears immediately before a block of code applies to that code block. The code block ends at either the next blank line, or the next doc comment, or the end of the file, whichever comes first. The doc comment will be output below the code block to which it applies, and the code block will be formatted as code. The first doc comment of the file will be taken to apply to the entire file. An empty doc comment will be ignored.
Here is a simple example. Say you have the following commented code:
// Greeter.java
/** Greeting people. */

/** Provides a method to greet people. */
class Greeter {
  /** Returns a greeting given a name. */
  public static String greet(
    /** The name of the person to greet. Throws a null pointer exception if null. */
    String name) {/***/
    return "Hello, " + name + "!";
}
Here is the proposed output:

Greeter.java

Greeting people.
Contents


class Greeter { - Provides a method to greet people.
  public static String greet( - Returns a greeting given a name.
    String name) { - The name of the person to greet.


class Greeter {

Provides a method to greet people.

  public static String greet(

Returns a greeting given a name.

    String name) {

The name of the person to greet. Throws a null pointer exception if null.

Here's an OCaml example:
(* networking.mli *)
(** Networking utilities*)

(** Connection data *)
type connection =
(** Capture information about the connection *)
| Connected of {
  (** The URI of the host we are connected to *)
  host : string;
  (** The port we are connected to. Default: 80 *)
  port : int option
}
(** Represent that we're not connected *)
| Disconnected
And the output (ignoring the header and TOC this time):


type connection =

Connection data

| Connected of {

Capture information about the connection

  host : string;

The URI of the host we are connected to

  port : int option
}

The port we are connected to. Default: 80

| Disconnected

Represent that we're not connected

A few things to note:

We are trying to go for a literate programming style of documentation. Introduce some code, then explain it.
We link to each documented item by giving it a unique ID. The ID is simply the starting line number of the code block being documented. The code block is also a hyperlink to its exact anchor on the page.
Code blocks will usually be short but they can be controlled easily by the documenter. E.g., after the String name parameter above, we put an empty doc comment to stop it from showing the implementation. We could also have left a single blank line after that block.
The generated Markdown is very simple. It relies entirely on a Markdown renderer to finally display the finished product. E.g., GitHub Pages or wiki.
We can easily add a table of contents to the top of each generated page because we have a list of IDs to link to.
I used a very recent OCaml syntax: records embedded in variant constructors. With doctool we don't need to keep up with changes to the language syntax.

I want to reiterate that this will be a very simple implementation, because it won't try to actually 'understand' the code--it will just do the minimum possible to generate nice-looking output. This has some consequences:

We can't do inter-documentation links by symbol name, because we don't lex or parse any names at all. The documentation author can write the link markup themselves, though, because the link ID is generated by a very simple and predictable algorithm (as described above--just 'L' followed by the line number of the linked code block). But these links will unfortunately break whenever the line numbers change, i.e. whenever the source files are edited. Doctool can probably help with this, though, with a little effort to track old link IDs and their corresponding new ones, and auto-replacing old IDs with new in the source code. Hence, I recommend checking the generated documentation into source control. This is a good idea in and of itself, because most repository viewer tools, like GitHub, auto-render Markup documents. This means anyone browsing your repo will immediately see nice documentation. Also because it lets people download your docs and render it themselves without needing doctool at all.
We force documentation authors to adapt their line break style to conform to our requirement that each documented code block must have its own doc comment. Of course, if the author doesn't need an exact link to the item they are documenting, they can simply include its documentation in the previous doc comment.
We force the user to adopt a new, non-standard tool.
We rely entirely on Markup renderers like GitHub etc. to display the actual documentation, so offline rendering will depend on tools like pandoc.

As you can see, the doctool convention is a simple one. More functionality can be added by post-processing its output for specialised markup like e.g. @throws or @deprecated. But the core tool itself can be implemented in any language. It is language agnostic in both senses.