michalo1334/Draft.md

## Draft.md

      
    Raw
  

              Draft.md
            
          
    Table of contents


Overview
Rationale
Concepts

General

Stream oriented
Compute Node
Concurrent streams
Late-bound
Types


Optimization, protocols, predicate dispatch

Heuristics & parallel solution searching
Specialization

Protocols
Predicate dispatch
Example
Summary


Backends are cooperative
Implementing backends and list of them worth interfacing to


Tooling

In-Cuis

Debugger
Runtime browser
Compute Node browser
External dependency browser
AOT compilation browser


External

Native libraries and runtimes
Profiling tools
Native debugging
Validation tools
Info gathering


External dependency resolving


Summary
Changelog
- 25.04.2024
- 24.04.2024

Overview

This preliminary document describes performant, declarative, stream-oriented language for letting access to power of modern CPUs and GPUs to Cuis Smalltalk without overly leaving the late-bound environment of the latter.
The language strongly focuses on readability of the written code, to express many common algorithms in the most compact & meaningful manner possible.
The "performant" part is not a contradiction to the above but rather an orthogonal issue, solved by providing first class tools to express meaningful and performant implementations of the same solution at the same time and letting programmer be an (optional) mediator in the process of optimizing programs.
Rationale

Let me preface with my opinion that this is not the way to go to scale up the current Smalltalk-based systems. This method uses more like Python-like approach, by having late-bound, slow but interactive environment which is sped up by FFIng to native libraries (which results not being as portable as the Cuis system part itself).
It completely ignores a need for (best effort) transparently distributed (not just across OS threads but also computers) system of communicating obj... virtual computers.
It rather prepares ground for writing highly interactive graphical tools & simulations that one day may be used to generate such system.
Back on the main track:
List of reasons why:

An obsolete architecture of Smalltalk, written with particular goals in mind back in 70s and 80s, making difficult to use modern day computing resources
A need for writing solution-oriented code that says what it does rather than putting the "optimization hat" from the get go, so that the code, written with readability in mind, runs with interactive rate.
Inspiration by: problem-oriented/domain-specific languages (you can make your own, how about that?!), VPRI work,
Access to powerful resources: OMeta, Cuis, free high-performance runtimes and libraries, hardware
An astonishment how complicated and uncomprehensible the current software is
Contributing to the Cuis Smalltalk and Computer Science in general
Pushing my skills to see what I'm able to do with actually doing it, by forced to write bachelor thesis :P The title is set in stone (in theory) so no coming back for me hah.

Concepts

All code examples are to show how generally the language looks like and the syntax (not yet implemented) may be subject to change.
General

Stream oriented

The language is stream-oriented. All computations are about plumbing instances/objects via streams. Streams can be arbitrary nested (e.g. you can process stream of streams) and dynamically created.
Recursion & looping is supported either via ordinary for loops or stream-back operator and refering to previous elements of stream (e.g. with '). Stream-back operator was an original contribution by Dan Amelang in his Nile language so I'm going to give him full credit for that (see: FONC archives, there was email exchange about this).
This operator can be pretty much used to implement recursion and looping (in some way even unifying them) in an intuitive way by "pushing back" computed result to the input stream rather than to the output one.
Possible use cases: ray marching/tracing, reduce/fold.
Compute Node

The unit of computation is called Compute Node. A CN unifies kernel and function execution semantics i.e. a particular CN can act as either one depending on the context.
CNs can be organized into pipelines - a sequence of nodes, where output of the former one is feed into input of the next one.
Concurrent streams

It's often the case where single input is processed by different kernels, then written to separate buffers, then finally combined together (e.g. https://en.wikipedia.org/wiki/Deferred_shading) into a result. The language provides built in way to express such pattern via concurrent streams - streams are "plumbed" in lock-step and their items are zipped together. Pretty much zip operation.
For example (uploaded as image because Github refuses to preserve MD formatting):

Late-bound

To provide seamless integration with Cuis, it's necessary the language should be late-bound as well. It would be then possible to use Cuis-side written classes in the Lang and vice versa. This way it's possible to express any computation, regardless of the object's class as well as provide different, underlying implementation (polymorphism) approaches to buffer arrays, images, numbers etc.
This also allows to port some methods from Cuis to provide better readability and possible performance improvement.
One may ask: how can one combine rigidness of high-performance runtimes with late-bound nature of Smalltalk? There are two approaches, both are going to be included in the language and both are interrelated:

Speculative execution & profiling
Programmer's custom heuristics inserted into optimization pipeline and CN specializations via protocols and predicate dispatch

Types

As mentioned previously, the language has - beside the ones written in itself - ability to use Cuis' classes as first class types. The language ones are exposed in Smalltalk browser as well.
I don't think there is any point in defining that there are primitive, struct types etc. since they're already there.
However, due to nature of main backends' execution model that I consider (OpenCL, WebGPU), I think it is worth it to define "interface type" (not tied to any particular implementation) that in generally describes n-dimensional (where N is usually 1, 2 or 3) discrete or continuous space.
This allows us to express OpenCL Nd work items (discrete), textures & fields (continuous e.g. SDF) sampling in natural and noise-free way.
Optimization, protocols, predicate dispatch

Optimization is tough. Millions of dollars and hundreds of thousands of engineering hours were poured into efficiently optimizing dynamic and static languages.
We could spend rest of our lives devoted to designing a nice framework for optimizations & letting flexibly compose them but I think in this case we have to stop being smart in the short term and appeal to biological-like processes.
Let me preface that I'm not an expert nor probably going to learn to be at least intermediate in this matter so this is why I'm taking the road below.
The approach I'm going to use is to rely on semi-randomness and programmer's wisdom.
Essentially a wide array of techniques (executed possibly in parallel) related to speculative optimization, heuristic-guided - written by a programmer - inserted into the optimization pipeline.
Tagging along are less "meta" ones - Compute Node/Kernel specializations.
Another thing is that unless we're going to write custom CPU-based backend, runtimes are going to do a lot of heavy lifting for us i.e. register allocation, code hoisting, vectorization etc. So we have to focus on the aspects that they can't do, like kernel fusion and resource management.
Heuristics & parallel solution searching

As mentioned above, the way to solve the problem is to launch in parallel thousands or even hundreds of thousands of instances of a single kernel, each having applied different heuristics with different configurations, then see which is the fastest/uses the least memory etc.
The testing can performed either statically (at programmer's request via tools to precompute for later) or dynamically by JIT runtime.
Advanced one: results are then fed-back to the runtime and are generalized via rules that are applied based on detected code patterns. Maybe employ some form of machine learning/symbolic solver?
Examples of safe heuristics:

Memory allocation: allocate aligned to 4/8 bytes
Use registers for most commonly used variables

Examples of speculative heuristics - pretty much what one can expect in modern JITs:

Inline code assuming variable is of type X

Example of unsafe heuristics - they may completely change semantics of the code but may "accidentally" yield a correct one with better perf.:

Swap two instructions
Code motion (random)
Randomly assign registers

In short: a lot of "guess and check".
Not sure how viable in general is this approach but I'm going to find out!
Specialization

The language provides a way to overload kernels with different input parameter conditions, enabling multiple kernel specializations for the semantically same code. This is realized by protocols and predicate dispatching.
Protocols

A protocol (or an interface or even a contract - Eiffel term) is a named set of required,selectors methods that the target object has to understand. It allows not just simply specifying required message names that must be understood but also additional preconditions via predicate dispatch, that the method's arguments has to satisfy.
Predicate dispatch

To provide even more flexibility, the language uses predicate dispatching - selecting/dispatching method/node based not just on name or even type, but on arbitrary, programmer's specified conditions.
One instance where PD could be useful is controlling the allowed error range in the operations. Not all optimizations are the same 1:1 - for example floating point operations are not associative and particular specialization may (tho algebraically correct) swap operations order and the result is not bit identical to the reference operation anymore.
PD could let programmer specify how much they tolerate the discrepancies across different implementations, via e.g. global variable used in the predicate.
Protocols and PD enable seamless integration between the Cuis and my Language since one can also define Cuis class to be a protocol e.g. Float. Then if it's used in the code, all arguments must implement Float methods - either they're instances of Float (Small or Boxed) and hence provide a strong optimization hint or are like FP, keeping polymorphism in check.
Example

Let's take matrix multiplication as an example. The simplest possible way to implement and understand it is as dot product of i-row of LHS and j-column of RHS. But for optimizer it might not be easy or is straight impossible to transform it into more optimized form. This is where kernel specializations come into play.
First, let's define protocols (again formatting issues, sorry for that):
    //The float protocol - all arguments requiring this protocol must be either an instance of Float or be like it. 
    Float <: Cuis[`Float]
	
	//Or in-language definition - enumerate all selectors and their parameters protocol requirements
	Float:
		a + b :: Float, Float
		a - b :: Float, Float
		sin a :: Float
		...	
	
	//Or another an interesting way to have: virtual protocols.
	//Take the Cuis's selectors then tweak from the perspective of the lang syntax to further define constraints.
	//Would require dedicated tooling
	
	//Named predicates: method-like
	//Another experimental feature - nameless keyword arguments - block like
	?MatrixDimensionCheck :a :b 
		[a columns size = b rows size]
	
	?MatrixSquare :m 
		[m rows size = m columns size]

	//The matrix protocol
	//A flexible and implementation-independent way of definining array-like types - just define whether they're like 1D, 2D etc. without fixing concrete types in place + specify convenient indexing accessors
	Matrix:
		layout: Array2D
		indices: m_{i0}{i1}

Now, "meaningful" and specialized implementations:
	 //A generic Compute Node - accepts matrix-like types whose indexable members are of any type. May signal DNU since there are no guards.
	 //Basic dimension check
     a * b :: Matrix, Matrix -> Matrix
	     ?MatrixDimensionCheck :a :b
	 //TODO: add nice, easy to understand mat multiply
 
	 //Specialized Compute Nodes - use predicate dispatching and protocols to specify conditions and types they're specialized for

	 //Specialized node for 4x4 double matrix multiplication
	 //The <> brackets may resemble generics from many popular languages. Unlike them, here it's just simply a synctatic sugar for checking if every array item is of particular type - fully late-bound/dynamic. 
	 //I decided to borrow the notation since they imitate a lot generic collections that are particularly common - List<T>, Dictionary<TKey, TValue> etc. and I think it's the most brief and familiar way to express the intention of all items being of the same kind.
	 //Check error tolerance via global key (`` is a shortcut for accessing global names instead via #at: or [])
	 //How is approximation error determined? It's either relative to Cuis backend or relative acurrate solution for the given type precision. 
	 //The value of error is either computed empirically (run with different values along with edge case testing) or taken from documentation of a library.
	 a * b :: Matrix<Float>, Matrix<Float> -> Matrix<Float>
		 ?MatrixSquare :a
		 ?MatrixSquare :b
		 ?[``epsilon < 1E-5]
		 ?[``backends includes: 'plugin'] 
		 //use intrinisics/vector instructions - custom CPU backend
	 
	 
	 //Use BLAS library e.g. clBLAS
	 //Specify that results are not just different from the reference implementation, but they're are completely non deterministic (e.g. due to scheduling)
	 a * b :: Matrix<Float> , Matrix<Float> -> Matrix<Float>
		 ?[``deterministic not]
		 ?[``backend = 'opencl']
	
	 m <- a rows size
	 n <- b columns size
	 k <- a columns
	 //generate or dynamically extract method selectors e.g. from C header files
	 //the method is dynamically generated by the runtime
	 ``libraries clBLAS
		 order: 0
		 transA: false
		 transB: false
		 M: m
		 N: n
		 K: k
		 alpha: 1.0
		 A: a	//cl_mem
		 offA: 0
		 lda: k
		 B: b //cl_mem
		 offB: 0
		 ldb: m
		 C: c //cl_mem
		 offC: 0
		 ldc: N
		 numCommandQueues: 1
		 commandQueues: 'Queue' //a special placeholder string for indicating this is where the runtime should pass a CL command queue
		 numEventsInWaitList: 0
		 eventWaitList: 'EventWaitList' //same as above
		 events: 'Events' //same as above

Predicate dispatch uses ordered choice to test for appropriate node, therefore one should order them from the most specific to the most generic ones. This ensures all CNs are taken into the account.
Summary

Protocols and predicate dispatch not just as a way of providing strong hints to the optimization machinery - guided by the programmer - but also (or even primarily) serve as a way to organize the code.
Protocols and PD enable fine-grained specification of requirements (essentially a contract) of a method - if they were implemented in Cuis (would be great to have), there would no more ad-hoc aBoolean, anOrderedCollection etc. and no more IF-ladder-dispatching to sub-methods.
Backends are cooperative

Compute Nodes, or pipelines of thereof, are run via backends. Backend is an environment that provides execution capabilities, along with their own rules about how it's going to be run, guarantees where it's run, accuracy etc. as well as requiring to create their-specific resources.
I do not consider libraries (i.e. simple calls without previously set up context etc.) to be worth calling as such. They are rather extensions to them, available to be used if the backend is also available.
Optimizing fully a whole pipeline might be impossible so they should be able to be run in mixed-mode - one stage uses Cuis backend, another uses GPU backend etc.
In other words: they should be able to cooperate together, by dynamically creating, exchanging and destroying backend-specific resources (another tough job for an optimizer).
Implementing backends and list of them worth interfacing to

In general, it seems like all vendors don't give a damn about developers and they have decided to maintain gazillion different variants of their compute platforms. The most widespread is SYCL-based (C++) but it's like using C++ for a scripting language.
OpenCL seems the most reasonable target as tradeoff between dependency, effort to implement interface to and widespread.
To debug generated code it seems like one has to pull really heavy tools (gdb etc.) from relevant compute platforms and implement bindings to them so I don't think it's gonna happen anytime soon.

OpenCL - probably the most universal platform, most implementations support SPIR-V as IR, which makes code generation easier.

+ universal
+ CPU & GPU support
++ accepts relatively simple (after brief reading), fixed width IR - SPIR-V
- feels like falling out in favor of vendor-specific APIs and SYCL


WebGPU Native - it seems like it's the new API both for web and desktop to serve as a way to painlessly access graphical capabilities. Native version provides additional features not available (for now at least) in draft spec.
Link: https://github.com/gfx-rs/wgpu-native

+ painless access to graphics rendering on GPU using modern API without drawbacks of "very low-levelity" (memory allocation on device etc.)
+ SPIR-V as IR
+ upcoming API for access to gpu for web so it's going to stay for a while
- dependency on a library


ISPC - SPMD paradigm compiler from Intel, allegedly generates high quality code for x86, ARM and Intel GPU. Supports multithreading (dependency on Intel TBB).
Link: https://ispc.github.io/

+ SPMD, possibility for generating non-kernel, low-call-overhead code for CPU
- generates object files - would require a "surgery" to extract assembly to run or a compiler to produce shared library
- - relatively big - around 120MB in total, counting itself + TBB as required dependency.


Blend2D - high performance 2D vector graphics engine with multithreaded rendering.
Link: https://blend2d.com/

+ high-performance, could be used for drawing large number of objects in real time
- dependency on a library (tho small, around 1,2MB)


Custom Distributed - for the time being, a small backend for simple and fun way to run distributed computations. NNG/MPI as potential candidate for coordinating messaging.
Custom CPU - It would be great to have a custom, minimal-size runtime for cpu, tailored specifically to the language execution semantics. However one has to safely generate assembly code (some kind of universal, low-level IR would be needed) to retain the barrier between Cuis and the native world - it shall never (gratuitously) crash.
Then one has to write advanced optimizers, possibly JIT even? map IR & vector instructions to platform-specific opcodes etc.
Of course we could use help in the form of existing libraries like TBB for launching and scheduling multithreaded tasks, just don't go overboard and pull LLVM :)

And then there are native, platform-specific runtimes (CUDA & PTX, Vulkan, Metal etc.) for which implementing runtimes would minimize dependencies but would drastically increase the amount of platforms one has to maintain.
And then - finally - there is the reference implementation written in Cuis, which serves as a way to write and run platform-independent code (a philosophy of Smalltalk based systems), as "single source of truth" of result correctness and as fallback platform when no any other native backend is available.
It's critically important to polish (readability wise) the Cuis one as much as possible since I consider it even the "default" one - it represents a "meaning" code. The other platforms are actually just so you don't pull your hair completely while waiting for results :).
Tooling

In-Cuis

Debugger

Step-by-step and tracing debugger - simultaneous view (like Nile viewer).
Runtime browser

A name encompassing tools for:

Status of all run pipelines and what backends they run on
Performed optimizations and info why some optimizations failed - critically important in guiding the programmer and developer where and how to change optimization stages and/or kernel specializations
Profiling info
Launching searching in parallel and testing different outcomes
Available runtimes
Other

Compute Node browser

Dedicated, separate, Smalltalk-based browser for writing Compute Nodes with additional features making easier to browse the code & integrate with Cuis.
External dependency browser

Helper for painlessly setting up all necessary external dependencies - runtimes, profiling tools etc.
AOT compilation browser

In order so that the language is useful to the outside world, I think it would be interesting to implement compiler that statically resolves and embeds Compute Nodes into shared library (or generates C source file) in the form of collection of ready to use modules. This way one can plug them in into the codebases that use C or C++ like languages or are able to FFI interop into them.
Editor

External

Native libraries and runtimes

Not all runtimes are built-in into the OSes so one has to set them up, hopefully without pulling entire platforms along with them.
Profiling tools

The programmer and runtimes require feedback how their optimizations affect the execution performance on backends that are native ones.
For data visualization, I think we can use their GUI instead of writing our own. The gathered data however is going to be processed by us.
Considered ones: Nvidia NSight, Radeon profiler, Intel VTune, Intel Advisor.
Native debugging

Sooner or later one has to jump into the native world and see step-by-step where things have gone wrong instead of just guessing.
The amount of effort required to interface to various gdb flavors might not be worth it and simply dumping the assembly would be enough for the time being.
Validation tools

SPIR-V validator to check if the generated code adheres to all validation rules.
Info gathering

Collecting information about runtime capabilities, libraries etc. is either gonna happen via external programs - clinfo, vulkaninfo, readelf etc. - or custom ones, depending how difficult is to parse & present it in a meaningful way.
External dependency resolving

Let me say that I absolutely despise setting up environment variables & messing with paths so that every tool knows to find input files, other tools etc. It just god damn frustrating - why can't they automatically find them?! They're right there, just one directory above. Just don't be so dumb and brain-dead command line tool and look there, geez.
I'm going to spare you the pain and give you ability to semi-auto resolve conflicts, download and set up all - be it optional or necessary - external tools from the comfort of the Cuis.
Devising a simple Smalltalk-based mini query & discovery DSL would help in this matter.
Example (steps, not a code):
To set up Blend2D library:
Try to find on user's computer - use heuristics like env variables ($PATH, $LIBRARY_PATH etc. on Linux), look in */lib directories etc. If everything else failed, let the user specify the path.
	If not found try to download built binaries from <here the user should put websites with already built .so or .dll lib)>
	If not found download and build from sources:
		Find or download CMake (again use heuristics)
		Run shell commands - either generic configuration for cmake or for specifically building Blend2D. Configurations can be built by the community.
	Install in one of the Cuis subdirectories or user-specified one.

Summary

It's definitely going to be a time consuming project to finish it in at least some usable state before the thesis defense. I hope  to continue working on after it to finish it off.
A lot of the above is more like a wish list, things I want either to have as first-class-citizens or test many different problems in slightly different ways than they're solved today.
I'm going to prioritize user experience with language & tooling, put some effort into optimization to be able at least translate (without backend-side optimization) relatively high-level looking source (recall that you can also write code using more primitive constructs like for loops etc. but this doesn't count) to native backends.
Try also to implement - a very very basic one - the distributed one & AOT compilation for my thesis.
Well, time to work.
Thank you for reading!
Changelog

25.04.2024


Added table of contents
Added Types section
Added AOT compilation browser section
Fixed formatting
Stream oriented: Stream of processes -> Stream of streams

24.04.2024


Finished first version