Skip to content

Instantly share code, notes, and snippets.

@Stiivi Stiivi/cubes2.0-goals.md

Last active Aug 5, 2017
Embed
What would you like to do?
Cubes 2.0 Goals

Cubes 2.0

Hi there. After almost two years of none or very sparse activity due to life and career situation, I’m committing myself back to the Cubes project. It will take some time to ramp-up, but we will eventually get there. I apologize for not meeting expectations lately and for letting the framework, mailing list and discussions go stale.

I got quite a lot of useful feedback and recommendations from users and people in the domain and that revived my motivation to spend more of my spare time to make Cubes better and modern OLAP toolkit.

Now, let’s move forward. To do any improvements or changes, Cubes needs quite a lot of housekeeping. The whole 2.0 release addresses that. Only when we have consistent, well-defined interface, when we have goals and equally importantly non-goals set, we can start growing Cubes again.

Links:

Objectives of the 2.0 Release

  • Maintainability
  • Type consistency
  • Correctness before feature richness
  • Transparency of the query generator and execution process
  • Extensions API clarification
  • Better decoupling of components
  • Preserved existing model compatibility if possible and sensible
  • Preserved existing HTTP interface compatibility
  • Multiple physical representations

Summary of 2.0 issues can be found here.

Maintainability and Contributions

I have been major Cubes developer most of the time and I admit I was not good in communicating the ideas frequently, clearly or up-front in time for discussion. I apologize for that. This resulted in Cubes codebase being non-easy to understand or maintain by other or new people. Therefore, during the refactoring process, I will try to focus to make the codebase more understandable and more maintainable. Will try to decrease barrier for being able to contribute to the library wherever possible.

I already started to categorize issues based on size:

  • size-small – not many changes needed, either minor but repeated changes in multiple files or small change within one file. Understanding of broader context is not quite necessary.
  • size-medium – might span across files/modules, might require some refactoring, understanding a bit more than the changed piece of code is needed. Should be a change with quite well defined boundaries.
  • size-large – deeper understanding of the library is needed and change is expected to affect a lot of modules or start a change dependency chain.

The small issues are good to get familiar with code.

Other tags:

  • help-wanted – anyone with at least some knowledge with the library might be able to implement it and I am willing to assist
  • easy – should be easy to implement, good for those who want to get familiar with the code

If you would like to contribute, but don’t know where to start, start with easy, size-small or help-wanted issues. Or you can also:

  • comment on existing issues with implementation proposals
  • challenge proposed implementations
  • propose project/module/class reorganization
  • write unit-tests
  • make sure that the documentation reflects code, add missing documentation or remove obsolete documentation

The best contribution is still a pull-request.

I will be available on the Cubes Gitter channel mostly evenings or weekends PST (San Francisco) time zone.

Side-note: I understand that the BI and Data Warehousing is quite lucrative domain and people usually don’t spend their spare time contributing to open-source. However, if your organisation or company uses Cubes, please let the community know. It helps the project going by keeping involved people motivated.

Type Consistency and Correctness

Background: Cubes has approximately seven years old code which has grown in complexity quite organically. Most of the time the focus was on feature richness and growth. This resulted in lots of compromises in the interface, coupling out of convenience, such as functional metadata with human oriented metadata in the model or wild argument types which can be anything from strings through tuples to complex dictionaries. The code became non-trivial to understand and navigate which resulted in difficulties of adding new features, maintaining or debugging existing ones. Some good advanced features such as Periods-to-Date or Semi-additive measure-dimension relationships had to be removed as they had negative impact on the rest of the code base.

Static type annotations and type checking is very basic but powerful way to reveal potential inconsistencies, catch possible type mismatch errors. It also helps understand what is actually being passed around and helps us to see whether the design is optimal, has holes or can be improved.

The 2.0 release will be focusing mostly on type correctness. There are no radical feature changes planned, mostly refactoring of the existing ones.

Opened issues:

Query generator and execution transparency

Current state: SQL query generator is complex, tightly and weirdly coupled collection of interacting objects: Browser, StarSchema, QueryContext, Store etc. In 1.x release the query generator was rewritten with attempt to be more straightforward, however many wild or anonymous data types were introduced and it was still not properly decoupled from the rest of the library. One of the reasons that was not mentioned before was also idea of having reusable SQL denormalizer in the future that would reside outside of Cubes, which turned out to be not a great idea, due to lack of rich metadata that Cubes already provides.

Secondary problem is the concept of Browser itself. It looked like a great idea from the very beginning of the library, where it was the only object responsible for everything. Now it has overgrown a bit to a state where it is not clear in which object what functionality should happen. Another problem with the Browser is, that it tries to accept wild types of arguments (names, model objects, differently shaped drilldowns, etc.) but the backend query generator needs them as well known consistent data types. This was solved by having pairs such as aggregate() with backend-customizable arguments, prepare_aggregates() and provide_aggregate() but no strict rules around those functions were proposed.

The Cubes 2.0 should bring more transparency to the query generator and properly decouple the components involved in the process. Notable changes:

  • have well defined multi-dimensional query object, that can be preserved, shared, published, etc.
  • have a ‘prepared query’ object that can be used for execution, inspection and storage of denormalized or aggregated queries generated by Cubes
  • provide first-class interface for getting a compiled SQL query without execution for further viewing, processing or executing by another system
  • have greater involvement of the Store for query preparation, execution and materialization

The above functionality with greatly extend usability of the library as an embedded component that can generate multi-dimensional ROLAP or other queries and feed them to other systems.

Proper separation of components will also make debugging and testing easier. Today preparing good unit tests for the browser is not trivial.

See also:

Extensions Interface Improvement

Cubes has been modular for a while. However how the modules should be advertised and used was not well documented neither the interface was given too much attention from outsider’s perspective. Most of the interfaces were just following internal cubes needs.

One of the future objectives of the Cubes is to serve as multi-dimensional analytical dispatch system – a common interface for multi-dimensional interactions of data components for analytics. The extension interface must be well defined and must respect wide variety of uses. This is not going to be delivered completely right from the beginning, but we need better governance of the interfaces.

What needs to be done:

  • Review current extension/module discovery and loading system
  • Open possibility for custom extension/module loading system (some infrastructures have their own ways of deployment where traditional python discovery might not work)
  • Consolidate current extension types and provide reasonable definition of the interfaces
  • Provide useful documentation

See also:

Preserved existing model compatibility

Priority of this release is mostly house-keeping. We will try to preserve reading of the current models as much as possible, however it might not be 100% guaranteed if it will be in conflict with correctness. Same applies to the other side: we will try to preserve the HTTP interface with just minor changes, but they should not interfere with correctness.

Once type consistency of the core library is assured and query transparency is provided, we can change the model in more depth. The change is needed for multiple reasons. More details about model changes will come later.

Multiple Representations

Current state: each cube can have only one physical representation and the representation can be either a snowflake or a denormalized table. Also the library can generate aggregated tables, but can’t use them directly.

We need to have a way how to specify multiple representations of a cube (snowflake, denormalized, aggregate …) and let query engine use them based on some well defined rules and configuration.

This will increase usefulness of the library in multiple aspects:

  • query performance (although still depends on the underlying system)
  • potential recurrent regeneration of aggregate tables for further processing by other systems
  • preparation of ad-hoc denormalizations for analytical purposes

See also:

Server as Extension

Current state: Cubes server is bundled together with the core library and the slicer tool. The reason was mainly easier adoption of the toolkit with as little packages needed to install as possible. This created an illusion that there is only one way of serving cubes and also didn’t promoted development of other kinds of servers.

What is needed to be done first is to make server an extension type and allow server selection in the command-line tool. We would still provide a default server implementation within the library (based on Flask, called simply flask) that wold have referential interface.

This opens possibility to have other kinds of servers for Cubes, not only HTTP/JSON, but also Protocol Buffers or Thrift.

Deprecated Backends

All existing store/browser backends except SQL are deprecated, unless someone wants to pick them up and maintain for the 3.0 release.

Future

Cubes is heading towards a common interface for multi-dimensional interactions of data components for analytics. The purpose of the Cubes core library is to:

  • provide multi-dimensional semantic layer
  • generate multi-dimensional queries with focus on relational data stores
  • provide a delegation mechanisms for computing/materializing/maintaining cube representations in variety of data stores
  • serve as middle-ware between other data producing or consuming systems

Non-goals:

  • have it’s own implementation of data store for computed cubes
  • direct support of non-structured data stores – if one wants to address them, providing a relational projection on top of them will be needed

Cubes will remain technology agnostic, mostly because the data technology comes and goes.

There are some changes in the foreseeable future, after the huge house-cleaning:

  • Model improvements
    • Correct/improved nomenclature
    • Separation of descriptive metadata (labels, descriptions) from functional metadata (ones used to construct queries)
    • Improvement of handling of role-playing dimensions
    • Reusability of dimensions as levels of other dimensions
  • Advanced queries
    • Periods-to-Date (Thanks to Robin Thomas)
    • Non-additive measure-dimension combinations
  • Master Data Repository

Preliminary plans for future enhancements can be found here.

Links and Contact

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.