GA4GH Search - A path forward - 2019-05-08
(Original located at https://groups.google.com/a/ga4gh.org/forum/#!topic/ga4gh-discovery-search/8gjdZU0Atsc - Requires login)
I’ve been mulling over the events of Hinxton conference, and also those from the past few months.
After discussions last week, I think I have come to realise that the current Search API specification, and the new proposal from Marc and Miro, are actually trying to solve two different problems. I appreciate it may appear on the surface that they are trying to solve the same problem, but I now believe they are not.
I feel the most sensible way forward is to form two different Search API’s, and I believe this is OK. Let me explain why.
The current version of the Search API meets a number of driver project’s requirements We should fix known outstanding issues required by the GA4GH SC for v1.0 resubmission The current version of the Search API is 90% complete It looks like the new proposal is trying to solve a different problem, and that’s OK There’s no harm in having two standards when they solve distinctly different problems Ewan noted during the Hinxton conference his preference for “keep it simple, release, and iterate to improve” Changes, big or small, should be clearly backed by driver project requirements. I can’t see any drive project provenance for the new proposal put forward.
We must continue to complete the remaining work to re-submit the Search API specification.
We must also allow another Search API concept to be developed and validated without prohibiting the remaining Search API work.
The Search API as it stands today is the culmination of over a year of work, based on a year of prior work, and several years prior experience within the MME and the problems they faced.
During the last year, the Search team had engagement from driver projects, and demonstrated a 4 node / database federated search network built on the RC5 (release candidate 5) specification.
This included a part of MME and a part of EJP RD (Variant Matcher and, Café Variome and RD-Connect respectively) in addition to commercial partners Illimina.
A video of the presentation and link to slides are found on the GA4GH website under section 10 of the 6th plenary session: https://www.ga4gh.org/event/ga4gh-6th-plenary/
During that year, the team reached a multi-driver project general consensus as to how we wanted to handle query logic for the initial version of the Search API.
It was agreed that we would make it limited for the first version, in order to test the water carefully, establishing Search as something viable and useful.
This is documented at https://github.com/ga4gh-discovery/ga4gh-discovery-search/issues/9
This was not decided for the purpose of rushing a standard to its first version, but with the expectation that a simple first version would demonstrate the utility, allowing for iteration and development on a next version.
The year later proposal documented at https://github.com/ga4gh-discovery/ga4gh-discovery-search/issues/75 has been controversial.
This is not a suggestion to JUST change the query language, but also proposes a different approach to an API for discovering data. Not to say I don’t think both ideas have value, but the new proposal is trying to solve a slightly different problem than the current Search API, where the nuances may not have been communicated clearly enough for this to be obvious.
As a note: In the Phenopackets session, Ewan appeared, and to paraphrase, said “If you’re wondering to keep it simple or not, keep it simple, release, and iterate to improve and add functionality ”. I feel this likely applies to all GA4GH projects, including Search.
The Search API as it currently stands, is designed to allow for the federation of a query to all compliant nodes, regardless of the type of data they hold, only returning data that is potentially useful.
The search client making a federated request to servers, constructs the search query using multiple components, and optional Boolean logic.
The search client may specify in their request that they are only interested in results where each returned record includes a specific type of data (for example phenotypes).
The servers receiving the query can either respond in error or with records containing as much or as little data as they are willing to include.
Where the server does not recognise a specific component version or expects newer, it may use a backwards compatible version if possible, and include this metadata information in its response.
Additionally, where a server does not recognise a specific component, if the query logic allows, they may disregard that component, and still return potentially useful results.
This approach was checked and backed by several driver projects, and I’m now also discussing with HCA, which feels like further validation of the requirements and functionality laid out.
Current Search API:
The release candidate for the Search API passed the GA4GH Product Review Committee, falling down in the GA4GH Steering Committee session.
The feedback I received from specific individuals and feedback contained within the minutes from the meeting, showed that the specification itself was 90% of the way there.
There were criticisms regarding driver project engagement, documentation structure, and security.
I addressed each of these in my retrospective document put out as a result of the GA4GH SC meeting: https://github.com/ga4gh-discovery/ga4gh-discovery-search/blob/master/documents/v1.0.0-rc.5-retrospective.md
No concern was raised over the components framework and the request and response format, which includes the query logic. It’s minuted that Ewan defended that the framework met the current needs of at least one driver project. I don’t feel it makes sense to backtrack on the huge progress made by the team to reach this point, and then make a fundamental 11th hour change.
I agree we will need a more expressive query model to support more complex queries, but a basic initial level was agreed on for a first release by driver projects involved, as mentioned and evidenced earlier. (It is an AST which can be easily converted to SQL, as shown by the RC5 implementers.)
The proposal put forward by Marc and Miro takes a different approach:
A client must make a request to a specific server they wish to query in order to obtain their supported fields.
The field information can then be used to construct an SQL query for execution on the server in order to return results.
The server may use underlying SQL distribution technology to query or join between different sources (this is not query federation).
(Please correct me if I’m wrong here, or if I’m missing something.)
What’s the provenance for this concept? Did it stem from a driver project requirement or some other means?
The key difference I see here is, with the current Search API you don’t need to know the components (and therefore fields) that each server supports up front, while the new proposal requires that you must ask each server what fields they support before you can query them. This makes a federated search query not possible.
If an SQL query is sent to a server which doesn’t recognise a table that forms part of a JOIN or possible condition, it will result in an error, whereas with the current Search API, this will gracefully degrade where possible.
There has also been a discussion on using an SQL parser, but I’ve not been able to find a general purpose PrestoSQL to AST parser beyond the official Java one. I’d be interested to see an implementation which uses such an SQL parser to transform the AST to an existing implementations ORM SQL AST for use in an existing systems database connection and security setup, which is a common requirement within MME.
I wasn’t sure what the response format was that you proposed. Would it be to return the rows from an SQL search as JSON?
Don’t get me wrong, I think Presto and your implementation is impressive, and can be massively useful in terms of enabling systems with multiple internal data sources to expose their data for discovery, but I don’t think that’s in a scalable federated fashion, and I think the barrier to entry would be too high for many involved. (I think the barrier can be lowed if you establish a limited set of SQL DSL keywords, but that’s a topic for another day.)
Value in two solutions:
Miro rightly pointed out in our discussions, having a standardised search method and format for singular resources, where federation isn’t required for such queries, still has value and utility!
This is why I’m proposing we consider creating two standards, “Federated Search” and “Power Search” (or some such).
Federated Search would allow a general query to be sent to many services, regardless of data types held, while Power Search would allow for targeted more powerful searching of specific databases or systems.
Given SchemaBlocks now only looks to take schemas from approved standards or those already in use by organisations, we will still need to collaborate on the schemas to be used by both.
It may very well be that as both solutions are developed and have releases, that we establish there is more in common than expected, and later major versions look for converging on utility, but I feel that’s likely to be a way off yet.
I can see the potential for later convergence of the two Search API specifications, if a limited set of SQL keywords and standardised AST structure can be agreed on.
I feel we do need consider the pragmatic approach: the team that worked on the current version of the Search API are looking to move it forward to an initial version, allowing us to validate the concept, utility, and iterate on lessons learned based on code in production.
If we continue to debate and discuss, looking for a perfectly harmonised solution, we take a non-pragmatic approach to try and reach a unreachable level of perfection at first try, and risk alienating those who have put in work already, who may start to look to other projects they feel are more likely to deliver their requirements.
GA4GH looks to model some aspects of itself after the IETF, and their unofficial motto is “rough consensus and running code”.
I believe we should look to release and iterate based on the feedback from driver projects and organisations who have live code.