Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Google Summer of Code 2019 Project Report

Google Summer of Code 2019: Project Report

Alert Redistribution System for Fink: An Apache Spark based broker for Astronomy

Background

Fink is an Apache Spark based broker infrastructure to enable various applications and services based on alerts issued by telescopes around the world. It uses Spark's Structured Streaming API to receive and process the high volumnous data from telescopes. Among several, Fink follows the guidelines by LSST LDM-612 for setting the objectives of its broker infrastructure. One such objective is to redistribute the alert packets to the community. The developments in this GSoC project were done in order to fulfill this objective.

Project Description

The proposal focuses on the development of an Alert Distribution System for Fink, which would be based on Apache Kafka and Spark's Structured Streaming. It would consist of a secured Kafka Cluster to manage the clients and a Spark based producer to detect new alerts of interest and send out these to the community via Kafka streams.

Project Development

All the development carried out towards the project was done via Pull Requests on the following two GitHub repositories:

and the repository that hosts Fink's documentation

Hence, it is easier to understand the milestones achieved during the project via these PRs. Each PR is linked to its issue in the repository to define clearly the proposal and to prepare the future work via concrete action items. Below is the work carried out during the GSoC project:

  • PR #182 | Setting up a basic Alert Distribution | [Status: Merged]

    • adding a basic alert distribution in which HBase is read in a Spark process and alerts are sent as avro messages on Kafka
    • also adding a shell script to manage the Kafka Cluster
    • enabling the fink's distribution service: fink start distribution and a testing service: fink start distribution_test
  • Other PRs with some minor fixes: | [Status: Merged]

  • A Problem:

    The work untill now scanned the HBase (which acts as the Science DataBase in Fink) and produced alerts. However, it is required to send the alerts as soon as they are received in real time. While the other services in Fink are based on Spark Streaming, as of this writing, an HBase cannot be used yet as a Spark Streaming Source natively.

    Solution: There can be a number of ways to solve the problem of real time scanning of an HBase.

    • Using HBase-Kafka Connector: This uses a proxy peer that sends Kafka messages corresponding to replication events on an HBase RegionServer. The use of this approach led to a few problems. First, each update in an HBase record leads to a large number of Kafka messages (corresponding to every column qualifier of each Column Family). Second, there were errors decoding some data types from the resulting avro messages. While the former could be solved by using Spark SQL functionalities to obtain the original record, the latter problem could not be solved given the time allocated for this project.

    • Timestamp based HBase scan: An HBase provides functionality to scan records between timestamps. This can be used to scan the Science DataBase within successive time intervals to obtain near real-time streaming of alerts.

Considering the time in hand, the second option (using Timestamp) was chosen as it fullfills all the requirements for now i.e. streaming alerts every few seconds providing a near real-time solution given all the complexity of the pipeline.

  • PR #201 | Timestamp based HBase scan and streaming | [Status: Merged]

    • adding real-time streaming achieved by timestamp based HBase scans
    • allowing streaming of alerts between given timestamps
  • PR #206 | Filtering of Alerts before distribution | [Status: Merged]

    • To cut down the processing overload on other Fink's processes, streaming only a subset of alert data can be helpful. This PR adds an xml based filtering mechanism that allows an easy way (via an xml file) to select columns of interest and apply simple filtering rules.
    • While the above approach has limitations from a scientific perspective, it could be useful in a few use cases.
    • PR #216: Fink users are traditionally python users. Hence, following a much more powerful and elegant pythonic way to add generic processes and filters developed by my mentor (Julien Peloton) in PR #209, this PR adds the similar function for alert distribution.
  • PR #231 | Authentication and Authorization on Kafka Cluster | [Status: Merged]

    • to enable control over the number of clients on the Kafka Cluster and to prevent unauthorized access to Fink's servers, it is important to add a security layer. This PR:
    • adds authentication layer to the Kafka Cluster
    • adds authorization of clients on Kafka Cluster
  • PR #249 | Integration with Slack | [Status: Merged]

    • adding functionality to send alert information via slack channels
  • Other PRs with some minor fixes: | [Status: Merged]

    • PR #255: fix error while using wildcard
    • PR #268: fix TypeError caused in earlier version of pandas
    • PR #269:for a complete testing, enabling streaming of full alerts and filtering only the number of alerts

Fink Client

Fink Client is a light package to manipulate catalogs and alerts issued from the Fink broker programmatically and remotely. It is designed for beginners and advanced users, and it exposes several services to easily collect and extract the data processed by Fink that researchers need for their work.

  • PR #11 | Add python based Kafka Consumer | [Status: Merged]

  • Other PRs with minor fixes: | [Status: Merged]

    • PR #17: fix coverage
    • PR #20: stream full alerts from fink-broker
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.