gopi487krishna/gsoc20.md

## gsoc20.md

      
    Raw
  

              gsoc20.md
            
          
    FITS Module

FITS or Flexible Image Transport System is an open standard defining a digital format for storage, transmission, and retrieval of data. FITS was developed by IAU in 1981 to transport different kinds of astronomical data in a unified manner. It is the most commonly used digital file format in astronomy and allows the users to store various kinds of data such as images ( example 2D image ) or tables.
More information about FITS standard can be found at  :
FITS Standard
The Origin and Purpose of FITS 
Boost.Astronomy

Boost. Astronomy is a C++ library proposed by Pranam Lashkari in 2018, which provides basic required functionalities such as Coordinate System and FITS module for analysis and storage/manipulation of astronomical data to Professional C++ Developers, Scientists,  Physicists in the field of astronomy.
My Role

My role as a GSOC student was to complete the development of the FITS module ( parser ) and provide an external API for easy management of FITS files ( reading, writing, manipulating data ) and bring the FITS module to a ready review state.
Objectives


Designing an extensible system for custom validation of FITS files
Improving the reading part of FITS parser to support efficient reading of HDU's
Generalizing the communication between various parts of FITS module for easy addition of new HDU's
Designing the writing part for all the basic FITS HDU's
Writing an external interface for easy and convenient access to the functionalities of FITS module

Features developed and enhanced during the GSOC period?

I divided the development process of FITS module into three phases named Phase1, Phase2, Phase 3 ( according to GSOC )
Phase 1 :

Phase 1 consisted of the following developments:

Completing Documentation for the FITS module
Bug Fixes and Refactoring
Writing tests for the FITS module
Designing Reading Portion of FITS module

Completing Documentation for the FITS module

Studying the previous codebase of the FITS module revealed a lack of Documentation at many places and inconsistencies with Doxygen format.
So I decided to rewrite and add missing Documentation to every source header in the FITS module. Apart from that, we also shifted our documentation format to the java doc convention for better clarity.
I also implemented a small script using py-driller and python to insert the Copyright statements in an automated manner. The code ensured that the copyright notices added the name and date for every contributor that wrote or contributed to the file.
Pull Request: Added Copyright Information and License information for every header and source file
Bug Fixes and Refactoring in FITS module :

Lack of tests in the previous codebase allowed some subtle bugs to creep in that were uncaught.
So i removed these bugs and verified the results with astropy to ensure correctness.
More information about astropy can be found here :
Astropy
Also, I refactored the code to improve the quality and performance of the FITS module.
Pull Request: Bug Fixes And Refactoring in FITS Module
Writing Unit Tests for the entire FITS module :

To make sure that everything written before worked perfectly, I decided to implement the tests for the FITS module
Following test files were written in this Phase:

t_ascii_table
t_binary_table
t_card
t_hdu
t_primary_hdu

Pull Request: Added Unit Tests for FITS module along with some minor bug fixes
Note: This pull request did not get merged in favor of the new Pull Request which included Reading Part as well as Unit Tests
Along with the new tests, several major bug fixes took place in the FITS module. I validated the changes with the results of Astropy to ensure correctness.
Phase 2

Phase 2 consisted of the following developments

Completing the Reading Portion of the FITS Module
Developing the Card Policy
Unified API for reading and writing FITS data
Designing the external interface for reading FITS files

Completing the Reading Portion of the FITS Module

Upon performing some benchmarks on the existing code, we found that the performance of the ASCII table and BINARY table reading was a little unacceptable.
Upon some research by me, Master Pranam Lashkari, we noticed that the performance loss happened by dynamic detection of the column data type. Also, we had to use Dynamic cast to get the appropriate data. After some discussion with Pranam Sir, he agreed on making the column data type a template parameter so that parsing became efficient.
Binary Table, on the other hand, suffered from some other problems as well. Reading and parsing the binary table data required a large amount of branching as the code was runtime in nature. After approval of having the template parameter for the column data type, we now opened up new ways of parsing the data. After a discussion with the team, we decided on using the member function specializations trick for parsing the data. Although the code for parsing different types remained the same, we placed them in specialized member function candidates. Although the amount of code remained the same, instantiation of code took place only upon its use. Using this technique helped us in removing bloat code and increasing the performance by a large factor. Binary table parsing became the fastest parsed HDU.
Developing the Card Policy

"The development of a library requires that the code is always extensible. Doing this helps the users to use the library according to their own needs."
To support the custom validation of the FITS header, we decided to make the validation, part of a policy class. Doing this made our library extensible by a large factor as users could now provide their custom validation rules.
Unified API for reading and writing FITS data.

To study the dependency between various modules, I decided to make a UML diagram depicting the relationship between different modules. Upon analysis of the UML diagram, I found that the code was heavily relying upon fstream for parsing. To remove this dependency, we created a policy for a unified interface to support different ways of reading and writing data. It ensured that the modules were not bound to a particular read/write interface.
Designing the external interface for reading FITS files

I scraped off the existing FITS external interface and made a new interface that supported the following features.

Access of HDU's by their names or the position in the file
Lazy storage of HDU's to conserve space
Use of control blocks to cache the FITS data for faster access

The dependency of the R/W interface was removed and replaced with directly passing the header and data ( as string buffer ).
Pull Request: Fits external interface(reading)
Phase 3

Phase 3 consisted of the following developments

Designing a new more flexible data structure for storing table data
Support for converter policies
Completing the writing part of the FITS module

Designing a new more flexible data structure for storing table data

In the old codebase, we used to store the table data in a single string buffer. The problem was that access to a column required calculating offsets and writing, updating, adding columns was a lot difficult ( maintaining caches and even writing them back to file was very slow and error-prone)
Moreover, the problem with the old code was that we were parsing the entire column data and then returning the data in a container. Doing so caused unnecessary parsing as it is not necessary for the user to use all the values of a column.
To solve the problems given above, we devised a new data structure called TL-FCS.
TL- FCS ( Two Level - Fast Caching System )

TL-FCS consists of two caches named Type UnAware Cache and Type Aware Cache. ( TUAC, TAC )
Type UnAware Cache is a 2D matrix of string elements that is populated directly from the HDU data inside the file.
Type Aware Cache is a map of element indexes to the parsed column data . It acts as the second-level cache. It is present inside the column view.
Whenever the user asks for a column, the function returns a column view which is internally connected to the primary cache. Upon querying the column value, the secondary cache is checked for the parsed value. If not present, then the value is fetched from the primary cache( 2D Matrix ), parsed into the appropriate type (Column view holds the type information ), and an entry gets added into the secondary cache.
The writing process involves updating the second level cache serializing the data and then updating the primary cache.
Doing this has increased the number of features that could be added in the future.
Support for Converter Policies

Converters of Binary Table and Ascii Table were made a policy so that the users can implement their serialization and deserialization functions
By default the ascii converter is based on boost spirit framework but as the converter is a policy class users can have their policy class subtituted for the default one
Completing the writing part of FITS module

Due to the design of TL-FCS algorithm, writing HDU data became as easy as traversing a 2D matrix. Hence we implemented the writers for all the HDU's. The writers also have the capability to accept custom writer classes.
Pull Request:FITS Writing Part
Future work FITS module


Support for custom addition, deletion and creation of columns

Porting the entire image portion to boost gil.
Porting the entire library to C++20 modules for better compilation times to the user
Support for reading extra bytes after main table

Acknowledgements

I thank my mentors Mr.Pranam Lashkari and Mr. Sarthak Singhal for their valuable guidance and support during the entire GSOC period. Especially i would like to thank Mr.Pranam Lashkari for his immense support and guidance during each and every phase of the project.
Pranam sir never tried to enforce any specific constraints on how should i develop the project. He was and is always supportive of the decisions that I have taken for the betterment of FITS module. Also he helped me improve my coding skills by a large factor. It is due to him that I adopted Test Driven Development Strategy for the development of FITS module. Apart from that Pranam Sir also helped me understand the true meaning of Open Source.
I would also like to thank Ali bro, my best colleague for helping me out at making some key decisions associated with the project. Not only he is a great team member, he is excellent at mathematics and astronomy which has inspired me a lot. During the initial phase he helped me learn a lot of git stuff ( including resources for the same ) and even helped me find some bugs at the initial phase of the project. We were and are really a great team bro. Thankyou ! 😁
I also thank my parents for supporting me at every moment during the project.
Above all i thank almighty for helping me become a part of this great team and of this great project.
Directory Structure

GitHub Link: https://github.com/BoostGSoC20/astronomy/tree/develop/include/boost/astronomy/io