wz1000/proposal.md

## proposal.md

      
    Raw
  

              proposal.md
            
          
    Call for discussion

Extending GHC to provide to provide first class support for code analysis

Three Haskell Summer of Code 2018 proposals share an underlying theme - extracting information about Haskell source from GHC.
haskell-org/summer-of-haskell#41
haskell-org/summer-of-haskell#36
haskell-org/summer-of-haskell#46
GHC has a wealth of information about the source it compiles, but extracting this information is a tricky,
frustrating and messy business. Additionally, to get this info we have to compile the code again. However, getting GHC to output
and build a database(possibly within .hi files) of the info we need would allow us to do this in a single pass.
Information that GHC could output

Documentation
(Specilaized)Types of expressions, symbols at source locations
Source locations of stuff defined in this module
A global database of cross reference data

Clang also builds a similar index:

As well as units and records, covered above, the indexing data collected from the AST and stored within each record file is further organized into entries for the symbols, occurrences and relations libIndex already produces. Symbol entries include all the information about a symbol that doesn’t change based on where in a source file it appears: its USR, source language, name, kind and subkind (e.g. Constructor and CXXMoveConstructor), and a bit set encoding other useful properties, like whether the symbol is locally scoped, or templated. Occurrence entries, on the other hand, include the information unique to each occurrence of a symbol in a source file – its position in the file, its roles in that position (e.g. Definition, Reference, Call, Read, Write), and its relations to other symbols. Each Relation includes a reference to the related symbol entry along with the set of roles that describe the nature of the occurrence’s relationship to that symbol (e.g. RelationBaseOf, RelationCalledBy, or RelationChildOf)

https://docs.google.com/document/d/1cH2sTpgSnJZCkZtJl1aY-rzy4uGPcrI-6RrUpdATO2Q/
Additionally, we need more hooks in GHC so we can extract parsed, typechecked and renamed modules
Questions that need to be answered:

What data can we always collect and save in .hi files?
What will be the impact on GHC speed and .hi file disk usage? (I believe ghc already needs to know all this information while compiling, so slowdown should mainly come from writing it to disk)
If some data is too much to fit in a .hi file or slows compilation, we could generate it only if built with a special flag.
Where do we store this?