Software Heritage has the biggest open archive of the source code publicly available, it captures software projects from various hosting places and all of them are stored inside a giant Merkle DAG. Currently Software Heritage has an experimental tool (Code Scanner) to check, from a given code base which part of the source code is already stored inside the archive. The main idea for this proposal is to enhance the Software Heritage Code Scanner to make it usable in real production use cases: Software Heritage GSoC task
- (D5926) Refactoring of the swh-scanner model: since swh-model already provide on-disk caching of software artifacts the source code is stored directly in the swh-model Merkle data structure. This part involved also the refactoring of the output functions and the creation of a new data structure to store the Merkle nodes informations. Task(s) Involved: T3349, T2730, T2692
- (D5996) Abstraction of scan policies in order to easily create new scan algorithms. All the scan approaches present in the benchmark branch was moved to the master branch. Task(s) Involved: T3420
- (D6114) Store provenance information about software artifacts using the Software Heritage graph service.