garrettr/gist:3628225

## gistfile1.txt
# Plan for finishing Gapped Reads Mapping

Here is the remaining work that needs to be done to map gapped reads (RNASeq) with Statmap.

1. [x] Change the binary search in find_matching_mapped_locations to return as matches all locations within a range (REFERENCE_INSERT_LENGTH_MAX in config.h) including and beyond the base location's start.
Done on: 8/31/12
2. [x] Handle multiple matches in the candidate mapping building. Storing them is easy (we used a mapped_locations from the beginning with this in mind).
Done on: 9/3/12

Once that is done, we will have a set of candidate mappings, as we had before. The next step is to rework the recheck. As preparation, the first step is

3. **UPDATE**: The original goal was to rewrite recheck_locations for candidate mappings at this stage. Actually, it is not correct to recheck locations at this stage - we should check across the whole mapping once the candidate mappings have been joined.
    a)  Remove the RECHECK flag (finally) from candidate mappings and the recheck
        code from find_candidate_mappings.
        Done by: 9/4/12
    b)  Write (naive, for now) recheck code using likelihood ratio test for the
        sets of joined candidate mappings.
        Done by: 9/4/12

Question: can we update the error data if we haven't done the recheck? Should
we do an initial recheck on any unique mappers we see?

5. can_be_used_to_update_error_data needs to be rewritten. This might be tricky - in the case of any situation with multiple indexable subtemplates, we're going to have multiple candidate mappings here, and the mappings->length == 1 || 2 test isn't going to work. Since we know the number of indexable subtemplates, the best thing to do might be to pass this into the update_error_data function, and make sure mappings->length == num_indexable_subtemplates || num_indexable_subtemplates*2. This does make the whole comparison of sequences for the diploid possibility a bit tricky though - I will need to revisit it, as a few minutes of thinking didn't turn up an obvious solution.

Writing a function like candidate_mappings_are_a_unique_mapper is the basic goal here.

6. update_error_data_record_from_candidate_mappings, likewise. This shouldn't be so hard - just do the update process for each candidate mapping in a loop.

At this point it's time to stop, breathe, and get the RNASeq unit test working. All of the marginal mapping for gapped reads is complete at this point. Straightforward map and compare mappings back to genome thing, this time using the cigar strings for the comparison to see how it handles the introns.

And then there are two more steps, which I will require your assistance with.

7. fragment length distribution
   estimation? or do we need it earlier? chicken and egg? also this code has no comments.

8. iterative mapping. I think this one is for you, since I don't really understand what all is going on in there.
	# Plan for finishing Gapped Reads Mapping

	Here is the remaining work that needs to be done to map gapped reads (RNASeq) with Statmap.

	1. [x] Change the binary search in find_matching_mapped_locations to return as matches all locations within a range (REFERENCE_INSERT_LENGTH_MAX in config.h) including and beyond the base location's start.
	Done on: 8/31/12
	2. [x] Handle multiple matches in the candidate mapping building. Storing them is easy (we used a mapped_locations from the beginning with this in mind).
	Done on: 9/3/12

	Once that is done, we will have a set of candidate mappings, as we had before. The next step is to rework the recheck. As preparation, the first step is

	3. UPDATE: The original goal was to rewrite recheck_locations for candidate mappings at this stage. Actually, it is not correct to recheck locations at this stage - we should check across the whole mapping once the candidate mappings have been joined.
	a) Remove the RECHECK flag (finally) from candidate mappings and the recheck
	code from find_candidate_mappings.
	Done by: 9/4/12
	b) Write (naive, for now) recheck code using likelihood ratio test for the
	sets of joined candidate mappings.
	Done by: 9/4/12

	Question: can we update the error data if we haven't done the recheck? Should
	we do an initial recheck on any unique mappers we see?

	5. can_be_used_to_update_error_data needs to be rewritten. This might be tricky - in the case of any situation with multiple indexable subtemplates, we're going to have multiple candidate mappings here, and the mappings->length == 1 \|\| 2 test isn't going to work. Since we know the number of indexable subtemplates, the best thing to do might be to pass this into the update_error_data function, and make sure mappings->length == num_indexable_subtemplates \|\| num_indexable_subtemplates*2. This does make the whole comparison of sequences for the diploid possibility a bit tricky though - I will need to revisit it, as a few minutes of thinking didn't turn up an obvious solution.

	Writing a function like candidate_mappings_are_a_unique_mapper is the basic goal here.

	6. update_error_data_record_from_candidate_mappings, likewise. This shouldn't be so hard - just do the update process for each candidate mapping in a loop.

	At this point it's time to stop, breathe, and get the RNASeq unit test working. All of the marginal mapping for gapped reads is complete at this point. Straightforward map and compare mappings back to genome thing, this time using the cigar strings for the comparison to see how it handles the introns.

	And then there are two more steps, which I will require your assistance with.

	7. fragment length distribution
	estimation? or do we need it earlier? chicken and egg? also this code has no comments.

	8. iterative mapping. I think this one is for you, since I don't really understand what all is going on in there.