Release v0.17.4 - update docs and arguments

arendsee · Oct 12, 2016 · c29d266 · c29d266
2 parents 9d542d7 + 0b17ff0
commit c29d266
Show file tree

Hide file tree

Showing 12 changed files with 172 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -126,7 +126,7 @@ context.
 
 # Definitions
 
- * query genome - the reference genome of input intervals
+ * query genome - the genome referenced by the input intervals
 
  * target genome - the genome to which input intervals are mapped
 
@@ -156,7 +156,7 @@ Execution order for the `synder search` command
  2.  transform scores
  3.  merge doubly-overlapping blocks
  4.  determine contiguous sets
- 5.  find query-side overlapping and flanking blocks for each input sequence
+ 5.  find query-side, contextually-relevant blocks for each input sequence
  6.  map to overlapping contiguous sets
  7.  calculate score for input sequence relative to each contiguous set
  8.  calculate search interval relative to each contiguous set
@@ -251,29 +251,94 @@ solution for now.
 
 ## 4. Determine contiguous sets
 
-**OUT-OF-DATE**
-
 Build an interval adjacency matrix for the query context intervals and for the
 target context intervals. AND them together to get a block adjacency matrix.
 From this matrix, extract paths of adjacent blocks. Synder will merge any
 blocks that overlap on both genomes, this ensures there is a unique path
 through the adjacency matrix.
 
-## 5. Find overlapping/flanking blocks
+## 5. Find contextual blocks
+
+Mapping query intervals to target intervals would be trivial for a perfectly linear map, e.g.
+
+```
+                                      [-----------]
+=====      ============     ==========             ============
+  |              |               |                       |  
+  |              |               |                       |  
+=====      ============     ==========   <--->     ============
+
+Where <---> is the query interval and [----] the search interval
+```
+
+Synder reduces all blocks in the genome into into contiguous sets, which are
+non-overlapping sets of intervals where all blocks are adjacent.
+
+ * interval adjacency - two intervals are adjacent if they are on the same
+   contig and no other interval is fully contained between them
+
+ * block adjacency - two blocks are adjacent if 1) the intervals are adjacent on
+   both the query and target sides and 2) both have the same sign
+
+ * query context - all blocks that overlap or are adjacent to the query interval
+
+ * contiguous set - a set where block *i* is adjacent to block *i+1*
+
+ * search intervals - a set of intervals on the target genome where the
+   ortholog of the query interval is expected to be (the ortholog search space)
 
-**INCOMPLETE**
 
 ## 6. Reduce to overlapping sets
 
-**INCOMPLETE**
+Once all overlapping and flanking query-side blocks are identified.
 
 ## 7. Calculate scores relative to contiguous sets
 
-**INCOMPLETE**
+Especially with the high *k*, it is important to be able to rank search
+intervals. Queries that heavily overlap elements of contiguous sets are more
+reliable than ones that fall inbetween. Likewise, queries in dense contiguous
+sets, should rank higher than those in sparse ones (all else being equal).
+
+Input scores for syntenic blocks are additive (assuming the user entered the
+correct transformation).
+
+```
+s = 0
+for i in [a..b]
+    if i in any block in c AND i in q
+        s += d / L
+    else if i in any block in c
+        s += d * e^(-r * (dist(q, i)))
+where
+    c := the contiguous set
+    q := query interval
+    a := contig lower bound
+    b := contig upper bound
+    d := query syntenic score
+    L := query length
+    dist := function calculating distance
+```
+
+So all blocks in the contiguous set contribute to the total score. The scores
+of blocks that do not overlap the query decay exponentially with distance from
+the query bound.
+
+The score decay rate is controlled by the parameter *r*. A value of 0.001, the
+current default, indicates weight will fall to 0.5 by 1000 bases from the
+query. k=0 would give equal weight to all elements in the contiguous set, i.e.
+be more affected by context. A high value, e.g. r=100, would completely ignore
+context, basing score only on overlapping elements. 
+
 
 ## 8. Calculate search interval relative to contiguous set
 
-**INCOMPLETE**
+Exactly one search interval is created for each previously selected contiguous
+set.
+
+The chromosome length file tells synder how long each chromosome is. Then if
+some input query is closer to the end than anything in the synteny map, the
+search interval bound can be set to the end of the chromosome, rather than
+infinity.
 
  1. Find input interval, *i*, context in the query genome. This context consists of
     all query intervals *Sq* that either overlap or flank the input.
@@ -285,7 +350,71 @@ through the adjacency matrix.
     - **anchored** - if overlaps a member of *c*
     - **bound** - if is inbetween two members of *c*
     - **unbound** - if is more extreme than any member of the set, but is inbetween two contiguous sets
-    - **extreme** - No entr is found
+    - **extreme** - No entry is found
+
+The snapping rules are detailed below:
+
+```
+KEY:
+|x--  --y| - bounds of the contiguous block; start==x, stop==y
+a========b - a syntenic block with start == a and stop == b
+  <---q    - the query interval, with stop == q (start doesn't matter)
+a==b--c==d - query bounding blocks in the contiguous set
+[===]      - a non-bounding block in the same contiguous set
+ ...  F=== - nearest non-adjacent block ***ON THE TARGET***, F=start
+     ^     - search interval bound
+
+Possible snap positions (relative to query)
+  |x...[===]-----a=======b-----c=======d-----[===]...y|  ...  F===
+                 ^       ^     ^       ^     ^                ^
+
+We always snap to one bound of the relevant block. If the bound falls between
+the blocks (i1, j1) and (i2, j2), it would be reasonable to set the search
+interval to (j1 + 1, i2 - 1), however, this results in negative lenghts when
+the blocks are adjacent. So instead we set such intervals to to (j1, i2).
+
+=====================================================================
+Query bound is precedes both contiguous set bounds
+
+         |x-----y|
+      ...a=======b
+         ^
+  <---q
+
+-------------------------------------------------------------------------------
+Query bound falls within a contiguous set element
+
+  |x...a=======b...y|
+               ^
+       <--q
+
+-------------------------------------------------------------------------------
+Query bound falls between elements of the contiguous set
+
+  |x...a=======b-------c=======d...y|
+                       ^
+                 <--q
+
+-------------------------------------------------------------------------------
+Query is beyond the bounds of the contiguous set
+
+  Target side           A=======B      F===   1. map b-\>B   
+                                |------^      2. map B->F
+                                |             3. set F as SI bound
+  Query side     |x...--a=======b| ...
+                                <---q
+
+-------------------------------------------------------------------------------
+Query is beyond anything in the synteny map
+
+  Target side           A=======B      THE\_END   
+                                |------^      
+                                |             1. map b->B   
+  Query side     |x...--a=======b| ...        2. B is contig extremum
+                                <---q         3. set SI bound to contig length
+
+-------------------------------------------------------------------------------
+```
 
 
 # TODO list
@@ -317,6 +446,7 @@ through the adjacency matrix.
  - [x] directly parse synteny files, no database Bash script
  - [x] implement filter
  - [ ] write tests for filter
+ - [x] remove hard-coding of score decay parameter
  - [x] reimplement dump blocks
  - [x] add quiet mode to runtests.sh
  - [ ] add score to filter
@@ -331,6 +461,4 @@ through the adjacency matrix.
 
 I do not know a good way to do these, nor am I certain of their value.
 
- - [ ] implement target side scoring
- - [ ] implement score thresholding
  - [ ] implement contiguous set scoring
diff --git a/src/arguments.cpp b/src/arguments.cpp
diff --git a/src/arguments.h b/src/arguments.h
@@ -20,6 +20,7 @@ class Arguments
     FILE *qclfile  = nullptr;
     bool swap      = false;
     long k         = 0;
+    double r       = 0.001;
     char trans     = 'i';
 
     ~Arguments()

diff --git a/src/commands.cpp b/src/commands.cpp
@@ -135,6 +135,11 @@ option::Descriptor k = {
     "  -k, --interruption-fuzz \tNumber of interrupting intervals allowed before breaking contiguous set (default=0)"
 };
 
+option::Descriptor r = {
+    R, 0, "j", "--decay-rate", Arg::Numeric,
+    "  -j, --decay-rate \tScore decay rate (default=0.001)"
+};
+
 option::Descriptor transform = {
     TRANSFORM, 0, "x", "--transform", Arg::Required,
     "  -x, --transform \tTransform score (Synder requires additive scores):\n"
@@ -198,6 +203,7 @@ bool subcommand_filter(int argc, char* argv[])
         descriptors::tclfile,
         descriptors::qclfile,
         descriptors::k,
+        descriptors::r,
         descriptors::reverse,
         descriptors::help,
         {0,0,0,0,0,0}
@@ -290,6 +296,7 @@ bool subcommand_search(int argc, char* argv[])
         descriptors::tclfile,
         descriptors::qclfile,
         descriptors::k,
+        descriptors::r,
         descriptors::reverse,
         descriptors::base_offsets,
         descriptors::transform,

diff --git a/src/commands.h b/src/commands.h
@@ -26,6 +26,7 @@ enum optionIndex {
     REVERSE,
     BASE_OFFSETS,
     K,
+    R,
     TRANSFORM,
     UNKNOWN,
     VERSION,

diff --git a/src/contig.cpp b/src/contig.cpp
@@ -39,7 +39,7 @@ void Contig::map(Feature& t_feat)
     delete rc;
 }
 
-std::vector<SearchInterval> Contig::list_search_intervals(Feature& t_feat)
+std::vector<SearchInterval> Contig::list_search_intervals(Feature& t_feat, double r)
 {
     // TODO -- need to move this back up to Contig
 
@@ -66,7 +66,7 @@ std::vector<SearchInterval> Contig::list_search_intervals(Feature& t_feat)
     bool inbetween = rc->inbetween || rc->leftmost || rc->rightmost;
     std::vector<SearchInterval> si;
     for(auto &c : csets) {
-        si.push_back( SearchInterval(c->ends, &t_feat, inbetween) );
+        si.push_back( SearchInterval(c->ends, &t_feat, inbetween, r) );
     }
 
     delete rc;
@@ -75,9 +75,9 @@ std::vector<SearchInterval> Contig::list_search_intervals(Feature& t_feat)
     return si;
 }
 
-void Contig::find_search_intervals(Feature& t_feat)
+void Contig::find_search_intervals(Feature& t_feat, double r)
 {
-    std::vector<SearchInterval> si = list_search_intervals(t_feat);
+    std::vector<SearchInterval> si = list_search_intervals(t_feat, r);
     for(auto &s : si) {
         s.print();
     }

diff --git a/src/contig.h b/src/contig.h
@@ -33,10 +33,10 @@ class Contig {
     void set_length(long length);
 
     /** Print target regions from a given query */
-    void find_search_intervals(Feature& feat);
+    void find_search_intervals(Feature& feat, double r);
 
     /** Print target regions from a given query */
-    std::vector<SearchInterval> list_search_intervals(Feature& feat);
+    std::vector<SearchInterval> list_search_intervals(Feature& feat, double r);
 
     /** Write blocks overlapping intervals in intfile
      *

diff --git a/src/search_interval.cpp b/src/search_interval.cpp
@@ -3,13 +3,14 @@
 SearchInterval::SearchInterval(
     const std::array<Block*,2>& t_ends,
     Feature* t_feat,
-    bool t_inbetween
+    bool t_inbetween,
+    double r
 )
     : m_feat(t_feat),
       m_inbetween(t_inbetween),
       m_bnds(t_ends)
 {
-    m_score = calculate_score(m_bnds[0]);
+    m_score = calculate_score(m_bnds[0], r);
 
     reduce_side(LO);
     reduce_side(HI);
@@ -190,7 +191,7 @@ void SearchInterval::get_si_bound(const Direction d)
 //      |===|    |             |                 |===|
 //      |<------>| far         |       near |<-->|
 //          |<-->| near        |        far |<------>|
-double SearchInterval::flank_area(long near, long far, double k)
+double SearchInterval::flank_area(long near, long far, double r)
 {
 
     // If far <= 0, this means there is no interval to score in this direction
@@ -208,17 +209,14 @@ double SearchInterval::flank_area(long near, long far, double k)
         // the weight falls exponentially with distance from the query, e.g.
         // $$ \int_{near}^{far} exp(-kx) dx $$
         // which evaluates to the following:
-        return (1 / k) * ( exp(-1 * k * near) - exp(-1 * k * far) );
+        return (1 / r) * ( exp(-1 * r * near) - exp(-1 * r * far) );
     }
     return 0;
 }
 
-double SearchInterval::calculate_score(Block* b)
+double SearchInterval::calculate_score(Block* b, double r)
 {
 
-    // TODO ASSIGN THIS AS AN INPUT PARAMETER
-    double k = 0.001;
-
     double score = 0;
 
     if(b == nullptr)
@@ -243,15 +241,15 @@ double SearchInterval::calculate_score(Block* b)
         // near difference := i1 = a1 - b2
         // far difference  := i2 = a1 - b1
         // i1 and i2 may be negative, if query is not in the above position
-        double weighted_length = flank_area(a1 - b2, a1 - b1, k) +
+        double weighted_length = flank_area(a1 - b2, a1 - b1, r) +
 
                           //    a1        a2
                           //    |=========|    b1  b2     query interval
                           //              |....|---|      syntenic interval
                           //                   |===|      interval to score
                           // near difference := i1 = b1 - a2
                           // far difference  := i2 = b1 - a1
-                          flank_area(b1 - a1, b2 - a1, k) +
+                          flank_area(b1 - a1, b2 - a1, r) +
 
                           //         a1        a2
                           //         |=========|         query interval

diff --git a/src/search_interval.h b/src/search_interval.h
@@ -30,14 +30,15 @@ class SearchInterval : Interval<SearchInterval>
     void get_si_bound(const Direction d);
 
     void set_bound(Direction d);
-    double flank_area(long near, long far, double k);
-    double calculate_score(Block* blk);
+    double flank_area(long near, long far, double r);
+    double calculate_score(Block* blk, double r);
 
 public:
     SearchInterval(
         const std::array<Block*,2>& t_ends,
         Feature* t_feat,
-        bool t_inbetween
+        bool t_inbetween,
+        double t_r
     );
 
     ~SearchInterval();