Skip to content

Commit

Permalink
Release v0.17.4 - update docs and arguments
Browse files Browse the repository at this point in the history
  • Loading branch information
arendsee committed Oct 12, 2016
2 parents 9d542d7 + 0b17ff0 commit c29d266
Show file tree
Hide file tree
Showing 12 changed files with 172 additions and 34 deletions.
152 changes: 140 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ context.

# Definitions

* query genome - the reference genome of input intervals
* query genome - the genome referenced by the input intervals

* target genome - the genome to which input intervals are mapped

Expand Down Expand Up @@ -156,7 +156,7 @@ Execution order for the `synder search` command
2. transform scores
3. merge doubly-overlapping blocks
4. determine contiguous sets
5. find query-side overlapping and flanking blocks for each input sequence
5. find query-side, contextually-relevant blocks for each input sequence
6. map to overlapping contiguous sets
7. calculate score for input sequence relative to each contiguous set
8. calculate search interval relative to each contiguous set
Expand Down Expand Up @@ -251,29 +251,94 @@ solution for now.

## 4. Determine contiguous sets

**OUT-OF-DATE**

Build an interval adjacency matrix for the query context intervals and for the
target context intervals. AND them together to get a block adjacency matrix.
From this matrix, extract paths of adjacent blocks. Synder will merge any
blocks that overlap on both genomes, this ensures there is a unique path
through the adjacency matrix.

## 5. Find overlapping/flanking blocks
## 5. Find contextual blocks

Mapping query intervals to target intervals would be trivial for a perfectly linear map, e.g.

```
[-----------]
===== ============ ========== ============
| | | |
| | | |
===== ============ ========== <---> ============
Where <---> is the query interval and [----] the search interval
```

Synder reduces all blocks in the genome into into contiguous sets, which are
non-overlapping sets of intervals where all blocks are adjacent.

* interval adjacency - two intervals are adjacent if they are on the same
contig and no other interval is fully contained between them

* block adjacency - two blocks are adjacent if 1) the intervals are adjacent on
both the query and target sides and 2) both have the same sign

* query context - all blocks that overlap or are adjacent to the query interval

* contiguous set - a set where block *i* is adjacent to block *i+1*

* search intervals - a set of intervals on the target genome where the
ortholog of the query interval is expected to be (the ortholog search space)

**INCOMPLETE**

## 6. Reduce to overlapping sets

**INCOMPLETE**
Once all overlapping and flanking query-side blocks are identified.

## 7. Calculate scores relative to contiguous sets

**INCOMPLETE**
Especially with the high *k*, it is important to be able to rank search
intervals. Queries that heavily overlap elements of contiguous sets are more
reliable than ones that fall inbetween. Likewise, queries in dense contiguous
sets, should rank higher than those in sparse ones (all else being equal).

Input scores for syntenic blocks are additive (assuming the user entered the
correct transformation).

```
s = 0
for i in [a..b]
if i in any block in c AND i in q
s += d / L
else if i in any block in c
s += d * e^(-r * (dist(q, i)))
where
c := the contiguous set
q := query interval
a := contig lower bound
b := contig upper bound
d := query syntenic score
L := query length
dist := function calculating distance
```

So all blocks in the contiguous set contribute to the total score. The scores
of blocks that do not overlap the query decay exponentially with distance from
the query bound.

The score decay rate is controlled by the parameter *r*. A value of 0.001, the
current default, indicates weight will fall to 0.5 by 1000 bases from the
query. k=0 would give equal weight to all elements in the contiguous set, i.e.
be more affected by context. A high value, e.g. r=100, would completely ignore
context, basing score only on overlapping elements.


## 8. Calculate search interval relative to contiguous set

**INCOMPLETE**
Exactly one search interval is created for each previously selected contiguous
set.

The chromosome length file tells synder how long each chromosome is. Then if
some input query is closer to the end than anything in the synteny map, the
search interval bound can be set to the end of the chromosome, rather than
infinity.

1. Find input interval, *i*, context in the query genome. This context consists of
all query intervals *Sq* that either overlap or flank the input.
Expand All @@ -285,7 +350,71 @@ through the adjacency matrix.
- **anchored** - if overlaps a member of *c*
- **bound** - if is inbetween two members of *c*
- **unbound** - if is more extreme than any member of the set, but is inbetween two contiguous sets
- **extreme** - No entr is found
- **extreme** - No entry is found

The snapping rules are detailed below:

```
KEY:
|x-- --y| - bounds of the contiguous block; start==x, stop==y
a========b - a syntenic block with start == a and stop == b
<---q - the query interval, with stop == q (start doesn't matter)
a==b--c==d - query bounding blocks in the contiguous set
[===] - a non-bounding block in the same contiguous set
... F=== - nearest non-adjacent block ***ON THE TARGET***, F=start
^ - search interval bound
Possible snap positions (relative to query)
|x...[===]-----a=======b-----c=======d-----[===]...y| ... F===
^ ^ ^ ^ ^ ^
We always snap to one bound of the relevant block. If the bound falls between
the blocks (i1, j1) and (i2, j2), it would be reasonable to set the search
interval to (j1 + 1, i2 - 1), however, this results in negative lenghts when
the blocks are adjacent. So instead we set such intervals to to (j1, i2).
=====================================================================
Query bound is precedes both contiguous set bounds
|x-----y|
...a=======b
^
<---q
-------------------------------------------------------------------------------
Query bound falls within a contiguous set element
|x...a=======b...y|
^
<--q
-------------------------------------------------------------------------------
Query bound falls between elements of the contiguous set
|x...a=======b-------c=======d...y|
^
<--q
-------------------------------------------------------------------------------
Query is beyond the bounds of the contiguous set
Target side A=======B F=== 1. map b-\>B
|------^ 2. map B->F
| 3. set F as SI bound
Query side |x...--a=======b| ...
<---q
-------------------------------------------------------------------------------
Query is beyond anything in the synteny map
Target side A=======B THE\_END
|------^
| 1. map b->B
Query side |x...--a=======b| ... 2. B is contig extremum
<---q 3. set SI bound to contig length
-------------------------------------------------------------------------------
```


# TODO list
Expand Down Expand Up @@ -317,6 +446,7 @@ through the adjacency matrix.
- [x] directly parse synteny files, no database Bash script
- [x] implement filter
- [ ] write tests for filter
- [x] remove hard-coding of score decay parameter
- [x] reimplement dump blocks
- [x] add quiet mode to runtests.sh
- [ ] add score to filter
Expand All @@ -331,6 +461,4 @@ through the adjacency matrix.

I do not know a good way to do these, nor am I certain of their value.

- [ ] implement target side scoring
- [ ] implement score thresholding
- [ ] implement contiguous set scoring
Empty file removed src/arguments.cpp
Empty file.
1 change: 1 addition & 0 deletions src/arguments.h
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ class Arguments
FILE *qclfile = nullptr;
bool swap = false;
long k = 0;
double r = 0.001;
char trans = 'i';

~Arguments()
Expand Down
7 changes: 7 additions & 0 deletions src/commands.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,11 @@ option::Descriptor k = {
" -k, --interruption-fuzz \tNumber of interrupting intervals allowed before breaking contiguous set (default=0)"
};

option::Descriptor r = {
R, 0, "j", "--decay-rate", Arg::Numeric,
" -j, --decay-rate \tScore decay rate (default=0.001)"
};

option::Descriptor transform = {
TRANSFORM, 0, "x", "--transform", Arg::Required,
" -x, --transform \tTransform score (Synder requires additive scores):\n"
Expand Down Expand Up @@ -198,6 +203,7 @@ bool subcommand_filter(int argc, char* argv[])
descriptors::tclfile,
descriptors::qclfile,
descriptors::k,
descriptors::r,
descriptors::reverse,
descriptors::help,
{0,0,0,0,0,0}
Expand Down Expand Up @@ -290,6 +296,7 @@ bool subcommand_search(int argc, char* argv[])
descriptors::tclfile,
descriptors::qclfile,
descriptors::k,
descriptors::r,
descriptors::reverse,
descriptors::base_offsets,
descriptors::transform,
Expand Down
1 change: 1 addition & 0 deletions src/commands.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ enum optionIndex {
REVERSE,
BASE_OFFSETS,
K,
R,
TRANSFORM,
UNKNOWN,
VERSION,
Expand Down
8 changes: 4 additions & 4 deletions src/contig.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ void Contig::map(Feature& t_feat)
delete rc;
}

std::vector<SearchInterval> Contig::list_search_intervals(Feature& t_feat)
std::vector<SearchInterval> Contig::list_search_intervals(Feature& t_feat, double r)
{
// TODO -- need to move this back up to Contig

Expand All @@ -66,7 +66,7 @@ std::vector<SearchInterval> Contig::list_search_intervals(Feature& t_feat)
bool inbetween = rc->inbetween || rc->leftmost || rc->rightmost;
std::vector<SearchInterval> si;
for(auto &c : csets) {
si.push_back( SearchInterval(c->ends, &t_feat, inbetween) );
si.push_back( SearchInterval(c->ends, &t_feat, inbetween, r) );
}

delete rc;
Expand All @@ -75,9 +75,9 @@ std::vector<SearchInterval> Contig::list_search_intervals(Feature& t_feat)
return si;
}

void Contig::find_search_intervals(Feature& t_feat)
void Contig::find_search_intervals(Feature& t_feat, double r)
{
std::vector<SearchInterval> si = list_search_intervals(t_feat);
std::vector<SearchInterval> si = list_search_intervals(t_feat, r);
for(auto &s : si) {
s.print();
}
Expand Down
4 changes: 2 additions & 2 deletions src/contig.h
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,10 @@ class Contig {
void set_length(long length);

/** Print target regions from a given query */
void find_search_intervals(Feature& feat);
void find_search_intervals(Feature& feat, double r);

/** Print target regions from a given query */
std::vector<SearchInterval> list_search_intervals(Feature& feat);
std::vector<SearchInterval> list_search_intervals(Feature& feat, double r);

/** Write blocks overlapping intervals in intfile
*
Expand Down
18 changes: 8 additions & 10 deletions src/search_interval.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@
SearchInterval::SearchInterval(
const std::array<Block*,2>& t_ends,
Feature* t_feat,
bool t_inbetween
bool t_inbetween,
double r
)
: m_feat(t_feat),
m_inbetween(t_inbetween),
m_bnds(t_ends)
{
m_score = calculate_score(m_bnds[0]);
m_score = calculate_score(m_bnds[0], r);

reduce_side(LO);
reduce_side(HI);
Expand Down Expand Up @@ -190,7 +191,7 @@ void SearchInterval::get_si_bound(const Direction d)
// |===| | | |===|
// |<------>| far | near |<-->|
// |<-->| near | far |<------>|
double SearchInterval::flank_area(long near, long far, double k)
double SearchInterval::flank_area(long near, long far, double r)
{

// If far <= 0, this means there is no interval to score in this direction
Expand All @@ -208,17 +209,14 @@ double SearchInterval::flank_area(long near, long far, double k)
// the weight falls exponentially with distance from the query, e.g.
// $$ \int_{near}^{far} exp(-kx) dx $$
// which evaluates to the following:
return (1 / k) * ( exp(-1 * k * near) - exp(-1 * k * far) );
return (1 / r) * ( exp(-1 * r * near) - exp(-1 * r * far) );
}
return 0;
}

double SearchInterval::calculate_score(Block* b)
double SearchInterval::calculate_score(Block* b, double r)
{

// TODO ASSIGN THIS AS AN INPUT PARAMETER
double k = 0.001;

double score = 0;

if(b == nullptr)
Expand All @@ -243,15 +241,15 @@ double SearchInterval::calculate_score(Block* b)
// near difference := i1 = a1 - b2
// far difference := i2 = a1 - b1
// i1 and i2 may be negative, if query is not in the above position
double weighted_length = flank_area(a1 - b2, a1 - b1, k) +
double weighted_length = flank_area(a1 - b2, a1 - b1, r) +

// a1 a2
// |=========| b1 b2 query interval
// |....|---| syntenic interval
// |===| interval to score
// near difference := i1 = b1 - a2
// far difference := i2 = b1 - a1
flank_area(b1 - a1, b2 - a1, k) +
flank_area(b1 - a1, b2 - a1, r) +

// a1 a2
// |=========| query interval
Expand Down
7 changes: 4 additions & 3 deletions src/search_interval.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,15 @@ class SearchInterval : Interval<SearchInterval>
void get_si_bound(const Direction d);

void set_bound(Direction d);
double flank_area(long near, long far, double k);
double calculate_score(Block* blk);
double flank_area(long near, long far, double r);
double calculate_score(Block* blk, double r);

public:
SearchInterval(
const std::array<Block*,2>& t_ends,
Feature* t_feat,
bool t_inbetween
bool t_inbetween,
double t_r
);

~SearchInterval();
Expand Down
Loading

0 comments on commit c29d266

Please sign in to comment.