initial migration from internal bitbucket

Novartis · Sep 27, 2018 · 1b03348 · 1b03348
1 parent 96b8888
commit 1b03348
Show file tree

Hide file tree

Showing 264 changed files with 43,475 additions and 2 deletions.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,13 @@
+Copyright 2018 Novartis Institutes for Biomedical Research (NIBR)
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
diff --git a/Makefile b/Makefile
@@ -0,0 +1,16 @@
+install:
+	conda env create -f ./environment.yml
+	cd ui && npm install && cd node_modules/higlass && npm install && npm run build
+
+update:
+	conda env update
+	cd ui && rm -rf node_modules/higlass && npm install && cd node_modules/higlass && npm install && npm run build
+
+build:
+	cd ui && npm run build
+
+example-2_4kb:
+	cd examples && ./prepare-2_4kb.py && cd .. && ./start.py --debug --config=examples/config-dnase-seq-2_4kb.json
+
+example-12kb:
+	cd examples && ./prepare-12kb.py && cd .. && ./start.py --debug --config=examples/config-chip-seq-12kb.json
diff --git a/README.md b/README.md
@@ -1,2 +1,211 @@
-# peax
-Peax is a tool for interactive concept learning and exploration of epigenomic patterns based on unsupervised featurization with autoencoders
+# Peax: a pattern explorer for epigenomic data
+
+![Peax's UI](teaser.png)
+
+> Epigenomic data expresses a rich body of diverse patterns, but
+  extracting patterns genome wide is limited, as specialized algorithms
+  are required or the expressiveness is low. Peax is a tool
+  for interactive concept learning and exploration of epigenomic
+  patterns based on unsupervised featurization with autoencoders.
+  Genomic regions are manually labeled for actively learning feature
+  weights to build custom classifiers based on your notion of
+  interestingness.
+
+## Installation
+
+```
+git clone https://bitbucket.org/novartisnibr/sda-peax/ peax && cd peax
+make install
+```
+
+_Do not fear, `make install` is just a convenience function for setting up conda and installing npm packages._
+
+**Note:** If you're a macOS user you might need to [brew](https://brew.sh) install `libpng` and `openssl` for the [pybbi](https://github.com/nvictus/pybbi) package. Also see [here](https://github.com/nvictus/pybbi/issues/2).
+
+## Overview
+
+Peax consists of four main parts:
+
+1. A python module for creating a autoencoder. [[/ae](ae)]
+2. A set of example notebooks exemplifying the creating of autoencoders. [[/notebooks](notebooks)]
+3. A Flask-based server application for serving genomic and autoencoded data on the web. [[/server](server)].
+4. A React-based user interface for exploring, visualizing, and interactively labeling genomic regions. [[/ui](ui)].
+
+## Getting started
+
+### Quick start
+
+Peax comes with 2 example autoencoders for which we provide convenience scripts
+to get you started as quickly as possible.
+
+1. ChIP-seq encoder for 12Kb genomic windows at 100bp binning.
+
+   ```
+   make example-12kb
+   ```
+
+2. DNase-seq encoder for 2.4Kb genomic windows at 100bp binning
+
+   ```
+   make example-2_4kb
+   ```
+
+The convenience scripts will download test ENCODE tracks and use the matching
+configuration to start the server.
+
+_Note: if you need to run Peax a hostname or port different than `localhost:5000` you need to [manually start Peax](#start-peax) using the respective configs which you can find in the [Makefile](Makefile)._
+
+### Slow start
+
+#### Build an autoencoder
+
+_Note: right now only Keras-based autoencoders are supported._
+
+First you need to create an autoencoder for your datasets's content types. The
+design of the autoencoder is entirely up to you. To get started take a look at
+the [notebooks](/notebooks) that we provide.
+
+After creating your autoencoder save its model as an HDF5 file.
+
+#### Configure Peax with your data
+
+Next you need to configure Peax with your data. The main reason for doing is to tell Peax, which tracks you want to visualize in HiGlass and which of those tracks are encodable using an autoencoder.
+
+The fastest way to get started is to copy the example config:
+
+```
+cp config.json.sample config.json
+```
+
+The main part to adjust is `aes` and `datasets`. AE stands for autoencoder and
+is a list of autoencoder definitions, e.g., you could have multiple encoder for
+different datatypes. The required format for autoencoders is as follows:
+
+| Field        | Description                                                                                                                                   | Defaults | Dtype |
+|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------|----------|-------|
+| encoder      | Relative path to your pickled encoder model. (hdf5 file)                                                                                      |          | str   |
+| decoder      | Relative path to your pickled decoder model. (hdf5 file)                                                                                      |          | str   |
+| content_type | Unique string describing the content this autoencoder can handle. Data tracks with the same content type will be encoded by this autoencoder. |          | str   |
+| window_size  | Window size in base pairs used for training the autoencoder.                                                                                  |          | int   |
+| resolution   | Resolution or bin size of the window in base pairs.                                                                                           |          | int   |
+| latent_dim   | Number of latent dimensions of the encoded windows.                                                                                           |          | int   |
+| input_dim    | Number of input dimensions for Keras. For 1D data these are 3: samples, data length (which is `window_size` / `resolution`), channels.        | 3        | int   |
+| channels     | Number of channels of the input data. This is normally 1.                                                                                     | 1        | int   |
+
+**Example:**
+
+```javascript
+{
+  "encoder": "path/to/my-12kb-chip-seq-encoder.h5",
+  "decoder": "path/to/my-12kb-chip-seq-decoder.h5",
+  "content_type": "chip-seq-pval",
+  "window_size": 12000,
+  "resolution": 100,
+  "channels": 1,
+  "input_dim": 3,
+  "latent_dim": 12
+}
+```
+
+Datasets require the following format:
+
+| Field        | Description                                                                                                                                                       | Dtype |
+|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
+| filepath     | Relative path to your track data file (bigwig or bigbed).                                                                                                         | str   |
+| content_type | Unique string describing the content this dataset. If you want to search for patterns in this track you need to have an autoencoder with a matching content type. | str   |
+| uuid         | A unique string identifying your track. (Optional)                                                                                                                | str   |
+| name         | A human readable name to be shown in HiGlass.(Optional)                                                                                                           | str   |
+
+**Example:**
+
+```javascript
+{
+  "filepath": "data/chip-seq/my-fancy-gm12878-chip-seq-h3k27c-track.bigWig",
+  "content_type": "chip-seq-pval",
+  "uuid": "my-fancy-gm12878-chip-seq-h3k27c-track",
+  "name": "My Fancy GM12878 ChIP-Seq H3k27c Track"
+}
+```
+
+#### Start Peax
+
+Finally, start the Peax server to run the application:
+
+```
+./start.py --config=your-config.json
+```
+
+Start supports the following options:
+
+```
+usage: start.py [-h] [-e ENCODER] [-d DATASET] [-w WINDOWSIZE] [-r RESOLUTION]
+                [-s STEPSIZE] [-c CHROMS] [--config CONFIG] [--clear]
+                [--debug] [--host HOST] [--port PORT] [--verbose]
+
+Peak Explorer CLI
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -e ENCODER, --encoder ENCODER
+                        path to saved encoder
+  -d DATASET, --dataset DATASET
+                        path to saved dataset (bigwig)
+  -w WINDOWSIZE, --windowsize WINDOWSIZE
+                        path to saved dataset (bigwig)
+  -r RESOLUTION, --resolution RESOLUTION
+                        number of bp per bin
+  -s STEPSIZE, --stepsize STEPSIZE
+                        relative to window, e.g., `2` => `windowsize / 2 =
+                        stepsize in bp`
+  -c CHROMS, --chroms CHROMS
+                        comma-separated list of chromosomes to search over
+  --config CONFIG       use config file instead of args
+  --clear               clears the db on startup
+  --debug               debug flag
+  --host HOST           Customize the hostname
+  --port PORT           Customize the port
+  --verbose             verbose flag
+```
+
+The `hostname` defaults to `localhost` and the `port` of the backend server defaults to `5000`.
+
+
+## Development
+
+Handy commands to keep in mind:
+
+- `make install` installs the conda environment and npm packages and builds HiGlass
+- `make update` updates the conda environment and npm packages and rebuilds HiGlass
+- `make build` rebuild the peax ui
+- `./start.py` starts the Flask server application for serving data
+- [/ui]: `npm install` installs and updates all the needed packages for the frontend
+- [/ui]: `npm build` creates the production built of the frontend
+- [/ui]: `npm start` starts a dev server with hot reloading for the frontend
+
+To start developing on the server and the ui in parallel, first start the backend server application using `./start.py` and then start the frontend server application from `./ui` using `npm start`. Both server's watch the source code, so whenever you change something the servers will reload.
+
+### Configuration
+
+There are 2 types of configuration files. The [backend server configuration](#configure-peax-with-your-data) defines which datasets to explore and is described in detail [above](#configure-peax-with-your-data).
+
+Additionally, the frontend application can be configured to talk to a different backend server and port if needed. Get started by copying the example configuration:
+
+```
+cd ui && cp config.json.sample config.json
+```
+
+By default the `server` is dynamically set to the hostname of the server running the frontend application. I.e., it is assumed that the backend server application is running on the same host as the frontend application. The `port` of the server defaults to `5000`.
+
+### Start the backend and frontend apps
+
+For development the backend and frontend applications run as seperate server
+applications.
+
+```
+# Backend server
+./start.py --config=config.json --debug
+
+# Frontend server
+cd ui && npm start
+```
+
diff --git a/ae/__init__.py b/ae/__init__.py
@@ -0,0 +1,12 @@
+"""
+Copyright 2018 Novartis Institutes for BioMedical Research Inc.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
diff --git a/ae/bigwig.py b/ae/bigwig.py
@@ -0,0 +1,70 @@
+"""
+Copyright 2018 Novartis Institutes for BioMedical Research Inc.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+
+import bbi
+import math
+import numpy as np
+
+
+def chunk(bigwig, window_size, step_size, aggregation, chroms, verbose=False):
+    base_bins = math.ceil(window_size / aggregation)
+
+    chrom_values = []
+
+    for chrom in chroms:
+        if chrom not in bbi.chromsizes(bigwig):
+            print(
+                "Skipping chrom (not in bigWig file):",
+                chrom,
+                bbi.chromsizes(bigwig)[chrom],
+            )
+            continue
+
+        chrom_size = bbi.chromsizes(bigwig)[chrom]
+
+        values = np.zeros(
+            (math.ceil((chrom_size - step_size) / step_size), base_bins)
+        )
+        starts = np.arange(0, chrom_size - step_size, step_size)
+        ends = np.append(
+            np.arange(window_size, chrom_size, step_size), chrom_size
+        )
+        bins = window_size / aggregation
+
+        # Extract all but the last window in one fashion (faster than `fetch`
+        # with loops)
+        values[:-1] = bbi.stackup(
+            bigwig,
+            [chrom] * (starts.size - 1),
+            starts[:-1],
+            ends[:-1],
+            bins=bins,
+            missing=0.0,
+        )
+        final_bins = math.ceil((ends[-1] - starts[-1]) / aggregation)
+        # Extract the last window separately because it's size is likely to be
+        # different from the others
+        values[-1, :final_bins] = bbi.fetch(
+            bigwig, chrom, starts[-1], ends[-1], bins=final_bins, missing=0.0
+        )
+
+        if verbose:
+            print(
+                "Chrom: {}".format(chrom),
+                "# win: {}".format(values.shape[0]),
+                "Max:   {}".format(np.max(values)),
+            )
+
+        chrom_values.append(values)
+
+    return chrom_values