Skip to content

Importer

Eugenia Oshurko edited this page Jul 29, 2016 · 36 revisions

The importer converts data from the specific input format to a graph typed by a particular meta-model. The result of the import is an action graph (an instance of TypedDiGraph of the ReGraph Python library) and a collection of nuggets typed by the action graph.

Currently, the BioPAX importer is implemented, and in the future, we would like to implement such an import from other standard biological data formats and from biological data described with natural language (using NLP techniques).

BioPAX Importer

BioPAX utils

The biopax_utils.py file contains the utilities for accessing and managing the BioPAX data with the help of Paxtools Java library.

The BioPAXModel class is a wrapper around the Paxtools object representing the BioPAX data (Model in Paxtools docs). Upon the initialization of an instance of the BioPAXModel class, with the help of the Python library jpype, Java virtual machine is started and all the necessary components of Paxtools are loaded into the Python objects.

The BioPAXModel class incapsulates the methods for accessing data from BioPAX, and mining the information that we are interested in. It also includes the load method which takes as an input path to the BioPAX file in OWL format and reads out the data into the model.

List of the methods available:

  • load(self, filename) - Import a BioPAX model from the file.
  • is_protein_family(self, reference_id) - Check if the protein object is a family.
  • is_complex_family(self, complex_id) - Check if the complex object is a family.
  • is_fragment(self, protein_id) - Check if the protein object is a fragment (region).
  • is_flag(self, feature_id) - Check if the feature (PTM) is a state flag.
  • is_residue(self, feature_id) - Check if the feature (PTM) is a residue.
  • is_fragment_feature(self, feature_id) - Check if the feature (PTM) is a fragment (region) definition.
  • protein_reference_to_node(self, protein_reference_id) - Convert protein reference data to node extracting all the information of our interest: unified reference (for example UniProtID), name, cellular location.
  • region_to_node(self, region_feature_id) - Convert region data to node extracting the start position and the end position of the region.
  • residue_to_node(self, residue_id, protein_id=None) - Convert residue data to node extracting the location of the residue and trying to resolve the amono acid one letter code.
  • flag_to_node(self, flag_id, protein_id=None) - Convert state flag data to node (create the state node and add as an attribute the name of the flag and the set of possible values 0 and 1. For example, the state node with the attribute 'ubiquitinylated lysine': [0, 1] is created as we do not know the location of the residue, therefore we cannot create the residue node).
  • family_reference_to_node(self, family_reference_id) - Convert family reference data to node.
  • small_molecule_to_node(self, small_molecule_id) - Convert small molecule data to node (we extract unified reference, name and cellular location).
  • complex_to_node(self, complex_id) - Convert complex data to node.
  • modification_to_node(self, reaction_id, target) - Convert modification data to node extracting the unified reference to the database for this reaction, direction of the reaction and the evidence
  • get_modifications(self, physical_entities) - Get all residues and state flags of the set of protein physical entities.
  • residue_in_region(self, residue_id, region_id) - Test whether residue lies in the given region.

BioPAXImporter

An instance of the BioPAXImporter class is initialised with the meta-model (see metamodels.py) and the BioPAXModel object.

Example of usage:

from bioruler.library.importers import BioPAXImporter

a = BioPAXImporter()
action_graph, nuggets = a.import_model("<path_to_your_biopax_file>.owl")

Method import_model of the BioPAXImporter class performs the following sequence of actions and returns the action graph and the list of nuggets:

  1. Loads the BioPAXModel from the specified path.
  2. Collects the agents (returns a dictionary with the agents ids collected with the keys: "proteins", "protein_families", "small_molecules", "small_molecule_families", "complexes", "complex_families"
  3. Collects the actions (MOD, BND, BRK etc)
  4. Generates the nodes in the action graph for each type of agents
  5. Generates the nodes in the action graph for each type of actions and returns the list of graphs - nuggets (mapping between the nodes in nuggets and the nodes in the action graph is implicit, meaning that the nodes possess the same ids).

Collection of agents

We have adopted the following strategy: first, we detect all the instances of the interest (proteins, small molecules, complexes, etc.) and collect their ids into some data structures; after that we iterate through these collections of ids and generate nodes in our action graph and nuggets with the data obtained from the BioPAX description of the object with a particular id. Here we describe the first step, namely, collection of the agents.

Proteins and protein families (BioPAXImporter.collect_proteins)

To construct the agent nodes corresponding to the proteins in the action graph we take all the instances of the ProteinReference BioPAX class. With the methods available in Paxtools we are able to get all the physical entites that reference to any particular protein reference, these physical entities give us the list of all residues and state flags that the protein may possess.

Extracting the protein agents we construct the following dictionary consisting of ids of the objects of the interest:

{
   <protein_reference_id>: {
            "regions": {
                 <region_id>: {
                     "residues": <set_of_residues_ids>,
                     "flags": <set_of_flags_id>
                 },
                 ...
            },
            "residues": <set_of_residues_ids>,
            "flags": <set_of_flags_ids>
   },
   ...
}

So, each protein can possess a set of regions (named intervals of the protein sequence), residues (amino acids at the particular location, e.g., "aa": S, "loc": 45) and state flags (e.g., "active": [0,1]). At the same time each region can possess residues and state flags as well.

From the data given in the instances of ProteinReference we can extract protein families as well. So, we construct the following dictionary:

{
    <family_id>: {
        "members": <set_of_family_member_ids>,
        "residues": <set_of_residue_ids>,
        "flags": <set_of_flag_ids>
    },
    ...
}

Small molecules and small molecule families (BioPAXImporter.collect_small_molecules)

Extraction of small molecules is much more simple comparing to extraction of proteins, as small molecules do not possess any internal states. So, this method extracts the set of small molecule ids and their families (similarly to protein families).

Complexes and complex families (BioPAXImporter.collect_complexes)

From the BioPAX class Complex we extract the complex ids, the residues of any particular complex, and its state flags, as the result we obtain the following dictionary:

{
    <complex_id>: {
         "components": <set_of_component_ids>,
         "residues": <set_of_residue_ids>,
         "flags": <set_of_flag_ids>
    },
    ...
}

Complex families are extracted similarly to protein and small molecule families.

Collection of actions

Modifications (BioPAXImporter.collect_modifications)

BioPAX represents the following chemical events of interest: biochemical reactions (BiochemicalReaction) and complex assemblies (ComplexAssembly). These events are coupled with the objects inherited from the class Control which describe interactions where a particular entity regulates the chemical reaction or modifies another entity. Such control interactions include catalysis and modulation.

An instance of the BioPAX class Catalysis (and Modulation) consists of a controller and a controlled reaction where the latest is given by the left-hand side and the right-hand side. The left-hand side represents the input, and the right-hand side -- the output produced as the result of this chemical reaction.

So far we extract three types of biochemical interactions:

  • If the left-hand side (LHS) and the right-hand side (RHS) of the controlled chemical reaction consists only of a single physical entity, and the one on LHS and the one on RHS reference to the same protein, we say that controller modifies the state of this protein, and the modification is the difference between the state flags and residue states of the LHS and RHS.

Example:

Controller: ER alpha/Gai/GDP/Gbeta gamma
	  ModificationFeature: [residue modification, active]
LHS:  Src
RHS:  Src
	 ModificationFeature: [O4'-phospho-L-tyrosine]@419

So we construct the nugget (and add the respective nodes and edges of the action graph) where complex ER alpha/Gai/GDP/Gbeta gamma modifies state phospho of the residue with "aa": S, "loc": 419 of the protein Src.

  • If the LHS and the RHS consist of the complexes whose components reference to the same entity, but possess some modifications, we say that the controller modifies the states and the residues corresponding to the ones that differ from LHS and RHS.

Example:

Controller: PKA Family active
       ModificationFeature: [residue modification, active]
LHS: GMCSF/GMR alpha/CSF2RB/JAK2 (dimer)
RHS: GMCSF/GMR alpha/CSF2RB/JAK2 (dimer)
       ModificationFeature: [residue modification, active]

If we have a closer look at the components:

Controller: PKA Family active
       ModificationFeature: [residue modification, active]
LHS:
    GMCSF
    GMR alpha
    CSF2RB
    JAK2 (dimer)
RHS: 
    GMCSF
    GMR alpha
    CSF2RB
       ModificationFeature: [O-phospho-L-serine]@601
    JAK2 (dimer)

So we construct the nugget (and add the respective nodes and edges of the action graph) where protein family PKA Family active modifies state phospho of the residue with "aa": S, "loc": 601 of the protein CSF2RB.

  • If the LHS and the RHS consist of the complexes which have two components, where one of them is a small molecule, and the second - either protein, or complex, or family, and the state of the complexes on the LHS and the RHS differ, we say that controller modifies the state of the second component according to the changes from the LHS to the RHS. (We call it modification of phenomenological state)

Example:

Controller: FAK
       ModificationFeature: [residue modification, active]
LHS: RhoA/GDP
       ModificationFeature: [residue modification, inactive]
RHS: RhoA/GTP
       ModificationFeature: [residue modification, active]

From such a biochemical reaction we construct a nugget (and add the respective nodes and edges of the action graph) where active protein FAK modifies state active of the protein RhoA.

As the result of this method the following dictionary is obtained:

{
   <reaction_id>: {
       "sources": {
            <controller_id>: <set_of_controller_states_and_residues>,
            ...
        },
       "targets": [(<modified_entity_id>, <modified_state_or_residue_id>), ...] 
   },
   ...
}

Extraction of other types of interactions is not implemented on the current stage.

Generation of agent nodes

Generations of agent nodes (and structural edges between agents) is performed for each type of agent separately. For this purpose the following methods of BioPAXImporter are implemented:

  • generate_proteins
  • generate_small_molecules
  • generate_complexes
  • generate_families

These methods take on the input an appropriate collection of ids of the entities and their structural elements (for example, residues or state flags) and with the help of BioPAX utils implemented in the BioPAXModel class generate the nodes and the edges of the action graph. All these methods take as an input graph object and perform its modification in-place.

Generation of action nodes + nugget generation

Generations of action nodes (and edges between agents and actions) is performed for each type of action separately as well. For this purpose the following methods of BioPAXImporter are implemented:

  • generate_modifications
  • to be continued...

These methods generate the nodes and the edges from the previously extracted interaction ids and ids of the components of the interaction. Similarly to the generation of agents they take as an input graph object and create in-place new nodes and edges. In addition, these function return the list of nuggets corresponding to the collection of interactions.

KappaImporter

Usage

Uncompile method

KappaImporter.uncompile(files, 
                        parser="bioruler/library/kappa_to_graph.byte", 
                        out="out", 
                        del_out=True, 
                        list_=True) 
  • files : list of files to parse
  • parser : path to the parser we want to use
  • out : path of output files needed for the import
  • del_out : if True, automatically delete created files before returning
  • list_ : if True, return a list of nuggets else return a single graph containing the nuggets

Example

from bioruler.library.importers import KappaImporter

action_graph, nuggets_list =  KappaImporter.uncompile(["tests.ka"])