Reactome Pengine Examples

Introduction

In this notebook we first briefly explain the data model of the underlying Reactome data set. Secondly, we explain the data model that we have built into Reactome Pengine. We then explain where computations take place. Next we give some example queries that illustrate the capabilities of Reactome Pengine in the context of a SWISH notebook, including graphical rendering, R integration and Javascript applications. Also note that further functionality of the Pengine Reactome can be utilised by building client applications in the full desktop version of SWI-Prolog. The examples of functionality in this notebook are a fraction of the possible ways to use Reactome Pengine, intended to illustrate calls to the API.


Section A: Basic Usage


1. Underlying data model of Reactome

The underlying data model is based on a RDF triple graph. This is fully documented on the Reactome site: Reactome Data Model. In brief, the principle type of entity in Reactome is the reaction and reactions have inputs and output entities. Reactions are also optionally controlled. Each biological entity such as proteins, small molecules, and reactions are given an ID. Entities also include 'Complexes' and 'Protein sets'. A complex is a set of molecular entities that have combined together. A protein set is when a set of proteins can perform the same biological function. Both complexes and protein sets can also themselves be composed of complexes and protein sets. We can query this data directly with the rdf/3 predicate. To see how this is done click the blue play triangle in the cell below. You can then find further solutions, with 'next'.

:-use_module(library(pengines)). reactome_server('https://apps.nms.kcl.ac.uk/reactome-pengine'). rdf(X,Y,Z):- reactome_server(S), pengine_rpc(S,rdf(X,Y,Z),[]).
rdf(X,Y,Z).

The semantic web library of prolog has other access predicates to query this data, for example rdf_has/3. For more details see the documentation here.

2. Our Reactome Pengine data model

In addition to the direct method of querying the RDF file using the Prolog semantic web library, a number of higher level predicates built on top of this library are also provided. Often these higher level predicates give a more intuitive and compact view of the data. The full documentation for the API is here.

Our model states that reactions are nodes on a graph. There are edges between these nodes in two cases:

  1. When an output of a reaction is an input to another reaction. This edge type we name 'precedes'.
  2. An output of a reaction r1 is a control of another reaction r2. This results in a number of different edge types depending on how the output of r1 controls r2. The types are:
    • ACTIVATION
    • ACTIVATION-ALLOSTERIC
    • INHIBITION
    • INHIBITION-ALLOSTERIC
    • INHIBITION-COMPETITIVE
    • INHIBITION-NONCOMPETITIVE
We can view this graph for an individual pathway using ridPathway_links/2 shown below. We can use this predicate in all directions, i.e 1) enumerate pathways and link pairs (neither argument is instantiated), 2) find the links for a particular pathway (pathway argument is instantiated), 3) see which pathway has a set of links (links argument is instantiated), or 4) check whether a given pathway has a given set of links (both arguments are instantiated). Most predicates in the API are capable of multi-directional queries. Details are given in the Reactome Pengine API documentation.

ridPathway_links(P,Ls):- reactome_server(S), pengine_rpc(S,ridPathway_links(P,Ls)).

Generate pathways and a list of links in the pathway using:

ridPathway_links(P,L).

Generate pathways that have an activation link where the activating entity is 'Complex452' using:

ridPathway_links(P,_Ls), member(ridReaction_ridLink_type_ridReaction(R1, 'Complex452', 'ACTIVATION', R2),_Ls).

3. Converting between Reactome Identifiers and familiar names/identifiers

In order to convert between the Reactome identifiers and the common names of entities you can use the rid_name/2 predicate.

You may also want to convert Reactome protein identifiers to Uniprot identifiers, and for this you can use ridProtein_uniprotId/2.

Both of these predicates are full relations so a user can query in both directions and use the predicate to generate examples or to check that a pair are related.

The following two examples retrieve the name and Uniprot ID for the Reactome IDs 'Complex452' and 'Protein1', respectively.

reactome_server(S),pengine_rpc(S,rid_name('Complex452',Name),[]).
reactome_server(S),pengine_rpc(S,ridProtein_uniprotId('Protein1',UniprotId),[]).

4. Finding the Protein Complex that contains the Sonic the Hedgehog Protein

The Sonic the HedgeHog (SHH) protein is well known. If we want to query Reactome Pengine to find what Protein complex it takes part in we can use the following query.

reactome_server(S),pengine_rpc(S,(rid_name(RidSonic,"SHH"),ridProteinSet_component(RidProteinSet,RidSonic),ridComplex_component(RidComplex,RidProteinSet),rid_name(RidComplex,ComplexName)),[]).

5. Naming and reuse of queries

In Prolog a query is a call to a predicate (or as we have seen a conjunction of predicates). As with any Prolog program we can reuse a query and assign it a name by writing a new predicate definition containing the query. This is useful for complex queries as we can decompose the complexity into small parts.

A useful feature of Prolog is that if predicate definitions are restricted to the subset of pure Prolog, then it is possible to reason about the solutions to queries logically. This is useful for debugging and applying certain advanced machine learning algorithms for example Inductive Logic Programming.

In the following program we name the previous query as pathway_with_complex452_activation/1. This predicate is true if P is instantiated with a pathway that has an activation edge where the activation entity is complex452. We also define a predicate small_pathway/1 for pathways with less than 35 edges. We then define a new predicate, reusing pathway_with_complex452_activation/1 and small_pathway/1 to define small_pathway_with_complex452_activation/1. As this new predicate is made from the composition of pathway_with_complex452_activation/1 and small_pathway/1, we can logically reason that the new predicate will be true for the intersection of the answers to the original two predicates.

pathway_with_complex452_activation(P):- ridPathway_links(P,Ls), member(ridReaction_ridLink_type_ridReaction(_R1, 'Complex452', 'ACTIVATION', _R2),Ls). small_pathway(P):- ridPathway_links(P,L), length(L,S), S<35. small_pathway_with_complex452_activation(P):- small_pathway(P), pathway_with_complex452_activation(P).

Running these three queries allows you to see that the answers to the third query is the intersection of the answer to the first two queries as we expected.

pathway_with_complex452_activation(P).
small_pathway(P).
small_pathway_with_complex452_activation(P).

6. Where the execution of our programs takes place

As we are seeing, using Prolog allows us to specify programs and algorithms, not just queries (in contrast to SQL, Cypher, REST APIs and SPARQL).

When using SWISH our programs are either executed on Reactome Pengine or on the SWISH server (itself a pengine application). When using the desktop version of SWI Prolog, programs are either executed on Reactome Pengine or your local machine. The idea is that small programs can be brought to the large data, rather than needing to transfer large datasets to a users machine. This, alongside the ability to send programs to the pengine, makes for an extremely flexible logical API. Both SWISH and Reactome Pengine have a time limit on queries. So far we have made simple queries of Reactome Pengine, and executed our programs in SWISH. In order to reduce the data transfer, and to send a program to the Pengine Reactome we use the 3rd argument of pengine_rpc/3.

Here we write a program to find a path on the graph of reactions across the whole Reactome. The predicate path_program/1 returns the program we wish to run, as a list of clauses. The predicate path_from_to/3 retrieves the server address and program and sends this, along with the query, to Reactome Pengine. The identified path is then returned.

To perform the same query without using Reactome Pengine the entire database would need to be downloaded.

:-use_module(library(pengines)). path_program(Program):- Program=[ (:- meta_predicate path(2,?,?,?)), (:- meta_predicate path(2,?,?,?,+)), (graph_path_from_to(P_2,Path,From,To):-path(P_2,Path,From,To)), (path(R_2, [X0|Ys], X0,X):-path(R_2, Ys, X0,X, [X0])), (path(_R_2, [], X,X, _)), (path(R_2, [X1|Ys], X0,X, Xs) :- call(R_2, X0,X1),non_member(X1, Xs),path(R_2, Ys, X1,X, [X1|Xs])), (non_member(_E, [])), (non_member(E, [X|Xs]) :- dif(E,X),non_member(E, Xs)), ( e(R1,R2):-ridReaction_ridLink_type_ridReaction(R1,_,_,R2)) %this makes a two place edge term ]. %Send a program and a query to the pengine reactome and return the result. path_from_to(Path,From,To):- reactome_server(Server), path_program(Program), Query=graph_path_from_to(e,Path,From,To), pengine_rpc(Server,Query,[src_list(Program)]).
path_from_to(Path,From,To).

The following query performs a breadth first search to find the shortest paths (click next to see successive results):

length(Path,_),path_from_to(Path,From,To).

Section B: Advanced Usage


7. Definite Clause Grammars

In Prolog we can describe lists declaratively, so for instance, we can write a definite clause grammar (DCG) for paths. These have a slightly different syntax to standard Prolog predicates, but can also be sent to The Pengine Reactome. For simplicity of presentation the DCG in the example below runs on the SWISH server. This example finds paths that pass through a reaction and that satisfy a user defined rule. In this case we simply ask for a path that passes through a reaction that has CTDP1 (Protein11301) as an input. Additionally we add 'Path=[_,_|_].' to the query. This serves as a constraint to find paths with at least two steps.

ridReaction_input(Rid,I):- reactome_server(S), pengine_rpc(S,ridReaction_input(Rid,I),[]). okPath --> begin, needs,end. begin -->[]. begin -->[_],begin. needs -->{ridReaction_input(R,'Protein11301')},[R]. end --> []. end --> [_],end.
path_from_to(Path,From,To), phrase(okPath,Path), Path=[_,_|_].

8. Rendering the query results

Graphviz

So far, in this tutorial, we have been representing the reaction graph as lists and terms. While this is useful for computations, for conveying results to human users it is often better to use graphical representations. To this end, in SWISH we can use The Graphviz renderer. Additionally, to make the graph visualisation more meaningful, we map the Reactome IDs to the reaction names.

:- use_rendering(graphviz). graphviz_program(Program):- Program =[ (linktype_color(X,green):- member(X,['ACTIVATION', 'ACTIVATION-ALLOSTERIC'])), (linktype_color(X,red):- member(X,['INHIBITION', 'INHIBITION-ALLOSTERIC', 'INHIBITION-COMPETITIVE', 'INHIBITION-NONCOMPETITIVE'])), (linktype_color(precedes,black)), (link_graphvizedge(ridReaction_ridLink_type_ridReaction(R1, _, Type, R2),edge(R1Name->R2Name,[color=Color])):- linktype_color(Type,Color), rid_name(R1,R1Name), rid_name(R2,R2Name)) ]. pathway_graphviz(P,G):- reactome_server(S), graphviz_program(Program), pengine_rpc(S,(ridPathway_links(P,L),maplist(link_graphvizedge,L,GE)),[src_list(Program)]), G = digraph(GE).
pathway_graphviz(X,G).

Charting data with C3

We can chart data from Reactome using C3, a Javascript visualisation library. The following code makes a simple plot showing the number of reactions in a set of pathways.

:- use_rendering(c3). pathways(P):- P=['Pathway1','Pathway2','Pathway5','Pathway12']. pathway_pairsize(P,PName-L):- reactome_server(S), pengine_rpc(S,(ridPathway_reactions(P,R),rid_name(P,PName)),[]), length(R,L). chart(Chart):- pathways(P), maplist(pathway_pairsize,P,Pairs), Chart = c3{data:_{x:elem, rows:[elem-count|Pairs], type:bar}, axis:_{x:_{type:category}}}.
chart(C).

9. Scraping the web and integrating with Reactome Pengine

Scraping the traditional HTML web for data makes it possible to combine many existing remote data sites with Reactome Pengine. Web scraping inside SWISH is currently limited to using the predicate load_html/3, whereas if you build a client application with the full version of SWI Prolog then predicates such as http_open/3, are available. After retrieving the HTML using either of these methods, we can then utilise libraries sgml and xpath to manipulate the data.

In the example below we scrape gene expression data from the GEO website for sample GSM38051. We also demonstrate how results can be displayed with the SWISH table renderer.

:- use_module(library(sgml)). :- use_module(library(xpath)). elem_in(URL, Elem,X,Y) :- load_html(URL, DOM, []), xpath(DOM, //'*'(self), element(Elem,X,Y)). url('https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?view=data&acc=GSM38051&id=4196&db=GeoDb_blob01'). splitter_row_cols(S,R,C):- split_string(R,S,'',C). url_datatable(U,Table):- url(U), elem_in(U,pre,_,[_One,_Two,_Three,_Four,_Five,Data|_Rest]), split_string(Data,'\n','',Rows), maplist(splitter_row_cols('\t'),Rows,Table). row_pair([X,Y,_,_],X-Y). assoc_key_val(Assoc,Key,Value):- get_assoc(Key, Assoc,Value). assoc_key_val(Assoc,Key,na):- \+get_assoc(Key,Assoc,_Value).
:- use_rendering(table).
length(List,10),append([_|List],_Rest,_All),url_datatable(_U,_All).

We can now integrate our webscraping and Reactome Pengine in a single query. Here we find the expression values for the probes for Protein11042.

reactome_server(_S),length(_List,12050),append([_|_List],_Rest,_All),url_datatable(_U,_All), maplist(row_pair,_List,_Pairs),list_to_assoc(_Pairs,_Assoc),pengine_rpc(_S,ridProtein_probelist('Protein11042',Probelist),[]), maplist(atom_string,Probelist,_ProbelistStrings),maplist(assoc_key_val(_Assoc),_ProbelistStrings,Valuelist).

10. R Integration

The R programming language is built into SWISH, which means that we can perform statistical analysis using familiar tools. For full details see here. In the example below we query the number of edges in a set of pathways and the number of reactions in the same set. We then use R to calculate the correlation, fit a line and plot these using R's qplot function.

:- <- library("ggplot2"). program(Program):- Program=[(pathways_30(P):- findnsols(30,Rid,rid_type_iri(Rid,'Pathway',_),P)), (ridPathway_reaction_(P,R):-ridPathway_links(P,L),member(E,L),E=..[_,R,_,_,_]), (ridPathway_reaction_(P,R):-ridPathway_links(P,L),member(E,L),E=..[_,_,_,_,R]), (ridPathway_reactions_(P,Rs):- setof(R,ridPathway_reaction_(P,R),Rs)), (pathway_numberedges_numberreactions(P,NE,NR):- ridPathway_links(P,Ls),length(Ls,NE), ridPathway_reactions_(P,Rs),length(Rs,NR) ), (xs_ys(Xs,Ys):-pathways_30(P),maplist(pathway_numberedges_numberreactions,P,Xs,Ys))]. get_data(Xs,Ys):- reactome_server(S), program(Program), pengine_rpc(S,xs_ys(Xs,Ys),[src_list(Program)]).

Run this query to see the correlation:

get_data(NumberOfEdges,NumberOfReactions), Correlation <- {|r(NumberOfEdges,NumberOfReactions)||cor(NumberOfEdges,NumberOfReactions)|}.

Run this query to plot the data with a fitted line:

get_data(NumberOfEdges,NumberOfReactions), <-qplot(NumberOfEdges,NumberOfReactions,geom=c("point","smooth"),xlab="Number of Edges", ylab="Number of Reactions").

11. Integrating with Javascript (and the D3 library)

SWISH allows the inclusion of Javascript code, which means we can use libraries such as d3. Therefore, we can make interactive charts and web applications with Reactome data inside a SWISH notebook. To see the Javascript code double click this text and scroll down to the 'script' tags.

We illustrate this functionality by combining Pengine Reactome and webscraping to build a simple application to show Hive Plots.

In Hive Plots the geometric placement of nodes on a graph has meaning, based on user defined rules. As we are using Prolog, we can build these rules easily. In the example below we visualise two features of reactions. These features are mapped to the geometric placement of nodes in the graph. The first feature is based on the network properties (the degree) of the reaction nodes. A reaction is assigned to one of three categories:

  1. Reactions that have a larger outdeggree than indegree
  2. Reactions that have a larger indegree than outdegree.
  3. Reactions that have an equal in and outdegree.

This first feature is illustrated by placing nodes on one of three axes: Vertical axis for category 1, 4 o'clock axis for category 2, and 8 o'clock axis for category 3.

For the second feature, we use the data of gene expression that we have scraped from the GEO website and perform the following steps:

  1. Query Reactome Pengine for the set of probes that code for the set of proteins that are inputs for each reaction in the graph.
  2. For each probe we retrieve the expression levels from the scraped GEO data.
  3. Calculate the sum of the expression levels of the probes of each reaction.
  4. Rescale the summations to be between 0 and 1.
This second feature is illustrated by the distance of the reaction node from the centre of the graph.

To see the visualisation select a pathway from the drop down menu and click show pathway. The visualisation enables comparison of the graph properties and expression levels of different pathways.

### The data access program for the Hive Plot
:-use_module(library(pengines)). ridPathway(X):- reactome_server(S), pengine_rpc(S,rid_type_iri(X,'Pathway',_),[]). item_index_itemindex(I,In,I-In). edgelist_hive(Edges,hive(Terms,NewEdges)):- setof(Node,E^(member(E,Edges),edge_node(E,Node)),Nodes), length(Nodes,L), L2 is L-1, numlist(0,L2,Indexes), maplist(item_index_itemindex,Nodes,Indexes,Pairs), findall(edge(X,Y),(member(ridReaction_ridLink_type_ridReaction(R1,_,_,R2),Edges), member(R1-X,Pairs), member(R2-Y,Pairs)), NewEdges), maplist(graph_node_group(NewEdges),Pairs,Groups), maplist(pair_probelist,Pairs,Probelists), url_datatable(_U,Table), maplist(datatable_probelists_valuelists(Table),Probelists,_Valueslists,Sums), max_list(Sums,Max), maplist(divide(Max),Sums,Scaled), maplist(pair_group_expressions_term,Pairs,Groups,Scaled,Terms). %get max of sum %maplist divide by max to get rescaled values. divide(X,Y,Z):- Z is Y/X. ridPathway_hive(P,Hive):- ridPathway_links(P,List), edgelist_hive(List,Hive). pair_group_expressions_term(X-Y,Z,Q,node(X,Y,Z,Q)). edge_node(ridReaction_ridLink_type_ridReaction(_, _, _, R2),R2). edge_node(ridReaction_ridLink_type_ridReaction(R1, _, _, _),R1). graph_node_indegree(G,N,D):- findall(X,member(edge(X,N),G),Xs), length(Xs,D). graph_node_outdegree(G,N,D):- findall(X,member(edge(N,X),G),Xs), length(Xs,D). graph_node_group(G,_-N,Group):- graph_node_indegree(G,N,I), graph_node_outdegree(G,N,O), in_out_group(I,O,Group). in_out_group(In,In,0). in_out_group(In,Out,1):- In>Out. in_out_group(In,Out,2):- In <Out. pair_probelist(RidReaction-_,ProbelistString):- reactome_server(S), pengine_rpc(S,ridReaction_inputProbelist(RidReaction,Probelist),[]), maplist(atom_string,Probelist,ProbelistString). datatable_probelists_valuelists(D,P,Values,Sum):- findall(Value,(member(X,P),member([X,Value,_,_],D)),Values), maplist(number_string,NumberValues,Values), sum_list(NumberValues,Sum).

To see the term that is used by the Javascript to make the hive plot run the following query.

ridPathway_hive('Pathway21',H).

Summary

Reactome Pengine is a web-logic API to query the human reactome. It provides both raw RDF data access and a set of built-in predicates to facilitate this. The pengine technology allows users to send entire programs to the API to augment the built in data and predicates. It can do this from a local client program, or as shown here, in a cloud hosted SWISH notebook. Programs hosted on SWISH notebooks allow for easy sharing and the ability to render query solutions graphically. In contrast, programs running in SWI-Prolog client applications have access to the full power of Prolog — including system calls and they also maintain privacy of local computations. Either of these options allow researchers to perform analysis of data that requires querying the human reactome and integrating with other data sources. In fact it is also possible to build a web application that interfaces with Reactome Pengine, for instance, a GUI web-based tool that makes predictions using Reactome data.