You can do a number of complex queries on the GO data. Here are some examples of just a few of them. For more information, read the schema documentation in GO Schema Documentation
Every example query comes with a sample piece of SQL. The SQL can be typed directly into the MySQL command line interpreter. For some queries we also provide an example of how to achieve the same thing programmatically using the go-db-perl API. These examples can be typed interactively in perl using the GOshell.pl script; or they can be used with a program that uses the go-db-perl API.
Some of the example queries have a DBStag Template. See the README in the stag-templates directory. You can use stag templates on your unix command line (if you follow the instructions in the README), or by selecting the appropriate template from the GO DBStag Demo.
Find the child terms of blood coagulation:
SELECT child.*
FROM term AS parent,
term2term,
term AS child
WHERE
parent.name = 'blood coagulation' AND
parent.id = term2term.term1_id AND
child.id = term2term.term2_id
You can use the stag-template go-child_term
selectall_xml.pl -d go /go-child_term -h
selectall_xml.pl -d go /go-child_term GO:0008045
selectall_xml.pl -d go /go-child_term term_name='blood coagulation'
The GO Graph is stored using the "term" (=node) and "term2term" (=arc) tables. Queries such as the above present a problem in that they are *recursive* - it is impossible to do them in a single query using the two tables above. Fortunately GO also includes a layer of *denormalization* which includes a table called "graph_path" (see GO Schema Documentation ), which stores all paths between a term and all its ancestors.
SELECT
rchild.*
FROM
term AS rchild, term AS ancestor, graph_path
WHERE
graph_path.term2_id = rchild.id and
graph_path.term1_id = ancestor.id and
ancestor.name = 'blood coagulation'
Or if you prefer using perl, fire up GoShell or write a perl script to do this:
$term = $apph->get_term({name=>"blood coagulation"});
$graph = $apph->get_graph_below($term->acc);
$terms = $graph->get_all_nodes;
foreach $t (@$terms) { printf "term:%s\n", $t->name }
see GO::Model::Graph for more on the graph object (note the code above will work even if the graph_path table is not populated)
You can use the stag-template go-descendent_term
selectall_xml.pl -d go /go-descendent_term -h
selectall_xml.pl -d go /go-descendent_term GO:0008045
selectall_xml.pl -d go /go-descendent_term term_name='blood coagulation'
fetching ancestors is similar to finding descendants. try this for the ancestors of DNA Helicase:
SELECT p.* FROM graph_path INNER JOIN term AS t ON (t.id = graph_path.term2_id) INNER JOIN term AS p ON (p.id = graph_path.term1_id) WHERE t.name = 'DNA Helicase activity';
this is the reflexive closure so it includes DNA Helicase itself - to get the non-reflexive closure (ie to filter out the central term) add the constraint graph_path.distance > 0
You can use the stag-template go-descendent_term - the same template can find either descendents or ancestors
selectall_xml.pl -d go /go-descendent_term descendent_term_acc=GO:0008045
selectall_xml.pl -d go /go-descendent_term descendent_term_name='DNA Helicase activity'
The previous query showed how to find all the ancestor terms for a term; what if you want to find the ancestor terms, and the relationships for those terms? This is useful for getting a sub-graph from the whole DAG.
SELECT p.*,r.* FROM graph_path INNER JOIN term AS t ON (t.id = graph_path.term2_id) INNER JOIN term AS p ON (p.id = graph_path.term1_id) INNER JOIN term2term AS r ON (r.term2_id = p.id) WHERE t.name = 'DNA Helicase activity';
This fetches all ancestor terms, and the relationship which have each ancestor term as a child
Note that this will not fetch the root term, because the root term is not the child of any other term. It will fetch relationships which have the root as parent. If you also want to fetch the root term, change the last join to a LEFT OUTER JOIN
See also: TRANSITIVE CORRELATIONS
Say we want to find the total number pf gene products that are BOTH GTP binding (GO:0005525) and immune response (GO:0006955)
SELECT
count(DISTINCT a1.gene_product_id)
FROM
term AS t1
INNER JOIN graph_path AS p1 ON (t1.id=p1.term1_id)
INNER JOIN association AS a1 ON (a1.term_id=p1.term2_id)
INNER JOIN term AS t2 ON (t2.id=p2.term1_id)
INNER JOIN graph_path AS p2 ON (a2.term_id=p2.term2_id)
INNER JOIN association AS a2 ON (a2.gene_product_id=a1.gene_product_id)
WHERE t1.acc = 'GO:0005525' AND t2.acc = 'GO:0006955'
This query is not generalisable to n terms; fortunately there is another way of doing things (this version doesn't use the graph):
SELECT gene_product.id, count(distinct term_id) AS c
FROM gene_product
INNER JOIN association ON association.gene_product_id=gene_product.id
INNER JOIN term ON association.term_id=term.id
WHERE term.name='immune response'
OR term.name='interleukin-4 receptor binding'
GROUP BY gene_product.id
HAVING c=2;
Thanks go to Claude Pasquier for coming up with this solution
Imagine we want to find the minimal common parent of GO:0008045 (motor axon guidance) and GO:0007474 (wing vein specification).
SELECT tp.*,
p1.distance+p2.distance AS total_distance,
p1.distance AS d1,
p2.distance AS d2
FROM term AS t1
INNER JOIN graph_path AS p1 ON (t1.id=p1.term2_id)
INNER JOIN term AS tp ON (p1.term1_id=tp.id)
INNER JOIN graph_path AS p2 ON (tp.id=p2.term1_id)
INNER JOIN term AS t2 ON (t2.id=p2.term2_id)
WHERE t1.acc = 'GO:0008045' AND t2.acc = 'GO:0007474'
ORDER BY total_distance
The total distance is defined as the sum of the distance between the first term and the common parent and the second term and the common parent. Note that there may be multiple paths between a child and a parent.
You can use the stag-template go-common_ancestor
selectall_xml.pl -d go /go-common_ancestor GO:0008045 GO:0008045
Imagine we want to find the distance between GO:0006629 (lipid metabolism) and GO:0007165 (signal transduction)
The graph_path table only stores the distance between nodes that have recursive parent/child relationships, not siblings (or cousins!). But we can still do this with a little cunning:
SELECT
min(graph_path1.distance + graph_path2.distance) AS dist
FROM
graph_path AS graph_path1, graph_path AS graph_path2,
term AS t1, term AS t2
WHERE
t1.acc = 'GO:0007165' and t2.acc = 'GO:0006629' and graph_path1.term2_id = t1.id
and graph_path2.term2_id = t2.id and graph_path1.term1_id = graph_path2.term1_id
Or if that is too much to type, just fire up GoShell and type:
print $apph->get_distance('GO:0007165', 'GO:0006629')
SELECT DISTINCT
gene_product.*,
dbxref.xref_key AS acc,
dbxref.xref_dbname AS speciesdb
FROM
gene_product, dbxref, association, graph_path, evidence, term
WHERE
evidence.association_id = association.id and
code!='IEA' and code!='ISS' and
gene_product.dbxref_id = dbxref.id and
association.gene_product_id = gene_product.id
and graph_path.term2_id = association.term_id
and graph_path.term1_id = term.id and
dbxref.xref_dbname='FB' and
term.name = 'transmembrane receptor activity'
GoAPI or GoShell:
$gpl = $apph->get_deep_products({term=>"transmembrane receptor", speciesdb=>"FB", evcodes=>["!IEA", "!ISS"]});
foreach $g (@$gpl) { printf "product:%s\n", $g->symbol }
OR query using the NCBI taxon ID - this will give different results; the former is FlyBase associations only; the latter is FlyBase Drosophila Melanogaster associations PLUS (possibly redundant) associations for fly proteins coming from other sources (eg GOA).
$gpl = $apph->get_deep_products({term=>"transmembrane receptor", taxid=>7227, evcodes=>["!IEA", "!ISS"]});
foreach $g (@$gpl) { printf "product:%s\n", $g->symbol }
See also the ::GO::Model::GeneProduct class which represents instances of GeneProducts
You can use the stag-template go-geneproduct-by-term-and-evidence; here are some example below
selectall_xml.pl -d go /go-geneproduct-by-term-and-evidence GO:0004888
selectall_xml.pl -d go /go-geneproduct-by-term-and-evidence term_name='transmembrane receptor activity'
selectall_xml.pl -d go /go-geneproduct-by-term-and-evidence term_acc=GO:0004888 product_dbname=FB
selectall_xml.pl -d go /go-geneproduct-by-term-and-evidence term_acc=GO:0004888 ncbi_taxa_id=7227
selectall_xml.pl -d go /go-geneproduct-by-term-and-evidence -where\
"term.acc='GO:0004888' AND species.ncbi_taxa_id=7227 AND evidence.code not in ('IEA','ISS')"
(note that there is currently no way of passing in multiple values for the same constraint variable with the DBStag command line interface - you have to override the whole sql where clause, as shown in the final example above)
This finds sequences for gene products associated to inner membrane (GO:0019866) or any of its children:
SELECT *
FROM gene_product
INNER JOIN dbxref ON (gene_product.dbxref_id = dbxref.id)
INNER JOIN association ON (gene_product.id = association.gene_product_id)
INNER JOIN graph_path ON (association.term_id = graph_path.term2_id)
INNER JOIN term ON (graph_path.term1_id = term.id)
INNER JOIN term AS associated_term ON (graph_path.term2_id = associated_term.id)
INNER JOIN gene_product_seq ON (gene_product.id = gene_product_seq.gene_product_id)
INNER JOIN seq ON (seq.id = gene_product_seq.seq_id)
WHERE term.acc = 'GO:0019866'
You can use the stag-template go-seq-by-term
selectall_xml.pl -d go /go-seq-by-term GO:0019866
In GO, gene products can be associated with multiple terms in multiple categories (funtion, process, cellular component). One would expect "protein biosynthesis" to be correlated with "ribosome" based on gene products in common.
We can explore these correlations with SQL queries like the following. ("protein biosynthesis" is GO:0006412)
SELECT t2.name, t2.acc, count(distinct a2.gene_product_id) AS gpc FROM term t1, term t2, association a1, association a2 WHERE a1.term_id = t1.id AND a2.term_id = t2.id AND a1.gene_product_id = a2.gene_product_id AND t1.acc = 'GO:0006412' AND t1.id != t2.id GROUP BY t2.name, t2.acc ORDER BY gpc DESC LIMIT 10
+-----------------------------------------------------+------------+------+ | name | acc | gpc | +-----------------------------------------------------+------------+------+ | structural constituent of ribosome | GO:0003735 | 1925 | | ribosome | GO:0005840 | 506 | | cytosolic large ribosomal subunit (sensu Eukaryota) | GO:0005842 | 312 | | cytosolic small ribosomal subunit (sensu Eukaryota) | GO:0005843 | 296 | | cytosolic large ribosomal subunit (sensu Bacteria) | GO:0009282 | 262 | | triplet codon-amino acid adaptor activity | GO:0030533 | 217 | | RNA binding | GO:0003723 | 214 | | cytosol | GO:0005829 | 205 | | mitochondrial large ribosomal subunit | GO:0005762 | 180 | | cytosolic small ribosomal subunit (sensu Bacteria) | GO:0009283 | 158 | +-----------------------------------------------------+------------+------+ 10 rows in set (3.77 sec)
(The final row shows the number of gene products that the two terms have in common)
We have "discovered" that translation is associated with the ribosome. Woohoo! Not a startling breakthrough in itself, but the ability to do this query may prove useful in a variety of contexts eg verifying changes to the ontology or verifying automatic assignments.
You can use the stag-template go-correlations or go-correlations-summary; the first finds cellular component correlations with protein biosynthesis; the second finds ALL correlations.
selectall_xml.pl -d go /go-correlations term_acc=GO:0006412 correlated_term_type='cellular_component'
selectall_xml.pl -d go /go-correlations > all_correlations.xml
Above we saw direct correlations between two terms. remember GO is a set of terms in a subsumption architecture, eg if you use the above query to ask for correlations between "nucleic acid binding" and other terms, you will miss out on correlations involving "DNA binding". Thankfully we can use the graph_path table to help us.
This example is for finding correlations between transcription factor (GO:0003700) and other terms:
SELECT t2.name, t2.acc, count(distinct a2.gene_product_id) AS gpc FROM term t1, term t2, graph_path p1, graph_path p2, association a1, association a2, evidence e1, evidence e2 WHERE t1.id = p1.term1_id AND t2.id = p2.term1_id AND a1.term_id = p1.term2_id AND a2.term_id = p2.term2_id AND e1.association_id = a1.id AND e2.association_id = a2.id AND a1.is_not = 0 AND a2.is_not = 0 AND e1.code != 'IEA' AND e2.code != 'IEA' AND a1.gene_product_id = a2.gene_product_id AND t1.acc = 'GO:0003700' AND a1.term_id != a2.term_id GROUP BY t2.name, t2.acc ORDER BY gpc DESC LIMIT 30
(This filters out IEAs which could skew the results)
You can use the stag-template go-transitive_correlations or go-transitive_correlations-summary; the first finds cellular component correlations with protein biosynthesis; the second finds ALL correlations [only do this if you have a massive mysql server, or you don't have the full association set loaded into your db!]
selectall_xml.pl -d go /go-transitive_correlations term_acc=GO:0006412 correlated_term_type='cellular_component'
selectall_xml.pl -d go /go-transitive_correlations > all_correlations.xml
Let's say we want to look at correlations between interpro domains and the GO term "transmembrane receptor". NOTE: This is independent of the interpro2go mapping (available from the term_dbxref table). For this you will need a go database with seqs, as we are looking for interpro accessions attached to sequence records correlated with terms. (these interpro accessions become associated with the sequence records via various protein motif searching programs run at the EBI).
SELECT
term.name, term.acc, dbxref.xref_key, count(dbxref.xref_key) AS c
FROM
term,
association,
graph_path,
seq_dbxref,
gene_product_seq AS gps,
dbxref
WHERE
term.name = 'transmembrane receptor' AND
dbxref.xref_dbname = 'interpro' AND
dbxref.id = seq_dbxref.dbxref_id AND
gps.seq_id = seq_dbxref.seq_id AND
gps.gene_product_id = association.gene_product_id AND
association.term_id = graph_path.term2_id AND
association.is_not = 0 AND
graph_path.term1_id = term.id
GROUP BY term.name, term.acc, dbxref.xref_key HAVING c > 1 ORDER BY c DESC
You can use the stag-template go-peptide_motif-by-term
selectall_xml.pl -d go /go-peptide_motif-by-term term_acc=GO:0006412 motif_dbname=interpro
Let's say we want to know the most common pairing of domains associated with GPCRs (GO:0004390); we can modify the above query like this:
SELECT
term.name, term.acc, x1.xref_key, x2.xref_key, count(x1.xref_key) AS c
FROM
term,
association,
graph_path,
seq_dbxref sx1,
seq_dbxref sx2,
gene_product_seq AS gps,
dbxref x1,
dbxref x2
WHERE
term.acc = 'GO:0004930' AND
x1.id < x2.id AND
x1.xref_dbname = 'interpro' AND
x2.xref_dbname = 'interpro' AND
x1.id = sx1.dbxref_id AND
x2.id = sx2.dbxref_id AND
gps.seq_id = sx1.seq_id AND
gps.seq_id = sx2.seq_id AND
gps.gene_product_id = association.gene_product_id AND
association.term_id = graph_path.term2_id AND
association.is_not = 0 AND
graph_path.term1_id = term.id
GROUP BY term.name, term.acc, x1.xref_key, x2.xref_key HAVING c > 1 ORDER BY c DESC
;
This gives something like:
+----------------------------+--------------+-----------+-----------+----+
| name | acc | xref_key | xref_key | c |
+----------------------------+--------------+-----------+-----------+----+
| G-protein coupled receptor | 'GO:0004930' | IPR003006 | IPR003597 | 34 |
| G-protein coupled receptor | 'GO:0004930' | IPR000832 | IPR001879 | 32 |
| G-protein coupled receptor | 'GO:0004930' | IPR001828 | IPR000337 | 19 |
| G-protein coupled receptor | 'GO:0004930' | IPR000276 | IPR001760 | 17 |
| G-protein coupled receptor | 'GO:0004930' | IPR003006 | IPR000353 | 14 |
| G-protein coupled receptor | 'GO:0004930' | IPR003597 | IPR000353 | 14 |
| G-protein coupled receptor | 'GO:0004930' | IPR003006 | IPR001039 | 13 |
| G-protein coupled receptor | 'GO:0004930' | IPR003597 | IPR001039 | 13 |
| G-protein coupled receptor | 'GO:0004930' | IPR002106 | IPR000276 | 12 |
| G-protein coupled receptor | 'GO:0004930' | IPR001611 | IPR000276 | 11 |
If you have a list of gene product identifiers that have been annotated with GO terms in the GO database, you can do a batch query for the GO terms associated with these:
SELECT term.*
FROM
dbxref
INNER JOIN
gene_product ON (dbxref.id = gene_product.dbxref_id)
INNER JOIN
association ON (association.gene_product_id=gene_product.id)
INNER JOIN
term ON (association.term_id = term.id)
WHERE
dbxref.xref_key IN
('Q8BRY9', 'Q8BRX4', 'Q8BRZ4', ...)
If you would like to fetch the gene product dbxref with every term, change the SELECT clause to term.*,gene_product.*,dbxref.*
To do the same query with the API:
my $apph = GO::AppHandle->connect(-dbname=>$dbname, -dbhost=>$mysqlhost);
my @accs = qw(Q8BRY9 Q8BRX4 Q8BRZ4);
my @pqlist = map { {acc=>$_} } @accs;
my $term_l =
$apph->get_terms({products=>[@pqlist]});
foreach my $term (@$term_l) {
printf "Term: %s %s\n", $term->acc, $term->name;
foreach my $assoc (@{$term->selected_association_list || []}) {
my $gp = $assoc->gene_product;
my $ev_l = $assoc->evidence_list || [];
printf " %s %s evidence:%s\n",
$gp->acc,
$gp->symbol,
join('; ', map {$_->code} @$ev_l);
}
}