About Gene-SCOUT
Gene-SCOUT aims to find similar genes to a particular gene of interest where for each gene a unique signature is constructed. The method exploits associations derived from 450,000 exomes sequenced in the UK Biobank, as well as 120,000 samples of metabolomic data. For a given gene, its signature comprises a collection of associations between variants of the gene and phenotypic traits measured in the UK Biobank.
Gene-SCOUT: identifying genes with similar continuous trait fingerprints from phenome-wide association analyses
Lawrence Middleton, Andrew R Harper, Abhishek Nag, Quanli Wang, Anna Reznichenko,
Dimitrios Vitsios✉, Slavé Petrovski✉
Nucleic Acids Research, Volume 50, Issue 8, 6 May 2022, Pages 4289–4301.
We consider each gene to be represented by a vector where the elements reflect the associations with certain quantitative traits. The resulting method then evaluates a similarity for a given seed gene to other genes. The raw data corresponds to a matrix of effect sizes for a given gene and given trait, with corresponding significance level (\(p\)-value) shown in brackets
Trait 1 | Trait 2 | ... | Trait m | |
---|---|---|---|---|
Gene 1 | 0.1 (3.0e-1) | 1.2 (5.1e-5) | 0.1 (-0.1e-4) | |
Gene 2 | -0.2 (5.2e-4) | 2.4 (4.2e-5) | 1.3 (1.2e-4) | |
... | ||||
Gene n | -1.9 (1.2e-3) | 0.9 (3.4e-1) | 6.4 (2.4e-1) |
The method to calculate similarities:
- Only evaluates similarities between genes that share at least one significant trait with the seed gene
- Only evaluates similarities on features that are significant in the seed gene
- If a gene shares at least on significant feature, then any non-significant features are imputed with zero
As we impute zeros for traits that are not significant the resulting vector may be relatively sparse. As a result, to evaluate similarity, we use the cosine similarity, given for two N-dimensional vectors as \[ d(x,y) = \frac{1}{2}\left(1 + {\sum_{i=1}^{N} x_i y_i \over \sqrt{\sum_{i=1}^{N} x_i^2 \sum_{i=1}^{N} y_i^2} }\right).\] The addition of \(1\) and scaling by \(\frac{1}{2}\) ensures the quantity varies in \([0,1]\) with \(1\) being the most similar. The cosine similarity has been successfully applied in the natural language processing context to perform clustering under data sparsity for document classification.
Gene-SCOUT provides similarities based on different feature sets. These include:
- All traits: This comprises, firstly, the 1,446 continuous traits present in the UK Biobank and, secondly, the concatenation of this with 259 metabolomic features from 120,000 samples.
- Biomarkers only: In this feature set, 30 biomarkers are selected of the 1,446 continuous traits present in the feature set without metabolomic data.
- Metabolomic data only: This feature set comprises exclusively 259 metabolomic features taken from 120,000 samples in the UKB.
The following .txt files provide the comprehensive lists of features:
Enrichment analyses in this context aim to establish whether certain terms are over-represented in the list of closest genes reported by Gene-SCOUT. To associate a \(p\)-value with a given list of genes we perform Fisher's exact test. Fisher's exact test primarily models contingency tables, where entries of the table are assumed distributed according to a hypergeometric distribution. In this context, a contingency table is provided for a specific binary trait and a list of genes that are close to the query gene. As such we can consider the \(p\)-value as a measure of overlap betweeen these two lists. More specifically consider APOB. In this case we have that PCSK9, GIGYF1, NPC1L1, ZNF229, ANGPTL3, RRBP1, ACVR1, SLC4A1, APOC3, PDE3B are all close to this gene. Now for a certain binary trait, we will have an alternative list of genes that are enriched for that trait. Let these be denoted as \(list_{GS}\) and \(list_{trait}\) respectively. We construct the first row of the contingency table as
- The number of genes on both \(list_{GS}\) and \(list_{trait}\).
- The number of genes on \(list_{GS}\) and not \(list_{trait}\).
We construct the second row as
- The number of genes on \(list_{trait}\) and not \(list_{GS}\).
- The number of genes on neither \(list_{GS}\) or \(list_{trait}\).
Given this contingency table we are able to calculate a \(p\)-value using the hypergeometric distribution.
A similar procedure is performed for Gene Ontology enrichment.
About Trait finder
Trait finder aims to find genes that possess a certain phenotypic signature. At a given significance level it is possible to consider all genes that are positively associated with a certain quantitative trait (or negatively associated). The user therefore enters in the desirable direction of these associations in the two text fields, with the aim of finding genes that satisfy this pattern of directionality of associations.
As an illustration, imagine GENE1 is positively associated with Seated height, Sitting height and Standing height but negatively associated with Birth weight all with a significance level of \(p\leq 10^{-5}\). If the user inputs Standing height in the positively associated box and Birth weight in the negatively associated box, if the significance threshold is less than \(10^{-5}\) GENE1 would score 2 as it matched for both Standing height and Birth weight. If the user also inputs Age at death into either of the boxes, the gene would still score 2 as it is not significantly associated for Age at death. Trait finder then performs this scoring against all genes and sorts the results in terms of highest scoring genes first. It is possible that no genes match the specified pattern in which case the results chart will be empty.