What is abSENSE?

abSENSE is a method that calculates the probability that a homolog of a given gene would fail to be detected by a homology search (using BLAST or a similar method) in a given species, even if the homolog were present and evolving normally.

The result of this calculation informs how one interprets the result of a homology search failing to find homologs of a gene in some species. One possibility to explain such a result is that the gene is actually absent from the genome in that species: a biological, and potentially interesting (e.g. if due to a gene loss or the birth of a new gene), result.

A second explanation, often ignored, is that the homolog is present in the genome of that species, but that the homology search merely lacks statistical power to detect it. Here, the apparent absense of the homolog is a technical/statistical limitation, and does not reflect underlying biology.

By calculating the probability that your homology search would fail to detect a homolog even if one were present and even if it were evolving normally (e.g. no rate accelerations on a specific branch, potentially suggestive of biologically interesting changes), abSENSE informs the interpretation of a negative homology search result. If abSENSE finds that there is a high probability of a homolog being undetected even if present, you may not be as inclined to invoke a biological explanation for the result: the null model of a failure of the homology search is sufficient to explain what you observe.

The method is explained in complete detail in the paper in which we introduce it:

Weisman CM, Murray AW, Eddy SR (2020) Many, but not all, lineage-specific genes can be explained by homology detection failure. PLoS Biol 18(11): e3000862. https://doi.org/10.1371/journal.pbio.3000862

There, it is applied to the specific case of lineage-specific genes, for which homologs appear absent in all species outside of a narrow lineage. The method itself is applicable to any case in which a homolog appears absent (e.g. a single species missing a homolog that one might interpret as a gene loss), and likewise, this code is applicable to all such cases.

When should I use this site vs. the downloadable code?

Small numbers of analyses: This website can analyze one gene at a time, outputs numerical results to the screen (not a file), and produces a visualization of the resulting analysis. If you want to analyze one or a few genes, for which you have the bandwidth and desire to look at their visualizations, this website will probably serve you well. By contrast, if you're looking to analyze hundreds or thousands of genes, you should use the downloadable command line code, which is faster, can be run on an arbitrary number of genes at once, and outputs results to a tab-delimited file.

No need for advanced options: The command line code allows you to implement advanced options that are not needed for the standard use case of abSENSE, and they are not included on the website. These include adjusting the E-value threshold and database sizes for the pre-computed fungal and insect genes and using bitscores from only a subset species in the fitting procedure and subsequent analysis. You can see a complete list of these options in the README on the github page.

FAQ

Why do some species appear in gray text in the abSENSE output? As indicated by the label 'Orthology ambiguous,' these are species in which a gene was detected that is homologous to the query gene at the chosen significance threshold, but which is at risk of not being a strict orthlog (e.g. could be a paralog), because it failed the Reciprocal Best Hit criterion. These species aren't missing homologs, and so we don't consider them as such in our analysis, but since their homologs may not be orthologs, including them in the prediction procedure for other orthologs may throw off the results (paralogs can have different evolutionary rates/patterns), and so we don't use them to predict bitscores.

How do I calculate the evolutionary distances between my species? There's no single way that you have to do this; the goal is just to use any method that gives you relatively accurate relative evolutionary distances. In the paper where we develop abSENSE, we show that you can do this reliably by aligning ~15 highly-conserved genes present in single copy in each of your species and computing the pairwise distances from a simple method like Protdist in the Phylip package (https://evolution.genetics.washington.edu/phylip/doc/protdist.html). But you could try to get distances in any number of other ways: taking them from a published phylogeny, estimating them with other methods (fewer genes or different kinds of genes), or even, potentially, from fossil record estimates. If you're going to use one of these different methods, it'll likely be important to check that your distances are reliable (result quality depends on this). You can do this by using a positive control: check the results that abSENSE and those distances give for genes that ARE detected in your species of interest. If you ask abSENSE to analyze a gene that in fact is detected in a species, holding out the bitscore data from that species as if it weren't, accurate evolutionary distances should give a result that correctly recapitulates the fact that the gene is detected. This is what we do in the paper (Figures 4,5).

Why do I get slightly different results when I run an analysis twice? / Why are the results here slightly different than the ones I see on the GitHub from the command line code? The analysis performed here runs more slowly; to prevent it from taking annoyingly long, it's a stochastic approximation of the exact value. The GitHub table and command line code are more accurate, but shouldn't differ too much.

Who should I contact if I have questions? You should email the author, Cara Weisman, at weisman@g.harvard.edu.