Journal club: Papers to read from BMC Bioinfo, PNAS, Science • Into Oblivion

After several airport adventures because of French air controllers’ strike on Tuesday, September 7, I’m back. Here are a bunch of papers which just came out:

BMC Bioinformatics

Filtering, FDR and power

Background

In high-dimensional data analysis such as differential gene expression analysis, people often use filtering methods like fold-change or variance filters in an attempt to reduce the multiple testing penalty and improve power. However, filtering may introduce a bias on the multiple testing correction. The precise amount of bias depends on many quantities, such as fraction of probes filtered out, filter statistic and test statistic used.

Results

We show that a biased multiple testing correction results if non-differentially expressed probes are not filtered out with equal probability from the entire range of p-values. We illustrate our results using both a simulation study and an experimental dataset, where the FDR is shown to be biased mostly by filters that are associated with the hypothesis being tested, such as the fold change. Filters that induce little bias on the FDR yield less additional power of detecting differentially expressed genes. Finally, we propose a statistical test that can be used in practice to determine whether any chosen filter introduces bias on the FDR estimate used, given a general experimental setup.

Conclusions

Filtering out of probes must be used with care as it may bias the multiple testing correction. Researchers can use our test for FDR bias to guide their choice of filter and amount of filtering in practice.

Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms

Background

Data generated using ‘omics’ technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of ‘omics’ data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques.

Results

The analysis of data from seven ‘omics’ studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (MVpower) that implements the simulation strategy proposed in this paper.

Conclusion

No single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data.

Consistency, comprehensiveness, and compatibility of pathway databases

Background

It is necessary to analyze microarray experiments together with biological information to make better biological inferences. We investigate the adequacy of current biological databases to address this need. Description: Our results show a low level of consistency, comprehensiveness and compatibility among three popular pathway databases (KEGG, Ingenuity and Wikipathways). The level of consistency for genes in similar pathways across databases ranges from 0% to 88%. The corresponding level of consistency for interacting genes pairs is 0%-61%. These three original sources can be assumed to be reliable in the sense that the interacting gene pairs reported in them are correct because they are curated. However, the lack of concordance between these databases suggests each source has missed out many genes and interacting gene pairs.

Conclusions

Researchers will hence find it challenging to obtain consistent pathway information out of these diverse data sources. It is therefore critical to enable them to access these sources via a consistent, comprehensive and united pathway API. We accumulated sufficient data to create such an aggregated resource with the convenience of an API to access its information. This united resource can be accessed at www.pathwayapi.com.

PNAS

Statistical tests for whether a given set of independent, identically distributed draws comes from a specified probability density

We discuss several tests for determining whether a given set of independent and identically distributed (i.i.d.) draws does not come from a specified probability density function. The most commonly used are Kolmogorov–Smirnov tests, particularly Kuiper’s variant, which focus on discrepancies between the cumulative distribution function for the specified probability density and the empirical cumulative distribution function for the given set of i.i.d. draws. Unfortunately, variations in the probability density function often get smoothed over in the cumulative distribution function, making it difficult to detect discrepancies in regions where the probability density is small in comparison with its values in surrounding regions. We discuss tests without this deficiency, complementing the classical methods. The tests of the present paper are based on the plain fact that it is unlikely to draw a random number whose probability is small, provided that the draw is taken from the same distribution used in calculating the probability (thus, if we draw a random number whose probability is small, then we can be confident that we did not draw the number from the same distribution used in calculating the probability).

Brain size, life history, and metabolism at the marsupial/placental dichotomy

The evolution of mammalian brain size is directly linked with the evolution of the brain’s unique structure and performance. Both maternal life history investment traits and basal metabolic rate (BMR) correlate with relative brain size, but current hypotheses regarding the details of these relationships are based largely on placental mammals. Using encephalization quotients, partial correlation analyses, and bivariate regressions relating brain size to maternal investment times and BMR, we provide a direct quantitative comparison of brain size evolution in marsupials and placentals, whose reproduction and metabolism differ extensively. Our results show that the misconception that marsupials are systematically smaller-brained than placentals is driven by the inclusion of one large-brained placental clade, Primates. Marsupial and placental brain size partial correlations differ in that marsupials lack a partial correlation of BMR with brain size. This contradicts hypotheses stating that the maintenance of relatively larger brains requires higher BMRs. We suggest that a positive BMR–brain size correlation is a placental trait related to the intimate physiological contact between mother and offspring during gestation. Marsupials instead achieve brain sizes comparable to placentals through extended lactation. Comparison with avian brain evolution suggests that placental brain size should be constrained due to placentals’ relative precociality, as has been hypothesized for precocial bird hatchlings. We propose that placentals circumvent this constraint because of their focus on gestation, as opposed to the marsupial emphasis on lactation. Marsupials represent a less constrained condition, demonstrating that hypotheses regarding placental brain size evolution cannot be generalized to all mammals.

Science

Oscillating Gene Expression Determines Competence for Periodic Arabidopsis Root Branching

Plants and animals produce modular developmental units in aperiodic fashion. In plants, lateral roots form as repeatingunits along the root primary axis; however, the developmentalmechanism regulating this process is unknown. We found thatcyclic expression pulses of a reporter gene mark the positionof future lateral roots by establishing prebranch sites andthat prebranch site production and root bending are periodic.Microarray and promoter-luciferase studies revealed two setsof genes oscillating in opposite phases at the root tip. Geneticstudies show that some oscillating transcriptional regulatorsare required for periodicity in one or both developmental processes.This molecular mechanism has characteristics that resemble molecularclock–driven activities in animal species.

[Perspectives]Transcription: Targeting the Core of Transcription

An enzyme that senses metabolic stress phosphorylates a chromatin protein to control gene expression and adaptive responses.