Self-renewal, the ability of a stem cell to divide repeatedly while maintaining an undifferentiated state, is a defining characteristic of all stem cells. Here, we clarify the molecular foundations of mouse embryonic stem cell (mESC) self-renewal by applying a proven Bayesian network machine learning approach to integrate high-throughput data for protein function discovery. By focusing on a single stem-cell system, at a specific developmental stage, within the context of well-defined biological processes known to be active in that cell type, we can produce consensus predictive networks that reflect biological reality more closely than those made by prior efforts using more generalized, context-independent methods. In addition, we show how machine learning approaches may be misled if the tissue-specific role of mammalian proteins is not defined the the training set and circumscribed in the evidential data.
For this study, we assembled an extensive compendium of mESC data: ~2.2 million data points, collected from 60 studies, under 992 conditions. We then integrated these data into a consensus mESC functional relationship network focused on biological processes associated with embryonic stem cell self-renewal and cell fate determination. You can view this network here using our dynamic network visualization tool: StemSight Scout. Computational evaluations, literature validation, and analyses of predicted functional linkages show our results are highly accurate and biologically relevant. Our mESC self-renewal network predicts many novel players involved in self-renewal and serves as the foundation for future pluripotent stem cell studies. We encourage you to use it to explore hypotheses about gene function in the context of stem cell self-renewal and for prioritizing genes for experimental validation.
Stem cells can divide symmetrically to generate two identical daughter cells or asymmetrically to produce one stem cell and one restriced progenitor cell.
A major challenge of ongoing research is to determine whether core conserved pathways can be distilled from the cacophony of biological interactions that direct embryonic and somatic stem cell fate and self-renewal. More than a dozen signaling pathways are implicated in self-renewal, suggesting regulation by a complex interplay of external signaling cues, transcriptional control, and molecular activities. Despite this inherent complexity, most models of self-renewal oversimplify the intricate dynamics associated with maintaining a cell lineage throughout development and adulthood.
A Bayesian network is a machine learning tool for organizing pieces of knowledge and encoding statistical dependence relationships among these pieces of knowledge. Such graphical models, in which each circle represents a node and each directed edge represents a dependence relationship, provide a flexible framework for combining different types of observed data and prior knowledge.
A naïve Bayesian network (Bayes net) is a simplified version of a Bayesian network in which all child nodes are dependent on the parent and independent of each other. This type of graphical device may be used to combine prior knowledge with different types of evidential data to generate probabalistic models of biological functional relationship networks, which are typically rendered as dense, complex graphs that represent molecular elements as nodes and predicted functional linkages between nodes as undirected edges. In our Bayes net structure, the functional relationship between the pair of proteins iand j (FRij) is a hidden conditional variable on which all evidential dataset variables are dependent and represents the discretized, observed similarity score in dataset k for proteins i and j The edge weight (eij) represents the probability that proteins ij are functionally related given the evidence observed in different high-throughput datasets. Strong evidence of a functional relationship between protein pairs as measured by edge weight indicates the proteins behave in a similar way given patterns observed in the data. The specific nature of that relationship can be deduced by evaluating the type of datasets that support the edge.
Our Bayes net method is designed to generate reliable and relevant predictive biological networks using high-throughput data limited to a specific cell type and a gold standard training set of positive and negative examples focused on biological processes known to be active in that cell type.
We trained a Bayesian classifier to make posterior predictions of functional relationships among 21,291 protein coding genes using:
- A manually curated a positive reference of 2056 pair-wise gene relationships (with a prior of 1) among 354 genes associated with mESC self-renewal or annotated to signaling pathways involved in early embryonic development, based on information extracted from 98 journal articles. We combined this positive reference with 20,560 randomly generated negative (with a prior of 0) pairs to generate a training gold standard with a class distribution of 1:10.
- A mESC data compendium, representing 60 independent research studies. This compendium was comprised of ~2.2 million data points, collected over 992 conditions, using 6 different high-throughput experimental techniques, and encompassed > 6 billion gene-pair measurements.
Performance metrics and cross validation confirmed this approach helps achieve optimal results for mammalian systems.
Computational assessment of network performance using standard machine learning metrics showed that precision at 10% recall was 90%, and 60% at 25% recall, before and after regularization and out-of-bag averaging to correct for overfitting to noise. The area under the Receiver Operator Characteristic curve (AUC) was 0.7479; after regularization and bagging, the final mESC network AUC was 0.7165. Top ranked, high confidence edges were supported by a diversity of high-throughput data, but predominantly by Protein-DNA binding data. Functional annotation analyses showed the most strongly connected genes in the mESC network were highly enriched for stem-cell-related processes, including development, maintenance, and differentiation, as well as processes associated with transcriptional regulation and chromatin modification.
This website is designed to provide you access to our underlying data, dynamic network visualization, and functional analyses of genes and proteins likely to be related in the context of self-renewal. This comprehensive online resource can be used as a reference for hypothesis creation and experimental design.
Future studies will examine shared and unique molecular characteristics of mouse stem cell and cell fate pathways in embryonic and adult stem cells as well as induced pluripotent and cancer stem-like cells. Comparative studies will contrast human and mouse stem cell fate pathways.