T cells are produced in the bone marrow. They are transported still, as pro-thymocytes to the thymus where they undergo the process of maturation and selection. The regulation of T cell maturation in the thymus is termed ‘central tolerance’. During gestation, most T cells generated bear the gamma/deta T cell receptor (TcR) on their surface. In the adult, most T cells bear the alpha/beta TcR. The newly formed TcR then, has to be tested for recognition of self-MHC/peptide. The T cells are tested at a stage of development known as double positive, meaning that they bear both CD4 and CD8 receptors on their surface. Cells with TcRs that recognize self-MHC/peptide with very low affinity will die. This process is known as death by neglect. Cells with TcRs with medium affinity for MHC receive survival signals and undergo a process known as positive selection. Finally, cells which receive a high affinity signal via their TcR die by apoptosis, a process known as negative selection. Cells that interact with MHC class I become CD8 positive T cell, and those that interact with MHC class II become CD4 positive T cells, before migrating out into the peripheral lymphoid system (Wood P, 2006).
Mature B cells, like T cell, are also develop form pluripotent stem cells. However unlike T cells lymphocytes, B cell maturation occurs in the bone marrow. There are four different stages of B cell development: pro-B, pre-B, immature B, and mature B cells. During its development, B cells acquire B cell surface marker expression such as B220, CD19, CD20, etc. as well as antigen receptors. The stromal cells lining the bone marrow provide essential growth signals to developing B cells, including cytokines such as IL7 and cell to cell contact, via VLA4/VCAM and Kit/SCF. During B cell development, gene segment rearrangements take place, just like in T cells where TcR rearrangements (central tolerance) also occur. However, for B cells, the immunoglobulin heavy chain gene locus (variable-V, joining-J and diversity-D segments), situated on chromosome 14, rearranges. In haematopoietic stem cells, the Ig heavy chain genes are in germline configuration (Kurosaki T et al., 2009). As B cells develop to pro-B cells, a D-J recombination is the first gene rearrangement to take place. The intervening DNA is normally deleted from the chromosome as a circle. Gene rearrangements are mediated by recombinase activitng genes, RAG proteins. As the developing B cell proceeds from pro- to pre-B cell stage, a V-DJ gene arrangement takes place to form the VDJ coding block that encodes the variable domain on the antibody heavy chain. Gene rearrangement takes place on both copies of chromosome 14 in a developing B cell, but once a productive VDJ block has been assembled on one chromosome 14, rearrangement ceases on the other chromosome, ensuring only one type of Ig is produced by any single B cell. This process is known as allelic exclusion. If a developing B cell fails to make a productive VDJ block, it will fail to produce antibody heavy chain and die in the bone marrow (Murphy K et al, 2008).
T and B cell activation:
T cell activation takes place in draining lymph nodes (also spleen) close to site of infection. T cell recognizes antigen on MHC (Major Histocompatibility Complex) molecules becomes activated and differentiates to effector cells. Effector T cells migrate to site of infection and carry out effector functions. The T lymphocytes arrive through venules, and cross through the endothelial to the lymph nodes. Antigen presenting cells such (APC) such as dendritic cells, and macrophages presented antigens to T cells. On recognition of the antigen, a low affinity interaction is formed. These T cells then leave lymph node though the lymphatic system. Those T cells that recognize the antigen’s wall with high affinity will be retained and the process of proliferation and differentiation occurs. However, initial B cell activation takes place in T cell zone of secondary lymphatic tissues (i.e. in lyhmph nodes). Mostly IgM producing plasma cells are produced at this state. B cells, unlike T cells, are activated by the ineraction with antigen-specific T cell, by linked recognition. Antigen-activated B cell migrates to B cell area of lymph nodes to form organized germinal centres, where additional B cell differentiation processes take place. It is important to note that T cells recognize the peptide, while B cells recognize the coat protein.
For T and B lymphocyte activation 2 signals are hypothesized to be required. Firstly, the antigen stimulus signal and secondly, the co-stimulatory stimulus. The absence of the second signal results in anergy or apoptosis. CD28/B7 interaction is the co-stimulatory signals for T cells while CD40/CD40 ligand, on the activated T cells, interaction is for B cells. For both T and B lymphocytes, in it resting G0 cell cycle, the cell appear to have a large nucleus, with little cytoplasm and show little evidence of organelles. However, when these cells enter G1/S/G2 cell cycle, cell shows an increase in cell size, chromatin de-condensation is seen. Cell division occurs rapidly, generating effector cells of either T or B lymphocytes. Effector T cells include Th1, Th2 and T regulatory, as well as T cytotoxic cell and memory T cells. On the other hand, effector B cells include plasma cell and memory B cell.
T and B cell effector functions:
B cell response to T-dependent protein antigen results in germinal centres formation in B cell areas of lymph nodes, and specialized processes such as Ig class-switching, somatic mutation and affinity maturation, memory B cell and plasma cell generation take place there. Emerging form germinal centres are somatically mutated and class-switched B cells, which no longer just produce IgM. Memory B cells are long-lived, resting and re-circulating cells, responsible for immunization part which helpto generate rapid and vigorous immune response on second encounter for that specific antigen. Plamablast cells migrate to other sites such as bone marrow, and become plasma cells, producing large amounts of secreted antibody. Some of which can live for long periods. The effector functions of B cells refer to what antibodies do after their contact with the antigen. The antibody effector functions include neutralization, complement fixation (IgM, IgG1/2/3), oposonization and antibody dependent cell-mediated cytotoxicity.
In contrast, T cell effector functions differ significantly from B cell effector functions. Antigen presenting cells present peptide via MHC which can either interact with CD4 or CD8 T cells. Helper T cells are defined by the cytokines they produce. Naïve CD4 T cells (Th0), on interaction with APC, can differentiate to Th1 or Th2 cells, depending on the cytokine environment. Th1 cells co-ordinate inflammatory immune responses to intracellular pathogens while Th2 cells help B cells to make antibodies required for immune responses to extracellular pathogens, this is known as humoral immunity. Th1 and Th2 cells both act to promote the generation of more leukocytes. Besides Th0/Th1/Th2, other CD4 T cell subsets exist (Zhu J et al., 2010). Resting T cells can differentiate into activated helper T cell, as well as activated cytotoxic T cell (CD8 T cell). Initially, CD8 T cells interact with potential target cells via low affinity/non-specific interactions between adhesion molecules on the T cell (LFA-1 and CD2) and the target cell (ICAM1, ICAM2). This interaction has no effect on the cytoskeleton of the T cell and is a transient interaction unless recognition of specific peptide:MHC complexes occurs. If peptide:MHC I complex is present, the affinity of the adhesion molecule interaction increases and there is clustering of T cell receptor and associated molecules at the point of contact with the target cell forming the immunological synapse. This also signals for cytoskeletal rearrangements organized by the microtubule organizing complex which focuses the cytotoxic granules of the T cell at the point of contact with the target. Notice here, that T cells, unlike B cells do not produce antibodies against antigens. Granules containing perforin and other enzymes including granzymes are released and induce the activation of the cathepsin pathways in the target cell leading to apoptosis. CD8 T cells can also kill target cells via the Fas/FasL pathway which also induces apoptosis (Peter EJ 2007).
In conclusion, adaptive immune responses occur when individual lymphocytes capable of responding to antigen proliferate and differentiate to become an antigen-specific effector cells and memory cells. The process of lymphocyte cell cycle progression, proliferation and differentiation in response to antigen and stimuli is known as lymphocyte activation. B cell activation is initiated by the ligation of the B cell receptor (BCR) with antigen and ultimately results in the production of protective antibodies against potentially pathogenic invaders. While naive or memory T cells encounter foreign antigen along with proper co-stimulation they undergo rapid and extensive clonal expansion. In human, this type of proliferation is fairly unique to cells of the adaptive immune system and requires a considerable expenditure of energy and cellular resources.
The Central Dogma of Molecular Biology
The molecule we know today as deoxyribonucleic acid was first observed in 1869 by Swiss biologist Friedrich Miescher, who stumbled upon a substance which was resistant to protein digestion. At the time he referred to the molecule as ‘nuclein’ (Pray, 2008). Though Miescher remained in obscurity, Russian biochemist Phoebus Levene continued work with this substance and in 1919 discovered the three major components of a nucleotide: phosphate, sugar, and base. He noted that the sugar component was ribose for RNA and deoxyribose for DNA, and he proposed that nucleotides were made up of a chain of nucleic acids (Levene, 1919). He was largely correct, and in 1950 Erwin Chargaff, after reading a paper by Oswald Avery in which Avery identified the gene as the unit of hereditary material (Avery, 1944), set out to discover whether the deoxyribonucleic acid molecule differed among species. He found that although, in contrast to Levene’s proposal that nucleotides are always repeated in the same order, nucleotides appear in different orders in different organisms, these molecules maintained certain characteristics. This led him to develop a set of rules (known as ‘Chargaff’s Rules’) in which he states that the total number of purines (Adenine and Guanine) and the total number of pyrimidines (Cytosine and Thymine) are almost always equal in an organism’s genetic material. In 1952 Rosalind Franklin and Maurice Wilkins used X-ray crystallography to capture the first image of the molecule’s shape, and in 1953 James Watson and Francis Crick finally proposed the three dimensional model for DNA (Watson, 1953). The four main tenants of their discovery still hold true today: 1) DNA is a double-stranded helix, 2) the majority of these helices are right-handed, 3) the helices are anti-parallel, and 4) the DNA base pairs within the helix are joined by hydrogen bonding, and the bases can hydrogen bond with other molecules such as proteins.
The Central Dogma of Molecular Biology, first proposed by Francis Crick (Crick, 1958), describes the directional processes of conversion from DNA to RNA and from RNA to protein. This gene expression process starts with DNA, a double-stranded molecule consisting of base-paired nucleic acids adenine (A), cytosine (C), guanine (G), and thymine (T) on a sugar-phosphate backbone. This genetic material serves as the information storage for life, a dictionary of sorts that provides all of the necessary tools for an organism to create the components of itself. During the process of transcription, the DNA molecule is used to make messenger RNA (mRNA), which carries a specific instance of the DNA instructions to the machinery that will make protein. Proteins are synthesized during translation using the mRNA molecule as a guide. Gene expression is a deterministic process during which each molecule is manufactured using the product of the previous step. The end result is a conversion from the genetic code into a functional unit which can be used to perform the work of the cell. As you can imagine, this process must be controlled by an organism in order to make efficient use of resources, respond to environmental changes, and differentiate cells within the body. Gene regulation, as it is sometimes called, occurs at all stages along the way from DNA to protein.
Regulation falls into four categories: 1) epigenetic (methylation of DNA or protein, acetylation), 2) transcriptional (involves proteins called transcription factors), 3) post-transcriptional (sequestration of RNA, alternative splicing of mRNA, microRNA (miRNA) and small interfering RNA (siRNA)), and 4) post-translational modification (phosphorylation, acetylation, methylation, ubiquitination, etc. of protein products). Epigenetic regulation of DNA involves a reversible, heritable change that does not alter the sequence itself. DNA methylation occurs on the nucleic acid cytosine. Arginine and lysine are the most commonly methylated amino acids. When proteins called histones) contain certain methylated residues, these proteins can repress or activate gene expression. Often this occurs on the transcriptional level, and thus prevents the cell from manufacturing messenger RNA (mRNA), the precursor to proteins. Proteins are often referred to as the workhorse of the cell and are responsible for everything from catalyzing chemical reactions to providing the building blocks for skeletal muscles. Some proteins, called transcription factors), help to up- or down-regulate gene expression levels. These proteins can act alone or in conjunction with other transcription factors and bind to DNA bases near gene coding regions.
This is a general schema for gene expression. DNA is a double-stranded molecule consisting of base-paired nucleic acids A, C, G, and T on a sugar-phosphate backbone and is used as information storage. mRNA is made during transcription and carries a specific instance of the DNA instructions to the machinery that will make the protein. Proteins are synthesized during translation using the information in mRNA as a template. This is a deterministic process during which each molecule is manufactured using the product of the previous step. mRNA requires a 5′ cap and a 3′ poly(A) tail in order to be exported out of the nucleus. The cap is critical for recognition by the ribosome and protection from enzymes called RNases that will break down the molecule. The poly(A) tail and the protein bound to it aid in protecting mRNA from degradation by other enzymes called exonucleases.
What can be gained by studying gene regulation? In general, it allows us to understand how an organism evolves and develops, both on a local scale (Choe, 2006,Wilson, 2008), and on a more global network level. There are, however, more specific reasons to investigate this process more closely. Failure in gene regulation has been shown to be a key factor in disease (Stranger, 2007). Additionally, learning how to interrupt gene regulation may lead to the development of drugs to fight bacteria and viruses (McCauley, 2008). A clearer understanding of this process in microorganisms may lead to possible solutions to the problem of antimicrobial resistance (Courvalin, 2005).
There are two major factors that motivate the studies herein. Firstly, the size and quality of biological data sets has increased dramatically in the last several years. This is due to high-throughput experimental techniques and technology, both of which have provided large amounts of interaction data, along with X-ray crystallography and nuclear magnetic resonance (NMR) experiments which have given us the solved three-dimensional structure of proteins. Secondly, machine learning has become an increasingly popular tool in bioinformatics research because it allows for more sound gene and protein annotation without relying solely on sequence similarity. If a collection of attributes which distinguish between two classes of proteins can be assembled, function can be predicted.
In this work we focus mainly on regulation at the transcriptional level and the components which play a commanding role in this operation. So-called nucleic acid-binding (NA-binding) proteins, which includes transcription factors, are involved in this and many other cellular processes. Disruption or malfunction of transcriptional regulation may result in disease. We identify these proteins from representative data sets which include many categories of proteins. Additionally, in order to understand the underlying mechanisms, we predict the specific residues involved in nucleic acid binding using machine learning algorithms. Identification of these residues can provide practical assistance in the functional annotation of NA-binding proteins. These predictions can also be used to expedite mutagenesis experiments, guiding researchers to the correct binding residues in these proteins.
Toward the ultimate goal of attaining a deeper understanding of how nucleic acid-binding proteins facilitate the regulation of gene expression within the cell, the research described here focuses on three particular aspects of this problem. We begin by examining the nucleic acid-binding proteins themselves, both on the protein and residue levels. Next, we turn our attention toward protein binding sites on DNA molecules and a particular type of modification of DNA that can affect protein binding. We then take a global perspective and study human molecular networks in the context of disease, focusing on regulatory and protein-protein interaction networks. We examine the number of partnership interactions between transcription factors and how it scales with the number of target genes regulated. In several model organisms, we find that the distribution of the number of partners vs. the number of target genes appears to follow an exponential saturation curve. We also find that our generative transcriptional network model follows a similar distribution in this comparison. We show that cancer- and other disease-related genes preferentially occupy particular positions in conserved motifs and find that more ubiquitously expressed disease genes have more disease associations. We also predict disease genes in the protein-protein interaction network with 79% area under the ROC curve (AUC) using ADTree, which identifies important attributes for prediction such as degree and disease neighbor ratio. Finally, we create a co-occurrence matrix for 1854 diseases based on shared gene uniqueness and find both previously known and potentially undiscovered disease relationships.
The goal for this project is to predict nucleic acid-binding on both the protein and residue levels using machine learning. Both sequence- and structure-based features are used to distinguish nucleic acid-binding proteins from non-binding proteins, and nucleic acid-binding residues from non-binding residues. A novel application of a costing algorithm is used for residue-level binding prediction in order to achieve high, balanced accuracy when working with imbalanced data sets.
During the past few decades, the amount of biological data available for analysis has grown exponentially. Along with this vast amount of information comes the challenge to make sense of it all. One subject of immediate concern to us as humans is health and disease. Why do we get sick, and how? Where do our bodies fail on a molecular level in order for this to happen? How are diseases related to each other, and do they have similar modes of action? These questions will require many researchers from multiple disciplines to answer, but where do we start? We take a bioinformatics approach and examine disease genes in a network context. In this chapter we analyze human disease and its relationship to two molecular networks. First, we find conserved motifs in the human transcription factor network and identify the location of disease- and cancer-related genes within these structures. We find that both cancer and disease genes occupy certain positions more frequently. Next, we examine the human protein-protein interaction (PPI) network as it relates to disease. We find that we are able to predict disease genes with 79% AUC using ADTree with 10 topological features. Additionally, we find that a combination of several network characteristics including degree centrality and disease neighbor ratio help distinguish between these two classes. Furthermore, an alternating decision tree (ADTree) classifier allows us to see which combinations of strongly predictive attributes contribute most to protein-disease classification. Finally, we build a matrix of diseases based on shared genes. Instead of using the raw count of genes, we use a uniqueness) score for each disease gene that relates to the number of diseases with which a gene is involved. We show several interesting examples of disease relationships for which there is some clinical evidence and some for which the information is lacking. We believe this matrix will be useful in finding relationships between diseases with very different phenotypes, or for those disease connections which may not be obvious. It could also be helpful in identifying new potential drug targets through drug repositioning.