| 7. Working with Microarray Data | |
7.1 Loading Microarray Data Before using Microarray data environment of BiologicalNetworks make sure you have Microarray files to open. Example data files could be downloaded here: http://brak.sdsc.edu/pub/BiologicalNetworks/MicroarrayData.zip Expression data are easily imported from Microarray submenu of BiologicalNetwroks main window. See Figure 7.1.1 ![]() Fig.7.1.1 Load Microarray Data. Choosing one of the available file types will open one Import Expression Data Wizard See Figure 7.1.2 The Import Expression Wizards allow you to import the expression data
in BiologicalNetworks. Importing Stanford (tab delimited) Data. To import data in Stanford (tab delimited) TXT or MS Excel formats: Importing Affymetrix Microarray Data To import microarray data in Affymetrix format: Importing Microarray Data in GPR Format To import microarray data in GPR format: ![]() Fig.7.1.2 Load Stanford(tab delimited) data type file Wizard.
7.2 Color schemes and Visual styles settings When an expression experiment is opened as a heat map, a colored box represents the expression level of each gene (protein). There are two default color schemes in the Expression Experiment Viewer that correspond to data formats supported by the software (Signal and Ratio). While importing expression data, you should choose the color scheme in the Apply Custom Color Scheme submenu of Microarray menu. Later, in the opened experiment, you can change the color schemes in the Expression Viewer Toolbar. Ratio data: the color intensity is proportional to the log ratio of the current sample to the base sample and is represented as double gradient color map. There are acceptable negative values in this format. On the heat map, the green color represents the negative log ratios and the red color represents the positive log ratios. Greens of increasing intensity correspond to increasingly negative log ratios. Reds of increasing intensity correspond to increasingly positive log ratios. In the Color Settings dialog window, you can set up the color range for min and max values of ratio, the cut off values, and the color range for missing data. Signal data: the color intensity is proportional to the signal value and is represented as single gradient color map. There are no negative values in this data representation. By default, the software uses green for low expression values, red for high expression values, and yellow for missing values. Visual Styles for Gene Expression Visual styles are used to make your gene
expression map more intuitive and clear. 7.3 Expression Experiment Viewer Loaded Microarray data appears on the Expression Experiment Viewer Panel on the bottom of BiologicalNetworks’ Main window. The Expression Experiment Viewer is designed to display a graphical representation of gene expression and proteomics experiment data, usually generated by microarray experiments. It provides the algorithms and workspace for examining the data from expression experiments or proteomics experiments and also for superimposing this data onto an opened pathways and gene regulatory networks. See Figure 7.3.1 ![]() Fig.7.3.1 Microarray data file loaded. Functionalities available from Microarray submenu and Microarray Experiment Manager Menu bar, allows the user to:
- Create new pathways as well as new groups from an expression experiment. - Select a number of genes and create a group or a pathway from them. - Expression data can be visually displayed on an existing pathway diagram by showing different shades of green/red depending on the fold change of expression. There are numerous clustering, filtering, normalization, search methods available in BiologicalNetworks. 7.4 Expression viewer toolbar The Expression Viewer toolbar contains wide range of functionalities:![]() Fig.7.4.1 Expression viewer toolbar 7.5 Filtering, Normalization and Data Transformation Adjust DataDifferent types of adjustments may be applied on top of one another in any sequence, and the same type of adjustment may be applied repeatedly to the matrix. Adjustments may not necessarily affect the main display or the values displayed when elements are clicked on the matrix displays, but will influence the calculation of the expression matrix, the foundation of all analyses. Adjustments will also be reflected when the entire matrix or individual clusters are saved as text files, although the original data files are not overwritten. Furthermore, with the exception of three options: “Set Lower Cutoffs”, “Set Percentage Cutoffs” and “Adjust Intensities of Zero”, all the changes made to an expression matrix are irreversible for the current session. Because of the above features, a good way to use these options might be to apply any required adjustments to the data set, save the entire adjusted matrix as a tab delimited formatted text file (using the “Save Microarray Matrix” option under the “Microarray” Menu), and then load this new file in a new session, during which no further data adjustments will be made. This will ensure consistency throughout the session.Adjustment options are described below: Data Transformations This is fairly self-evident, just taking the log2 transform of every element in the matrix. Note that this adjustment should not usually be necessary. The program will automatically compute the log2 ratio of the two intensities and use them in the expression matrix. TDMS files also often contain pre-calculated log2 ratios. This assumes that the current data are log 10 transformed, and transforms them to log base 2, i.e., it assumes that the input data is in the form log10x, and it outputs log2x. Data Filters (Data Quality and Variance Based Filters) Select Use Lower Cutoffs to exclude from analysis any genes for which the expression values (in either the corresponding Cy3 or Cy5 columns) are lower than specified values. Select Set Lower Cutoffs to set these values. To enable this option, check the “Use Lower Cutoffs” checkbox just below the “Set Lower Cutoffs” menu option, and uncheck it to disable this option. All subsequent analyses will include only those genes for which all Cy3 and Cy5 values are above the specified thresholds. This option is disabled by default. Select Use Percentage Cutoffs to ignore the genes for which there are not enough valid (non-zero) expression values across all samples. This will not delete any data, but will only exclude the genes from analysis. This option is sometimes useful in speeding up module calculation since many zeros will often slow them down. To determine which genes will be excluded, select Set Percentage Cutoffs and enter a percentage value. To enable this option, check the “Use Percentage Cutoffs” checkbox just below the “Set Percentage Cutoffs” menu option, and uncheck it to disable this option. Genes with less than the specified percentage of non-zero values will be ignored. A value of 0.0% indicates that all genes will be used in the analysis. To require that every one of the gene’s expression values must be valid to be included, set the value to 100. This option is disabled by default. The variance filter allows the removal of genes with low variation of expression over the loaded samples. This filter is basically used to remove ‘flat genes’ that don’t vary much in expression over the conditions of the experiment. The variance filter has three possible criteria for specifying which genes to keep. The Enable Variance Filter check box turns the filter on and off. Be sure to observe the History Node log to see the number of genes retained after using the filter. Note that the variance filter is performed after other filters such as Percent Cutoff Filter is imposed. This convention insures that the genes that are check for variance also contain some minimum level of ‘good’ (not missing) data. The Percentage of Highest SD Genes option ranks the genes based on standard deviation and then the genes that are kept are some percentage of this ranked list. For and example, if we have 1000 genes and the percentage was set to 20%, then the result would be a final list of the 200 most variable genes. The Number of Desired High SD Genes also ranks the genes based on SD and then the number of genes specified are selected from this SD ordered list such that the highest SD genes are selected. The SD Cutoff Value uses an actual SD value such that all genes having an SD greater than this value are selected. 7.6 Sorting and searching over expression data The Sort feature permits the user to sort the data:Search Utility -The Search feature permits the user to search the data for genes or samples for a search term given search criteria. 7.7 Clustering of Experimental Data Each of clustering algorithms available in BiologicalNetworks can be launched from the Microarray> Cluster Analysis> Clustering Algorithms menu of the Main Window. All clustering algorithms can be performed to cluster genes or samples. Clustering analysis results appear in in the Analysis subtree of the Project Properties navigation tree. The tabs within this subtree contain the results of the method's calculations. Each algorithm run will present a dialog or form to use to input parameters specific to the algorithm being performed. Full description of algorithms is available in BiologicalNetworks see in Section 9.7.8 Clustering analysis viewers Viewers are the graphical displays used to present the results of the microarray analysis. The viewers will appear as a subtree under the method’s Analysis Tree within the Project Properties navigation tree. ![]() Fig.7.8.1Clustering analysis expression viewer. Expression Images This viewer is used in the main window of the Expression Viewer as well as in clustering analysis Viewers. Every colored rectangle represent a gene. Each column represents all the genes from a single experiment, and each row represents the expression of a gene across all experiments. The default color scheme used to represent expression level is red/green (red for overexpression, green for underexpression) and can be adjusted in the Color Scheme dialog in the Microarray> Apply Custom Color Schema or/and Color Settings menu. See Section 7.2 Double clicking on any of the rectangles in this view will open a window containing a graph of this gene’s expression level across all samples. Expression Graphs The Expression Graphs Viewer displays graphs of the expression levels of each gene across the experimental conditions. ![]() Fig.7.8.2 Clustering analysis expression graphs. Gene cluster Table Views Table View of clustering results show annotations for gene in the cluster. ![]() Fig.7.8.3 Clustering analysis table viewer. 7.9 GeneOntology terms overrepresentation analysis BiologicalNetworks provides an implementation of the GeneOntology Fisher's overrepresentation test, method which gives the researcher an initial biological interpretation of gene clusters based on the indices provided in the input data set and information linking those indices to biological “themes”. These themes are generally GO terms, KEGG pathways, or any other descriptive term related to biological role or biochemical pathway information. The result of the analysis is a group of biological themes which are represented in the cluster. A statistic reports the probability that the prevalence of a particular theme within the cluster is due to chance alone given the prevalence of that theme in the population of genes under study (all “genes” loaded into BiologicalNetworks ). Fisher Exact Probability The Fisher's Exact Probability reports the probability that a biological theme is over-represented in the cluster of interest relative to the representation of that theme in the total gene population. For example, suppose that one has a gene list of 50 genes from a population of 10,000 genes. Now suppose that 10 of the 50 genes were related to pathway "A" but only 13 genes in the total population were associated with pathway "A". This scenario would yield a low probability that the observed number of hits (occurrences of pathway "A") within the small sample could be due to chance alone. This statistic is based on the hypergeometric distribution and has benefits over chi-square in that it is appropriate for finite populations. Annotation parameters Panel Population and Cluster Selection Option Annotation key This area contains a drop down list which contains a list of available annotation types which can be used identify genes.
Generally it's best to use an index or accession 'uniquely' identifying the spotted material. This optional file provides the mapping from your annotation key (above) to the index used to map to biological themes (GO terms, KEGG pathways, etc.). If your annotation key type is the one used in the linking file (below) then this conversion (mapping) is not needed. These files if needed are typically stored in the Convert directory. Gene Annotation / Gene Ontology Linking Files This section allows one to specify one or more annotation files. These files contain gene indices paired with biological themes such as GO terms. These files typically reside in the Class directory. Results of GeneOntology Analysis The primary result is reported in a table in which entries are ordered based on the reported statistic.
The table can be sorted on any column. A right click in the table will launch a menu allowing several operations:
|