7. Working with Microarray Data

7.1 Loading Microarray Data


Before using Microarray data environment of BiologicalNetworks make sure you have Microarray files to open.

Example data files could be downloaded here: http://brak.sdsc.edu/pub/BiologicalNetworks/MicroarrayData.zip

Expression data are easily imported from Microarray submenu of BiologicalNetwroks main window. See Figure 7.1.1

Fig.7.1.1 Load Microarray Data.
Fig.7.1.1 Load Microarray Data.

Choosing one of the available file types will open one Import Expression Data Wizard

See Figure 7.1.2


The Import Expression Wizards allow you to import the expression data in BiologicalNetworks.
The data files can be in TXT or MS Excel (for Stanford tab delimited and Affymetrix, .mev and .ann (for TIGR expression data), .gpr (for GenePix expression data). To import your data, call the Microarray>Load Microarray Data menu and choose format. Then specify the location of the source data file and follow the steps provided with the Wizard. The imported data will be opened in the Import Expression Wizard.



Importing Stanford (tab delimited) Data.

To import data in Stanford (tab delimited) TXT or MS Excel formats:
  • Call the Microarray>Load Microarray Data and choose the Stanford Format option.

  • The Import Expression Wizard appears.
  • In the Expression Wizard window, specify the content of the first string of the data file and the columns of the data file that contain the Gene IDs by clicking the upper leftmost expression value.
  • Then press "Load" to load the data.

  • Importing Affymetrix Microarray Data

    To import microarray data in Affymetrix format:
  • Call the Microarray>Load Microarray Data and choose the Affimetrix Format option.
  • The Import Expression Wizard appears.
  • In the Choose Affymetrix Expression Files dialog box, select the file and press OK.


  • Importing Microarray Data in GPR Format

    To import microarray data in GPR format:
  • Call the Microarray>Load Microarray Data and choose the Genepix Format (GPR) option.
  • The Import Expression Wizard appears.
  • Follow the steps provided by the wizard.


  • Fig.7.1.2 Load Stanford(tab delimited) data type file Wizard.
    Fig.7.1.2 Load Stanford(tab delimited) data type file Wizard.


    7.2 Color schemes and Visual styles settings


    When an expression experiment is opened as a heat map, a colored box represents the expression level of each gene (protein). There are two default color schemes in the Expression Experiment Viewer that correspond to data formats supported by the software (Signal and Ratio). While importing expression data, you should choose the color scheme in the Apply Custom Color Scheme submenu of Microarray menu. Later, in the opened experiment, you can change the color schemes in the Expression Viewer Toolbar.

    Ratio data: the color intensity is proportional to the log ratio of the current sample to the base sample and is represented as double gradient color map. There are acceptable negative values in this format. On the heat map, the green color represents the negative log ratios and the red color represents the positive log ratios. Greens of increasing intensity correspond to increasingly negative log ratios. Reds of increasing intensity correspond to increasingly positive log ratios. In the Color Settings dialog window, you can set up the color range for min and max values of ratio, the cut off values, and the color range for missing data.

    Signal data: the color intensity is proportional to the signal value and is represented as single gradient color map. There are no negative values in this data representation. By default, the software uses green for low expression values, red for high expression values, and yellow for missing values.


    Visual Styles for Gene Expression

    Visual styles are used to make your gene expression map more intuitive and clear.

    -First, you can try to change the scale of the expression map by zooming in and out, and setting up element size.
    -Secondly, can also use the Brightness option to adjust the color intensity of the heat map for better viewing.
    -You can also change the color range for the particular gene expression map. Call the Microarray>Color Settings dialog or use Microarrat toolbar icon to adjust the Expression Viewer option. In this dialog box, you can also enter the cut off values, set up the color range for min and max values of ratio, and the color range for missing data.
    -In the heat map, a colored box represents the level of expression for each gene. The software supports two default color schemes (Ratio and Signal) for expression data. To change the color schemes for an opened experiment, open the Micrroarray>Apply Custom Color Schema (or Microarray toolbar) dialog box, and select the radio button corresponding to the color scheme you want.


    7.3 Expression Experiment Viewer

    Loaded Microarray data appears on the Expression Experiment Viewer Panel on the bottom of BiologicalNetworks’ Main window.

    The Expression Experiment Viewer is designed to display a graphical representation of gene expression and proteomics experiment data, usually generated by microarray experiments. It provides the algorithms and workspace for examining the data from expression experiments or proteomics experiments and also for superimposing this data onto an opened pathways and gene regulatory networks.

    See Figure 7.3.1

    Fig.7.3.1 Microarray data file loaded.
    Fig.7.3.1 Microarray data file loaded.


    Functionalities available from Microarray submenu and Microarray Experiment Manager Menu bar, allows the user to:

    - Create new pathways as well as new groups from an expression experiment.

    - Select a number of genes and create a group or a pathway from them.

    - Expression data can be visually displayed on an existing pathway diagram by showing different shades of green/red depending on the fold change of expression.

     There are numerous clustering, filtering, normalization, search methods available in BiologicalNetworks.

    7.4 Expression viewer toolbar

    The Expression Viewer toolbar contains wide range of functionalities:

  • Import expression experiment option opens a gene expression file. The file must have a TXT, XLS, .mev and .ann, .gpr extension.
  • Save expression experiment.
  • Zoom in, Zoom out and Choose element size in the microarray matrix view.
  • Color entities by expression. Select this option to color pathway of interest entities by their expression values. Refer to Section 8.
  • Extract pathways, predict network from expression menu allows you to run the Correlation (for ex. Pearson) algorithms to build a network from your raw data. Refer to Section 8.
  • Build pathways from selection option creates a pathway from the selected genes. Refer to Section 8.
  • Filter and sort expression data.
  • Search over expression data. The Search feature permits the user to search the data for genes or samples for a search term given by search criteria.
  • Create group from selection option creates a group from the selected genes. Refer to Section 8.
  • Find pathways option can be applied to the opened Expression Experiment and returns a list of pathways that share at least one protein from the selection. Refer to Section 8.
  • Find groups option does the same as the Find Pathways option, but it will return the list of groups, which share at least one protein of the selection. Refer to Section 8.
  • Group selected genes together option allows you to select genes of interest from the expression map or table and put them together. Refer to Section 8.
  • Visual styles and color settings for gene expression map.


  • Fig.7.4.1 Expression viewer toolbar.
    Fig.7.4.1 Expression viewer toolbar

    7.5 Filtering, Normalization and Data Transformation

    Adjust Data

    Different types of adjustments may be applied on top of one another in any sequence, and the same type of adjustment may be applied repeatedly to the matrix. Adjustments may not necessarily affect the main display or the values displayed when elements are clicked on the matrix displays, but will influence the calculation of the expression matrix, the foundation of all analyses. Adjustments will also be reflected when the entire matrix or individual clusters are saved as text files, although the original data files are not overwritten. Furthermore, with the exception of three options: “Set Lower Cutoffs”, “Set Percentage Cutoffs” and “Adjust Intensities of Zero”, all the changes made to an expression matrix are irreversible for the current session.

    Because of the above features, a good way to use these options might be to apply any required adjustments to the data set, save the entire adjusted matrix as a tab delimited formatted text file (using the “Save Microarray Matrix” option under the “Microarray” Menu), and then load this new file in a new session, during which no further data adjustments will be made. This will ensure consistency throughout the session.

    Adjustment options are described below:

    Data Transformations
  • Log2 Transformation

  • This is fairly self-evident, just taking the log2 transform of every element in the matrix. Note that this adjustment should not usually be necessary. The program will automatically compute the log2 ratio of the two intensities and use them in the expression matrix. TDMS files also often contain pre-calculated log2 ratios.
  • Log10 to Log2

  • This assumes that the current data are log 10 transformed, and transforms them to log base 2, i.e., it assumes that the input data is in the form log10x, and it outputs log2x.

    Data Filters (Data Quality and Variance Based Filters)
  • Lower Cutoffs

  • Select Use Lower Cutoffs to exclude from analysis any genes for which the expression values (in either the corresponding Cy3 or Cy5 columns) are lower than specified values. Select Set Lower Cutoffs to set these values. To enable this option, check the “Use Lower Cutoffs” checkbox just below the “Set Lower Cutoffs” menu option, and uncheck it to disable this option. All subsequent analyses will include only those genes for which all Cy3 and Cy5 values are above the specified thresholds. This option is disabled by default.
  • Percentage Cutoffs

  • Select Use Percentage Cutoffs to ignore the genes for which there are not enough valid (non-zero) expression values across all samples. This will not delete any data, but will only exclude the genes from analysis. This option is sometimes useful in speeding up module calculation since many zeros will often slow them down.
    To determine which genes will be excluded, select Set Percentage Cutoffs and enter a percentage value. To enable this option, check the “Use Percentage Cutoffs” checkbox just below the “Set Percentage Cutoffs” menu option, and uncheck it to disable this option. Genes with less than the specified percentage of non-zero values will be ignored. A value of 0.0% indicates that all genes will be used in the analysis. To require that every one of the gene’s expression values must be valid to be included, set the value to 100. This option is disabled by default.
  • Variance Filter

  • The variance filter allows the removal of genes with low variation of expression over the loaded samples. This filter is basically used to remove ‘flat genes’ that don’t vary much in expression over the conditions of the experiment. The variance filter has three possible criteria for specifying which genes to keep. The Enable Variance Filter check box turns the filter on and off. Be sure to observe the History Node log to see the number of genes retained after using the filter. Note that the variance filter is performed after other filters such as Percent Cutoff Filter is imposed. This convention insures that the genes that are check for variance also contain some minimum level of ‘good’ (not missing) data.

    The Percentage of Highest SD Genes option ranks the genes based on standard deviation and then the genes that are kept are some percentage of this ranked list. For and example, if we have 1000 genes and the percentage was set to 20%, then the result would be a final list of the 200 most variable genes.

    The Number of Desired High SD Genes also ranks the genes based on SD and then the number of genes specified are selected from this SD ordered list such that the highest SD genes are selected. The SD Cutoff Value uses an actual SD value such that all genes having an SD greater than this value are selected.

    7.6 Sorting and searching over expression data

    The Sort feature permits the user to sort the data:

  • By expression value
  • By chromosomal order
  • By Gene ID



  • Search Utility

    -The Search feature permits the user to search the data for genes or samples for a search term given search criteria.
    -The Search initialization dialog allows the option of finding genes or samples. The search criteria include a search term, a selection to make the search case sensitive, and a selection to permit the search term to be an exact match or simply a contiguous portion of a larger annotation term.
    -Search results are returned in a new window . Upper section is represented as a table of genes or samples identified as matching the search criteria and a lower section providing shortcut links to cluster viewers that contain the identified samples or genes.
    -Navigation shortcuts provide a means to open cluster viewers that contain the elements found in the search.
    -Elements in the table can be deselected using the checkboxes. Clicking on the Update Shortcuts button will produce a new search result window with just the previously selected entries and the associated viewer shortcuts. This allows one to prune unwanted elements out of the search result.
    -The Store Cluster button will store the selected items as a cluster and assign a user selected color.

    7.7 Clustering of Experimental Data

    Each of clustering algorithms available in BiologicalNetworks can be launched from the Microarray> Cluster Analysis> Clustering Algorithms menu of the Main Window. All clustering algorithms can be performed to cluster genes or samples. Clustering analysis results appear in in the Analysis subtree of the Project Properties navigation tree. The tabs within this subtree contain the results of the method's calculations. Each algorithm run will present a dialog or form to use to input parameters specific to the algorithm being performed.

    Full description of algorithms is available in BiologicalNetworks see in Section 9.

    7.8 Clustering analysis viewers

    Viewers are the graphical displays used to present the results of the microarray analysis. The viewers will appear as a subtree under the method’s Analysis Tree within the Project Properties navigation tree.


    Fig.7.8.1Clustering analysis expression viewer.
    Fig.7.8.1Clustering analysis expression viewer.


    Expression Images

    This viewer is used in the main window of the Expression Viewer as well as in clustering analysis Viewers. Every colored rectangle represent a gene. Each column represents all the genes from a single experiment, and each row represents the expression of a gene across all experiments. The default color scheme used to represent expression level is red/green (red for overexpression, green for underexpression) and can be adjusted in the Color Scheme dialog in the Microarray> Apply Custom Color Schema or/and Color Settings menu. See Section 7.2

    Double clicking on any of the rectangles in this view will open a window containing a graph of this gene’s expression level across all samples.


    Expression Graphs

    The Expression Graphs Viewer displays graphs of the expression levels of each gene across the experimental conditions.
    The mean expression levels of genes in the cluster are shown as a centroid graph overlaid on top of the individual expression graphs.


    Fig.7.8.2 Clustering analysis expression graphs.
    Fig.7.8.2 Clustering analysis expression graphs.

    Gene cluster Table Views

    Table View of clustering results show annotations for gene in the cluster.
    -You can drag columns horizontally across the table to change their relative ordering.
    -You can sort the rows in ascending or descending order of the entries in the column by successive clicking on the header of that column.
    -You can sort the “Stored Color” column, bringing together elements that have been stored with the same cluster color.
    -You can sort the table in the original order of elements by CTRL-clicking on any column header.

    There is a Context Menu appearing by Right-clicking on the table view. The options available from the Context Menu are:

  • Store a subset of rows in the table in a cluster, to Groups/Clusters manager
  • Store entire table as a cluster, to Groups/Clusters manager
  • Make a search over the table.
  • Save currently viewed cluster to a file.
  • Delete all rows in the table or a subset of them
  • Delete a cluster stored from this viewer


  • Fig.7.8.3 Clustering analysis table viewer.
    Fig.7.8.3 Clustering analysis table viewer.



    7.9 GeneOntology terms overrepresentation analysis

    BiologicalNetworks provides an implementation of the GeneOntology Fisher's overrepresentation test, method which gives the researcher an initial biological interpretation of gene clusters based on the indices provided in the input data set and information linking those indices to biological “themes”. These themes are generally GO terms, KEGG pathways, or any other descriptive term related to biological role or biochemical pathway information. The result of the analysis is a group of biological themes which are represented in the cluster. A statistic reports the probability that the prevalence of a particular theme within the cluster is due to chance alone given the prevalence of that theme in the population of genes under study (all “genes” loaded into BiologicalNetworks ).


    Fisher Exact Probability

    The Fisher's Exact Probability reports the probability that a biological theme is over-represented in the cluster of interest relative to the representation of that theme in the total gene population.
    For example, suppose that one has a gene list of 50 genes from a population of 10,000 genes. Now suppose that 10 of the 50 genes were related to pathway "A" but only 13 genes in the total population were associated with pathway "A". This scenario would yield a low probability that the observed number of hits (occurrences of pathway "A") within the small sample could be due to chance alone. This statistic is based on the hypergeometric distribution and has benefits over chi-square in that it is appropriate for finite populations.

    Fig.7.9.1 Run GeneOntology annotation analysis.
    Fig.7.9.1 Run GeneOntology annotation analysis.


    Annotation parameters Panel

    Population and Cluster Selection Option
    This option specifies a gene population or a gene cluster list. The default selection is to use a population file which is simply all of the genes loaded into BiologicalNetworks.
    The Annotation parameters Panel also displays gene clusters currently stored in BiologicalNetworks cluster repository. If no clusters have been saved then a blank browser page will be displayed on this panel and the Cluster Analysis mode option will be disabled. Selecting a row (or a group of rows using 'Alt' button) in the cluster table will display the cluster in the expression graph area of the browser. Cluster analysis will be executed on the selected clusters.


    Annotation key

    This area contains a drop down list which contains a list of available annotation types which can be used identify genes. Generally it's best to use an index or accession 'uniquely' identifying the spotted material.

    Annotation Conversion File

    This optional file provides the mapping from your annotation key (above) to the index used to map to biological themes (GO terms, KEGG pathways, etc.). If your annotation key type is the one used in the linking file (below) then this conversion (mapping) is not needed. These files if needed are typically stored in the Convert directory.


    Gene Annotation / Gene Ontology Linking Files

    This section allows one to specify one or more annotation files. These files contain gene indices paired with biological themes such as GO terms. These files typically reside in the Class directory.



    Fig.7.9.2 Annotation analysis panel.
    Fig.7.9.2 Annotation analysis panel.


    Results of GeneOntology Analysis

    The primary result is reported in a table in which entries are ordered based on the reported statistic. The table can be sorted on any column. A right click in the table will launch a menu allowing several operations:

    Store Selection as Cluster: Stores the genes associated with a biological theme as a cluster that will be stored in the cluster manager.

    Open Viewer: Opens one of three possible viewers containing the genes within the biological theme. These viewers are also accessible from a node in the result tree which follows the table node in result navigation tree.

    Save Table: Stores the result table to a tab delimited file.



    Fig.7.9.3 Annotation analysis results.
    Fig.7.9.3 Annotation analysis results.