Tutorial I
Female Mouse Liver Microarray Data
Network Construction and Module Analysis
Steve Horvath
Correspondence: shorvath@mednet.ucla.edu, http://www.ph.ucla.edu/biostat/people/horvath.htm
Content of this tutorial
1) Gene Coexpression Network Construction,
2) Module Definition Based on Average Linkage hierarchical clustering with the dynamic tree cut algorithm
3) Relating Modules To Physiological Traits (module significance analysis)
4) Comparing Weighted Network Results to Unweighted Network Results
5) Studying the Clustering Coefficicient
Abstract
We use microarray from an F2 mouse intercross to examine the largescale organization of gene coexpression networks in female mouse liver and annotate several gene modules in terms of 20 physiological traits. Finally we study the relationship between connectivity and a measures of gene significance based on the physiological traits.
The data and biological implications are described in the following references

Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Schadt EE, Drake TA, Lusis AJ, Horvath S (2006)Integrating Genetic and Network Analysis to Characterize Genes Related to Mouse Weight. PloS Genetics. Volume 2  Issue 8  AUGUST 2006

Fuller TF, Ghazalpour A, Aten JE, Drake TA, Lusis AJ, Horvath S (2007) Weighted Gene Coexpression Network Analysis Strategies Applied to Mouse Weight, Mamm Genome 18(6):463472.
We provide the statistical code used for generating the weighted gene coexpression network results. Thus, the reader be able to reproduce all of our findings. This document also serves as a tutorial to weighted gene coexpression network analysis. Some familiarity with the R software is desirable but the document is fairly selfcontained. This document and data files can be found at the following webpage:
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/MouseWeight/
More material on weighted network analysis can be found here
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/
Method Description
The network construction is conceptually straightforward: nodes represent genes and nodes are
connected if the corresponding genes are significantly coexpressed across appropriately chosen tissue samples. Here we study networks that can be specified with the following adjacency matrix:
A=[a_{ij}] is symmetric with entries in [0,1]. By convention, the diagonal elements are assumed to be zero. For unweighted networks, the adjacency matrix contains binary information (connected=1, unconnected=0). In weighted networks the adjacency matrix contains encodes pairwise connection strengths.
Microarray data
RNA preparation and array hybridizations were performed at Rosetta Informatics. The custom inkjet microarrays used in this study (Agilent technologies, previously described [2, 24]) contain 2186 control probes and 23,574 noncontrol oligonucleotides extracted from mouse Unigene clusters and combined with RefSeq sequences and RIKEN fulllength clones. Mouse livers were homogenized and total RNA extracted using Trizol reagent (Invitrogen, CA) according to manufacturer’s protocol. Three µg of total RNA was reverse transcribed and labeled with either Cy3 or Cy5 fluorochrome. Purified Cy3 or Cy5 complementary RNA was hybridized to at least two microarray slides with fluor reversal for 24 hours in a hybridization chamber, washed, and scanned using a laser confocal scanner. Arrays were quantified on the basis of spot intensity relative to background, adjusted for experimental variation between arrays using average intensity over multiple channels, and fit to an error model to determine significance (type I error). Gene expression is reported as the ratio of the mean log_{10} intensity (mlratio) relative to the pool derived from 150 mice randomly selected from the F2 population.
Data Reduction:
In order to minimize noise in the gene expression data set, several data filtering steps were taken. First, preliminary evidence showed major differences in gene expression levels between sexes amongst the F2 mice used (Yang, manuscript in preparation), and therefore only female mice were used for network construction . The construction and comparison of the male network will be reported elsewhere. Only those mice with complete phenotype, genotype and array data were used. This gave a final experimental sample of 135 female mice used for network construction. Due to computational limitations the following filtering steps were applied to the genome wide expression data. First, the 8000 most varying genes (ml ratio – log_{10} of the ratio of experimental mouse gene expression to F2 pool), across all mice were identified. Next, amongst these 8000 genes, after preliminary network construction, the 3600 most connected genes were chosen as those to use in further steps (All genes excluded had a connectivity of 1 or less). These 3600 genes were then examined, and where appropriate, gene isoforms and genes containing duplicate probes were excluded by using only those with the highest expression among the redundant transcripts. This final filtering step yielded a count of 3421 genes for the experimental network construction.
The main reason for not using all 23000 genes is that genes with low variance across the mouse samples are likely to be less interesting in this analysis that relates gene expression profiles to SNP markers and highly varying physiological traits. Noise genes may compromise module detection and thus our integrated model. A computational reason for restricting the analysis to 8000 genes is that our R software code becomes extremely slow when dealing with matrices of dimension larger than 8000x8000.
For module detection, we limited our analysis to 3600 most connected genes since our module construction method and visualization tools cannot handle larger data sets at this point. By definition, module genes are highly connected with the genes of their module (i.e. module genes tend have relatively high connectivity). Thus, for the purpose of module detection, restricting the analysis to the most connected genes does not lead to major information loss for the key points of our application. However, there may be applications where genes with relatively low connectivity are biologically interesting so that gene filtering based on connectivity would lead to information loss. Finally, we eliminated multiple probes with similar expression pattern for the same gene since we are interested in studying gene networks as opposed to probeset networks. This resulted in 3421 genes in our final set which we used for module detection.
Network Construction:
We pioneered the use of a weighted coexpression network for mapping complex disease genes. In coexpression networks, network nodes correspond to genes and connection strengths are determined by the pairwise correlations between expression profiles. In contrast to unweighted networks, weighted networks use soft thresholding of the Pearson correlation matrix for determining the connection strengths between two genes. Soft thresholding of the Pearson correlation preserves the gene coexpression information and leads to weighted coexpression networks that are highly robust with respect to the construction method.
The network construction algorithm is described in detail elsewhere (Zhang and Horvath 2005). Briefly, a gene coexpression similarity measure (absolute value of Pearson’s product moment correlation) was used to relate every pairwise genegene relationship. An adjacency matrix was then constructed using a `soft’ power adjacency function a_{ij} = Power(s_{ij}, ) s_{ij}_{ }^{}_{ }where s_{ij}_{ }is the coexpression similarity, and a_{ij} represents the resulting adjacency that measures the connection strengths. The power is chosen using the scale free topology criterion proposed in Zhang and Horvath (2005). Briefly, the power was chosen such the resulting network exhibited approximate scale free topology and a high mean number of connections. The scale free topology criterion led us to choose a power of = 6 based on the preliminary network built from the 8000 most varying genes. However, since we are using a weighted network as opposed to an unweighted network, the biological findings are highly robust with respect to the choice of this power (Zhang and Horvath 2005) .
Topological Overlap Matrix and Gene Modules
The adjacency matrix was then used to define a network distance measure or more precisely a measure of node dissimilarity based on the topological overlap matrix. Specifically the topological overlap matrix is given by
where, denotes the number of nodes to which both i and j are connected, and u indexes the nodes of the network. The topological overlap matrix (TOM) is given by Ω=[ω_{ij}]. ω_{ij} is a number between 0 and 1 and is symmetric (i.e, ω_{ij}= ω_{ji}). The rationale for considering this similarity measure is that nodes that are part of highly integrated modules are expected to have high topological overlap with their neighbors.
Network Module Identification.
Gene "modules" are groups of nodes that have high topological overlap. Module identification was based on the topological overlap matrix Ω=[ω_{ij}] defined above. To use it in hierarchical clustering, it was turned into a dissimilarity measure by subtracting it from one (i.e, the topological overlap based dissimilarity measure is defined by ). Based on the dissimilarity matrix we can use hierarchical clustering to discriminate one module from another. We used a dynamic cuttree algorithm for automatically and precisely identifying modules in hierarchical clustering dendrogram (the "tree" method of cutreeDynamic, see Langfelder, Zhang Horvath 2008).
The algorithm takes into account an essential feature of cluster occurrence and makes use of the internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive process of cluster decomposition and combination and the process is iterated until the number of clusters becomes stable. No claim is made that our module construction method is optimal. A comparison of different module construction methods is beyond the scope of this paper.
Detection of hub genes:
To identify hub genes for the network, one may either consider the whole network connectivity (denoted by kTotal) or the intramodular connectivity (kWithin).
We find that intramodular connectivity is far more meaningful than whole network connectivity
Module Significance Analysis: Relating Gene Modules to Physiologic Traits,
In the BXH F2 cross, 20 physiological traits were measured for each animal. We used this information to explore the physiological relevance of each of the modules in the network. To do this, the Pearson’s product moment correlation (Pearson’s correlation) was computed between each gene within a module and each of the physiological traits. This measure is termed the ‘gene significance’ of a particular gene with that trait. The geometric mean was then calculated for the absolute value of all gene significance scores within each module, yielding the ‘module significance’ (MS) of that particular module with the trait .
In order to explore the characteristics of our connectivity measure, we plotted the connectivity parameter versus variance and mean gene expression for each gene within the most functionally significant (Blue and Brown) modules. We observed an inverse relationship between connectivity and gene expression variance. This is consistent with the idea that the network’s most highly connected ‘hubs’ are resilient to genetic background variations since they are vital for core biological functions.
Statistical References

Bin Zhang and Steve Horvath (2005) "A General Framework for Weighted Gene CoExpression Network Analysis", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17
The following theoretical reference explores the meaning of coexpression network analysis
• Horvath S, Dong J (2008) Geometric Interpretation of Gene CoExpression Network Analysis. PloS Computational Biology. 4(8): e1000117. PMID: 18704157
The WGCNA R package is described in
• Langfelder P, Horvath S (2008) WGCNA: an R package for Weighted Correlation Network Analysis. BMC Bioinformatics. 2008 Dec 29;9(1):559. PMID: 19114008
For the generalized topological overlap matrix as applied to unweighted networks see
• Yip A, Horvath S (2007) Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 8:22

Module detection based on branch cutting is described in
Langfelder P, Zhang B, Horvath S (2008) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics. Bioinformatics. 2008 Mar 1;24(5):71920. PMID: 18024473
# Absolutely no warranty on the code. Please contact SH with suggestions.
# CONTENTS
# This document contains function for carrying out the following tasks
# A) Assessing scale free topology and choosing the parameters of the adjacency function
# using the scale free topology criterion (Zhang and Horvath 05)
# B) Computing the topological overlap matrix
# C) Defining gene modules using clustering procedures
# D) Summing up modules by their first principal component (first eigengene)
# E) Relating a measure of gene significance to the modules
# F) Carrying out a within module analysis (computing intramodular connectivity)
# and relating intramodular connectivity to gene significance.
# G) Miscellaneous other functions, e.g. for computing the cluster coefficient.
# Downloading the R software
# 1) Go to http://www.Rproject.org, download R and install it on your computer
# After installing R, you need to install several additional R library packages:
# For example to install Hmisc, open R,
# go to menu "Packages/Install package(s) from CRAN",
# then choose Hmisc. R will automatically install the package.
# When asked "Delete downloaded files (y/N)? ", answer "y".
# Do the same for some of the other libraries mentioned below. But note that
# several libraries are already present in the software so there is no need to reinstall them.
# To get this tutorial and data files, go to the following webpage
# www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
# Unzip all the files into the same directory,
## The user should copy and paste the following script into the R session.
## Text after "#" is a comment and is automatically ignored by R.
# read in the R libraries
library(MASS) # standard, no need to install
library(class) # standard, no need to install
library(cluster)
library(sma) # install it for the function plot.mat
library(impute)# install it for imputing missing value
library(scatterplot3d)
# Download the WGCNA library as a .zip file from http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/
and choose "Install package(s) from local zip file" in the packages tab
library(WGCNA)
options(stringsAsFactors=F)
# Please adapt the file paths
setwd("C:/Documents and Settings/Steve Horvath/My Documents/ADAG/LinSong/NetworkScreening/MouseFemaleLiver")
# read in the custom network functions.
source("NetworkFunctions.txt")
# The following 3421 probe set were arrived at using the following steps
#1) reduce to the 8000 most varying, 2) 3600 most connected, 3) focus on unique genes
dat0=read.table("cnew_liver_bxh_f2female_8000mvgenes_p3600_UNIQUE_tommodules.xls",header=T)
names(dat0)
# this contains information on the genes
datSummary=dat0[,c(1:8,144:150)]
# the following data frame contains
# the gene expression data: columns are genes, rows are arrays (samples)
datExpr = t(dat0[,9:143])
no.samples = dim(datExpr)[[1]]
dim(datExpr)
datClinicalTraits=read.csv("BXH_ClinicalTraits_361mice_forNewBXH.csv",header=T)
#Now we order the mice so that trait file and expression file agree
restrictMice=is.element(datClinicalTraits$MiceID,dimnames(datExpr)[[1]])
table(restrictMice)
datClinicalTraits=datClinicalTraits[restrictMice,]
orderMiceTraits=order(datClinicalTraits$MiceID)
orderMiceExpr=order(dimnames(datExpr)[[1]])
datClinicalTraits =datClinicalTraits[orderMiceTraits,]
datExpr =datExpr[orderMiceExpr,]
#from the following table, we verify that all 135 mice are in order
table(datClinicalTraits$MiceID==dimnames(datExpr)[[1]])
rm(dat0);gc()
#SOFT THRESHOLDING For Weighted Network Construction
# To construct a weighted network (soft thresholding with the power adjacency matrix),
# we consider the following vector of potential thresholds.
# Now we investigate soft thesholding with the power adjacency function
powers1=c(seq(1,10,by=1),seq(12,18,by=2))
# To choose a cutoff value, we propose to use the Scalefree Topology Criterion (Zhang and
# Horvath 2005). Here the focus is on the linear regression model fitting index
# (denoted below by scale.law.R.2) that quantify the extent of how well a network
# satisfies a scalefree topology.
# The function PickSoftThreshold can help one to estimate the cutoff value
# when using hard thresholding with the step adjacency function.
# The first column lists the power beta
# The second column reports the resulting scale free topology fitting index R^2.
# The third column reports the slope of the fitting line.
# The fourth column reports the fitting index for the truncated exponential scale free model.
# Usually we ignore it.
# The remaining columns list the mean, median and maximum connectivity.
RpowerTable=pickSoftThreshold(datExpr, powerVector=powers1)[[2]]
Power scale.law.R.2 slope truncated.R.2 mean.k. median.k. max.k.
1 1 0.0848 0.392 0.484 706.00 722.000 1130.0
2 2 0.0244 0.624 0.857 240.00 238.000 532.0
3 3 0.2950 1.160 0.980 105.00 96.300 302.0
4 4 0.4760 1.570 0.975 53.10 44.500 188.0
5 5 0.6510 1.940 0.943 30.10 23.600 125.0
6 6 0.8810 1.610 0.982 18.50 13.500 86.3
7 7 0.9240 1.720 0.941 12.20 8.200 74.4
8 8 0.9110 1.760 0.900 8.47 5.120 67.5
9 9 0.8680 1.740 0.859 6.17 3.310 62.4
10 10 0.8350 1.690 0.846 4.67 2.240 58.2
11 12 0.8640 1.530 0.919 2.94 1.070 51.5
12 14 0.8740 1.410 0.948 2.03 0.527 46.2
13 16 0.9090 1.320 0.977 1.51 0.278 41.7
14 18 0.8910 1.250 0.961 1.17 0.150 37.9
cex1=0.7
par(mfrow=c(1,2))
plot(RpowerTable[,1], sign(RpowerTable[,3])*RpowerTable[,2],xlab="
Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n")
text(RpowerTable[,1], sign(RpowerTable[,3])*RpowerTable[,2], labels=powers1,cex=cex1,col="red")
# this line corresponds to using an R^2 cutoff of h
abline(h=0.8,col="red")
plot(RpowerTable[,1], RpowerTable[,5],xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n")
text(RpowerTable[,1], RpowerTable[,5], labels=powers1, cex=cex1,col="red")
To choose a cutoff value beta, we use the Scalefree Topology Criterion (Zhang and Horvath 2005). Here the focus is on the linear regression model
fitting index (denoted as scale.law.R.2) that quantify the extent of how well
a network satisfies a scalefree topology. We choose the soft thresholding parameter beta=6 for the correlation matrix since this is where the R^2 curve seems to saturates. From the above table, we find that the resulting slope looks OK (negative between 1 and 2). In the appendix, we investigate different choices of beta.
#Here the scale free topology criterion would lead us to pick a power of beta=6. In an appendix, we #study how the biological findings depend on the choice of the power.
beta1=6 # this is the the power adjacency function parameter in power(s,beta)
# By playing around with beta, you will find that the
# findings are highly robust with respect to beta1, which is an attractive property.
# The following computes the network connectivity (Connectivity)
Connectivity= softConnectivity(datExpr,power=beta1)
# Creating Scale Free Topology Plots (SUPPLEMENTARY FIGURE S1 in our article)
par(mfrow=c(1,1))
scaleFreePlot(Connectivity, truncated=T,main= paste("beta=",as.character(beta1)))
The Figure shows that the connectivity distribution p(k) is better modeled using an exponentially truncated power law p(k) ~ k^{}^{} exp(α k). In practice, we find that the two parameters α and provide too much flexibility in curve fitting. The truncated exponential model fitting index R^2 tends to be high irrespective of the adjacency function parameter. For this reason, we focus on the scalefree topology fitting index in our scalefree topology criterion. Exploring the use of the truncated exponential fitting index is beyond the scope of this article.
Module Detection
An important step in network analysis is module detetion.
To group genes with coherent expression profiles into modules, we use average linkage hierarchical clustering, which uses the topological overlap measure as dissimilarity.
The topological overlap of two nodes reflects their similarity in terms of the commonality of the nodes they connect to (see [Ravasz et al 2002, Yip and Horvath 2006]).
# Now define the power adjacency matrix
ADJ = adjacency(datExpr,power=beta1)
gc()
# The following code computes the topological overlap matrix based on the
# adjacency matrix.
# TIME: Takes several minutes
dissTOM=TOMdist(ADJ)
gc()
# Now we carry out hierarchical clustering with the TOM matrix. Branches of the
# resulting clustering tree will be used to define gene modules.
hierTOM = hclust(as.dist(dissTOM),method="average");
par(mfrow=c(1,1))
plot(hierTOM,labels=F)
# By our definition, modules correspond to branches of the tree.
# The question is what height cutoff should be used? This depends on the
# biology. Large height values lead to big modules, small values lead to small
# but very cohesive modules.
We used a dynamic cuttree algorithm for selection branches of the hierarchical clustering dendrogram (Langfelder Zhang Horvath 2008). The algorithm takes into account an essential feature of cluster occurrence and makes use of the internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive process of cluster decomposition and combination and the process is iterated until the number of clusters becomes stable.
# The following is the original code used in the paper by Ghazalpour et al
myheightcutoff =0.995
mydeepSplit = FALSE # fine structure within module
myminModuleSize = 20 # modules must have this minimum number of genes
#new way for identifying modules based on hierarchical clustering dendrogram
colorh1=cutreeDynamic(hierclust= hierTOM, deepSplit=mydeepSplit,maxTreeHeight =myheightcutoff,minModuleSize=myminModuleSize)
table(colorh1)
# Our code has slightly changed. If we could go back in time, we would use the following code
# for branch cutting. But please skip it....
colorhNEWstep1= dynamicTreeCut::cutreeDynamic(dendro=hierTOM, cutHeight =0.9965, minClusterSize = 20, method = "tree",deepSplit =F)
# to turn the branch lables (which are integers) into colors we use
colorhNEWstep2=labels2colors(colorhNEWstep1)
colorhNEWstep3=mergeCloseModules(datExpr, colors=colorh2, cutHeight = 0.15, MEs = NULL, impute = TRUE, useAbs = F)$colors
# Note that the resulting color is quite similar to the original one:
table(colorhNEWstep3,colorh1)
#This results in the following color assignment.
par(mfrow=c(2,1))
plot(hierTOM, main="Female Mouse Liver Network", labels=F, xlab="", sub="");
plotColorUnderTree(hierTOM,colors=data.frame(colorh1))
title("Colored by UNMERGED dynamic modules")
#Note that the colors correspond to portions of the branches.
#To determine whether some colors should be merged we a) represent each module by its #module eigengenes (defined as its first principal component) and b) clustering
#the principal components. If 2 module eigengenes (PCs) are highly correlated
#then the modules should be merged. A general rule may be that two modules are
#merged if the distance between the two is samller than 0.1 (i.e., correlation #is bigger than 0.9)
datME = moduleEigengenes(as.matrix(datExpr), colorh1)[[1]]
dissMEs = 1abs(cor(datME, use="p"))
dissMEs = ifelse(is.na(dissMEs), 0, dissMEs)
hierMEs = hclust(as.dist(dissMEs),method="a")
#display ME hierarchical dendrogram on screen
par(mfrow=c(1,1), mar=c(0, 3, 1, 1) + 0.1, cex=1)
plot(hierMEs, xlab="",ylab="",main="",sub="")
par(mfrow=c(1,1))
#This tree suggest to merge several colors
#To merge a minor cluster to a major cluster, we use
colorh1 = merge2Clusters(colorh1, mainclusterColor="lightcyan", minorclusterColor="grey60")
colorh1 = merge2Clusters(colorh1, mainclusterColor="blue", minorclusterColor="magenta")
colorh1 = merge2Clusters(colorh1, mainclusterColor="red", minorclusterColor="turquoise")
colorh1 = merge2Clusters(colorh1, mainclusterColor="red", minorclusterColor="pink")
colorh1 = merge2Clusters(colorh1, mainclusterColor="black", minorclusterColor="yellow")
colorh1 = merge2Clusters(colorh1, mainclusterColor="green", minorclusterColor="lightgreen")
colorh1 = merge2Clusters(colorh1, mainclusterColor="green", minorclusterColor="tan")
# After merging some colors we arrive at the following hierarchical plot
#### FIGURE 1A) in our manuscript
par(mfrow=c(2,1))
plot(hierTOM, main="Female Mouse Liver Network", labels=F, xlab="", sub="");
plotColorUnderTree(hierTOM,colorh1, title1="Colored by female liver modules")
# We also propose to use classical multidimensional scaling plots
# for visualizing the network. Here we chose 3 scaling dimensions
# This also takes about 10 minutes...
cmd1=cmdscale(as.dist(dissTOM),4)
par(mfrow=c(2,3))
plot(cmd1[,c(1,2)], col= as.character(colorh1) )
plot(cmd1[,c(1,3)], col= as.character(colorh1) )
plot(cmd1[,c(1,4)], col= as.character(colorh1) )
plot(cmd1[,c(2,3)], col= as.character(colorh1) )
plot(cmd1[,c(2,4)], col= as.character(colorh1) )
plot(cmd1[,c(3,4)], col= as.character(colorh1) )
### FIGURE 1 B in our article
par(mfrow=c(1,1))
scatterplot3d(cmd1[,1:3], color=as.character(colorh1), main="MDS plot",xlab="Scaling Dimension 1", ylab="Scaling Dimension 2", zlab="Scaling Dimension 3",cex.axis=1.5,angle=320)
TOM plot and MDS plots
To visualize the network, we used several plots. The topological overlap matrix plot represents the topological overlap matrix where rows and columns are sorted and colored according to the hierarchical clustering tree used in the module definition. A classical multidimensional scaling plot that uses the topological overlap matrix as input can also be used.
# An alternative view of this is the so called TOM plot that is generated by the
# function TOMplot
# Inputs: TOM distance measure, hierarchical (hclust) object, color
# Warning: for large gene sets, say more than 2000 genes
#this may take a while...
TOMplot(dissTOM , hierTOM, colorh1)
Definition of trait based gene significance
For a given physiological trait, we defined a measure of gene significance by forming the absolute value of the Spearman correlation between trait and gene expression values. For example, the body weight can be used to define a gene significance of the i^{th} gene expression GSweight(i) = cor(x(i), weight) where x(i) is the gene expression profile of the i^{th} gene.
A histogram of the clinical traits shows that several clinical traits appear to have outliers.
# To protect agains outliers, we replace the values of the physiological traits
# by their ranks.
rank1=function(x) rank(x, na.last="keep")
rankdatClinicalTraits=apply(datClinicalTraits[,5:26],2,rank1)
# This function computes the correlation between a gene expression
# and a physiological trait
if(exists("GSfunction")) rm(GSfunction)
GSfunction=function(x) {cor(x,rankdatClinicalTraits,use="p")}
# the following data frame has as columns the gene significance variables
# for different clinical traits
GeneSignificance =t(apply(datExpr,2,GSfunction))
dimnames(GeneSignificance)[[2]]=paste("GS",dimnames(rankdatClinicalTraits)[[2]],sep="" )
# Since we only care about absolute values of correlations between expression
# profiles and traits, we set
GeneSignificance=data.frame(abs(GeneSignificance))
names(GeneSignificance)
[1] "GSWeightG" "GSLengthCM" "GSAbFat" "GSOtherFat" "GSTotalFat"
[6] "GSX100xfat.weight" "GSTrigly" "GSTotalChol" "GSHDLChol" "GSUC"
[11] "GSFFA" "GSGlucose" "GSLDL.VLDL" "GSMCP.1.phys." "GSInsulin.ug.l."
[16] "GSGlucose.Insulin" "GSLeptin.pg.ml." "GSAdiponectin" "GSAorticLesions" "GSAneurysm"
[21] "GSAorticCal.M" "GSAorticCal.L"
# Here we define more conventional to annotate Figure 2 and Supplementary Figure S2
namesGS=c("Weight","Length","AbFat","OtherFat","TotalFat","Index", "Trigly","Chol","HDL","UC","FFA","Glucose","LDL+VLDL", "MCP1", "Insulin","GlucoseInsulin", "Leptin", "Adiponectin", "AorticLesions", "Aneurysm", "AorticCal.M", "AorticCal.L")
The mean gene significance for a particular module can be considered as a measure of module significance (MS) (see Materials and Methods for statistical test), which means that MS provides a measure for overall correlation between the trait and the module. This means that a module with high MS value for "body weight" is on average composed of genes highly correlated with body weight.
# Here we use the function verboseBoxplot to creates barplots
# that shows whether modules are enriched with significant genes.
# It also reports a Kruskal Wallis Pvalue.
# The gene significance can be a binary variable or a quantitative variable.
# It also plots the 95% confidence interval of the mean
par(mfrow=c(7,3), mar= c(1, 4, 3, 1) +0.1)
for (i in c(1:21) ) {
verboseBoxplot(GeneSignificance[,i],colorh1,col=levels(factor(colorh1)),main= namesGS[i],xlab="module",ylab="GS")
abline(h=.3,col="red") }
So, interesting combinations include
a) GSweight in the blue module and
b) Glucose.Insulin the the brown module.
Since our particular interest is in understanding body weight, we focus on the blue module.
The following code creates a barplot for the body weight based gene significance measure.
verboseBarplot(GeneSignificance[,1],colorh1,col=levels(factor(colorh1)),main="module significance",xlab="module",ylab="mean gene significance")
dim(GeneSignificance)
# Now we produce Figure 2 of the article.
whichmodule="blue"
# mean gene significance=module significance
meanGS=apply(abs(GeneSignificance)[colorh1==whichmodule,],2,mean)
# corresponding standard error
stderrGS= apply(abs(GeneSignificance)[colorh1==whichmodule,],2,stderr1)
# The following code produces a barplot with rotated axis labels
## Increase bottom margin to make room for rotated labels
par(mar = c(7, 4, 4, 2) + 0.1)
## Create plot and get bar midpoints in 'mp'
mp = barplot(as.vector(meanGS),col=whichmodule, ylab="Module Significance",cex.lab=1.5)
## Set up x axis with tick marks alone
axis(1, at = mp, labels = FALSE)
## text labels
labels = namesGS
## Plot x axis labels at mp, you may want to change the offser .005...
text(mp, par("usr")[3]  0.005, srt = 45, adj = 1, labels = labels, xpd = TRUE,cex=1.3)
## Plot x axis label at line 4
mtext(1, text = "Physiological Traits", line = 6,cex=1.5)
# This creates the error bars
err.bp(meanGS , stderrGS, two.side=T)
# To get a sense of how related the modules are one can summarize each module
# by its first eigengene (referred to as principal components).
# Next we cluster the eigengens. This is very similar to the code used above
# for identifying modules that should be merged.
dME2=moduleEigengenes(datExpr,colorh1)[[1]]
hclustdME2=hclust(as.dist( 1abs(t(cor(dME2, method="p")))), method="average" )
par(mfrow=c(1,1))
plot(hclustdME2, main="Clustering the Module Eigengenes")
# Now we create scatter plots of the samples (arrays) along the module eigengenes.
dME2=dME2[,hclustdME2$order]
pairs( dME2, upper.panel = panel.smooth, lower.panel = panel.cor , diag.panel=panel.hist ,main="Relation between modules")
Comment: each dot represents a mouse. Above the diagonal are scatterplots. Below are the corresponding absolute values of the correlation
#Now we study how connectivity is related to mean gene expression or variance of gene expression
#### This Supplementary Figure S3 in our article
par(mfrow=c(2,2))
whichmodule="blue"
# mean expression of the blue module genes
meanExprModule=apply( datExpr[,colorh1==whichmodule],2,mean1)
# variance of expression
varExprModule=apply( datExpr[,colorh1==whichmodule],2,var1)
ConnectivityModule= SoftConnectivity(datExpr[,colorh1==whichmodule], power=beta1)
verboseScatterplot(ConnectivityModule,varExprModule,xlab=paste("Connectivity (k.in)", whichmodule, " module"), ylab="Variance",col=whichmodule)
verboseScatterplot(ConnectivityModule,meanExprModule,xlab=paste("Connectivity (k.in)", whichmodule, " module"), ylab="Mean Expression",col=whichmodule)
meanExpr=apply( datExpr,2,mean1)
varExpr=apply( datExpr,2,var1)
verboseScatterplot(Connectivity,varExpr,xlab=paste("Whole Network Connectivity (k.all)"), ylab="Variance",col=colorh1)
verboseScatterplot(Connectivity,meanExpr,xlab=paste("Whole Network Connectivity (k.all)"), ylab="Mean Expression",col=colorh1)
In the coexpression network presented here, we find that the gene expression levels of hub genes are less variable (lower variance) than other, less connected, nodes across all mice. This is consistent with the idea that the network’s most highly connected hubs are resilient to large genetic background variations since they are vital for core biological functions.
# The following produces heatmap plots for each module.
# Here the rows are genes and the columns are samples.
# Well defined modules results in characteristic band structures since the corresponding genes are
# highly correlated.
ClusterSamples=hclust(dist(datExpr[,] ),method="average")
par(mfrow=c(3,1), mar=c(1, 2, 4, 1))
which.module="black"
plot.mat(t(scale(datExpr[ClusterSamples$order,][,colorh1==which.module ]) ),nrgcols=30,rlabels=T, clabels=T,rcols=which.module,
title=which.module )
which.module="blue"
plot.mat(t(scale(datExpr[ClusterSamples$order,][,colorh1==which.module ]) ),nrgcols=30,rlabels=T, clabels=T,rcols=which.module,
title=which.module )
which.module="brown"
plot.mat(t(scale(datExpr[ClusterSamples$order,][,colorh1==which.module ]) ),nrgcols=30,rlabels=T, clabels=T,rcols=which.module,
title=which.module )
# The function intramodularConnectivity computes the whole network connectivity kTotal,
# the within module connectivity (kWithin). kOut=kTotalkWithin and
# and kDiff=kInkOut=2*kINkTotal
ConnectivityMeasures=intramodularConnectivity(abs(cor(datExpr,use="p"))^beta1,colorh1)
names(ConnectivityMeasures)
[1] "kTotal" "kWithin" "kOut" "kDiff"
# The following plots show the gene significance vs intromodular connectivity
colorlevels=levels(factor(colorh1))
par(mfrow=c(4,3))
for (i in 1:12) {
whichmodule=colorlevels[i];restrict1=colorh1==whichmodule
verboseScatterplot(ConnectivityMeasures$kWithin[restrict1],GeneSignificance[restrict1,1],col=whichmodule,main=whichmodule,xlab="Intramodular k", ylab="Gene Signif")
}
APPENDIX: Constructing an unweighted networks and comparing it to the weighted nework.
Here we redo the network analysis using hard thresholding, i.e. dichotomizing the correlation matrix. We show that our main biological findings are highly robust with respect to the network construction method.
Use the scale free topology criterion for finding the hard threshold parameter tau.
thresholds1= c(seq(.1,.5, by=.1), seq(.55,.95, by=.05) )
TableHard=pickHardThreshold(datExpr, thresholds1)[[2]]
gc()
Cut p.value scale.law.R.2 slope. truncated.R.2 mean.k. median.k. max.k.
1 0.10 2.47e01 0.7750 2.460 0.8060 2290.000 2390 2880
2 0.20 1.96e02 0.1260 0.545 0.0153 1450.000 1540 2330
3 0.30 3.88e04 0.1110 0.124 0.0280 875.000 898 1820
4 0.40 1.40e06 0.2880 0.852 0.8180 478.000 454 1320
5 0.50 5.74e10 0.6120 1.340 0.9660 228.000 189 823
6 0.55 4.05e12 0.6690 1.520 0.9580 149.000 111 633
7 0.60 1.18e14 0.7310 1.680 0.9750 93.100 68 484
8 0.65 0.00e+00 0.7750 1.700 0.9880 55.100 35 344
9 0.70 0.00e+00 0.7250 1.910 0.9710 30.200 15 242
10 0.75 0.00e+00 0.6730 1.680 0.8400 15.600 5 147
11 0.80 0.00e+00 0.6530 1.500 0.6040 7.930 1 106
12 0.85 0.00e+00 0.0658 1.600 0.0294 4.290 0 100
13 0.90 0.00e+00 0.6990 0.987 0.8320 2.280 0 88
14 0.95 0.00e+00 0.9590 0.996 0.9620 0.558 0 42
To choose a cutoff value tau, we propose to use the Scalefree Topology Criterion (Zhang and Horvath 2005). Here the focus is on the linear regression model
fitting index (denoted as scale.law.R.2) that quantify the extent of how well
a network satisfies a scalefree topology. We choose the cut value (tau) of 0.7 for the correlation matrix since this is where the R^2 curve seems to saturates. From the above table, we find that the resulting slope looks OK (negative and between 1 and 2), and the mean number of connections looks good Below we investigate different choices of tau.
par(mfrow=c(1,1))
plot(thresholds1,
sign(TableHard[,4])*TableHard[,3], type="n",ylab="Scale Free Topology R^2",xlab="Hard Threshold tau", ylim=range(min(c( sign(TableHard[,4])*TableHard[,3]),na.rm=T),1) )
text(thresholds1, sign(TableHard[,4])*TableHard[,3], labels= thresholds1, col="black")
abline(h=.8)
tau1=.65 # this parameter is hard threshold parameter.
#Let’s define the adjacency matrix of an unweighted network
ADJHARD = I(abs(cor(datExpr[,],use="p"))>tau1)+0.0
gc()
# This is the unweighted connectivity
ConnectivityHard =as.vector(apply(ADJHARD,2,sum))
scaleFreePlot(ConnectivityHard,truncated=T,main=paste("tau=",as.character(tau1)))
# Let’s compare weighted to unweighted connectivity in a scatter plot
verboseScatterplot(ConnectivityHard, Connectivity,xlab="Unweighted Connectivity",ylab="Weighted Connectivity", col= as.character(colorh1))
# Comments:

the connectivity measures is highly preserved between weighted and unweighted networks
but there are marked differences for the brown module.
# The following code computes the topological overlap matrix based on the
# adjacency matrix.
dissTOMhard=TOMdist(ADJHARD)
gc()
# Now we carry out hierarchical clustering with the TOM matrix.
hierTOMhard = hclust(as.dist(dissTOMhard),method="average");
#Next, we study whether the `soft’ modules of the unweighted network described above can also be #found in the unweighted network
# The following shows the hierarchical tree based on the unweighted network but the
# genes are colored according to their membership in the weighted network
par(mfrow=c(2,1))
plot(hierTOMhard, main="Unweighted Network Module Tree ", labels=F, xlab="", sub="");
plotColorUnderTree(hierTOMhard, colors=colorh1)
title("Dynamic Colors, weighted network")
Comment: Overall the colors stay together. This is particularly true for the blue module, which is the main interest of our paper. This demonstrates that the module assignment is robust with respect to the network construction method.
# An alternative view of this is the so called TOM plot that is generated by the
# function TOMplot
# Inputs: TOM distance measure, hierarchical (hclust) object, color
# Here we use the unweighted module tree but color it by the weighted modules.
TOMplot(dissTOMhard , hierTOMhard, as.character(colorh1))
gc()
#Comment: module assignment is highly preserved.
ConnectivityMeasuresHARD=intramodularConnectivity(ADJHARD,colorh1)
Appendix: Computation of the cluster coefficient
Although, we don’t discuss the clustering coefficient in our main article, we briefly mention it here since it is an important network concept.
The cluster coefficient measures the cliquishness of a gene.
While we don’t use the clustering coefficient in our manuscript, we report it here for the sake of completeness.
# First, we start with the weighted network
cluster.coef= clusterCoef(ADJ)
gc()
# Now we plot cluster coefficient versus connectivity
# for all genes
par(mfrow=c(1,1),mar=c(2,2,2,1))
plot(Connectivity, cluster.coef,col=as.character(colorh1),xlab="Connectivity",ylab="Cluster Coefficient")
Overall, we find that the clustering coefficient in a weighted network is roughly constant for highly connected genes inside of a given module. Across modules the clustering coefficient varies a lot.
# Now we compute the CC for the unweighted network
diag(ADJHARD)=0
cluster.coefHARD= clusterCoef(ADJHARD)
ConnectivityHARD= apply(ADJHARD,2,sum)
par(mfrow=c(1,1))
plot(ConnectivityHARD,cluster.coefHARD,col=as.character(colorh1),xlab="Connectivity",ylab="Cluster Coefficient" )
# There is a marked difference between the weighted network and the unweighted network when it comes to the relationship between clustering coefficient and connectivity. This is further discussed in Zhang and Horvath 2005 and the following reference:
Horvath and Dong, Yip (2008) PloS Comp Biol
THE END
To cite the code and methods in this manual, please use
Zhang B, Horvath S (2005) A General Framework for Weighted Gene CoExpression Network Analysis. Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17. http://www.bepress.com/sagmb/vol4/iss1/art17
Share with your friends: 