Cleaner
Cleaner is an R-system software package for the assembly of informative, transcript-specific probe-clusters for Affymetrix expression microarrays.
Download Application
Click here to download Cleaner.
Documentation and Support
This explains the installation of the cleaner package for R. If you have any questions, suggestions, or problems with installation, send an email to Mariano Alvarez.
Dependencies
R v2.8.0 or higher
BWA
affy package for R
Obtaining R
You can download R from the Comprehensive R Archive Network (CRAN). Visit http://cran.r-project.org or a local mirror (for example, http://cran.us.r-project.org). Source code is available for Unix, and binaries are available for Windows, MacOS, and many versions of Linux.
Configuring R
Full writing permissions are necessary in order to install Bioconductor packages. Either setup R environmental variables to define the destination of personal package libraries (alternatives to write protected /usr/lib/R/usr/share/doc/R) or install R with writing permissions for your account (see ``add on packages in the R installation and administration manual available at http://www.r-project.org).
Obtaining BWA
You can obtain BWA from sourceforge http://bio-bwa.sourceforge.net/BWA is used for mapping probe sequences to refSeq (the probeMap function of the cleaner package). BWA is not needed for mapping 133P2 and 95av2 probes; the remapping are given in separate files in the data folder of the package and can be passed on directly to the cleaner function.
Obtaining affy
You can obtain the latest version of affy from Bioconductor (www.bioconductor.org)
Obtaining RefSeq Database Files
Use of BWA for mapping probe to genes relies on the RefSeq database. Necessary files for human chips are 'human.rna.fna', 'gene2refseq', and 'refFlat.txt' which can be download from: 'ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.fna.gz', 'ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz', and 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz'. For species other than human the corresponding RefSeq files should be used instead of 'human.rna.fna' and 'refFlat.txt'.
affy Installation
The easiest way of getting and installing affy package is selecting the "BioC Software" repository using the "setRepositories()" function in R and then typing "install.packages("affy")".
cleaner Installation
Unzip and untar the file cleaner.tar.gz in a given directory, for example ~/R/packages/. cleaner can then be installed from R by typing "install.packages("~/R/packages/cleaner", repos=NULL)".
Getting Started
Once you start R, you'll need to type "library(cleaner)" to load the package. The package include three main functions. You can get the documentation for them by typing: "?probeMap" "?mapCleaner" "?cleaner" "?newCDF"
The full process of probe mapping and cleaner was implemented in the function "mapCleaner". mapCleaner is an extensive wrapper for the functions probeMap and cleaner. probeMap is a wrapper for BWA and it is used to produce a remapping of probes to refSeq transcripts. Example remapping output for Affy arrays U133P2 and U95av2 can be obtained at the links above. probeMap takes the refSeq DNA database, refSeq flat file and a refSeq to Gene mapping file (described above) as input to produce probe to gene mappings. probeMap output is passed to the main function (cleaner) to construct consistent probe clusters. For convenience, we suggest to store remapping files and reuse them. This will eliminate the need for running mapCleaner and probeMap. Finally, the function "newCDF" constructs CDF and probe annotation packages from cleaner probe cluster. CDF and probe annotation packages can be used as blue prints for MAS5, RMA and GCRMA normalization using 'affy' and 'gcrma' packages from Bioconductor (http://www.bioconductor.org).
Example (using the given remapped probe file):
setwd('working directory where the cel files are')
cleanClusters <- cleaner(file.path(.Library, "'directory where the probe-mapping files are located'/probeMappings-hgu133plus2"))
newCDF(cleanClusters, chip="hgu133plus2", subfix="example")
Current version 1.03
Changes made to Cleaner from version 1.02
Now Cleaner does not discard non-consistent probes. They are kept in the mapping results and aggregated in probe-clusters with names ending in 'b'. For example, all consistent probes mapping to TP53 are aggregated in consistent probe-clusters named 7157_1, 7157_2, etc, while all the non-consistent probes are aggregated in the probe-cluster 7157_b.
Changes made to Cleaner from version 1.01
Now newCDF function can build annotation packages for a given platform "A" from cleaner results obtained from a different platform "B". This is particularly useful for example to create annotation packages for HT-U133A affymetrix platform from U133A experimental data. Besides differences in physical distribution of the features, U133A shares most probes with HT-U133A, with exception of probes mapping to 18S and 28S ribosomal RNAs, that have been removed from HT-U133A platform.
Changes made to Cleaner from version 1.00
Zoom based maping was replaced by BWA (http://bio-bwa.sourceforge.net/).
Probe mapping was considered redundant only when the probe maps to two or more non-overlaping loci (entrezIDs). Overlaping loci can be the result of read through transcripts. Positions on the genome for each loci is obtained from the refFlat.txt file obtained from UCSC genome server (ftp://hgdownload.cse.ucsc.edu).
Pseudogenes are now not considered for mapping purposes. Pseudogene information is obtained from gene_info database (ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz).
Probe-mapping files for popular Affymetrix platforms
Arabidopsis thaliana ath1121501 Affymetrix array
HT Human Genome U133A Array Plate
HT Human Genome U133B Array Plate
Previous versions
Version 1.02
Version 1.01
Cleaner version 1.01 for Linux
Cleaner version 1.01 for Windows
Version 1.00
Cleaner version 1.00 for Linux
Cleaner version 1.00 for Windows
Related publication
Alvarez MJ, Sumazin P, Rajbhandari P, Califano A. Correlating measurements across samples improves accuracy of large-scale expression profile experiments. Genome Biol. 2009;10(12):R143.