SPLASH
Introduction
SPLASH (Structural Pattern Localization Analysis by Sequential Histograms) is a deterministic pattern discovery algorithm which can find sparse amino or nucleic acid patterns matching identically or similarly in a set of protein or DNA sequences. Sparse patterns of any length, up to the size of the input sequence can be discovered without significant loss in performance. SPLASH is extremely efficient and embarrassingly parallel by nature. Large databases, such as a complete genome, the full set of PROSITE families, or the non-redundant SWISS-PROT database can be processed in a few hours on a typical workstation. Alternatively, a protein family or superfamily, with low overall homology, can be analyzed to discover common functional or structural signatures.
Application Download
You can use our geWorkbench framework to run SPLASH (using the Pattern Discovery Component). Alternatively you can use a standalone executable for your research. The supported platforms are:
Download SPLASH executable file for Windows (Cygwin)
Cygwin is a Linux emulator for Windows. It is required to run or compile SPLASH under a Windows environment. You can download Cygwin here.
Download SPLASH executable file for Linux
Building From Source
Source code for SPLASH is available via this link
Untar the downloaded file by typing:
gunzip splash.tar.gz tar - xf splash.tar
You should have two directories, contrib and SPLASH.
cd into contrib/gsoap-2.7 and follow the instructions for building gsoap laid out in the README file.
Essentially,
./configure --prefix=$HOME
make
make install exec_prefix=$HOME
cd into SPLASH/src and build SPLASH by typing:
make -copy the splash executable to the desired directory
These instructions are also available in the README located in the SPLASH/src directory.
Running SPLASH
The file splash.property should be in the same directory as the splash executable. The splash.property file contains just one line of the form
soapport=PORT_NUMBER where PORT_NUMBER is the port for splash to bin to. eg. 8040
By default, splash will run as soapserver which can be connected to via caWorkbench. To run SPLASH as a standalone, you need to type:
./splash -P standalone [other options] input
The help for splash is displayed by typing ./splash -h on the command line.
Similarity matrix "BLOSUM50" should be placed in the share subdirectory of the current directory. Any other matrix file should be placed in this directory.
Sample Data
Similarity Matrix
Documentation And Support
Usage: splash [OPTIONS]... [FILE]
(FILE default is input.fa)
The options are as follows:
-P program_type. Default: soapserver [soapserver|standalone] -a algo_type. Default: regular [regular|exhaustive|hierarchical] -q token_type. Default: dna [dna|protein] -% support_as_percent_of_sequences. Default: 0.80 (Not compatible with j) -b min_identity_tokens. Default: 2 -i reported patterns must match identically on each token. Default: not set -j min_support. Default: 80% of sequences in FILE. (Not compatible with %) -k min_tokens_in_window. Default: 3 -l min_tokens. Default: min_tokens_in_window -w window. Default: 8 -c cluster size. Default: 10 (Hierarchical) -d min pattern in cluster. Default: 10 (Hierarchical) -C decrease_support. Default: 0.05 [0.0 - 1.0] (Exhaustive) -D min_support. Default: 0.5 [0.0 - 1.0] (Exhaustive) -m file_name. A similarity matrix file -o output_type to be supported -t thread_id number_of_threads. Default: 0 1 -T number_of_processors. Default: 1 -u count sequences. Default: set -v verbose - print pattern detail. Default: not set -x max_patterns. Default: 100,000 -z z_core. Set compute the Zscore. default: not set -h display this help and exit
Contact us
The development team encourages comments and questions about SPLASH. You can email us at ac2248@cumc.columbia.edu.
Related publications
Califano A. SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics. 2000 Apr;16(4):341-57.
Stolovitsky G, Califano A. Statistical significance of patterns in biosequences. 1998 Oct 9.