SPLASH

Introduction

SPLASH (Structural Pattern Localization Analysis by Sequential Histograms) is a deterministic pattern discovery algorithm which can find sparse amino or nucleic acid patterns matching identically or similarly in a set of protein or DNA sequences. Sparse patterns of any length, up to the size of the input sequence can be discovered without significant loss in performance. SPLASH is extremely efficient and embarrassingly parallel by nature. Large databases, such as a complete genome, the full set of PROSITE families, or the non-redundant SWISS-PROT database can be processed in a few hours on a typical workstation. Alternatively, a protein family or superfamily, with low overall homology, can be analyzed to discover common functional or structural signatures.

Application Download

You can use our geWorkbench framework to run SPLASH (using the Pattern Discovery Component). Alternatively you can use a standalone executable for your research. The supported platforms are:

Download SPLASH executable file for Windows (Cygwin)

Cygwin is a Linux emulator for Windows. It is required to run or compile SPLASH under a Windows environment. You can download Cygwin here.

Download SPLASH executable file for Linux

Building From Source

Source code for SPLASH is available via this link

Untar the downloaded file by typing:

gunzip splash.tar.gz tar - xf splash.tar

You should have two directories, contrib and SPLASH.

cd into contrib/gsoap-2.7 and follow the instructions for building gsoap laid out in the README file.

Essentially,

./configure --prefix=$HOME

make

make install exec_prefix=$HOME

cd into SPLASH/src and build SPLASH by typing:

make -copy the splash executable to the desired directory

These instructions are also available in the README located in the SPLASH/src directory.

Running SPLASH

The file splash.property should be in the same directory as the splash executable. The splash.property file contains just one line of the form

soapport=PORT_NUMBER where PORT_NUMBER is the port for splash to bin to. eg. 8040

By default, splash will run as soapserver which can be connected to via caWorkbench. To run SPLASH as a standalone, you need to type:

./splash -P standalone [other options] input

The help for splash is displayed by typing ./splash -h on the command line.

Similarity matrix "BLOSUM50" should be placed in the share subdirectory of the current directory. Any other matrix file should be placed in this directory.

Sample Data

histoall.fa

HistoneH1_aah29046.fa

H1BLASTed.fa

Similarity Matrix

BLOSUM50

Documentation And Support

Usage: splash [OPTIONS]... [FILE]

(FILE default is input.fa)

The options are as follows:

-P program_type. Default: soapserver [soapserver|standalone] -a algo_type. Default: regular [regular|exhaustive|hierarchical] -q token_type. Default: dna [dna|protein] -% support_as_percent_of_sequences. Default: 0.80 (Not compatible with j) -b min_identity_tokens. Default: 2 -i reported patterns must match identically on each token. Default: not set -j min_support. Default: 80% of sequences in FILE. (Not compatible with %) -k min_tokens_in_window. Default: 3 -l min_tokens. Default: min_tokens_in_window -w window. Default: 8 -c cluster size. Default: 10 (Hierarchical) -d min pattern in cluster. Default: 10 (Hierarchical) -C decrease_support. Default: 0.05 [0.0 - 1.0] (Exhaustive) -D min_support. Default: 0.5 [0.0 - 1.0] (Exhaustive) -m file_name. A similarity matrix file -o output_type to be supported -t thread_id number_of_threads. Default: 0 1 -T number_of_processors. Default: 1 -u count sequences. Default: set -v verbose - print pattern detail. Default: not set -x max_patterns. Default: 100,000 -z z_core. Set compute the Zscore. default: not set -h display this help and exit

Contact us

The development team encourages comments and questions about SPLASH. You can email us at ac2248@cumc.columbia.edu.

Related publications

Califano A. SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics. 2000 Apr;16(4):341-57.

Stolovitsky G, Califano A. Statistical significance of patterns in biosequences. 1998 Oct 9.