dPABBs: A webserver for Designing of Peptides Against Bacterial Biofilms

Main dataset: We have taken 90 biofilm-active peptides (BAPs) from the BaAMPs database as the positive dataset and 88 quorum sensing peptides (QSPs) from the QSPpred server as the negative dataset. These datasets were used for calculation of amino acid composition, dipeptide and tripeptide percentage composition.
Independent dataset: For the positive indepdendet dataset, 10 peptides were randomly selected from BAPs. The negative indepdendet dataset contains 10 peptides manually curated from published literature.
Training dataset: Consequently, 80 BAPs and 88 QSPs have been used for training and testing the Support Vector Machine (SVM) and WEKA based models.

The specific details about dataset preparation are shown in the flow diagram given below:

(A) Flow diagram representing the preparation of the positive dataset comprising of peptides active against bacterial biofilms.

(B) Flow diagram representing the preparation of the negative dataset comprising of quorum sensing peptides.

List of abbreviations used:

Abbreviation	Expansion
EDF	Extracellular death factor
RIP	RNAIII inhibiting peptide
RBP	RAP binding peptide
QS	Quorum sensing

[TOP]
Prediction approach:

Whole amino acid composition (%) and selected residues composition (%) were used as input features for developing SVM and WEKA based model development. The amino acid composition is the fraction of each amino acid in a peptide and converts a peptide sequence into a vector of 20 dimensions. The dipeptide composition in a peptide is the percentage of the different adjacent pairs of amino acids represented in a particular peptide. Dipeptide composition converts a peptide sequence into a vector of 400 dimensions and helps encapsulates the properties of the neighboring amino acids. The tripeptide composition in a peptide is the percentage of the three adjacent amino acids represented in a particular peptide. Tripeptide composition converts a peptide sequence into a vector of 8000 dimensions and helps encapsulates the properties of the neighboring amino acids.

Binary profile: In this approach, fixed length of NT5 or CT5 sequence patterns were converted into binary form. Each residue of patterns was represented by a vector of dimension 20 (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0; Cys by 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0).

SVM: The SVM was implemented using freely downloadable software package SVM^light written by Joachims (Joachims 1999). The software enables the user to define a number of parameters as well as to select from a choice of inbuilt kernel functions, including a radial basis function (RBF) and a polynomial kernel.

Weka: In this study, Random Forest based algorithm, implemented in Weka package has been used for building Weka based models to predict anti-biofilm activity of novel / query peptides.

Feature Selection Through Weka: The feature selection through Weka (CfsSubsetEval attribute evaluator with BestFirst (with parameters: -D 1 -N 5) search method (using full training set as attribute selection mode)) produced a total of 8, 20 and 26 selected residues, dipeptides and tripeptides, respectively. Selected 8 residues percentage composition has been used for building Weka based model (Sel_8_res). The selected 8 residues used in model building are: D, E, F, K, R, S, T, W.

Feature Selection Through two sample t-test: After applying two sample t-test (assuming unequal variances, Microsoft Office Excel 2007) on percentage whole amino acid composition (Whole AAC, calculated for ABPs and QSPs), fourteen residues were obtained for building a SVM based model (Sel_14_res). The selected 14 residues used in model building are: D, E, F, I, K, L, M, N, P, R, S, T, V, W.

[TOP]

Cross-validation technique: Using the aforementioned input features, a number of SVM and Weka based models were developed. The performance of these models was evaluated by employing a five-fold cross-validation technique. The whole dataset is divided into five sets in such a way that every time four sets are used for training and one set for testing. This process is repeated five times such that each set is used for testing and the remaining four for training the models.

Formulae Used: The formulae used to evaluate the performance of SVM and Weka based models are as given below:

Where TP and TN are correctly predicted positive and negative examples, respectively. Similarly, FP and FN are wrongly predicted positive and negative examples, respectively.

Models Used on Webserver: SVM and WEKA based models have been used on the webserver. The following results were obtained from SVM and WEKA taking amino acid composition as input feature:

Classifier	Input feature	Threshold	Sensitivity	Specificity	Accuracy	MCC
SVM	Whole amino acid composition (Whole AAC)	0	92.50	97.73	95.24	0.91
SVM	Selected 14 residues composition (D, E, F, I, K, L, M, N, P, R, S, T, V, W)	0	88.75	94.32	91.67	0.83
SVM	N terminal first 5 residues (NT5) binary pattern profile (BPP)	0	95.89	83.33	90.91	0.81
SVM	C terminal first 5 residues (CT5) binary pattern profile (BPP)	0	97.26	72.92	87.60	0.75
WEKA	Whole amino acid composition (Whole AAC)	0.5	93.75	96.59	95.24	0.90
WEKA	Selected 8 residues composition (D, E, F, K, R, S, T, W)	0.5	93.75	95.45	94.64	0.89
WEKA	N terminal first 5 residues (NT5) binary pattern profile (BPP)	0.5	98.63	83.33	92.56	0.85
WEKA	C terminal first 5 residues (CT5) binary pattern profile (BPP)	0.5	90.41	85.42	88.43	0.76

[TOP]

Citation: If you are using this server, please cite:

Gupta, P. et al. dPABBs: A Novel in silico Approach for Predicting and Designing Anti-biofilm Peptides. Sci. Rep. 6, 21839; doi: 10.1038/srep21839 (2016).

Developed by: OSDD Unit, CSIR- HQ, New Delhi- 110001