G’s scale [40]; the hydrophobic moment was given by Eisenberg’s equation [40]; and the average charge was calculated as the net charge at physiological pH normalized by the number of residues. The final ensemble of sequence descriptors was defined through a principal component analysis (PCA). The nine descriptors were measured for the positive data set, and then the PCA was applied,Materials and Methods Data SetsThe positive data set (PS) was constructed by selecting sequences with four or more cysteine residues from the Antimicrobial 13655-52-2 chemical information Peptides Database (APD) [35]. This set was manually curated, keeping only the sequences annotated at least with activities against bacteria, fungi or virus. In addition, incomplete sequences were removed. PS was composed of 385 sequences with size ranging from 16 to 90 amino acid residues. The negative data set (NS) was composed of a subset of Protein Data Bank (PDB), while in our previous work it was composed of random proteins predicted as transmembrane [20]. Initially, the protein sequencesFigure 1. Principal component analysis of sequence descriptors for cysteine-stabilized peptides. The components are indicated by arrows: as Gracillin larger the arrow is, major is the component contribution to the set’s variance. (A) The disposition of the nine sequence descriptors in the peptide space; (B) the final ensemble of descriptors, the descriptors hydrophobic moment, index of b-sheet formation, rate between charged and hydrophobic residues and a-helix propensity were ruled out. 1662274 doi:10.1371/journal.pone.0051444.gCS-AMPPred: The Cysteine-Stabilized AMPs PredictorFigure 2. Distribution of sequence descriptor values. The left box in each panel corresponds to the AMPs. All descriptors have statistical differences when compared to the non-antimicrobial data set, with a critical value of 0.05. The observed p-values are as follows: charge (,2.2e-16), hydrophobicity (2.169e-06), flexibility (,2.2e-16), index of a-helix formation (,2.2e-16) and index of loop formation (2.908e-10). doi:10.1371/journal.pone.0051444.gsubsequently the descriptors with redundant behavior or with little influence on variance were removed. Therefore, a two sided Wilcoxon-Mann-Whitney non-parametric test was applied forverifying the differences between the sequence descriptors in the PS and NS sets, with a critical value of 0.05. The statistical analyses were done through the R package for statistical computing (http://www.r-project.org).Support Vector Machine’s Training and ValidationThree SVM models were developed through SVM Light [41], using the linear, polynomial and radial kernels. The training was done using the training set. An overview of the model’s accuracy was estimated through a 5-fold cross validation, taking into account only the training data set. Therefore, the models were challenged against the blind data set, where the following parameters were measured: Sensitivity TP |100 TPzFN ??SpecificityTN |100 TNzFP??AccuracyFigure 3. ROC curves for the CS-AMPPred models against the blind data set (BS1). doi:10.1371/journal.pone.0051444.gTPzTN |100 TPzTNzFNzFP??CS-AMPPred: The Cysteine-Stabilized AMPs PredictorTable 1. Evaluation of CS-AMPPred models against the individual cysteine-stabilized AMP classes and also PDB sequences which were not used in the data sets.a-defensins1 93.33 97.78 97.78 b-defensins1 96.83 95.24 96.83 CSab defensins1 81.36 77.12 77.12 Cyclotides1 70.34 81.36 83.05 Undefined1 84.13 79.37 80.95 PDB# 80.65 82.55 81.Model L.G’s scale [40]; the hydrophobic moment was given by Eisenberg’s equation [40]; and the average charge was calculated as the net charge at physiological pH normalized by the number of residues. The final ensemble of sequence descriptors was defined through a principal component analysis (PCA). The nine descriptors were measured for the positive data set, and then the PCA was applied,Materials and Methods Data SetsThe positive data set (PS) was constructed by selecting sequences with four or more cysteine residues from the Antimicrobial Peptides Database (APD) [35]. This set was manually curated, keeping only the sequences annotated at least with activities against bacteria, fungi or virus. In addition, incomplete sequences were removed. PS was composed of 385 sequences with size ranging from 16 to 90 amino acid residues. The negative data set (NS) was composed of a subset of Protein Data Bank (PDB), while in our previous work it was composed of random proteins predicted as transmembrane [20]. Initially, the protein sequencesFigure 1. Principal component analysis of sequence descriptors for cysteine-stabilized peptides. The components are indicated by arrows: as larger the arrow is, major is the component contribution to the set’s variance. (A) The disposition of the nine sequence descriptors in the peptide space; (B) the final ensemble of descriptors, the descriptors hydrophobic moment, index of b-sheet formation, rate between charged and hydrophobic residues and a-helix propensity were ruled out. 1662274 doi:10.1371/journal.pone.0051444.gCS-AMPPred: The Cysteine-Stabilized AMPs PredictorFigure 2. Distribution of sequence descriptor values. The left box in each panel corresponds to the AMPs. All descriptors have statistical differences when compared to the non-antimicrobial data set, with a critical value of 0.05. The observed p-values are as follows: charge (,2.2e-16), hydrophobicity (2.169e-06), flexibility (,2.2e-16), index of a-helix formation (,2.2e-16) and index of loop formation (2.908e-10). doi:10.1371/journal.pone.0051444.gsubsequently the descriptors with redundant behavior or with little influence on variance were removed. Therefore, a two sided Wilcoxon-Mann-Whitney non-parametric test was applied forverifying the differences between the sequence descriptors in the PS and NS sets, with a critical value of 0.05. The statistical analyses were done through the R package for statistical computing (http://www.r-project.org).Support Vector Machine’s Training and ValidationThree SVM models were developed through SVM Light [41], using the linear, polynomial and radial kernels. The training was done using the training set. An overview of the model’s accuracy was estimated through a 5-fold cross validation, taking into account only the training data set. Therefore, the models were challenged against the blind data set, where the following parameters were measured: Sensitivity TP |100 TPzFN ??SpecificityTN |100 TNzFP??AccuracyFigure 3. ROC curves for the CS-AMPPred models against the blind data set (BS1). doi:10.1371/journal.pone.0051444.gTPzTN |100 TPzTNzFNzFP??CS-AMPPred: The Cysteine-Stabilized AMPs PredictorTable 1. Evaluation of CS-AMPPred models against the individual cysteine-stabilized AMP classes and also PDB sequences which were not used in the data sets.a-defensins1 93.33 97.78 97.78 b-defensins1 96.83 95.24 96.83 CSab defensins1 81.36 77.12 77.12 Cyclotides1 70.34 81.36 83.05 Undefined1 84.13 79.37 80.95 PDB# 80.65 82.55 81.Model L.