PyPredT6: A Python based Prediction Tool for Identification of Type VI Effector Proteins

Rishika Sen, Losiana Nayak, Rajat K. De

rishika21sen@gmail.com, losiana_t@isical.ac.in, rajat@isical.ac.in

Effector proteins of bacteria infect their hosts by certain dedicated machinery present in them, otherwise known as Secretion System (SS). So far six such Secretion Systems (SS) have been identified in gram-negative bacteria, i.e., T1SS -T6SS. T6 effector proteins of many gram-negative bacteria have not yet been discovered. We have developed PyPredT6, a Python-based tool which provides a convenient way to predict whether a protein is a T6 effector or not. We have taken experimentally validated effector proteins and designed a classification system based on a consensus of classifiers. We have used the predicted result of Artificial Neural Network (ANN), Support Vector Machine (SVM), k Nearest Neighbors (kNN), Naive Bayes (NB) and Random Forest (RF) to annotate an unknown protein.

Technical Details

PyPredT6 is a python script that runs on Python 3.6 and above. It has been written using Win Python in 64-bit Windows 8.1 operating system and a 32GB RAM. PyPredT6 can read nucleotide and amino acid sequences from text files in fasta format. The secondary structures from the amino acid sequences of the proteins have been extracted from PaleAle 4.0 tool online. The python packages needed for executing PyPredT6 are tkinter, time, random, pandas, numpy, csv, sklearn, keras, tensorflow, imblearn, collections and re.

How to setup PyPredT6

Follow the steps to setup PyPredT6:
  1. Download the PyPredT6 and its prerequisites from Github: PyPredT6 Package
  2. . Click the tab "clone or download".
  3. In the folder you will have the following:
  4. Open WinPython-64bit-3.6.1.0Qt5 on Windows and open the application IDLEX (Python Gui).
  5. From The GUI click "Open" which will open the file explorer.
  6. Find the file PyPredT6 (if should be on the same folder as WinPython) and click "Open" and "Run". On the IDE type "PyPredT6()" and click Enter. A file entry box will appear.




  7. Find the "Samples" folder from PyPredT6 parent folder. Under the "Samples" folder, there are three subfolders, "sample1", "sample2" and "sample3". Each of these folders contain 2 files, a "gene" file and a "protein" file. The "gene.txt" file contains the sample nucleotide sequences while the "protein.txt" file contains the sample amino acid sequences in fasta format. One can create such files for predicting effector proteins among the sequences provided in these files.
  8. For field "Sample peptide file" (in fasta format) give the path for amino acid sequence file of the proteins to be predicted.
  9. For field "Sample nucleotide file" (in fasta format) give the path for nucleotide sequence file of the corresponding protein sequence to be predicted.
  10. For field "Effector feature file" (in CSV format) give the path for effector feature set file ( featurefile_eff.csv ) for the training.
  11. For field "Non-effector feature file" (in CSV format) give the path for non-effector feature set file ( featurefile_noneff.csv ) for the training.
  12. Fill the fields and click 'Predict'. Wait for the results.

Biological validation of prediction using PyPredT6

File containing the list of the putative effector proteins in Vibrio cholerae and Yersinia pestis hinting their association with T6 secretion system mediated pathogenesis supported by previously published articles from the literature can be obtained here. The excel files contain the Protein names, Uniprot Entry ids, Uniprot Entry names, Gene names, Gene Ontology (GO) - Biological Process, GO - Cellular Components, GO-Molecular Function Evidences and their respective DOI.
  1. Vibrio cholerae
  2. Yersia pestis