CQNR: Cluster Quality based Non-Reductional over-sampling technique and EPP3D: Effector Protein Predictor based on 3D structure of proteins

Rishika Sen, Somnath Tagore, Rajat K. De

CQNR - The phenomenon of the imbalanced dataset is common in classification. The existence of this problem is visible in many real-life domains, especially for biological datasets. Data imbalance problem can be tackled either by reducing the samples of the majority class or increasing the number of samples in the minority class. In this paper, we have introduced Cluster Quality based Non-Reductional over-sampling technique (CQNR). CQNR generates new samples proportional to the distribution of samples of the minority classes, without eliminating any sample of the minority class as noise.

EPP3D - Effector proteins of bacteria infect their hosts by certain dedicated machinery, otherwise known as secretion systems. Therefore, we have created a unique feature set consisting of eight features i.e., convex hull layer count, surface atom composition, radius of gyration, packing density and compactness, derived from the 3D structure of the experimentally verified effector proteins for its classification. For the classification of pathogenic effector proteins, a significant improvement of approximately 5% has been noticed, which has applied CQNR compared to the other oversampling methods. Based on this feature set oversampled by CQNR, we have developed EPP3D, an effector protein predictor based consensus of classifiers using majority voting.

Technical Details

CQNR and EEP3D are both python scripts that runs on Python 3.6 and above. It has been written using Win Python in 64-bit version of Windows 8.1 operating system and a 32GB RAM. CQNR can read unbalanced dataset in csv format (comma separated). The input unbalanced and the output balanced feature sets have rows as samples and columns as features. The python packages needed for executing CQNR and EPP3D are tkinter, random, pandas, numpy, csv, sklearn, keras, tensorflow, collections, matplotlib, scipy, imblearn, decimal, warnings and re.

Downloads

Download the CQNR and EPP3D and its prerequisites from this link: CQNR Package. Click the tab "clone or download". On unzipping the package the following files/folders can be found:
  1. cqnr_sample - it has the datasets in csv format on which the performance of CQNR has been tested
  2. epp3d_sample_protein - it has the pdb files of proteins (effectors and non-effectors) on which the performance of EPP3D has been tested
  3. CQNR.py - source code of CQNR in python
  4. EPP3D.py - source code of EPP3D in python
  5. all8features.csv - unbalanced eight feature set of the effector and non-effector protein
  6. training_epp3d.csv - balanced eight feature set of the effector and non-effector protein using which EPP3D has to be trained

How to execute CQNR

Follow the steps to execute CQNR:
  1. Open WinPython-64bit-3.6.1.0Qt5 on Windows and open the application IDLEX (Python Gui).
  2. From The GUI click "Open" which will open the file explorer.
  3. Find the file CQNR (if should be on the same folder as WinPython) and click "Open" and "Run". On the IDE type "CQNR()" and click "Enter". A file entry box will appear.




  4. For field "Imbalanced class data file location" give the path of the imbalanced dataset (present in folder cqnr_sample) in CSV format.
  5. For field "Output location for the balanced data file" (in fasta format) give the path for the balanced datset that will be produced by the algorithm.
  6. Fill the fields and click 'Oversample!'. Wait for the results.

How to execute EPP3D

Follow the steps to execute CQNR:
  1. Open WinPython-64bit-3.6.1.0Qt5 on Windows and open the application IDLEX (Python Gui).
  2. From The GUI click "Open" which will open the file explorer.
  3. Find the file EPP3D.py. and click "Open" and "Run". On the IDE type "EPP3D()" and click "Enter". A file entry box will appear.




  4. For field "Training dataset" give the path for the training dataset (training_epp3d.csv) given in CSV format.
  5. For field "PDB file of protein to be predicted" (present in folder epp3d_sample_protein) give the path for PDB file of the unknown protein that need to be classified.
  6. Fill the fields and click 'Predict!'. Wait for the results.