Entry for:Bioinformatics Peer Prize III
Crop diseases are the most important biological hazards that have been challenging the sustainable development in agricultural production for many years. Every year, 42% of the global agricultural yield is destroyed by diseases. Bioinformatics techniques provide efficient methods to analyze and interpret the raw biological data, which helps to study the effect of pathogen on a crop. Microarray gene expression data represents the expression levels of genes of a cell (organism) maintained at a particular environmental condition. Hence significant gene prediction and pathogen-host interactions can be studied using gene expression data. Different machine learning techniques can be applied to extract useful information represented by the candidate genes. Proposed approach consists of pre-processing of gene expression data, gene selection or feature extraction using a parallel approach and classification. The feature selection methods, Support Vector Machine with Recursive Feature Elimination (SVM-RFE), Minimum Redundancy Maximum Relevance (mRMR), Principal Component Analysis (PCA), Successive Feature Selection (SFS) and Independent Component Analysis (ICA), have been analysed for extraction of candidate genes with biological significance for rice related diseases. In order to deal with computational complexity and large volume of data, the combination of GPGPU computing and MapReduce programming on hadoop framework has been proposed. The experimental results show improved time efficiency in feature extraction and classification processes.
The proposed work aims at extracting candidate genes and designing a multi class classifier for predicting rice diseases. Further the importance of genes in a specific disease is to be analysed. To handle the large volume of data, distributed and parallel processing techniques like MapReduce and GPGPU computing can be used.
Selecting significant genes from the gene expression data has the following steps
a) Pre-processing of gene expression data extracted from public repositories using Analysis of Variance (ANOVA).
b) Gene selection or feature extraction using Support Vector Machine - Recursive Feature Elimination (SVM-RFE) method, minimum Redundancy Maximum Relevance method (mRMR), Principal Component Analysis (PCA), Successive Feature Selection (SFS) and Independent Component Analysis (ICA) approaches.
c) Classification of data after feature selection by applying Support Vector Machine algorithm with polynomial kernel.
The parallel version of SVM-RFE, mRMR, SFS, PCA and ICA using CPU+GPU clusters improves the time efficiency for feature extraction. Also training time of SVM can be reduced using Hadoop clusters.
Five parallel implementations of feature extraction approaches namely, SVM-RFE, SFS, mRMR, PCA and ICA have been implemented. All the feature selection approaches except SVM-RFE, show maximum accuracy with 20 features. So on an average, 20 candidate genes are sufficient for disease identification. SFS shows maximum accuracy. SFS checks all branches and follows a greedy approach. ICA is a statistical process that is focused on extracting independent components. It thus eliminates redundancy and also improves relevance. So the accuracy of ICA is good. ICA is followed by PCA which extracts principal components using statistical measures. mRMR aims at eliminating redundant information while at the same time improving the relevance. Hence mRMR also shows good performance for 20 features. On an average, SFS, ICA, PCA, mRMR and SVM-RFE approaches give 66%,56%, 50%, 31%, 18% better accuracy when compared to ANOVA as only the relevant features are extracted.
Both serial and parallel implementations of SFS take less time as blocks are considered and parallelized and not the entire data set. Parallelizing PCA reduces time complexity from cube power to square. mRMR approach is less time efficient as it depends on both sample and feature size. It can be seen that good improvements in time efficiency is obtained by parallelizing PCA and SVM-RFE. For MapReduce cluster around 4 nodes are required to show a good performance. This depends on the size of data. On an average around 10 common genes are selected from around top 200 genes
Bacterial leaf streak and bacterial blight disease pathogens have similar characteristics.
In SVM-RFE, among top 500 genes selected, 53 genes were found common for bacterial leaf streak disease and bacterial blight disease and for top 100 genes 7 genes were common for the two diseases. In mRMR, among top 1538 genes selected in bacterial leaf streak disease and 1771 genes selected in bacterial blight disease, 315 genes were found common for 2 diseases, and 6 genes are common among top 200 genes.
This paper proposes the use of better feature extraction techniques like SVM-RFE, mRMR, SFS, PCA and ICA to improve classification accuracy. On analyzing the results, efficient gene selection method for finding differently expressed genes, which helps to improve the performance of the classifier, can be chosen. Major challenge lies in obtaining sufficient number of samples, finding samples that is closely related to exact biological conditions (disease and non-disease) using suitable feature extraction techniques. Feature reduction using ANOVA; and extraction of biologically significant features using SVM-RFE, mRMR, SFS, PCA and ICA gives significant results. Classification accuracy is improved between 10% to 66% when compared to ANOVA. The parallel version of SVM-RFE, mRMR, SFS, PCA and ICA using CPU+GPU clusters improves the time efficiency for feature extraction. Also training time of SVM can be reduced using Hadoop clusters.
A limitation in SVM-RFE is that parallelization is achieved only on SVM training. As the process of feature elimination is recursive, it cannot be parallelized. Map Reduce programming requires data to be independent as parallelism is achieved by splitting data into blocks.
5. Future ideas/collaborators needed to advance research?
The work can be extended to study the applicability of the proposed approaches to analyze gene connections for mental disorders like schizophrenia, autism and bipolar disorders.
Construction of Protein- Protein Interaction networks can verify the proteins common to the candidate genes and their interactions with the genes. This can validate the correctness of the proposed approaches. As Spark provides in memory computations, it is suitable for machine learning applications. Hence feature selection approaches can be tried on Spark.
As Gene data bases are heterogeneous, a common query format can be design to query / upload information into these data bases
6. A link to your paper
Dr G sudha Sadasivam is working as a Professor in CSE, PSG College of technology, Coimbatore , India. She has 22 years of teaching and research experience. Her areas of interest include parallel d...
Round: Peer Prize Round