International Journal of Bioinformatics and Biomedical Engineering
Articles Information
International Journal of Bioinformatics and Biomedical Engineering, Vol.1, No.1, Jul. 2015, Pub. Date: Jul. 9, 2015
Optimal Feature Subset Selection Using Similarity-Dissimilarity Index and Genetic Algorithms
Pages: 19-36 Views: 3399 Downloads: 1385
Authors
[01] Muhammad Arif, Department of Computer Science, College of Computer and Information Systems, Umm, Alqura University, Makkah, Kingdom of Saudi Arabia.
Abstract
Optimal feature subset selection is an important pre-processing step for classification in many real life problems where number of dimensions of feature space is large and some features are may be irrelevant or redundant. One example of such a situation is genes expression profile data to classify among normal and cancerous samples. Contribution of this paper is five folds. Similarity-dissimilarity index (MSDI) is proposed which can estimate the class discrimination quality of the high dimensional feature space without using any kind of classifier. A framework to find out the best features subset from the n-dimensional feature space using genetic algorithm is proposed to select the minimum possible important features optimally using MSDI as fitness function to evolve the population. Similarity-dissimilarity plot is proposed to visualize the high dimensional data that can be used to extract important information about the class discrimination quality of the feature space. It is possible to predict the best classification accuracy using MSDI when an appropriate classifier is used. Another index called average differential of similarity and dissimilarity distances above similarity-dissimilarity line is proposed which gives information about how far each class instances or clusters are from other classes and the compactness of the classes in the feature space. Effectiveness of the methods is highlighted by using a large set of benchmark datasets in cancer classification and size of features subset and predicted classification accuracy is compared with the published results.
Keywords
Pattern Classification, Genetic Algorithm, Biomedical Datasets, Nearest Neighbor, Visualization
References
[01] Galoub T. R., Slonim D. K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J. P., Coller H., Loh M. L., Downing J. R., Caligiuri M. A., Bloom_eld C. D., and Lander E. S. (1999), Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science, 286, 531-537.
[02] Alon A., Barkai N., Notterman D. A., Gish K., Ybarra S., Mack D., Levine A.J., (1999), Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays, Proc. Natl. Acad. Sci. USA, 96, 6745-6750.
[03] Dudoit, Sandrine, Jane Fridlyand, and Terence P. Speed. (2002), Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American statistical association 97, no. 457 77-87, 2002.
[04] Chen, Zhiping, and Kevin Lü. (2006), A preprocess algorithm of filtering irrelevant information based on the minimum class difference. Knowledge-Based Systems 19, no. 6 422-429.
[05] Jieming Yang, Yuanning Liu, Xiaodong Zhu, Zhen Liu, Xiaoxu Zhang, (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization, Information Processing & Management, Volume 48, Issue 4, 741-754.
[06] Jieming Yang, Yuanning Liu, Zhen Liu, Xiaodong Zhu, Xiaoxu Zhang, (2011) A new feature selection algorithm based on binomial hypothesis testing for spam filtering, Knowledge-Based Systems, Volume 24, Issue 6, 904-914.
[07] Xu, Ping, Guy N. Brock, and Rudolph S. Parrish. (2009) Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Computational Statistics & Data Analysis 53, no. 5, 1674-1687.
[08] Guo, Y., Hastie, T., Tibshirani, R., (2007) Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8 (1), 86–100.
[09] Hastie, T., Tibshirani, R., (2004) Efficient quadratic regularization for expression arrays. Biostatistics 5 (2), 329–340.
[10] Xiong, Momiao, Wuju Li, Jinying Zhao, Li Jin, and Eric Boerwinkle. (2001) Feature (gene) selection in gene expression-based tumor classification. Molecular Genetics and Metabolism 73, no. 3, 239-247.
[11] Van der Maaten, L. J. P., E. O. Postma, and H. J. Van den Herik. (2009) Dimensionality reduction: A comparative review. Journal of Machine Learning Research 10, 1-41.
[12] Antoniadis, Anestis, Sophie Lambert-Lacroix, and Frédérique Leblanc. (2003) Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics 19, no. 5, 563-570.
[13] Notterman, D. A., et al., (2001) Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. Cancer Res. 61, 3124–3130.
[14] Bouras, T., et al., (2002) Stanniocalcin 2 is an estrogen-responsive gene coexpressed with the estrogen receptor in human breast cancer. Cancer Res. 62, 1289–1295.
[15] Mootha, V. K., et al., (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately down regulated in human diabetes. Nat. Genet. 34, 267–273.
[16] Bushel, P. R., et al., (2002) Computational selection of distinct class- and subclass-specific gene expression signatures. J. Biomed. Inform. 35, 160–170.
[17] Haseeb Ahmad Khan, (2013) A novel gene expression index (GEI) with software support for comparing microarray gene signatures, Gene, Volume 512, Issue 1, 82-88.
[18] Su, Yang, T. M. Murali, Vladimir Pavlovic, Michael Schaffer, and Simon Kasif. (2003) RankGene: identification of diagnostic genes based on expression data. Bioinformatics 19, no. 12, 1578-1579.
[19] Wu, Baolin, Tom Abbott, David Fishman, Walter McMurray, Gil Mor, Kathryn Stone, David Ward, Kenneth Williams, and Hongyu Zhao. (2003) Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 19, no. 13, 1636-1643.
[20] Levner, Ilya. (2005) Feature selection and nearest centroid classification for protein mass spectrometry. BMC bioinformatics 6, no. 1, 68.
[21] A. L. Blum, P. Langley, (1997) Selection of relevant features and examples in machine learning, Artificial Intelligence 97, 245–271.
[22] Chen, Yuehui, Ajith Abraham, and Bo Yang. (2006) Feature selection and classification using flexible neural tree. Neurocomputing 70, no. 1, 305-313.
[23] Saeys, Yvan, Iñaki Inza, and Pedro Larrañaga. (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23, no. 19, 2507-2517.
[24] Ando, Tatsuya, Miyuki Suguro, Takeshi Kobayashi, Masao Seto, and Hiroyuki Honda. (2003) Selection of causal gene sets for lymphoma prognostication from expression profiling and construction of prognostic fuzzy neural network models. Journal of bioscience and bioengineering 96, no. 2, 161-167.
[25] Chen, Guoan, Tarek G. Gharib, Chiang-Ching Huang, Dafydd G. Thomas, Kerby A. Shedden, Jeremy MG Taylor, Sharon LR Kardia et al. (2002) Proteomic analysis of lung adenocarcinoma identification of a highly expressed set of proteins in tumors. Clinical Cancer Research 8, no. 7 (2002): 2298-2305.
[26] Satten, Glen A., Somnath Datta, Hercules Moura, Adrian R. Woolfitt, Maria Da G. Carvalho, George M. Carlone, Barun K. De, Antonis Pavlopoulos, and John R. Barr. (2004) Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens. Bioinformatics 20, no. 17, 3128-3136.
[27] Adam, Bao-Ling, Yinsheng Qu, John W. Davis, Michael D. Ward, Mary Ann Clements, Lisa H. Cazares, O. John Semmes et al. (2002) Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Research 62, no. 13, 3609-3614.
[28] Xie, Juanying, and Chunxia Wang. (2011) Using support vector machines with a novel hybrid feature selection method for diagnosis of erythemato-squamous diseases. Expert Systems with Applications 38, no. 5, 5809-5815.
[29] Yang, Yee Hwa, Yuanyuan Xiao, and Mark R. Segal. (2005) Identifying differentially expressed genes from microarray experiments via statistic synthesis. Bioinformatics 21, no. 7, 1084-1093.
[30] Yang, Pengyi, Bing B. Zhou, Zili Zhang, and Albert Y. Zomaya. (2010) A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data. BMC bioinformatics 11, no. Suppl 1, S5.
[31] Liu, Zhenqiu, Feng Jiang, Guoliang Tian, Suna Wang, Fumiaki Sato, Stephen J. Meltzer, and Ming Tan. (2007) Sparse logistic regression with Lp penalty for biomarker identification. Statistical Applications in Genetics and Molecular Biology 6, no. 1.
[32] Tibshirani, Robert. (1997) The lasso method for variable selection in the Cox model. Statistics in medicine 16, no. 4, 385-395.
[33] Krishnapuram, Balaji, Lawrence Carin, Mario AT Figueiredo, and Alexander J. Hartemink. (2005) Sparse multinomial logistic regression: Fast algorithms and generalization bounds. Pattern Analysis and Machine Intelligence, IEEE Transactions on 27, no. 6, 957-968.
[34] Wang, Xiaoming, Taesung Park, and K. C. Carriere. (2010) Variable selection via combined penalization for high-dimensional data analysis. Computational Statistics & Data Analysis 54, no. 10, 2230-2243.
[35] Ma, Shuangge, and Jian Huang. (2008) Penalized feature selection and classification in bioinformatics. Briefings in bioinformatics 9, no. 5, 392-403.
[36] Emmanoullidis C., Hunter A., Macintyre J., (2000) A multi objective evolutionary setting for feature selection and a commonality-based crossover operator, in: 2000 Congress on Evolutionary Computation (CEC’ 2000), San Diego, California, July 2000. IEEE Service Center.
[37] P. Baraldi, N. Pedroni, E. Zio, (2009) Application of a niched Pareto genetic algorithm for selecting features for nuclear transients classification, International Journal of Intelligent Systems 24 (2) 118–151.
[38] Ying Li and Keiichi Horio, (2010) Visualization and Analysis of Mental States Based on Photoplethysmogram, ICIC Express Letters, vol.4, no.3(B), pp.923 -928.
[39] Qing Ma and Toshiyuki Kanamaru, (2010) Extraction and Visualization of Numerical and Named Entity Information from a Very Large Number of Documents Using Natural Language Processing, Information and Control, vol.6, no.3(B), pp.1549-1568.
[40] Kenichi Kawasaki, (2009) Study on the Visualization of the Impression of Button Sounds, International Journal of Innovative Computing, Information and Control, vol.5, no.11(B), pp.4189-4204.
[41] D. F. Andrews, (1972) Plot of high dimensional data, Biometrics, 29, 125-136.
[42] J. M. Chambers, W. S. Cleveland , B. Kleiner, P. A. Tukey, (1976) Graphical methods for data analysis, Chapman and Hall.
[43] J. J. van Wijk, R. van Liere, (1993) HyperSlice, Proc. of IEEE Visualization '93, Nielson, G. M., Bergeron, R. D., editors, IEEE Computer Society Press, Los Alamitos, 119-125.
[44] B. Alpern, L. Carter, (1991) Hyperbox, Proc. of IEEE Visualization '91, 133-139.
[45] R. Spence, L. Tweedie, H. Dawkes, H. Su, (1995) Visualisation for Functional Design, Proc of IEEE Visualization ’95, 4-10.
[46] A. Inselberg, (1985) The plane with parallel coordinates, The Visual Computer, 69-92.
[47] Inselberg, B. Dimsdale B., (1990) Parallel coordinates: A tool for visualization high dimensional geometry, Proc. of IEEE Visualization, 361-378.
[48] Hong Zhou, Xiaoru Yuan, Huamin Qu, Weiwei Cui, Baoquan Chen, (2008) Visual Clustering in Parallel Coordinates IEEE-VGTC Symposium on Visualization, 27.
[49] W. Peng, M.O. Ward, E.A. Rundensteiner, (2004) Cluster reduction in multi-dimensional data visualization using dimension reordering, Proc of IEEE symposium on Information visualization, 89-96.
[50] J. Johansson, P. Ljung, M. Jern, M. Cooper, (2000) Revealing structures within clustered parallel coordinates display, Proc. of IEEE symposium on Information visualization, 125-132, 2005.
[51] H. Siirtola, Direct manipulation of parallel coordinates, Proc of IEEE 4th International Conference on Information visualization, 373-378.
[52] Brunsdon, A. S. Fotheringham, M. E. Charlton, (1998) An Investigation of Methods for Visualising Highly Multivariate Datasets, In Case studies of Visualization in Social Sciences, 55-80.
[53] G. Leban, I. Bratko, U. Petrovic, T. Curk, B. Zupan, (2005) Vizrank: finding informative data projections in functional genomics by machine learning, Bioinformatics, 21/3, 413-414.
[54] J. F. McCarthy, K.A. Marx, P.E. Hoffman, A. G. Gee, P. O’Neil, M. L. Ujwal J. Hotchkiss, (2004) Applications of machine learning and high-dimensional visualization in cancer detection, diagnosis and management, Annals of New York Academy of Sciences, 1020, 239-262.
[55] B. Zupan, (2007) FreeViz-an intelligent multivariate visualization approach to explorative analysis of biomedical data, Journal of biomedical informatics, 40/6, 661-671.
[56] John Sharko, Georges Grinstein, and Kenneth A. Marx, (2008) Vectorized Radviz and Its Application to Multiple Cluster Datasets, IEEE Transactions on Visualization and Computer Graphics, 14(6), pp 1444-1451.
[57] Arif M, (2012) Similarity-Dissimilarity Plot for Visualization of High Dimensional Data in Biomedical Pattern Classification, Journal of Medical Systems, Journal of Medical Systems, Volume 36, Issue 3, pages 1173–1181, 2012.
[58] Arif M and Saleh Basalamah, (2012) Similarity-Dissimilarity Plot for High Dimensional Data of Different Attribute types in Biomedical Datasets, International Journal of Innovative Computing, Information and Control, Vol 8, No 2, 1275-1298.
[59] Holland, John H. (1992) Genetic algorithms. Scientific American 267, no. 1, 66-72.
[60] Yagi, Tomohito, Akira Morimoto, Mariko Eguchi, Shigeyoshi Hibi, Masahiro Sako, Eiichi Ishii, Shuki Mizutani, Shinsaku Imashuku, Misao Ohki, and Hitoshi Ichikawa. (2003) Identification of a gene expression signature associated with pediatric AML prognosis. Blood 102, no. 5, 1849.
[61] Crossman LC, Mori M, Hsieh YC, Lange T et al. (2005) In chronic myeloid leukemia white cells from cytogenetic responders and non-responders to imatinib have very similar gene expression signatures. Haematologica, 90(4):459-64.
[62] Pomeroy, Scott L., Pablo Tamayo, Michelle Gaasenbeek, Lisa M. Sturla, Michael Angelo, Margaret E. McLaughlin, John YH Kim et al. (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, no. 6870, 436-442.
[63] Alizadeh, Ash A., Michael B. Eisen, R. Eric Davis, Chi Ma, Izidore S. Lossos, Andreas Rosenwald, Jennifer C. Boldrick et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, no. 6769, 503-511.
[64] Bhattacharjee, Arindam, William G. Richards, Jane Staunton, Cheng Li, Stefano Monti, Priya Vasa, Christine Ladd et al. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences 98, no. 24, 13790-13795.
[65] Singh, Dinesh, Phillip G. Febbo, Kenneth Ross, Donald G. Jackson, Judith Manola, Christine Ladd, Pablo Tamayo et al. (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer cell 1, no. 2, 203-209.
[66] Khan, Javed, Jun S. Wei, Markus Ringner, Lao H. Saal, Marc Ladanyi, Frank Westermann, Frank Berthold et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine 7, no. 6, 673-679.
[67] Dyrskjøt L, Thykjaer T, Kruhøffer M, Jensen JL et al. (2003) Identifying distinct classes of bladder carcinoma using microarrays. Nat Genet, Jan;33(1):90-6.
[68] Hippo Y, Taniguchi H, Tsutsumi S, Machida N et al. (2002) Global gene expression analysis of gastric cancer by oligonucleotide microarrays. Cancer Res, 62(1):233-40.
[69] Armstrong, Scott A., Jane E. Staunton, Lewis B. Silverman, Rob Pieters, Monique L. den Boer, Mark D. Minden, Stephen E. Sallan, Eric S. Lander, Todd R. Golub, and Stanley J. Korsmeyer. (2002) MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature genetics30, no. 1, 41-47.
[70] Guyon, Isabelle, Steve Gunn, Asa Ben-Hur, and Gideon Dror. (2004) Result analysis of the nips 2003 feature selection challenge. Advances in Neural Information Processing Systems 17, 545-552.
[71] Gasparoviсa, Madara, Ludmila Aleksejeva, and Valdis Gersons. (2012) The Use of BEXA Family Algorithms in Bioinformatics Data Classification. Information Technology and Management Science 15, no. 1, 120-126.
[72] Zhu, Zexuan, Yew-Soon Ong, and Manoranjan Dash. (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognition 40, no. 11, 3236-3248.
[73] Huang, Chenn-Jung, and Wei-Chen Liao. (2004) Application of probabilistic neural networks to the class prediction of leukemia and embryonal tumor of central nervous system. Neural Processing Letters 19, no. 3, 211-226.
[74] Piao, Yongjun, Minghao Piao, Kiejung Park, and Keun Ho Ryu. (2012) An ensemble correlation-based gene selection algorithm for cancer classification with gene expression data. Bioinformatics 28, no. 24, 3306-3315.
[75] Wang, Xiaosheng. (2012) Robust two-gene classifiers for cancer prediction. Genomics, 99(2):90-5.
[76] Liu, Huawen, Lei Liu, and Huijie Zhang. (2010) Ensemble gene selection for cancer classification. Pattern Recognition 43, no. 8, 2763-2772.
[77] Wang, Xiaosheng, and Osamu Gotoh. (2009) Accurate molecular classification of cancer using simple rules. BMC medical genomics 2, no. 1, 64.
[78] Sounak Chakraborty. (2009) Bayesian binary kernel probit model for microarray based cancer classification and gene selection. Comput. Stat. Data Anal. 53, 12, 4198-4209.
[79] J. H. Cho, D. Lee, J. H. Park, I. B. Lee, (2004) Gene selection and classification from microarray data using kernel machine, FEBS Lett. 571, 93–98.
[80] K. Deb, A. R. Reddy, (2003) Reliable classification of two-class cancer data using evolutionary algorithms, Biosystems 72, 111–129.
[81] Lee, Chien-Pang, and Yungho Leu. (2011) A novel hybrid feature selection method for microarray data analysis. Applied Soft Computing 11, no. 1, 208-213.
[82] L. Li, C. R. Weinberg, T. A. Darden, L. G. Pedersen, (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA / KNN method, Bioinformatics 17, 1131–1142.
[83] Gordon GJ, Jensen RV, Hsiao LL, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R: (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res, 62(17):4963-4967
[84] Momin BF, Mitra S: (2006) Reduct generation and classification of gene expression data.In proceedings of the First International Conference on Hybrid Information Technology: 9-11 November 2006; Jeju Island. Edited by Szczuka MS, Howard D, Slezak D, Kim HK, Kim TH, Ko IS, Lee G, Sloot PMA. Berlin/Heidelberg: Springer; 699-708.
[85] Geman D, d'Avignon C, Naiman DQ, Winslow RL: (2004) Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol, 3:Article 19.
[86] Tan, Jun-Yan, Chun-Hua Zhang, and Nai-Yang Deng. (2010) Cancer related gene identification via p-norm support vector machine. In The 4th international conference on computational systems biology, pp. 101-108.
[87] Aksu, Yaman. (2012) A Fast SVM-based Feature Selection Method, Combining MFE (Margin-Maximizing Feature Elimination) and Upper Bound on Misclassification Risk. arXiv preprint arXiv:1210.4460.
[88] Chang, Fu, and Chan-Cheng Liu. (2012) Ranking and selecting features using an adaptive multiple feature subset method. number TR-IIS-12-005, Institute of Information Science, Academia Sinica.
[89] J. Deutsch, (2003) Evolutionary algorithms for finding optimal gene sets in microarray prediction, Bioinformatics 19 (1), 45–52.
[90] Seo, Minseok AND Oh, Sejong CBFS, (2012) High Performance Feature Selection Algorithm Based on Feature Clearness, PLoS ONE, 7, e40419.
[91] Cohen, Shay, Gideon Dror, and Eytan Ruppin. (2007) Feature selection via coalitional game theory. Neural computation 19, no. 7, 1939-1961.
[92] Gaudel, Romaric, and Michele Sebag. (2010) Feature selection as a one-player game. In International Conference on Machine Learning, pp. 359-366.
[93] Neal, Radford, and Jianguo Zhang. (2006) High dimensional classification with Bayesian neural networks and Dirichlet diffusion trees. Feature Extraction, 265-296.
[94] Wang Y, Makedon FS, Ford JC, Pearlman J: HykGene: (2005) A hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data. Bioinformatics, 21(8):1530-1537.
[95] Lin, Tsun-Chen, Ru-Sheng Liu, Chien-Yu Chen, Ya-Ting Chao, and Shu-Yuan Chen. (2006) Pattern classification in DNA microarray data of multiple tumor types.Pattern Recognition 39, no. 12, 2426-2438.
[96] Asim Roy, Patrick D. Mackin, Somnath Mukhopadhyay, (2013) Methods for pattern selection, class-specific feature selection and classification for automated learning, Neural Networks, ISSN 0893-6080, 10.1016/j.neunet.2012.12.007.
[97] Shu-Lin Wang, Xueling Li, Shanwen Zhang, Jie Gui, De-Shuang Huang, (2010) Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction, Computers in Biology and Medicine, Volume 40, Issue 2, Pages 179-189.
[98] Hui-Ling Huang, Fang-Lin Chang, (2007) ESVM: Evolutionary support vector machine for automatic feature selection and classification of microarray data, Biosystems, Volume 90, Issue 2, Pages 516-528.
600 ATLANTIC AVE, BOSTON,
MA 02210, USA
+001-6179630233
AIS is an academia-oriented and non-commercial institute aiming at providing users with a way to quickly and easily get the academic and scientific information.
Copyright © 2014 - American Institute of Science except certain content provided by third parties.