Platforms and Algorithms for Big Data Analytics

 

Abstract:

This is an era of Big Data. The total digital data in this world is expected to double in less than two years. Big Data is driving radical changes in traditional data analysis platforms and algorithms. This tutorial consists of two parts: (i) Big data platforms and their characteristics (ii) Large-scale classification and clustering algorithms.

 

The first part will provide an in-depth analysis of different platforms available for studying and performing big data analytics. It will survey different hardware platforms available for big data analytics and assesses the advantages and drawbacks of each of these platforms based on various metrics such as scalability, data I/O rate, fault tolerance, real-time processing, data size supported and iterative task support. Using a star ratings table, a rigorous qualitative comparison between different platforms is made for each of the six characteristics that are critical for the algorithms of big data analytics. In addition to the hardware, a detailed description of the software frameworks used within each of these platforms is also discussed along with their strengths and drawbacks. Some of the critical characteristics that will be described here can potentially aid the audience in making an informed decision depending on their computational needs.

 

The second part of the tutorial will consist of big data classification and clustering algorithms. In order to provide more insights into the effectiveness of each of the platforms in the context of big data analytics, specific implementation level details of the widely used k-nearest neighbor and the k-means clustering algorithm on various platforms will be described in the form of pseudocode. In addition, recent advances in large-scale linear classification and map-reduce based classification algorithms will also be discussed. In the context of clustering, some of the well-known one-pass clustering algorithms and other parallel and distributed clustering solutions will be briefly mentioned.

 

Tutorial Presented at the IEEE BigData 2015 Conference: PRESENTATION SLIDES

 

Source Codes and Installation Instructions

 

Comparison of Platforms:

 

Comparison Table

 

 

Relevant References:

1. Dilpreet Singh and Chandan K. Reddy, "A survey on platforms for big data analytics", Journal of Big Data, Vol.2, No.8, pp.1-20, October 2014. (The first part of the tutorial is primarily based on this survey paper.)

2. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. "Spark: cluster computing with working sets", In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pp. 10-10. 2010.

3. Jeffrey Dean, and Sanjay Ghemawat, "MapReduce:simplified data processing on large clusters", Communications of the ACM, Vol. 51, No. 1, pp.107-113, 2008.

4. John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips, "GPU computing", Proceedings of the IEEE, vol. 96, no. 5, pp. 879-899, 2008.

5. Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, and Kunle Olukotun, "Map-reduce for machine learning on multicore", In NIPS, pages 281-288, 2006.

6. Guo-Xun Yuan, C-H. Ho, and Chih-Jen Lin, "Recent advances of large-scale linear classification", Proceedings of the IEEE, vol. 100, no. 9, pp. 2584-2603, 2012.

7. Indranil Palit and Chandan K. Reddy, "Scalable and parallel boosting with MapReduce", IEEE Transactions on Knowledge and Data Engineering (TKDE), vol.24, no.10, pp.1904-1916, October 2012.

8. Hanghang Tong and U. Kang, "Big Data Clustering", Book chapter in Data Clustering: Algorithms and Applications, Charu C. Aggarwal and Chandan K. Reddy (Eds.), Chapman & Hall/CRC Press, 2013.

Target Audience:

The target audience is researchers from both academia and industry including graduate students working in the fields of big data, data analytics, data mining and machine learning. In terms of the prerequisites, we expect the audience to be little familiar with some of the basic concepts of data mining such as classification and clustering. Our immediate goal is to provide an overview of the big data platforms and to educate the research community about the platform characteristics and large-scale data mining algorithms. The ultimate goal is to bridge researchers and practitioners to foster interdisciplinary works between the two groups. This tutorial can also attract researchers from the big data industry as it covers many practical aspects of big data analytics. The tutorial will be primarily targeted for researchers who are interested in analyzing large-scale data. They will become knowledgeable about the platforms and algorithms available to perform various kinds of analysis on large-scale data.

Presenter BIO:

Chandan K. Reddy is an Associate Professor in the Department of Computer Science at vt State University. He received his Ph.D. from Cornell University and M.S. from Michigan State University. He is the Director of the Data Mining and Knowledge Discovery Laboratory and a scientific member of Karmanos Cancer Institute. His primary research interests are Data Mining and Machine Learning with applications to Healthcare Informatics, Social Network Analysis and Bioinformatics. His research is funded by the National Science Foundation, the National Institutes of Health, the Department of Transportation, and the Susan G. Komen for the Cure Foundation. He has published over 55 peer-reviewed articles in leading conferences and journals. He received the Best Application Paper Award in ACM SIGKDD conference in 2010, and was a finalist of the INFORMS Franz Edelman Award Competition in 2011. He is a senior member of IEEE and member of ACM.