Data Analytics at Scale
Course Composition and Objectives
- Broadly explain the challenges of data analytics at scale, cyberinfrastructure (e.g., Hadoop, BDSA) and computational modeling approaches that address these challenges, and their applications to real-world problems
- Use, adapt, or develop a data analytics cyberinfrastructure to analyze heterogeneous interconnected data for one or more real-world problem domains using a high-level programming language (e.g., R, Java, Pig)
- Identify, formulate, and solve problems associated with data analytics at scale (e.g., data sparsity, very high dimensionality, causality analysis)
- Compare the strengths and weakness of alternative cyberinfrastructures and computational modeling approaches so that they can articulate the rationales of their choice, adaptation, and/or innovation in their design and implementation of a solution for data analytics at scale
- Instructors Choice: Instructors may choose topics and learning objectives that meet the spirit of the course as defined here. Instructors may choose to devote more time to the learning objectives listed above or to add additional, complimentary objectives. Supplementary material and objectives should not overlap with the defined content of other courses in the curriculum.
Course Description
This course introduces principles, models, techniques, and cyberinfrastructures for storing, processing, retrieving, integrating, analyzing, mining, and linking large scale heterogeneous information involving multiple types (including text and images) across multiple scales over temporal, spatial, and human dimensions. The course consists of four major modules. The first module introduces the cyberinfrastructure for data analytics at scale. Leveraging DS 210’s coverage on data models for data sciences, this module introduces the cyberinfrastructures for data-intensive processing at scale and the associated distributed information storage systems. The second module of the course introduces parallel programming and computing platforms supporting computationintensive data analytics at scale. The third module addresses techniques and tools for computation modeling from large-scale heterogeneous data including text and images. Building on DS 310, this module also introduces machine learning methods designed for data mining at scale. The fourth module covers methods for addressing three important challenges and opportunities for data analytics at scale: data sparsity, causality analysis, and discovery informatics. The course will include a laboratory component to provide students with hands-on experience in developing data analytics solutions using an existing cyberinfrastructure. The hands-on laboratory component of the course will also enable