Introduction to Data Science
Course Composition and Objectives
- Upon completion of the course, you should be able to gain first-hand experiences about a mini data science project. More specifically, you will:
- Be able to design an exploratory data science project using Tweets and assess its feasibility (using visualization tools).
- Be able to use Twitter API to gather tweets of interest for the project
- Be able to use a tool (Weka) to analyze twitter data
- Be able to use tools to visualize data and the models they generate
- Be able to generate a decision-tree predictive model for classifying tweets automatically using a tool (Weka)
- Be able to generate a probabilistic predictive model for classifying tweets automatically using a tool (Weka)
- Be able to evaluate and compare the performance of predictive models
- You will be able to understand and apply the following concepts related to exploratory data analysis:
- R – Representation
- I – Induction
- S – Search
- E – Evaluation
- You should also be able to gain a conceptual understanding about some of the real-world applications such as:
- The “Beer and Diaper” data mining story– The Discovery of Customer Purchase Patterns(The discovery of frequent association; exploratory data project; human behavior; conditional probability)
- Amazon product recommendation based on reviews of others. (Similarity Measure; Collaborative Filtering; Recommendation Systems)
- Google’s pre-processing of Web pages for Its Search Engine
- Social Media analytics
- Examples of data science applications in specific domains (e.g., health, social, security, life science).
- Instructors Choice: Instructors may choose topics and learning objectives that meet the spirit of the course as defined here. Instructors may choose to devote more time to the learning objectives listed above or to add additional, complimentary objectives. Supplementary material and objectives should not overlap with the defined content of other courses in the curriculum.
Course Description
This course aims to achieve three goals:
- It will provide you with hands-on experiences about a data science project, which will enable you to extract meaningful information (relevant to a question/hypothesis of interests to you) from a large twitter dataset you gather.
- You will learn four key concepts regarding predictive modeling and exploratory data analysis: Representation, Induction, Simplification, and Evaluation (RISE). This understanding will provide you a framework for relating theories (e.g., logic, probability) to practical methods using these theories (e.g., decision-tree induction, Naives Bayes induction), and their applications to data sciences, and to your data science project in particular.
- You will learn the broader landscape of Data Sciences:
- What global trends make Data Sciences important for our society?
- What are the “types” of data science projects and how, together, they form the journey of a data science initiative?
- What is the role of visual analytics in Data Sciences?
- What are the foundations of Data Sciences for innovating solutions for analyzing massive datasets?
- What should a Data Scientist know about data ethics?
- What is the role of domain-specific knowledge in Data Science projects?
While we may only be “touching the surface” of these topics, they will be addressed and elaborated in other courses throughout your Data Science education experience at Penn State. Together, I hope these goals help to guide you as you start this exciting journey of becoming the “Next Generation Data Scientists”.