GRA 4153 Advanced Statistics and Alternative Data Types

Course code:

GRA 4153

Department:

Data Science and Analytics

Credits:

Course coordinator:

Adam Lee

Course name in Norwegian:

Advanced Statistics and Alternative Data Types

Product category:

Master

Portfolio:

MSc in Data Science for Business

Semester:

2023 Autumn

Active status:

Active

Level of study:

Master

Teaching language:

English

Course type:

One semester

Introduction

This course is in three parts:

1) - The linear regression model in the cross-sectional context
2) - The statistical analysis of time series data
3) - Analysis of text data

In this course we will first review standard regression analysis from a statistical perspective. The course will then introduce time series and text data and standard models used to analyse such data. Both types of data come with their own challenges compared to using traditional cross-sectional data sets. For example, time series data typically is dependent and naïve application of statistical methods and machine learning techniques to time dependent data is unlikely to work well. Text data is unstructured, meaning it cannot be handled directly by standard statistical and machine learning models. Accordingly, the course will cover how to apply standard pre-processing tools such that text can be used in both standard machine learning models as well as in models tailored for natural language processing.

Learning outcomes - Knowledge

By the end of the course, the student:

Is able to explain fundamental time series concepts such as stationarity, auto-correlation, unit roots, persistence, and out-of-sample forecasting.
Can utilise and explain basic time series and natural language processing models.
Is able to explain the motivation for different textual pre-processing steps and how they are conducted.

Learning outcomes - Skills

By the end of the course, the student:

Can apply basic time series modelling techniques (e.g. fitting an appropriate ARMA model) to time series data.
Can apply simple natural language processing models to e.g. analyse the sentiment of a text.
Can choose among, and critically evaluate, different modelling options when working with either time series or textual data.

General Competence

By the end of the course, the student:

Will be able to apply basic time series models and think critically about statistical inference when working with this type of data
Will be familiar with how to work with text as data and basic natural language processing.

Course content

The linear regression model (recap w/cross-sectional data)

Standard (distributional) assumptions
Bias/variance trade-off
Regularization
Prediction

Time series data

Stationarity and autocorrelation
Fundamental time series processes
- Random Walk
- ARMA models
Estimation and inference
Out-of-sample forecasting

Natural language processing

Tokenization, converting text to numbers
- An important step of using text as data is to convert raw text into numbers that can be used in statistical models or machine learning models. You will learn several ways to do this.
Textual pre-processing – how to extract signal rather than noise
- The student will learn how to use concepts like stop words, lemmatization, tf-idf weighting, and embeddings to extract meaningful content from text data.
Simple models for text data (and the limits of linear regression)
- Examples may include: Classifying review score (books, movies, restaurants), classifying spam mail from usual mail, extract topics, or derive sentiment scores

Teaching and learning activities

The learning activities will combine 2/3 lectures (synchronous) and 1/3 asynchronous learning activities. The asynchronous activities will include (i) reading provided notes and assigned materials to prepare for the lectures and (ii) solving theoretical and practical exercises. Students are expected to prepare for the lectures by reading the assigned materials, solving the assigned exercises and participate actively in the discussion of the lecture topics.

Software: Python.

Software tools

Software defined under the section "Teaching and learning activities".

Additional information

Please note that while attendance is not compulsory in all courses, it is the student’s own responsibility to obtain any information provided in class.

All parts of the assessment must be passed in order to get a grade in the course.

The weighting for the exams for this course has been changed, starting academic year 2023/2024. It is not possible to retake the old version of the exam. Please note new exam codes in the Exam section of the course description.

Qualifications

All courses in the Masters programme will assume that students have fulfilled the admission requirements for the programme. In addition, courses in second, third and/or fourth semester can have specific prerequisites and will assume that students have followed normal study progression. For double degree and exchange students, please note that equivalent courses are accepted.

Disclaimer

Deviations in teaching and exams may occur if external conditions or unforeseen events call for this.

Required prerequisite knowledge

Programming in Python.

Knowledge of probability & linear algebra at the level of the course FORK 10CC Preparatory Course in Mathematics for Data Science.

Assessments

Assessments
Exam category: Submission Form of assessment: Written submission Weight: 30 Grouping: Group/Individual (1 - 3) Duration: 1 Month(s) Comment: Project. Exam code: GRA 41533 Grading scale: ECTS Resit: Examination when next scheduled course
Exam category: Submission Form of assessment: Written submission Weight: 70 Grouping: Individual Duration: 1 Week(s) Comment: Individual take home (final) exam Exam code: GRA 41534 Grading scale: ECTS Resit: Examination when next scheduled course

Type of Assessment:

Ordinary examination
All exams must be passed to get a grade in this course.

Total weight:

100

Sum workload:

A course of 1 ECTS credit corresponds to a workload of 26-30 hours. Therefore a course of 6 ECTS credits corresponds to a workload of at least 160 hours.

Link to reading list

Reading list

Programmeinfo BI

Varselmelding

GRA 4153 Advanced Statistics and Alternative Data Types

GRA 4153 Advanced Statistics and Alternative Data Types

Disclaimer

Semester

Oversettelser