GRA 4153 Advanced Statistics and Alternative Data Types

GRA 4153 Advanced Statistics and Alternative Data Types

Course code: 
GRA 4153
Department: 
Data Science and Analytics
Credits: 
6
Course coordinator: 
Adam Lee
Course name in Norwegian: 
Advanced Statistics and Alternative Data Types
Product category: 
Master
Portfolio: 
MSc in Data Science for Business
Semester: 
2023 Autumn
Active status: 
Active
Level of study: 
Master
Teaching language: 
English
Course type: 
One semester
Introduction

This course is in three parts:

1) - The linear regression model in the cross-sectional context
2) - The statistical analysis of time series data
3) - Analysis of text data

In this course we will first review standard regression analysis from a statistical perspective. The course will then introduce time series and text data and standard models used to analyse such data. Both types of data come with their own challenges compared to using traditional cross-sectional data sets. For example, time series data typically is dependent and naïve application of statistical methods and machine learning techniques to time dependent data is unlikely to work well. Text data is unstructured, meaning it cannot be handled directly by standard statistical and machine learning models. Accordingly, the course will cover how to apply standard pre-processing tools such that text can be used in both standard machine learning models as well as in models tailored for natural language processing.

Learning outcomes - Knowledge

By the end of the course, the student:

  • Is able to explain fundamental time series concepts such as stationarity, auto-correlation, unit roots, persistence, and out-of-sample forecasting.
  • Can utilise and explain basic time series and natural language processing models.
  • Is able to explain the motivation for different textual pre-processing steps and how they are conducted.
Learning outcomes - Skills

By the end of the course, the student:

  • Can apply basic time series modelling techniques (e.g. fitting an appropriate ARMA model) to time series data.
  • Can apply simple natural language processing models to e.g. analyse the sentiment of a text.
  • Can choose among, and critically evaluate, different modelling options when working with either time series or textual data.
General Competence

By the end of the course, the student:

  • Will be able to apply basic time series models and think critically about statistical inference when working with this type of data
  • Will be familiar with how to work with text as data and basic natural language processing.

     

Course content

The linear regression model (recap w/cross-sectional data)

  • Standard (distributional) assumptions
  • Bias/variance trade-off
  • Regularization
  • Prediction

Time series data

  •  Stationarity and autocorrelation
  •  Fundamental time series processes
    • Random Walk
    • ARMA models
  •     Estimation and inference
  •     Out-of-sample forecasting

Natural language processing

  • Tokenization, converting text to numbers
    • An important step of using text as data is to convert raw text into numbers that can be used in statistical models or machine learning models. You will learn several ways to do this.
  • Textual pre-processing – how to extract signal rather than noise
    • The student will learn how to use concepts like stop words, lemmatization, tf-idf weighting, and embeddings to extract meaningful content from text data.
  • Simple models for text data (and the limits of linear regression)
    • Examples may include: Classifying review score (books, movies, restaurants), classifying spam mail from usual mail, extract topics, or derive sentiment scores
Teaching and learning activities

The learning activities will combine 2/3 lectures (synchronous) and 1/3 asynchronous learning activities. The asynchronous activities will include (i) reading provided notes and assigned materials to prepare for the lectures and (ii) solving theoretical and practical exercises. Students are expected to prepare for the lectures by reading the assigned materials, solving the assigned exercises and participate actively in the discussion of the lecture topics.

Software: Python.

Software tools
Software defined under the section "Teaching and learning activities".
Additional information

Please note that while attendance is not compulsory in all courses, it is the student’s own responsibility to obtain any information provided in class.

All parts of the assessment must be passed in order to get a grade in the course.

The weighting for the exams for this course has been changed, starting academic year 2023/2024. It is not possible to retake the old version of the exam. Please note new exam codes in the Exam section of the course description. 

Qualifications

All courses in the Masters programme will assume that students have fulfilled the admission requirements for the programme. In addition, courses in second, third and/or fourth semester can have specific prerequisites and will assume that students have followed normal study progression. For double degree and exchange students, please note that equivalent courses are accepted.

Disclaimer

Deviations in teaching and exams may occur if external conditions or unforeseen events call for this.

Required prerequisite knowledge

Programming in Python.

Knowledge of probability & linear algebra at the level of the course FORK 10CC Preparatory Course in Mathematics for Data Science.

Assessments
Assessments
Exam category: 
Submission
Form of assessment: 
Written submission
Weight: 
30
Grouping: 
Group/Individual (1 - 3)
Duration: 
1 Month(s)
Comment: 
Project.
Exam code: 
GRA 41533
Grading scale: 
ECTS
Resit: 
Examination when next scheduled course
Exam category: 
Submission
Form of assessment: 
Written submission
Weight: 
70
Grouping: 
Individual
Duration: 
1 Week(s)
Comment: 
Individual take home (final) exam
Exam code: 
GRA 41534
Grading scale: 
ECTS
Resit: 
Examination when next scheduled course
Type of Assessment: 
Ordinary examination
All exams must be passed to get a grade in this course.
Total weight: 
100
Sum workload: 
0

A course of 1 ECTS credit corresponds to a workload of 26-30 hours. Therefore a course of 6 ECTS credits corresponds to a workload of at least 160 hours.