GRA 4153 Advanced Statistics and Alternative Data Types
GRA 4153 Advanced Statistics and Alternative Data Types
This course is in three parts:
1)  The linear regression model in the crosssectional context
2)  The statistical analysis of time series data
3)  Analysis of text data
In this course we will first review standard regression analysis from a statistical perspective. The course will then introduce time series and text data and standard models used to analyse such data. Both types of data come with their own challenges compared to using traditional crosssectional data sets. For example, time series data typically is dependent and naïve application of statistical methods and machine learning techniques to time dependent data is unlikely to work well. Text data is unstructured, meaning it cannot be handled directly by standard statistical and machine learning models. Accordingly, the course will cover how to apply standard preprocessing tools such that text can be used in both standard machine learning models as well as in models tailored for natural language processing.
By the end of the course, the student:
 Is able to explain fundamental time series concepts such as stationarity, autocorrelation, unit roots, persistence, and outofsample forecasting.
 Can utilise and explain basic time series and natural language processing models.
 Is able to explain the motivation for different textual preprocessing steps and how they are conducted.
By the end of the course, the student:
 Can apply basic time series modelling techniques (e.g. fitting an appropriate ARMA model) to time series data.
 Can apply simple natural language processing models to e.g. analyse the sentiment of a text.
 Can choose among, and critically evaluate, different modelling options when working with either time series or textual data.
By the end of the course, the student:
 Will be able to apply basic time series models and think critically about statistical inference when working with this type of data

Will be familiar with how to work with text as data and basic natural language processing.
The linear regression model (recap w/crosssectional data)
 Standard (distributional) assumptions
 Bias/variance tradeoff
 Regularization
 Prediction
Time series data
 Stationarity and autocorrelation
 Fundamental time series processes
 Random Walk
 ARMA models
 Estimation and inference
 Outofsample forecasting
Natural language processing
 Tokenization, converting text to numbers
 An important step of using text as data is to convert raw text into numbers that can be used in statistical models or machine learning models. You will learn several ways to do this.
 Textual preprocessing – how to extract signal rather than noise
 The student will learn how to use concepts like stop words, lemmatization, tfidf weighting, and embeddings to extract meaningful content from text data.
 Simple models for text data (and the limits of linear regression)
 Examples may include: Classifying review score (books, movies, restaurants), classifying spam mail from usual mail, extract topics, or derive sentiment scores
The learning activities will combine 2/3 lectures (synchronous) and 1/3 asynchronous learning activities. The asynchronous activities will include (i) reading provided notes and assigned materials to prepare for the lectures and (ii) solving theoretical and practical exercises. Students are expected to prepare for the lectures by reading the assigned materials, solving the assigned exercises and participate actively in the discussion of the lecture topics.
Software: Python.
Please note that while attendance is not compulsory in all courses, it is the student’s own responsibility to obtain any information provided in class.
All parts of the assessment must be passed in order to get a grade in the course.
The weighting for the exams for this course has been changed, starting academic year 2023/2024. It is not possible to retake the old version of the exam. Please note new exam codes in the Exam section of the course description.
All courses in the Masters programme will assume that students have fulfilled the admission requirements for the programme. In addition, courses in second, third and/or fourth semester can have specific prerequisites and will assume that students have followed normal study progression. For double degree and exchange students, please note that equivalent courses are accepted.
Disclaimer
Deviations in teaching and exams may occur if external conditions or unforeseen events call for this.
Programming in Python.
Knowledge of probability & linear algebra at the level of the course FORK 10CC Preparatory Course in Mathematics for Data Science.
Exam category  Weight  Invigilation  Duration  Grouping  Comment exam 

Exam category: Submission Form of assessment: Written submission Exam code: GRA 41533 Grading scale: ECTS Grading rules: Internal examiner Resit: Examination when next scheduled course  30  No  1 Month(s)  Group/Individual (1  3)  Project. 
Exam category: Submission Form of assessment: Written submission Exam code: GRA 41534 Grading scale: ECTS Grading rules: Internal examiner Resit: Examination when next scheduled course  70  No  1 Week(s)  Individual  Individual take home (final) exam 
All exams must be passed to get a grade in this course.
A course of 1 ECTS credit corresponds to a workload of 2630 hours. Therefore a course of 6 ECTS credits corresponds to a workload of at least 160 hours.