GRA 4153 Advanced Statistics and Alternative Data Types
GRA 4153 Advanced Statistics and Alternative Data Types
This course is in three parts:
1) - The linear regression model in the cross-sectional context
2) - The statistical analysis of time series data
3) - Analysis of text data
In this course we will first review standard regression analysis from a statistical perspective. The course will then introduce time series and text data and standard models used to analyse such data. Both types of data come with their own challenges compared to using traditional cross-sectional data sets. For example, time series data typically is dependent and naïve application of statistical methods and machine learning techniques to time dependent data is unlikely to work well. Text data is unstructured, meaning it cannot be handled directly by standard statistical and machine learning models. Accordingly, the course will cover how to apply standard pre-processing tools such that text can be used in both standard machine learning models as well as in models tailored for natural language processing.
By the end of the course, the student:
- Is able to explain fundamental time series concepts such as stationarity, auto-correlation, unit roots, persistence, and out-of-sample forecasting.
- Can utilise and explain basic time series and natural language processing models.
- Is able to explain the motivation for different textual pre-processing steps and how they are conducted.
By the end of the course, the student:
- Can apply basic time series modelling techniques (e.g. fitting an appropriate ARMA model) to time series data.
- Can apply simple natural language processing models to e.g. analyse the sentiment of a text.
- Can choose among, and critically evaluate, different modelling options when working with either time series or textual data.
By the end of the course, the student:
- Will be able to apply basic time series models and think critically about statistical inference when working with this type of data
-
Will be familiar with how to work with text as data and basic natural language processing.
The linear regression model (recap w/cross-sectional data)
- Standard (distributional) assumptions
- Bias/variance trade-off
- Regularization
- Prediction
Time series data
- Stationarity and autocorrelation
- Fundamental time series processes
- Random Walk
- ARMA models
- Estimation and inference
- Out-of-sample forecasting
Natural language processing
- Tokenization, converting text to numbers
- An important step of using text as data is to convert raw text into numbers that can be used in statistical models or machine learning models. You will learn several ways to do this.
- Textual pre-processing – how to extract signal rather than noise
- The student will learn how to use concepts like stop words, lemmatization, tf-idf weighting, and embeddings to extract meaningful content from text data.
- Simple models for text data (and the limits of linear regression)
- Examples may include: Classifying review score (books, movies, restaurants), classifying spam mail from usual mail, extract topics, or derive sentiment scores
The learning activities will combine 2/3 lectures (synchronous) and 1/3 asynchronous learning activities. The asynchronous activities will include (i) reading provided notes and assigned materials to prepare for the lectures and (ii) solving theoretical and practical exercises. Students are expected to prepare for the lectures by reading the assigned materials, solving the assigned exercises and participate actively in the discussion of the lecture topics.
Software: Python.
Please note that while attendance is not compulsory in all courses, it is the student’s own responsibility to obtain any information provided in class.
All parts of the assessment must be passed in order to get a grade in the course.
The weighting for the exams for this course has been changed, starting academic year 2023/2024. It is not possible to retake the old version of the exam. Please note new exam codes in the Exam section of the course description.
All courses in the Masters programme will assume that students have fulfilled the admission requirements for the programme. In addition, courses in second, third and/or fourth semester can have specific prerequisites and will assume that students have followed normal study progression. For double degree and exchange students, please note that equivalent courses are accepted.
Disclaimer
Deviations in teaching and exams may occur if external conditions or unforeseen events call for this.
Programming in Python.
Knowledge of probability & linear algebra at the level of the course FORK 10CC Preparatory Course in Mathematics for Data Science.
Assessments |
---|
Exam category: Submission Form of assessment: Written submission Weight: 30 Grouping: Group/Individual (1 - 3) Duration: 1 Month(s) Comment: Project. Exam code: GRA 41533 Grading scale: ECTS Resit: Examination when next scheduled course |
Exam category: Submission Form of assessment: Written submission Weight: 70 Grouping: Individual Duration: 1 Week(s) Comment: Individual take home (final) exam Exam code: GRA 41534 Grading scale: ECTS Resit: Examination when next scheduled course |
All exams must be passed to get a grade in this course.
A course of 1 ECTS credit corresponds to a workload of 26-30 hours. Therefore a course of 6 ECTS credits corresponds to a workload of at least 160 hours.