GRA 4153 Advanced Statistics and Alternative Data Types

GRA 4153 Advanced Statistics and Alternative Data Types

Course code: 
GRA 4153
Department: 
Data Science and Analytics
Credits: 
6
Course coordinator: 
Adam Lee
Course name in Norwegian: 
Advanced Statistics and Alternative Data Types
Product category: 
Master
Portfolio: 
MSc in Data Science for Business
Semester: 
2022 Autumn
Active status: 
Active
Level of study: 
Master
Teaching language: 
English
Course type: 
One semester
Introduction

When learning about statistics, most students are exposed to cross-sectional data, e.g., numerical observations of one entity at one point in time. However, most data also have a time dimension, and most of the data in the world is not in numerical format, it is textual. As such, time series and text are important data types frequently used in many different industries, ranging from finance and economics to HR and marketing.

In this course we will first cover important principles for standard regression analysis both from a statistical and machine learning perspective, including distributional assumptions, bias-variance trade-offs, cross validation and resampling. We will then give you an introduction to how to work with time series and text data in this setting. Both types of data come with their own challenges compared to using traditional cross-sectional data sets. For example, text data is unstructured, meaning that it can not be directly handled by standard statistical and machine learning models. Accordingly, you will learn how to apply standard pre-processing tools such that text can be used in both standard machine learning models as well as in models tailored for natural language processing. Similarly, data with time dependence comes with its own set of challenges, and naïve application of statistical methods and machine learning techniques to time dependent data are unlikely to work well.

Learning outcomes - Knowledge

By the end of the course, the student:

  • Is able to explain fundamental time series concepts such as stationarity, auto-correlation, unit roots, persistence, and out-of-sample forecasting.
  • Is able to explain the motivation for different textual pre-processing steps and how they are conducted.
  • Can identify the appropriate steps needed before using either time series or textual data in actual statistical or machine learning models.
  • Can exemplify and explain basic time series and natural language processing models.
Learning outcomes - Skills

By the end of the course, the student:

  • Can apply basic time series processes, such as the Random Walk or ARMA model, for actual time series modelling.
  • Can apply simple natural language processing models to derive, e.g., what the text is about and its overall sentiment.
  • Can choose among, and critically evaluate, how different data transformations affect statistical inference at a later stage when working with either time series or textual data.
General Competence

By the end of the course, the student:

  • Will be able to apply basic time series models and think critically about statistical inference when working with this type of data
  • has become familiar with how to work with text as data and basic natural language processing.
Course content

The linear regression model (recap w/cross-sectional data)

  • Standard (distributional) assumptions
  • Bias/variance trade-off
  • Regularization
  • Cross validation and in-sample/out-of-sample predictions
  • Resampling

Time series data

  • Stationarity and autocorrelation
  • Unit roots and tests
  • Fundamental time series processes
    • Random Walk
    • Autoregressive model
  • Estimation and HAC corrections
  • Persistence and impulse response functions
  • Out-of-sample forecasting

Natural language processing

  • Tokenization, converting text to numbers
    • An important step of using text as data is to convert raw text into numbers that can be used in statistical models or machine learning models. You will learn several ways to do this.
  • Textual pre-processing – how to extract signal rather than noise
    • The student will learn how to use concepts like stop words, lemmatization, tf-idf weighting, and embeddings to extract meaningful content from text data.
  • Simple models for text data (and the limits of linear regression)
    • Examples may include: Classifying review score (books, movies, restaurants), classifying spam mail from usual mail, extract topics, or derive sentiment scores
  • Distributional assumptions when working with generative text data models
Teaching and learning activities

The learning activities will combine 2/3 lectures, 1/3 case discussions and videos. Students are expected to prepare for the lectures by reading assigned materials and participate actively in the discussion of the lecture topics.

Software: Python.

Software tools
Software defined under the section "Teaching and learning activities".
Additional information

Please note that while attendance is not compulsory in all courses, it is the student’s own responsibility to obtain any information provided in class.

All parts of the assessment must be passed in order to get a grade in the course.

Qualifications

All courses in the Masters programme will assume that students have fulfilled the admission requirements for the programme. In addition, courses in second, third and/or fourth semester can have specific prerequisites and will assume that students have followed normal study progression. For double degree and exchange students, please note that equivalent courses are accepted.

Disclaimer

Deviations in teaching and exams may occur if external conditions or unforeseen events call for this.

Required prerequisite knowledge

Programming in Python.

Assessments
Assessments
Exam category: 
Submission
Form of assessment: 
Written submission
Weight: 
25
Grouping: 
Group/Individual (1 - 3)
Duration: 
1 Month(s)
Comment: 
Project.
Exam code: 
GRA 41531
Grading scale: 
ECTS
Resit: 
Examination when next scheduled course
Exam category: 
Submission
Form of assessment: 
Written submission
Invigilation
Weight: 
75
Grouping: 
Individual
Support materials: 
  • All printed and handwritten support materials
  • BI-approved exam calculator
  • Simple calculator
  • Bilingual dictionary
Duration: 
5 Hour(s)
Comment: 
-
Exam code: 
GRA 41532
Grading scale: 
ECTS
Resit: 
Examination when next scheduled course
Type of Assessment: 
Ordinary examination
All exams must be passed to get a grade in this course.
Total weight: 
100
Sum workload: 
0

A course of 1 ECTS credit corresponds to a workload of 26-30 hours. Therefore a course of 6 ECTS credits corresponds to a workload of at least 160 hours.