GRA 4157 (Big) Data Curation, Pipelines and Management

GRA 4157 (Big) Data Curation, Pipelines and Management

Course code: 
GRA 4157
Department: 
Data Science and Analytics
Credits: 
6
Course coordinator: 
Magdalena Ivanovska
Course name in Norwegian: 
(Big) Data Curation, Pipelines and Management
Product category: 
Master
Portfolio: 
MSc in Data Science for Business
Semester: 
2023 Autumn
Active status: 
Active
Level of study: 
Master
Teaching language: 
English
Course type: 
One semester
Introduction

To gain consistent benefits from machine learning models in business, it is essential to move data science projects from experimentation to production by building automated machine learning pipelines. A standard machine learning pipeline consists of data preparation, model training, model evaluation and validation. The tasks of the automated pipeline range from collecting real-time streaming data to model and output management. 

In this course, you will learn the life cycle of a data science project and the responsibilities of different roles in a data science team. You will also learn how to build an efficient end-to-end data science project and get hands-on experience on programming in python. Particular focus will be put on what is typically the most time consuming part, namely data curation, cleaning, and management, including different database infrastructures and SQL-style queries. 

Learning outcomes - Knowledge

By the end of the course, the student:

  • Can explain the responsibilities of different roles in a data science team.
  • Can identify and describe the important components of an efficient automated machine learning pipeline.
  • Can compare and contrast various data preparation/cleaning procedures, and know how these might effect estimation output at a later stage.
  • Demonstrates a good understanding of different database infrastructures, such a relational and non-relational databases, and can adequately contrast and compare how they fit in different situations.
  • Is able to understand a few basic machine learning algorithms.
Learning outcomes - Skills

By the end of the course, the student:

  • Can effectively use components of python, and in particular packages NumPy, pandas, matplotlib, SciPy and scikit-learn for data science applications.
  • Can build, deploy, and manage machine learning pipelines in an efficient manner, including cleaning, transforming, merging and reshaping of data.
  • Can apply a number of data curation techniques to handle, e.g., missing numbers and outliers in the data.
  • Is skilled in reading and writing data from various database infrastructures (SQL, MongoDB).
  • Is able to apply real data sets to machine learning algorithms for insight.
General Competence

By the end of the course, the student:

  • Understands the end-to-end process of building a data science project, and knows how to build efficient workflows.
  • By replicating real-world situations, students can understand what challenges they will face as data scientists and how to solve them.
Course content

The course will go towards advanced python programming with a particular focus on data science project life cycles. This includes:

  • Understanding business problems.
  • Under data curation, collection, and preprocessing the course will cover:
    • Reading and writing data “manually” to files in python.
    • Handling missing values, outliers, and other data anomalies.
    • Database infrastructures and explore and curate data with pandas (including SQL and MongoDB).
    • Make analysis queries that answer business questions.
    • Create training data using queries.
  • Feature engineering.
  • Model building and deployment, including machine learning. 
  • Roles and responsibilities in data science projects.
Teaching and learning activities

The learning activities will combine 2/3 lectures and 1/3 group projects. The first part will focus on learning the skills of data curation and machine learning workflow. In the second part, students need to work in groups to complete a data science project. They will play different roles in the team (product manager, domain expert, data engineer, data scientist). The tasks of the project include accessing open datasets in databases, building machine learning workflows, and presenting results.

Software tools
Python, pandas, NumPy, matplotlib, scikit-learn. 

Software tools
Software defined under the section "Teaching and learning activities".
Additional information

 Please note that while attendance is not compulsory in all courses, it is the student’s own responsibility to obtain any information provided in class.

All parts of the assessment must be passed in order to get a grade in the course.

Starting the academic year 2023/2024 the course has changed weighting of the two assessment components. It is not possible to retake the old version of the exam. Please note new exam codes in the Exam section of the course description. 

 

Qualifications

All courses in the Masters programme will assume that students have fulfilled the admission requirements for the programme. In addition, courses in second, third and/or fourth semester can have specific prerequisites and will assume that students have followed normal study progression. For double degree and exchange students, please note that equivalent courses are accepted.

Disclaimer

Deviations in teaching and exams may occur if external conditions or unforeseen events call for this.

Required prerequisite knowledge

Programming (preferably Python), relational database, and machine learning, or similar type of courses.

Assessments
Assessments
Exam category: 
Submission
Form of assessment: 
Written submission
Invigilation
Weight: 
40
Grouping: 
Individual
Support materials: 
  • Bilingual dictionary
Duration: 
2 Hour(s)
Exam code: 
GRA 41573
Grading scale: 
ECTS
Resit: 
Examination when next scheduled course
Exam category: 
Submission
Form of assessment: 
Handin - all file types
Weight: 
60
Grouping: 
Group/Individual (1 - 3)
Duration: 
1 Month(s)
Comment: 
The written submission (report) is based on 2 group presentations during the semester:
1. Data curation and machine learning workflow design.
2. The process of building a machine learning pipeline, deploying and managing models.

Students get feedback on their presentations from peers and teachers. Feedback can be used to write this report.
Exam code: 
GRA 41574
Grading scale: 
ECTS
Resit: 
Examination when next scheduled course
Type of Assessment: 
Ordinary examination
All exams must be passed to get a grade in this course.
Total weight: 
100
Student workload
ActivityDurationComment
Group work / Assignments
100 Hour(s)
Prepare for teaching
12 Hour(s)
Examination
32 Hour(s)
Teaching
36 Hour(s)
There are two hours of synchronous and one hour of asynchronous teaching each week.
Sum workload: 
180

A course of 1 ECTS credit corresponds to a workload of 26-30 hours. Therefore a course of 6 ECTS credits corresponds to a workload of at least 160 hours.