GRA 4157 (Big) Data Curation, Pipelines and Management

Course code:

GRA 4157

Department:

Data Science and Analytics

Credits:

Course coordinator:

Magdalena Ivanovska

Course name in Norwegian:

(Big) Data Curation, Pipelines and Management

Product category:

Master

Portfolio:

MSc in Data Science for Business

Semester:

2023 Autumn

Active status:

Active

Level of study:

Master

Teaching language:

English

Course type:

One semester

Introduction

To gain consistent benefits from machine learning models in business, it is essential to move data science projects from experimentation to production by building automated machine learning pipelines. A standard machine learning pipeline consists of data preparation, model training, model evaluation and validation. The tasks of the automated pipeline range from collecting real-time streaming data to model and output management.

In this course, you will learn the life cycle of a data science project and the responsibilities of different roles in a data science team. You will also learn how to build an efficient end-to-end data science project and get hands-on experience on programming in python. Particular focus will be put on what is typically the most time consuming part, namely data curation, cleaning, and management, including different database infrastructures and SQL-style queries.

Learning outcomes - Knowledge

By the end of the course, the student:

Can explain the responsibilities of different roles in a data science team.
Can identify and describe the important components of an efficient automated machine learning pipeline.
Can compare and contrast various data preparation/cleaning procedures, and know how these might effect estimation output at a later stage.
Demonstrates a good understanding of different database infrastructures, such a relational and non-relational databases, and can adequately contrast and compare how they fit in different situations.
Is able to understand a few basic machine learning algorithms.

Learning outcomes - Skills

By the end of the course, the student:

Can effectively use components of python, and in particular packages NumPy, pandas, matplotlib, SciPy and scikit-learn for data science applications.
Can build, deploy, and manage machine learning pipelines in an efficient manner, including cleaning, transforming, merging and reshaping of data.
Can apply a number of data curation techniques to handle, e.g., missing numbers and outliers in the data.
Is skilled in reading and writing data from various database infrastructures (SQL, MongoDB).
Is able to apply real data sets to machine learning algorithms for insight.

General Competence

By the end of the course, the student:

Understands the end-to-end process of building a data science project, and knows how to build efficient workflows.
By replicating real-world situations, students can understand what challenges they will face as data scientists and how to solve them.

Course content

The course will go towards advanced python programming with a particular focus on data science project life cycles. This includes:

Understanding business problems.
Under data curation, collection, and preprocessing the course will cover:
- Reading and writing data “manually” to files in python.
- Handling missing values, outliers, and other data anomalies.
- Database infrastructures and explore and curate data with pandas (including SQL and MongoDB).
- Make analysis queries that answer business questions.
- Create training data using queries.
Feature engineering.
Model building and deployment, including machine learning.
Roles and responsibilities in data science projects.

Teaching and learning activities

The learning activities will combine 2/3 lectures and 1/3 group projects. The first part will focus on learning the skills of data curation and machine learning workflow. In the second part, students need to work in groups to complete a data science project. They will play different roles in the team (product manager, domain expert, data engineer, data scientist). The tasks of the project include accessing open datasets in databases, building machine learning workflows, and presenting results.

Software tools
Python, pandas, NumPy, matplotlib, scikit-learn.

Software tools

Software defined under the section "Teaching and learning activities".

Additional information

Please note that while attendance is not compulsory in all courses, it is the student’s own responsibility to obtain any information provided in class.

All parts of the assessment must be passed in order to get a grade in the course.

Starting the academic year 2023/2024 the course has changed weighting of the two assessment components. It is not possible to retake the old version of the exam. Please note new exam codes in the Exam section of the course description.

Qualifications

All courses in the Masters programme will assume that students have fulfilled the admission requirements for the programme. In addition, courses in second, third and/or fourth semester can have specific prerequisites and will assume that students have followed normal study progression. For double degree and exchange students, please note that equivalent courses are accepted.

Disclaimer

Deviations in teaching and exams may occur if external conditions or unforeseen events call for this.

Required prerequisite knowledge

Programming (preferably Python), relational database, and machine learning, or similar type of courses.

Assessments

Assessments
Exam category: Submission Form of assessment: Written submission Invigilation Weight: 40 Grouping: Individual Support materials: Bilingual dictionary Duration: 2 Hour(s) Exam code: GRA 41573 Grading scale: ECTS Resit: Examination when next scheduled course
Exam category: Submission Form of assessment: Handin - all file types Weight: 60 Grouping: Group/Individual (1 - 3) Duration: 1 Month(s) Comment: The written submission (report) is based on 2 group presentations during the semester: 1. Data curation and machine learning workflow design. 2. The process of building a machine learning pipeline, deploying and managing models. Students get feedback on their presentations from peers and teachers. Feedback can be used to write this report. Exam code: GRA 41574 Grading scale: ECTS Resit: Examination when next scheduled course

Exam category:

Submission

Form of assessment:

Written submission

Invigilation

Weight:

Grouping:

Individual

Support materials:

Bilingual dictionary

Duration:

2 Hour(s)

Exam code:

GRA 41573

Grading scale:

ECTS

Resit:

Examination when next scheduled course

Exam category:

Submission

Form of assessment:

Handin - all file types

Weight:

Grouping:

Group/Individual (1 - 3)

Duration:

1 Month(s)

Comment:

The written submission (report) is based on 2 group presentations during the semester:
1. Data curation and machine learning workflow design.
2. The process of building a machine learning pipeline, deploying and managing models.

Students get feedback on their presentations from peers and teachers. Feedback can be used to write this report.

Exam code:

GRA 41574

Grading scale:

ECTS

Resit:

Examination when next scheduled course

Type of Assessment:

Ordinary examination
All exams must be passed to get a grade in this course.

Total weight:

100

Student workload

Activity	Duration	Comment
Group work / Assignments	100 Hour(s)
Prepare for teaching	12 Hour(s)
Examination	32 Hour(s)
Teaching	36 Hour(s)	There are two hours of synchronous and one hour of asynchronous teaching each week.

Sum workload:

180

A course of 1 ECTS credit corresponds to a workload of 26-30 hours. Therefore a course of 6 ECTS credits corresponds to a workload of at least 160 hours.

Link to reading list

Reading list

Programmeinfo BI

Varselmelding

GRA 4157 (Big) Data Curation, Pipelines and Management

GRA 4157 (Big) Data Curation, Pipelines and Management

Disclaimer

Semester

Oversettelser