GRA 4157 (Big) Data Curation, Pipelines and Management
GRA 4157 (Big) Data Curation, Pipelines and Management
To gain consistent benefits from machine learning models in business, it is essential to move data science projects from experimentation to production by building automated machine learning pipelines. A standard machine learning pipeline consists of data preparation, model training, model evaluation and validation. The tasks of the automated pipeline range from collecting real-time streaming data to model and output management.
In this course, you will learn the life cycle of a data science project and the responsibilities of different roles in a data science team. You will also learn how to build an efficient end-to-end data science project and get hands-on experience on programming in python. Particular focus will be put on what is typically the most time consuming part, namely data curation, cleaning, and management, including different database infrastructures and SQL-style queries.
By the end of the course, the student:
- Can explain the responsibilities of different roles in a data science team.
- Can identify and describe the important components of an efficient automated machine learning pipeline.
- Can compare and contrast various data preparation/cleaning procedures, and know how these might effect estimation output at a later stage.
- Demonstrates a good understanding of different database infrastructures, such a relational and non-relational databases, and can adequately contrast and compare how they fit in different situations.
- Is able to understand a few basic machine learning algorithms.
By the end of the course, the student:
- Can effectively use components of python, and in particular packages NumPy, pandas, matplotlib, SciPy and scikit-learn for data science applications.
- Can build, deploy, and manage machine learning pipelines in an efficient manner, including cleaning, transforming, merging and reshaping of data.
- Can apply a number of data curation techniques to handle, e.g., missing numbers and outliers in the data.
- Is skilled in reading and writing data from various database infrastructures (SQL, MongoDB).
- Is able to apply real data sets to machine learning algorithms for insight.
By the end of the course, the student:
- Understands the end-to-end process of building a data science project, and knows how to build efficient workflows.
- By replicating real-world situations, students can understand what challenges they will face as data scientists and how to solve them.
The course will go towards advanced python programming with a particular focus on data science project life cycles. This includes:
- Understanding business problems.
- Under data curation, collection, and preprocessing the course will cover:
- Reading and writing data “manually” to files in python.
- Handling missing values, outliers, and other data anomalies.
- Database infrastructures and explore and curate data with pandas (including SQL and MongoDB).
- Make analysis queries that answer business questions.
- Create training data using queries.
- Feature engineering.
- Model building and deployment, including machine learning.
- Roles and responsibilities in data science projects.
The learning activities will combine 2/3 lectures and 1/3 group projects. The first part will focus on learning the skills of data curation and machine learning workflow. In the second part, students need to work in groups to complete a data science project. They will play different roles in the team (product manager, domain expert, data engineer, data scientist). The tasks of the project include accessing open datasets in databases, building machine learning workflows, and presenting results.
Software tools
Python, pandas, NumPy, matplotlib, scikit-learn.
Please note that while attendance is not compulsory in all courses, it is the student’s own responsibility to obtain any information provided in class.
All parts of the assessment must be passed in order to get a grade in the course.
Starting the academic year 2023/2024 the course has changed weighting of the two assessment components. It is not possible to retake the old version of the exam. Please note new exam codes in the Exam section of the course description.
All courses in the Masters programme will assume that students have fulfilled the admission requirements for the programme. In addition, courses in second, third and/or fourth semester can have specific prerequisites and will assume that students have followed normal study progression. For double degree and exchange students, please note that equivalent courses are accepted.
Disclaimer
Deviations in teaching and exams may occur if external conditions or unforeseen events call for this.
Programming (preferably Python), relational database, and machine learning, or similar type of courses.
Assessments |
---|
Exam category: School Exam Form of assessment: Written School Exam - digital Exam/hand-in semester: First Semester Weight: 40 Grouping: Individual Support materials:
Duration: 2 Hour(s) Exam code: GRA 41573 Grading scale: ECTS Resit: Examination when next scheduled course |
Exam category: Submission Form of assessment: Submission other than PDF Exam/hand-in semester: First Semester Weight: 60 Grouping: Group/Individual (1 - 3) Duration: 1 Month(s) Exam code: GRA 41574 Grading scale: ECTS Resit: Examination when next scheduled course |
All exams must be passed to get a grade in this course.
Activity | Duration | Comment |
---|---|---|
Group work / Assignments | 100 Hour(s) | |
Prepare for teaching | 12 Hour(s) | |
Examination | 32 Hour(s) | |
Teaching | 36 Hour(s) | There are two hours of synchronous and one hour of asynchronous teaching each week. |
A course of 1 ECTS credit corresponds to a workload of 26-30 hours. Therefore a course of 6 ECTS credits corresponds to a workload of at least 160 hours.
1. Data curation and machine learning workflow design.
2. The process of building a machine learning pipeline, deploying and managing models.
Students get feedback on their presentations from peers and teachers. Feedback can be used to write this report.