GRA 4157 (Big) Data Curation, Pipelines and Management
GRA 4157 (Big) Data Curation, Pipelines and Management
To gain consistent benefits from machine learning models in business, it is essential to move data science projects from experimentation to production by building automated machine learning pipelines. A standard machine learning pipeline consists of data preparation, model training, model evaluation and validation. The tasks of the automated pipeline range from collecting real-time streaming data to model and output management.
In this course, you will learn the life cycle of a data science project and the responsibilities of different roles in a data science team. You will also learn how to build an efficient end-to-end data science project and get hands-on experience on programming in python. Particular focus will be put on what is typically the most time consuming part, namely data curation, cleaning, and management, including different database infrastructures and SQL-style queries.
By the end of the course, the student:
- Can explain the responsibilities of different roles in a data science team.
- Can identify and describe the important components of an efficient automated machine learning pipeline.
- Can compare and contrast various data preparation/cleaning procedures, and know how these might effect estimation output at a later stage.
- Demonstrates a good understanding of different database infrastructures, such a relational and non-relational databases, and can adequately contrast and compare how they fit in different situations.
- Is able to understand a few basic machine learning algorithms.
By the end of the course, the student:
- Can effectively use components of python, and in particular packages NumPy, pandas, matplotlib, SciPy and scikit-learn for data science applications.
- Can build, deploy, and manage machine learning pipelines in an efficient manner, including cleaning, transforming, merging and reshaping of data.
- Can apply a number of data curation techniques to handle, e.g., missing numbers and outliers in the data.
- Is skilled in reading and writing data from various database infrastructures (SQL, MongoDB).
- Is able to apply real data sets to machine learning algorithms for insight.
By the end of the course, the student:
- Understands the end-to-end process of building a data science project, and knows how to build efficient workflows.
- By replicating real-world situations, students can understand what challenges they will face as data scientists and how to solve them.
The course will go towards advanced python programming with a particular focus on data science project life cycles. This includes:
- Understanding business problems.
- Under data curation, collection, and preprocessing the course will cover:
- Reading and writing data “manually” to files in python.
- Handling missing values, outliers, and other data anomalies.
- Database infrastructures and explore and curate data with pandas (including SQL and MongoDB).
- Make analysis queries that answer business questions.
- Create training data using queries.
- Feature engineering.
- Model building and deployment, including machine learning.
- Roles and responsibilities in data science projects.
The learning activities will combine 2/3 lectures and 1/3 group projects. The first part will focus on learning the skills of data curation and machine learning workflow. In the second part, students need to work in groups to complete a data science project. They will play different roles in the team (product manager, domain expert, data engineer, data scientist). The tasks of the project include accessing open datasets in databases, building machine learning workflows, and presenting results.
Software tools
Python, pandas, NumPy, matplotlib, scikit-learn.
-
All courses in the Masters programme will assume that students have fulfilled the admission requirements for the programme. In addition, courses in second, third and/or fourth semester can have specific prerequisites and will assume that students have followed normal study progression. For double degree and exchange students, please note that equivalent courses are accepted.
Disclaimer
Deviations in teaching and exams may occur if external conditions or unforeseen events call for this.
Programming (preferably Python), relational database, and machine learning, or similar type of courses.
Assessments |
---|
Exam category: Submission Form of assessment: Written submission Invigilation Weight: 30 Grouping: Individual Support materials:
Duration: 2 Hour(s) Exam code: GRA 41571 Grading scale: ECTS Resit: Examination when next scheduled course |
Exam category: Submission Form of assessment: Written submission Weight: 70 Grouping: Group/Individual (1 - 3) Duration: 1 Month(s) Exam code: GRA 41572 Grading scale: ECTS Resit: Examination when next scheduled course |
All exams must be passed to get a grade in this course.
Activity | Duration | Comment |
---|---|---|
Group work / Assignments | 100 Hour(s) | |
Prepare for teaching | 12 Hour(s) | |
Examination | 32 Hour(s) | |
Teaching | 36 Hour(s) |
A course of 1 ECTS credit corresponds to a workload of 26-30 hours. Therefore a course of 6 ECTS credits corresponds to a workload of at least 160 hours.
1. Data curation and machine learning workflow design.
2. The process of building a machine learning pipeline, deploying and managing models.
Students get feedback on their presentations from peers and teachers. Feedback can be used to write this report.