Chapter 5 Prep Material

HDA builds on quantiative and computational fundamentals to apply big data analysis to the medium of environmental and molecular epidemiology. Using Health Data not only provides expertise in an evergreen field, but also offers the biggest of big data challenges to hone transferrable skills in.

Each student enters with a different level of expertise in these key areas. Due to this disparate skill set, they need to adopt a personalized approach to their education. HDA encourages each student to identify which areas they are least familiar with and put more effort into leveling up those skills.

A reminder that it is not necessary to be an expert in any of these topics at the start of the program. Applying a consistent effort (30-60 minutes a day) in a targeted manner will help make the program easier and complex topics less daunting.

5.1 Molecular Biology

Several of the courses in the program apply transferrable techniques to omics data. Omics data refers to the comprehensive datasets generated in “omics” fields of biology, which study the totality of molecules and processes within a biological system. These fields, like genomics, transcriptomics, and proteomics, aim to analyze the complete sets of DNA, RNA, proteins, and other molecules. Omics data provides insights into the complex interactions within cells and organisms, contributing to a deeper understanding of biological systems.

Some of the former students and teaching staff on the course have put together a nice cheat sheet with some of the concepts you will need to pickup throughout the course. You will review these in Molecular Epidemiology or in our optional Molecular Biology classes but this is a great starting point.

5.2 Quantitative Skills

HDA focuses on the application and interpretation of advanced methodologies. You do not need to be a math expert to succeed in the course. However, understanding core quantiative concepts will make the advanced material more approachable.

For example, in Natural Language Processing (NLP), cosine similarity is a metric used to determine how similar two text documents are to each other, regardless of their size. It is calculated by representing the documents as vectors and calculating the dot product of the vectors divided by the magnitude of each vector. This can be very confusing if you aren’t familiar with vectors in the first place. Another example, Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate. This can be really confusing if you don’t know what a gradient or optimization is.

Understanding a few core quantitative fundamentals will make it much easier to understand how and why certain methods should be used.

5.2.1 Math

What do self driving cars, setting the quota in a fishery, determining a product launch campaign, and finding novel causes of disease all have in common? They are solved by constrained optimization. Statistics employs exact optimizations to generate solutions. Different machine learning algorithms change the constraints, optimizer, or both. The follow external videos provide a general introduction to the concept of optimization. You will not need to perform any complex math in the course, but you will utilize methods that do so. Understanding what they are doing on a conceptual level will make the methods more approachable.

Intro to Optimization

Least Squares Optimization

Convex vs. Non Convex Optimization

Visualizing Convex Optimization

5.2.2 Linear Algebra

Most numeric optimizers and programming languages utilize linear algebra. The following external videos provide a good conceptual overview and visual representation of the key concepts you will see in the program.

Linear Algebra Overview

Note: The above video has an informal editing style and delivery. However, it is an excellent overview of a large number of complex topics in a digestiable format.

Detailed Linear Algebra

Note: These videos are longer and in greater detail. HDA recommend watching videos 1-6 and 9 in the above playlist.

5.2.3 Statistics

Statistics underpins most analytical fields - epidemiology, machine learning, AI, finance, and more. It is important to be precise, understand the limitations of the approaches, and understand how to interpret the outcomes. The following videos provide an excellent informal introduction to statistics. The statistics course in HDA will provide a more rigorous, formal and precise statistical foundation.

Statistics Overview

Detailed Statistics Overview

5.3 Programming

The best way to get better at programming is by practicing. As an Imperial student, you will have free access to Datacamp. HDA recommends the following courses if you are interested in practicing your skills before the program: Introduction to R, Introduction to Python, Introduction to SQL, Foundations of Git, and Introduction to shell. These courses follow on from the introduction to the programming languages covered in HDA provided in this guide. To access the DataCamp courses, you need to use your Imperial email account.

5.3.1 Other Practice Platforms

There are other platforms that you can use to practice your programming skills for free. The most popular ones include Leetcode, HackerRank, CodeSignal, Codewars, and AlgoExpert. These exercises are more focused on computer science skills than on data science. These skills are not needed to succeed in HDA and not relevant to all next steps. If you decide you want to pursue a programming heavy career option, then it is a good thing to start practicing every day. HDA does not recommend starting with material not strictly relevant to the program, but would advise those that are interested to start looking into those options in Term 2.

Kaggle is a data science competition platform and online community for data scientists and machine learning practitioners under Google LLC. It has a variety of free courses and past competitions with solutions. While an excellent resource, HDA would recommend establishing a base familarity with the other material in this guide before exploring Kaggle.

5.4 Fun Readings

Not all learning needs to be technical and detailed. It can be useful to read interesting short articles or books to see how the technical skills you are learning could be applied.

5.4.1 Blogs

This is a collection of some of the most popular blogs in relevant spaces to the program. You do not need to follow all of them, or any of them. HDA recommends browsing a few from each category to begin to understand where your interests may lie.

Data Science and ML - https://www.kdnuggets.com/ - https://towardsdatascience.com/ - https://www.analyticsvidhya.com/blog/

AI - https://bair.berkeley.edu/blog/ - https://news.mit.edu/topic/artificial-intelligence2

Programming - https://news.ycombinator.com/ - https://blog.tensorflow.org/ - https://www.mygreatlearning.com/blog/

Biotech - https://www.sciencedaily.com/news/plants_animals/biotechnology/ - https://www.nature.com/nbt/ - https://www.fiercebiotech.com/ - https://www.fiercepharma.com/

5.4.2 Books

While some people prefer to have physical books, you can also download PDF versions of most books for free from the internet in a safe, legal and ethical manner.

Where to Download PDFs

Here are a few books that introduce key concepts of the program that are favourites of the staff. Note: None of the staff have any association with the authors or publishers of these books and do not endorse them further than being entertaining reads.

What is a P Value Anyway?

Signal through the Noise: Why so many prediction fail

Junk DNA