📍 Interview Guide for Data Scientists
‣
How to ace your interview
- Tell a story. Concentrate on delivering a concise and engaging story of your past experience. Read the room if you’re going on too long.
- Focus on challenges. When describing your past work, bring up the challenges you faced and decisions you made to resolve them. This is the stuff interviewers want to hear about.
- Not sure? Think — then talk it out. If you're faced with a difficult question, take a moment to think, and then share your thought process. Interviewers are far more interested in your ability to think critically than getting a correct solution right away.
- "I don't know" is a valid answer. If you don't really know how to answer a question, don't try to guess. Better to be honest — and move on to questions you feel more comfortable answering.
- Read as many Towards Data Science medium articles as you can, especially those explaining algorithms.
- Using exact terms rather than ambiguous or vague concepts. Knowing the language of the tools used indicates a candidate is familiar with the tools. If they do not understand the language of their tools, it’s a guaranteed they aren’t familiar
- some people try to simplify ideas or used commonplace language, but it should be determined whether this is to help the interviewer or to cover up lack of experience
- Do not use algorithms you don’t know in depth in the take home test. Even though they may perform better, it might bite you later on in the interview, especially if the interviewer has used them in the past.
- Understand fundamentals of probability theory and statistics. If you don’t understand these, it’s like building a house on sand.
- The fundamentals are exactly what helps a good data scientist learn how to modify an algorithm and bend it to their purposes/data.
‣
Sample questions for practice
Practice answering these real questions from past mission interviews with data scientists:
‣
- How do you determine the appropriate criteria for measuring success in a data science project?
- Data science is all about measurements; that’s why it’s a science. All data scientists should have a firm grasp on how they measure success on their projects.
- Oftentimes, data scientists will explain technical measurements used to evaluate models, etc. Ideally, the candidate can speak to the business use-case and how they aligned their technical performance metrics with the business KPIs.
- How do you make sure your models generalize/apply to new data that it hasn’t seen before
- this response should include some terms like:
- hold-out set or test set: putting data aside when training models, and only using the data that was set-aside (test data) for evaluating performance
- cross-validation: a technique where models are trained on many small sets of data to determine which training algorithm (hyper-parameters) work best on the data
‣
‣
- Case study:
- A standard interview question is the one where along with the interviewer you go through building a machine learning pipeline with questions relating to pre-processing, handling of big data, model choice, performance, feature engineering etc.
- Model choice:
- My current model has a performance of 3.5% error rate and a new one has a performance of 3% error rate. Should I switch in production to the newer one?
- Answer: Do statistical testing.
- What is hypothesis testing?
- Tests the validity of a claim (null hypothesis) that is made about a population using sample data. If the claim is not valid, then we accept the alternative hypothesis as true. Article
- What is a Z-score?
- Z-score is the number of standard deviations from the mean a data point is.
- What is a p-value?
- P-value is the probability that given a null hypothesis is true, it would be high enough so that we are not surprised by the evidence. If it is lower than a predefined significance layer (alpha) then we reject the null hypothesis and accept the alternative one. Article
- What is the difference between one or two tailed tests?
- What is precision?
- Used mostly in information retrieval (or binary classification), precision is a metric that defines how precise I am in my retrievals (or predictions). Meaning from the things I say they are class A (i.e. True), how many are they trully A? How precise am I in predicting A? More mathematically it is TP/(TP + FP).
- What is recall?
- Again used in informations retrieval, recall is a metric that defines how good my algorithm is in retrieving (or predicting) the correct documents (or classes) out of all the correct documents (or classes). It is the percentage of the documents denoted as A retrieved vs. all the A documents (retrieved or not). More methematically it is TP/(TP+FN)
- Should I optimize for performance or recall?
- Well this depends on the problem at hand.
- What is bayesian optimization?
- Bayesian optimization is a technique to optimise function that is expensive to evaluate. Article
- Boundaries: Provide the decision boundaries of different algorithms
- Naive Bayes
- Logistic Regression
- Decision Trees
- What is a ROC curve?
- GUI representation of TPR vs. FPR
- What is a Type I error?
- False Positive Rate
- What is a Type II error?
- False Negative Rate
- Bayes Theorem
- Posterior probability of an event given what is know as prior knowledge.
P(A|B) = P(B|A) * P(A) / P(B)
- i.e.
P(Spam|Words) = P(Words|Spam) * P(Spam) / P(Words)
- Bagging
- Different models built on different subsets of data
- Boosting
- Different weights to each sample
- What is Entropy?
- An indicator of how messy the data is
- What is the curse of dimensionality?
- Large space where the data reside which leads to sparse data that lead to random relations.
- You can eliminate it by:
- Feature selection
- L1 regularization
- PCA, other dimensionality reduction techniques
- Better feature engineering
- Why ensembles are better?
- They make different errors.
‣
- Definitions
- Bias: Error due to erroneous or overly simplistic assumptions, underfitting. The error rate on the training set.
- Variance: Too much complexity, sensitive to high degrees of variation, overfitting. The error rate between the training and the validation set.
- How can I act if I have different kinds of high/low bias/variance scenarios
- High Bias:
- Increase model size (usually with regularization to mitigate high variance)
- Add more helpful features (which is another way of increasing again model size)
- Remove (Worse) / reduce (Better) regulaization
- Adding more training data wont help...the model is too "small"
- High variance:
- Add training data (usually with a big model to handle them)
- Add/increase regularization
- Early stopping for NN
- Remove features (make the model simpler)
- Decrease model size (prefer regularization)
- Add more helpful features that are more useful to the problem at hand instead of the ones you have
- Can I have both low variance-low bias or high bias-high variance?
‣
- Decision Trees
- How is a decision tree built? (Common question)
- How do I prune a decision tree?
- What happens if I choose randomly the split?
- Support Vector Machines
- Which kernel to choose?
- What data do I need for each kernel?
- Random Forests
- Describe how random forest works, pros and cons, and where have you used them lately?
- Based on Bagging. It is an ensemble method that creates many decision trees by selecting randomly different combinations of attributes in order to build the trees, making it more difficult to overfit. In the end the tree are aggregated. Params to usually tune are: number of trees, and number of features to be selected at each node;
- Pros:
- Works on a lot of datasets (Fernadez-Delgado et al., JMLR, 2014)
- Few important parameters to tune
- Handles multiclass problems (unlike SVMs)
- Can handle a mixture of features and scales
- Cons:
- Slow for real-time prediction due to the large number of trees.
- When there are categorical values with different number of levels, RF prefer the variables with more levels.
- Boosting Machines
- Describe how XGBoost works, pros and cons?
- It is an ensemble method that follows the boosting approach. It builds trees one at a time, with each tree correcting the errors made by the previous one. Params to usually tune are: number of trees, depth of trees, and learning rate.
- Pros:
- Helpful in highly imbalanced class problems (outlier detection) Kagglers are drinking in its name :)
- Cons:
- Sensitive to overfitting if data is noisy
- Long(er) training times due to the sequential nature
- Harder to tune
- More of a black-box even though tree-based
- LSTM
- Understanding LSTM
- Cell state: the cell state runs through the LSTM with minor interactions
- Forget gate layer: Decides what information will be thrown away from the cell state through a sigmoid layer (1 keep this, 0 get rid of this)
- Input gate layer: Decides what new information will go into the cell state
- Output gate layer: Decides what cell state information will be outputed
- Convolutional Neural Networks
- A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way
- A convolution is the process of applying a filter (“kernel”) to an image. Max pooling is the process of reducing the size of the image through downsampling.
- CNNs: Convolutional neural network. That is, a network which has at least one convolutional layer. A typical CNN also includes other types of layers, such as pooling layers and dense layers.
- Convolution: The process of applying a kernel (filter) to an image
- Kernel / filter: A matrix which is smaller than the input, used to transform the input into chunks
- Padding: Adding pixels of some value, usually 0, around the input image
- Pooling: The process of reducing the size of an image through downsampling. There are several types of pooling layers. For example, average pooling converts many values into a single value by taking the average. However, max-pooling is the most common. Useful for decreasing computational power required and extracting dominant features which are rotational and positional invariant.
- Max-pooling: A pooling process in which many values are converted into a single value by taking the maximum value from among them. Performs as a noise suppressant as well.
- Stride: the number of pixels to slide the kernel (filter) across the image.
- Downsampling: The act of reducing the size of an image
- Why use CNNs?
- To build invariance in your model. When recognising an object in one place, CNNs will recognise it to others as well. A CNN successfully captures Spatial and Temporal dependencies in an image.
‣
- Types of regularization
- L1: Lasso, minimize absolute value
- L2: Ridge, minimize square value
- Dropout
- Set randomly outputs to 0 (disables them). This prevent neurons from co-adapting and forces them to learn individually useful features.
- Early Stopping
- How do you select hyperparameters?
- Manual Search
- Grid Search
- Random Search
- Bayesian Optimization
- Check this as well: Guideline to select the hyperparameters in Deep Learning
- What is batch normalization?
- It is the normalisation the inputs of each layer in such a way that they have a mean output activation of zero and standard deviation of one.
- The idea is to do the same in the hidden layers as you do in the input layer.
- Batch normalization:
- Makes the network learn faster (converge more quickly, higher learning rates)
- Helps with weight problems (weight initialization, saturated activation functions, some regularization capabilities)
‣
- How K-means works?
- How Expectation-Maximization works?
‣
‣
- what was the business problem and how did they improve and solve the problem?
- Who were the stakeholders you spoke with?
- How did you measure the performance of your work? what were the results?
- Did you set up a hypothesis? (a great DS will be able to defend his conclusions well)
- Was your data science research used in production eventually?
- This is a logical thought process for a project
- Heard the problem
- Independent research, data exploration
- Talk to relevant stakeholders about it
- Looked for data sources that exist and have access to
- Great DS will add → simple plotting, aggregation of data
- Not great DS will go straight to → modeling
- Have you written models that were deployed into production? Where was it used: API, web app, mobile app, dashboard?
- What’s the difference between supervised and unsupervised learning?
‣
Best questions to ask your interviewer
Always prepare questions for the end of the interview. The interviewer can learn a lot from what questions you choose to ask. Take a look at our best questions to ask your interviewer:
- What phase of development is the product in?
- What’s the biggest challenge with your product?
- What keeps you up at night?