Machine learning still occupies the peak place in Gartner hype cycle in 2017. Beyond anything, Glassdoor reports that data scientist role is the best job based on job satisfaction, job openings and average salary. However, there is no dedicated bachelor degree for this field. It is just like a diversity program. Candidates can have computer science, engineering, math or statistics background. I’ve collected some job interview questions asked for data scientist, machine learning engineer or artificial intelligence researcher roles. As a disclaimer, responses are just my personal opinions. There is no single true for the questions mentioned above. Approach is more important than the answer.
Suppose that you work for a finance institution. There might be thousands of branches that your company can connect with customers. How can you reward branches?
Rewarding branches based on profits might not be fair. Because some of these branches have higher profits and some have more customers. This causes to reward lucky ones. You might apply unsupervised learning and create clusters based on profitability, turnover, transaction volumes, having customers or region. It is like customer segmentation. Then, you should evaluate each branch based on where it is in current cluster. In this way, each branch can compete with same weight competitors. Otherwise, it would be like putting light weighted boxer in front of heavyweight one. In fact, there might be several champs for different weight groups.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
We have a billion of transactions data. A few of them are marked as fraud transactions. In other words, distribution for fraud and not fraud are non-homogeneous. What should we do before creating fraud detection system?
This is rare event detection problem. Classifiers expect homogeneous data during training to produce satisfactory results. We cannot always expect to have balanced data for some cases. Firstly, you can feed less number of randomly selected instances to decrease the number of non fraud transactions. This is called sub sampling. But this causes to lose important data. We would not often prefer to apply this. Secondly, we can increase the number of fraud transactions by creating synthetic fraud data. For example, you can pick random two existing fraud instances, calculate average of transaction amount for this two instances, and assign the average amount to the new data. This is called over sampling. This increases the number of fraud instances. This approach might be preferable than sub sampling for the fraud case but it is still dangerous because it causes to feed non existing data to the model. It is like having imaginary friends!
Let’s turn back to the fraud example. We had a billion lines of transaction data. Most of lines are marked as non fraud and a few lines are marked as fraud. Additionally, there might be fraud transactions we didn’t marked as fraud. Feeding fraud transactions as non fraud will manipulate the AI model. So, how should we design the AI model?
We can ignore the fraud mark and consider the problem as anomaly detection. However, we should work on transactions for customers individually. Suppose that transactions of a customer (e.g. named Sefik) has a normal distribution. Mean (µ) and standard deviation (σ) of transaction amount will lighten us. We have already known that 3 standard deviation beyond the mean (µ ± 3σ) covers 99.7% of all area. We can apply this logic to transactions of a customer. For example, if a customer has averagely 100$ expenses, and standard deviation were 10$, then 99.7% expenses must be less than 130$ and must be greater than 70$. You can mark any transaction of that customer as abnormal if it is greater than 130$. That might not be fraud but still it is abnormal. In this way, we can have an idea for unmarked transactions. BTW, you can increase precision. 6 sigma covers 99.99%.
We thought about the problem for only transaction amount. We can increase the dimensions by adding some additional information such as time and location information.
You need to develop a model for credit decisioning. You need to explain final decisions clearly because of some legal issues. How would you design this decisioning model?
Some machine learning models such as neural networks or support vector machines produce opaque models. This means that opaque decisions cannot be read and understood by human. Everything is handled in a black box. On the other hand, a decision tree algorithm produces transparent decisions. Transparent decisions can be read and understood by human clearly. In other words, you can follow the steps to make decision. For example, look at the following decision tree. If your decision were accept offer, because the company offers free coffee, commutation does not last more than 1 hour and salary is greater than 50K.
That’s why, you have to build decision tree for credit decisioning. Herein, the most common decision tree algorithms for classification are ID3, C4.5 and CART. On the other hand, CART can be adapted for regression problems.
What does 100% prediction accuracy mean to a machine learning professional?
You might either solve an insignificant problem like how many legs does a cow have or you overfitted. You have most probably the second one. Even the most advanced AI models or intelligent life forms fail. You should not expect to get 100% accuracy anytime. How senior developers do not expect new programs to work bug-free at first time, notice that it makes happy just junior developers. Similarly, machine learning practitioners should never expect to get 100%. Still, you believe that you can solve a problem with 100% accuracy, then it would be automation. In this case, you can create a rule based model and there is no need for AI.
Well, what about almost 100%? What if the model gets 99% prediction accuracy?
Remember the fraud detection data set. Suppose that there are 1M legal transactions and 100 fraud transactions. This means that 99.99% of the dataset corresponds legal whereas 0.01% corresponds fraud. In this case, you can get 99.99% accuracy if you return not fraud by default. Is this a success? Of course, no! Here, the important thing is that you can classify correctly how many of really fraud instances. Confusion matrix and ROC curve become important instead of overall accuracy. If number of cases for true positive and true negative close to 100%, that would be a good job.
Besides, if your problem is based on human health, then 99.99% accuracy means that you can cause to die of 1 person in every 1000 people. So, metrics might have different meanings based on problems.
Consider a weather forecast program. What kind of machine learning problems it includes?
Funny, but it includes both regression, classification and clustering. It predicts weather temperature in Fahrenheit or Celsius degrees. This is regression because continuous outputs will be produced. Moreover, it classifies the weather as partly sunny, raining and snowing. This is classification because there are limited number of classes. Finally, it includes unsupervised learning. It clusters some cities /states based on the geographic location.
How would you handle with over-fitting?
If you run a decision tree algorithm, then they tend to over-fit on a large scale data sets. A basic approach is to apply random forest. It basically separates data set into several sub data sets (mostly prime number). Then, different decision trees are created for all of those sub data sets. Final decisions of these sub data sets specify the global decision. Moreover, you can apply pruning to avoid over-fitting.
On the other hand, if you run neural networks, it is based on updating weights over epochs. You should monitor the training set and validation set error over epochs. Training set error will decrease over iterations. If validation set error starts to increase for some epoch value, you should terminate epochs. Moreover, you could create a really complex neural networks model (input features, number of hidden layers and nodes). You might re-design a less complex the model.
Imagine a simple single layer perceptron. Can you code it in any programming language?
This question might seem very easy but it is a tricky one. Traditional developers tend to design this kind of systems with for loops.
import numpy as np inputs = np.array([1,0,1]) weights = np.array([0.3, 0.8, 0.4]) sum = 0 for i in range(inputs.shape[0]): sum = sum + inputs[i] * weights[i] print(sum)
However, machine learning practitioners must not apply this approach. They have to apply matrix multiplication. Because, vectorized solution fasten processing time almost 150 times.
import numpy as np inputs = np.array([1,0,1]) weights = np.array([0.3, 0.8, 0.4]) sum = np.matmul(np.transpose(weights), inputs) print(sum)
Can you describe dimension reduction? What does it offer?
Your data set can have thousands of features. Feeding all features becomes much more complex model. Training lasts longer and it might tend to over fit. Dropping some features will reduce the complexity and fasten training but in this case we might lose some significant information. Autoencoders are typical way to represent data and reduce dimensions. Thus, you can zip the data (lossy) but it offers you to have less complex model, faster training and you do not lose any information just like in dropping.
Besides, face recognition technology and art style transfer techniques are mainly based on dimension reduction and auto-encoders.
So, I collected some job interview questions asked for data scientists and machine learning practitioners and I try to respond. Responses reflect my personal opinions. You might find some answers true or partially false. These questions asked to test solution approach of a candidate. In other words, solution approach is more important than the pure answer.
Support this blog if you do like!