每日一练 | Data Scientist & Business Analyst & Leetcode 面试题 279

大数据应用 · 公众号 · 大数据 · 2018-01-23 10:35

正文

自2017年6月15日起，数据应用学院与你一起温习数据科学（DS）和商业分析（BA）领域常见的面试问题。从2017年10月4号起，每天再为大家分享一道Leetcode算法题。

希望积极寻求相关领域工作的你每天关注我们的问题并且与我们一起思考，我们将会在第二天给出答案。

Day 179

DS Interview Questions

While building a model, you typically use different Python packages like scikit-learn. If the organization says that you will have to develop the model yourself, how comfortable would you be in doing that?

BA Interview Questions

What is a constraint?

Leetcode Questions

Description: Given an array S of n integers, find three integers in S such that the sum is closest to a given number, target. Return the sum of the three integers. You may assume that each input would have exactly one solution.

Input: [2, 7, 11, 15]

Output: [0, 1]

Assumptions: each input would have exactly one solution.

欲知答案如何？请见下期分解！

Day 178 答案揭晓

DS Interview Questions

You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do?

Processing a high dimensional data on a limited memory machine is a strenuous task, your interviewer would be fully aware of that. Following are the methods you can use to tackle such situation:

Since we have lower RAM, we should close all other applications in our machine, including the web browser, so that most of the memory can be put to use.

We can randomly sample the data set. This means we can create a smaller data set, let’s say, having 1000 variables and 300000 rows and do the computations.

To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, we’ll use correlation. For categorical variables, we’ll use chi-square test.

Also, we can use PCA and pick the components which can explain the maximum variance in the data set. Building a linear model using Stochastic Gradient Descent is also helpful.

We can also apply our business understanding to estimate which all predictors can impact the response variable. But, this is an intuitive approach, failing to identify useful predictors might result in significant loss of information.

BA Interview Questions

What is an Index?

An index is performance tuning method of allowing faster retrieval of records from the table. An index creates an entry for each value and it will be faster to retrieve data.

Leetcode Questions

Input: [2, 7, 11, 15]

Output: [0, 1]

Assumptions: each input would have exactly one solution.

Solution: 3Sum的变种，保存一个与target的差值，每次作比较，小于这个差值即更新答案。
注意点：三个int相加使用long防止溢出
Code: