# Get all latest (August) EMC E20-007 Actual Test 41-50

QUESTION 41

You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters. What should you do?

 A. Identify additional measures to add to the analysis B. Remove one of the measures C. Decrease the number of clusters D. Increase the number of clusters

Correct Answer: C

QUESTION 42

Data visualization is used in the final presentation of an analytics project. For what else is this technique commonly used?

 A. Data exploration B. Descriptive statistics C. ETLT D. Model selection

Correct Answer: A

QUESTION 43

Refer to exhibit. You are asked to write a report on how specific variables impact your client’s sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only.

After a preliminary analysis of the data, the following findings were made:

1. Multicollinearity is not an issue among the variables

2. Only three variables-A, B, and C-have significant correlation with sales

You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit.

You cannot request additional data. what is a way that you could try to increase the R2 of the model without artificially inflating it? A. Create clusters based on the data and use them as model inputs B. Force all 15 variables into the model as independent variables C. Create interaction variables based only on variables A, B, and C D. Break variables A, B, and C into their own univariate models

Correct Answer: A

QUESTION 44

Data visualization is used in the final presentation of an analytics project. For what else is this technique commonly used?

Model selection

 A. Assessing data quality B. Descriptive statistics C. ETLT D.

Correct Answer: A

QUESTION 45

What is the mandatory Clause that must be included when using Window functions?

 A. OVER B. RANK C. PARTITION BY D. RANK BY

Correct Answer: A

QUESTION 46

You submit a MapReduce job to a Hadoop cluster and notice that although the job was successfully submitted, it is not completing. What should you do?

 A. Ensure that the TaskTracker is running. B. Ensure that the JobTracker is running C. Ensure that the NameNode is running D. Ensure that a DataNode is running

Correct Answer: A

QUESTION 47

What is holdout data?

 A. a subset of the provided data set selected at random and used to validate the model B. a subset of the provided data set selected at random and used to initially construct the model C. a subset of the provided data set that is removed by the data scientist because it contains data errors D. a subset of the provided data set that is removed by the data scientist because it contains outliers

Correct Answer: A

QUESTION 48

You have been assigned to run a linear regression model for each of 5, 000 distinct districts, and all the data is currently stored in a PostgreSQL database. Which tool/library would you use to produce these models with the least effort?

 A. MADlib B. Mahout C. R D. HBase

Correct Answer: A

QUESTION 49

How does Pig’s use of a schema differ from that of a traditional RDBMS?

 A. Pig’s schema is optional B. Pig’s schema requires that the data is physically present when the schema is defined C. Pig’s schema is required for ETL D. Pig’s schema supports a single data type

Correct Answer: A

QUESTION 50

Before you build an ARMA model, how can you tell if your time series is weakly stationary?

 A. There appears to be a constant variance around a constant mean. B. The mean of the series is close to 0. C. The series is normally distributed. D. There appears to be no apparent trend component.

Correct Answer: A

