## Assignment: Benchmark Model Building Project

Assignment: Benchmark Model Building Project

*ORDER NOW FOR CUSTOMIZED AND ORIGINAL ESSAY PAPERS *

*ORDER NOW FOR CUSTOMIZED AND ORIGINAL ESSAY PAPERS*

Utilizing the information from previous topics build a model that will solve the identified problem. This model should be an individual approach to apply an analytical process to effectively build a model that best fits the business problem. Use one or more of the following software applications: IBM SPSS Modeler, SPSS Statistics, Excel, Tableau, Python, or R.

Create a draft outline describing your model and addressing the following: Assignment: Benchmark Model Building Project

- Demonstrate your application of data analysis process by specifying which models you built and indicate why this best addresses the business problem.
- What variables did you include or leave out and why?
- Provide specific screenshots from the modeling software.
- Provide the raw software files that you used for this assignment.

Synthesize the information from your draft outline to complete, in 750-1,000 words, the relevant components in the Methodology Approach and Model Building section of the “Capstone Template.”

Submit the draft outline, raw data Excel files, screenshots, and the updated “Capstone Project Thesis Template.”

Prepare this assignment according to the guidelines found in the APA Style Guide, located in the Student Success Center. An abstract is not required.

This assignment uses a rubric. Please review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.

You are required to submit this assignment to LopesWrite. Refer to the LopesWrite Technical Support articles for assistance.

*Benchmark Information*

*This benchmark assignment assesses the following programmatic competencies:*

*MS Business Analytics*

*1.5: Effectively apply data analytics processes.*

2attachments

Slide 1 of 2

- attachment_1attachment_1
- attachment_2attachment_2

Credit Card Customers

MIS 690 Capstone Project

Submitted to Grand Canyon University

Graduate Faculty of the Colangelo College of Business

in Partial Fulfillment of the

Requirements for the Degree of

Master of Science

Business Analytics

Approved by:

Date

** Business Understanding**

**Background: **A manager at the bank is worried as more and more customers leave their credit card programs. They would really appreciate if one could predict who would be churned for them so that they can go to the market proactively and give them better value and turn customer decisions in the opposite direction.

**Business Problems:** Churn is a very big problem for every financial organization. It’s hard to predict the churn. Churn has existed as long as the organization exists. The company has to satisfy the existing customers rather than the new customers because losing existing customers lead to a huge loss. In order to overcome this problem, the company has to reward existing customers for constant spending on things.

**Analytics Problem:** Examination based suppositions that are being made for the objective gatherings, instruction levels, and pay level. Client age and Months on book have a high relationship on the grounds that these clients have probability to proceed with a Mastercard. Another supposition that is with the schooling factors; clients with more significant levels of instruction beating not exactly the clients with lower levels of training. For money levels, the suspicion would be that higher salaries are stirring not as much as people with lower-level wages. Assignment: Benchmark Model Building Project

At the point when the factors are estimated against one another and their effect upon the objective of steady loss or existing client the hugeness level would take into consideration quantifiable factor. Factors with more elevated levels of noteworthiness would proceed into the model while the lower levels would be disposed of when drafting the model. In the preparation model dataset, the Visa use and pay levels assume a vital part to proceed with Visa. For each lower pay and high Mastercard utilization individuals will be added to dataset. At that point the model that is shaped would be taken to the test dataset and the offers or limits and pace of interest will be given. Consistent subsequent meetups are finished with clients to ensure client clear the contribution and weakening rate is diminished. Assignment: Benchmark Model Building Project

The expense is increment in whittling down rate because of helpless client connection and development gave by giving awful credit cutoff points and notoriety for the bank – which would mean lower income. Loss of benefits from Attired clients that might have made Mastercard utilization which implies the organization should unite their assets, for example, cutbacks, shutting branches, and so on

There are various choices according to the information accessible, consequently there is a need to affirm if the information is connected in any capacity.

- Provide an examination arrangement that targets improving the default danger of the wearing down rate?

The region of spotlight would be on the quantity of factors that have a critical effect upon the probability of attired clients. Deciding to take a gander at explicit zones, for example, instruction and pay levels tackle the since quite a while ago held thought that individuals with advanced education are more fruitful hence, ought to be less inclined to whittling down.

Play out a direct relationship examination test to decide if instruction/pay/family status and steady loss rate hazard are identified with one another or not. The target will be to set up if there is a connection between the factors distinguished.

Chi-square? To play out a chi-square test for all out informational collection. The target will be to build up if more youthful couples would prone to be attired client quicker contrasted with more established couples?

Play out a various relapse investigation to decide the sort of relationship the autonomous factors play on the needy factors.

# Data Understanding

This analysis focuses on the behavior of bank customers who are more likely to leave the bank (i.e. close their bank account). I want to find out the most striking behaviors of customers through Exploratory Data Analysis and later on use some of the predictive analytics techniques to determine the customers who are most likely to churn. To**Analyze Bank Churners there are various factors to estimate:**

Age: the range of customer’s age is from 18 to 92

Months_on_book: months that the customer has stayed with the bank

Avg_Utilization_Ratio: the avg amount of money used per month

Income_Category: customer’s self-reported annual salary

Attrition_Flag: whether the customer has churned.

**Collect Initial Data****: **The Banking sector data is available for credit card analysis anytime on the website for internal purposes. This can be confidential information as it contains the financial info of clients. However, few highly confidential info can be masked. To identify bank churners, we can proceed with below factors. This data can be obtained in a regular basis like monthly, quarterly or annually.

**Identifying the specific Data type:** For the statistical analysis, analyst must summarize the data obtained for the evaluation and present it to other people.so creating an analysis plan is crucial. When it comes to data, we often deal with variables, which define the exact information about each category and each variable has values in which it represents the characteristics of individual data. Information collected for this project includes 23 variables such as Attrition Flag, Customer Age, Gender, Education Level, etc. These variables are further divided into two Nominal and ordinal-level groups in which refers to the categorical variables. Like, Marital Status, Card Category are the categorical variables.

The collected data have ratio level and interval level data in which these two refers to the continuous variables. Customer Age, Income Category, Months on book, Months Inactive 12months and so on are continuous variables. Form these different data variables we can progress through different approaches in which we gather more information. The data which we have got can be analyzed and can be visualized. As per the variable level, we can decide which visual representation will be suitable, for example for Nominal and Ratio level variables are interpreted using Pie chart or bar chart and the continuous variables are better interpreted using histograms. From the Bank customers’ data, I can analyze the cause of churns with visual representation. From this I can easily address the problem and by creating model I can easily say where the actual churn is happening and from the model, we can predict how many customers are about to leave the company. For example, age is a counted data, age has the highest impact on customer churn, that is probably because when people get older, they need to take care of the entire family, so they switched to another bank. The bank can come up with some preferential policies to engage with those people. From this model we can stop the existing customers by giving more flexible options. These are how visualization and model prediction are useful for Data Analysis plan

**Descriptive Statistics Analysis**

To verify that the data was reliable, I first conducted a series of descriptive statistics analyses using the descriptive statistics frequency report within SPSS Statistics:

The results primarily demonstrate that there are no missing values in any of the variables. Where available for continuous data, the mean, median, standard deviation, minimum, and maximum were calculated. These results were all checked to rule out the presence of any outliers, which would be apparent with unreasonable minimum or maximum values. No such values were identified. Frequency tables were constructed for categorical data, again to detect the presence of outliers or clearly mis-labeled data (such as an incorrect gender notation). No anomalies were identified here either.

No problems were found with the original dataset. Therefore, issue mitigation was not required before continuing with analysis in this dataset.

**Correlation Matrix**

Relationships between the data were initially explored using SPSS Statistics. First, a correlation matrix was built for each continuous variable:

The matrix reports the Pearson correlation and significance level for each pairwise comparison. By scanning the table and looking for significant relationships, the groundwork for further analysis can be set. Each statistically significant relationship can then be further explored by developing a scatterplot for the relationship.

**Customer Age vs Months on Book**

This relationship had a Pearson correlation of 0.789 with a corresponding two-tailed p-value of less than 0.0001. The highly positive correlation coefficient indicates that as age increases, the number of months on book tends to increase as well.

**Customer Age vs Total Amount Change from Q4 to Q1**

This relationship had a Pearson correlation of -0.062 with a corresponding two-tailed p-value of less than 0.001. The negative correlation coefficient indicates that as age increases, the total amount change from Q4 to Q1 tends to decrease.

**Customer Age vs Total Transaction Amount**

This relationship had a Pearson correlation of -0.046 with a corresponding two-tailed p-value of less than 0.001. The negative correlation coefficient indicates that as age increases, the total amount tends to decrease.

**Customer Age vs Total Transaction Count**

This relationship had a Pearson correlation of -0.067 with a corresponding two-tailed p-value of less than 0.001. The negative correlation coefficient indicates that as age increases, the total transaction count tends to decrease.

**Months on Book vs Total Amount Change from Q4 to Q1**

This relationship had a Pearson correlation of -0.049 with a corresponding two-tailed p-value of less than 0.001. The negative correlation coefficient indicates that as the number of months on book increases, the total amount change from Q4 to Q1 tends to decrease.

**Months on Book vs Total Transaction Amount**

This relationship had a Pearson correlation of -0.039 with a corresponding two-tailed p-value of less than 0.001. The negative correlation coefficient indicates that as the number of months on book increases, the total transaction amount tends to decrease.

**Months on Book vs Total Transaction Count**

This relationship had a Pearson correlation of -0.050 with a corresponding two-tailed p-value of less than 0.001. The negative correlation coefficient indicates that as the number of months on book increases, the total transaction count tends to decrease.

**Credit Limit vs Total Revolving Balance**

This relationship had a Pearson correlation of 0.042 with a corresponding two-tailed p-value of less than 0.001. The positive correlation coefficient indicates that as credit limit increases, total revolving balance also tends to increase.

**Credit Limit vs Average Open to Buy**

This relationship had a Pearson correlation of 0.996 with a corresponding two-tailed p-value of less than 0.001. The high positive correlation coefficient indicates that as credit limit increases, the average open to buy tends to increase strongly.

**Credit Limit vs Total Transaction Amount**

This relationship had a Pearson correlation of 0.172 with a corresponding two-tailed p-value of less than 0.001. The positive correlation coefficient indicates that as credit limit increases, total transaction amount also tends to increase.

No, there is no missing data, as explored above. However, some categorical data is marked as missing or unknown. This is not an issue in the analysis, however, as imputing the data will potentially lead to incorrect conclusions. Instead, in any analysis using a categorical variable, only customers for which data is included will be used.

**Summarize Data Samples**

In order to summarize the data, a visualization dashboard will be developed using Tableau. This dashboard will summarize all critical relationship between the data, as well as provide a visual synopsis of the spread and distribution of key variables. This summary will be largely defined and driven by the previous analysis but will also look for novel insights into the data.

**Univariate Analysis: Categorical Data**

First, the current data will be visually described. With respect to gender, the distribution is roughly even with more female customers than male customers.

With respect to income category, the majority of customers make less than 40K a year with a second peak at between 40 to 60K a year.

With respect to marital status, most customers are married, closely followed by single. It is important to note that for both income category and marital status, there is a significant piece of missing and unknown data, indicating that

The vast majority of customers are in the Blue card category with only 20 customers out of over 10,000 in the Platinum category. It may be interesting to further consider the effect of card category on purchasing activity.

Finally for categorical data analysis, the majority of customers have a high school education, followed by high school. This variable has a significantly high number of unknown customers, however, limiting the applicability of this variable to further analysis.

**Univariate Analysis: Continuous Data**

The vast majority of customers have been on book for between 35 and 40 months with a relatively normal distribution of times on either end. The maximum time on book for any customer is 56 months.

Credit limit is significantly right skewed, with an interesting peak at the maximum credit limit value. It may be of value to further analyze this peak as it could potentially be a data acquisition or storage artifact and not reflect the true distribution of credit limits.

Customer age follows an almost perfect normal distribution with a peak at 43 years old.

The transaction amount data has four peaks, overall demonstrating randomness in this variable. The four peaks occur at 1480, 4070, 7400, and 14430. Most of the data is below 5180, however, indicating a broad right skewed distribution.

Finally for the continuous data, the transaction count variable shows a bimodal distribution. One peak occurs at 36 while the other occurs at 73. Overall, this variable has a more normal distribution than the transaction amount variable.

Data Relationships

Next we can consider the relationships between different variables. Dual box and whisker plots comparing credit limits by gender reveals that males have higher credit limits on average. In fact, 75% of female credit limits are less than the median credit limit for males. However, interestingly, both genders have the same range in credit limits. Females are significantly more right skewed in their credit scores than males.

A similar analysis for transaction amount and income category interestingly shows very little variation with the amount spent between the different income category groups. Each has a similar median with a similar spread of data.

It occurs when confounding variables are not properly accounted for. In order to prevent this from occurring, all analyses were first cross checked with the correlation matrix. If variables were themselves interrelated, one would be a confounding variable for the other. Therefore, these variables were not included.

In this analysis, we are trying to predict customer attrition. Descriptive analytics on attrition in this dataset reveals that the majority of the customers are existing with about 16% attrition.

Next, causal relationships between variables and attrition was explored. Customer age and attrition reveals that there is no difference in the distribution for attired customers and new customers, indicating that age is not a predictive factor.

An analysis of gender and attrition also shows a similar lack of relationship between the two variables.

Similarly, credit limit and attrition do not share a significant relationship.

Together, these analyses indicate that a combination of at least two variables is required to properly predict attrition rate in customers. There are no systematic or problematic data anomalies in the data. One feature of note is that there is some missing and unknown data in the categorical variables. It may be an option to impute this data if further analysis, including machine learning, is to be conducted. However, it may be better to omit this data to avoid drawing incorrect conclusions based on the method of imputation. The quality of data has been explored through a range of statistical and visual methodologies, as explored throughout the length of this report. There are no requirements for data transformation or alteration at this time

0 comments