Ranking Potential Customers based on GroupEnsemble method

 

 

The ExceedTech Team

South China University Of Technology

 

 

1. Background understanding

 

Both of the products have been on the market for many years, however overlap between these two customer bases is currently very small. And the reason of such a phenomenon remained unknown. Besides, the provided dataset just gave us the information about whether customers have opened a home loan with the company within 12 months after opening the credit card. It would just happen that the distribution of potential customers being different from distribution of customers who open a home loan account in the first year. No assertion of 0-label samples being beyond the group of potential customers could be made. Hence it would be appropriate to treat these samples as unlabeled ones. However, 1-label samples represent customers who have already open a home loan account. So they are treated as strong positive samples and should be ranked in top positions.

 

The task is to produce a score of the propensity to take up a home loan with the company. Since no restrictions on the metric of such propensity, it may be appropriate to treat the problem as a ranking problem, since a ranking may be more intuitive than a unknown-metric-score. For example, the company may just choose the top-100 samples as the most probable clients to advertise home loan according to their resources. Besides, due to the dataset being imbalanced, using the area under ROC curve, namely AUC, to measure the efficiency of the model would be better than widely used error calculations. And ranking method suits it well as AUC could be determined once the ranking is known.

 

From the statistic of dataset, nearly 90% samples suffer missing values on some attributions. Such a situation may cause some numerical learners such as Neutral Networks, SVM being incapable of handling this situation. We choose tree model instead, due to the capacity of such models handling missing values.

 

 

2. Preprocessing

 

We found some problems in the original dataset:

 

·         Attribution B_DEF_PAID_L12M and B_DEF_UNPD_L12M are suffering a lack of representiveness. According to statistics of B_DEF_PAID, 3 samples took 1, 403 missed and all other 40297 samples took 0. For B_DEF_UNPAID, all samples took identical value, which have no information according to entropy theories.

·         Though in statement of Update 9, all attributes with bureau variables, which were begun with ‘B’, special value 98 and 99 could be treated as missing values, they are still different, since 99 indicated the customer didn’t go to bureau, 98 indicated positive however. Thus they may be treated separately.

·         For attributes of CHQ_ACCT_IND, SAVE_ACCT_IND, AMEX_CARD, DINERS_CARD, VISA_CARD, MASTER_CARD, RETAIL_CARDS, namely account attributes, invalid values existed and should be corrected into valid ones.

·         All missing values are simply blank, which is incompatible with WEKA’s data format.

 

The supplied dataset went through following series of operations:

 

·         Kick B_DEF_PAID_L12M and B_DEF_UNPD_L12M out of the dataset.

·         Add an attribute FlagNotBureau in each sample. For all samples with value 99 in all remained bureau attributes, evaluate the flag with 1. Others take 0. Then transfer value 98 and 99 into blank ones, the identity of missing value.

·         Transfer all values of account attributes into valid values: for missing values, transfer them to ‘X’; for all values besides ‘N’, ‘Y’, ‘X’, such as ‘0’, ‘1’, ‘2’ of which meaning is unknown, transfer them to ‘Z’.

·         Place a missing value flag ‘?’ to all blanks in the dataset in order to make the format compatible to WEKA.

 

 

3. Model Description

 

To handle the difficulties above, we introduce the 3-level Regression model. In level 1, ERTree(Expending Regression Tree) is applied to expand the probability distribution from one year to overall. Then in level 2, a meta-learning method, RankBoost([2]), is used for optimize the AUC value of the model in order to achieve the best ranking result. It is also helpful in the imbalance case. Finally, in the third level, simple grouping and bagging method are used in dealing with the imbalance problem and reduce the time complexity of the model.

 

Figure 1 Framework of the model

 

 

4. GroupEnsemble Model

 

GroupEnsemble is a meta-learning method (pseudo-code shown in Table 1), which is respectively the basis of sub model Rankboost in the level2 and ERTree in the level1.

 

The original purpose of GroupEnsemble is to tackle seriously imbalanced problems and reduce the time consuming in training process. Many learning algorithm is designed for optimizing the accuracy, which would have a poor performance in the imbalanced situations. Take this problem into account, if we classify all samples as negative class, it can achieve about 98% correctness, which is meaningless to the problem. One approach to imbalanced problems is resampling. Here we use grouping method to divide negative samples into several equal-size sets, and then combine the positive sample set with each of them. For instance, if 40 groups is applied in 40700 samples, every new dataset will contain 700 positives and 1000 negatives, so that each dataset seems much more balanced than before. In respect of time consuming, split datasets can be used to train model parallel. Even if in single-CPU computers, time of training could also be reduced on the ground that

Where k is group number, n is scale of dataset.

 

 

Table 1 GroupEnsemble Algorithm

 

5. RankBoost Algorithm

 

As what was mentioned in section 1, the task of finding the potential customers can be regarded as a ranking problem, which is more helpful for the company decision makers to decide the amount of resources to spend on boosting them to be the real customers according to the rank.

 

The original idea of RankBoost algorithm is to ensemble the weak learners to achieve the best ranking result. Implementation of the idea is carried out by resampling methods. Thus it would be pretty suitable for the weak learners which ERTree, the next-level learner, is incapable to handle. Besides, it uses a resample method to improve ranking of the weak positive sample, which may also be effective to imbalanced problems. The pseudo code of RankBoost is shown in Table 2. It shows that Rankboost reweight every sample differently from AdaBoost which is proved to be more effective and have better generalized performance in [2]

 

 

 

Table 2. RankBoost Algorithm

 

 

 

6. Expanding Regression Tree

 

As analyzed above, the concerned task is with time-variant target distribution, which can hardly be addressed by a pure decision tree algorithm such as C4.5. Therefore, we propose the ERTree (Expanding Regression Tree) algorithm here, as shown in Table 3. ERTree is based on REPTree in weka. It gradually enhances the score of the 0-label samples, meanwhile, update the score of 1-labels. As a result, the rank of positive samples and negative samples will both be updated: the weak positives will go down and the weak negatives will go up. These behaviors of ERTree is due to REPTree taking the average value of incidents in a leaf as the prediction value of that leaf.  To make the relabel process more reasonable, we introduce a parameter, attenuation, in step 5, as follows:

x.label← (h(x)-x.label)* attenuation +x.label

For instance, supposing that attenuation equals 0.2, and the labels of a leave are 1,0,0,0,0, then we know the average value of that leaf is 0.2, so h(x)=0.2 for each x in that leaf. According to step 5, origin 1-label samples will be relabeled to (0.2-1)*0.2+1=0.84, similarly, other origin-0-label samples will be relabeled to 0.04. And refer to step8, the attenuation will be updated to its square value, which make the effect in each iterations smaller gradually. Therefore, if a leaf contains many high-score samples, the low-score samples will be enhanced quickly during the iteraitons.

 

Another great advantage of using REPTree as a base model lies in its convenience of handling missing value samples in the tree model. Besides, attribution selection procedure would be no longer important since the best attributes for partition have been selected during the process of tree building and pruning.

 

However, ERTree can only expand the samples around the strong positive clusters, which is useless in the situation that positive sample scattering in the strong negative samples cluster. This disadvantage would be solved in the RankBoost meta-algorithm of the upper level

 

Table 3 Expand Regression Tree Algorithm

 

 

7. Parameter Selection

7.1. Group Number

 

We try several different values of Group Number in level3, corresponding to different proportion of positive instances, and record the AUC value of 7 folds cross-validation.

 

Table 4 Comparisons with different Group Numbers

Group Number(positive percentage)

Test Set

attenuation

Base Learner Num

AUC

600+3440 10group(14.8%)

100+5600

0.8

10

0.685659

600+1720 , 50 group(25.8%)

0.699514

600+344 , 100 group(63.7%)

0.705255

 

The result (in the table) shows that the AUC value increases with the portion of positive growing up. This indicates that the patterns of potential customers are so weak that more positive instances are needed to extract the patterns.

 

Figure 2. AUC values with different Group Numbers

组-AUC

 

 

 

7.2. Base Learner Num

 

Base Learner Number is the parameter in level2, which indicate that the number of ERTree in the schema of RankBoost. As a common sense, the more weak learners are in the boosting schema, the better performance the learner would be. But the point wasn’t confirmed in the experiment. :

 

Table 5 Comparisons with different numbers of Base Learners

Group Number

Testing Set

Attenuation

Base Learner Num

AUC

600+3440, 10 groups, (14.8%)

100+5600

0.8

5

0.681509

10

0.685659

15

0.671741

20

0.676313

 

The AUC value varies from different BaseLearnerNum. However it’s not true that more Base Learners leads to better results. Thus we select BaseLearnerNum = 10 to be the parameter in our final, which balances time-consuming and accuracy.

 

Figure 3 AUC values with different numbers of Base Learners

调整最大分类数目

 

 

 

7.3. Attenuation

 

Attenuation is the parameter in ERTree learning. As mentioned in section 3, the functionality of attenuation is to reduce the affect of the former iterations and meanwhile, maintain the rank in the suitable scale. It can be predicted that the attenuation approach to 1, the dataset will become more unstable during the iterations; on the other side, if attenuation approach to 0, the dataset will little change and the ERTree is just building as the REPTree.

 

Table 6 Comparisons with different Attenuations

Group Number

Testing Set

Attenuation

Base Learner Num

AUC

600+3440, 10 groups,

divided by sequence

100+5600

0.2

10

0.667402

0.4

0.679264

0.6

0.678211

0.8

0.685659

1.0

0.667402

 

 

Figure 4 AUC values with different Attenuations

衰减系数-auc

 

 

 

7.4. Model Evaluation:

 

In the final model, the group number is 100, BaseLearnerNum is 10, Attenuation is 0.8. The experiments for the comparison with RankBoost-J48, RankBoost-ERTree, Adaboost-J48 were done by using 7-folds cross-validation, the results are following:

 

Table 7 Comparisons with different Models

Model

AUC

GroupEnsemble

0.72656

RankBoost-ERTree

0.69814

RankBoost -J48

0.60356

Adaboost-J48

0.53667

 

It shows that the AUC values of the four Learners are not high,, but the GroupEnsemble performs respectively better than others.

 

7.5. Model Insight

 

Though AUC value doesn’t have a very good performance, some samples with lowest ranks could be treated with strong negative samples. Choose the last 700 as strong negative samples and form up a dataset by combining them with 700 label-1 samples as strong positive. Some rules could be obtained by JRip Rules Learner:

 

 

Figure 5 Results of JRip Rules on the newly combined dataset

 

 

Take first and second rule for example.

 

First rule states that when total number of months at Current Residence <= 40, Number of Bureau Enquiries in Loans for last 12 months for loans no more than 0 and CCN Occupation code = 0, then the sample’s target flag is probably 1. Such a rule makes sense: those who haven’t lived in the area for long may not have a home yet. And the reason they didn’t inquiry about loans may be that they have accumulated some amount of cashes for buying a house. Both of the conditions makes the customer a potential home loan buyer.

 

Second states that if the Number of Bureau Enquiries in the last 12 months for Mortgages have been more than 1, target flag of the sample would probably be 1 as well. This rule makes sense as well as the above one: a customer have inquired about policy for mortgages right for a home loan.

 

7. Discussion

 

We have proposed a 3-level ranking model, GroupEnsemble model. Although only 0.7 AUC value could be acquired in the 7-folds cross-validation, it is still worth to point out that the potential customer may have various feature which is hard to locate exactly only with 700 samples. Our experiments also shows that ERTree have better performance than traditional tree methods such as J48 in this problem. And RankBoost is proved to be an effective way to enhance the AUC value in training. In training sets, an AUC value of 98% could be acquired by this. From the contrary view, our model can well capture some the strong negative samples which indicate the non-interesting customers.

 

A much more efficient approach would be carrying out more detailed and pertinent surveys to the pool of users who were offered no home loan services at first however bought it later on. More features about this group of people could be observed in it.

 

 

8. Download

Submitted Paper

 

Prediction Algorithms’ Code

 

Submitted Result

 

Summary of Result

9. References

[1] Witten, I.H, Frank, E : Data Mining: Practical Machine Learning Tools and Techniques. 2nd edition, San Francisco, CA: Morgan Kaufmann (2005).

[2] Zhou, Z.-H. , Liu, X.-Y. :Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data, Engineering 18 (1) (2006) 63–77.

[3] [LeCun et al. 2006]: A Tutorial on Energy-Based Learning (in Bakir et al. (eds) "Predicting Strutured Data", MIT Press 2006):

[4] Yoav Freund, Raj Iyer, Robert E. Schapire, Yoram Singer: An Efficient Boosting Algorithm for Combining Preferences

[5] Corinna Cortes and Mehryar Mohri, 2006: AUC Optimization vs. Error Rate Minimization

[6] Kaan Ataman, W. Nick Street Member, IEEE and YiZhang, 2006: Learning to Rank by Maximizing AUC with Linear Programming