Posts Kaggle Competition--Travelers Fraud Detection

Kaggle Competition--Travelers Fraud Detection

This blog is about what our group has done in the Kaggle competition for Travelers Fraud Detection and what I have learnt through this experience. Hope this would help you get some ideas when doing similar projects.

The whole material could be found on My Github. Programming stress should be put on Data Cleaning, XGBoost, LightGBM, Model Comparison, and Classification. The detailed coding could be found on these notebooks. Credit goes to the whole team: King Yiu Suen, Sam Piehl, Somyi Baek, Xun xian, and Yu Yang.

From this project, I learned several new things.

  1. When tuning the parameters of XGBoost model, don’t try to use grid search to find the optimal parameters with one single run. Instead, tune the parameters one by one. I found this Reference to be a useful guide.

  2. It is a good habit to clean the data thoroughly. In our project, we generated several cleaned dataset and the cleaning extents are different, which then requires us to do further cleaning inside the model notebooks. Therefore, if you are going to organize a similar project, my suggestion is to well organize the datasets to avoid time wasting in the later stage.

  3. The idea of Cross Validataion should be used through out the project. It is a vital tool to avoid overfitting.

  4. When coping with the LightGBM model, we found that sometimes fewer features might give better results. This reminds us that including as many as possible features may not be a good thing. The bad performance might be due to the correlation, which requires further investigation. Therefore, it might worth a try to delete some features when trying to find the best feature combination.

  5. A question is raised: should Bayesian tuning be used in similar cases? According to the results of our tests, Bayesian tunning cannot beat manual tuning in XGBoost and LightGBM. The effectiveness of Bayesian tunning needs to be furthered considered.

  6. Logistic regression and random forest don’t perform well in such a case, just as suggested by many Kaggle competitions. So, when doing classification, we might ignore such methods and put our time and focus on XGBoost, LightGBM, Catboost and etc.

  7. Several things need to be studied: Stacking, WOE. Future notes regarding these will be added soon.

This post is licensed under CC BY 4.0 by the author.