UCL Data Science Society – Microsoft Hackathon Challenge Winners

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

UCLDataScience.PNG

 

Introduction

On November 30th, 2019, the UCL Data Science Society organized its first-ever Hackathon in association with Microsoft and American Express.

 

We headed there individually to embark on our first hackathon with the only expectation of learning.

 

DS1.PNG

Ds2.PNG

Ds6.PNG

As we arrived, we were randomly assigned to a group of 6 people from different academic backgrounds. It wasn’t long before we were told the objective of the event, that is, to draw a business proposal to increase American Express ́ customer base growth by analysing the five datasets provided. These included all types of retail information that would aid us in finding ways to increase their profits.

 

Our Team

 

Zachary Bedja-Johnson.jfif

Zachary Bedja-Johnson

https://www.linkedin.com/in/zbedjajohnson/ 

I am currently in my second year at University College London studying MEng Mechanical Engineering. Aside from my degree, I have taught myself Python alongside a few other languages and technologies. Although this was my first experience of practical data science, I am eager to learn more about the wider uses within industry. Next academic year I will be studying at The University of Illinois at Urbana-Champaign hoping to further my knowledge in the fields of Engineering and Computer Science.

 

David Waisman.jfif
David Waisman

https://www.linkedin.com/in/davidwaisman/

Second year, Theoretical Physics student at UCL. Living in Argentina and the UK has shown me the impact properly applied technologies can have on communities and fuelled a desire to develop areas of technology that are in their early stages, such as data science, and have unlimited potential in their application. I am interested in explaining the inner workings of nature and using analytical thinking and computing, to develop new technologies and solutions to improve the future. Apart from developing my teamwork skills and machine learning modelling knowledge in hackathons, I like to improve my technical skills by analysing open source projects in GitHub. (And sometimes even contributing to them!)


Jaime Sabal Bermúdez.jfif

Jaime Sabal Bermúdez 

http://www.linkedin.com/in/jaimesabalbermudez 

As of 2019-2020, I am a second year student at University College London doing an MSci in Physics. A little background about myself; I was born in Venezuela but quickly moved to Barcelona, Spain due to the grave political situation, where I studied the International Baccalaureate. I am deeply passionate about data science and amazed by its practicality. I am also excited that next academic year I will be doing a year abroad at the University of Toronto in Canada, where I hope to broaden my skills and undergo new experiences. Currently, I am seeking new ways to expand my knowledge in the field of data science in the context of the tech industry. 

 

Hector Garcia Rodriguez.jfif

Hector Garcia Rodriguez

https://www.linkedin.com/in/hector-garcia-rodriguez 

Hi there, I’m Hector! Studying Theoretical Physics has provided me with more formal tools to do what I always loved: solving complex analytical problems. Following my other passion, building programs to which are useful and have an impact, led me to take part in this Hackathon. As I am going to pursue further studies in Machine Learning, I believe this experience provided me with a great insight on how to apply machine learning models and programming to real life problems, a great complement to my formal education in Machine Learning, Mathematics (Linear Algebra...) and Programming.

 

Mihir Parekh.jfif

Mihir Parekh 

https://www.linkedin.com/in/mdparekh/

 

Hey! My name is Mihir and my personal mission is to drive social impact through data. I'm currently a postgraduate student at UCL specialising in Data Science and a director for UCL's Analytics for Social Impact Society. During my undergraduate study of Economics at the University of Cambridge, my passion for data-driven analysis grew through Econometrics courses. Since then, I have gained industry experience completing a data science internship at Facebook where I worked on the Audience Network product.

 

Zain Yousef.jfif

Zain Yousef 

http://www.linkedin.com/in/zainyousef

Hi, I’m Zain. I’m a second year student at UCL studying BSc Statistics from Cardiff. The Data Science Hackathon was a great experience for me; I really enjoyed learning about the credit card industry through analysing data and using new technologies, like Microsoft Azure, to do so. I have previous experience in data science through working as a full-time Data Analyst for NHS Wales through the last two summers and working  as a Consultant part-time for MYPINPAD last year. I am currently looking for opportunities for this upcoming summer to further my knowledge in data science and statistics.      

 

Aims

 

The objective of the hackathon was to help American Express continue to drive customer base growth and retention. We conducted our analysis using a three-pronged approach: Firstly, target the rising default rates in the industry; which customer segments are most likely to default and how can American Express adjust their offerings as a result? Secondly, how can American Express curb the growth in subprime lending volumes and what classes a ‘risky’ loan acceptance? Finally, how can American Express increase their product personalisation in order to compete more effectively with fintech companies such as Monzo and Revolut?

 

Rising default rates

Default occurs when an individual accreditor cannot pay his debt. Companies like American Express have seen a rise in default rates over the last few years, which is damaging for both lenders and borrowers, as they will not be able to obtain any finance after defaulting. Hence it is extremely important for American Express to be able to assess which potential clients are more likely to default. Our team trained various models with data from fictitious users who had and had not defaulted payments. We further used our logistic regression model, which was the most precise with over 80% accuracy, to assess how credit cards were being issued.

default1.jpg

We found the rate of issuing of cards was larger for some groups which were more likely to default, with the main attribute driving this phenomenon being age. Although younger credit cards applicants are more likely to default (as found per our model), the rate of acceptance for “riskier” profiles in this age band was greater than for any other group, since we assume the issuer is really interested in attracting a younger audience. On the other hand, the opposite effect is observed for older age groups, which the issuer might not be as interested in having as clients.

 

Over-Under Rejections.png

 

Growth in subprime lending volumes

 

In order to analyse growth in subprime lending volumes, we decided to create two regression models. Our first model was created in order to discover the extent to which American Express considers each variable in our given dataset when accepting or rejecting a consumer for a credit card. From this we were able to generate the following formula to predict the probability of acceptance for a credit card with American Express.

ProbabilityofAcceptance.png

We found that the variables with the factors with the most significance to American Express when considering a consumer for a credit card were: 

  • Reports (number of major derogatory reports)
  • Age
  • Income (annual)
  • Share (ratio of monthly credit card expenditure to yearly income)
  • Expenditure (average monthly credit card expenditure)
     

    ExpenditureAverage.png

  • Major cards (number of major credit cards held)
  • Active (number of active credit accounts)

 

To analyse the accuracy of this model we decided to use a Cook’s distance plot to determine the number of outliers within the model. The case with the greatest Cook’s distance value had a value of approximately 2.00E-13. The general threshold for a Cook’s Distance model is 4/n, using this we will have a threshold of 3.00E-03. Therefore, there were no outliers detected in our model using this threshold.

 

Growing demand for product personalisation

 

Established  connections between how often people purchased products and how much they spent on new products. Customers who spent more often -more reliable ones- also spent the most money. Suggesting that to target their desired client base, AmEx should ensure higher credit limits. Complementary, we did not find a link between net income and frequency of purchasing, which discredits any theory suggesting clients who spent more frequently, spend more because of higher purchasing ability.

Averagecreditamounts.png

 

How we used Microsoft Azure to accomplish our goals

In order to tackle the increasing issue of subprime lending, we set out to provide a method for Amex to predict whether a customer is creditworthy or not. We analysed a dataset from a German bank, detailing features from a customer’s credit history and how this affects their chance of applying for credit in the future.

 

Due to the nature of the data we first used Microsoft’s Azure Machine Learning Environment to cleanse and seperate the features into categorical and numerical. Once completed we decided to train two competing model to compare. Using Microsoft resources, we decided to use the Two-Class Logistic Regression and Two Class Boosted Decision Tree algorithms. Azure cloud computing exponentially reduced the time taken to train such models, taking hours if done locally, allowing us to experiment with different strategies throughout the day. We were able to score the competing models against each other and produce a confusion matrix along with ROC and Precision-recall curves.

 

ROCML.png

Weights.jpg

 

Risky

Safe

Risky

Good

BAD

Safe

Acceptable

Good

 

 

Both models performed extremely well with high accuracy, however, when predicting credit risk, a false-negative can be extremely detrimental. A large default from a ‘perceived safe’ customer can wipe out the profits otherwise gained from multiple ‘actually safe’ customers. Analysing the figures azure produced for our models allowed us to easily select the more efficient model.

 

The two class boosted decision tree produced the most efficient model, giving high accuracy whilst only producing 3 false negatives from our testing set of 200. By increasing the threshold we were able to reduce this to zero, however this also increased the number of false positives hence reducing the amount of customers that Amex would gain. Further refinement of the input data alongside a larger more diverse dataset could help improve our results in production.

 

Overall our experience with Azure Machine Learning Studio was extremely enjoyable as it allowed us to rapidly prototype different strategies, edit metadata for our features and easily integrate virtual machines to train our models.

 

Reflection

We began this event without knowing what to expect and with who we were going to be working with. In the end, our different academic backgrounds all proved to be valuable when analysing the given datasets. As the hours passed, our group dynamic improved increasingly and our semi-individual analyses reached solid and well-rounded conclusions that ultimately shaped our proposal to American Express. We profited from every minute that passed in the day, even learning how to use innovative tools such as Microsoft Azure Machine Learning servers, which made this experience incredibly nourishing.

 

REMEMBER: these articles are REPUBLISHED. Your best bet to get a reply is to follow the link at the top of the post to the ORIGINAL post! BUT you're more than welcome to start discussions here:

This site uses Akismet to reduce spam. Learn how your comment data is processed.