Why machine learning is your best defense against fraud

4 August 2017

How Dimebox uses machine learning to optimise its fraud solution: It’s a fact that credit-cards are still by far the most popular payment method across the globe. Despite this fact, or maybe because of it, credit-card payments are subject to a large amount of fraud cases, bleeding markets all over the world for billions of dollars.

Battling fraud with the help of machine learning

Over the years, companies have developed many intelligent tools in many different forms to prevent fraudsters from hijacking and abusing credit-cards. On the other side of the fence, fraudsters are using more and more advanced techniques to bypass those tools. A key factor in winning this battle depends on your ability to anticipate and constantly keep learning and adapting. Traditionally, that required large amounts of manpower and resources.

From this realisation, the Dimebox fraud detection engine was conceptualised which leverages state-of-the-art research that has been done in this field. And then we built it: A combination of machine learning models managed by a highly concurrent and fault-tolerant application, the engine offers an automation layer on what is traditionally done offline by data scientists. This article gives both a general and a technical description of the Dimebox AI payments fraud detection engine.

Traditional approach to Machine Learning

Before getting started, let's agree on some general definitions:

  • Model: A piece of independent logic able to classify an input into categories. In essence, a model is a mathematical function which connects legions of parameters to a label.
  • Feature: A refined piece of information, mostly extracted, computed and extrapolated from raw data sources
  • (Supervised) Training: Any attempt to increase the classification accuracy of a model by adjusting its internal parameters from the observation of already classified data points.

In traditional machine learning, models are trained on historical data coming from different sources. Therefore, from a raw dataset, one usually derives a great amount of well chosen features which gives better chances to machine learning models.

Actually, machine learning is often more about finding the right features than finding a large amount of data (there's no point of trying to classify animals as cats or dogs by looking at the number of legs they have; looking at their weight and the length of their tail will give better results). Traditionally, this process is done over and over with different models, different training sets and different features until one 'optimal' configuration is found.

In our case, this approach has a few downsides:

  • Fraudsters do not sit still. Historical data only reflects past fraudsters' patterns which are likely to have evolved over the years.
  • The approach is broad. Models aren't tailored to any specific business and may miss the specificity of a particular industry.
  • It requires a lot of maintenance. To keep the system reactive, it needs to be retrained and redeployed often.

Meet the Dimebox fraud engine

At Dimebox, the process described above has been completely automated and is done dynamically by the engine. Transactions are collected as they get processed and stored for historical analysis. Each transaction is thereby transformed into features, juicing out every piece of information accessible, but also in past transactions. The engine tries to understand the behavior of each customer to quickly detect any anomaly in the customer purchase habits.

Every once in a while, the engine groups past transactions together and creates a training set from them. Active models are trained dynamically without interrupting the capabilities of the system. Once ready, freshly trained models replace the old ones and the process continues endlessly, getting more accurate after each iteration.

This method has several advantages. As the engine evolves, it gets to know its customer and offers better results for recurring ones. The models also adapt quickly to new fraud patterns and are able to automatically correct themselves without any human intervention. For better results, the engine uses different machine learning models which can be configured to use a wide set of features. All those models are constantly evaluated and trained to maximise the savings for each merchant. Different engines connected to different sources may actually end up using completely different models and features: each engine is tailored to its specific merchants. The more transactions it processes, the smarter it becomes.

As the system operates, our team is able to focus on providing raw data of better quality and on finding new features from existing data. Those new elements can be seamlessly introduced into an existing engine which will start using them on the next iteration.

In the end, what our engine provides is a risk score for each transaction. Combined with a more traditional rule-based system, the technology helps to reduce the amount of fraudulent transactions by a significant amount while saving valuable time and resources, while teaching itself to combat the evolving fraud trends of the future.