Detecting Risk in Real Time: Architecture, Machine Learning, and Operations

@dalexeenko|April 21, 2023 (3y ago)151 views

This is the second in a series of blog posts on building risk detection and mitigation systems.

#Building a strong architecture of the risk system

Once you know what kinds of fraud and risk you're dealing with, it's time to build a system to fight it. Let's break down how to do this in a way that's clear and effective.

When we started out our fraud-fighting journey at Airbnb and Stripe, we tried to fight fraud by building an extensive barrage of different risk defenses. We built hundreds of rules, heuristics, requests for information, and rate limiters. Many of our initial defenses were built in isolation, with heavy reliance on manual reviews. It worked great until it didn't. As we scaled, we ran into problems:

Users got confused because at times we kept asking for the same information.
Some of our defenses actually interfered with each other.
It was hard to understand how everything worked together, which led to some fraud slipping through (it was hard to comprehend how the entire system worked, and no easy way to govern, audit, and finetune the system).
Backlogs. We relied heavily on human expertise and this meant that we would have more risky events to evaluate than we had qualified humans. This meant that either risk slipped through or good users had their accounts impacted for way longer than necessary. (that led to churn!)

To simplify our risk defenses and to ensure that the whole system scaled as the company grew, we came up with the risk architecture that would work across all triggers, rules, and scenarios:

On a high level, this architecture had two major parts: detection (machine learning models, heuristics, anomaly detection systems) and mitigation (actions or interventions including human reviews).

It all started with an event (deemed critical enough from a risk perspective). An event could have been a proactive user action (a user adding a new payment method) or a state machine transition not caused by any user actions (asynchronous update).

For every event that we were concerned about there was a risk check (or a set of risk checks). These could be (a) machine learning models, (b) rate limiting (checking if someone's doing things too quickly), (c) rules and heuristics our operational teams set up, (d) lists and databases of known bad actors, devices, and online fingerprints.

At the end of the day, every risk check had three potential outcomes:

Green: low risk, no action needed.
Yellow: medium risk, an intervention needed, such as a user-facing intervention like asking for a verified phone number, a behind-the-scenes API call or manual human review.
Red: high risk, which usually required a reduction of the user's capabilities to minimize further risk. This could be a spectrum: from something lightweight (e.g., slow down the user payout), to pausing certain capabilities (the user can’t change their linked bank account), to suspending the account altogether.

Every user who was presented with an intervention or whose capabilities were reduced had a way to appeal. This was an extremely important detail: inevitably our risk systems would have false positives, hence, it was imperative to provide a way for falsely identified accounts to appeal and potentially reverse the decision.

Having a clear system architecture like this helped us in a few ways. It became easier to understand how everything connected together. We could add new fraud checks without messing up the whole system. Our users had a much better experience. Ultimately, we were able to prevent more risk while making fewer mistakes.

#Fraud is a numbers game

When fighting fraud, it can be tempting to aim at getting it down to zero. Don't. Instead, think of it as an optimization function.

While it's practically impossible to stop all fraud, it's important to not let fraudsters cut corners. Make the bad actors sweat: force them to cycle through IP addresses, email addresses, phone numbers, devices, identities. The more expensive you make it for them, the less incentive they will have to use your product for their nefarious purposes. At some point, they may decide to go and find another platform that's more lucrative for it. Force fraudsters to do extra work throughout your entire product lifecycle. Look at every single piece of the signup form users provide. Look at how quickly they perform actions on the website. Look for similarities in everything: from email addresses, to phone numbers, to browsing sessions, to BINs. Leverage locality-sensitive hashing to generate similarity hashes. Use hamming distance and Jaccard similarity between those hashes to find things that look similar. Remember, fraudsters usually go for the easiest targets. Your goal is to not be that target.

Fighting fraud is an evergreen optimization function between three major components of the equation:

Fraud losses. This includes direct and indirect financial fraud losses, as well as things like brand damage.
Losses due to bad user experience. Every time you add a step to catch fraud (like a CAPTCHA), some good users might get distracted or stuck and leave.
Operational expenditure. This includes people reviewing suspicious activity, buying data from other companies, or using external verification APIs.

As you scale, it becomes more and more tricky to balance these three things. We used to joke at Airbnb that the surest way to stop fraud is to shut down the website — but that's not a real solution! Here's how we approached it:

Be careful not to overdo it. Since it can be hard to assess the impact of certain fraud and risk defenses on legitimate users, naturally it's easy to overfit and decline accounts and transactions. Find a way to measure the impact on good users, especially the silent sufferers, who don't complain or escalate.
Try to put a dollar value on each of these three parts. This will help you compare them and make intentional trade-offs.
Remember, some fraud might be less costly than losing good customers or spending too much on prevention and running your operational teams.

There is an excellent Airbnb blog post Fighting Financial Fraud with Targeted Friction that provides an example of how we leveraged targeted user friction to optimize our overall loss function while battling chargebacks.

#Be clear about metrics

You can't improve what you don't measure. So it's important to establish clear metrics. Those can be financial losses (in dollars and in bps), measures of user experience degradation or user pain (incorrect account suspensions that get overturned), etc.

#Invest early in machine learning

Fighting fraud can feel like looking for a needle in a haystack. It's usually tricky for two reasons. First, most activity on a platform is good, making it hard to spot suspicious behavior. Classification gets very hard. Second, bad actors do their best to cover their tracks and to hide their bad activity. You have to look at hundreds and thousands of possible signals.

Let's break down how to tackle this challenge.

Collect data. Gather as much information as you can. This includes things like IP addresses, account age, email addresses, action velocity, browser information, payment details, etc. Collect this when people sign up and throughout their time using your service.
Classify. Figure out if something looks risky. This is where you use all that data you collected to make a judgment.
Take action. If something looks suspicious, you might need to ask the user to verify their identity, have them solve a CAPTCHA, and limit what they can do on your platform (e.g., pause payouts). In extreme cases, suspend their account.

Fighting fraud is an ongoing process. It also requires the collaboration of many different teams: operations, data science, engineering, product, etc. Here's how it typically goes: (1) It all starts with understanding. The operations team are the front-line workers. They spot fraud patterns day-to-day and understand what's really happening. (2) Operations will hand these insights to the data science team to design features that can be used to predict fraudulent behavior. The data science team will also train machine learning models that we can use to score activity that's happening. (3) The engineering team is the one that runs these models in production. The engineering team is also responsible for mitigating things when something is deemed to be high-risk. (4) In certain cases we'll ask the operations team to review the decisions that we've made. As they are reviewing these patterns they'll get a better understanding of what's happening on the platform.

The speed of this evaluation loop is critical. Bad actors adapt quickly, often in hours and minutes. Your system will need to keep up. So automate what you can. Try to handle most fraud automatically, so your team can focus on the tricky cases and the new patterns. Leverage machine learning as much as you can: as you grow, you'll want to use more advanced modeling (using convolutional and recurrent neural networks), but it's more than fine to start with logistic regression, random forests, and XGBoost.

It's important to get the training and inference architecture right — from the 30,000-foot view the one we built at Airbnb and Stripe looked like this:

It all started with either synchronous calls to the risk gateway service or asynchronous Kafka events. The first thing it needed to do was to collect features to make a prediction. So it would call the feature service (we had to get hundreds and thousands of features quickly and efficiently). The feature service would gather features from online and offline data storage or services and when it returns the feature values it would also log those in the data warehouse. The feature values would then get sent to the scoring or online inference service. The scoring service would then produce a model score. If the score is high enough, then the entity (account, transaction, message) can get challenged, suspended, or sent for a manual review. The answers for those manual reviews would get stored in the production database as labels. Those labels would then be exported to the data warehouse. Now all the ground truth needed to train the model lived in the data warehouse. The training service would produce a model that we would then upload to the model service.

There are a lot of systems in play here, but one of the most challenging engineering problems here is building a feature engineering service that can both extract hundreds and thousands of features and resolve inconsistencies between feature definitions in online databases and data warehouse. At Airbnb and Stripe, we jointly built a highly optimized feature engineering framework called Chronon.

#Your secret weapon is manual reviews

While machine learning is important, how you manage your review process can be even more crucial. While a lot of people who worked on machine learning may find this obvious, it's worth reiterating: your system is only as good as your data. If your ground truth isn't accurate, you are severely limiting the precision and the recall for your models. That's where human reviews come in. They can help:

Measure how well you are doing (effectively QA the whole system)
Get training data (high-quality labels) for machine learning models
Work on the most tricky cases
Take action on suspicious activity

Here are some tips for effective human reviews:

Prioritize. As you scale you'll find yourself with dozens of manual review queues and tens of thousands of reviews to comb through. It's crucial to have a robust prioritization framework that assigns weight to manual reviews based on probabilities, monetary value, urgency, etc.
Don't aim for zero backlog. It's okay to have some reviews left at the end of the day.
Build systems and tooling to look at things in relation to others, not in isolation. This is where clustering systems (text and photo similarity, graph connectivity, systems that find duplicates) come in.
Show enough information. Give reviewers all the data they need to make a decision.
Be selective. Aim for about 1 review per 1000 transactions / entities.

Visualization is important. You don't need fancy graphics. Simple visuals that help reviewers work faster are best. For example, PayPal used red backgrounds for cases with high fraud scores, helping reviewers spot them within a fraction of a second.

Agent efficiency and quality are paramount. Don't just throw more people at the problem. Instead, focus on making your system smarter and your reviews more efficient. Look at simple things like reviewer agreement: if you give the same thing to two people do they agree it is fraud or not fraud? If you have low reviewer agreement you either (a) don't have enough data (you need to collect more data either from the user or third parties), or (b) subpar training materials, workflows, and change management.

#Acquire data as needed

Sometimes, your own data isn't enough to spot fraud. That's when it's helpful to bring in information from other companies. Different companies offer different types of data. Here are some examples:

Identity verification. Companies like Socure, Persona, Jumio, LexisNexis can help confirm if someone is who they say they are.
Fraud. Sift, Alloy, and Sardine use advanced techniques to spot patterns that might indicate fraud.
Background checks. Checkr can provide information about a person's history.
Financial information. Plaid can give insights into financial accounts.
Phone and email verification. Twilio, Telesign, Ekata, and Emailage can help verify contact information.
Location data. MaxMind can provide information about where a user might be located.

In part three, we take a deep dive into the core concepts of an efficient risk system: rules, decisions, and policies.

Thanks to Dasha Cherepennikova, Eugene Shapiro, Simon Hachey, Steve Kirkham, and Tara Sandhu for reading drafts of this.