Detecting Risk in Real Time: The Technology

@dalexeenko|May 22, 2023 (3y ago)195 views

This is the third in a series of blog posts on building risk detection and mitigation systems.

Background

Trust is the fundamental currency of the sharing economy. Every time someone gets into a Lyft rather than a cab, for example, they are making an implicit declaration of trust: I, the passenger, trust you, the driver, to safely get me to my destination. Similarly, the driver is also affirming their own faith in the passenger to treat their car well and respect their possessions, space and time.

Airbnb is the same game, just at higher stakes. Travelers using Airbnb are going out on a limb, avoiding the standard experience of a hotel for something a little more uncertain. They are trusting that their host will provide the experience described by the Airbnb listing, with certain expectations of cleanliness, responsiveness, and safety. Hosts are literally welcoming strangers into their homes, and while they are insured up to one million U.S. dollars by our Host Guarantee, they still take on a significant amount of risk with every reservation. Without either party’s willingness to trust one another throughout the entire process, our online marketplace would come to a screeching halt.

Unfortunately, like most online marketplaces, Airbnb is sometimes targeted by fraudsters trying to undermine this trust and take advantage of our community. This can take a variety of different forms: a bad actor trying to take over a host’s account, or a scammer trying to use a stolen credit card to book a reservation. Our team - the Trust team - is dedicated to getting ahead of these issues. Our mission is to build the world’s most trusting community, and protecting our community from exploitation and fraud.

Over the years, our team has built a number of defenses in pursuit of this mission. We essentially served as a platform team to different business units - Homes, Experiences, China - focusing on mitigating fraud in their businesses and enabling their product teams to focus on work more directly relevant to their goals. Like many other aspects of our code base, these defenses have grown incrementally to suit product needs as they arise. In spite of our growing defenses, every time a business unit adds in new functionality to their systems, there is always a risk of it going unnoticed and hence unprotected by the Trust team defenses. Even for product features we are aware of, engineers solutions might diverge, and defenses become difficult to track, understand, and maintain. A number of different entry points into our systems have developed in this ad hoc fashion, significantly lowering our overall observability. Lacking a clear audit log, our Trust & Safety agents, i.e. the people reviewing potential fraud, performing investigations, and dealing with incidents directly, needed to spend minutes, if not hours, tying various pieces of information together for things as simple as understanding why certain transactions have been declined.

Based on the above, it was clear that we needed to standardize and take a more unified approach to defense. Teams building their own systems for defense would not scale in the long term. We needed a system that would allow, at low cost, the creation of new rules and machine-learning models, standardized observability and logging, first-class data warehouse support, and tooling. We wanted to help Airbnb’s business verticals, like Homes and Experiences, focus on their product and user experiences and not on fraud mitigation, so we decided to build a centralized system called Enforcement Framework, a set of tools to detect and mitigate fraud.

Concepts

Airbnb is a community made of millions of people who interact with our platform every day. In order to protect our community, we need to constantly evaluate activity occurring on the site, making decisions in real-time on whether or not we consider the activity risky. If we suspect something suspicious is happening, we can block the action, show the user a security challenge, or even suspend their account from the website. This decision-making needs to happen broadly across the site in a way that does not significantly affect latencies or drastically hamper the good-user experience. Additionally, the system needs to support dozens of developers writing rules, deploying machine learning models, and monitoring their endpoints in real-time.

On a high level our fraud detection architecture can be represented like this:

Our risk detection systems — that comprise of rules, heuristics, and machine learning models — listen to and analyze everything that happens on the platform and either allow for things to happen (Low risk), present a user challenge (Medium risk) or block an action altogether (High risk). Given our requirements, it made sense to treat the fundamental unit of input to our system as an event. An event could represent a user attempting a login, sending a message to a host, making a reservation, or any of the other numerous actions people can take on Airbnb. For example, an event representing an update to a listing might look like:

case class ListingEditEvent
(
  eventId: EventId,
  timestamp: Timestamp,
  userId: UserId,
  listingId: ListingId,
  requestContext: RequestContext,
  editedFields: List[String],
  defenderSignals: Map[String, Any]
) extends Event

When an event reaches the system, it is then evaluated by rules. These rules accept an event as input to make a determination and output a decision. A decision is a classification of an entity (user, reservation, listing, etc) with a specific type of risk that it represents. For a fake listing, this could look like:

case class FakeListing
(
  listingOwner: UserId,
  listingId: ListingId,
  Confidence: ConfidenceLevel
) extends Decision

This decision then gets interpreted by a policy, which translates it into actions to take against the user. Adding a layer of indirection between rules and actions gives the system a few nice properties:

We get a human readable interpretation of the risk that is posed, along with the action that is taken. This provides an important audit log for the system (e.g. we have the context to know we blocked a message because we classified it as spam).
We achieve consistency in actions for a given risk type and confidence. We may want to directly block spam messages for now, but in the future, we might want to change our policy and show the user a captcha. Only one place in code would need to be updated, not every rule that wants to block.
We can share decisions across both automated rules and human investigators. No matter if a human or a rule classifies a listing as fake, we can always perform the same set of mitigating actions.

As a whole, the system together looks like:

This allows for a central system where developers can define events, rules, and policies in a standardized way across our different defense teams.

Optimizing the Rules

While these abstractions provide a basic model for how the system would work, the reality of implementing it is much more complicated. Rules are not simply boolean logic on the data contained in the input event - they almost always need more data from external services across the company to make their determination. Oftentimes, the same piece of data will be accessed by multiple rules, making caching important. Additionally, rules need to be performant, and developers should not need to worry about concurrency explicitly.

In order to solve this, we used a Haxl-like monad, we call Fetch, to provide concurrent access to data points from our rules. From a rule author’s perspective, a Fetch represents a value that will be available at some point in the future. Internally, however, Fetch constructs a dependency graph between requested data points and concurrently fetches them from their data sources. Fetch also provides caching of the accessed data to ensure that different rules issuing the same requests are receiving the same results. A sample rule fetching data about a user and a listing and then scoring an ML model might look something like:

class ListingMLModelRule extends Rule[ListingEditEvent] {
  override def evaluate(event: ListingEditEvent): Fetch[List[RuleResult]] = {
    for {
       (userInfo, listingData) <- join(
         userService.getInfo(event.userId), 
         listingService.getListingData(event.listingId))
       mlModelScore <- mlModelService.scoreListing(userInfo, listingData)
    } yield {
      if (mlModelScore > 0.90) {
        return List(FakeListing(event.userId, event.listingId, HighConfidence()))
      } else {
        return List.empty
      }
    }
  }
}

This code fetches user and listing information in parallel, and then uses the results of that data to score a machine-learning model. If the model’s score is higher than a threshold, we classify the listing as FakeInventory.

In a more complicated scenario, we may have multiple rules making requests to various services. Our Fetch library creates a combined DAG of these requests in the background to combine execution of all rules. The diagram below illustrates a sample scenario where we have two separate rules that make datafetches:

Notice that we only make one call to the User and Listing services, even though both rules independently request data from them.

In addition to adopting a monadic style data fetching paradigm, rules themselves follow a very functional style as well. We enforce that rules are side effect free, and that any intended state change (incrementing counters, recording metrics, publishing a message to kafka, etc) are explicit values returned in the list of RuleResults of the evaluate method. In addition to making testing easier, it gives us nice properties like being able to preview evaluations and learning what they would do, and, in the future, even being able to backtest rules by mocking out side effects and fetches with offline data.

The fact that our logic uses common abstractions means that we are able to deliver the impact from these optimizations across the full breadth of events that occur on the Airbnb platform. All forms of risk evaluation, from payment risk to fake inventory detection, are able to take advantage of these improvements.

Impact

We’ve seen how the use of common abstractions gives us more leverage on our performance improvements. In the online setting, allowing us to minimize service requests even as we’ve scaled to over one-hundred million events (and one billion data fetches) per day. What’s more, the use of these abstractions has delivered further impact by allowing us to standardize other aspects of our systems. Before, when different defense teams built and maintained their own systems, it could be difficult to determine where exactly an event had been evaluated, what the outcome of that evaluation was - or whether said evaluation had occurred at all. Now that we have a single, consolidated entry point to our evaluation processes, we can log not only every event but also every fetch, decision, and resulting action that occurs over the course of the evaluation. By standardizing this data, we have been able to drastically improve our observability into a variety of crucial questions: the performance of our evaluations, the efficacy of our challenges, the particular information that led to a user being challenged or suspended.

One aspect of this improved observability is improved developer tooling. We have built a tool where engineers can now easily query not only what decisions were made about an event or user, but also the exact values of the data fetches which resulted in those decisions taking place. Standardizing the data allowed us to build generalized tools in a way we were never able to before.

This standardization has also been absolutely essential for improving the way we interact with offline data. Previously, as with the implementation of the defenses themselves, data for offline analysis was logged in an ad-hoc fashion with no common schema. Engineers and data scientists writing pipelines around evaluation data needed deep, implementation level knowledge of the structure and content of the information being emitted. Even worse, this information would vary across event types and fraud vectors - switching domains in the offline world required relearning everything about what the data looked like in that new domain. Building anything more generalized was near impossible.

Now, events are published to Kafka at each point in the evaluation process using a set of shared schemas. These events are then loaded into our data warehouse where they are queryable via Hive and Presto within an hour of being emitted. Engineers and data scientists only have to learn the details of one particular set of schemas to work on pipelines or build generalized visualizations like the Sankey diagram seen below.

This is just the start for our team. We’re continuing to double down on the performance, reliability, offline data quality and developer friendliness of our system, and we’re excited to see the impact of these improvements realized across all of the teams (roughly sixty engineers) that contribute to the project.

Conclusion

As new product needs arise, software that grows incrementally can often grow in an unplanned fashion and become needlessly complex. This was the previous state of our risk evaluation systems, and it incurred costs in a variety of ways - increased maintenance, siloed knowledge, lack of online observability and inscrutable offline data. By aligning on abstractions that fit current and planned use cases, we built a new rules engine which could support our evaluation logic in a standardized and generalized way. This standardization allowed us to make improvements on a myriad of aspects - performance, observability, logging, data quality - and scale these improvements out to all of the engineers working on the new platform.

Still, this is just the beginning for our team. Having a solid foundation means we are in a significantly better position to make some of the big bets that, until now, we’ve only talked about. We are already beginning to introduce new features like back-testing, sequence modeling, state of the art photo and text classifiers, clustering systems to examine text, photo and video content, and real-time rules that will empower our Trust & Safety agents, the people most directly familiar with the issues facing us, to fight fraud trends head-on. These improvements to our reaction time and effectiveness will only be more important moving forward. As Airbnb continues to offer a variety of new products, such as the recently launched Adventures, or Luxe, user behavior and the scope of potential risk on our platform will continue to grow alongside it. With an existing framework in place, improvements won’t be incremental, and can be easily shared across products and teams. New teams will be in significantly better positions to build and iterate on defenses quickly as new products and trends emerge. With the implementation of our enforcement platform, we hope to be prepared to protect our community, however it might grow and into whatever space Airbnb might be headed.

Thanks to Dasha Cherepennikova, Eugene Shapiro, Simon Hachey, Steve Kirkham, and Tara Sandhu for reading drafts of this.