Using Log Data And Machine Learning To Weed Out The Bad Guys

We have seen in the past how security threats often have their origins inside organizations. While high-profile data breaches from shady external characters create a dramatic story that is attractive to the media and Hollywood, there is arguably even greater risk from internal IP theft. Indeed, losses due to IP theft are estimated to be more than $300B each year. I recently came across a case study that involved an unnamed $20 billion manufacturer. The company, who for commercial sensitivity reasons obviously didn't want to be named, was recently a victim of internal IP theft, and spent a year and more than a million dollars working with a traditional security player trying to identify the rogue players.

What is interesting here is that the company applied some new approaches to identifying who was to blame here. They combined the aggregation of log data, the metadata that is created any time IT systems are used, along with high-level analytics and machine learning tools. Using a combined approach that saw the company engage with log data from version management vendor Perforce and analytics vendor Interset, the company got to work crunching the data they had. The two vendors have a joint solution, Helix IP Threat Detection, which aims to accurately detect threats and prioritize those threats by assigning them a relative risk score.

To give some context to this "needle in a haystack" problem, the company in question had over 20,000 developers. Since they needed to take log data from a chunk of time, the company wanted to analyze 30 days worth of logs. In all, they were looking to run analytics over millions of different log events.

The solution looks at usage patterns - how individuals within the organization act - and identifies patterns. These patterns could relate to timing "a particular user tends to access a particular type of data on weekdays" or could relate to usage patterns "a particular user tends to check out code once a day". With these patterns identified, the solution looks to unusual events "this user checked out code at an unusual time". The data, which is anonymized for privacy reasons, can identify individuals or projects that have unusual actions related to them.

The solution uses math to take actions that it observes in the system, and computes a score for each behavior that represents how risky and anomalous that particular action was. This score then identifies how important it is for the customer to pay attention to that behavior. The tool calls that score the behavior risk.

The behavior risk is scored using a set of models that take into account four components: the user, the activity (behaviors carried out by the user are compared to normal baselines), the asset (assets are files or source code projects) and methods (the types of behaviors that are built in the math models looking for risky actions).

The activity component calculates how anomalous a behavior is - looking for the differences between predicted activity based on the user's normal baseline and the activity the user just completed. The analytics engine then does something quite cool, it compares not only activity baseline against historical activity, it also uses clustering algorithms to create groups of behavioral peers, and look for differences between a particular user's activity and other developers that are similar to her.

The asset component takes into account different levels of risk or importance associated with different source code projects. The more important an asset is, the greater risk it creates.

The method is a series of math equations that look at groups of activities that define specific risk. For example the amount of data checked in and out of a code repository. When that ratio becomes anomalous the risk scores increase. There are many types of methods and they are specific to the data sources collected.

Each behavior to be monitored can be divided up into combinations of these four components. Using these four components and an equation that sits within the analytics engine, the solution creates a risk measurement for any particular scenario.

The engine then utilizes a second level of computation. It calculates "entity risk" a score applied to every user and every asset. A fictitious example:

Suppose John Sneakypants is accessing an unusual source code project. That one behavior will have a certain amount of risk associated with it, which we compute, and may be important, but may also be a false positive. Perhaps John just changed his job role, so it’s okay that he is accessing this network share.
But suppose John did this at a time of day that he was never active before. The combined actions push John’s Entity risk score higher. And suppose he just took from an inactive source code project – the risk of that event is higher, and pushed John to an even higher Entity risk score. And suppose he is downloading more source code than expected – John’s Entity risk moves higher yet again. As John’s anomalous actions continue – risk continues to grow.
So intuitively the more of these risky behaviors coincide on the same Entity – John in this case – and the risk continues to grow – by tying together the actions of the user over time and understanding the context of those actions – we start to realize that this is not a false positive and John begins to feel like a person of interest that we should investigate.

By using this process, the solution determines a numerical risk score for every action within the log data and in doing so gives a customer an indication of the riskiest actions to follow up on.

Going back to the case study I've seen, that 30 days of log data included actions from two engineers who had already stolen data. The company wanted to see if the solution could detect the attack. Over two weeks Interset crunched the data and not only identified those two engineers, but also identified 11 other cases of internal theft. The company had this to say about their experience:

There was a ton of data [in the Perforce logs]-- 9.1 billion events, executed by some 20,000+ developers. By running historic log data from the Perforce software version control platform through a behavioral analytics engine from security startup Interset, the 9.1 billion events were turned into useful and actionable data. In under two hours, the analytics engine surfaced two “red herrings” – engineers already known by the company to have stolen information. But that’s not all – it also showed 11 other unknown bad actors; eight located in China had been replicating as many as 500,000 files a day. The analytics engine had no notion of these bad actors before the data was entered, no policies to express these actions and no threshold based rules to set off alerts. All the threats were found in real-time using native machine learning and behavioral analytics.

This type of analytics shows the value of both machine learning and the capturing of log data. By capturing and storing the detailed logs of individual's interactions with systems, and running those logs through analytics engines, a wealth of data can be extracted.

Follow me on Twitter. Check out my website.

More From Forbes

Using Log Data And Machine Learning To Weed Out The Bad Guys