So you’ve set up a firewall to guard your network.
You can finally lean back and take a breather. Your data is safe… right?
At this point, you know that’s a pipe dream. It’s going to take constant data analysis to adapt to threats in real time and keep your network safe.
But there’s no way your data science or security teams can sort through the thousands of security alerts coming in each day.
So what should you do?
We went to Jeet Dutta, Director of Data Science at Adlumin, to get his thoughts. In this conversation, he shares how his team is using a data science layer and unsupervised learning to pick up on security threats and adapt to the constant stream of attacks that networks face every day.
The conversation below has been edited for length and content.
What role does Machine Learning play in detecting network security threats?
You could have a very well-guarded network but, as we know, attacks still happen. Sometimes these attacks can be insidious with far reaching consequences in terms of data theft or disruption of an organization’s daily operations.
At Adlumin, our thinking is that having security rules is not enough to safeguard an I.T. network. A firewall, for example, is going to generate thousands of alerts a day. It’s not feasible for security professionals to chase down that many leads.
That’s where data science comes in.
You need a data science layer to boost the signal-to-noise ratio. That will make the task far more manageable. Once an algorithm has sifted through the data and identified some potentially malicious activity, you have a target rich environment for security professionals to work with and perform follow-up analysis on. Data science narrows down on the data. It allows analysts to filter through the noise and focus their specialized knowledge and expertise on the most important areas.
Data science plays a role even if you don’t catch the attack in real time.
Post-breach, there are many lines of investigation you need to follow. You need to systematically track down emails and other contact that your systems or machines have had with the attacker. From there, you can draw a perimeter and identify how much damage was done. That’s a necessary first step before you can make a plan for guarding your network against future attacks.
What are the main challenges you face when applying Machine Learning to security threat detection?
These I.T. security threats are constantly evolving. Most types of threats don’t have a signature pattern. There’s no set of actions or sequence of events that will immediately signal a malicious attack. Your algorithm needs to be more flexible in design rather than rules based. For example, you could have an individual user logging onto a machine they’ve never accessed before. You could say that’s a red flag and write an algorithm to alert you to these “novel logins.” But what about other entities, like your network administrators, who frequently log in to machines that they haven’t accessed before?
Instead of being rules based, the key challenge is to define what’s abnormal in a deeper way.
Another challenge in our field is that we lack labeled data sets. We see a lot of abnormal things in our client data, but there’s no feedback loop. Historical databases of malicious attacks don’t really exist in this field. If you’re in a different industry, such as credit card fraud detection, there are a lot of databases that clearly identify instances of fraud. So you can train a model to learn from those instances by asking, “What were the signals of fraud occurring?”
In network security, we don’t have a lot of those historical databases. That means we have to primarily rely on a Machine Learning (ML) approach, called unsupervised learning. With this approach, we’re trying to cluster the data in a way that the data point representing a malicious attack stands apart from the cluster where everything else exists. This means the daily activity graph of a user whose credentials might have been stolen will look abnormal.
With this approach of unsupervised learning, you can get algorithms that are less wedded to history and more adaptable to a new attack pattern. The challenge is that it’s difficult to test how well these unsupervised models perform. You have to play it by ear.
One final challenge is that a lot of data science is about converting things that are not numbers into numbers. For example, you can give numerical representations to words or pictures. This type of numerical representation is a bit harder to achieve in I.T. security, but it provides a great deal of value and insight.
How does the Cloud help in all of this?
The Cloud plays a huge part. We wouldn’t be able to do what we do here at Adlumin without the Cloud. We process billions of data points every day and provide these security layers at a reasonable cost thanks to cloud computing.
We’ve built this serverless architecture that can scale up almost immediately. We’ve built data lakes in the Cloud that can store vast amounts of data. The amount of storage and computing processing power are both very critical pillars for what we’re doing at Adlumin.