10 Pounds of Packets in a 5 Pound Bag
Richard Bejtlich has been talking a lot about the difference between Network Security Monitoring (NSM) and "alert-centric" technologies like Snort. His basic premise is that "real" NSM requires more than just IDP alerts and packet logs, it requires event notifications, full packet logs of the entire network and flow data as well. He also quotes me as saying "Richard, I wrote Snort so you don't have to look at packets". This isn't quite right. I think what I actually said was "Did you think about how much data you're going to record if you do that on a high speed network? We wrote IDS so that we wouldn't have to record everything."
I get it, I understand what the NSM guys are saying and I really don't disagree with them at all. The problem I have is that if you try to deploy this concept in a large network environment with lots and lots of sensors, you've got some big problems to overcome. Let's look at the problems.
1) Flow aggregation - As I see from Richard's latest post on Cico's MARS product, he wants the raw flow data, not just statistical NetFlow rollups. RNA does that already, as can Snort with the right options turned on. This works fine as long as the network environment is relatively small and you don't try to roll up all of the data for post processing and analysis. If you do aggregate it to a central collector, then you've got the multiplication problem on the aggregation link(s), namely the more traffic the sensors see and the more sensors you have the more they pump up to the collector and have to push into the database that you're using to be able to manage all this information. If you're in an environment where you're aggregating more than a few million flows per hour, that's a ton of data to manage if you figure ~40-bytes per flow record (in binary format). That's 200MB of data just for flow data for an hour, almost 5GB per day. 150GB per month. Those also have to get blasted into a database so they can be worked with, so you're going to be translating all that data into SQL insert statements and then pumping it into the local database on your aggregation machine or across the network to a database server (or cluster). That's a lot of processing, a lot of network bandwidth and a lot of disk, not to mention a lot of RAM to maintain the indices in memory for the database. It's not that this isn't doable, but now we're talking about offloading the work across multiple machines at a minimum and that's going to increase your costs dramatically. Overall this isn't a huge problem (the NetFlow analysis/NBA guys do it for a living) but it is a big one in any large enterprise, it takes a lot of work to scale technology to work with it effectively.
2) Traffic aggregation - If you thought the flow aggregation problem was fun then start logging all the traffic on your network. Let's take a fairly well utilized modern enterprise network backbone running at a sustained 500Mbps, that's 62.5MBps of data to record on a single sensor. 225GB/hour of packet traffic, 5.4TB per day from a single sensor. All that data is going to need to be rolled up too, unless you're going to spool it into a local database and do distributed queries across the network for packet traces. At that kind of data density your NSM sensor is going to need a NAS device someplace nearby so that the data can be stored, it's going to be really hard to do that on a 1U appliance just due to physical drive space limitations. Once you have all that data, you're going to need to be able to work with it, so it's got to be in a database or it has to be indexed on the filesystem in some logical fashion so that smaller chunks of data can be rapidly located, decoded and presented to the user on demand. There are companies that build products to do this, I can't really speak to their effectiveness. I can hook a high-speed collection process like daemonlogger up to a big disk and grab all this data, but once again how much value are you really getting for recording all that data vs the logistical overhead of trying to maintain all that information in a usable fashion for extended periods of time? What's the time horizon if this data? Do I need to keep a week/month/year of this data live in a database for referential purposes? If there's going to be any expectation of success the amount of data that's kept "live" is going to have to have some pragmatic constraints.
3) Alert aggregation - This is what IDP vendors spend their time working on getting to their users. We have pretty well established metrics as to what is acceptable in this realm in terms of sustainable event rates, data overload thresholds for analysts, data density and so on. This is the de facto standard in IDP because this is the thing that people are paying for, we've got to generate the events and everyone wants to see them since that's what the technology is supposed to be doing. This is a lot of data to deal with too, and this is the raw information that analysts have to work with. We do a lot at Sourcefire to pare down the number of events analysts have to deal with via our Impact Assessment technology that's enabled by RNA, so it is possible to do effectively in large environments even with less than optimal tuning of the sensor infrastructure.
When I started Sourcefire one of the things that I decided to do to get people to want to pay for something that was free (i.e. Snort) was to try solving their data management problems. If you look at most of the IDS vendors before Sourcefire was founded, they would sell you IDS sensors and management front-ends but they wouldn't solve the biggest problem that most people would have once they deployed the technology, namely managing the information produced by the sensors. As we all know, IDS can generate immense amounts of data with just alerts and if you want to be able to work with that data it needs to go into a database that has been optimized for the data set. Prior to Sourcefire, you could buy $250k worth of sensors from vendor X and when you deployed the sensor grid you'd call vendor X and ask them how you're going to manage all those alerts. Their answer was typically "go call Oracle, they make a really nice database and we'll sell you professional services if you need help setting it up." This greatly increased the cost and complexity of deployment of the IDS solutions. When Sourcefire started I decided that this was an area where we could add real value, so we built what is now called Defense Center allowing customers to have a plug-n-play appliance that solved their data management problems and provided a path to deploy large infrastructures of our gear quickly. As you can see from our S-1 filing, this was probably a Good Idea.
A "real" NSM infrastructure is going to primarily be built around the idea of collecting, moving and storing data and then making it highly available in a variety of presentation formats for users. If you try to do this on a network that's generating lots of traffic across lots of sensors/segments, the likelihood of building a scalable solution that anyone is willing to pay for is vanishingly low. You're going to need hundreds of terabytes of disk, a dedicated out of band management network for moving data, huge database servers AND the management and sensing infrastructure to actually grab the data.
Now we want to scale it. I know from experience that there are large distributed international enterprises out there that have remote offices sitting on the other side of 128kbps (and below) links. They get really irritated when you saturate that link to pump out a continuous stream of security data. These organizations also have core networks with 10Gbps links that can sustain 2+Gbps of internal traffic for hours. That's a couple terabytes per hour of traffic you want to log, give or take, just in the core. Then you have the rest of the enterprise with 100+ sensors deployed that are seeing varying amounts of traffic but say none of them go below 10Mbps typically, so that's another TB of data every hour you've got to collect and forward to a central aggregation point. Then we throw in the flow data (lots of small records to insert) and the event data (more small records to insert) and you've got a data aggregation nightmare. Concentrating this data to a central collector or a load balanced set of collectors will saturate a gigabit line so you're going to either have to figure out how to leave it local on the sensors and perform distributed queries against it or you're going to have to deploy a bunch of additional network gear to absorb the load.
The cost of deploying a solution like this will make today's IDP deployments look like rounding error and the amount of time required to sell this into an enterprise will make today's sales cycles look like selling fast food.
Then we've got training. I know what the binary language of moisture vaperators, Rich knows the binary language of moisture vaperators, lots of Sguil users know it too. The majority of people who deploy these technologies do not. Giving them a complete session log of an FTP transfer is within their conceptual grasp, giving them a fully decoded DCERPC session is probably not. Who is going to make use of this data effectively? My personal feeling is that more of the analysis needs to be automated, but that's another topic.
One of the comments made to one of Rich's posts said
It seems that a lot of these SIM and IDS/IPS systems are really now being sold to small and medium enterprises without any regard to the amount of additional staff time and expertise that will be required to maintain them. Consequently I find that the ones I've used aren't oriented towards making investigation of an incident easier but are there simply to send out more alerts under the premise that more alerts is surely better because we're detecting and stopping more attacks.That's incorrect. They're being sold to extremely large enterprises (Fortune 100) and when they're sold in those environments there's an expectation that they will scale. There is more data that we can get to the users of these systems for sure, but everything is an unrealistic expectation given the realities of the large enterprises that these technologies are sold into.
Recording everything doesn't scale today but maybe someday it will. Like after the Singularity.