November 6th, 2012
by Stephen Brannon
An indicator of compromise (IOC) is one of the basic units of information sharing in tactical intelligence. Simple examples are IP addresses of known command-and-control servers and MD5 hashes of know malware. But a common problem is that IOCs are often stored and shared in a document like a PDF or Word document. We need to extract IOCs from these documents in order to use them, but the process can be difficult, time consuming, and prone to human error. We recognize that this problem is likely to be with us for a while, so we developed a small tool to mostly automate the process of extracting IOCs from documents. This helps us get them into our systems (like CIF) more quickly and accurately so we can use them and analyze them.
IOCextractor is a small program that presents a text document to a person with likely IOCs identified and highlighted in context. It has a simple interface for a person to add or remove any IOC identifications based on human judgment and understanding of the document?s context. Finally, the program exports the IOCs in a structured format that?s easy for other tools to import and use. The underlying philosophy is to leverage a computer for tasks it?s best at (matching patterns and copying strings accurately) and to use a human?s time only for the tasks a human is best at (judging whether identified strings really are IOCs based on context).
The tool is useful to us because it makes the process of extracting IOCs from a wide range of documents much faster. In fact, it sped the process up enough that it changed some of our cost-benefit analysis. Before, the time required to extract useful information from some sources of shared intelligence was prohibitive because of their difficult formats. Now, the time required is short enough that those sources are worth parsing, so we?re getting and using IOCs that we were effectively ignoring before. To improve efficiency even more, we?re considering expanding the formats that IOCextractor can input and output. We originally designed IOCextractor only to read plain-text files, relying on other applications? mature capabilities to save their own formats as text. This has largely worked well, but we?re looking at adding the ability to directly open a URL. We?re also considering additional output formats like CybOX, STIX, or OpenIOC to further facilitate integration with CIF and other tools.
From a certain point of view, this problem is simply one of extracting structured data (IOCs) from semi-structured or unstructured data (documents). There?s a mature field of study on the subject, including classic articles by Sreekumar Sukumaran and Ashish Sureka and by Bill Inmon. And of course we drew from the field for ideas. There?s also a range of products that are very good at text mining, named entity extraction, and other related tasks. But those tools are generally too expensive to consider for this kind of small application, and they?re far more complex to use and customize than IOCextractor: it?s less than 250 lines of Python code. In this case, developing our own tool for this purpose was the best way for us to go, and we decided to make it open source and available on Github. We hope other people find the tool useful too, and we look forward to reactions and suggestions!
Shark Week 2012 evelyn lozada UFC 150 Caster Semenya Medal Count 2012 Olympics victoria beckham London 2012 rhythmic gymnastics
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.