Thursday, June 28, 2012

CIF-Lite: Customizing CIF to your schema

This post documents how to customize CIF to use your own data handling / storage methodology.  My particular customization I call "CIF-Lite" since it focuses on cutting down on storage requirements and simplifies the database layout.  If you are interested in my specific customization of CIF: the code and installation instructions are available at  However, this blog post will provide details for performing your own customization as well.

By the way, it is past time that CIF have a logo ;)

For those unfamiliar with the CIF project, it stands for Collective Intelligence Framework (CIF) - it is an excellent project being spear-headed by the REN-ISAC (the .EDU information sharing and analysis center).  The project is fully open-source and provides a flexible framework for automating the pulling down, parsing, storage/archiving, analysis, and retrieval of information from data-feeds.  For example, no doubt in your security environment you have scripts that grab data from the likes of ZeusTracker, MalwareDomainsList, and whatever other data sources you're interested in - and then do things like applying the data to your blocklists.  CIF allows you to perform these actions without having to write any code, just write a config (.cfg) file and voila! you have your datafeed integrated into CIF.  There are other great features too, such as tracking analyst searches in CIF and defining data-sharing restrictions ... but this post isn't about CIF features (you can read up on them on the CIF website), it is about customizing the back-end CIF data handling and storage.

The rest of this post assumes that you have a working install of CIF v0.01 (note: this entire post refers to this version of CIF, I have not checked out the Beta versions that are under development yet) and are interested in customizing the storage of the data that is auto-magically pulled down and parsed as the result of your CIF cfg files that you have built.

Without question CIF is a great project and has a growing user-base following it within the security community - for example, integration of ELSA with CIF.  From using CIF I have found that the storage back-end may not be ideal for all environments - and I've heard similar feedback from a few other analysts using CIF.  CIF does a number of things by default in its storage that may not be ideal in some environments:
  • CIF will store all data records as new records even if the record previously existed in the feed the last time it was pulled.  There are a number of feeds like Alexa that are large and storing records over an over again can become expensive and unnecessary for some.
  • CIF will store raw data about each record as an IODEF text field in the database - the IODEF format contains extraneous text and fields (read: bytes/storage) that again can become expensive and unnecessary for some.
  • CIF makes use of separate tables for various data "impacts" for indexing purposes (e.g., botnet domain, phishing domain, etc.).  These additional tables help speed up the CIF query time, but there are also storage expenses and potentially complexity issues querying across multiple tables in the database.
  • There are a few other CIF storage gotchas too like storing the datafeed source as a UUID without a lookup table.  So you need to calculate the UUID for all sources to match up which one the particular UUID corresponds to.
Note: none of the above bullet-points are a bash against its design or development -- these particular points were not ideal for me and I was interested in seeing about better understanding and customizing the CIF back-end.

Fortunately within the CIF code base, everything is modular.  The main script that CIF calls in your running instance is set to regularly run from your cron:
This script loops through running your feeds in your CIF .cfg files through:
This script parses (CIF::FeedParser->parse) and stores (CIF::FeedParser->process) your data feed using the CIF framework logic in:
Which makes usage of its own CIF::FeedParser logic for parsing and the CIF::Archive logic for storing.

To customize how data is stored in your instance, make a copy of /opt/cif/bin/cif_feedparser to be your custom data handler (e.g.,  /opt/cif/bin/cif_feedparser_custom).  At the bottom of the file you will see the call to CIF::FeedParser->process(), add a parameter called "function" and point it to your own custom function for handling the data records that CIF parses out of your datafeeds:

Note: there is a bug in /opt/cif/lib/CIF/ that impacts the use of a custom handling function.  I have reported it to Wes / CIF development - but in the meantime I have a fix for it here.

Within your custom handling function (CIF::Lite::insert_records in my case), you handle the receipt of the CIF records and configuration, and are then free to iterate over the records and normalize and store the data however you wish.  E.g.,

In my particular storage schema I used first/last seen to show a timeframe for repeating records versus creating new records and I use a lookup table for things like source and impact.

After you're happy with your back-end customization, you can modify how cif_crontool is called in your crontab to use your custom feedparser script using the "-C" option, e.g.,

You can then query your custom database directly and/or write your own CIF client tools for extracting data out of the database based on your new schema.  For example, here is my client. Hope my experience helps out anyone interested in customizing the data handling/storage functionality of their CIF instance.