Elasticsearch Cloudfront Template

[07.02.2017]

Cloudfront Template for Elasticsearch

Elasticsearch is a great way to collect and search logs for most anything you want to put into it. The problem I ran into was that I coudln't find many resources online to help me with getting my Cloudfront logs from S3 to my Elasticsearch cluster.

Fortunately, Elastic does have a plugin for S3, so that makes reading the files from there easy. I just needed a way to get those logs into my cluster.

I found some guides online, and I pieced this together (cloudfront-original.conf):


    input {
      s3 {
        bucket => "cloudfront-logs-bucket"
        delete => false
        interval => 60 # seconds
        prefix => "cf-logs/"
        region => "us-east-1"
        type => "cloudfront"
        codec => "plain"
        sincedb_path => "/opt/logstash_input/s3/cloudfront/sincedb"
        backup_to_dir => "/opt/logstash_input/s3/cloudfront/backup"
      }
    }
    filter {
      if [type] == "cloudfront" {
        if ( ("#Version: 1.0" in [message]) or ("#Fields: date" in [message])) {
          drop {}
        }

        grok {
          match => { "message" => "%{DATE_EU:date}t%{TIME:time}t%{WORD:x_edge_location}t(?:%{NUMBER:sc_bytes}|-)t%{IPORHOST:c_ip}t%{WORD:cs_method}t%{HOSTNAME:cs_host}t%{NOTSPACE:cs_uri_stem}t%{NUMBER:sc_status}t%{GREEDYDATA:referrer}t%{GREEDYDATA:User_Agent}t%{GREEDYDATA:cs_uri_stem}t%{GREEDYDATA:cookies}t%{WORD:x_edge_result_type}t%{NOTSPACE:x_edge_request_id}t%{HOSTNAME:x_host_header}t%{URIPROTO:cs_protocol}t%{INT:cs_bytes}t%{GREEDYDATA:time_taken}t%{GREEDYDATA:x_forwarded_for}t%{GREEDYDATA:ssl_protocol}t%{GREEDYDATA:ssl_cipher}t%{GREEDYDATA:x_edge_response_result_type}" }
        }

        mutate {
          add_field => [ "received_at", "%{@timestamp}" ]
          add_field => [ "listener_timestamp", "%{date} %{time}" ]
        }

        date {
          match => [ "listener_timestamp", "yy-MM-dd HH:mm:ss" ]
        }

        date {
          locale => "en"
          timezone => "UCT"
          match => [ "listener_timestamp", "yy-MM-dd HH:mm:ss" ]
          target => "@timestamp"
          add_field => { "debug" => "timestampMatched"}
        }
      }
    }

    output {
      elasticsearch {
        hosts => ["elastic.mydomain.com:9200"]
        index => "logs-cloudfront-%{+YYYY.MM.dd}"
      }
    }

Amazingly enough, this worked very well. Unfortunately, though, the logs are quite large, so I needed to look at ways to make them smaller. Also, because the backup_to_dir directive was in my config, it was causing my disk to fill up on my collector machine, so I commented that out and restarted my collector.

My cloudfront logs averaged 700+MB per day, so I was looking for more ways to reduce the size of my logs as well as make things more easily searchable.

To do this, I needed to create a template. Also, I noticed that I could also just throw away the original "message" field, since logstash was groking out the juicy bits anyway. With the new template, it would be necessary to reindex the logs, so I looked into creating a unique way to identify each record. Logstash has a built-in function called fingerprint.

So, I added this to the filter section:


    fingerprint {
      method => "SHA1"
      key => "KEY"
      target => "[@metadata][fingerprint]"
    }

Also, I added some mutate statements to remove unneeded fields.


    mutate {
      remove_field => [ "message" ]
      remove_field => [ "cloudfront_fields" ]
    }

In order to use the custom fingerprint ID, I added this to the elasticsearch declaration:


    document_id => "%{[@metadata][fingerprint]}"

Now, this gave me the unique identifier, so if I happened to try to reingest the same logs a second time, it wouldn't show up as duplicates (it just updates the version of the document, which still takes processing time, but it won't show up as a second hit in the logs).

I created a template file for each field I was ingesting: cloudfront-elastic-template.json I loaded the template into the Kibana developer console and started a re-index like this:


    post _reindex
    {
      "source": {
        "index": "logs-cloudfront-2017.05.03"
      },
      "dest": {
        "index": "logs-cloudfront-reindex-2017.05.03"
      }
    }

The final cloudfront.conf contains the final changes I made.

After making these changes and reindexing, I estimate that I saved about 35% disk space for my logs, which greatly reduces my disk space needs. Also, because it reduces the amount of data being pushed into Elasticsearch, it speeds up the indexing.

aws elasticsearch cloudfront