Posts Tagged fluentd

Storing Apache Access and Error Logs to Amazon S3 with Fluentd

It’s been a while I did not write a new blog post.  I joined Amazon Web Services a year ago and our company policy is to not blog about our products and services, but rather refer to our official blogs.

However we can talk about other products and services and today I would like to share some experiments I made while preparing a training class, using Fluentd.  Fluentd is an open-source data collector designed for processing data streams.  Many use Fluentd to collect and aggregate log files.

Fluentd is able to parse many different input files.  It produces aggregated JSON files and stream them to your destination of choice : a central Fluentd aggregation server, a database, a file storage on prem or in the cloud.

Fluentd’s documentation is very complete and clear and the community extremely reactive and willing to help. For basic instructions to collect and aggregate Apache’s HTTP Server log files and to push them to Amazon S3, please read this excellent article. I will not repeat the basic configuration instructions here.

However I had to struggle with two configuration problems and I thought it was worth to share these and their solutions to benefit the wider Fluentd community.

Problem #1 : Fluentd has no standard parser for Apache’s HTTP Server error_log file

The in_tail plugin is able to parse a couple of file formats by default, including Apache’s HTTP Server access_log file, using a configuration like this :

  type tail
  format apache2
  path /var/log/httpd/access_log
  pos_file /var/log/td-agent/httpd.access_log.pos
  tag s3.apache.access

However, when it comes to error_log file, Apache’s HTTP Server uses a slightly different format and in_tail plugin will fail to parse it. Fortunately, in_tail allows to define your own regular expression for parsing message.  After a bit of experiment, I managed to craft a regular expression to parse the error_log file :

^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\] \[pid (?<pid>[^\]]*)\] \[client (?<client>[^\]]*)\] (?<message>.*)$

My configuration for the error_log parser is as this :

  type tail
  format /^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\] \[pid (?<pid>[^\]]*)\] \[client (?<client>[^\]]*)\] (?<message>.*)$/
  path /var/log/httpd/error_log
  pos_file /var/log/td-agent/httpd.error_log.pos
  tag s3.apache.error

Problem #2 : Fluentd’s Amazon S3 output plugin does not generate true JSON by default

I want to push all my aggregated log to Amazon S3 for storage and further analysis with Amazon Elastic Map Reduce.  Fluentd has an output plugin to push the stream to Amazon S3. Unfortunately, by default, this plugin uses the following data format :

date\ttag\t{... JSON ...}

This is not true JSON and makes it difficult to analyse the JSON stream with standard parsers (or Serde if you plan to use Apache’s Hive) .  To configure the plugin to use pure JSON as output, you just need to add the following lines (in bold below) :

<match s3.*.*>
  type s3

  s3_bucket bucket_name
  path logs/
  buffer_path /var/log/td-agent/s3

  time_slice_format %Y%m%d%H
  time_slice_wait 10m
  format_json true
  include_time_key true
  include_tag_key true

  buffer_chunk_limit 256m

The first line will tell the plugin to use pure JSON, the two other lines ensure you don’t loose the date and tag information by pushing them back to the JSON structure.

This configuration produce entries like this one, easy to parse and to analyse

{"host":"62.999.999.999","user":null,"method":"GET","path":"/dreambox-second-stage-bootloader-update/","code":200,"size":28377,"referer":null,"agent":"Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1","tag":"s3.apache.access","time":"2014-05-01T09:27:23Z"}


In addition to this Fluentd configuration, you want to be sure only Fluentd is able to write to your Amazon S3 log storage bucket.  When using Fluentd on an Amazon EC2 instance, you can use the following permissions attached to an IAM Role :

      "Effect": "Allow",
      "Action": [
        "s3:Get*", "s3:List*","s3:Put*", "s3:Post*"
      "Resource": [
        "arn:aws:s3:::bucket_name/logs/*", "arn:aws:s3:::bucket_name"


A big kudo and thanks to @tagomoris and @repeatedly for your help to configure this.

Enjoy !

, , ,

No Comments