source/posts/using_Apache_Nifi_Kafka_big_data_tools.mdwn

   1 [[!meta date="2018-09-11 21:09:06 +0800"]]
   2 [[!tag Apache Nifi Kafka bigdata streaming]]
   3
   4 Working in analytics these days, the concept of big data has been firmly established. Smart engineers have been developing cool technology to work with it for a while now. The [[Apache Software Foundation|https://apache.org]] has emerged as a hub for many of these - Ambari, Hadoop, Hive, Kafka, Nifi, Pig, Zookeeper - the list goes on.
   5
   6 While I'm mostly interested in improving business outcomes applying analytics, I'm also excited to work with some of these tools to make that easier.
   7
   8 Over the past few weeks, I have been exploring some tools, installing them on my laptop or a server and giving them a spin. Thanks to [[Confluent, the founders of Kafka|https://www.confluent.io]] it is super easy to try out Kafka, Zookeeper, KSQL and their REST API. They all come in a pre-compiled tarball which just works on Arch Linux. (After trying to compile some of these, this is no luxury - these apps are very interestingly built...) Once unpacked, all it takes to get started is:
   9
  10 [[!format sh """
  11 ./bin/confluent start
  12 """]]
  13
  14 I also spun up an instance of [[nifi|https://nifi.apache.org/download.html]], which I used to monitor a (json-ised) apache2 webserver log. Every new line added to that log goes as a message to Kafka.
  15
  16 [[Apache Nifi configuration|/pics/ApacheNifi.png]]
  17
  18 A processor monitoring a file (tailing) copies every new line over to another processor publishing it to a Kafka topic. The Tailfile monitor includes options for rolling filenames, and what delineates each message. I set it up to process a custom logfile from my webserver, which was defined to produce JSON messages instead of the somewhat cumbersome to process standard logfile output (defined in apache2.conf, enabled in the webserver conf):
  19
  20 [[!format sh """
  21 LogFormat "{ \"time\":\"%t\", \"remoteIP\":\"%a\", \"host\":\"%V\", \"request\":\"%U\", \"query\":\"%q\", \"method\":\"%m\", \"status\":\"%>s\", \"userAgent\":\"%{User-agent}i\", \"referer\":\"%{Referer}i\", \"size\":\"%O\" }" leapache
  22 """]]
  23
  24 All the hard work is being done by Nifi. (Something like
  25
  26 [[!format sh """
  27 tail -F /var/log/apache2/access.log | kafka-console-producer.sh --broker-list localhost:9092 --topic accesslogapache
  28 """]]
  29
  30 would probably be close to the CLI equivalent on a single-node system like my test setup, with the -F option to ensure the log rotation doesn't break things. Not sure how the message demarcator would need to be configured.)
  31
  32 The above results in a Kafka message stream with every request hitting my webserver in real-time available for further analysis.
  33