+++ /dev/null
-#+date: 2018-09-11 21:09:06 +0800
-#+filetags: Apache Nifi Kafka bigdata streaming
-#+title: Using Apache Nifi and Kafka - big data tools
-
-Working in analytics these days, the concept of big data has been firmly
-established. Smart engineers have been developing cool technology to
-work with it for a while now. The [[Apache Software
-Foundation|https://apache.org]] has emerged as a hub for many of these -
-Ambari, Hadoop, Hive, Kafka, Nifi, Pig, Zookeeper - the list goes on.
-
-While I'm mostly interested in improving business outcomes applying
-analytics, I'm also excited to work with some of these tools to make
-that easier.
-
-Over the past few weeks, I have been exploring some tools, installing
-them on my laptop or a server and giving them a spin. Thanks to
-[[Confluent, the founders of Kafka|https://www.confluent.io]] it is
-super easy to try out Kafka, Zookeeper, KSQL and their REST API. They
-all come in a pre-compiled tarball which just works on Arch Linux.
-(After trying to compile some of these, this is no luxury - these apps
-are very interestingly built...) Once unpacked, all it takes to get
-started is:
-
-#+BEGIN_SRC sh
-./bin/confluent start
-#+END_SRC
-
-I also spun up an instance of
-[[nifi|https://nifi.apache.org/download.html]], which I used to monitor
-a (json-ised) apache2 webserver log. Every new line added to that log
-goes as a message to Kafka.
-
-[[Apache Nifi configuration|/pics/ApacheNifi.png]]
-
-A processor monitoring a file (tailing) copies every new line over to
-another processor publishing it to a Kafka topic. The Tailfile monitor
-includes options for rolling filenames, and what delineates each
-message. I set it up to process a custom logfile from my webserver,
-which was defined to produce JSON messages instead of the somewhat
-cumbersome to process standard logfile output (defined in apache2.conf,
-enabled in the webserver conf):
-
-#+BEGIN_SRC sh
-LogFormat "{ "time":"%t", "remoteIP":"%a", "host":"%V",
-"request":"%U", "query":"%q", "method":"%m", "status":"%>s",
-"userAgent":"%{User-agent}i", "referer":"%{Referer}i", "size":"%O" }"
-leapache
-#+END_SRC
-
-All the hard work is being done by Nifi. (Something like
-
-#+BEGIN_SRC sh
-tail -F /var/log/apache2/access.log | kafka-console-producer.sh --broker-list localhost:9092 --topic accesslogapache
-#+END_SRC
-
-would probably be close to the CLI equivalent on a single-node system
-like my test setup, with the -F option to ensure the log rotation
-doesn't break things. Not sure how the message demarcator would need to
-be configured.)
-
-The above results in a Kafka message stream with every request hitting
-my webserver in real-time available for further analysis.