I feel a bit like I'm going back to my roots with this blog post. Technical documentation on how I've setup a relatively niche product, with an opinionated stance on how it should be setup.
What is Apache Nifi?
Apache Nifi is an easy to use, powerful, and reliable system to process and distribute data. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. See http://nifi.apache.org/ for more info.
What's the context of the setup?
I want to wrap a control process around some of the disparate lambda tasks that I have triggered by S3. I need a control function for reporting & monitoring, as well as something relatively simple to use to allow support workers to identify and make adjustments upstream from the 'critical path.' I reviewed Luigi, Airflow & Nifi as options, and ended up picking Nifi due to the speed & flexibility of usability.
Docker (ECS) v. Barebones (EC2)
The first decision was on the deployment model for Nifi.
At this stage the bounding resource for the use case that I'm pushing through Nifi is likely to be memory, as some of the files passing through the process may reach up to 5GB in size; that said, I hope that the majority are going to be closer to the 100MB end of the scale. For initial discovery, I'd used a well made docker image mkobit/nifi, to stand up and validate Nifi.
Unfortunately I found a gotcha with the docker setup (having not been "down in the dirt" for a few months) and either docker's changed it's default behaviour, or the persistence model has changed. Either way, without explicitly defining a volume for /opt/nifi/conf
- upon container restart I'd lost all my configuration. The good news is that this allowed me to validate how simple Nifi was - within 15 minutes I'd been able to rebuild my model from scratch, having now familiarised myself with a number of Nifi standard processors.
I therefore took the decision to build a barebones installation of Nifi on top of the latest Ubuntu LTS release. There's no official package; so I'll be relying on the releases directly from nifi.apache.org.
Initial installation
Once you've got your vanilla LTS release stood up, add the following to the end of /etc/security/limits.conf
* hard nofile 50000
* soft nofile 50000
* hard nproc 10000
* soft nproc 10000
The documentation mentions you may also need to edit /etc/security/limits.d/90-nproc.conf
- however that is not required on Ubuntu.
You'll also need to allow additional tcp sockets, as that's how your flow will communicate with each other. Once we've got our flows setup a stable, these should be monitored on a production environment to make sure that capacity levels are set correctly.
In /etc/sysctl.conf
add the following line:
net.ipv4.ip_local_port_range=10000 65000
I mentioned earlier that the process is likely to be memory bound. For performance reasons (and to avoid disk i/o becoming a bottleneck) we'll need to set the swapfile to be disabled so that we don't end up with weird behaviours.
In /etc/sysctl.conf
add the following line:
vm.swappiness = 0
The final implementation key will be looking at the overhead of things like the core notification time on files. Assuming that this instance is tailored specifically for Nifi (which it should; 1 server per role and all that jazz) - we can disable access time logging by setting /etc/fstab and adding noatime
to each of the volumes that will have high input/output for Nifi. Again, I'd probably recommend putting /opt/nifi in it's own volume so the scope is clear (and you keep the atime logging for your generic system files & logs).
Package dependencies
Nifi requires a JRE, so run the following to get OpenJDK version 8 JRE:
apt-get install openjdk-8-jre
Installing Nifi
To install Nifi in the base system, I'm going to home it in /opt/nifi/.
Download the latest binary from https://nifi.apache.org/download.html - which for this tutorial is 1.3.0.
wget http://www.mirrorservice.org/sites/ftp.apache.org/nifi/1.3.0/nifi-1.3.0-bin.tar.gz
Then extract the file:
tar zxvf nifi-1.3.0-bin.tar.gz
Once that's done - let's add our new nifi scripts to our PATH
environment variable:
echo 'export PATH=$PATH:/opt/nifi/nifi-1.3.0/bin' >> /etc/bash.bashrc
You'll need to reload the shell; run /bin/bash
to do so (or simply close and reload a new session).
We can now install nifi as a service in Ubuntu; by running nifi.sh install
. This will make monitoring slightly easier, and allow us to start the service on boot. Unfortunately there's a bug in the service script which causes the following error when trying to add using systemctl:
systemctl enable nifi
insserv: There is a loop at service nifi if started.
Instead you can also start nifi without installing a a service by running:
nifi.sh start
The next post will be on configuring nifi.properties
and setting up a certificate store for client-side SSL authentication.