ijokarumawak/00.README.md

## 00.README.md

      
    Raw
  

              00.README.md
            
          
    Key findings


Measuring performance of a streaming application is difficult. GenerateFlowFile can be useful but understanding NiFi backpressure and scheduling is important.
Push provides better load distribution than Pull.
Pull can provide the same level of throughput with Push, but latency is bigger. Increasing backpressure threshold is encouraged.
Fewer larger flow-files provide better throughput than many smaller flow-files.
HTTP provides identical throughput with RAW Site-to-Site, but use slightly more CPU resources.
Be careful with Provenance repository max.storage.time, if it's too long for your use-case, CPU will be occupied to rollover the provenance storage and other tasks can't be executed. Once provenance storage starts having too many journal files, it starts backpressure mechanism and holds lock until it clears old events.

Environment

Master


EC2, m3.large
Ganglia gmetad
Apache HTTP server
Zookeeper

Nodes


EC2, m3.large


NiFi 1.0.0-SNAPSHOT


Java Open JDK 1.8.0_101-b13


4GB available. Let's set the soft limitation for NiFi data to 2GB as other data need to be persisted, such as logs and indices.


Data
Limit
Config


Flow File Repository
0.5GB


Content Repository
1GB
Disabled archiving. Ex) 1KB * 1,000,000, or 1MB * 1,000


Provenance Repository
0.5GB


1MB * 1,000 flow-files are queued:
1007M   ./content_repository
540K    ./provenance_repository
2.6M    ./flowfile_repository

p.nifi

push-data-generator: GenerateFlowFile
relashonship: backpressure threshold objectt: 1,000,000, data size: 1GB
RPG: to 'input'


q.nifi

Input Port: 'input'
relashonship: backpressure threshold objectt: 1,000,000, data size: 1GB
push-data-terminator: UpdateAttribute


nifi.properties

$ diff nifi.properties nifi.properties.org |grep '<'
< nifi.remote.input.host=0.p.nifi.aws.mine
< nifi.remote.input.socket.port=8081
< nifi.web.http.host=0.p.nifi.aws.mine
< nifi.cluster.is.node=true
< nifi.cluster.node.address=0.p.nifi.aws.mine
< nifi.cluster.node.protocol.port=9091
< nifi.zookeeper.connect.string=0.master.aws.mine:2181
< nifi.zookeeper.root.node=/p.nifi.aws.mine

logback.xml

default
Commands

Build

# Build the latest NiFi SNAPSHOT, based on 09840027a37c076f5df6239c669fc77315b761d9 with PR714 (cherry-pick 79521d8cd01c0675bd8bd4d6a9f9382e11ca9d6b)
git checkout master
git cherry-pick 79521d8cd01c0675bd8bd4d6a9f9382e11ca9d6b
nifi-clean-install

How to Start a NiFi node

./request-spot-fleet master
./request-spot-fleet p.nifi
./request-spot-fleet q.nifi
./generate-hosts
# Add generated hosts
sudo vi /etc/hosts
./update-route53-records
# Update hostname setting for the new node, it also start gmond
./update-hostname 0.p.nifi
./execute-nifish 0.p.nifi restart

Provenance Repository rollover stacktrace

"Provenance Maintenance Thread-2" #41 prio=5 os_prio=0 tid=0x00007fc731dd2000 nid=0x1abb runnable [0x00007fc72f2f9000]
   java.lang.Thread.State: RUNNABLE
     at java.io.UnixFileSystem.getLength(Native Method)
     at java.io.File.length(File.java:974)
     at org.apache.nifi.provenance.IndexConfiguration.getSize(IndexConfiguration.java:333)
     at org.apache.nifi.provenance.IndexConfiguration.getIndexSize(IndexConfiguration.java:347)
     at org.apache.nifi.provenance.PersistentProvenanceRepository.getSize(PersistentProvenanceRepository.java:863)
     at org.apache.nifi.provenance.PersistentProvenanceRepository.rollover(PersistentProvenanceRepository.java:1371)
     at org.apache.nifi.provenance.PersistentProvenanceRepository.access$300(PersistentProvenanceRepository.java:116)
     at org.apache.nifi.provenance.PersistentProvenanceRepository$1.run(PersistentProvenanceRepository.java:258)
     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
     at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)


## 01.HTTPS.md

      
    Raw
  

              01.HTTPS.md
            
          
    How to setup?


p.nifi

keystore
truststore

q.nifi's cert
admin user's cert


q.nifi

keystore
truststore

p.nifi's cert
admin user's cert


# create private key and certificate
echo -n 'hostname:'; \
read k; \
openssl req -x509 -newkey rsa:2048 \
 -keyout $k.pem \
 -out $k.crt \
 -days 365 \
 -nodes \
 -subj "/CN=$k/C=US/L=$k"

# convert those into p12 key store file, in order to import those into java keytool keystore
echo -n 'hostname:'; \
read k; \
openssl pkcs12 -export \
 -in $k.crt \
 -inkey $k.pem \
 -out $k.p12 \
 -passout pass:"pfxpassword" \
 -name $k

# import p12 into java keystore
echo -n 'hostname:'; \
read k; \
keytool -importkeystore \
 -deststorepass keystorepass \
 -destkeypass keystorepass \
 -destkeystore $k-keystore.jks \
 -srckeystore $k.p12 \
 -srcstoretype PKCS12 \
 -srcstorepass pfxpassword -alias $k

# add nodes into truststore
i=0.p.nifi.aws.mine; \
k=0.q.nifi.aws.mine; \
keytool -importcert \
 -v -trustcacerts \
 -alias $i \
 -file $i.crt \
 -keystore $k-truststore.jks \
 -storepass truststorepass \
 -noprompt

# list keys
keytool -list -storepass truststorepass -keystore 0.q.nifi.aws.mine-truststore.jks


## 11.Squid.md

      
    Raw
  

              11.Squid.md
            
          
    How to setup

In order to use HTTPS via a forward proxy, the proxy server has to support HTTP 1.1 CONNECT method.
HTTP 1.1 is not supported fully by squid yet?

https://wiki.squid-cache.org/Features/HTTP11

  
## 12.ApacheWebServer.md

      
    Raw
  

              12.ApacheWebServer.md
            
          
    How to setup: Apache Web Server

The most part of this setup is originally written in this StackOverflow question. Added few commands to setup a forward proxy with authentication from scratch.
Install Apache Web Server:
sudo yum install httpd24

Create a file /etc/httpd/conf.d/proxy.conf:
ProxyRequests On
ProxyVia On

# Only 443 and 563 are supported by default, custom ports need to be added.
AllowCONNECT 443 563 8443

<Proxy "*">
  Order deny,allow
  Allow from all
  # AuthType Basic
  AuthType Digest
  # Specify auth realm
  AuthName "aws.mine"
  # AuthUserFile basic_password.file
  AuthUserFile digest_password.file
  AuthGroupFile group.file
  Require group usergroup
</Proxy>

Create group and password file:
# Create a group file:
vi /etc/httpd/group.file (Add following entry)
usergroup: nifi

# Create password files for basic and digest auth:
htpasswd -c /etc/httpd/basic_password.file nifi (I used 'nifi proxy password' as a password here)
htdigest -c /etc/httpd/digest_password.file aws.mine nifi (I used 'nifi proxy password' as a password here)

Restart Apache Web Server:
service httpd restart


## 13.ApacheTrafficServer.md

      
    Raw
  

              13.ApacheTrafficServer.md
            
          
    How to setup: Apache Traffic Server

I used following commands to install Apache Traffic Server, based on the Administration Guide:
curl -OL http://apache.claz.org/trafficserver/trafficserver-6.1.1.tar.bz2
tar xvf trafficserver-6.1.1.tar.bz2
cd trafficserver-6.1.1
./configure --help

sudo yum install gcc-c++
sudo yum install openssl-devel
sudo yum install tcl-devel
sudo yum install libxml2-devel
sudo yum install pcre-devel
sudo ./configure --prefix=/usr/local/ats --with-user=tserver

sudo useradd -M --shell /bin/false tserver
sudo usermod -L tserver

DNS Server

In order to keep using the same environment, while it's possible to shutdown EC2 instances when it's not used, I need to use host names instead of private ip addresses to let nodes talk to each other. I originally tried to use hosts file to maintain hostnames, but it seems ATS doesn't use hosts file, instead it supports Host DB. However, I ended up using Route53 private hosted zone because it's easier to manage hostname and ip addresses among nodes in the environment.
I wrote a script to update Route53 records.
Authentication

While it's possible to configure ACL with ip address out of the box [1], user authentication has to be added as a plug-in.
There was a discussion in TS ML [2] that mentioned TS doesn't support digest auth, and it should happen at HTTP server. There was no complete auth plugin (only sample) at that time, and I think it's still the same. There was an experimental plugin which redirects auth request to the origin or an auth server [3].
[1] https://docs.trafficserver.apache.org/en/latest/admin-guide/security/index.en.html#controlling-access
[2] https://mail-archives.apache.org/mod_mbox/trafficserver-users/201305.mbox/%3CCAB1tU+cvLY_bJ6wz8YjfQF26NvV=uxse1QsXMv7jNMeS_Txi4A@mail.gmail.com%3E
[3] https://docs.trafficserver.apache.org/en/5.3.x/reference/plugins/authproxy.en.html
From above reason, I haven't tested user auth with ATS.
Be careful with Default ip_allow setting

The default ip_allow.config is configured something like below:
src_ip=127.0.0.1                                  action=ip_allow method=ALL
src_ip=::1                                        action=ip_allow method=ALL
# Deny PURGE, DELETE, and PUSH for all (this implies allow other methods for all)
src_ip=0.0.0.0-255.255.255.255                    action=ip_deny  method=PUSH|PURGE|DELETE
src_ip=::-ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff action=ip_deny  method=PUSH|PURGE|DELETE

The above setting allow every ip to GET and POST, but restricts only localhost can perform PUSH, PURGE and DELETE. It makes HTTP Site-to-Site fails in the middle of data transport, since HTTP Site-to-Site uses DELETE method to finalize a transaction.
So, I've changed the setting as follows. Access to this proxy is protected by AWS Security Groups instead:
src_ip=::-ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff action=ip_allow  method=ALL
src_ip=0.0.0.0-255.255.255.255                    action=ip_allow  method=ALL

records.config

##############################################################################
# Specify server addresses and ports to bind for HTTP and HTTPS. Docs:
#    https://docs.trafficserver.apache.org/records.config#proxy-config-http-server-ports
##############################################################################
CONFIG proxy.config.http.server_ports STRING 8080
# Added this to use 8443
CONFIG proxy.config.http.connect_ports STRING 443 563 8443
Data	Limit	Config
Flow File Repository	0.5GB
Content Repository	1GB	Disabled archiving. Ex) 1KB * 1,000,000, or 1MB * 1,000
Provenance Repository	0.5GB