Thursday, 28 March 2019

How To Configure TLS & Okta Authentication for Apache NiFi


NiFi cannot support any authentication mechanisms unless it is configured to utilize TLS. Note that following these instructions will also change the port that NiFi runs on to TCP 9443.
  1. Back up your existing file on each node in the NiFi cluster.
  2. Download the NiFi Toolkit for your release of NiFi @ and extract it on each node.
  3. Navigate to the toolkit’s bin directory (e.g. /opt/nifi-toolkit/bin/)
  4. Run to generate a truststore, keystore, and updated file with the following syntax:
./ standalone -f <path/to/current/> -n <server_fqdn>

  1. Copy the resultant keystore.jkstruststore.jks, and files to the NiFi instance’s conf/ directory.
  2. Restart NiFi with the following command:
bin/ restart


  1. Log in to Okta as an Application Administrator user
  2. Go to Applications -> Add Application
  3. Choose “Web” and click Next
  4. Specify a Name for the application
  5. For “Base URI” specify the URI of NiFi (e.g. https://<nifi_fqdn>:9443/)
  6. For “Login redirect URIs” specify https://<nifi_fqdn>:9443/nifi-api/access/oidc/callback
  7. Click “Done
    1. Note: The UI may hang. If so, just go to the Applications page and your application should now appear
  8. Click on your application, and go to the “General” tab
  9. Copy the “Client ID” and “Client secret” values for later use.


  1. Identify the OIDC Discovery URL for your Okta instance
    1. This is a combination of your existing Okta Instance ID, and several static values. For example, if your Okta instance URL is, your Instance ID is

      Combine this with the following URL format to get the OIDC Discovery URL:

  2. Edit conf/
    1. Find the “# OpenId Connect SSO Properties #” section
    2. Set the value of “” to the value identified in Step 1.
    3. Set “” and “” to the values obtained in the ”Create the NiFi Application in Okta” section.
    4. Save your changes, and quit, out of the editor.

      Example configuration:
# OpenId Connect SSO Properties # secs secs

  1. Edit conf/authorizers.xml
    1. Find the first <userGroupProvider> section, and update <property name="Initial User Identity 1"></property> to include the user ID of your desired administrator, e.g.
       <property name="Users File">./conf/users.xml</property>
       <property name="Legacy Authorized Users File"></property>

       <property name="Initial User Identity 1"></property>

    1. Find the <accessPolicyProvider> section, and update <property name="Initial Admin Identity"></property> to include the user ID of your desired administrator, e.g.
       <property name="User Group Provider">file-user-group-provider</property>
       <property name="Authorizations File">./conf/authorizations.xml</property>
       <property name="Initial Admin Identity"></property>
       <property name="Legacy Authorized Users File"></property>

       <property name="Node Identity 1"></property>

    1. Save your changes, and quit, out of the editor.
  1. Restart NiFi with bin/ restart.

Tuesday, 24 July 2018

Draining the (Big Data) Swamp

Before we dive too far into this article, let's define a few key terms that will come up at several points:

Big Data - Technology relating to the storage, management, and utilization of "Big Data" (e.g. enormous amounts of data/petabyte scale).
Data Lake - A common term to refer to a storage platform for Big Data. In this case, let's assume Apache Hadoop (HDFS) or Amazon S3.
Data Warehouse  - A large store of data sourced from a variety of sources within a company, then used to guide management or business decisions.
ETL  - Extract, Transform, Load. The process of taking data from one source, making changes to it, and loading it into another location. The underpinning of Master Data Management, and basically all data movement.
Master Data Management (MDM)  - The concept of merging important data into a single point of reference. For a lot of Big Data applications, this means tying data back to a single person or actor.

Now with that out of the way... why on earth is the name of this article talking about a Data Swamp? 

Coming from the RDBMS/data warehouse world - data normalization and MDM is a critical part of any project involving data. Generally there will be some source system that is feeding data in that is then "cleaned" (or normalized) to reduce duplication/inconsistencies and is then fed into a data warehouse which is used for reporting.

Let's say that you're handling data coming in from two ubiquitous Microsoft platforms - Active Directory and Exchange. Both of these have relatively consistent logging formats to track access, or at least to track behaviour, but where things get different quickly is who they say is accessing things.

With Active Directory logs (let's just say evtx format for the sake of argument), you'll see the account name as DOMAIN\User, or in some cases With Exchange, however, you're going to see data coming in under the user's email address or, if there's no authentication configured, whatever email address was specified as the sender... in some cases this is going to be however frequently it can also be or any other infinite number of ways that an organization chooses to reference email addresses.

To put this into a more real world example, at a company I used to work with my login for the domain was in the format of a country code and employee ID, e.g. CA123456, so AD would track me under that ID and only that ID. At that same company I had two email addresses -, and (short name and long name). See where this gets confusing!?

OK, now I understand that data is a mess... What's your point?

Now picture yourself on the receiving end of those logs and trying to piece this all together into a working scenario where you can say that I logged in at 8AM at my normal office place and send out fifteen emails to thirty recipients... That's where MDM comes into play!

Despite the seemingly strange approach, there's usually a method to the madness in this sort of thing. Generally all of these items are referenced together in some place... generally an Active Directory user profile or some LDAP based equivalent. Easy to use, right!? Maybe...

Attributes in an LDAP profile can be public, meaning anyone who can do a lookup on the user can see them, or they can be private which means that you have to be authorized to see them. In a lot of cases the public attributes are only going to be group memberships, a real name, and the user name - this leaves a lot out to dry.

Let's assume, though, that we have privileged access and can now read all the attributes - we've got our source of truth! Fantastic! Let's ignore the logistics of trying to maintain up to date LDAP lookups on demand for tens of thousands of users, let alone hundreds of thousands, and move on... 

At this point a consulting team that gets paid lots of money is going to implement some sort of ETL pipeline that pulls data from the source system, does enrichment to tie this all to the "one true version" of a user, and then loads it into the Data Warehouse which will be used for reporting.

This all sounds great! We've identified how to enrich the data (ignoring the logistics of doing so), and now you've got a Data Warehouse you can report off of!

Absolutely, it is great, but here is where we get back to the title of this article! Everything I just described is extremely common in the relational database world (e.g. Oracle, MS SQL Server, etc...) for reporting with products like Cognos, Qlik, etc... but unfortunately it is a concept that seems to be routinely ignored in the Big Data world.

As noted earlier, a lot of people will take these logs from many places and store them in their Data Lake for future use/analysis/etc... While it's great that they get stored, what this means is that you've got likely petabytes worth of data that you're paying to store but are not using for anything and could not effectively use for anything that would not be extremely resource intensive to implement.

This is why Big Data implementations fail!

Big Data is expensive, and its complex... it's the perfect storm of hard to do! With this said, all of the same concepts we talked about earlier still apply, they just have to be done differently! Running data through an ETL tool designed for an RDBMS that maybe operates on a total of 24 CPU cores and 128GB of RAM isn't going to cut it when you're dealing with enterprise levels of log files.

This is where I say it's expensive - not only do you need to have the expertise and knowledge of which tooling to use and how to use it, you need be willing to pay to make it happen. In a lot of cases for an enterprise handling authentication logs, email logs, endpoint logs, firewall logs, etc... this is going to be manifesting its self in the many dozens of CPU cores and hundreds of gigabytes of memory just to do the normalization. You can quickly see why there's hesitation to do this - who wants to spend upwards of half a million dollars on something that's just changing a user name in a log file!?

Unfortunately, without spending that money, those logs that are sitting in S3 (despite it being really cheap) are not really going to be useful if you want to use them down the road. This is where you wind up with the eponymous Data Swamp - a very murky lake full of data.

Many Data Lakes get set up as part of security initiatives so that there can be a postmortem on a breach, or even used proactively in the world of UEBA and Security Analytics, but without this MDM work (and in many cases even basic normalization) that company is going to wind up with the following challenges, to name a few:
  • Weeks or months of time spent identifying the characteristics and "shapes" of the data that is stored in the data lake.
  • The same cost that would have been incurred originally to now perform that MDM work (but under a time crunch).
  • Added costs and stress of trying to expedite an implementation of something to analyze the logs, or to have their security team(s) manually review them in a SIEM style tool like Splunk.
Really, it just circles back to what could have been done in the first place, and what would have allowed that company to have proactive tools in place, but even past that do immediate manual analysis or rules-based analysis of the data that was stored without having to incur all of the additional costs, stress, and time taken to do so after the fact.

You're painting a pretty dire picture... Are all Big Data implementations like this?

NO! Thank goodness they are not, but the ones that fail all seem to have the same characteristics.
  • No willingness to spend (usually this turns into all of this data living in S3 because it's cheap).
  • Limited forethought into the end use case (e.g. SIEM, Security Analytics, etc...).
  • No dedicated team for the platform, and thus no one to enable the end use case.
The ones that are successful, however, tend to have done the following:
  • Brought in, or hired, the expertise to handle the data movement and transformation, and continue to engage with those teams if they are not permanent.
  • Implement(ed) the Data Lake with a specific goal in mind, and identified the work required to make that happen.
  • Engaged with, or at least consulted, vendors that would enable the use case once the data was available. 

Let's wrap this up.

Is this technically easy? Absolutely not. It's not easy even in the RDBMS world where a few dozen gigabytes of sales data across 500 retails stores is considered a huge volume of data to report off of. What I'm saying at the end of this is that conceptually it's something that seems to get missed in a lot of Big Data implementations, and there are a lot of folks out there who have been blasting Big Data/Hadoop implementations for years despite the fact that a lot of those issues can be tied back to this at the end of the day.

If this struck a chord with you, and it's something you're having trouble with, keep an eye on this blog and the associated YouTube channel. We're going to be taking a look at a bunch of technology that enables this whole thing to work - Kafka, NiFi, Elasticsearch, Hadoop, etc... there's no shortage of platforms to let you do this work, and even choosing the "wrong" one generally won't hamstring you too much, but you have to make that leap to get started!

Sunday, 29 April 2018

Tutorial: Set up a Linux VM on Windows using Hyper-V

Today I'm going to walk you through setting up Hyper-V and getting a Linux virtual machine up and running!

I also referenced some potential future videos on Docker/containers, there's a great summary of what these are @ if you're not familiar. Past that, we'll definitely be taking a look at Virtual Box at a later time, and maybe WSL.

With this video I am not going to have a full text version, as instructions can be found elsewhere and we cover a few topics, so I'd encourage you to watch and use this as a reference. Some future videos that are more structured in their topic and/or approach will be accompanied by a full text format of the video!

Sunday, 22 April 2018

My Journey in the Tech Industry

In today's video, I chat about my journey in the tech industry, and try to give some insight on "non-traditional" ways of getting into it!

A hot topic these days on in a lot of tech circles is how to get into tech as a career. There are lots of great channels out there with videos about this recently including some of my favourites:
* CLM = Career Limiting Move

In most of these cases, these folks had pretty traditional paths into the industry in that they want to school for computer science/software engineering, whereas my experience has been a little bit different.

I've been into computers since as young as I can remember - at least 3-4 years old - and the first computer I ever had for myself was a Mac IIcx, from there a Performa 6400, and my last Mac (for a long time) was a Power Mac 7500/100 that was CPU swapped for a higher end 604e CPU and an ATI Rage 128 video card.

From there I moved on to Windows, and started dabbling into Linux. To date myself a little bit (not that bad), the Linux distros I started with were Red Hat, Mandrake, and Slackware. 

These days I'm really heavily focused on Hadoop and as a result spend most of my days working in Linux (via MacOS), mostly RHEL/CentOS. The rest of my screen time is split between MacOS (work), and Windows (fun & work).

On to jobs!

First tech job was doing website updates for the school I was going to. Prior to that I'd done "web design", but that was mostly along the lines of Geocities/Angelfire level stuff... bgsound and blink were still valid HTML elements. This project was the first thing that really got me looking at the bigger picture of a career. I thought that I wanted to do web design/freelance design work from there.

Then, Christmas 2001, I got a copy of The Blue Nowhere by Jeffery Deaver. This book is a bit over the top, but it's a very entertaining read that I'd highly recommend if you're interested at all in cybersecurity and social engineering. This book got me started down the the road I'm on now.

My first project after this was in 2003 with the launch of White Rabbit Lane with some friends. This was a collection of cybersecurity tools, white papers, and tutorials/articles that we had written or gathered (and credited) from other sources. This was a very minor project, but we of course were the "cool kids on the block"... at least in our circle, and I still run into people every so often that have at least heard of, or visited, this website despite the fact that I now live almost 3000KM away from where that was founded. Crazy!

In 2005 I landed my first "proper" job as an intern doing IT and network security work at a startup where I'm located now. I spent the whole summer pouring through firewall logs, setting up networks, and configuring IDS/IPS software like Snort.

While in school I ran my own business doing tech support just to learn stuff, and get the occasional free beer. This let me reach out to some other groups on campus, and landed me a job doing network analysis for the organization running the campus network at the school I went to. With this I got to do some basic pen testing, and a lot of traffic analysis to help resolve some issues they were having with DPI and packet shaping that was being done to try to combat piracy... which of course was replaced with people just using IRC and DC++.

Once I was out of school I joined a very large software company doing support for business intelligencebusiness analytics software. While support is usually frowned upon, this is a great way to get into the industry - this position allowed me to get a tremendous amount of experience working with a ton of operating systems and databases, not to mention learning the ins and outs of enterprise software and all of the "fun" that can come with projects to deploy it! I ended up staying there until about 2016 when I moved on to where I am now...

Now I work with a startup (< 100 people) that is heavily focused on security analytics. I won't go into any details around it, but you'll be able to find it easily enough! This job has really changed my view again on how I want my career to progress. Going from a company of 300,000+ to a company this size has really opened my eyes to how fun it can be to work at a small company because everyone is heavily invested in what they're doing and is able to get their hands on everything instead of being pigeon-holed.

At the start of this article I noted that I had a bit of a different path into this industry than may be considered traditional... While I did do some post-secondary education, it started out doing a history degree (which I quickly realized I hated) and I then moved on to doing a 2 year diploma program in Computer Systems where I was able to get a ton of hands on time with hardware, Linux/Windows in a server environment, and networking. Obviously I highly recommend this if you feel that computer science isn't necessarily your thing... it will definitely get you a foot in the door with many companies.

A lot of folks are sometimes afraid to get into a company by doing support since it can seem like it's a dead end, but the great thing with many companies is that you can use this position as a stepping stone into many other roles if you're a high performer. I know many people who started out in support, like I did, and have moved on to consulting, technical writing, project management, software design/development, etc... the possibilities are nearly endless!

Friday, 20 April 2018

Being Self-Reliant

In today's video, I discuss why asking for answers is NOT always the fastest way to solve a problem!

Let's talk about "self-reliance" as it relates to career and learning. The three aspects of this we're going to discuss are:

  • Research
  • Failure
  • Asking for Help


You're sitting at your desk, and you've got a piece of code that isn't working, or some other challenge... it's not even technical! You're stuck and you've run out of ideas, so you need to turn somewhere else for answers.

The first instinct of many people in this situation is to simply ask someone else, but I virtually guarantee that others have already done that, and you can likely find your answer on Google! Obvious, right!?

The problem many people encounter when turning to Google (or any form of research like this) is that they will try to be too specific in what they are looking for, and a lot of the time it will result in little to nothing being found. What you can do, instead, is search for variations of your issue - for example, take parts of an error message or the error code so that you can get a broad understanding of what you could be looking at.

For tech purposes, Stack Overflow is a common resource that people will turn to directly, or which will come up in searches. Very frequently, you'll find an answer to your question on Stack Overflow, but it may not be the answer. By this I mean that the solution you find may work, but it may not be the most efficient way to address the problem. 

The goal of your research should not only be to find an answer, but to learn from the plethora of information you find so that you can determine the best approach forward.


Next up, we're going to about failure.

Isn't this what got us here to start!? Yes, but it's also the best way to learn!

Continuing with the example of some broken code - you've now done some research, found ten or twenty different ways to code around the problem, and now you need to implement those and find out:
  1. Do they work?
  2. Which is the best for your purposes?
  3. What can you change to make it work even better?
It may seem intuitive that you want to test your code before using it in the real world, but this meme exists for a reason:
Image result for test in production

Work in your local (or cloud based) development environment, break things until you find what works best!

Asking for Help

We've done research, and now we've tried, and failed, to solve the problem. On the plus side, we've learned a ton about the situation and some ways to address and avoid similar problems in the future, but are still stuck... it's time to ask for help!

Yes, you could have started out with this, but the reason it's the last step is that the goal here isn't only to solve a problem - it's to learn from solving it. The great part about having done your research, and having tried things, is that you've not only saved yourself time in the future but you're going to save time for the person you're going to be asking as well.

The first questions that you will get almost every time when asking for help are "What's the issue, and what have you tried?".

By having tried many things and researched before involving someone else and using their time, you can answer this with an educated statement around what has not worked, what may have gotten you part of the way toward an answer, and what you think might be the path forward.

Chances are that the person you're asking for help already is going to follow, or already has in their past experience, the exact same steps here. It's due to this experience, that you're now trying to build, that you'll be able to work together to get a quick and good solution.

I know this all seems like it's common sense, but you'd be surprised at how many people skip this process and just jump to asking other people - especially skipping the "Failure" after not finding an exact answer after some preliminary research. At the end of the day, yes, getting your solution was slower than just asking, but think of all of the time that you've now saved yourself in the future, as well as the time that would have been working through these same steps with someone else.