BigBear.ai
  • Home
  • Industries
    • Academia
    • Government
    • Healthcare
    • Manufacturing
  • Solutions
    • Cyber
    • Data Analytics
    • Enterprise Planning and Logistics
    • Intelligent Automation
    • Modeling Solutions
    • Professional Services
  • Company
    • About
    • Investor Relations
    • Partners
    • Team
  • Careers
    • Benefits
    • Culture
    • Explore Jobs
    • Military and Veterans
    • Applicant Login
    • Employee Login
  • Resources
    • Blog
    • Events
    • Newsroom
    • Resource Library
    • Online Store
  • Contact
Search

Home ETL Introduction to KNIME

Blog

Introduction to KNIME

Kyle Eckenrode
March 6, 2019
  • Share
  • Share

Let’s say you’ve been tasked to pull data from a variety of data sources on a monthly basis, manipulating and processing it to create insights. At first one may look to a powerhouse software like Informatica, but what if you don’t really need something that heavy and your budget is extremely limited? You need an ETL tool to get you from point A to point B accurately and reasonably quickly, something that’s easy to learn but has the depth and flexibility for more advanced processing and customization. Enter KNIME.

Up-front KNIME presents itself as a tool designed for data analysts, emphasizing visual clarity and usability over raw power; where Informatica will perform multiple transformation types in a single Expression node, KNIME has a node for each step. This may be a turnoff for some coming from enterprise-level software, but the enhanced control offered by KNIME is extremely valuable and can shave hours off your development and debugging. A major offering of KNIME is that it allows you to see how your data changes at every step of transformation. Because your processing steps are broken out into individual nodes you are now able to see where and how things happen. Take for example this set of nodes that perform a simple row transformation:

 

Each node after the CSV Reader is in an executable state, represented by the yellow dot. The CSV Reader has been executed successfully and has a green dot. We can open the results set and see what the data looks like:

Note that it was able to pull in the column headers and give the data type without any additional processing. Though a lot of steps are more manual KNIME is very good about detecting what kind of data you’re working with; S is Strings, D is Doubles. You can alter data types with many of the other nodes in the repository, which is quite expansive:

By using the search-bar you can even find nodes by keyword or by the type of work you want to do. If we wanted to work with column manipulation, we just search for “Column”:

And check out how many nodes it gives us to work with! This does add on a learning curve to figuring out which node you need to use for a given process but given time and a little Google-fu you’ll land on the best way to do things.

Back to our example, the next node is going to do a simple string manipulation.

Like Informatica KNIME offers a series of functions that we can use to manipulate strings. This one is just converting the Measure column to uppercase. In the background KNIME is using Java to create these results meaning that creating your own nodes is quite easy if you know Java. KNIME even comes with a plugin to use your Eclipse IDE to create and test your custom nodes, which we have made use of in integrating KNIME with Elastic and AWS.

After the String Manipulation completes, we can again check the results:

We then follow this up with a Row Filter on getting only records with a Value of over 500:

Fairly straightforward stuff here, but what if we wanted to use variables for our bounds? KNIME supports this with two kinds of variables; Workflow and Flow. Workflow variables are set in on the Workflow itself while a Flow variable is created and set during processing, very similar to what you would see in Informatica. We can see the status of our variables by checking the output tables of previous nodes and tabbing over to Flow Variables:

We’ve set a Workflow Variable called Lower_Bound with a value of 500.0. Now when we go into our Row Filter we can use that variable to set our lower bound in the Flow Variable tab of the Node Configuration dialog:

Post-execution we get these results:

Now let’s say we want to manipulate that variable and add 500 to its value. We can use Variable nodes to alter the state of both our Workflow and Flow variables, which use a different input port type than a normal node. Note the red circle icons on this Math Formula node:

Red circles represent variable inputs and outputs and can be linked to and from other variable or normal nodes by their variable ports. These are normally hidden by the UI but can be dragged from if you click in the upper-right corner of a node for output or upper-left for input. You can also expose them by right-clicking and selecting “Show Flow Variable Ports”

We can now connect a normal node to a variable node or pass variables between them. Here we’ll use this to connect to the Math variable node so that it can execute:

In the Math node we’ll add 500 to the Lower_Bound Workflow variable. This will not overwrite the value in our configuration, it will only alter it going forward in the workflow. We could use this to do another Row Filter or any other set of operations now parallel to the rest of the nodes, but we’ll continue on with the Transpose node:

The Chunk Size is the maximum number of columns the node will be looking to transpose. 10 is the default which is more than the columns we have going in, so we’ll keep it at 10. Post-execution we have this result:

You can scroll left and right here, I’ve cut off the bottom to save space
After examining the data, we notice that Country and Country Trigraph are always null, so let’s say we want to go back and remove these columns before transposing. We can add in a Column Filter node right after our CSV reader to accomplish this, and KNIME will handle resetting all of our downstream nodes as well as hooking the inputs and outputs for it as long we drop it on the connection line:

The line is red as we hold the new node over it…

Choose OK…
All we need to do now is configure the node to filter out the columns:

After executing the rest of the pipeline, we end up with our final result:

If we wanted to consolidate this process, we can just make these steps into a Metanode:

We can enter this Metanode to see the processing steps and check the individual steps:

While this is all possible in bigger ETL software it is certainly not as easy as KNIME makes it, and we can fully control how and when every step happens. Some jobs are suited for enterprise-level ETL but when you need something lean and precise KNIME is a fantastic choice with minimal startup time. The installation process is quick and easy and KNIME offers a great package of example and practice workflows. If you’re a Java developer there’s a lot to be gained from learning how to customize and create your own nodes, as well as many node packages created by other KNIME users for you to try out.

Posted in ETL, KNIME.
BigBear.ai
  • Home
  • Industries
  • Solutions
  • Company
  • Careers
  • Blog
  • Investor Relations
  • Contact
  • Twitter
  • Facebook
  • Linkedin
  • Google My business for BigBear.ai
1-410-312-0885
[email protected]
  • Privacy Policy
  • Terms of Use
  • Accessibility
  • Site Map
© BigBear.ai 2023
We value your privacy
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
Privacy Policy | Do not sell my personal information
AcceptCookie Settings
Manage Consent

Cookies Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
JSESSIONIDsessionThe JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
CookieDurationDescription
__atuvc1 year 1 monthAddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs30 minutesAddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
Analytics
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
CookieDurationDescription
_ga2 yearsThe _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_NK4L4Q320Q2 yearsThis cookie is installed by Google Analytics.
_gat_gtag_UA_163894009_21 minuteSet by Google to distinguish users.
_gid1 dayInstalled by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
at-randneverAddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT2 yearsYouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
uvc1 year 1 monthSet by addthis.com to determine the usage of addthis.com service.
Advertisement
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
CookieDurationDescription
f5avraaaaaaaaaaaaaaaa_session_sessionbusinesswire.com cookie
loc1 year 1 monthAddThis sets this geolocation cookie to help understand the location of users who share the information.
VISITOR_INFO1_LIVE5 months 27 daysA cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSCsessionYSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devicesneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-idneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
Save & Accept