BigBear.ai
  • Home
  • Industries
    • Academia
    • Government
    • Healthcare
    • Manufacturing
  • Solutions
    • Cyber
    • Data Analytics
    • Enterprise Planning and Logistics
    • Intelligent Automation
    • Modeling Solutions
    • Professional Services
  • Company
    • About
    • Investor Relations
    • Partners
    • Team
  • Careers
    • Benefits
    • Culture
    • Explore Jobs
    • Military and Veterans
    • Applicant Login
    • Employee Login
  • Resources
    • Blog
    • Events
    • Newsroom
    • Resource Library
    • Online Store
  • Contact
Search

Home KNIME Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME

Blog

Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME

Raja Chithambaram
July 31, 2019
  • Share
  • Share

This blog describes how to perform Serverless Analysis of data in Amazon S3 using Amazon Athena through KNIME. Let’s start with quick introduction about Amazon Athena.

What is Athena?

Athena is serverless query service for querying data in S3 using standard SQL, with no infrastructure to manage. It supports ANSI SQL queries with support for joins, JSON and window functions.

Figure 1 – Athena Architecture

How to connect Athena and execute SQL from KNIME?

KNIME can interface with Athena using the following nodes.

Figure 2 – KNIME Nodes

How to do Analysis of data in S3 using Athena trough KNIME?

A traditional approach is to download the entire files from S3 to KNIME using a Node such as the Parquet Reader. This approach works fine if the file is relatively manageable in size. But if you have large files in S3, then this download approach will consume lots of time and local memory and processing will be slow because of data volume. Athena can help with these issues. Since Athena supports SQL processing on top of S3, all heavy lift data processing such as joins, filters, and group-bys can be done using Athena (inside a Database Reader Node) and only processed result sets are downloaded locally into the KNIME workflow. This design pattern works well for large Data Analysis use cases. Here are some specific examples shown with and without Athena in KNIME.

Use Case 1: Querying partitioned data from S3:

This first example demonstrates how an Athena query can reduce the amount of network I/O from file download options. The better approach is to use a Database Reader node which connects to Athena and performs SQL to get only required partition or subset of data using a WHERE clause.

Figure 3 – Querying S3

Use Case 2: Joining and grouping S3 data:

Instead of downloading, joining and grouping through KNIME nodes, it is very efficient to do joins and groupings using Athena. This will typically result in better performance and memory consumption. Since Athena complies with ANSI SQL, all flavors of join such as inner, left outer and right outer are possible.

Figure 4 – Joining & Grouping S3 Data

Use Case 3: Processing, conversion and Transformation of S3 data:

Let’s say you want to process S3 data and perform operations such as data cleaning, conversion, de-duping, deriving new columns using existing columns, etc. All this data processing could be done in Athena without ever even transferring the result set locally until other KNIME nodes are needed.

In this scenario, each Database SQL Executor node performs an Athena CTAS (Create Table As Select) query. Use this CTAS design pattern to create a new table from the result of a SELECT statement from another query. Athena stores data files created by CTAS statement in a specified location in Amazon S3. Note that each CTAS will result in creating new file in S3. Downstream KNIME database nodes can then interact with the newly created S3 data as an Athena table.

Figure 5 – Processing S3 Data

Use Case 4: Changing format of S3 data:

If you have S3 files in CSV and want to convert them into Parquet format, it could be achieved through Athena CTAS query. For example, if CSV_TABLE is the external table pointing to an S3 CSV file stored then the following CTAS query will convert into Parquet.

Figure 6 – Changing S3 Format

CREATE EXTERNAL TABLE PARQUET_TABLE
WITH
(
Format=’PARQUET’,
External_location = ‘s3://bucket-name/key-name/’
)
AS
SELECT * from CSV_TABLE;

Use Case 5: Re-partitioning Dataset in S3

Athena is priced by the amount of data scanned per query and uses partitioning to reduce the data that requires scanning when Where clauses are specified. Hence, smartly partitioning data helps control costs. In order to repartition S3 data, we need to create new S3 files for each new partition. This can be done using an Athena CTAS query. By using the Athena CTAS query we can eliminate entire data downloading locally while creating partitions as needed.

Figure 7 – Re-partitioning S3

CREATE EXTERNAL TABLE REPARTITION_TABLE
WITH
(
Format=’PARQUET’
External_location = ‘s3://bucket-name/key-name/’,
Partition_by=ARRAY[‘year’,’month’]
)
AS
SELECT * from PARTITION_TABLE;

Posted in KNIME.
BigBear.ai
  • Home
  • Industries
  • Solutions
  • Company
  • Careers
  • Blog
  • Investor Relations
  • Contact
  • Twitter
  • Facebook
  • Linkedin
  • Google My business for BigBear.ai
1-410-312-0885
[email protected]
  • Privacy Policy
  • Terms of Use
  • Accessibility
  • Site Map
© BigBear.ai 2023
We value your privacy
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
Privacy Policy | Do not sell my personal information
AcceptCookie Settings
Manage Consent

Cookies Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
JSESSIONIDsessionThe JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
CookieDurationDescription
__atuvc1 year 1 monthAddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs30 minutesAddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
Analytics
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
CookieDurationDescription
_ga2 yearsThe _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_NK4L4Q320Q2 yearsThis cookie is installed by Google Analytics.
_gat_gtag_UA_163894009_21 minuteSet by Google to distinguish users.
_gid1 dayInstalled by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
at-randneverAddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT2 yearsYouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
uvc1 year 1 monthSet by addthis.com to determine the usage of addthis.com service.
Advertisement
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
CookieDurationDescription
f5avraaaaaaaaaaaaaaaa_session_sessionbusinesswire.com cookie
loc1 year 1 monthAddThis sets this geolocation cookie to help understand the location of users who share the information.
VISITOR_INFO1_LIVE5 months 27 daysA cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSCsessionYSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devicesneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-idneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
Save & Accept