BigBear.ai
  • Home
  • Industries
    • Academia
    • Government
    • Healthcare
    • Manufacturing
  • Solutions
    • Cyber
    • Data Analytics
    • Enterprise Planning and Logistics
    • Intelligent Automation
    • Modeling Solutions
    • Professional Services
  • Company
    • About
    • Investor Relations
    • Partners
    • Team
  • Careers
    • Benefits
    • Culture
    • Explore Jobs
    • Military and Veterans
    • Applicant Login
    • Employee Login
  • Resources
    • Blog
    • Events
    • Newsroom
    • Resource Library
    • Online Store
  • Contact
Search

Home Data Warehousing Data Refresh Techniques

Blog

Data Refresh Techniques

Jim McHugh
December 26, 2019
  • Share
  • Share

The three keys to warehouse project success are to quickly, efficiently and consistently refresh the data in the warehouse. Here are four ways to design the process with some being generally more successful than others.

Refresh the Dimensions, then the Facts

This approach attempts to make sure that the process is completed with minimal negative impact should an error arise in the ETL workflow. We accomplish this by inserting/updating (upserting) the Dimension Data, then move on to the Fact Data. Our reasoning is that modifying dimensions has no impact on the reports. This isolates the fact loading workflow as the only place where a report impacting error could occur. In most cases this only will  impact a single fact table if incremental commits are needed during the loading of the fact. That said, there remains a potential for the customer to have an inconsistent view of the data because the workflow still is running.   Quick and efficient with the potential for inconsistency in data and reports.

Refresh the Partition

This approach is similar to the above process except the Fact Data will be partitioned. It allows only the current (and sometimes previous) partition to refresh at any one time by truncating the partition and running an insert select.  Because this approach limits the amount of data being transferred it should be fast but there still is the possibility of inconsistency while the truncate / insert process is running.

Rename the Tables

Renaming tables can be an effective way to minimize the risk of data inconsistency.  In this process one would create a second copy of the data and make changes to these tables.  Once the refresh process is complete a process is kicked-off to rename the tables by changing the current table to a temporary name (perhaps with a dash (-) at the end), then change the refreshed table to be the name of the table (remove the underscore) to be used by the reporting system and finally changing the name of the original table to have an underscore so it can be used by the refresh process during the next instantiation.  Because you’ll be using at least twice the disk space and increasing the time to refresh data, this consistency has a high server cost.

Partition Exchange

I try to ensure that my blogs are vendor neutral but occasionally one product offers something that no one else does. Partition Exchange is an Oracle enterprise database offering and hits all three pillars (quick, efficient and consistent) of an effective data refresh solution.

For a partition exchange process, one creates a partitioned fact table (required by any large data warehouse fact table). I recommend that the table be partitioned by (date) range but that is not required. Next set up a swap partition table. This table will not be partitioned but will be an exact replica of the partitioned table. The ETL workflow will load the non-partitioned table(s) to include all of the information necessary for that time frame. Once all of the fact tables are loaded the exchange partition process is kicked-off. This process will essentially swap the pointers from the partition with the data in the stand-alone table. This process takes a fraction of a second no matter the size of the data in the table and partition. There is an added bonus, if a query is run against these tables while the exchange process is taking place:  the query will be held until the process is complete, thereby ensuring that the data is consistent. The final benefit of this process is that there is an easy fallback should there be an error in the data.  To recover from a processing error, just run the exchange partition commands again and the data will be back to the way it was before the process started.

This is my go-to solution when Oracle DB is the engine for the warehouse. When I use other tools, I will use some combination of the other processes based upon the functionality provided by the database and ELT software.

I hope you found this interesting and I look forward to your feedback concerning your experiences loading large data warehouses.

Posted in Data Warehousing.
BigBear.ai
  • Home
  • Industries
  • Solutions
  • Company
  • Careers
  • Blog
  • Investor Relations
  • Contact
  • Twitter
  • Facebook
  • Linkedin
  • Google My business for BigBear.ai
1-410-312-0885
[email protected]
  • Privacy Policy
  • Terms of Use
  • Accessibility
  • Site Map
© BigBear.ai 2023
We value your privacy
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
Privacy Policy | Do not sell my personal information
AcceptCookie Settings
Manage Consent

Cookies Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
JSESSIONIDsessionThe JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
CookieDurationDescription
__atuvc1 year 1 monthAddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs30 minutesAddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
Analytics
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
CookieDurationDescription
_ga2 yearsThe _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_NK4L4Q320Q2 yearsThis cookie is installed by Google Analytics.
_gat_gtag_UA_163894009_21 minuteSet by Google to distinguish users.
_gid1 dayInstalled by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
at-randneverAddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT2 yearsYouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
uvc1 year 1 monthSet by addthis.com to determine the usage of addthis.com service.
Advertisement
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
CookieDurationDescription
f5avraaaaaaaaaaaaaaaa_session_sessionbusinesswire.com cookie
loc1 year 1 monthAddThis sets this geolocation cookie to help understand the location of users who share the information.
VISITOR_INFO1_LIVE5 months 27 daysA cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSCsessionYSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devicesneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-idneverYouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
Save & Accept