I have recently been brought on to a project involving a Qlik implementation for a client. Among the requirements for the project, the client needs to have the data refreshed every hour. As with many Qlik implementations, the Qlik segment is the final step in the ETL process with the data arriving to an Oracle DB first. By the time the ETL arrived to Qlik, only 30 minutes remained of the allotted hour. Here is the process for optimization that I followed.
Status on arrival: The number of tasks in the QMC were about 40. There were about 30 apps in the front end that needed reloading, most of which would reload in under a minute. However, there were several apps loading tables of over 250 million records with dozens of fields, and those apps would take well over an hour. The total time to reload everything took about 6.5 hours. Load time needed drastic improvement.
Below are some techniques for Extraction, Transformation & Loading improvement, I’ll touch on more Loading techniques in another post.
The main item I look for to improve extraction time is minimizing data transfer from the database. The team prior to my arrival had been mostly Oracle SMEs. They did an excellent job setting up the Oracle database for optimized extractions by setting up indexes for some incremental loads based on the date field & storing the files into monthly QVDs (format: TableName_MonthYear.qvd). This was a good starting point as the new monthly records on these tables typically did not exceed 2.5 million.
I was very happy to see there was already an initial QVD ETL layer in place. The advantage of extracting the Data into QVDs is that QVDs are much faster to load into Qlik than extracting directly from the DB. If a table is used in several apps, the extraction need only happen once & later loaded optimally in the apps. It also helps in assessing where the Qlik ETL process is getting choked up.
The first order of business was to analyze the QVD tasks & separate them into synchronous jobs. A typical Qlik implementations can easily run 4 synchronous jobs (check the QMC settings to set the number of synchronous tasks limit & evaluate the server size to ensure it can handle the load). There was only one QVF for extracting all of the data from Oracle, so I separated that QVF into 4 QVFs so that the extraction step can be multi-threaded. The caveat here is to ensure the dependent tasks have all the QVDs necessary for their loads.
Time to load data: 4 hours
As I was working on this, I contacted Rob Wunderlich to get permission to adapt a log analysis tool he had built for QlikView. I had attended the Qlik Master’s summit for four years (I highly recommend the course) and have used his analyzer to assist with QlikView script optimization for years. Since this was a Qlik Sense implementation & I could not install QlikView on the server for various reasons, I adapted the script for Sense.
With the log analyzer on the server, I was able to quickly identify which tables were taking up the most time to load (this can also be done by manually reading through the logs). I isolated some the additional tables that needed to be loaded incrementally from the database & moved on to transformation.
There are two ways a QVD loads into memory: optimized & unoptimized. The optimized method loads data from a QVD file into memory 10x faster than an unoptimized load. So the first thing I did was optimize the loads & limit the data being loaded into the system before starting slower transformation processes. As a rule of thumb, load the data directly into memory from the QVD with the only field changes being a new alias if desired, and use the exists( ) function to limit the data to a subset of the QVD based on values in a field.
An important item to note: when using “not Exists( )” in the “where” clause, if the field in memory has the same name as the field in the table that is being loaded, only the first record will be loaded from that field. This is because the value is appended to the field value list & now exists in the system. I prefer creating a table with a single field for exists( ) comparison, usually with an alias of ‘exists_fieldName’ to avoid this collision.
While it is common knowledge in the Qlik community that the Exists( ) function is the best way to load data, I found that the Exists( ) function works best while comparing to field of distinct values in memory. This came as a surprise to me & I set out to investigate by how much the load times differed. I did my analysis (see the code at the end of this blog) & found that the same table concatenating to another table will load 4-8x as fast when comparing to a field with a distinct value set. You can copy the code below, change the file connection variable & create the QVD to execute the sub and run the test on your environment.
Sidenote: The fastest way I found to generate a table with a field of distinct values is to loop through all values in a field. Code:
Load FieldValue('<Field>',iterNo()) as <Field>_exists AutoGenerate 1 While IterNo() <= FieldValueCount('<Field>'). It works great if there are only a few values while the table containing them has millions of records & can take a while to load a single field with a Load Distinct from the Resident table.
From here the process was simple: I loaded the QVD as an optimized load into memory, restricting values through a “Where Exists( )” clause and perform only transformations supported by an optimized load. Then I would load the table again using a Resident load to do the transformations that were unoptimized. This would speed up the process for the initial load, and later in the resident because there were less records to go through on that load (the unoptimized load from the QVD would go through the whole QVD in the unoptimized form, leading to more records & a longer load time).
There is another process worth mentioning that was extremely time & resource consuming. On one table of 94 million records (after the Exists filter had been applied) there was a URL field being created. This URL had a prefix, middle, and suffix part, and between each there was a field value inserted. Regardless of whether this is the best implementation, the load itself can be optimized. Adding a long string during a load process is resource-intensive. I have found that the best way of loading a scenario like this is to add the values as fields in the table by a Join statement and using the fields for concatenation. This method provides optimized performance for the load. In my case, my load would look like this:
Join (FactTable) Load * Inline [ URLPrefix,URLMid,URLSuffix Beginning,Middle,End ];
After applying all of these changes, the time for a full reload has been greatly diminished.
Time to Load Data: ≈ 1 hour
We are still over the 1 hour mark of the client’s requirements though and need to get it under 30 minutes. This is the best we can manage for a full reload, I’ll write about how our team went from a load time of 1 hour to roughly 15 minutes for reloads during the day by adding new records to a data model and the different methods of doing that.
Thank you for reading. I hope this high-level overview of my journey & research for Qlik optimization helps you in your development efforts.
//copy code into the Qlik script editor, save & refresh the page Set vPossibleUniqueKeys = 100000; Set vTableRows = 40000000; Set vRandomValues = Ceil(Rand()*$1); set vExistsValues = 60000; //change to desired file connection string Set vFileConnection = MyDocuments; ///$tab Sub Routine Sub IncrementalConcat(pFileName,pIncrementalField,PFunctionAddIncFieldToQVD,pScenario) Set pIncrementalSuccessful = 0; Let pControlStatement = Len('$(pQVDFieldNames)'); If pControlStatement = 0 then CheckFieldNames: NoConcatenate First 1 Load * From [lib://$(vFileConnection)/$(pFileName).qvd] (qvd); Let pQVDFieldNames = '[' & FieldName(1,'CheckFieldNames') & ']'; For i=2 to NoOfFields('CheckFieldNames') Let pQVDFieldNames = '$(pQVDFieldNames)|' & '[' & FieldName($(i),'CheckFieldNames') & ']'; Next i Let i =; Trace Fields:$(pQVDFieldNames); Drop table CheckFieldNames; End If If SubStringCount(pQVDFieldNames,pIncrementalField) = 1 then Switch pScenario Case 1 Trace Incremental Load ready, creating unique list of values; Let pNewField = PurgeChar('$(pIncrementalField)',''); //Exists clause works faster with distinct list of key values DataInModel: Load FieldValue('$(pNewField)',iterNo()) as Existant_Keys AutoGenerate 1 While IterNo() <= FieldValueCount('$(pNewField)') ; Let vStart = Now(); Let vStartnum = Num(Now()); Trace Start: $(vStart); Concatenate([$(pFileName)]) Load * From [lib://$(vFileConnection)/$(pFileName).qvd] (qvd) Where not exists(Existant_Keys,$(pIncrementalField)); Drop Tables DataInModel ; Let vEnd = Now(); Let vEndnum = Num(Now()); Let vTimeElapsed = Interval($(vEndnum)-$(vStartnum),'hh:mm:ss'); Trace End: $(vEnd); Trace Elapsed: $(vTimeElapsed); Set pIncrementalSuccessful = 1; Case 2 Trace Incremental Load ready, adding field to master table; Let pNewField = PurgeChar('$(pIncrementalField)',''); //Exists clause works faster with distinct list of key values Left Join ([$(pFileName)]) Load FieldValue('$(pNewField)',iterNo()) as Existant_Keys, FieldValue('$(pNewField)',iterNo()) as $(pNewField) AutoGenerate 1 While IterNo() <= FieldValueCount('$(pNewField)') ; Let vStart = Now(); Let vStartnum = Num(Now()); Trace Start: $(vStart); Concatenate([$(pFileName)]) //[$(pFileName)_temp]: Load * From [lib://$(vFileConnection)/$(pFileName).qvd] (qvd) Where not exists(Existant_Keys,$(pIncrementalField)); Let vEnd = Now(); Let vEndnum = Num(Now()); Let vTimeElapsed = Interval($(vEndnum)-$(vStartnum),'hh:mm:ss'); Trace End: $(vEnd); Trace Elapsed: $(vTimeElapsed); Set pIncrementalSuccessful = 1; end Switch End If If Not IsNull(pFunctionAddIncFieldToQVD) and pIncrementalSuccessful = 0 Then Trace Incremental Field $(pIncrementalField) not found, creating field; map_chr39: Mapping Load * Inline [ doubleQuote,Single ",' ]; Let pFunction = MapSubString('map_chr39','$(pFunctionAddIncFieldToQVD)'); [$(pFileName)_SaveIncField]: NoConcatenate Load *, $(pFunction) as $(pIncrementalField) From [lib://$(vFileConnection)/$(pFileName).qvd] (qvd); Store [$(pFileName)_SaveIncField] INTO [lib://QVD Repository/$(pFileName).qvd]; Drop Table [$(pFileName)_SaveIncField]; Let pQVDFieldNames = '$(pQVDFieldNames)|' & '[$(pIncrementalField)]'; ElseIf IsNull(pFunctionAddIncFieldToQVD) and pIncrementalSuccessful = 0 Then Call ThrowError('Inc field is not present in incremental QVD'); End If If pIncrementalSuccessful = 0 Then Call IncrementalConcat('$(pFileName)','$(pIncrementalField)') End If Let pQVDFieldNames=; Let pIncrementalSuccessful=; Let pFunction=; Let pNewField=; End Sub ///$tab Generate Test QVD (Run First) /* TestDataSet: Load Hash160($(vRandomValues($(vPossibleUniqueKeys)))) as TableKey, If($(vRandomValues(2)) = 1,'Fact','Test') as Type, $(vRandomValues(1000)) + 42000 as Date, $(vRandomValues(1000)) as number, $(vRandomValues(50)) as stateNum, Ceil(Rand()*1000,.001) as Amount AutoGenerate $(vTableRows); Store TestDataSet into [lib://$(vFileConnection)/TestDataSet.qvd](qvd); Drop Table TestDataSet; exit script //*/ ///$tab Run Test sub CreateTable TestDataSet: Load Hash160($(vRandomValues($(vExistsValues)))) as TableKey, If($(vRandomValues(2)) = 1,'True','Testing') as Type, $(vRandomValues(1000)) + 42000 as Date, $(vRandomValues(1000)) as number, $(vRandomValues(50)) as stateNum, Ceil(Rand()*1000,.001) as Amount Autogenerate $(vTableRows)*.1; End Sub //try with distinct value list call CreateTable Call IncrementalConcat('TestDataSet','TableKey',Null(),1) Drop Table TestDataSet; //try with temp field in main table (non-distinct) Call CreateTable Call IncrementalConcat('TestDataSet','TableKey',Null(),2) Drop Table TestDataSet; //try with temp field in main table with 2x the records (non-distinct) Call CreateTable Join (TestDataSet) Load * Inline [ DoubleRows One Two ]; Call IncrementalConcat('TestDataSet','TableKey',Null(),2) Drop Table TestDataSet; Exit Script ///$tab Optimized vs unoptimized loads //UnOptimized Let vStart = Now(); trace Start Optimized Load; Trace $(vStart); UnOptimized: Load * From [lib://$(vFileConnection)/TestDataSet.qvd] (qvd) where 1=1 ; Let vEnd=Now(); Trace $(vEnd); //Optimized Let vStart = Now(); trace Start Optimized Load; Trace $(vStart); Optimized: Load * From [lib://$(vFileConnection)/TestDataSet.qvd] (qvd) ; Let vEnd=Now(); Trace $(vEnd); //End Code