My Data Warehouse

Stockholm 2016-01-17
Recently I looked into some Data Warehouse systems. And I learned there is not anything on the market that compares with my Data Warehouse, I would not say my is better but it is very different from the others. Personally  I’m convinced my Data Warehouse concept is better, but that is what we all creators of Data Warehouses say I suppose.

Most Data Warehouse architectures are based on multidimensional cubes with a fact table in the middle. This model was conceived in the late seventies when the capabilities of the infrastructure were much weaker than today. Cubes have some intrinsic merits they give a logical structure to the data, have low redundancy and are economical with disk space. But there are disadvantages of cubes which outweighs the strengths. Cubes are hard to understand and decode. Very few non-specialists can extract data directly from cubes. Cubes are complex to maintain, and they are slow to access. I prefer simple two dimensional tables containing all data needed for a report. Tables are simple to understand and easily accessible from spreadsheets which are in every business user's toolbox. Data in my warehouse is organised in tables not in cubes.

Most data warehouses store data in relational databases which are accessed and manipulated via SQL statements. For some reason most Data Warehouse system goes to great lengths to conceal SQL from the Data Warehouse admins, instead they invent their own access language or extend a common programming language with their own access libraries. I created a workflow engine and scheduler which focus on executing jobs with a minimum of ‘macro code’, data manipulation is done directly by SQL. The entire ETL process is written in SQL, instead of an obscure language only insiders understand.

ETL processes of Data Warehouses often process very large amounts of data, parallel process workflows and split workflows into parallel ‘subflows’ is of paramount importance. My implementation of parallelism via piggyback iterators, is the simplest and most elegant parallel programing algorithm I’m aware of and it is a far cry from other traditional parallel programing idioms.

My Data Warehouse: Two dimensional tables, SQL and simple sound parallelism.


Balanced parallel execution

I have written some posts on the importance of parallel execution and how to program parallel  integration workflows with my Integration Tag Language. A workflow called a <schedule> often consists of a number of workflow steps called <job>.  Here is a ‘bare bone’ schedule with  2 jobs processing a file with 100 rows:
When this schedule executes these jobs execute one by one in sequence. If  the jobs are independent of each other it would be better to execute them in parallel, this is achieved by adding the parallel attribute to the jobs, like this:

If job 1 takes 1 minute and job 2 takes 10 minutes then you have not gained much by parallel process the two jobs, the jobs are unbalanced from a parallelization view. If we break down job two into 10 parallel sub processes the 2 jobs are balanced. You do this by introduce a job iterator and piggyback connect the myiterator to the jobiterator, like this:
Here the jobiterator cuts the file into 10 pieces and parallel execute them all, each chunk is given to a myiterator. In theory the two jobs should now execute in 1 minute and this is what I refer to as parallel balanced workflow. The wall clock time is reduced by 82%, with very little effort.
In real life not many workflows behaves as nicely as this one, you may have to adjust the chunks a bit to achieve about equal execution time.


Happy New Year

This was supposed to be a rather lengthy post, but zscaler was tired today, making me waiting for gateway.zscaler.net for to long, and now time is running out but...

Another year have past again. At work I didn’t accomplish at all what I had hoped for. But I have done some other things, like helping out with evaluating software and some integration problems. I did not do one line of JavaScript coding which I really had hoped to do, this is only partly due to lack of time I hadn’t any good projects for javaScript and I do not code just for the sake of coding. I didn’t buy a raspberry Pi same thing there I didn’t have a good project for a raspberry Pi. I didn’t try Perl6. There are a lot of things I didn’t do. But I tested the Amazon infrastructure learned a few thing about Azure cloud. I managed to create a travel expense report by myself, I never done that before so I’m happy for that.   
Two colleagues have left this year, that is never fun. A data warehouse specialist left and that was not only un-fun it was almost painful, I’m still emotionally attached to my old data warehouse and those who work there.
Actually the zscaler just went bananas, it blocked my internet access with this prompt
I would never type in a password in a frame like this. Need help contact your IT team, this is the 31 of december at 16:45 do not think anyone will answer.
Years end is the most IT-important day of the year, and you should make an extra effort to avoid anything that can interrupt or disrupt IT services this and next day of the year.


Extracting SAP projects with BAPI - 3

I am restructuring some long running Data Warehouse extraction workflows. In total these workflows takes some 4 hours today, which is ridiculous. One part of restructuring the workflows is to use modern Integration Tag Language idioms, so newcomers  can understand, the present archaic syntax is a bit tricky. I have rewritten the second part in much the same way as the first.
So far I cut down execution time from 4 hours to 30 minutes. This is achieved by parallelizing the jobs running the BAPIs. I have rewritten the last parts of the workflow much in the same way I rewrote the first part.
The result is good, but not good enough Still the runtime is dependent on the amount of objects defined in SAP, in a few years when projects have doubled so will the runtime. I strongly advocate full load over delta load, since full load is much simpler to set up and is self healing if something goes wrong. But here is a case when full load is not performant enough, 30 minutes and growing by the day. I will rewrite these workflows from full load into a hybrid delta load where I aim at a more stable run time below 10 minutes.    

One job in the last rewrite is of interest: SAP information BAPIs are structured with one BAPI giving a list of keys to all objects and then you have an army of BAPIs giving detailed information about individual objects. BAPI_NETWORK_GETINFO is a bit different it takes an array of network identities and respond with detail info of all objects in one go, here the € list operator comes to the rescue, it  takes a PHP array and reformats it into a RFC import array.

The BAPI_NETWORK_GETINFO is run once for all networks in sequence.

  1. The <forevery> job iterator creates the iterator from the NWDRVR mysql table.
  2. Then runs the BAPI_NETWORK_GETINFO BAPI for each row in the SQL result table one by one. (Addressed by @J_DIR/driver1)
  3. Lastly stores all results in corresponding MYSQL tables   

If the list of network objects is large enough you have to spit the array into chunks and execute them separate to overcome SAP session limits and performance problems. We have some 9000 projects and that is to many in our environment to execute in one go.
A small rewrite of the job will split the SQL result into 8 chunks and distribute them over separate workers and execute the in parallel:

Here BAPI_NETWORK_GETINFO is run in 8 parallel workers.

  1. The <forevery> iterator splits the SQL result into 8 chunks, each chunk is executed by a separate worker in parallel
  2. Each worker then runs the BAPI_NETWORK_GETINFO BAPI for each row in the SQL result table of the worker one by one. (Addressed by @R_DIR/driver1)
  3. Lastly each worker store all results in corresponding MYSQL tables

With this slight modification the run time for the job is cut by a factor of 8. This is really parallel programing made easy. Compare this with visual workflow programing so popular today, I think you will agree with me this is easier to set up.


What's in a return code?

The last week I have mused on ‘What’s in a return code and what it is good for’. It started with the innocent question:
‘How do I see which job in a workflow that bombed out’?
‘The first job with result equals zero’.
‘There is no zero result, there are only ones and nulls’.

My job scheduler return codes are boolean 1=success, 0=failure’. It is not entirely true, the return code can be NULL, which normally means not executed yet. I decided to take a look in the log:

The first job without a return code is trunc_dsaldo, up until trunc_dsaldo all jobs have executed successfully (result=1), it turned out trunc_dsaldo was successfully bypassed, the boolean return code does not really allow for a third ‘bypassed’ condition. The registration of a bypassed job is bypassed altogether so it is impossible to tell a bypassed job from a not executed job.  

I like boolean return codes.  Either a job executes successfully or not, it could not be more simple if it were not for the bypass condition. In this particular case it was the next job dsaldo_read who failed, due to an infrastructure fuckup the job failed and the connection to the database log table was lost, so it could not register a failure. A very unlikely situation, but nevertheless it happened.  

What is the a return code good for?
The most obvious reason the return code should tell the result of a job? In this case it does not do that well. You can argue the result of a bypassed job is unknown and should be left with a Null return code, but you can also say it was successfully bypassed and should qualify for a successful return code, but a bypassed job can be seen as a failure. Right now I lean towards giving bypassed a unique non zero return code but keeping the boolean type. This approach keeps the boolean simplicity but has a side effect it indicates the job was successfully bypassed. I still do not know if this is a good thing or not. I have to scrutinise some unnecessary complex code carefully before I make any changes. If I decide to change the code I will rewrite job related ‘return code’ code, since it has been subject for some patching during the years.
Another and maybe the most important function of a return code is testability, for successor jobs to test the outcome of a predecessor, that has already been taken care of, you can set up a job prereq testing the outcome of a predecessor job example:
<job name=’successor’...>
  <prereq type='job' predecessor='previousJob' result='success’ bypassed='ok'/>  
The successor job will run if the execution of previousJob was a success or previousJob was bypassed.
But the job return code has not got the attention it deserves, it’s a nice way to say there is some odd logic and bad code lurking in my job scheduler concerning return codes. Maybe return code should be a class. I’m not much of an OO fan, but return codes are important and maybe deserves a class of it’s own.


Extracting SAP projects with BAPI - 2

In the previous post I described how we cut the time of extracting projects data from SAP. In this post I will show the new improved workflow coded in Integration Tag Language where workflows are coded in schedules. The first part of the project's schedule looks like this:

This is just initialisation of constants and logic needed for the execution of the schedule. The second part:

Extracts the PROJ table from SAP and creates an iterator with all projects. The last of these jobs ‘generate_driver’ creates a PHP array called driver0 which is the iterator used to drive the BAPI jobs. The first set of BAPI jobs are the quick ones that runs on less than 10 minutes:

These jobs run in parallel, you only insert a parallel=’yes’ directive on the job statement. The iteratorn from the generate_driver job is declared by the <mydriver> tag and iterates the BAPI once for each project in the array0 iterator. The id of the project is transferred via the @PROJECT_DEFINITION variable, as you can see inside the <rfc> section. In the <sql> section the wanted tables are declared, default are all tables, BAPIs often creates a lot of ‘bogus’ tables so we explicitly state which tables we like to import to the Data Warehouse. The next job is a bit more complex it is the long running BAPI_PROJECTS_GETINFO job:

In part one we decided to distribute the workload over 9 workers, by doing so we need to truncate the tables upfront since we do not want our 9 workers to do 9 table truncations. First we create a dummy job as a container for our jobs, which we declare parallel=’yes’ so the job run in parallel with the preceding jobs. Inside the dummy job there is a table truncate job and subsequently the BAPI extraction job. Here the iterator array0 is defined with the <forevery> tag, the iterator is split up in 9 chunks which all will be executed in parallel. The rows in each chunk are transferred as before by the <mydriver> iterator which is given a chunk by the piggyback declaration. If you study this job carefully you will see there are some very complex processing going on, if you want a more detailed description I have written a series of posts on parallel execution. I am very happy about the piggyback coupling of the iterators, by joining the iterators a complex workflow is described both succinct and eloquent.
The 5th and last part of the schedule shows a job similar to the one just described, this time we only need to run the BAPI job in two workers:

If you take the time and study this ITL workflow you will find there are some advanced parallel processing in there reducing the run time from about one and an half hour to less than 10 minutes. But so far we have used brute force to decrease the run time, by applying some amount of cleverness we can reduce the time even further and make it more stable. I hope to do this another weekend and if I do I write a post about that too.

Extracting SAP projects with BAPI - 1

Some years ago I was asked to extract project information from SAP for reporting/BI purposes. I decided to base the extraction solely on BAPIs. I wanted to test the BAPIs thus avoiding writing ABAP code and/or tracing what SAP tables containing project info. It sounded like a good strategy no SAP development just clean use of SAP premade extraction routines.  It turned out to be quite a few BAPIs I had to deploy for complete extraction, first I started with the project list BAPI:
BAPI_PROJECTDEF_GETLIST to get all projects (if you are not familiar with BAPI extraction read this first). Then I just had to run all the other BAPI one by one for each project:

In the beginning it was fine running these BAPIs in sequence, very few projects only one company (VBUKR) using projects. Last time I looked it took about 30 minutes to run the routine, it was a long time but what the heck 30 minutes during night time it’s not a big deal. Last week I had a call from present maintainers of the Data Warehouse, “Your project schedule takes hours and hours each night. The code is a bit ‘odd’, can you explain how it works, so we can to do something about it”. To understand the ‘archaic’ code in the schedule first thing I had to do was to clean it up, replacing obsolete idioms with more modern code constructs others could understand. Then I split the original schedule into smaller more logical schedules, the first one consisting of:

took more than two hours to run.  A look into the projects data showed 16000+ projects belonging to more companies than I created the extraction for. Now we replaced the BAPI_PROJECTDEF_GETLIST with direct extraction of the SAP PROJ table selecting only the interesting company about 8000 projects and run the BAPIs in parallel  this brought down the execution time to about 1 hour 20 minutes. Analysing job statistics showed the three first BAPIs only took little more than 500 seconds each, BAPI_PROJECT_GETINFO 5000 seconds and finally BAPI_BUS2054_GETDATA about 1000 seconds. Distributing  BAPI_PROJECT_GETINFO on 9 workers and BAPI_BUS2054_GETDATA on 2 workers should make all BAPI execute in between 500 to 600 seconds. This is a balanced scheme and the execution time is acceptable, from over 2 hours to 10 minutes. In the next post I will show the new improved execution schedule.