2012-05-26

Replicate Business Intelligence Data across the globe

Working in a multinational company you sometimes end up in places you never dreamed of. I’m writing this from  a hotel room in Osaka; Japan. We have a factory here, and we are implementing SAP for purchasing, logistics and production planning, and accompanying SAP is our Business Intelligence system the Data Warehouse, (which is my responsibility). When testing response times from the Osaka plant ‘Fuji Air Tools’ half a year ago , the response times were slow. I was asked if we could do something to speed up response times. I hope this is  easy, we just install a local Data Warehouse server replica in the Osaka plant. Any decent workstation PC  will do since we will just replicate some MySql databases over to Osaka. My plan is to use a very simple Rsync replication scheme. Each day when time is due we just offline the ‘Osaka database ‘ and rsync it from the master in Stockholm.
Why not set up real time replication? I tried this a few times, but this is more complicated. Live replication must be monitored, and it inevitably will fail at some time in a way manual intervention is required. My Rsync replication is simple and self-healing if it which God forbid should fail. (Just rerun the rsync.)
I hear people say you cannot backup/copy/replicate a live database like that. You must at least flush the tables and block updates of the master during Rsync, otherwise you will have a corrupt replica. But I will Rsync until there is no differences between the replica and the master. This is a BI system and updates are mostly controlled background  jobs. My Rsync scheme will work well.

Doers and Thinkers.

I am of the opinion BI systems should be on Local  Area Networks not Wide  Area Networks. BI systems shuffle lot of data, response times will always be better in a LAN, and the band width can be used better than repeatedly send BI analysis queries, this is what the thinkers  do. What happens if I change… and then the thinker resubmit a similar query once again and thinkers result tables are often big.
You can divide Business Intelligence users into Doers  and Thinkers . Doers logs on early in the morning and use the BI system for their daily job. They run their reports, e.g. in a factory production planners and logisticians may analyze component shortages and  adjust today's production accordingly. Thinkers on the other hand come in later and work with issues like ‘What happen if we build a new plant in...’ . And they tend to not only browse through lot of data they also pick up a lot of data.

Scared of airports, not feared of flying.

This is not at all what I intended to write. I intended to write about my adventures in airports, with the title scared of airports not of flying. Coming to an airport is like descend into hell, I probably have a pulse rate like a soldier in combat. For me everything goes wrong, this time I started by losing my boarding cards. It is not until I’m on the airplane I can calm down.
You know these self-checking machines where you are supposed to type in your booking number, e-ticket number or whatever. I NEVER type the right number, I type the forms number, the date or something that these stupid machines do not recognize as me.
I do not like airports, it seems to be genetic .

2012-05-20

User triggered SAP data loading into a MySQL database from MS Excel

A strategy for real time reporting and loading of Business Intelligence Layer.

From an Excel sheet Button, a signal is sent to a Linux ETL system, that starts a process to fetch ‘delta data’ from a SAP system and update the Business Intelligence data storage in this case a MySQL database.

In a  previous post  I have written about the importance of  performant ETL  processes see   When fast is not enough , you may also want to look at extracting data from SAP  and schedule jobs with PHP . In this post  I will take this  a bit further. Here I show how you can do BI real time reporting with the help of a performant  ETL and Visual Basic in MS  Excel.
The example this post  is based on is a P roof of C oncept and there are still details not fully worked out. But testing so far has been successful.

The merge delta load method

A separate business intelligence (BI) system and real time reporting is easier said than done. Since the BI data is physically separated, there will always be latency, the time it takes to transfer data from the source system to the BI layer. The only way to achieve real time reporting is to do the reporting in the source ERP system. But there are strategies to minimize latency and achieve pseudo online reporting, where the latency is acceptable for the users. Normally you pull data from your source ERP systems at regular intervals, e.g. each night, week or month. To achieve realtime reporting you can increase the pull frequency e.g.  extract data each hour gives  better reconciliation than once a day. However the strain on the computer systems and networks increases with pull frequency and this method does not give online reporting.   The merge delta load method  first fetch data from the BI system and then fetch the delta data from the source ERP system and then merge the two data streams into the report. Although this method gives good real time reporting, there are several disadvantages with this method. It adds complexity to the client, since the client must be able to contact the ERP system and merge the report with ERP delta data. Each time the report is requested the delta data has to be extracted from the ERP system. This means all reports suffer from the response time delay of retrieving data from the source ERP system. Since the BI layer is not updated this delta load is growing with time until next BI layer update.
 
Figure 1a                                                                                                         
Figure 1a shows this scenario. If we now add a user as in Figure 1b we now double the delta extraction from the source ERP system as each user have to fetch the data for his/her report(s).
Figure 1b
With a large number of users this method may extract a lot of data over and over again from the ERP system. If we add another source ERP system as in Figure 1c, this solution becomes very messy. Now the BI client must be able to connect to different source systems and extract and merge all data streams with the base report. The merge delta load method is not very elegant and scales poorly.
Figure 1c

Another approach – user triggered Delta Load

In this example I have  taken another approach to real time reporting.  The merge delta load method is complex and requires special client software to delta load and merges different data streams. Another approach  is to run the ETL process more frequent. But this solution will potentially create a lot overhead by running lots of delta load ETL without any purpose at all. Instead of fire off frequent loads we let the user trigger the ETL. This solves the problem of running lots of meaningless ETL data loads, since we only run the ETL on user request. This way all users benefits from the delta load, data is only extracted once. Instead of retrieving delta data from the source ERP systems, the client sends a signal to the ETL system to start a delta load process  and when the ETL process is finished request the report (see Figure 2a).
 
Figure 2a
This request triggered  update will create a report that is almost real time provided the ETL process is fast. Unfortunately this method requires event based VBA programming in the Excel client, which is currently above the skill level of the author. There is a simple yet elegant way to overcome this technical problem, give the user an update button and let the user trigger the update.  
User triggering simplifies the real time reporting process but the workflow can be depicted the same way as request triggering, the only difference is data extraction from source systems only occur when requested by the user .When we add a user (Figure 2b) we start to see the benefits compared with the merge delta load (Figure 1b).
Figure 2b
No double delta loading since all data extraction updates the common BI layer only once, no matter how many users is added. Adding source systems does not change the picture much, see Figure 2c. We still got a simple and clean real time reporting process.  On the other hand user triggering requires a performant ETL process otherwise the response time will be slow and the report will be old and not real time.  User triggering also requires strict synchronization isolating of individual ETL updates. I will show how I have implemented user triggering for real time reporting.
Figure 2c

User triggered Delta Load  – an   example.

In this example we real time report COPA data from a  SAP system. The reporting is done within an Excel sheet.

The sheet

Figure 3
This is what the user sees when the Excel sheet is invoked. When the user trigger button is pressed a signal is sent to the ETL  to run the delta load  process (see Figure 2a). The ETL status field shows the start, stop and outcome of the last ETL run [1] . The ETL status field is ‘click’ sensitive, click it and the status is updated. In this prototype the ETL process is controlled manually, press the ‘Extract from SAP’ button to fire off the ETL process, click the ‘ETL status’ field to update the status. This is complex real time reporting reduced to a simple, practical and scalable solution. A button and a field    gives the user the power of real time reporting. (There is a lot more to this Excel sheet but that is outside the scope of this post .)

Under the hood

 There is more to it than the button and the field, there are a lot going on under the hood.  The status field is simple, when clicked it just retrieves the status from the ETL system  via a simple ODBC SQL request. The ‘extract’ button is trickier it must contact the   ETL engine which resides in a Linux server and send the signal ‘run Delta Load ’. First we establish a Secure Shell  connection to the Linux server using the Open Source PuTTY program Plink  then we just send the signal ‘run Delta Load’  and disconnect from Linux. The Excel  VBA code  that does the trick can be found in Appendix A.
The Delta Load  procedure is too large to discuss here. In Figure 4 you can see an abbreviated version outlining the job steps.
Figure 4
There is one line of special interest in the procedure, the <prereq type=’singleton’>  restricts the execution of the procedure, it must run alone. If a sibling process already runs this instance dies immediately.  Look at Figure 2 b . If both users send the ‘run ETL’ signal simultaneously one must yield, and this is what this <prereq> does. You can queue the request up by adding a wait to the <prereq>. Since one ETL run will serve all users, it is not necessary to run more than one ETL at a time. This is a very important  simplification, simultaneously updating of large amounts of data is not trivial, and it requires complex logic and slows down the execution.

Response time

The Excel sheet (Figure 3) is a report generator that slices and dices the SAP COPA data. Normally the response time is around one second for generating a report, this is quite a feat since there are a lot of COPA data. This is achieved by ‘in memory computing’. We try to keep data in memory by various caching techniques. When the COPA data is not in memory there may be an initial less than a minute delay [2] .  Since the ETL load time is added to the report response time (if you need an updated report) it is essential to keep the ETL load time low. ETL load statistics (see Appendix B) gives you can normally ETL load 2 to 3 thousand rows in 15 seconds. Two background ETL load jobs a day will keep maximum rows for additional user triggered ETL load below three thousand rows. This is probably a well balanced ETL load scheme. But in the end it is up to user to decide, it is simple to change the ‘background frequency’.

Appendix A – Excel VBA code sends ’run Delta load ’ signal.
Private Sub CommandButton2_Click()
    Dim resVar As Variant
    commspgmP = "C:\Program Files\PuTTY\plink.exe"
    commspgm = """" & commspgmP & """"
    'MsgBox commspgm
    If Dir(commspgmP) = "" Then
        MsgBox "You must install " & commspgmP & " to update BI!" & " http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html"
        Exit Sub
    End If
    hostuser = " -v -l viola -pw xxxxxx DWETLserver"
    cmdtxt = " /home/tooljn/dw/pgm/scriptS.php schedule=COPA_DeltaLoad02.xml logmode=w onscreen=no"
    resVar = Shell("cmd /c " & commspgm & hostuser & cmdtxt, vbMinimizedFocus)
'MsgBox commspgm & hostuser & cmdtxt & "  -  result=" & resVar
    Sleep (2000) ' Sleep for 2 seconds to allow BI update to commence before we...
    Call displayLastUpdate
End Sub
Comments to the code:
This is my first piece of VBA code and I have trial&error it until it worked, I do not know VBA well.
 The line hostuser = " -v -l viola -pw xxxxxx DWETLserver"   is not very elegant. You should  not have hard coded credentials! However in this case it’s benign, the user viola  cannot do anything else but run our ETL process COPA_DeltaLoad02.xml. Going into production this will be replaced by certificates. The user ‘viola’ is in honor of my late and very beloved granny Viola. She was from north of Sweden, her father was chief engineer of the Boden fortress  and an inventor. The German Wehrmacht tried to enlist him, but he declined, which I am grateful for, that made it possible for my grandfather (an orphan with no means from south of Sweden) to meet Viola. He served as a sergeant at the fortress. After marriage and my mother were born he had a motor cycle accident and he was relocated to Stockholm, administrating arms purchase at the Army HQ just 100 meters from my present flat (I’m looking at the building while writing this). I’m grateful for the accident because that made it possible for my father to meet my mother.  At that time he was an accountant, a job now generally replaced by Excel sheets.

Appendix B  - ETL De lta  load times

 The table in Appendix B shows statistics from real ETL runs. Here you can see that the example in this article took 5 minutes and 45 seconds to run and the performance was ~310 rows per second. The load velocity is depended on number of new lines (delta data) , concurrent network load  and the responsiveness of the source (SAP) system . The setup time [3]  is low, normally well below one second.  

   

To keep average response time acceptable we should probably schedule two back-ground ETL runs a day at 07.00 and noon for the (SAP COPA) data in this example. These ETL runs will have a negligible impact on the infrastructure landscape and keep user triggered loads to a maximum of a few thousand rows.  
Before you schedule the background jobs you should analyze data update pattern i.e. when updates occur and schedule on low activity hours like 07.00 and noon.

[1]  This particular ETL run took some 6 minutes. See Appendix B for details.
[2]  These figures are not measured, it is the authors biased opinion.
[3]  This includes send ‘run ETL signal’ and start up the ETL job in the BI layer. This also includes detection if there is anything to load, this is important you do not want to aggregate nonexistent data and reload already up to date BI cubes.

2012-05-18

It is not fair to raise the fare email

When I read this amusing story about a 500 mile email problem , I come to think of another email story I witnessed about fifteen years ago. I was doing consultant work for a large company in Sweden. I have done many mistakes and caused some real bad disasters myself, but this one I only observed.

Stockholm Underground Transport (SL) had raised traffic fare prices and a lady employee of the company was very upset about the raise. She wrote an email urging her colleagues to protest against the raise and addressed it to all colleagues in Stockholm, or so she thought she did. I received her mail at about 09.00 in the morning, after reading the mail I purged it and didn’t thought more of it. But soon I began to get responses to her email some in favor of the lady and some against. And my network connection started to get sluggish, then I realized the recipients of the mails was the entire company about 40.000 employees, I do not recall the exact figure but it was a lot of recipients and to make things worse due to a merger with an almost equally big company their mail systems  had been connected the week before so the mail reached almost the double amounts of recipients. About noon time the network was down I only received a few ‘response mail’ an hour. The network was flooded with these mails. In the afternoon people started to be aware of the problem and now mails started to appear from bosses in both companies telling the lady her use of the mail system was improper and she was to blame for the network outage, of course they responded to all making things worse. Later in the afternoon mails from network admins started to appear begging all to stop mailing. It took the admin guys two days to get back to normal again.

2012-05-17

Is an IT professional less creative than an Artist?

If I only had one word to define me as a professional creative is the word I would use. The best quality an IT professional can possess is an analytical mind. I am analytical, but it does not match my creativity.

Recently I have had reasons to consider the twain traits, analyticity and creativity. Here I define creativity  as the ability to create computerized models of the real world or cyber models of reality,  and  analyticity as the ability to understand such cyber models and find faults in them. I was contemplating over the best eight [1]  guys I met in the business. I was analyzing their strengths and weaknesses and found they all are analytical (no surprise), but to my surprise I considered them relatively weak in creativeness. One of these guys once asked me for advice he had a problem with the general ledger in an accounting application. He gave me such a detailed explanation of the problem I thought he was giving me a solution, but when I outlined the solution he called me a genius, (this guy is a genius with an IQ above 150). You can focus on a problem so much you do not see the solution, this happens to me too, but more often to my more analytical friends.

I am starting to believe these two traits are XOR, you can be very creative or very analytical but not both. Of these eight guys  the one I consider least analytical is the most creative of them (and vice versa).

Eight is too few to draw far fetching conclusions from, but I wonder. I suppose  I have assumed being creative and analytical goes hand in hand. To be successful in IT you must be analytical, being a successful creative  IT professional you may be a compromise, balancing analyticity and creativity. If I’m right a creative IT professional should be less creative than successful professionals in trades where analytical skills are less important, since the creative IT guy have an analytic mind to drag along.


[1]  My selection criteria ‘I should have worked with them sufficiently to know they are top quality IT professionals’. I count to eight such guys. I will not name or describe them since some of them would resent being on my list in public.

2012-05-12

GOTO structures in program code

When I started to code programs, IBM mainframe memory had become large enough to accommodate almost any bread-and-butter program, which basically were read X files match them and print a paper report or read two files match them and update one file with the contents of the other.  This meant you could afford the luxury to structure programs with other goals than saving memory.  In COBOL the first programming language I learned,  there were two instructions particularly suitable for creating memory saving code, GOTO and ALTER.
The first of those two instructions GOTO was used to branch freely anywhere in the code. A liberate use of GOTO easily create code that gives an impression of a full spaghetti bowl. You do not have to be a brain surgeon to understand a spaghetti bowl of code can be hard to debug.
The latter of these two instructions ALTER was used to change GOTO statement on the fly. Yes ;  you could make a GOTO sectionA  statement in a  program go to sectionB  by issuing an ALTER sectionA TO PROCEED TO sectionB .   If you have spent some hours with a bowl full of spaghetti code and finally starting to make some sense out of it, then you find some devilish ALTER statements (ALTER was always written in the last piece of code you looked at).  Even a bicycle repairman understand such code is a challenge to debug. The only help you had was a paper printout of the program and a dump which didn’t told you a thing. (NO; very few programmers not versed in assembler understand dumps). You used pens, paper clips and fingers to bookmark the code, and then you saw the feared ALTER… In the old days the programmers had a large active vocabulary of colorful words.
When I was a computer operator I developed a contempt for programmers, they were a miserable bunch that tried to make you prioritize their test job , by everything from sweet talking to  aggressive threats. Of course you place their jobs at the bottom of the work queue. It was only later when I become  a COBOL programmer I understood why they behaved the way they did. Debugging GOTOs and ALTERs do things to you personality, bad things.
But when I came along things had start chang e , I structured my first program with the help of Jackson structured programming method.  I used novel structured branching like DO WHILE…END. We abolished ALTER altogether and used GOTO sparsely. The code quality got much better and the productivity sky rocketed.
Today people do not seem to know why GOTO can be bad in code, people seem to think the instruction GOTO is bad. But rightly used GOTO can be a blessing. It is frequent use of unstructured GOTO that makes program both error-prone and hard to debug. Programs are code models and abstractions of the real world, and the real world is not restricted to FOR EACH or DO WHILE structures, sometimes the free GOTO instruction is just the right instruction to use.
I end this post with an extract of the first GOTO I written in thirteen years.
 

Update

Yesterday i was thinking why not solve the example above with recursion? That is more elegant. I always have had a preference for recursion, it reminds me as a child standing in front of a mirror with another mirror in my hand staring into eternity. Recursion is beautiful.
This is PHP code, and I added this GOTO as part of trying it out when the instruction was added into the language. Also here the GOTO does a better job than a recursive call. And PHP lacks tail call recursion optimization , which means you should always be cautious with recursive calls. In this case it does not matter though since we only go to RestartJob  once, If the guard  routine fails a second time we fall through and the job fails. In this case GOTO does a better job.
 

Integration between the Business Intelligence storage and Qlikview

In this post I first give a background which is our  experience with the Qlikview program product, and then our strategy for Qlikview integration with our Business Intelligence storage.

Background

Our Business Intelligence data storage and viewers.

In our Business Intelligence application The Data Warehouse (DW), we isolate the data from viewer applications. The data is stored in a SQL database (MySQL), and viewers are any application that can connect to Mysql using standard interfaces  like ODBC or JDBC. Today the primary viewer is Microsoft’s Excel. Qlikview is the other viewer in use today, and it is spreading like a prairie fire among our users. The reason is the users like Qlikview and you can develop advanced analytical apps quickly.
Note!  All though this post is about  Qlikview integration, we do not exclude other viewers from our Data Warehouse. We invite users to use viewers of their  liking not ours .

Qlikview the good and the bad.

We use the program product Qlikview as a viewer into our BI storage. Qlikview as other apps has it’s (dis)advantages. From an   evaluation  some years ago done by some Belgian colleagues  we could learn Qlikview is strong on data visualization, it is a user friendly and flexible tool for analyzing data. But it lacks in enterprise integration  i.e.  workflows and ETL procedures tend to be complex.
It is cumbersome to  create, maintain and debug complex Workflows in Qlikview.  Nor is Qlikview a good data storage, it is a closed environment, once Qlikview has slurped up the data it is inaccessible for other applications.

Qlickview parallel processing.

Qlickview also have a problem with parallel processing, as a rule-of-thumb  you should not run more than 10 ETL processes in parallel [1] . Good parallel processing capabilities and good scalability is essential for large scale ETL. Now Qlikview consists of two servers, the Publisher which is the ETL engine and the Database server which handles user requests, by splitting these servers into separate physical computers you alleviate the problems with parallel processing, but still you should not run more than ten ETL processes in parallel [2] . Having separated the twain Qlikview server parts, loading data does no longer affect user interaction, but the loading itself is still a problem. There is probably more you can do but for us right now we are contented with Qlikview loading performance. (We try to avoid heavy ETL processing in Qlikview.)

Data views.

Data can be viewed from different angles, a production planner see the world from a different perspective than a logistic analyst  which have another view than a purchaser .   N ot to mention the business controllers which often  like to see different perspectives in the same view, the need for financial report structures is endless or so it seems.

What is in a warehouse storage bin?

The controller may want to know the value of the physical parts of the bin, while  the production planner wants to know if there are enough allocatable parts  to produce X products, while  the logistic analyst wants all parts including those being produced the next two weeks, while  the purchaser wants to include parts on the road from the vendor,  stock figures  actually are more complex but this will do as an example of different data views.
Business rules . Data views are determined by business rules, and you want the data views to be consistent, between viewers. In our case we have other  viewers apart from Qlickview. Data views are not static, they change all the time, influenced by things like the real world, type of production, management, layout of workshop, business model , the analysts themselves etc. etc. E.g. change in management often implies change in data views.
To create data views you often have to join data from different sources, calculate ratings like  ( 1/ price * 100 +  1 / delivery time * 10  + quality * 0.25) or classify observation by rules like if vendor  rating < 10 and  vendors = 1 then ‘critical’ else if… .

Data transformations for viewers.

Viewers need formatting, e.g. in one application you want material number  to be called product number . Aggregation is another transformation  that affect viewers, how detailed information is useful?  Transpose or cross tabulation or pivot tables is a powerful transformation very common in BI. The data transformations for views goes from the very simple as changing the default label, to the very complex, transpose a table is not easy [3] .
Viewers prefer data to be organized in specific ways, Qlikview often wants all data or most of the data for an application in one big table.
(Viewers often has an internal format, Qlikview have a proprietary compressed format, tailored for performat Qlikview application. This transformation is done internally when data is load into Qlikview.)  

Oceans of space .

Our BI system the DW runs on servers designed for BI, which (among other things) means the database servers have many large and low-cost  hard disks . I cannot stress enough the importance of economical  disk space for BI storage.  
We run Qlikview in a hosted  environment where dis k space is performant and expensive .  This is good for our Qlikview apps, but   oceans of space  is out of the questions for our Qlikview environment.  So we try to only store information in Qlikview necessary for the apps them self. Using Qlikview as a general data store will be costly, due to the environment we run Qlikview in.

Accessibility, quality and reuse of data (and definitions).

First the BI store must be available. If the BI store is down there is not much accessibility, quality or reuse.
There is a consensus among everyone 100% quality  of data is essential (all experienced BI professionals have ‘extrapolated’  one time or another, show me one that has not, and I ’ ll show you a liar). Data quality in BI is important if users find that the data is invalid they stop using the system, or pester you with ‘there is invalid figures in the Bi system’, whenever a discrepancy is found [4] .
The reuse  of BI data is grossly exaggerated (mostly by IT professionals). For rapid, lenient and agile application development it is often preferable to copy data to a new database. Copies must be replicated correctly, which means good workflows, monitoring and consolidating facilities to safe guard 100% quality.
The importance of BI data accessibility  is often neglected/diminished (mostly by  IT professionals). IT professional often only want to use one BI application (the one they know) . Most often this is a closed proprietary system, which hinders free use of application data. Ideally a BI data store respects de facto database interface standards like ODBC, JDBC and has good extendable APIs.

Strategy for Qlikview integration.

Above I have tried to depict the background and the prerequisites for Qlikview integration in our BI echo system. Our integration strategy is based on the conditions that applies to us here and now.  We try to use the best tool for the task .   F or Qlikview apps this means use Qlikview for visualization and the Data Warehouse (DW) for ETL and workflows. With our strategy we try to accomplish a cost effective environment for Qlikview development and maintenance.
Do not use Qlikview as a  ‘data funnel’.  It is much better to gather data from various sources in the DW where we have better workflows, traceability etc. Consolidate the data in the DW and send it in one go to Qlikview.  This way we also have a natural Qlikview backup.
Do not store more data in Qlikview than necessary . Qlikview is not a data store, it is better to store data in the DW.
Do not use Qlikview for defining data views .  It is much easier to create, maintain and reconcile the data views in the DW, where the data can be reused, shared and copied.
Try to do  transformations in the DW . This is most simpler transformations as changing the format or default labels of data, it is a question of transferring  similar code from Qlikview to the DW. By preparing data sent to Qlikview we can minimize time spent on Qlikview development and make Qlikview code cleaner. Do not overdo this, it’s not the end of the world if transformations are done in Qlikview.
Do aggregations in  Qlikview . Qlikview needs detail data and has excellent functions for aggregation and cross tabulating.
Try to organize Qlikview application data into a specific DW database . This way it is simple to isolate the Qlikview application and developer (often a remote consultant) from other data and still present all application data to him or her.
Do not try to reuse and share data between Qlikview applications . With  separated data you create cleaner Qlikview applications. Sooner or later the shared/reused data and definitions will deviate anyway. It is not worth the effort to overdo the sharing. Share and reuse is better done in the BI layer.
Examples.  

These pictures are included  just to give an idea what transformation code may look like in Qlickview and SQL.

Example of simple Qlikview  formatting that partly can be moved to the DW database layer:
Example of simple Qlikview formatting done in the DW database layer:

[1]  I have heard Qlikview consultants say it is a problem with Windows, but I think it’s more a Qlikview problem. Not being an expert I think  Windows Server 2008 R2 is  good at parallel processing.
[2]  When overloaded the Publisher seems to run into a stalemate, the computer runs at full speed but very little happens, I have not analyzed the situation but it looks like a classical overload situation.
[3]  If I get time I will write about a ‘poor man’s SQL transpose engine’ which I’m building in LUA on top of MySQL. I have done 60% of the work, so we see if I be able to finalize it.   This is based on work done by Gui sepp e  Maxia .
[4]   Every complaint about data quality should be fully investigated. Good BI administrators put lots of efforts in quality complaints. In about 98-99% of the cases it is not a problem in the BI. In most cases it is interpretation of figures that cause quality complaints.

2012-05-08

My summer of code

Google summer of Code  is a great event, giving young people the opportunity to work with hopefully good mentors and gain an insight in the part of the IT world that is system design and programming. Programming is a word obsolete not used so much anymore. Who tells you I’m a programmer these days?  Web designer, systems designer, application developer,  project leader, database modeler - yeah, but programmer - no.

Each summer vacation if I do not have anything better to do , I try study a subject in some detail, something I then can use in my daily  work. Last summer I did some web development with CouchDB to learn Nosql databases (no I have no use for CouchDB in my work), two years ago I had better things to do, three years ago I created a Dekiwiki/MindTouch  web site for use as a ‘virtual  project workplace’ only to know we already had ‘standardized’ around the Lotus Notes Quickr product.

This year I have decided to pursue something for sheer lust or pleasure. I will learn a new programming language. Deep down in my heart I’m a programmer. If I do not program anything for a month or two I get abstinence problems.  Unfortunately it is a long time since I learned a new programming language. The last one I learned is PHP and it is about 10 years ago. And I feel a bit handicapped or limited. I realize I must have more tools in my toolbox, Javascript for certain. But there are others, C I have used a little 15 years ago, C++ I tried, C# and Java too. I looked at Phyton, Ruby, Node.js and a few more. Nice languages but nay it really didn’t click. I also tried Erlang but I’m not smart enough for Erlang or maybe too old or both.

PERL 6

 I have been following the development of PERL6 since 2002 or so and that I really like. I started to write Perl 6 a few times, but Perl 6 is not ready yet. Creating something genuinely new takes time and there are some lovely constructs in there. I will definitely  learn Perl 6, it’s part of the future. Perl 6 is coming to us this year, of that I’m certain. The guys behind Perl 6 are true heroes of endurance.  And Carl Mäsak  (one of the heroes) is my favorite blogger. Carl’s posts are often brilliant, full of wit and humor and wisdom about IT and beyond.  

And the winner of my summer of code is - D

But for my summer of code I decided to go for D. The D language  has a lot of nice features and it comes with an assembler. It’s not mainstream - I like that, but it has the potential to become mainstream - I like that even more.

P.s.

Year 2001 I needed a simple scripting language in the Linux environment, I wanted to build a simple job controller I choose PHP (in favor of PERL). PHP was new and fresh and I never heard  about it, so I thought I should give it a try. Still I can do whatever I need in PHP, it’s a good scripting language, no matter what programmers not using PHP say (and they do). The animosity is interesting, some guys are really explicit when they express their dislike of PHP. And it’s fun to read it is impossible to do with PHP what you do with PHP.

2012-05-06

Job scheduling with PHP - 4

In my previous post about Job scheduling with PHP , I described two entities the context and the schedule. The context is where all configurations and descriptions of source systems go. The schedule is the entity we schedule for execution. The entity where we describe the job we want to do is called job. Jobs are chained together in schedules and can be included in a schedule from a job library or explicitly declared directly in the schedule. Before we look at the job I need to explain return codes.

Return Codes

This is a bit  complicated, if you are not for details go directly to Summary .

Previous in my  posts on job scheduling with PHP I have explained why I use Boolean return codes (with a few exceptions). Normally you declare a schedule with mustcomplete=’yes’ , which means execution stops if any return code is FALSE. This is what you normally want, stop at point of failure, correct and rerun the  schedule.  But sometimes you need to cleanup or automatically fix the problem by executing error correcting jobs or you like to build schedules with logic like ‘if month end run job allocateNewMonth’. You do this by prereqs, a prereq is basically a Boolean gate that is either open or closed, a job prereq determines if a job should execute or not, if the prereq is FALSE the job is bypassed. This means that the result of a job is not strictly Boolean, it can successfully execute, fail or be bypassed. The ‘bypassed’ condition defaults to TRUE/success, you can change this by bypassed=’false’  in the schedule.

To allow for failures you turn off normal error checking by stating ‘ mustcomplete=no ’ in the schedule, then all error checking must be done explicitly in the schedule.    

Summary:  Job return codes are Boolean. If a FALSE return code is detected execution of the schedule is intercepted. This default behavior can be changed.

The Job Iterator

How many times must a job execute  to become a success? In most job control systems I have seen the answer is ONCE only. In my job scheduler the answer is - it defaults to zero or more times . By default a job is executed ONCE.  The job iterator determines how many times a job executes. The job iterator is a table and the job is executed once for each row, the job iterator is also a placeholder for symbolic variables a.k.a. @tags. Job iterators are immensely powerful, but for now the job iterator determines how many times a job executes and can contain @tags.  

The job iterator is declared by the xml tag <forevery> within a job.

The Job

The job is declared by the xml <job> tag.

A job is a unit of work consisting of five optional execution elements:

1.             init actions; (operation system commands) executed prior to the job type action.

2.             the job type action; which is an SQL script or a PHP script or function.

3.             nested jobs.

4.             exit actions; (operation system commands) executed after the job type action.

5.             guard action; a php script that executes in case of unsuccessful excecution of 1,2,3 or 4.

Job prereqs can be used to fine tune execution logic.

The example

Now this may seem awfully complicated but it is not. You only add what you need and that is almost always only an SQL script or a PHP script. Does this mean I always have to create SQL and PHP scripts? Yes and No. You write SQL scripts, but very seldom PHP scripts those you need are already written, e.g:

I want to create an HTML table report and mail it to Kalle Kula. The data is in Mysql table mytable.

 

This schedule consists of 2 jobs.

The first job creates the report formatted as HTML. The second job mails the job with the help of the prewritten PHP script sendmail.  

There are two tags in there THETABLE points to the result HTML table produced by job 1, and the second tag THECSS point to a prewritten CSS template file.

Now suppose Kalle tells us, ‘ Please send the data for sales area ‘uppsala’ to my colleague Peggy Piggelin and please send the reports as Excel sheets ’. We have to do some changes to our schedule, these changes can be done in several ways. I show you one way to do it:

 

As you can see we have added a dummy  job with a <forevery> job iterator which consists of two rows with the columns NAME,EMAIL and SALES areas. In this new dummy job we execute the two original jobs, first for row one in the iterator then for the second and last row. To change the output from an HTML table to an MS Excel sheet we only changed SQL converter.

In real life you store the recipients in a database table and create the job iterator with an SQL query. The report sql queries are probably not hard coded in the job but stored in files in a suitable directory.

The post PHP, MySQL, Iterators, Jobs ,Templates and Relation sets , explains templates and shows alternate iterators.

The post PHP parallel job scheduling - 1  explains how you can parallel execute jobs.

Other Examples can be found here .

PHP code

I end my post about my job scheduler by displaying sqlconverter_GoogleDocs01.php

This code uploads the SQL result table to Google docs. I think this is very cool : )

2012-05-05

Blogging, Gartner knowledge and Security

Real knowledge.

By reading about or studying a subject you can learn a lot. To become proficient you also have to practice, in fact you have to practice, practice, practice. This is well-known in Human Language learning. In IT this basic truth is sometimes forgotten. Gartner Group is an  advisory firm specializing in writing subject matter articles and reports targeted on IT decision makers. These reports are often initiated and balanced. Let’s say you like to introduce Business Intelligence (BI) to your company but do not know much about BI. Looking at what Gartner has to say about BI, is a good start and here  a problem starts, many readers stop here and make their decisions only on Gartner reports. You think you know the subject, but you only have a superficial overview of the subject  and the market . I call this Gartner Knowledge .  Above I wrote ‘ often initiated and balanced ’, but you can question how objective such reports are. You do not bite the hand that feed you, do you? Gartner knowledge   is something I many times have thought writing about, someday I maybe will.
Anyway when I started blogging  it was mainly to get practical knowledge about blogging. I studied blogging by reading blog posts, and then I wanted to know more, so I created this  blog and wrote some posts.

The first thing I learned about was Spam Sites.

From Google I got statistics about blog hits, i.e. readers of my blog. My main purpose was to learn about blogging, but of course I was very curious about readers and how many actually read my posts. Soon, very soon I had many readers, but almost all were from odd looking sites. I suspected web scanning programs (bots, crawlers, etc.) were the main readers of my blog. After some googling I found these came from spam   sites . Spam site programs create blog hits to create an interest for the site ‘ who is this reader of mine? ’ and luring you  into go there  and expose you to their malware or whatever evil they try to contaminate your computer with. Even worse many sites show ‘ Followers of my blog: ’ with links to frequent readers of the blog, luring your readers  to go to the spam sites.  You should not go to sites you do not trust, even if they are readers of you and you should not link to such sites. This was the first thing I learned when I started blogging.
When I learned about this I talked with some experienced  PC support guys I know, to get first-hand information about real security issues with PCs today. As I suspected old fashioned virus infections are rare these days. Windows (7) is becoming more secure and the virus protections in becoming better making viruses a less threat that is was. The major problem  (for PC support) these days is more cleansing  PCs after their owners has clicked on ‘ You have won a million $, click here to collect… ’ and then you have annoying spam malware in your computer that make you go bonkers.

The second thing I learned.

It is easy to leak or give away information unintentionally when you demonstrate examples. I blog about IT when I show examples I found it is very easy to expose sensitive info. You should take great care only show information necessary for the example you writing about.  I have probably done some blunders when I started, but I hope those are not sensitive or possible to exploit. This is a good learning experience.

The third thing I learned so far.

It seems that by exposing you to the web you are soaked into social networks of various kinds. I’m now a member of linkedIn , with contacts to God knows who. I have started to look at Twitter and Facebook. This is something I probably write another post about in future.

Update

Another thing I learned

When you publish a Google document as a post here, the images goes anywhere, what you see here is far from what I have composed in my Google document.