Some days ago a I wrote a post about the speedy database server of the Data Warehouse. A month ago I blogged about a problem we had with the same server, it was like someone had filled the server with glue figuratively speaking, the server was excruciatingly slow. I could not find the reason for the slowdown, but a reboot fixed the problem. Yesterday it happened again, I spent some hours trying to figure out what the problem was but I came up with nothing, to prevent a new disaster the server was rebooted, but this time it refused to come up again. Looking at the server one disk showed very high activity while the other disks in a raid6 group did nothing, very suspicious. But no alerts no red lamps no nothing, "I'm fine just a bit busy but I'm alright" the disk was saying. Anyway Linux refused to start up, claiming it did not have time enough to get hold of the file system. After some help of an external consultant we found out the high activity disk was faulty and we pulled it out of the raid and all went back to normal again. (The disk problem was recorded in a hardware log I didn't know of).
A disk raid is a complex thing and it is or rather should be fault tolerant. But what use do you have of a safe raid if it slows down so much nothing comes out of it. I have found disk raids and raid controllers unreliable, if something breaks in a server it's likely connected to raids. I prefer single disks and backups. When a disk breaks replace it, apply the backup and eventual redo logs, rerun jobs if needed and then you are back in business again. In (update) transactional applications this approach might not be possible, then fault tolerant disk raids plus hot standby backup system might be necessary. since the chance of loosing data is less with a fault tolerant raid setup. This is contradictory I know, but raids are complex and hard to 'debug' and I do not not trust them much.
If you follow the Data Warehouse on Twitter you can see some staggering high query figures the last couple of days. The first of April The Data Warehouse responded to 182.814.418 queries, and that is no April fools day prank. That’s impressive by any standard and far from the regular 9 million queries. I thought the figures were wrong, but it turned out to be correct. They BI crew is implementing a new ‘purchase delivery performance’ routine. In the SAP source system there is no relation between Purchase Order Lines and Deliveries. So the BI team has developed a SQL stored procedure that calculates delivery performance. This stored procedure is ‘query intensive’, and what you see in the recent tweets is testing and calculation of Purchase Order history. When we are in a steady delta calculating phase it will be considerable lower.
I tried to create a PHP ‘in memory’ script for this calculation, but I run out of 10GB memory trashing the Data Warehouse ETL server. Since we have sufficient memory in the database server, this more elegant stored procedure is by and large ‘in memory’ computing.
Last week while I tested a new Web Service I forgot to stress test the service.
We figures out that if it could cope with 50 concurrent call that would be more than it will ever be stressed in real life, so I run a test with 100 concurrent calls. This is how i did it:
The XML tag <forevery rows='100' parallel='yes'/> iterates the job 100 times in parallel.
It took 4 seconds to do that: start the job, create 100 threads, call the web service, receive a result, write it to disk and finally cleanup.
The execution of this ITL workflow is done in PHP including the parallel scheduling of the job 100 times. Much of the 4 seconds are probably PHP execution and server communication. The web service is more stress resistant than we will ever need.
At the company we are setting up a Master Data Management System, and intend to use web services to distribute (master) data from this MDM system. I like to test ideas and concepts before I deploy them. I discussed this with a colleague Petr Hutar, who runs the MDM system at Business Area MR. We decided currency rates would be a good entity to test the web service data distribution. Petr wrote and deployed the web service, and I sat up a client web service importing the rates into the Data Warehouse:
The ITL workflow for importing currency rates into the Data Warehouse. The idea is to start this schedule and let it sit and wait until the new period’s currency rates becomes available. When the rates are available the job getRates fetches the rates and the job loadRates inserts them into a Mysql table.
There are some tags of interest worth explaining:
<tag name='WSDL' value='http://IPaddress/IBS/IBS.ASMX?WSDL'/>
The WSDL tag points to the WSDL of the SOAP web service we use to extract the currency rates. As you can see it’s an ASP.NET developed web service which is called from the Data Warehouse Linux/PHP environment.
<prereqwait sleep='600' for='40' until='17:00:00'/>
<prereq type='pgm' name='SoapWsdlNewCurRates.php' parm='@WSDL'/>
The <prereqwait> will repeat checking the prereq(s) until they are satisfied. This prereqwait will check the prereq every 10 minutes (600 seconds) maximum 40 times or the clock strikes 17:00:00. whichever happens first.
The <prereq> calls SoapWsdlNewCurRates.php:
The job getRates calls getIBCurrRatesWebserv2.php:
Comments to the code. I had to write my own error and exception handler to catch error in the SOAP client. This should not be necessary, maybe I did something wrong.
As you see it’s dead simple to consume a SOAP webservice, just point to the WSDL and setup the parameters and call the method you are interested in. This web service test is a success and I will definitely explore web services as an integration tool. I’m already experimenting with Nodejs/HAPI creating REST APIs to the Data Warehouse. I wouldn’t be surprised if I start creating Web pages one of these days:)
If you follow the data warehouse tweets, you may have noticed there are often failed jobs in the job statistics. It is actually worse than what you see, jobs which never start due to failing prerequisites is not registered, e.g. a necessary source system job ends unsuccessfully. Still very seldom this is a problem since we only have a production environment, most tests are also run in the production environment, and the majority of failed jobs are tests. You should turn off logging when you test, but who remembers that? And who cares?
In an ERP system landscape it makes sense to have separate testing and production environments for very obvious reasons.You never want to put a transaction or program of any kind into ERP production without proper testing, since ERP transactions often updates the database and it is very hard to test and debug ERP transactions in an everchanging ERP environment. In a test environment you can set up your own scenarios and test the transactions without the risk of other updating your test data. There is also a security aspects, e.g. no sane owner of an application register financial transactions would grant ‘free’ access to developers into the production environment. Thus it makes sense to set up a permanent test environment for ERP systems which also is a good a playground for user training.
For BI or Data Warehouse applications the situation is different, most transactions are read-only (most updates are done in controlled and relatively static batches) and debugging is very hard without current production data. Since the data is stable testing can preferably be done in the the production environment. More important agility is key in BI. Developing directly in production (together with the user) is much faster when you can see the result in real time, than developing in a test environment. I question if you can develop together with users in a test environment at all. Anyway agile it is not. When you need to test updating programs you create a personal mart or take a snapshot, that is one of the key features of a Data Warehouse. And then you test in your mart or snapshot. You need plenty of disk-space for this (and easily accessible backups, for fast restore when you screw up). What use are good data mart facilities if you cannot use them?
If you develop in a separate environment, do not develop together with users, if you as a developer cannot create data marts and/or snapshots on the fly or do not have simple access to backups; you do not have agile Business Intelligence, no matter what your BI vendor says.
There is yet another aspect that can aggravate the agility of the BI, the development methodology. In ERP requirements and program specifications are good things, but in BI specifications other than very high level are evil.
If you develop according to the ERP specification model it is very likely you do not have Business Intelligence at all, it is more likely you are doing standard reporting.
‘When I want a new analysis of stock movements since 2006 I want it in hours not tomorrow. I do not necessarily care about 100% veracity, I care about velocity!’
A Data Warehouse user.
It’s not only the BI system, it is very much your BI organization that determines to what degree you are agile.
The observant reader may have seen a flagrant loophole in this post, ‘the security aspect of BI development‘. Shouldn’t the BI developer be kept out of the production environment?
I argue you cannot keep the BI developer away from the production data, since debugging and testing often is very hard and sometimes almost impossible without access to production data. There are situations where you cannot allow the developers access to the production data, but then you will not have a very agile BI system. You have to trust the BI developer if you want speedy agile BI systems, and accept the fact the BI developer knows more about the business than they maybe should.
Business Intelligence developers should sign an agreement of non disclosure.
Business Intelligence developers should sign an agreement of non disclosure.
Last week we had two terrible days in the Data Warehouse, for no obvious reasons at all the database server slow down to almost a standstill, like someone had thrown glue in the server. I was on a business trip and could not attend the problem from start.
First I suspected the network interfaces since we have a known problem with one of the interfaces but that was a red herring. Then I was looking at the memory unilization, I could not see anything alarming. (I find interpreting memory utilization in Linux a bit of a challenge.) I could not see any sign of excessive paging or swapping, I was looking at vmstat swap but si and so was most of the time zero. I noticed free memory was very low but I do not know what is useful and not harmful use of memory (as cache) in Linux, and I do not know what these figures looks like when the system is ok.
Anyway the server behaved very bad, MySQL actually crashed twice during the time. I stopped all unnecessary processes in the server, checked network cables but all in vain. Later I remember the remedy for all problems when you sit in a help desk. ‘Is the power cord attached?, ‘Is the network cable connected?’’ and lastly the classic ‘Have you rebooted your PC?’
I had already checked the power and network cables, but I had not rebooted the server. And sure enough after restart all was back to normal. But I still do not understand what the root cause of the problem was and that is very unsatisfactory. But as I use to say - Once is no time, twice is one time too many - next time, which God forbid, this happens I should be better prepared, or at least reboot quicker.
While this malady was ongoing I noticed MySQL queries was in state statistics or something similar, which I interpreted as MySQL where optimizing access paths. We only run dynamic queries since we very seldom run the same query more than once in the same context, but there are some programs that run the same query over and over again, maybe we could gain something by prepare the query and then execute the query. This is a reworked prepare query program:
I have for a long time been thinking about writing an http communication channel to the Data Warehouse, mainly to be able to write outbound web services. Today I actually did the first serious attempt. This is barely beyond ‘Hello World’, it is a small step for mankind, but it was a giant step for me. Here is the complete web server:
What is this web server good for?
This server can be contacted from e.g. VBA code in Excel sheets to trigger load jobs in the data warehouse. To be really useful I need to add file upload capabilities to the server but that has to wait to next weekend.Any comments about the code is most welcome. There is probably some ugly gronk in there, I have not really tested the code yet.