Relational Big Data

Do you want sufficient fast access to data or the fastest access to data?  

In Big Data and Relational stores  I described differences between Big Data and Relational data. The starting point for my post was - ‘there is a misconception Big Data is needed for efficient manipulation of large volumes of data’. I described Big Data as a document database where a document is a number of values/attributes tied together by a key and not much more. The Big Data manager can do some clever compression and optimization of physical layout to minimize both disk space and access time.
If you remove all but one value you are left with a key value pair , if you store such pairs you have a key value store. It is perfectly ok to store all your Big Data as key value pairs, not very convenient but you can do some heavy optimization both in terms of space and performance. If your data lend itself to a key value store it is very hard to beat Big Data.
If we look at the relational data, it is very hard to compete with Big Data in terms of space utilization, this is not a big deal since disk space is dirt cheap these days. But accessing disk takes time so the larger the disk space is the longer it will take to read it. But there is a remedy for disk access time, move all data into RAM memory and replace regular hard disks with SSD, this will significantly reduce access time. It is true Big Data will benefit even more from fast disk access since the data structure is simpler. Yet again the relational data manager has a trick up his sleeve, instead of browsing through all data the clever database admin creates covering indexes . Theoretically  covering indexes  could be accessed as fast as a key value dataset,  the relational database managers are not there yet, but they will eventually be.

Map and Reduce

Big Data parallelize data access by mapping the data access onto arbitrary number of workers and then reduce the results by merging them together. This parallel programing technique is called Map and Reduce , a design pattern I’m well acquainted with.
I used Map and Reduce 1991 when I created a search engine I called ‘Fast Search for Structured Data’, I put in Structured  to emphasize the search engine was not primarily designed for free text search. The data was stored in key value bitmaps. A test we did comparing my search with DB2 and a  Cobol based serial processing precursor to my map and reduce program. DB2 we had to stop after 23 hours with no result, the cobol program took 20 minutes and my program less than 5 seconds.
Rightly tuned Map and Reduce applied on key value data can be incredibly fast on large data volumes. DB2 and other relational managers have come a long long way since 1991, rightly tuned they can be incredibly fast on large data volumes too.

Relational big data

There is nothing that stops you or the relational database manager from applying Big Data techniques on relational data. Here  I show how I have applied Map and Reduce on relational data.You can also dramatically increase relational performance with good index management.

When are my data volumes so large I need Big Data?

When the data volume becomes a challenge to your design you have ‘big data’ volumes. If your data is contained in a file cabinet and all data access is done by you manually, big data volumes may be a few thousand pieces of data. Does the relational model have conceptual or inherent problems with large data volumes? No it has not, but in some edge cases ‘Big Data’ concepts scales even better. The performance gains comes at a cost, you lose some control of your data and you lose a uniform method (SQL) for data access.
Both Big Data and Relational are borrowing/stealing from each other. There are Big Data with some SQL support and relational managers incorporating Big Data stores. What kind of Data Managers and Data Stores we have in the future remains to be seen. On the global scene the data production super-inflates and many more players want to capture more data. There is a demand for better data management.   Yesterday most data captured for analysis came from ERP systems, today it is probably Web trade and tomorrow social media .
Personally I do not believe we live in a post relational age. I bet my 5 cent on SQL and relational will prevail, but they will evolve in many ways, one of them may be Big Data. I also believe there is a place for Big Data, there will be more large data volume applications tomorrow where speed matters most and there Big Data may play a niche role. I can also see Peoples Republic of China has a great need of Big Data solutions for various reasons, this is probably not a small niche though.


External SAP job monitor

The other day we had a problem with the nightly batch load of our Business Intelligence system the Data Warehouse. Actually we have problems all the time, we run some 20.000 jobs a month so problems is expected, but the problem we had the other day where special. En ETL process refused to run due to missing input. WTF! these files have never failed to show up before, well almost never. When tracing the error we found the SAP job assembling the information for us was not run, so our dependent extraction routine had failed with ‘SAP information not refreshed, abending...’ Following a long chain of dependent events, we found out the guys in our Tierp factory  wanted vacation this summer too :-) To keep our customers happy, a slight change in the shop calendar was done to increase the production before the vacation period. These are all nice problems; continue keep customers happy, increase production, vacation. The nasty problem was the changed shop calendar was flawed and the cover time calculation bombed out which ultimately led to the above ‘SAP information not refreshed, abending...’. We notified SAP operations and the problems was fixed in no time. Our waiting ETL process munched the late input files and in the end a Qlikview process was triggered and some Qlikview applications was refreshed with the new information.

All these problems would have been sorted out and fixed by the operations without our interference, but we are the early birds, the BI users are the first to spot delays in the IT factory. The IT operation is monitored, but there is no special monitoring of the early precursors for the BI processes, so I said to myself why not do just that. So I knocked together a simple monitor for this purpose.

I keep a list of jobs I want to follow up, fetch all jobs run in SAP and joins those with my list to find out which jobs has not completed successfully and mail them to me.

This was easy peasy. (If you have a Lotus Notes mailbox and do not know Lotus Notes well, stick to a simple text message!)

I did the monitor in an ITL schedule  with two jobs. The first job ‘getMonitoredJobs’ sifts failing jobs and the second ‘send_mail’ sends a mail to concerned if there is failing jobs. You use the bapi BAPI_XBP_JOB_SELECT to extract job information from SAP, the only snag is you have to run this bapi in a XMI/XBP session .

And the result, an example:


I wrote the spanish myself, without the help of Google translate. I used Google translate for a translation into english though. Hi, work (s) important for the data warehouse, SAP fiasco , not bad at all, I suspect it can be expressed better, but what the Hey I do not know spanish. And it sounds fantastic in Google Translate :))

Switch to english and have Google Translate speak up, very amusing. You can be creative with Google translate in ways I never thought of before.  


MySQL 5.6 is out, so what is next?

I just read Simon Mudd’s excellent post MYSQL 5.6 is out, so what is next?  Simon put forth his wish list for MySQL 5.7. Well if Simon can wish so can I. First I think Simons wish list is a good one so I only want to add one wish. Parallel select processing of table partitions . Since my use of MySQL is Business Intelligence applications, I have some large (partitioned) tables that is read many times, so what I really want is faster select processing on large partitioned tables . Another option that could help speed up select on partitioned tables is global partitioned indexes i.e. the index is not partitioned. The MySQL 5.6 key cache support is something I will test out first thing when I have migrated to MySQL 5.6. What is holding me back from migrating is our somewhat odd MySQL setup. I have a MySQL replica  in Japan, which I Rsync with our Master database. First I have to upgrade the Japanese replica and then upgrade the master. The problem is I do not know how to upgrade the Japanese MySQL server. It is an Ubuntu 12.04 LTS standard install, MySQL is the normal Ubuntu package and how to upgrade that from MySQL 5.5 to 5.6? I found this post , but I do not try this on a machine on the other side of the globe on a Linux I only installed once. I do not know Ubuntu procedures but I hope they issue a MySQL 5.6 package for 12.04 LTS. The master MYSQL is run on a Mandriva Linux server where I install MySQL RPM packages myself.    


Big Data and Relational data stores

The concept of Big Data has had a big impact in the corporate world. At least IT is talking about Big Data, even top level IT management. But few actually knows what it is about, a common misconception, ‘we need Big Data to efficiently manage our very large increasing data volumes’. I have not only come across this misconception once or twice the last six month but many times. Where does this ideas come from? Probably from evangelists who have seen the Big Data light, and more important those who think they can make a buck or two by selling Big Data solutions. All of a sudden all major soft- and hardware vendors have Big Data solutions for sale.
What is Big Data then? Before I go into that a short recap of what we have today the relational data model.
In the relational database model data is organised in tables . Data is stored in tables, can be viewed in virtual tables and the result of operations on data is returned in tables. Data in a table is strictly ordered in columns  and rows . Tables are organized into databases . The SCHEMA database  is a special database where all databases, tables, columns and rows must be predefined before they can be used. Data normalisation  is a strict procedure where (unstructured) data is deconstructed  into structured tables. All access to data are done via a common language Structured Query Language  (SQL).
In the relational model there exists a high degree of order, where data is described in strict meta data rules  in the SCHEMA database, these rules cannot be violated, the relational database manager guarantees the data integrity .
In Big Data little of these concepts exists, there is no table structure, no SCHEMA equivalent, no common data access language. In fact Big Data started it’s life as NoSQL  and SCHEMA free databases for documents  and unstructured  data and was only recently rebranded Big Data.
Now wait a minute ‘ Haven’t I heard this before? ’ Yes, in the beginning there was the Lotus Notes database . Lotus Notes database is the mother of Big Data, that is something very few Big Data guys talk about. In the beginning Big Data was basically defined as ‘not relational’, this is not a good marketing concept, I think that’s why the more positive sounding Big Data was conceived.
So what is Big Data?

Big Data a generalized definition.

This is a very superficial generalisation since there is no close relation between the different Big Data model stores. Data is stored in the form of unstructured document  or key value pairs  often with JSON  notation, the data is accessed by programs  (often JavaScript). That’s basically it.
Anyone can picture a table, but what does an unstructured document look like, This an attempt
to depict a Big Data Document:
As you see it is text/data/whatever you like to call it, here with the headers Subject, Author, PostedDate,Tags & Body. You store the document by throwing the document to the Big Data Manager and it is stored for later use, simple as that. You access the document by writing a program that checks the document has the headers that defines the type of document you are interested in and then select the subject ‘I like Plankton’ and your program is given the the document(s).
Now we take the same document and deconstruct it into relational tables:
Of the document I constructed four tables (column names are removed for visibility, but they are the sames as the document headers). What I hope is obvious from the pictures - there is more order and complexibility in relational tables compared with Big Data documents . (I simplified the table structure quite a bit and left out a lot of definitions, in real life there is even more order and complexibility). The Tags are moved to a separate table and the link between the Tags and Mails are moved to a special TagMail table. Authors have got a table of it’s own, this is to show: if you add more data about the author(s), that data goes into the Authors table and not into the Mails table. At last I could not resist the temptation to change   PostedDate into a decent format .
Here I cannot  store the mail just by throwing the document at the relational database manager, I have to map the mail into the table structure and issue separate SQL update requests against the tables in the right order. This is definitely more complex than just throw the document as it is to the Big Data Manager. I can then access the mail by joining these tables together with SQL. If this is simpler than create a program of the Big Data manager’s choice is a matter of taste. I happen to think SQL is simpler and better, and it is one unifying language for all relational managers.
Relational data is ordered and documented to a very high degree, whereas Big Data is not, here you order and structure your data in the programs that access the data. It is easier to start with a ‘Big Data’ store than a relational. You don't have model your data structure and define your database schema before you start using it. But this is like pee in your pants, warm and cosy at first, but then it’s just wet and cold (and you stink).  You must be very careful with your Big Data otherwise you will lose control over your data.
This was a brief generalized overview of Big Data & Relational data stores. One question that arises when reading this is, ‘What has this to do with large data volumes’? Not much, that is another aspect of Big Data and the Relational Database I try to address here .

Lotus Notes data administration.

‘Where is my important application data, and what is all this crap’, this is what I hear over and over again from LN administrators. ’We must enforce strict rules for creating databases, only trusted developers should be able to create databases’. This is a mantra the LN admins is chanting. They (and even more top level IT management) try to fight the lack of control over data with masochistic self imposed rules, restricting the creation of LN data.
This is not something you find in the relational camp, there order prevails.