2014-08-18

Extracting tweets with Twitter API

I have for some time now played around with the Twitter API, mainly tweeting operations status reports. That was some kind of warming up activity. The thing I really wanted to do was extracting tweets and build statistics from it. It turned out to be both hard and simple. It was easy to set up the search and extraction of tweets, but hard to specify search arguments. I wanted to scan Twitter for tweets about the Company, but the Company consists of different brands and business areas there are many names to search for. I decided to start with the Company name only. Then I realised there are a lot of twitterers associated with the Company and I wanted to catch all their tweets no matter if the tweets contain the Company name or not. To catch those tweets I created a list with those usernames and scanned the list for their tweets. The tweets I found I save in a MySQL table:

CREATE TABLE `tweets` (
`id` bigint(20) NOT NULL,
`created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`msg` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`lang` char(3) COLLATE utf8_unicode_ci NOT NULL,
`user_id` bigint(20) NOT NULL,
`user_name` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
`user_screen_name` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
`lattitude` decimal(18,12) NOT NULL,
`longitude` decimal(18,12) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci   


Then I just run the extraction jobs and they worked fine from start.
A problem I have is the extraction API methods (‘search/tweets’ & ‘lists/statuses’) I use, they do not catch all tweets but I do not think that matters much, I’m  interested in trends more than absolute figures and I intend to schedule the extraction jobs about four times a days night. I reckon I catch the lion’s share of the selected tweets.


Another problem; I use David Grudl’s Twitter for PHP, this excellent package lacks the list/status method,  so I added this function:
public function lists($query){
$res = $this->request('lists/statuses', 'GET', $query));
return $res;
}

This is how the ITL  Twitter list extraction script looks:
If you look at the job allTwots you see I supply the Twitter list_id to the script twitterListsStatus and then I just load the result into mysql with the subsequent job loadAllTwots.


The Twitter search workflow is very similar and all this works like a charm, there are still some rough edges and I have not really checked the result yet, but I have good hope the result is good enough to be used.


This is a part of the resulting Mysql twitter table. In another post I show how I use this table.

I end this post with the above mentioned twitterListsStatus.php script.
/* Copyright (C) 2014 Lars Johansson This program product is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. */
/**
* Twitter - dynamically included in {@link execScript()}
*
* I found the included twitter class on https://github.com/dg/twitter-php
* written by (c) David Grudl, 2008, 2013 (http://davidgrudl.com)
*
* This code is intended for scanning twitter statuses of a list. Example:
*
* <job name='getListsStatuses' type='script' pgm='twitterListsStatus.php' outfile='filename' list_id='nnn'>
*    <iconv input='UTF-8' output='UTF-8' internal='UTF-8'/>
*    <host></host>
*    <app></app>
* </job>
*
* The <host> (default 'twitter') and <app> (default 'DW status') xml tag,
*    points to a named entry in custom/twitter.xml.See {@link getSysContactInfo()} for details.
*
* @author Lasse Johansson <lars.a.johansson@se.atlascopco.com>
* @package adap
* @subpackage utillity
* @uses getHostSysDsn()
* @return bool $_RESULT This is what the calling execScript checks at return
*/
$listid = trim($job['list_id']);
$log->enter('Note',"script=$scriptName list_id=$listid");
if (!$listid) {
 $log->logit('Error',"You must supply a Twitter list_id");
 return FALSE;
}
$hostInfo = 'twitter';
$xml = getCustomXml($context,$hostInfo);
$instlib = $xml->install;
if (is_null($instlib)){
 $log->logit('Error',"Tag <install> not found in $hostInfo");
 return FALSE;
}
require_once $instlib;
$outfile  = array_key_exists('outfile',$job) ? $job['outfile']          : 'ALLTWOTS.txt';
$hostInfo = array_key_exists('host',$job)    ? $job['host'][0]['value'] : 'twitter';
$tapp     = array_key_exists('app',$job)     ? $job['app'][0]['value']  : 'DW status';
$txml = getHostSysDsn($context, $hostInfo, "$tapp");
if (!$txml) return FALSE;

$twitter = new Twitter($txml['Consumerkey'], $txml['Consumersecret'], $txml['Accesstoken'], $txml['Accesstokensecret']);
$_RESULT = FALSE;
$resultcount = $lastcount = 100;
$searchResult = [];

$arg = ['list_id'=>$listid, 'count'=> "$resultcount"];
// TODO - inplement since_id = lowest id yesterday!
$docnt = 0;
while ($resultcount == $lastcount){
 $docnt++;
 if ($docnt > 50){
   $log->logit('Note', "BREAKING do count at $docnt; loop?");
   break;
 }
 $log->dumpit('Note', "Twitter args", $arg);
 try {
 $tempResult = $twitter->lists($arg);

 $lastelem = end($tempResult);
 if ($lastelem){
   $minid = $lastid = $lastelem->id;
   foreach ($tempResult as $t){
     $minid = min($t->id,$minid);
   }
   $log->logit('Note',"MINID=$minid,lastid=$lastid");
   $searchResult = array_merge($searchResult, $tempResult);
   $arg['max_id'] = $minid - 1;
 }
 } catch (TwitterException $e) {
 $log->logit('Error', $e->getMessage());
 $tempResult = FALSE;
 $_RESULT = FALSE;
 break;
 }
$lastcount = is_array($tempResult) ? count($tempResult) : FALSE;
$log->logit('Note', "LAP $docnt lastcount $lastcount");
}
foreach ($searchResult as $sr){
 $r = [];
 $r['id'] = $sr->id;
 $r['created'] = strtotime($sr->created_at);
 $r['created'] = date('Y-m-d h:m:s',$r['created']);
 $r['text'] = strtr($sr->text, ";\n", "  ");
 $r['lang'] = $sr->lang;
 if ($sr->user){
   $r['usr_id'] = $sr->user->id;
   $r['usr_name'] = $sr->user->name;
   $r['usr_screen_name'] = $sr->user->screen_name;
 } else {
   $r['usr_id'] = '';
   $r['usr_name'] = '';
   $r['usr_screen_name'] = '';
 }
 if ($sr->coordinates){
   $co = $sr->coordinates->coordinates;
   $r['coordinate_lat'] = $co[0];
   $r['coordinate_lon'] = $co[1];
 } else{
   $r['coordinate_lat'] = '0';
   $r['coordinate_lon'] = '0';
 }
$flat .= implode(';',$r) . "\n";  
}
file_put_contents("$outfile", $flat);
$_RESULT = TRUE;
return $_RESULT;

2014-08-10

Standing on the shoulders of giant mistakes




In some posts I have pointed out other’s blunders. When I see a mistake on system level design I try to remember it, so I do not have to repeat. One such example is return codes. Return codes are numeric values used to signal the outcome of a piece of code e.g. a function. Traditionally return code 0 (zero) means all went well and the higher return code code the less well the piece of code was able to perform. The problem with such return codes, the more values they have the more complex will the interface be, since the receiver of the return code must be able to deal with all values, as long a function is called from one caller only this is not that bad, but when a function is called from several callers this starts to get messy and when the return code scheme is changed it is common not all callers are updated to handle the new or changed return code values, which in the end will trigger erratic behavior. I try to use only two values as return code, 0 (zero) indicate failure, non-zero indicate success. Almost all modern programming languages supports Boolean true/false values and zero is decoded as false while non-zero is true, and I abide by these languages, breaking with tradition making zero signal failure instead of success. This makes my programs simpler and less error prone. I have to translate return codes when calling or being called from external software. Knowing I’m right makes that extra work light. There are situations where the true/false return code values are not enough, one example is the absence of a return code, a null value.    
In IBM’s Job Control Language (JCL) condition codes are return codes signal the outcome of a jobstep. E.g. step 1 in a job ends with condition code 4, step 2 tests for a 0 (zero) condition code which in this case means it will not execute. Now step 3 tests for a 0 (zero) condition code from step 2, which means step 3 expects step 2 run successfully. But in this example step 2 didn’t run, so JCL sets a default 0 (zero) condition code. In JCL a successful execution or no execution at all are both flagged by a 0 (zero) condition code. 
In my job control language ITL, I tackle this problem, allowing to test for bumped over job(steps) plus give an option how to interpret bumped over jobs (success or failure). I consider my ITL language’s return code scheme better than any other job control system I have have seen. This is not because I am better, but because I have had the chance of learning from other’s mistakes.
If you study my works you can learn a lot.

2014-08-05

Patching rfc_read_table

rfc_read_table is a very useful program someone wrote a long time ago, probably some SAP employee. rfc_read_table does what the name implies, it reads tables and it is Remote Function Call enabled, which means you can call it from outside SAP. I found this useful program in 2005, when the company replaced my MRP system with SAP. We (the company) did not have the resources to convert my Data Warehouse to SAP (which I advocated fiercely) at the same time. I realised SAP standard reporting could not be used, if SAP reporting can be said to be structured vertically our marketing department is structured horizontally, actually it is structured both ways. IMHO sap standard reporting is pretty lame. Any road up the Data Warehouse is written in PHP so I looked for a SAP PHP interface and found Eduard Koucky’s SAPRFC and Piers Harding’s SAPNWRFC. After some googling I found rfc_read_table which is excellent for extracting data from SAP systems. Unfortunately rfc_read_table has some glitches and limitations. In the beginning I did some changes to the program to overcome some of the limitations. For some reason SAP do not maintain rfc_read_table, so you have to patch it yourself. (SAP also discourage you from using rfc_read_table.) Years ago I attempted to write an ABAP program rfc_execute_dynamic_sql but I never finished it, probably due to lack of inspiration and due to the fact rfc_read_table covers most of our needs. You can do some nifty things with rfc_read_table and I happen to believe full table extraction is the way to go, yes I know there are some practical issues, but in future when extracting even very large amount of data is done in a blink of an eye delta extracts will be something of the past. (Delta extraction/load should be the last resort and pure table extraction is very natural when you deal with relational data).
rfc_read_table is a very useful program, but it got some glitches even bugs, and lo and behold after all these years we come across one. Packed numerics allocate two bytes to little in the output buffer, or to be correct my version of rfc_read_table allocated two bytes to little in the output buffer, now it’s fixed.


I do not know if it applies to the original version too, but the program did not take into account the decimal delimiter and the trailing negative sign. ISO 11404 prescribes leading negative sign which most of us find quite natural (except financial people, who seem to prefer trailing sign). I did the swapping of trailing to leading negative sign in PHP, a grave mistake from my side, it should have been dealt with directly in rfc_read_table.
That was rather lengthy about patching rfc_read_table.

2014-08-03

PHP 5.6 and a tweet with a graph, final cut.

The 1st of August I wrote this post:
In the previous post I hoped for a nicer monthly graph the 1st of August produced by a daily batch job. I almost delivered! The graph is nicer, but the automatically scheduled job blow up, it looks like there is a problem with the Cron environment.  If you do not follow the Data Warehouse at Twitter I invite you to  at https://twitter.com/tooljn.


This morning I compared the logs from a successful run:


with the failed run:

In PHP version 5.5 I have not linked in necessary software for pChart2, and that is still the PHP version called from the Cron environment. So the fix for this problem is to upgrade to PHP 5.6. As of 2014-07-31 we have PHP 5.6 RC3 and I will plan for this version to go into production in the middle of the month. Follow the data warehouse on twitter, if there is a tweet with a graph 2014-09-01 then we have upgraded to PHP 5.6. There is a hell lot of tests to´run before a PHP version upgrade, I have run PHP 5.6 for some time now, I have not had any problems, the only thing I have noticed so far, PHP 5.6 is faster.

The  development of the monthly graph application was done in the production environment (except for the PHP 5.6 version). Since the application is run once a month the progress calendar time been protracted, but working hours are not many. The current version of the graph application is the final version, it will run in PHP 5.6 and above.

2014-08-01

Tweet with a graph from the Data Warehouse, take 3

In the previous post I hoped for a nicer monthly graph the 1st of August produced by a daily batch job. I almost delivered! The graph is nicer, but the automatically scheduled job blowed up. It looks like there is a problem with the Cron environment.  If you do not follow the Data Warehouse at Twitter I invite you to https://twitter.com/tooljn.

Your pen is leaking!

I’m intending to write a series of posts on development methodology or something similar. Over the years I have come across a load of development methodologies. Most of these make sense one way or another, many emanates from older ones, it’s kind of evolutionary, as it should be. Newer methods tend to be better than older, but at the same time there is nothing new under the sun,  when I read what the true pioneers wrote more than half a century ago about programming I’m baffled over their insights and understanding of programming and software development. But much is also bullshit created by inflated egos trying to make a buck out of baloney.  

In the mid 1980ies I attended a software user conference, one of the attendees was Mr X, he didn’t know the software well and it was clear to me he was there to sell himself and his services. Mr X was political, socializing with managers and software representatives only, he managed to become ‘head’ of the user community for a short period of time. Anyway at this conference Mr X held a presentation about his developing method. The first slide had three defining moments in the history of computers:
  1. 194X The first computer
  2. 196Y IBM’s 360 architecture
  3. 198Z Mr X developing method
Most attendees stopped listening here, I lack a critical mind, so I listened with some enthusiasm but I soon realised there were nothing new or exciting in Mr X presentation and nothing remotely connected to the software we had gathered for, so I also stopped listening after a short while. Then a norwegian software developer tried to interrupt the presentation ‘Mr  X your...’, he was hushed into silence by Mr X, the norwegian developer was all too insignificant for mr X to take notice of. Then I saw what most of us already had seen, there was an ink spot growing just below the chest pocket on mr X shirt, slowly but steadily. By the time the spot was clearly visible to everyone, we were all mesmerized by the growing dark blue spot on the white shirt. No one listen to Mr X and the norwegian guy tried once more ‘Mr  X your...’ and once more he was hushed by Mr X ‘Can you wait until after my presentation with questions, please!’ When the spot was larger than the palm of a hand, the norwegian guy could not hold it back anymore ‘Mr X I think your pen is leaking’. That was the end of the presentation. At the dinner that evening Mr X had a glass to many and never showed up again (as I recall it).

Now when I’m writing about development methods I am aware it might just be pretentious bull. You can judge for yourself when I publish (if I do). But I will read eventual comments with respect no matter who you are.

2014-07-23

Looking for my posts

I recently read a post how to search engine optimize blog posts. I decided to look for my own posts in Google to find out how well my posts were doing in Googles search engine. To my pleasure I found my posts showed up, some even quite high, when I searched for them. But I was a bit saddened to find extracting data from SAP was hard to find, a year ago it was ranked higher but I suppose many posts is written about this subject. (Still I find my post informative, with a practical hands on examples.) By prefixing my search term with 12d (the name of my blog) all my posts I tried showed up, except some data warehouse, no matter how I tried to find it, the post refused to show up in the search result. Other posts of mine showed up, while explicitly searching for some data warehouse but not the post itself. I have checked the post, and I cannot find a reason in the post for not showing up in the search result. The quality is no worse than most of my other posts. No broken links. No explicit language, no controversial anything. It is a mystery to me why the some data warehouse post refuse to show up in the search result.
(Update! I just had to put double quotes around the argument to find my post. There are a lot of "some" + "data" + "warehouse", that is probably the reason why I could not find my post. However the meaning of "some" is different, in the other posts "some" means "unspecified number". I use "some" as in "some chicken", "some data warehouse".)


While searching for my own posts I checked how many have read my posts, and the now not so easy to find extracting data from sap was a clear winner with 736 pageviews, with replicating MySQL with Rsync as the runner up 652 pageviews. Not bad at all, even if many readers was not what I had in mind when I started blogging, it is very rewarding you people out there read my posts. Actually I’m mesmerized by the view stats looking at it almost every day. What I think is a bit sad not more than 30 have read my posts on relational big data, I think they are good and worth some minutes of your time. But what is really think is a pity no more than 14 have read php parallel job scheduling 4. This post shows how to parallel process and rightsize business processes like extraction of data from source systems, a complex and advanced post from a different viewpoint from most writings about parallel processing, which typically deals with speeding up the minute atomic parts of computer programs. The entire serie of parallel workflow processing is worth reading, but I suppose very few are interested in this topic, especially when parallel is done in PHP, so 14 pageviews is maybe not that bad:-)