2014-08-18

Extracting tweets with Twitter API

I have for some time now played around with the Twitter API, mainly tweeting operations status reports. That was some kind of warming up activity. The thing I really wanted to do was extracting tweets and build statistics from it. It turned out to be both hard and simple. It was easy to set up the search and extraction of tweets, but hard to specify search arguments. I wanted to scan Twitter for tweets about the Company, but the Company consists of different brands and business areas there are many names to search for. I decided to start with the Company name only. Then I realised there are a lot of twitterers associated with the Company and I wanted to catch all their tweets no matter if the tweets contain the Company name or not. To catch those tweets I created a list with those usernames and scanned the list for their tweets. The tweets I found I save in a MySQL table:

CREATE TABLE `tweets` (
`id` bigint(20) NOT NULL,
`created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`msg` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`lang` char(3) COLLATE utf8_unicode_ci NOT NULL,
`user_id` bigint(20) NOT NULL,
`user_name` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
`user_screen_name` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
`lattitude` decimal(18,12) NOT NULL,
`longitude` decimal(18,12) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci   


Then I just run the extraction jobs and they worked fine from start.
A problem I have is the extraction API methods (‘search/tweets’ & ‘lists/statuses’) I use, they do not catch all tweets but I do not think that matters much, I’m  interested in trends more than absolute figures and I intend to schedule the extraction jobs about four times a days night. I reckon I catch the lion’s share of the selected tweets.


Another problem; I use David Grudl’s Twitter for PHP, this excellent package lacks the list/status method,  so I added this function:
public function lists($query){
$res = $this->request('lists/statuses', 'GET', $query));
return $res;
}

This is how the ITL  Twitter list extraction script looks:
If you look at the job allTwots you see I supply the Twitter list_id to the script twitterListsStatus and then I just load the result into mysql with the subsequent job loadAllTwots.


The Twitter search workflow is very similar and all this works like a charm, there are still some rough edges and I have not really checked the result yet, but I have good hope the result is good enough to be used.


This is a part of the resulting Mysql twitter table. In another post I show how I use this table.

I end this post with the above mentioned twitterListsStatus.php script.
/* Copyright (C) 2014 Lars Johansson This program product is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. */
/**
* Twitter - dynamically included in {@link execScript()}
*
* I found the included twitter class on https://github.com/dg/twitter-php
* written by (c) David Grudl, 2008, 2013 (http://davidgrudl.com)
*
* This code is intended for scanning twitter statuses of a list. Example:
*
* <job name='getListsStatuses' type='script' pgm='twitterListsStatus.php' outfile='filename' list_id='nnn'>
*    <iconv input='UTF-8' output='UTF-8' internal='UTF-8'/>
*    <host></host>
*    <app></app>
* </job>
*
* The <host> (default 'twitter') and <app> (default 'DW status') xml tag,
*    points to a named entry in custom/twitter.xml.See {@link getSysContactInfo()} for details.
*
* @author Lasse Johansson <lars.a.johansson@se.atlascopco.com>
* @package adap
* @subpackage utillity
* @uses getHostSysDsn()
* @return bool $_RESULT This is what the calling execScript checks at return
*/
$listid = trim($job['list_id']);
$log->enter('Note',"script=$scriptName list_id=$listid");
if (!$listid) {
 $log->logit('Error',"You must supply a Twitter list_id");
 return FALSE;
}
$hostInfo = 'twitter';
$xml = getCustomXml($context,$hostInfo);
$instlib = $xml->install;
if (is_null($instlib)){
 $log->logit('Error',"Tag <install> not found in $hostInfo");
 return FALSE;
}
require_once $instlib;
$outfile  = array_key_exists('outfile',$job) ? $job['outfile']          : 'ALLTWOTS.txt';
$hostInfo = array_key_exists('host',$job)    ? $job['host'][0]['value'] : 'twitter';
$tapp     = array_key_exists('app',$job)     ? $job['app'][0]['value']  : 'DW status';
$txml = getHostSysDsn($context, $hostInfo, "$tapp");
if (!$txml) return FALSE;

$twitter = new Twitter($txml['Consumerkey'], $txml['Consumersecret'], $txml['Accesstoken'], $txml['Accesstokensecret']);
$_RESULT = FALSE;
$resultcount = $lastcount = 100;
$searchResult = [];

$arg = ['list_id'=>$listid, 'count'=> "$resultcount"];
// TODO - inplement since_id = lowest id yesterday!
$docnt = 0;
while ($resultcount == $lastcount){
 $docnt++;
 if ($docnt > 50){
   $log->logit('Note', "BREAKING do count at $docnt; loop?");
   break;
 }
 $log->dumpit('Note', "Twitter args", $arg);
 try {
 $tempResult = $twitter->lists($arg);

 $lastelem = end($tempResult);
 if ($lastelem){
   $minid = $lastid = $lastelem->id;
   foreach ($tempResult as $t){
     $minid = min($t->id,$minid);
   }
   $log->logit('Note',"MINID=$minid,lastid=$lastid");
   $searchResult = array_merge($searchResult, $tempResult);
   $arg['max_id'] = $minid - 1;
 }
 } catch (TwitterException $e) {
 $log->logit('Error', $e->getMessage());
 $tempResult = FALSE;
 $_RESULT = FALSE;
 break;
 }
$lastcount = is_array($tempResult) ? count($tempResult) : FALSE;
$log->logit('Note', "LAP $docnt lastcount $lastcount");
}
foreach ($searchResult as $sr){
 $r = [];
 $r['id'] = $sr->id;
 $r['created'] = strtotime($sr->created_at);
 $r['created'] = date('Y-m-d h:m:s',$r['created']);
 $r['text'] = strtr($sr->text, ";\n", "  ");
 $r['lang'] = $sr->lang;
 if ($sr->user){
   $r['usr_id'] = $sr->user->id;
   $r['usr_name'] = $sr->user->name;
   $r['usr_screen_name'] = $sr->user->screen_name;
 } else {
   $r['usr_id'] = '';
   $r['usr_name'] = '';
   $r['usr_screen_name'] = '';
 }
 if ($sr->coordinates){
   $co = $sr->coordinates->coordinates;
   $r['coordinate_lat'] = $co[0];
   $r['coordinate_lon'] = $co[1];
 } else{
   $r['coordinate_lat'] = '0';
   $r['coordinate_lon'] = '0';
 }
$flat .= implode(';',$r) . "\n";  
}
file_put_contents("$outfile", $flat);
$_RESULT = TRUE;
return $_RESULT;

No comments:

Post a Comment