Tuesday, November 20, 2012

Scraping Twitter with ScraperWiki


While I was searching for a good scraper in python, I encountered many of the scrapers written in python. Finally I tried with ScraperWiki and it was quite interesting.

Everything can be done within the browser and very simple to use. We can write python scrapper scripts in the browser and it allows you to run and test the code. Finally it shows the results within the same page. We also can use scripts with other languages like ruby and php.

It also has various other built in scrapers like scarping for csv and excel file and storing data back to database. Please go thru this URL ( https://scraperwiki.com/ ) to learn more about this.

I thought of writing a simple scraper for getting the results from twitter and here is my piece of python code. You can modify scripts from publicly available scripts in scraperwiki site and run it by yourself.

 import scraperwiki  
 import simplejson  
 import urllib2  
   
 # Get results from the Twitter API! Change QUERY to your search term of choice.   
 # Examples: 'newsnight', 'from:bbcnewsnight', 'to:bbcnewsnight'  
   
 QUERY = 'bigdata'  
 RESULTS_PER_PAGE = '100'  
 LANGUAGE = 'en'  
 NUM_PAGES = 5   
   
 for page in range(1, NUM_PAGES+1):  
   base_url = 'http://search.twitter.com/search.json?q=%s&rpp=%s&lang=%s&page=%s' \  
      % (urllib2.quote(QUERY), RESULTS_PER_PAGE, LANGUAGE, page)  
   try:  
     results_json = simplejson.loads(scraperwiki.scrape(base_url))  
     for result in results_json['results']:  
       data = {}  
       data['id'] = result['id']  
       data['text'] = result['text']  
       data['from_user'] = result['from_user']  
       print data['from_user'], data['text']  
   except:  
     print 'Failed to scrape %s' % base_url  
       

No comments:

Post a Comment