Moving Forward

Homepage of Andrew Robinson

Using Tweepy to access the Twitter Stream

with 9 comments

I’ve dove head-first into Python lately for part of a new natural language processing project. Part of this project involves collecting tweets and inserting them into the database for later analysis. To accomplish this goal I set out to find a good API already written. I found Tweepy on GitHub and it seemed to do the trick. The problem is most of the code was written before OAuth became a requirement, and while they supported accessing the Twitter stream, there was no solid example. To that end I’ve examined the source and written an example that does just that.

Ensure you clone or download the code directly from the Git repo. The current stable release does not include support for OAuth in the stream module.

OAuth Authentication

The first step in using OAuth is to obtain your API keys. They can be obtained on the Twitter App Developers site. You’ll need both a consumers key, and an access token, with their respective secrets to successfully communicate. Once you have downloaded Tweepy and obtained the keys you can start a new python script and create an instance of the api module.

import tweepy

auth1 = tweepy.auth.OAuthHandler('CONSUMER KEY','CONSUMER SECRET')
auth1.set_access_token('ACCESS TOKEN','ACCESS TOKEN SECRET')
api = tweepy.API(auth1)

At this point you have a fully authenticated API module. To test things out you can post a tweet to your account.

api.update_status('This is a test!')

If all is good, you’ll see this tweet appear when you visit your homepage on Twitter.

Creating a Stream Listener

The Twitter stream operates by holding open an HTTP connection and continuously sending JSON packets across it containing a structure that represents the most recent tweets. This stream is consumed by the Tweepy module asynchronously and is acted upon by a callback class implementation called StreamListener. In order to process tweets we’ll have to implement this class. The following example implements the on_status method, and simply inserts the tweet into a database. For simple data collection purposes this should be adequate. Additionally we’ll display the tweet on the screen using the TextWrapper class for debugging and observational reasons.

class StreamListener(tweepy.StreamListener):
    status_wrapper = TextWrapper(width=60, initial_indent='    ', subsequent_indent='    ')
    conn = mdb.connect('localhost', 'dbUser','dbPass','dbBase')
    def on_status(self, status):
        try:
            cursor = self.conn.cursor()
            cursor.execute('INSERT INTO tweets (text, date) VALUES (%s, NOW())' ,(status.text))
            print self.status_wrapper.fill(status.text)
            print '\n %s  %s  via %s\n' % (status.author.screen_name, status.created_at, status.source)
        except Exception, e:
            # Catch any unicode errors while printing to console
            # and just ignore them to avoid breaking application.
            pass

Tying it all together

Finally we use the stream API and start capturing tweets. With all the framework put in place it’s a simple matter of setting a list of keywords to filter with and calling the filter method of stream.

l = StreamListener()
streamer = tweepy.Stream(auth=auth1, listener=l, timeout=3000000000 )
setTerms = ['hello', 'goodbye', 'goodnight', 'good morning']
streamer.filter(None,setTerms)

Alternatively to get a sample of all incoming tweets (something like 1% of total) you can use the streamer.sample() method. The filter method accepts, with default API permission levels, up to 400 keywords to filter on. The first parameter accepts a list of interesting people to follow. Using both parameters together results in an OR of the terms.

After running your stream listener for a few days you’ll have more than enough data to do some natural language processing on the data using the nltk!

Written by Andrew Robinson

July 15th, 2011 at 1:14 pm

Posted in Uncategorized

9 Responses to 'Using Tweepy to access the Twitter Stream'

Subscribe to comments with RSS or TrackBack to 'Using Tweepy to access the Twitter Stream'.

  1. Hi Andrew,

    Thanks for the example. I was looking for an example, and it was weird that the Tweepy documentation does it have one! Glad you publish this piece of code!

    Btw, you might probably figure out already but StreamFilter has also a on_data function which returns the stream results raw. Otherwise, you cannot get the additional parameters of the tweet.

    Is the “mdb” an access database ? Shame on you :)

    Thanks for the example code and the explanations again! I would like to hear your experiences/results with the NLTK/Twitter combination. I know the tweets aren’t a good example of pos-tagging, chunking. There was a really nice paper about Twitter NLP studies but no source code…

    Bahadir Cambel

    28 Dec 11 at 3:01 pm

  2. I have followed your instruction step by step, excluding the textwrap. But, somehow i got this error message :

    “SSLError: The read operation timed out”

    could you tell me, why? I’ve search about this SSL, and I dont get any good answer…

    fyi, i’m using widows 7, python 2.7, and tweepy 1.8, and internet connection without proxy or something else…

    ismailsunni

    16 Feb 12 at 5:01 am

  3. I have a URL that works to return a JSON stream in a browser, but chokes with a 404 error when used in python code.
    Crazy talk.

    smitty

    9 Mar 12 at 10:37 pm

  4. Hi Andrew,

    Great stuff, best tutorial for me (and as a relative noob I’ve tried quite a few :-) )

    Thanks!

    Vincent

    26 Mar 12 at 5:07 pm

  5. Can you tell us noobs where to get TextWrapper ?? TNX!

    Robert Effinger

    26 Apr 12 at 7:03 pm

  6. for those with the same question above ^^^^ just put this include line at the top of your file its included by default in Python

    from textwrap import TextWrapper

    Robert Effinger

    26 Apr 12 at 7:07 pm

  7. [...] read up on twitter librares for python and chose tweepy, a very simple one. Then I modified this tutorial by Andrew Robinson to count the number of statuses received for each emotion. After that, it was [...]

  8. Does anyone have a suggestion about how to insert all these items into a mongodb, using pymongo, instead of printing them to screen?

    PhilT

    17 Feb 13 at 11:26 pm

  9. Hi, thanks for this code and tutorial.

    When I run it I am getting the following error:
    conn = mdb.connect(‘localhost’, ‘dbUser’, ‘dbPass’, ‘dbBase’)
    NameError: name ‘mdb’ is not defined

    Can you help with this please, as a complete noob to Python I have no idea how to fix this.

    Thanks.

    Tom

    6 Apr 13 at 3:40 pm

Leave a Reply