February 8, 2012 Analysis Crowd Sourcing MatPlotLib Python Twitter
The article below was written about a year ago. It was sitting on my harddrive, gathering virtual dust, so I figured I could as well post it here.
Sometimes I have a hunger for data analysis. Also, I am a software engineer/programmer, one of those people who tend to be lazy and want to automate tasks as much as possible. Combine the two, and this is what you get.
To still my hunger I decided to do some analysis. Anything would do, but of course, the bigger the better! (Or, the easier to automate, the better.) After some thinking I decided to go with Twitter. Already having some experience with writing Twitter bots (another article about that coming up later.) Twitter is fairly easy to monitor and provides lots of posts/data.
Also, as shown in the idea/hypothesis section, it seems that sometimes people sleep bad collectively. As a wild guess - not counting the post-weekend-syndrome - I thought that maybe the weather was an influence. Probably not, but that kind of data is easy to get.
And thus, a project was born!
Most people are already familiar with Twitter. To quote the information from the Twitter website:
Twitter is a real-time information network that connects you to the latest information about what you find interesting. At the heart of Twitter are small bursts of information called Tweets. Each Tweet is 140 characters in length.
People can talk to each other, but also just say something in general, without a target.
Information can be extracted from these Tweets in various ways. For example, there are several sentiment analysis websites, such as TwitterMood. These websites interpret huge amounts of Tweets. The outcome is then a number which represents the mood, possibly pinointed to a location.
Sometimes when I am at work, I see several colleagues feeling sleepy. Yawning, looking tired, etc. Is it a coincidence that my colleagues are feeling sleepy at the same time? Or am I influenced by my own mood/sleepiness, projecting my own sleepiness on my colleagues? Or might there be an external source at play?
Given the previous observations, I was wondering if the weather influences how well a person sleeps. Given the information which can be easily gathered from Twitter, we should easily be able to define a hypothesis and test this hypothesis.
The hypothesis for this research is defined as follows: The weather influences how well people sleep.
A very broad hypothesis, I know, but I don’t want to cut off any weather-related-issues. As will be shown in a later section, a couple of measurable weather ‘parameters’ are tested against the general mood of the sleepers.
The implementation is very basic, fortunately. It consists of matching Twitter data with the data from the KNMI (Royal Netherlands Meteorological Institute, in dutch: Koninklijk Nederlands Meteorologisch Instituut.)
Twitter provides the mood of people through Tweets. The Tweets are gathered by a small Python program. The library Tweepy is used to interface with Twitter from Python. To make sure only the posts are used which are about the way a person has slept, we only use the Tweets given by Twitter-search-engine after a search on the word ‘geslapen.’ The dutch word ‘geslapen’ roughly means ‘have slept’ in english. Specifically this word was chosen because it is the past tense. As a result, the Tweets will be about what happened already, not what is going to happen. This makes the analysis of the Tweets more trivial.
The reason why only dutch Tweets are used for the analysis, is that The Netherlands covers a relatively small area. This decision will be elaborated later on.
The Tweets are not stored but are directly classified and this result is then stored. Classification is done in a fairly cruel fashion. Two tables are kept: one table with positive words and one table with negative words. The words in the Tweet are matched against the words in the two tables. If there are more positive than negative words in the Tweet, then the Tweet is classified as a good sleep. If there are more negative than positive words in the Tweet, then the Tweet is classified as a negative sleep. The two tables are stored in the tables below. Translations of the words can be found using services such as Google Translate or Interglot. If the Tweet has no positive or negative words in it, then the results in stored as unclassified. This way of classifying certainly has its limits but works well enough for this little research.
A formula which represents the algorithm is as follows:
c = Sigma word: classify(word) if c > 0: positive else if c < 0: negative else neutral/unclassified
Note that when a Tweet contains a negative and a positive word, the Tweet is classified as neutral/unclassified.
- ‘kut’ (cursing word)
- ‘klote’ (cursing word)
The program has been running for the whole month december 2010. The results are gathered and hourly stored. Hourly seemed fine grained enough for this research. It allows us to filter out day-sleepers, for example.
Using Matplotlib, figures can be easily generated from the gathered results. For example, the slept-good vs slept-bad results are shown in figure 1, below.
The KNMI provides data on the weather for each day for its stations in The Netherlands. Data includes rain fall (mm), temperature (degree C) and sun shine-duration. This data can be downloaded in a CSV format which can be easily processed by a program. Using Numpy the data can quickly be imported. Numpy was not really needed here, but I do use Numpy for other projects and thus I was already familiar with Numpy, as opposed to the standard Python CSV reader, allowing me to quickly build the programs.
Given the fact that we only cover dutch Tweets, and that The Netherlands covers a small area, I chose only to use 1 weather station. The weather station is located in De Bilt, which lies near the centre of The Netherlands. This simplified the analysis a lot, as the geo-location of a Tweet (if any is given at all) does not has to be considered. This may be considered as a flaw, but my time is also limited!
Given both the data, the classified Tweets and the KNMI data, we can now try to find a correlation in the data.
So we’re all waiting for the results, right? Right! Here we go.
First of all, I do have to say that I do not think our sleep is only dependent on the weather. A whole lot more factors come into play. So the results probably does not match with the hypothesis.
Also, I am trying to find correlations by hand. No algorithm or method is used to find a correlation.
Another note is that the weather data is for that current day. The Tweets, or rather the classification, are from the current day as well, but represents how the people slept last night. As such, the weather data should be moved one day back. In the graphs, this is not done. I think I can see through this, so the graphs are not corrected.
The different weather data I tested are http://www.knmi.nl/klimatologie/daggegevens/selectie.cgi:
- DR Duration of rainfall of that day
- FG Average day wind speed of that day
- FHVEC Average vector wind speed of that day
- NG Average cloudiness (max is 9) of that day
- PG Average air pressure of that day
- RH Total rainfall of that day
- SQ Total time sun was shining on that day
- TG Average temperature of that day
- UG Average relative air humidity of that day
To keep things short, I will give a short summary: I did not find any correlation. The figures can be found in the archive at the end of this article.
First of all, the DR figure. A quick look does not show any correlations to me. Then the FG figure. Has a slight bit of potential, but no. Next up is FHVEC. Again, I do not see any correlation. Sometimes it appears so, but then it is smashed to pieces by the next day. For example, take the two peaks on the right. (Keep in mind, the weather data should be shifted 1 day back at least.) Before the peaks, the weather data values are rising. So are the slept good-values. But then the last peak comes and smashes the correlation. The NG graph is a bit hard to read, as the orange bars are nearly covering up the green/red bars. But, again, there does not seem to be a correlation. For the PG, RH, TG, and UG graphs, there also is no correlation as far as I can see. One can argue about a possible correlation in the SQ graph, there might be a correlation if you have a lot of fantasy.
So the hypothesis is invalidated by the results.
Directly from the last section we can conclude that we should have been using a method to detect correlations. In the paper “Twitter predicts the stock market” a much better analysis is done. In that paper, the Granger causality analysis method is used.
Also, the location of the Tweets is ignored and not matched to the weather data of a station close by. Instead, all the data from a weather station central to The Netherlands is used. Perhaps there might have been minor but noticeable differences in the weather and have influenced the people.
The determination of the mood of a Tweet is also fairly basic. As shown in Implementation, simply the positive and negative words are counted. The counts are then compared and the highest bidder wins. The implementation takes only a small number of words into account. It is likely that there are more words which can/should be used in the determination of the mood. There are several mood-determination methods. One is LIWC2007. It looks interesting, but not interesting enough to spend any money on it.
Many other topics can be researched, such as flue-activity. Some time to time there are flue-epidemics. Twtitter can be used to determine how many people are having the flue.
The most straightforward conclusion is that the weather is not or not the only factor which influences how people sleep. No correlation was found in the graphs showing how the people slept versus the weather data.
Another conclusion, albeit very subjective, is that analysis on the Twitterverse is fun! A lot of information can be gathered from it and is already done. For example, brand and sentiment analysis can be very valuable to a company. But also trends can be spotted and combined with locations on the world, showing possible market demands.
A lot of things can still be analyzed. However, analysis has to be a bit more detailed, as seen from the given implementation. Possible improvements of the implementation have been given. Maybe this is for a later time, maybe not.
-  http://www.twitter.com/about
-  http://www.twittermood.org
-  http://www.knmi.nl/klimatologie/daggegevens/selectie.cgi
-  http://arxiv.org/abs/1010.3003
-  http://en.wikipedia.org/wiki/Granger_causality
-  http://www.python.org
-  https://github.com/joshthecoder/tweepy
-  http://matplotlib.sourceforge.net/
-  http://numpy.scipy.org/
The program, data and images used in this post can be found here: weather_sleep_tracker.tar.gz.