In the present project, I gather, assess, wrangle, and clean a dataset assembled from data from the channel "We rate dogs". I use Twitter API to query the text of the tweets and gather missing columns from the base dataset, and request a dataset with the associated images. I use Python tools to assess any problems with the data. The result is a tidy and useful dataset of good quality data.
The data: Users submit pictures and a short text to the group "We rate dogs", and have the pet rated. Most of the time, dogs are classified into stages known to that specific community: dogo, puppo, pupper. Other comments generally using internal lingo, and the names of the dogs are usually given as well.
Steps in this project:
1 - Gathering the data: load existing dataset, use Twitter API to query tweets, use information extracted in JSON, request the images dataset from a website and save it as a local dataset.
2 - Assess data: assess data sources for "quality" and "tidiness". Issues include, but are not limited to: different sizes of merged datasets, mismatch of text for dog stage between base dataset entry and tweet due to problem capturing string, incorrectly captured dog names, tweets later deleted by users, among others.
3 - Cleaning data: establish and execute an action for each of the quality and tidiness problems found.
4 - Plots and comments: a first approach to analyzing the data with preliminary plots. I found that for this channel, the top five most popular breeds are Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, and Pug, with the top one, Golden Retriever, appearing almost double and three times more than fourth and fifth places.The breeds whose photos and tweet were most retweeted were also, in order, Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, but in fifth place is Samoyed (seventh in number of appearances). Pugs are popular but do not get as retweeted.
import requests
import numpy as np
import pandas as pd
import json
import tweepy
import sys
# Load base dataset
tw = pd.read_csv("twitter-archive-enhanced.csv")
tw.head()
# Setting up API
consumer_key = 'my_consumer_key'
consumer_secret = 'my_consumer_secret'
access_token = 'my_access_token'
access_secret = 'my_access_secret'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True) # http://docs.tweepy.org/en/v3.5.0/api.html
# Initializing dataframe columns
init_v = [-1 for i in range(len(tw))]
tw['favorite_count'] = init_v
tw['retweet_count'] = init_v
errors = ['' for i in range(len(tw))]
tw['errors'] = errors
tw['json_tw'] = errors
# Querying Twitter and saving JSON
data = {}
data['tweets'] = []
for i in range(0,len(tw)):
try:
# These first two lines are a way of directly accessing the values
# without need for JSON
fav_c = api.get_status(tw.tweet_id[i]).favorite_count
tw.set_value(i,'favorite_count', fav_c)
rtw_c = api.get_status(tw.tweet_id[i]).retweet_count
tw.set_value(i,'retweet_count', rtw_c)
# Here I'm querying the JSON strings for the sake of practicing using JSON,
# there are other possibilities of getting the tweets'information
json_tw = js = api.get_status(tw.tweet_id[i])._json
data['tweets'].append(json_tw)
# Saving JSON string to dataframe
tw.set_value(i,'json_tw', json_tw)
except:
e = sys.exc_info()[0]
tw.set_value(i, 'errors', str(e) )
# Visually inspecting dataset with new columns
tw.head(4)
# Initializing dataframe columns
init_v = [-1 for i in range(len(tw))]
tw['favorite_count_JSON'] = init_v
tw['retweet_count_JSON'] = init_v
# Read each line
for i in range(0,len(data['tweets'])):
indice = tw.index[tw.tweet_id == data['tweets'][i]['id']].tolist()[0]
tw.set_value(indice, 'retweet_count_JSON', data['tweets'][i]['retweet_count'])
tw.set_value(indice, 'favorite_count_JSON', data['tweets'][i]['favorite_count'])
tw.head(4)
# Check for problems
tw[tw.favorite_count != tw.favorite_count_JSON][0:4]
with open('tweet_json.txt', 'w') as outfile:
json.dump(data, outfile)
tw.to_csv("tw_gathered.csv", sep = ",")
# api.get_status(tw.tweet_id[1])._json['favorite_count']
# api.get_status(tw.tweet_id[1]).favorite_count
# js = api.get_status(tw.tweet_id[1])._json
# js['favorite_count']
images = requests.get("https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv")
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
images=StringIO(images.text)
img = pd.read_csv(images, sep="\t")
# Visual inspection
img.head()
img.to_csv("image_predictions.tsv", sep = "\t")
-- Added on iteration while cleaning:
print("Length of images file: " + str(len(img)))
print("Length of twitter file: " + str(len(tw)))
tw.info()
tw.describe()
img.info()
img.describe()
print('Total number of rows with error when querying API: ' +
str(len(tw[tw.errors == "<class 'tweepy.error.TweepError'>"])))
tw[tw.errors == "<class 'tweepy.error.TweepError'>"]
#no_stage = tw[(tw.doggo== "None") & (tw.pupper== "None") & (tw.puppo== "None")]
for i in [545 ,1779 ,1636]:
print(tw.text[i])
print(tw.doggo[i] + ' ' + tw.pupper[i] + ' ' + tw.puppo[i])
print('\n')
i = 460
print(tw.text[i])
print('\n')
print(tw.doggo[i] + ' ' + tw.pupper[i] + ' ' + tw.puppo[i])
rand_indices = np.random.randint(0,tw.shape[0],15)
tw.name[rand_indices]
# tw.name
tw.columns
img.p1[0:10]
print("There are " + str(sum(img.p1_dog == False))+ " images not of a dog.")
img.p1_dog[0:10]
img.info()
print(tw.doggo.unique())
print(tw.pupper.unique())
print(tw.puppo.unique())
tw.timestamp[0:5]
img.head(5)
Quality
Tidiness
Quality
Tidiness
It is important to have a backup of the original data, and effect changes on a copy.
tw = pd.read_csv("tw_gathered.csv")
img = pd.read_csv("image_predictions.tsv", sep ="\t")
# Making copies of original datasets
tw_clean = tw.copy()
img_clean = img.copy()
Issue:
Define action:
# Code
# Cases in which stage has not been detected
tw_clean['no_stage'] = 'None'
tw_clean.loc[(tw_clean.puppo == 'None') &
(tw_clean.doggo == 'None') &
(tw_clean.pupper == 'None'), 'no_stage'] = '-'
# Make column stage that contains the contents of columns doggo, pupper, puppo.
cols = list(tw_clean.columns)
cols.remove('doggo')
cols.remove('pupper')
cols.remove('puppo')
cols.remove('no_stage')
tw_clean = pd.melt(tw_clean, id_vars = cols,
var_name = 'Dog_stage', value_name = 'Stage_value')
tw_clean = tw_clean[tw_clean.Stage_value != 'None']
len(tw_clean)
There are 13 entries for which the dog has been given 2 stages, as can be seen by checking duplicate tweet_id. In the code below, the dataframe dupls contains these cases.
dupls = tw_clean[tw_clean.duplicated('tweet_id')]
print(len(dupls))
For such a small number, I manually checked them and found that most contain two dogs or a dog with two stages.
The texts for all of these cases is displayed below.
dupls = dupls.reset_index(drop=True)
dupl_tweet_id = dupls.tweet_id.unique()
for i in range(0,len(dupls)):
print(tw_clean[tw_clean.tweet_id == dupls.tweet_id[i]].reset_index(drop=True).text[0])
print('\n')
So, they get their own label. (Drop duplicates first)
tw_clean = tw_clean.drop_duplicates(subset='tweet_id')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*doggo|doggo.*pupper', case = False),
'Dog_stage', 'pupper-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('puppo.*doggo|doggo.*puppo', case = False),
'Dog_stage', 'puppo-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*puppo|puppo.*pupper', case = False),
'Dog_stage', 'pupper-puppo')
Importantly, there are no cases in which the three stages appear together.
len(tw_clean[tw_clean.text.str.contains('pupper.*doggo.*puppo', case = False)])
# Dropping column of original stage values, and resetting indices of tw_clean table
tw_clean = tw_clean.drop('Stage_value', axis=1)
tw_clean = tw_clean.reset_index(drop=True)
# Test
print(tw_clean.Dog_stage.unique())
print(tw_clean.columns)
Issue:
Define action:
tw_clean.timestamp[0]
# Day, month, year
date = tw_clean.timestamp.str.extract('(\d{4}[-]\d{2}[-]\d{2})', expand=True)
tw_clean['year'], tw_clean['month'], tw_clean['day'] = date[0].str.split('-',2).str
# Hour, minute, second
time = tw_clean.timestamp.str.extract('(\d{2}[:]\d{2}[:]\d{2})', expand=True)
time
tw_clean['hour'], tw_clean['minute'], tw_clean['second'] = time[0].str.split(':',2).str
# Dropping timestamp to avoid duplicate data
tw_clean = tw_clean.drop('timestamp', axis=1)
# Test
#tw_clean.info()
tw_clean.head(3)
Issue:
Define action:
# Code
img_clean = img.copy()
img_clean.drop(img_clean.columns[7:], axis=1, inplace=True)
# Test
img_clean.head(5)
Issue
Define action
# Code
print(len(tw_clean))
print(sum(~tw_clean.errors.isnull()))
indices = tw_clean.index[~tw_clean.errors.isnull()].tolist()
tw_clean.drop(tw_clean.index[[indices]],inplace=True)
# Test
print(len(tw_clean))
print(sum(~tw_clean.errors.isnull()))
Issues:
Define action:
# Code
# Two values (either from two dogs, or one dog with two qualifications)
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*doggo|doggo.*pupper', case = False),
'Dog_stage_new', 'pupper-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('puppo.*doggo|doggo.*puppo', case = False),
'Dog_stage_new', 'puppo-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*puppo|puppo.*pupper', case = False),
'Dog_stage_new', 'pupper-puppo')
# Single value
tw_clean = tw_clean.set_value(~tw_clean.text.str.contains('doggo|puppo', case = False) &
tw_clean.text.str.contains('pupper', case = False),
'Dog_stage_new', 'pupper')
tw_clean = tw_clean.set_value(~tw_clean.text.str.contains('doggo|pupper', case = False) &
tw_clean.text.str.contains('puppo', case = False),
'Dog_stage_new', 'puppo')
tw_clean = tw_clean.set_value(~tw_clean.text.str.contains('pupper|puppo', case = False) &
tw_clean.text.str.contains('doggo', case = False),
'Dog_stage_new', 'doggo')
# Agreement on those tweets for wchich there was no dog stage assigned
# (between original stage, and my discovered)
new = tw_clean.Dog_stage_new.isnull()
old = tw_clean.Dog_stage == 'no_stage'
tw_clean = tw_clean.set_value(new & old, 'Dog_stage_new', 'no_stage')
Interestingly, I found that words that contain the dog stage within them, but that are longer, were not recognized in the original dataset.
The overwhelming case for this is plurals: puppers, doggos, puppos. However, there are also cases such as :pupperdoop, puppergeddon, pupporazi,apuppologized, puppoccino, puppertunity, pupposes; shown in the strings below.
# Texts of tweets containing dog stages missing in original dataset.
for i in tw_clean[~(tw_clean.Dog_stage == tw_clean.Dog_stage_new)].index:
print(tw_clean.text[i])
print('\n')
# Do away with original values and keep the more thorough new dog stage column
tw_clean = tw_clean.drop('Dog_stage', axis=1)
# Remove "new" from dog stage column
tw_clean = tw_clean.rename(columns = {'Dog_stage_new':'dog_stage'})
sample_size = 5
for k in tw_clean.dog_stage.unique():
print('\n')
print(k)
print('\n')
s = min(sample_size, len(tw_clean[tw_clean.dog_stage == k]))
for i in tw_clean[tw_clean.dog_stage == k].sample(s).index:
print(tw_clean.text[i])
print("\n")
Issue:
Define action:
# Code
# Reset indices, changed after modifications above
tw_clean = tw_clean.reset_index(drop=True)
# Get all unique names that do not start with uppper case and are not "None"
not_names = tw_clean[~tw_clean.name.str.istitle() & ~tw_clean.name.str.isupper()]
not_names.name.unique()
# Create set of words that are not names and use it filter 'name' columns off them.
false_names = ['just', 'one', 'his', 'a', 'mad', 'actually', 'all', 'the', 'such',
'quite', 'not', 'incredibly','an', 'very', 'my',
'getting', 'this', 'unacceptable', 'old', 'infuriating',
'by', 'officially', 'life', 'light', 'space']
tw_clean = tw_clean.set_value(tw_clean.name.isin(false_names), 'name', 'None')
# Of these 2 full uppercase cases, one has to be fixed manually, "O'Malley" instead of "O"
tw_clean[tw_clean.name.str.isupper()]
# Fixing uppercase name
tw_clean = tw_clean.set_value(1008, 'name', "O'Malley")
# Test
print(sum(tw_clean.name.isin(false_names)))
tw_clean.name[1008]
Issue:
Define action:
# Code
# Initialize column blep
tw_clean['blep'] = False
# Assign values
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('blep', case = False),'blep', True)
There are only four cases of "bleps".
tw_clean[tw_clean.text.str.contains('blep', case = False)]
# Test
print('BLEP')
for i in tw_clean[tw_clean.text.str.contains('blep', case = False)].index:
print(tw_clean.text[i])
print('\n')
print('NO BLEP')
for i in tw_clean[~tw_clean.text.str.contains('blep', case = False)].sample(5).index:
print(tw_clean.text[i])
print('\n')
Issue:
Design action:
# Code
img_clean.p1 = img_clean.p1.str.title()
# Test
sum(~img_clean.p1.str.istitle())
Issue:
Design action:
# Code
indices = img_clean.index[img_clean.p1_dog == False].tolist()
img_clean.drop(img_clean.index[[indices]],inplace=True)
# Test
img_clean.p1_dog.unique()
Issue:
Design action:
tw_clean = tw_clean.merge(img_clean,how='left', left_on='tweet_id', right_on='tweet_id')
# https://stackoverflow.com/questions/33086881/merge-two-python-pandas-data-frames-of-different-length-but-keep-all-rows-in-out
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
# Test
tw_clean.sample(4)
tw_clean.to_csv("twitter_archive_master.csv", sep = ",")
Wordcloud using the most frequently found words in the body of the tweets.
# Code from https://amueller.github.io/word_cloud/auto_examples/masked.html
# Slightly modified to be applied to my data
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
# All tweeted texts combined into a single text
texts = tw_clean.text.str.cat(sep=' ')
texts = texts.replace('https://t.co/','')
# Read the mask image
# (Taken from http://www.stencilry.org/stencils/animals/dog/dog+3.gif )
dog_mask = np.array(Image.open("dog_mask.png"))
stopwords = set(STOPWORDS)
stopwords.add("@dog_rates")
wc = WordCloud(background_color="white", max_words=2000, mask=dog_mask,
stopwords=stopwords)
# Generate word cloud
wc.generate(texts)
# Store to file
wc.to_file("dog_wordcloud.png")
# Show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
plt.show()
#plt.imshow(dog_mask, cmap=plt.cm.gray, interpolation='bilinear')
#plt.axis("off")
counts = tw_clean.p1.value_counts()
names = list(counts.axes[0])
names.reverse()
values = list(counts.get_values())
values.reverse()
fig = plt.figure(figsize=(20,50))
ax = fig.add_subplot(111)
yvals = range(len(names))
ax.barh(yvals, values, align='center', alpha=0.4)
ax.tick_params(axis='both', labelsize=18)
plt.yticks(yvals,names)
plt.title('Count of each breed from photos with dogs in them', fontsize = 24)
plt.tight_layout()
plt.savefig('breeds_counts.png', bbox_inches='tight')
plt.show()
breed_rt = tw_clean.groupby('p1').agg({'retweet_count': sum})
breed_rt = breed_rt.retweet_count.sort_values(ascending=False)
names = list(breed_rt.axes[0])
names.reverse()
values = list(breed_rt.get_values())
values.reverse()
fig = plt.figure(figsize=(20,50))
ax = fig.add_subplot(111)
yvals = range(len(names))
ax.barh(yvals, values, align='center', alpha=0.4)
ax.tick_params(axis='both', labelsize=18)
plt.yticks(yvals,names)
plt.title('Retweets by breed from tweets with photos with dogs in them', fontsize = 24)
plt.tight_layout()
plt.savefig('breeds_retweets.png', bbox_inches='tight')
plt.show()
counts = tw_clean.dog_stage.value_counts()
names = list(counts.axes[0])
names.reverse()
values = list(counts.get_values())
values.reverse()
fig = plt.figure(figsize=(20,5))
ax = fig.add_subplot(111)
yvals = range(len(names))
ax.barh(yvals, values, align='center', alpha=0.4)
ax.tick_params(axis='both', labelsize=18)
plt.yticks(yvals,names)
plt.title('Count of each dog stage', fontsize = 24)
plt.tight_layout()
plt.savefig('stages.png', bbox_inches='tight')
plt.show()
# To add jitter to scatterplot
jitter = np.random.uniform(low = -.99, high = .99, size = len(tw_clean))
tw_jitter = tw_clean.copy()
tw_jitter.retweet_count = tw_jitter.retweet_count + jitter
tw_jitter.rating_denominator = tw_jitter.rating_denominator + jitter
fig = tw_jitter.plot.scatter('retweet_count', 'rating_denominator',alpha=0.1, s=3)
fig.axes.set_ylim(9,11)
plt.title('Ratings by retweet count', fontsize = 10)
plt.savefig('ratings_retweets.png', bbox_inches='tight')
plt.show()
The top five most popular breeds are Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, and Pug. The top one, Golden Retriever, appears almost double and three times more than fourth and fifth places.
Unsurprisingly, the breeds whose photos and tweet were most retweeted were also, in order, Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, but in fifth place is Samoyed (seventh in number of appearances). Pugs are popular but do not get as retweeted.
Overall, however, breeds that have high counts are also generally highly retweeted.
The most common stage mentioned is “pupper”, three times more than the second most mentioned stage “doggo”, in turn twice as frequent as the least common stage “puppo”.
Double stages, or two dogs, are much less common. Consistent with the single stage popularity ranking, of the combinations, “pupper-doggo” is the most common. Interestingly, there are no “pupper-puppo” duos.
Generally unexpected, there is no relationship between rating and retweet count. This does make sense if put in the context of how the ratings work, without logic or consistency.