API querying and data wrangling in Python

In the present project, I gather, assess, wrangle, and clean a dataset assembled from data from the channel "We rate dogs". I use Twitter API to query the text of the tweets and gather missing columns from the base dataset, and request a dataset with the associated images. I use Python tools to assess any problems with the data. The result is a tidy and useful dataset of good quality data.

The data: Users submit pictures and a short text to the group "We rate dogs", and have the pet rated. Most of the time, dogs are classified into stages known to that specific community: dogo, puppo, pupper. Other comments generally using internal lingo, and the names of the dogs are usually given as well.

Steps in this project:

1 - Gathering the data: load existing dataset, use Twitter API to query tweets, use information extracted in JSON, request the images dataset from a website and save it as a local dataset.

2 - Assess data: assess data sources for "quality" and "tidiness". Issues include, but are not limited to: different sizes of merged datasets, mismatch of text for dog stage between base dataset entry and tweet due to problem capturing string, incorrectly captured dog names, tweets later deleted by users, among others.

3 - Cleaning data: establish and execute an action for each of the quality and tidiness problems found.

4 - Plots and comments: a first approach to analyzing the data with preliminary plots. I found that for this channel, the top five most popular breeds are Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, and Pug, with the top one, Golden Retriever, appearing almost double and three times more than fourth and fifth places.The breeds whose photos and tweet were most retweeted were also, in order, Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, but in fifth place is Samoyed (seventh in number of appearances). Pugs are popular but do not get as retweeted.

In [3]:
import requests
import numpy as np
import pandas as pd
import json
import tweepy
import sys

1 - Gathering the data

In [2]:
# Load base dataset
tw = pd.read_csv("twitter-archive-enhanced.csv")
tw.head()
Out[2]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None

Querying Twitter Data using Tweepy and saving JSON

In [3]:
# Setting up API

consumer_key = 'my_consumer_key'
consumer_secret = 'my_consumer_secret'
access_token =  'my_access_token'
access_secret = 'my_access_secret'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True) # http://docs.tweepy.org/en/v3.5.0/api.html
In [4]:
# Initializing dataframe columns

init_v = [-1 for i in range(len(tw))]
tw['favorite_count'] = init_v
tw['retweet_count'] = init_v

errors = ['' for i in range(len(tw))]
tw['errors'] = errors
tw['json_tw'] = errors
In [5]:
# Querying Twitter and saving JSON

data = {}
data['tweets'] = []    

for i in range(0,len(tw)):
    try:
        # These first two lines are a way of directly accessing the values
        # without need for JSON
        fav_c = api.get_status(tw.tweet_id[i]).favorite_count
        tw.set_value(i,'favorite_count', fav_c)
        
        rtw_c = api.get_status(tw.tweet_id[i]).retweet_count
        tw.set_value(i,'retweet_count', rtw_c)
        
        # Here I'm querying the JSON strings for the sake of practicing using JSON, 
        # there are other possibilities of getting the tweets'information
        json_tw = js = api.get_status(tw.tweet_id[i])._json        
        data['tweets'].append(json_tw)

        # Saving JSON string to dataframe
        tw.set_value(i,'json_tw', json_tw)
 
    except:
        e = sys.exc_info()[0]
        tw.set_value(i, 'errors', str(e) )
In [10]:
# Visually inspecting dataset with new columns
tw.head(4)
Out[10]:
Unnamed: 0 Unnamed: 0.1 tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id ... floofer pupper puppo favorite_count retweet_count errors json_tw favorite_count_JSON retweet_count_JSON tweet_id_JSON
0 0 0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN ... None None None 39294 8773 NaN {'created_at': 'Tue Aug 01 16:23:56 +0000 2017... 39294 8773 -1
1 1 1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN ... None None None 33649 6426 NaN {'created_at': 'Tue Aug 01 00:17:27 +0000 2017... 33649 6426 -1
2 2 2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN ... None None None 25351 4268 NaN {'created_at': 'Mon Jul 31 00:18:03 +0000 2017... 25351 4268 -1
3 3 3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN ... None None None 42668 8858 NaN {'created_at': 'Sun Jul 30 15:58:51 +0000 2017... 42668 8858 -1

4 rows × 26 columns

Reading and writing Twitter JSON

In [164]:
# Initializing dataframe columns

init_v = [-1 for i in range(len(tw))]
tw['favorite_count_JSON'] = init_v
tw['retweet_count_JSON'] = init_v
In [165]:
# Read each line
for i in range(0,len(data['tweets'])):
    indice = tw.index[tw.tweet_id == data['tweets'][i]['id']].tolist()[0]
    
    tw.set_value(indice, 'retweet_count_JSON', data['tweets'][i]['retweet_count'])
    tw.set_value(indice, 'favorite_count_JSON', data['tweets'][i]['favorite_count'])
    
In [11]:
tw.head(4)
Out[11]:
Unnamed: 0 Unnamed: 0.1 tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id ... floofer pupper puppo favorite_count retweet_count errors json_tw favorite_count_JSON retweet_count_JSON tweet_id_JSON
0 0 0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN ... None None None 39294 8773 NaN {'created_at': 'Tue Aug 01 16:23:56 +0000 2017... 39294 8773 -1
1 1 1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN ... None None None 33649 6426 NaN {'created_at': 'Tue Aug 01 00:17:27 +0000 2017... 33649 6426 -1
2 2 2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN ... None None None 25351 4268 NaN {'created_at': 'Mon Jul 31 00:18:03 +0000 2017... 25351 4268 -1
3 3 3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN ... None None None 42668 8858 NaN {'created_at': 'Sun Jul 30 15:58:51 +0000 2017... 42668 8858 -1

4 rows × 26 columns

In [12]:
# Check for problems
tw[tw.favorite_count != tw.favorite_count_JSON][0:4]
Out[12]:
Unnamed: 0 Unnamed: 0.1 tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id ... floofer pupper puppo favorite_count retweet_count errors json_tw favorite_count_JSON retweet_count_JSON tweet_id_JSON
85 85 85 876120275196170240 NaN NaN 2017-06-17 16:52:05 +0000 <a href="http://twitter.com/download/iphone" r... Meet Venti, a seemingly caffeinated puppoccino... NaN NaN ... None None None 28271 4836 NaN {'created_at': 'Sat Jun 17 16:52:05 +0000 2017... 28273 4836 -1
106 106 106 871879754684805121 NaN NaN 2017-06-06 00:01:46 +0000 <a href="http://twitter.com/download/iphone" r... Say hello to Lassie. She's celebrating #PrideM... NaN NaN ... None None None 38789 11704 NaN {'created_at': 'Tue Jun 06 00:01:46 +0000 2017... 38788 11704 -1
172 172 172 858843525470990336 NaN NaN 2017-05-01 00:40:27 +0000 <a href="http://twitter.com/download/iphone" r... I have stumbled puppon a doggo painting party.... NaN NaN ... None None None 16164 3715 NaN {'created_at': 'Mon May 01 00:40:27 +0000 2017... 16165 3715 -1
175 175 175 857989990357356544 NaN NaN 2017-04-28 16:08:49 +0000 <a href="http://twitter.com/download/iphone" r... This is Rosie. She was just informed of the wa... NaN NaN ... None None None 16807 2770 NaN {'created_at': 'Fri Apr 28 16:08:49 +0000 2017... 16806 2770 -1

4 rows × 26 columns

Saving JSON file and modified tw file

In [172]:
with open('tweet_json.txt', 'w') as outfile:  
    json.dump(data, outfile)        
In [6]:
tw.to_csv("tw_gathered.csv", sep = ",")
Other possibilities for getting at info of tweets
In [ ]:
# api.get_status(tw.tweet_id[1])._json['favorite_count']

# api.get_status(tw.tweet_id[1]).favorite_count

# js = api.get_status(tw.tweet_id[1])._json
# js['favorite_count']

Loading images

In [20]:
images = requests.get("https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv")

if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

images=StringIO(images.text)

img = pd.read_csv(images, sep="\t")
In [13]:
# Visual inspection
img.head()
Out[13]:
Unnamed: 0 tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

Saving images

In [5]:
img.to_csv("image_predictions.tsv", sep = "\t")

2 - Assessing Data

Quality

Both datasets

  • Length of images (img) and of twitter data (tw) are different.

Twitter dataset

  • There were eight tweets removed by their users.
  • Some dogs that were given a stage by the tweet, do not have a dog stage in the table.
  • One dog's stage does not coincide with the one given in the text of the tweet.
  • Some dog names were incorrectly captured: 'such', 'an', 'a', 'the', 'quite', among others.
  • Blep column missing.

-- Added on iteration while cleaning:

  • Incorrectly entered name: tw.name[775] as "O" when it should be "O'Malley"

Images dataset

  • Some dog breeds are capitalized while others aren't.
  • Many images do not belong to a dog.

Tidiness

Twitter dataset

  • Dog stage is divided into separate columns when it is a single variable: dog stage.
  • Time stamp column contains six variables: day, month, year, hour, minute, second. Perhaps add time zone.

Images dataset

  • Dog breed information is contained in several columns. We know this from the explanation of the design of the dataset.

Quality

In [26]:
print("Length of images file: " + str(len(img)))
print("Length of twitter file: " + str(len(tw)))
Length of images file: 2075
Length of twitter file: 2356
In [15]:
tw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 26 columns):
Unnamed: 0                    2356 non-null int64
Unnamed: 0.1                  2356 non-null int64
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
favorite_count                2356 non-null int64
retweet_count                 2356 non-null int64
errors                        8 non-null object
json_tw                       2348 non-null object
favorite_count_JSON           2356 non-null int64
retweet_count_JSON            2356 non-null int64
tweet_id_JSON                 2356 non-null int64
dtypes: float64(4), int64(10), object(12)
memory usage: 478.6+ KB
In [17]:
tw.describe()
Out[17]:
Unnamed: 0 tweet_id img_num p1_conf p2_conf p3_conf
count 2075.000000 2.075000e+03 2075.000000 2075.000000 2.075000e+03 2.075000e+03
mean 1037.000000 7.384514e+17 1.203855 0.594548 1.345886e-01 6.032417e-02
std 599.145224 6.785203e+16 0.561875 0.271174 1.006657e-01 5.090593e-02
min 0.000000 6.660209e+17 1.000000 0.044333 1.011300e-08 1.740170e-10
25% 518.500000 6.764835e+17 1.000000 0.364412 5.388625e-02 1.622240e-02
50% 1037.000000 7.119988e+17 1.000000 0.588230 1.181810e-01 4.944380e-02
75% 1555.500000 7.932034e+17 1.000000 0.843855 1.955655e-01 9.180755e-02
max 2074.000000 8.924206e+17 4.000000 1.000000 4.880140e-01 2.734190e-01
In [18]:
img.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 13 columns):
Unnamed: 0    2075 non-null int64
tweet_id      2075 non-null int64
jpg_url       2075 non-null object
img_num       2075 non-null int64
p1            2075 non-null object
p1_conf       2075 non-null float64
p1_dog        2075 non-null bool
p2            2075 non-null object
p2_conf       2075 non-null float64
p2_dog        2075 non-null bool
p3            2075 non-null object
p3_conf       2075 non-null float64
p3_dog        2075 non-null bool
dtypes: bool(3), float64(3), int64(3), object(4)
memory usage: 168.3+ KB
In [21]:
img.describe()
Out[21]:
Unnamed: 0 tweet_id img_num p1_conf p2_conf p3_conf
count 2075.000000 2.075000e+03 2075.000000 2075.000000 2.075000e+03 2.075000e+03
mean 1037.000000 7.384514e+17 1.203855 0.594548 1.345886e-01 6.032417e-02
std 599.145224 6.785203e+16 0.561875 0.271174 1.006657e-01 5.090593e-02
min 0.000000 6.660209e+17 1.000000 0.044333 1.011300e-08 1.740170e-10
25% 518.500000 6.764835e+17 1.000000 0.364412 5.388625e-02 1.622240e-02
50% 1037.000000 7.119988e+17 1.000000 0.588230 1.181810e-01 4.944380e-02
75% 1555.500000 7.932034e+17 1.000000 0.843855 1.955655e-01 9.180755e-02
max 2074.000000 8.924206e+17 4.000000 1.000000 4.880140e-01 2.734190e-01

Twitter dataset

  • There were eight tweets removed by their users.
In [55]:
print('Total number of rows with error when querying API: ' + 
      str(len(tw[tw.errors == "<class 'tweepy.error.TweepError'>"])))
tw[tw.errors == "<class 'tweepy.error.TweepError'>"]
Total number of rows with error when querying API: 8
Out[55]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls ... floofer pupper puppo favorite_count retweet_count errors json_tw favorite_count_JSON retweet_count_JSON tweet_id_JSON
19 888202515573088257 NaN NaN 2017-07-21 01:02:36 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: This is Canela. She attempted s... 8.874740e+17 4.196984e+09 2017-07-19 00:47:34 +0000 https://twitter.com/dog_rates/status/887473957... ... None None None -1 -1 <class 'tweepy.error.TweepError'> -1 -1 -1
95 873697596434513921 NaN NaN 2017-06-11 00:25:14 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: This is Walter. He won't start ... 8.688804e+17 4.196984e+09 2017-05-28 17:23:24 +0000 https://twitter.com/dog_rates/status/868880397... ... None None None -1 -1 <class 'tweepy.error.TweepError'> -1 -1 -1
118 869988702071779329 NaN NaN 2017-05-31 18:47:24 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: We only rate dogs. This is quit... 8.591970e+17 4.196984e+09 2017-05-02 00:04:57 +0000 https://twitter.com/dog_rates/status/859196978... ... None None None -1 -1 <class 'tweepy.error.TweepError'> -1 -1 -1
155 861769973181624320 NaN NaN 2017-05-09 02:29:07 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: "Good afternoon class today we'... 8.066291e+17 4.196984e+09 2016-12-07 22:38:52 +0000 https://twitter.com/dog_rates/status/806629075... ... None None None -1 -1 <class 'tweepy.error.TweepError'> -1 -1 -1
260 842892208864923648 NaN NaN 2017-03-18 00:15:37 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: This is Stephan. He just wants ... 8.071068e+17 4.196984e+09 2016-12-09 06:17:20 +0000 https://twitter.com/dog_rates/status/807106840... ... None None None -1 -1 <class 'tweepy.error.TweepError'> -1 -1 -1
566 802247111496568832 NaN NaN 2016-11-25 20:26:31 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: Everybody drop what you're doin... 7.790561e+17 4.196984e+09 2016-09-22 20:33:42 +0000 https://twitter.com/dog_rates/status/779056095... ... None None None -1 -1 <class 'tweepy.error.TweepError'> -1 -1 -1
784 775096608509886464 NaN NaN 2016-09-11 22:20:06 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: After so many requests, this is... 7.403732e+17 4.196984e+09 2016-06-08 02:41:38 +0000 https://twitter.com/dog_rates/status/740373189... ... None None None -1 -1 <class 'tweepy.error.TweepError'> -1 -1 -1
1421 698195409219559425 NaN NaN 2016-02-12 17:22:12 +0000 <a href="http://twitter.com/download/iphone" r... Meet Beau &amp; Wilbur. Wilbur stole Beau's be... NaN NaN NaN https://twitter.com/dog_rates/status/698195409... ... None None None 18234 -1 <class 'tweepy.error.TweepError'> -1 -1 -1

8 rows × 24 columns

  • Some dogs that were given a stage by the tweet, do not have a dog stage in the table.
In [125]:
#no_stage = tw[(tw.doggo== "None") & (tw.pupper== "None") & (tw.puppo== "None")]
for i in [545 ,1779 ,1636]:
    print(tw.text[i])
    print(tw.doggo[i] + ' ' + tw.pupper[i] + ' ' + tw.puppo[i])
    print('\n')
This is Duke. He is not a fan of the pupporazzi. 12/10 https://t.co/SgpBVYIL18
None None None


IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
None None None


Gang of fearless hoofed puppers here. Straight savages. Elevated for extra terror. Front one has killed before 6/10s https://t.co/jkCb25OWfh
None None None


  • One dog's stage does not coincide with the one given in the text of the tweet.
In [95]:
i = 460
print(tw.text[i])
print('\n')
print(tw.doggo[i] + ' ' + tw.pupper[i] + ' ' + tw.puppo[i])
This is Dido. She's playing the lead role in "Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple." 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7


doggo pupper None
  • Some dog names were incorrectly captured: 'such', 'an', 'a', 'the', 'quite', among others.
In [73]:
rand_indices = np.random.randint(0,tw.shape[0],15)
tw.name[rand_indices]
# tw.name
Out[73]:
504        Bauer
403         Nala
535         Cali
117     Clifford
139        Sammy
1730       Bruce
32          None
144        Albus
660        Mabel
168         None
981         Finn
1457        just
56             a
2125           a
1686        None
Name: name, dtype: object
  • Blep column missing.
In [128]:
tw.columns
Out[128]:
Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
       'favorite_count', 'retweet_count', 'errors', 'json_tw',
       'favorite_count_JSON', 'retweet_count_JSON', 'tweet_id_JSON'],
      dtype='object')

Images dataset

  • Some dog breeds are capitalized while others aren't.
In [76]:
img.p1[0:10]
Out[76]:
0    Welsh_springer_spaniel
1                   redbone
2           German_shepherd
3       Rhodesian_ridgeback
4        miniature_pinscher
5      Bernese_mountain_dog
6                box_turtle
7                      chow
8             shopping_cart
9          miniature_poodle
Name: p1, dtype: object
  • Many images do not belong to a dog.
In [77]:
print("There are " + str(sum(img.p1_dog == False))+ " images not of a dog.")
img.p1_dog[0:10]
There are 543 images not of a dog.
Out[77]:
0     True
1     True
2     True
3     True
4     True
5     True
6    False
7     True
8    False
9     True
Name: p1_dog, dtype: bool
  • Dog breed is not a categorical type.
In [136]:
img.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

Tidiness

In [177]:
print(tw.doggo.unique())
print(tw.pupper.unique())
print(tw.puppo.unique())
['None' 'doggo']
['None' 'pupper']
['None' 'puppo']
  • Time stamp column contains six variables: day, month, year, hour, minute, second. Perhaps add time zone.
In [78]:
tw.timestamp[0:5]
Out[78]:
0    2017-08-01 16:23:56 +0000
1    2017-08-01 00:17:27 +0000
2    2017-07-31 00:18:03 +0000
3    2017-07-30 15:58:51 +0000
4    2017-07-29 16:00:24 +0000
Name: timestamp, dtype: object

Images dataset

  • Dog breed information is contained in several columns. We know this from the explanation of the design of the dataset.
In [79]:
img.head(5)
Out[79]:
Unnamed: 0 tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

3 - Cleaning Data

Plan and steps

Twitter dataset

Quality

  • Find indices of the seven twits were removed by their users, delete the corresponding rows. sum(tw.favorite_count == -1)
  • Search the text for a word in the set of ['doggo', 'pupper', 'puppo'] and fill in the stage for the dogs that don't have a dog stage. If the dog has a stage, and it doesn't coincide, save the index and check the reason(s). Adjust accordingly.
  • Initialize blep column as boolean False. Search the text for 'blep', if it appears, enter True in the corresponding column.
  • Create a list of all English determiners, match the rows that have any of them as a name, and remove them from names. Check visually for any noncapitalized names left if list is small.

Tidiness

  • Make column stage that contains the contents of columns doggo, pupper, puppo.
  • Divide time stamp into day,month,year,hour.

Images dataset

Quality

  • Capitalize all dog breeds.
  • Eliminate rows for which dog == False.
  • Make all values of column breed into categorical.

Tidiness

  • Keep only highest ranked dog breed columns (p1) and remove the rest.

Both

  • Merge twitter dataset with images dataset based on tweet_id. Images dataset tweet_id values are a subset of twitter dataset tweet_id values.

Making copies

It is important to have a backup of the original data, and effect changes on a copy.

In [84]:
tw = pd.read_csv("tw_gathered.csv")
img = pd.read_csv("image_predictions.tsv", sep ="\t")

# Making copies of original datasets
tw_clean = tw.copy()
img_clean = img.copy()

Tidiness

Twitter dataset

Issue:

  • Dog stage is divided into separate columns when it is a single variable: dog stage.

Define action:

  • Make column stage that contains the contents of columns doggo, pupper, puppo. It also contains an empty value for when the dog stage has not been determined.
In [85]:
# Code
# Cases in which stage has not been detected
tw_clean['no_stage'] = 'None'
tw_clean.loc[(tw_clean.puppo == 'None') & 
               (tw_clean.doggo == 'None') & 
               (tw_clean.pupper == 'None'), 'no_stage'] = '-'
In [86]:
# Make column stage that contains the contents of columns doggo, pupper, puppo.
cols = list(tw_clean.columns)
cols.remove('doggo')
cols.remove('pupper')
cols.remove('puppo')
cols.remove('no_stage')

tw_clean = pd.melt(tw_clean, id_vars = cols,
                var_name = 'Dog_stage', value_name = 'Stage_value')

tw_clean = tw_clean[tw_clean.Stage_value != 'None']
len(tw_clean)
Out[86]:
2369

There are 13 entries for which the dog has been given 2 stages, as can be seen by checking duplicate tweet_id. In the code below, the dataframe dupls contains these cases.

In [87]:
dupls = tw_clean[tw_clean.duplicated('tweet_id')]
print(len(dupls))
13

For such a small number, I manually checked them and found that most contain two dogs or a dog with two stages.

The texts for all of these cases is displayed below.

In [88]:
dupls = dupls.reset_index(drop=True)
dupl_tweet_id = dupls.tweet_id.unique()

for i in range(0,len(dupls)):
    print(tw_clean[tw_clean.tweet_id == dupls.tweet_id[i]].reset_index(drop=True).text[0])
    print('\n')
This is Dido. She's playing the lead role in "Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple." 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7


Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho


Like doggo, like pupper version 2. Both 11/10 https://t.co/9IxWAXFqze


This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj


This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd


Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u


RT @dog_rates: Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda


RT @dog_rates: This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC


Meet Maggie &amp; Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time https://t.co/MYwR4DQKll


Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8


This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC


Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda


Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel


So, they get their own label. (Drop duplicates first)

In [89]:
tw_clean = tw_clean.drop_duplicates(subset='tweet_id')

tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*doggo|doggo.*pupper', case = False), 
                   'Dog_stage', 'pupper-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('puppo.*doggo|doggo.*puppo', case = False),
                   'Dog_stage', 'puppo-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*puppo|puppo.*pupper', case = False), 
                   'Dog_stage', 'pupper-puppo')

Importantly, there are no cases in which the three stages appear together.

In [90]:
len(tw_clean[tw_clean.text.str.contains('pupper.*doggo.*puppo', case = False)])
Out[90]:
0
In [91]:
# Dropping column of original stage values, and resetting indices of tw_clean table
tw_clean = tw_clean.drop('Stage_value', axis=1)
tw_clean = tw_clean.reset_index(drop=True)
In [92]:
    # Test
    print(tw_clean.Dog_stage.unique())
    print(tw_clean.columns)
['doggo' 'puppo-doggo' 'pupper-doggo' 'pupper' 'puppo' 'no_stage']
Index(['Unnamed: 0', 'Unnamed: 0.1', 'tweet_id', 'in_reply_to_status_id',
       'in_reply_to_user_id', 'timestamp', 'source', 'text',
       'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'floofer', 'favorite_count',
       'retweet_count', 'errors', 'json_tw', 'favorite_count_JSON',
       'retweet_count_JSON', 'tweet_id_JSON', 'Dog_stage'],
      dtype='object')

Issue:

  • Time stamp column contains six variables: day, month, year, hour, minute, second. Perhaps add time zone.

Define action:

  • Use regular expresions to divide timestamp values into day, month, year, hour, minutes, and assign them to their respective columns.
In [93]:
tw_clean.timestamp[0]
Out[93]:
'2017-07-26 15:59:51 +0000'
In [94]:
# Day, month, year
date = tw_clean.timestamp.str.extract('(\d{4}[-]\d{2}[-]\d{2})', expand=True)
tw_clean['year'], tw_clean['month'], tw_clean['day'] = date[0].str.split('-',2).str
In [95]:
# Hour, minute, second
time = tw_clean.timestamp.str.extract('(\d{2}[:]\d{2}[:]\d{2})', expand=True)
time
tw_clean['hour'], tw_clean['minute'], tw_clean['second'] = time[0].str.split(':',2).str
In [96]:
# Dropping timestamp to avoid duplicate data
tw_clean = tw_clean.drop('timestamp', axis=1)
In [101]:
    # Test
    #tw_clean.info()
    tw_clean.head(3)   
Out[101]:
Unnamed: 0 Unnamed: 0.1 tweet_id in_reply_to_status_id in_reply_to_user_id source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp ... favorite_count_JSON retweet_count_JSON tweet_id_JSON Dog_stage year month day hour minute second
0 9 9 890240255349198849 NaN NaN <a href="http://twitter.com/download/iphone" r... This is Cassie. She is a college pup. Studying... NaN NaN NaN ... 32333 7614 -1 doggo 2017 07 26 15 59 51
1 43 43 884162670584377345 NaN NaN <a href="http://twitter.com/download/iphone" r... Meet Yogi. He doesn't have any important dog m... NaN NaN NaN ... 20637 3078 -1 doggo 2017 07 09 21 29 42
2 99 99 872967104147763200 NaN NaN <a href="http://twitter.com/download/iphone" r... Here's a very large dog. He has a date later. ... NaN NaN NaN ... 27805 5597 -1 doggo 2017 06 09 00 02 31

3 rows × 29 columns

Images dataset

Issue:

  • Dog breed information is contained in several columns. We know this from the explanation of the design of the dataset.

Define action:

  • Keep only highest ranked dog breed columns (p1) and remove the rest.
In [102]:
# Code
img_clean = img.copy()
img_clean.drop(img_clean.columns[7:], axis=1, inplace=True)
In [103]:
    # Test
    img_clean.head(5)
Out[103]:
Unnamed: 0 tweet_id jpg_url img_num p1 p1_conf p1_dog
0 0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True
1 1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True
2 2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True
3 3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True
4 4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True

Quality

Twitter dataset

Issue

  • There were eight tweets removed by their users.

Define action

  • Find indices of the seven twits were removed by their users, delete the corresponding rows.
In [104]:
# Code
print(len(tw_clean))
print(sum(~tw_clean.errors.isnull()))

indices = tw_clean.index[~tw_clean.errors.isnull()].tolist()
tw_clean.drop(tw_clean.index[[indices]],inplace=True)
2356
8
In [105]:
# Test
print(len(tw_clean))
print(sum(~tw_clean.errors.isnull()))
2348
0

Issues:

  • Some dogs that were given a stage by the tweet, do not have a dog stage in the table.
  • One dog's stage does not coincide with the one given in the text of the tweet.

Define action:

  • Search the text of each tweet for a word in the set of ['doggo', 'pupper', 'puppo', 'doggo|pupper', 'doggo|puppo', 'puppo|pupper']. Enter result in a new stage column. Check for any mismatches between the new stage column and the existing one; adjust accordingly.
In [106]:
# Code
# Two values (either from two dogs, or one dog with two qualifications)
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*doggo|doggo.*pupper', case = False), 
                   'Dog_stage_new', 'pupper-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('puppo.*doggo|doggo.*puppo', case = False), 
                   'Dog_stage_new', 'puppo-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*puppo|puppo.*pupper', case = False),
                   'Dog_stage_new', 'pupper-puppo')

# Single value
tw_clean = tw_clean.set_value(~tw_clean.text.str.contains('doggo|puppo', case = False) & 
                   tw_clean.text.str.contains('pupper', case = False), 
                   'Dog_stage_new', 'pupper')
tw_clean = tw_clean.set_value(~tw_clean.text.str.contains('doggo|pupper', case = False) &
                   tw_clean.text.str.contains('puppo', case = False), 
                'Dog_stage_new', 'puppo')
tw_clean = tw_clean.set_value(~tw_clean.text.str.contains('pupper|puppo', case = False) &
                   tw_clean.text.str.contains('doggo', case = False), 
                   'Dog_stage_new', 'doggo')
In [107]:
# Agreement on those tweets for wchich there was no dog stage assigned 
# (between original stage, and my discovered)
new = tw_clean.Dog_stage_new.isnull()
old = tw_clean.Dog_stage == 'no_stage'
tw_clean = tw_clean.set_value(new & old, 'Dog_stage_new', 'no_stage')

Interestingly, I found that words that contain the dog stage within them, but that are longer, were not recognized in the original dataset.

The overwhelming case for this is plurals: puppers, doggos, puppos. However, there are also cases such as :pupperdoop, puppergeddon, pupporazi,apuppologized, puppoccino, puppertunity, pupposes; shown in the strings below.

In [108]:
# Texts of tweets containing dog stages missing in original dataset.
for i in tw_clean[~(tw_clean.Dog_stage == tw_clean.Dog_stage_new)].index:
    print(tw_clean.text[i])
    print('\n')
This is Gary. He couldn't miss this puppertunity for a selfie. Flawless focusing skills. 13/10 would boop intensely https://t.co/7CSWCl8I6s


I can say with the pupmost confidence that the doggos who assisted with this search are heroic as h*ck. 14/10 for all https://t.co/8yoc1CNTsu


Meet Venti, a seemingly caffeinated puppoccino. She was just informed the weekend would include walks, pats and scritches. 13/10 much excite https://t.co/ejExJFq3ek


Say hello to Lassie. She's celebrating #PrideMonth by being a splendid mix of astute and adorable. Proudly supupporting her owner. 13/10 https://t.co/uK6PNyeh9w


This is Lili. She can't believe you betrayed her with bath time. Never looking you in the eye again. 12/10 would puppologize profusely https://t.co/9b9J46E86Z


Jerry just apuppologized to me. He said there was no ill-intent to the slippage. I overreacted I admit. Pupgraded to an 11/10 would pet


Here we have some incredible doggos for #K9VeteransDay. All brave as h*ck. Salute your dog in solidarity. 14/10 for all https://t.co/SVNMdFqKDL


@0_kelvin_0 &gt;10/10 is reserved for puppos sorry Kevin


This is Lucy. She has a portrait of herself on her ear. Excellent for identification pupposes. 13/10 innovative af https://t.co/uNmxbL2lns


RT @SchafeBacon2016: @dog_rates Slightly disturbed by the outright profanity, but confident doggos were involved. 11/10, would tailgate aga…


RT @dog_rates: Meet Jack. He's one of the rare doggos that doesn't mind baths. 11/10 click the link to see how you can help Jack!

https://…


Meet Jack. He's one of the rare doggos that doesn't mind baths. 11/10 click the link to see how you can help Jack!

https://t.co/r4W111FzAq https://t.co/fQpYuMKG3p


This is Lennon. He's a Boopershnoop Pupperdoop. Quite rare. Exceptionally pettable. 12/10 would definitely boop that shnoop https://t.co/fhgP6vSfhX


This is Duke. He is not a fan of the pupporazzi. 12/10 https://t.co/SgpBVYIL18


You need to watch these two doggos argue through a cat door. Both 11/10 https://t.co/qEP31epKEV


Here we are witnessing an isolated squad of bouncing doggos. Unbelievably rare for this time of year. 11/10 for all https://t.co/CCdlwiTwQf


Here are three doggos completely misjudging an airborne stick. Decent efforts tho. All 9/10 https://t.co/HCXQL4fGVZ


This is Dietrich. He hops at random. Other doggos don't understand him. It upsets him greatly. 8/10 would comfort https://t.co/U8cSRz8wzC


This is one of the most reckless puppers I've ever seen. How she got a license in the first place is beyond me. 6/10 https://t.co/z5bAdtn9kd


This is Arlen and Thumpelina. They are best pals. Cuddly af. 11/10 for both puppers https://t.co/VJgbgIzIHx


Everybody stop what you're doing and watch these puppers enjoy summer. Both 13/10 https://t.co/wvjqSCN6iC


Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv


Here are two lil cuddly puppers. Both 12/10 would snug like so much https://t.co/zO4eb7C4tG


Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1


Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12


WeRateDogs stickers are here and they're 12/10! Use code "puppers" at checkout 🐶🐾

Shop now: https://t.co/k5xsufRKYm https://t.co/ShXk46V13r


Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa


This golden is happy to refute the soft mouth egg test. Not a fan of sweeping generalizations. 11/10 #notallpuppers https://t.co/DgXYBDMM3E


Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3


Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55


Gang of fearless hoofed puppers here. Straight savages. Elevated for extra terror. Front one has killed before 6/10s https://t.co/jkCb25OWfh


Meet Sadie. She fell asleep on the beach and her friends buried her. 10/10 can't trust fellow puppers these days https://t.co/LoKVvc1xAW


This is Penny. Her tennis ball slowly rolled down her cone and into the pool. 8/10 bad things happen to good puppers https://t.co/YNWU7LeFgg


Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD


Crazy unseen footage from Jurassic Park. 10/10 for both dinosaur puppers https://t.co/L8wt2IpwxO


IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq


Hope your Monday isn't too awful. Here's two baseball puppers. 11/10 for each https://t.co/dB0H9hdZai


Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw


Here's a handful of sleepy puppers. All look unaware of their surroundings. Lousy guard dogs. Still cute tho 11/10s https://t.co/lyXX3v5j4s


Happy Friday. Here's some golden puppers. 12/10 for all https://t.co/wNkqAED6lG


This is Rodman. He's getting destroyed by the surfs. Valiant effort though. 10/10 better than most puppers probably https://t.co/S8wCLemrNb


Herd of wild dogs here. Not sure what they're trying to do. No real goals in life. 3/10 find your purpose puppers https://t.co/t5ih0VrK02


This is Zoey. Her dreams of becoming a hippo ballerina don't look promising. 9/10 it'll be ok puppers https://t.co/kR1fqy4NKK


In [109]:
# Do away with original values and keep the more thorough new dog stage column
tw_clean = tw_clean.drop('Dog_stage', axis=1)
In [110]:
# Remove "new" from dog stage column 
tw_clean = tw_clean.rename(columns = {'Dog_stage_new':'dog_stage'})
In [112]:
sample_size = 5
for k in tw_clean.dog_stage.unique():
    print('\n')
    print(k)
    print('\n')
    s = min(sample_size, len(tw_clean[tw_clean.dog_stage == k]))
    for i in tw_clean[tw_clean.dog_stage == k].sample(s).index:
        print(tw_clean.text[i])
        print("\n")

doggo


Here's a sleepy doggo that requested some assistance. 12/10 would carry everywhere https://t.co/bvkkqOjNDV


Meet Gerald. He's a fairly exotic doggo. Floofy af. Inadequate knees tho. Self conscious about large forehead. 8/10 https://t.co/WmczvjCWJq


Meet Jack. He's one of the rare doggos that doesn't mind baths. 11/10 click the link to see how you can help Jack!

https://t.co/r4W111FzAq https://t.co/fQpYuMKG3p


Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv


Here we have some incredible doggos for #K9VeteransDay. All brave as h*ck. Salute your dog in solidarity. 14/10 for all https://t.co/SVNMdFqKDL




puppo-doggo


I have stumbled puppon a doggo painting party. They're looking to be the next Pupcasso or Puppollock. All 13/10 would put it on the fridge https://t.co/cUeDMlHJbq


Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel




pupper-doggo


Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda


Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8


This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj


This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd


RT @dog_rates: Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda




pupper


Here's a pupper in a onesie. Quite pupset about it. Currently plotting revenge. 12/10 would rescue https://t.co/xQfrbNK3HD


This is Zoey. Her dreams of becoming a hippo ballerina don't look promising. 9/10 it'll be ok puppers https://t.co/kR1fqy4NKK


This pupper just got his first kiss. 12/10 he's so happy https://t.co/2sHwD7HztL


RT @dog_rates: Meet Herschel. He's slightly bigger than ur average pupper. Looks lonely. Could probably ride 7/10 would totally pet https:/…


Pupper hath acquire enemy. 13/10 https://t.co/ns9qoElfsX




puppo


Say hello to Lassie. She's celebrating #PrideMonth by being a splendid mix of astute and adorable. Proudly supupporting her owner. 13/10 https://t.co/uK6PNyeh9w


Sorry for the lack of posts today. I came home from school and had to spend quality time with my puppo. Her name is Zoey and she's 13/10 https://t.co/BArWupFAn0


This is Cooper. He's just so damn happy. 10/10 what's your secret puppo? https://t.co/yToDwVXEpA


Say hello to Lily. She's pupset that her costume doesn't fit as well as last year. 12/10 poor puppo https://t.co/YSi6K1firY


This is Duke. He is not a fan of the pupporazzi. 12/10 https://t.co/SgpBVYIL18




no_stage


Not much to say here. I just think everyone needs to see this. 12/10 https://t.co/AGag0hFHpe


This is Duke. He permanently looks like he just tripped over something. 11/10 https://t.co/1sNtG7GgiO


This is Jesse. He really wants a belly rub. Will be as cute as possible to achieve that goal. 11/10 https://t.co/1BxxcdVNJ8


When she says she'll be ready in a minute but you've been waiting in the car for almost an hour. 10/10 https://t.co/EH0N3dFKUi


This is Timber. He misses Christmas. Specifically the presents part. 12/10 cheer pup Timber https://t.co/dVVavqpeF9


Issue:

  • Some dog names were incorrectly captured: 'such', 'an', 'a', 'the', 'quite', among others.

Define action:

  • Get all unique names that do not start with uppper case and are not "None". Create set that excludes those that could still be names. Use set of no-names to filter the name column.
In [113]:
# Code
# Reset indices, changed after modifications above
tw_clean = tw_clean.reset_index(drop=True)

# Get all unique names that do not start with uppper case and are not "None"
not_names = tw_clean[~tw_clean.name.str.istitle() & ~tw_clean.name.str.isupper()]
not_names.name.unique()
Out[113]:
array(['just', 'one', 'his', 'a', 'mad', 'actually', 'all', 'the', 'such',
       'quite', 'not', 'incredibly', 'BeBe', 'an', 'very', 'DonDon', 'my',
       'getting', 'this', 'unacceptable', 'old', 'infuriating', 'CeCe',
       'by', 'officially', 'life', 'light', 'space', 'DayZ'], dtype=object)
In [114]:
# Create set of words that are not names and use it filter 'name' columns off them.
false_names = ['just', 'one', 'his', 'a', 'mad', 'actually', 'all', 'the', 'such',
       'quite', 'not', 'incredibly','an', 'very', 'my',
       'getting', 'this', 'unacceptable', 'old', 'infuriating',
       'by', 'officially', 'life', 'light', 'space']

tw_clean = tw_clean.set_value(tw_clean.name.isin(false_names), 'name', 'None')
In [115]:
# Of these 2 full uppercase cases, one has to be fixed manually, "O'Malley" instead of "O"
tw_clean[tw_clean.name.str.isupper()]
Out[115]:
Unnamed: 0 Unnamed: 0.1 tweet_id in_reply_to_status_id in_reply_to_user_id source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp ... favorite_count_JSON retweet_count_JSON tweet_id_JSON year month day hour minute second dog_stage
1008 775 775 776201521193218049 NaN NaN <a href="http://twitter.com/download/iphone" r... This is O'Malley. That is how he sleeps. Doesn... NaN NaN NaN ... 10587 2871 -1 2016 09 14 23 30 38 no_stage
2033 2041 2041 671542985629241344 NaN NaN <a href="http://twitter.com/download/iphone" r... This is JD (stands for "just dog"). He's like ... NaN NaN NaN ... 1149 610 -1 2015 12 01 04 14 59 no_stage

2 rows × 29 columns

In [116]:
# Fixing uppercase name
tw_clean = tw_clean.set_value(1008, 'name', "O'Malley")
In [117]:
    # Test
    print(sum(tw_clean.name.isin(false_names)))
    tw_clean.name[1008]
0
Out[117]:
"O'Malley"

Issue:

  • Blep column missing.

Define action:

  • Initialize blep column as boolean False. Search the text for 'blep'. If it appears, enter True in the corresponding column.
In [118]:
# Code
# Initialize column blep
tw_clean['blep'] = False

# Assign values
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('blep', case = False),'blep', True)

There are only four cases of "bleps".

In [119]:
tw_clean[tw_clean.text.str.contains('blep', case = False)]
Out[119]:
Unnamed: 0 Unnamed: 0.1 tweet_id in_reply_to_status_id in_reply_to_user_id source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp ... retweet_count_JSON tweet_id_JSON year month day hour minute second dog_stage blep
97 29 29 886366144734445568 NaN NaN <a href="http://twitter.com/download/iphone" r... This is Roscoe. Another pupper fallen victim t... NaN NaN NaN ... 3267 -1 2017 07 15 23 25 31 pupper True
424 61 61 880221127280381952 NaN NaN <a href="http://twitter.com/download/iphone" r... Meet Jesse. He's a Fetty Woof. His tongue ejec... NaN NaN NaN ... 4369 -1 2017 06 29 00 27 25 no_stage True
484 139 139 865359393868664832 NaN NaN <a href="http://twitter.com/download/iphone" r... This is Sammy. Her tongue ejects without warni... NaN NaN NaN ... 5287 -1 2017 05 19 00 12 11 no_stage True
802 523 523 809448704142938112 NaN NaN <a href="http://twitter.com/download/iphone" r... I call this one "A Blep by the Sea" 12/10 http... NaN NaN NaN ... 1678 -1 2016 12 15 17 23 04 no_stage True

4 rows × 30 columns

In [120]:
# Test
print('BLEP')  
for i in tw_clean[tw_clean.text.str.contains('blep', case = False)].index:
    print(tw_clean.text[i])
    print('\n')
    

print('NO BLEP')    
for i in tw_clean[~tw_clean.text.str.contains('blep', case = False)].sample(5).index:
    print(tw_clean.text[i])
    print('\n')   
BLEP
This is Roscoe. Another pupper fallen victim to spontaneous tongue ejections. Get the BlepiPen immediate. 12/10 deep breaths Roscoe https://t.co/RGE08MIJox


Meet Jesse. He's a Fetty Woof. His tongue ejects without warning. A true bleptomaniac. 12/10 would snug well https://t.co/fUod0tVmvK


This is Sammy. Her tongue ejects without warning sometimes. It's a serious condition. Needs a hefty dose from a BlepiPen. 13/10 https://t.co/g20EmqK7vc


I call this one "A Blep by the Sea" 12/10 https://t.co/EMdnCugNbo


NO BLEP
This is Vince. He's a Gregorian Flapjeck. White spot on legs almost looks like another dog (whoa). 9/10 rad as hell https://t.co/aczGAV2dK4


This left me speechless. 14/10 heckin heroic af https://t.co/3td8P3o0mB


Crazy unseen footage from Jurassic Park. 10/10 for both dinosaur puppers https://t.co/L8wt2IpwxO


This is Sadie and her 2 pups Shebang &amp; Ruffalo. Sadie says single parenting is challenging but rewarding. All 10/10 https://t.co/UzbhwXcLne


This is Steven. He has trouble relating to other dogs. Quite shy. Neck longer than average. Tropical probably. 11/10 would still pet https://t.co/2mJCDEJWdD


Images dataset

Issue:

  • Some dog breeds are capitalized while others aren't.

Design action:

  • Capitalize all dog breeds.
In [121]:
# Code
img_clean.p1 = img_clean.p1.str.title()
In [122]:
# Test
sum(~img_clean.p1.str.istitle())
Out[122]:
0

Issue:

  • Many images do not belong to a dog.

Design action:

  • Eliminate rows for which dog == False.
In [123]:
# Code
indices = img_clean.index[img_clean.p1_dog == False].tolist()
img_clean.drop(img_clean.index[[indices]],inplace=True)
In [124]:
# Test
img_clean.p1_dog.unique()
Out[124]:
array([ True])

Both twitter dataset and imgages dataset

Issue:

  • Length of images (img) and of twitter data (tw) are different.

Design action:

  • Merge twitter dataset with images dataset based on tweet_id. Images dataset tweet_id values are a subset of twitter dataset tweet_id values.
In [125]:
tw_clean = tw_clean.merge(img_clean,how='left', left_on='tweet_id', right_on='tweet_id')
# https://stackoverflow.com/questions/33086881/merge-two-python-pandas-data-frames-of-different-length-but-keep-all-rows-in-out
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
In [126]:
# Test
tw_clean.sample(4)
Out[126]:
Unnamed: 0_x Unnamed: 0.1 tweet_id in_reply_to_status_id in_reply_to_user_id source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp ... minute second dog_stage blep Unnamed: 0_y jpg_url img_num p1 p1_conf p1_dog
1464 1335 1335 705239209544720384 NaN NaN <a href="http://twitter.com/download/iphone" r... This is Jimothy. He lost his body during the t... NaN NaN NaN ... 51 44 no_stage False 955.0 https://pbs.twimg.com/media/CcmDUjFW8AAqAjc.jpg 1.0 Chihuahua 0.157950 True
181 1122 1122 730573383004487680 NaN NaN <a href="http://twitter.com/download/iphone" r... This is Rooney. He can't comprehend glass. 10/... NaN NaN NaN ... 40 42 pupper False 1146.0 https://pbs.twimg.com/media/CiOEnI6WgAAmq4E.jpg 2.0 American_Staffordshire_Terrier 0.810158 True
1395 1258 1258 710283270106132480 NaN NaN <a href="http://twitter.com/download/iphone" r... This is Gunner. He's a Figamus Newton. King of... NaN NaN NaN ... 55 02 no_stage False 1023.0 https://pbs.twimg.com/media/Cdtu3WRUkAAsRVx.jpg 2.0 Shih-Tzu 0.932401 True
146 772 772 776477788987613185 NaN NaN <a href="http://twitter.com/download/iphone" r... This is Huck. He's addicted to caffeine. Hope ... NaN NaN NaN ... 48 25 pupper False 1451.0 https://pbs.twimg.com/media/CsaaaaxWgAEfzM7.jpg 1.0 Labrador_Retriever 0.884839 True

4 rows × 36 columns

Saving cleaned dataset

In [449]:
tw_clean.to_csv("twitter_archive_master.csv", sep = ",")

4 - Plots and comments

Wordcloud

Wordcloud using the most frequently found words in the body of the tweets.

In [63]:
# Code from https://amueller.github.io/word_cloud/auto_examples/masked.html
# Slightly modified to be applied to my data
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

# All tweeted texts combined into a single text
texts = tw_clean.text.str.cat(sep=' ')
texts = texts.replace('https://t.co/','')

# Read the mask image
# (Taken from http://www.stencilry.org/stencils/animals/dog/dog+3.gif )
dog_mask = np.array(Image.open("dog_mask.png"))

stopwords = set(STOPWORDS)
stopwords.add("@dog_rates")

wc = WordCloud(background_color="white", max_words=2000, mask=dog_mask,
               stopwords=stopwords)

# Generate word cloud
wc.generate(texts)

# Store to file
wc.to_file("dog_wordcloud.png")

# Show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
plt.show()
#plt.imshow(dog_mask, cmap=plt.cm.gray, interpolation='bilinear')
#plt.axis("off")
<matplotlib.figure.Figure at 0x10a61c0f0>
In [457]:
counts = tw_clean.p1.value_counts()

names = list(counts.axes[0])
names.reverse()
values = list(counts.get_values())
values.reverse()

fig = plt.figure(figsize=(20,50))
ax = fig.add_subplot(111)
yvals = range(len(names))
ax.barh(yvals, values, align='center', alpha=0.4)
ax.tick_params(axis='both', labelsize=18)
plt.yticks(yvals,names)
plt.title('Count of each breed from photos with dogs in them', fontsize = 24)
plt.tight_layout()

plt.savefig('breeds_counts.png', bbox_inches='tight')
plt.show()
<matplotlib.figure.Figure at 0x11992a8d0>

Retweets by breed

In [458]:
breed_rt = tw_clean.groupby('p1').agg({'retweet_count': sum})
breed_rt = breed_rt.retweet_count.sort_values(ascending=False)

names = list(breed_rt.axes[0])
names.reverse()
values = list(breed_rt.get_values())
values.reverse()

fig = plt.figure(figsize=(20,50))
ax = fig.add_subplot(111)
yvals = range(len(names))
ax.barh(yvals, values, align='center', alpha=0.4)
ax.tick_params(axis='both', labelsize=18)
plt.yticks(yvals,names)
plt.title('Retweets by breed from tweets with photos with dogs in them', fontsize = 24)
plt.tight_layout()

plt.savefig('breeds_retweets.png', bbox_inches='tight')
plt.show()
In [459]:
counts = tw_clean.dog_stage.value_counts()

names = list(counts.axes[0])
names.reverse()
values = list(counts.get_values())
values.reverse()

fig = plt.figure(figsize=(20,5))
ax = fig.add_subplot(111)
yvals = range(len(names))
ax.barh(yvals, values, align='center', alpha=0.4)
ax.tick_params(axis='both', labelsize=18)
plt.yticks(yvals,names)
plt.title('Count of each dog stage', fontsize = 24)
plt.tight_layout()

plt.savefig('stages.png', bbox_inches='tight')
plt.show()

Retweet counts by rating

In [55]:
# To add jitter to scatterplot
jitter = np.random.uniform(low = -.99, high = .99, size = len(tw_clean))
tw_jitter = tw_clean.copy()
tw_jitter.retweet_count = tw_jitter.retweet_count + jitter
tw_jitter.rating_denominator = tw_jitter.rating_denominator + jitter
fig = tw_jitter.plot.scatter('retweet_count', 'rating_denominator',alpha=0.1, s=3)
fig.axes.set_ylim(9,11)
plt.title('Ratings by retweet count', fontsize = 10)

plt.savefig('ratings_retweets.png', bbox_inches='tight')
plt.show()

First insights

Breed popularity

The top five most popular breeds are Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, and Pug. The top one, Golden Retriever, appears almost double and three times more than fourth and fifth places.

Unsurprisingly, the breeds whose photos and tweet were most retweeted were also, in order, Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, but in fifth place is Samoyed (seventh in number of appearances). Pugs are popular but do not get as retweeted.

Overall, however, breeds that have high counts are also generally highly retweeted.

Stage popularity

The most common stage mentioned is “pupper”, three times more than the second most mentioned stage “doggo”, in turn twice as frequent as the least common stage “puppo”.

Double stages, or two dogs, are much less common. Consistent with the single stage popularity ranking, of the combinations, “pupper-doggo” is the most common. Interestingly, there are no “pupper-puppo” duos.

Rating and retweet counts

Generally unexpected, there is no relationship between rating and retweet count. This does make sense if put in the context of how the ratings work, without logic or consistency.