API querying and data wrangling in Python¶

In the present project, I gather, assess, wrangle, and clean a dataset assembled from data from the channel "We rate dogs". I use Twitter API to query the text of the tweets and gather missing columns from the base dataset, and request a dataset with the associated images. I use Python tools to assess any problems with the data. The result is a tidy and useful dataset of good quality data.

The data: Users submit pictures and a short text to the group "We rate dogs", and have the pet rated. Most of the time, dogs are classified into stages known to that specific community: dogo, puppo, pupper. Other comments generally using internal lingo, and the names of the dogs are usually given as well.

Steps in this project:

1 - Gathering the data: load existing dataset, use Twitter API to query tweets, use information extracted in JSON, request the images dataset from a website and save it as a local dataset.

2 - Assess data: assess data sources for "quality" and "tidiness". Issues include, but are not limited to: different sizes of merged datasets, mismatch of text for dog stage between base dataset entry and tweet due to problem capturing string, incorrectly captured dog names, tweets later deleted by users, among others.

3 - Cleaning data: establish and execute an action for each of the quality and tidiness problems found.

4 - Plots and comments: a first approach to analyzing the data with preliminary plots. I found that for this channel, the top five most popular breeds are Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, and Pug, with the top one, Golden Retriever, appearing almost double and three times more than fourth and fifth places.The breeds whose photos and tweet were most retweeted were also, in order, Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, but in fifth place is Samoyed (seventh in number of appearances). Pugs are popular but do not get as retweeted.

import requests
import numpy as np
import pandas as pd
import json
import tweepy
import sys

1 - Gathering the data¶

# Load base dataset
tw = pd.read_csv("twitter-archive-enhanced.csv")
tw.head()

Querying Twitter Data using Tweepy and saving JSON¶

# Setting up API

consumer_key = 'my_consumer_key'
consumer_secret = 'my_consumer_secret'
access_token =  'my_access_token'
access_secret = 'my_access_secret'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True) # http://docs.tweepy.org/en/v3.5.0/api.html

# Initializing dataframe columns

init_v = [-1 for i in range(len(tw))]
tw['favorite_count'] = init_v
tw['retweet_count'] = init_v

errors = ['' for i in range(len(tw))]
tw['errors'] = errors
tw['json_tw'] = errors

# Querying Twitter and saving JSON

data = {}
data['tweets'] = []    

for i in range(0,len(tw)):
    try:
        # These first two lines are a way of directly accessing the values
        # without need for JSON
        fav_c = api.get_status(tw.tweet_id[i]).favorite_count
        tw.set_value(i,'favorite_count', fav_c)
        
        rtw_c = api.get_status(tw.tweet_id[i]).retweet_count
        tw.set_value(i,'retweet_count', rtw_c)
        
        # Here I'm querying the JSON strings for the sake of practicing using JSON, 
        # there are other possibilities of getting the tweets'information
        json_tw = js = api.get_status(tw.tweet_id[i])._json        
        data['tweets'].append(json_tw)

        # Saving JSON string to dataframe
        tw.set_value(i,'json_tw', json_tw)
 
    except:
        e = sys.exc_info()[0]
        tw.set_value(i, 'errors', str(e) )

# Visually inspecting dataset with new columns
tw.head(4)

Reading and writing Twitter JSON¶

# Initializing dataframe columns

init_v = [-1 for i in range(len(tw))]
tw['favorite_count_JSON'] = init_v
tw['retweet_count_JSON'] = init_v

# Read each line
for i in range(0,len(data['tweets'])):
    indice = tw.index[tw.tweet_id == data['tweets'][i]['id']].tolist()[0]
    
    tw.set_value(indice, 'retweet_count_JSON', data['tweets'][i]['retweet_count'])
    tw.set_value(indice, 'favorite_count_JSON', data['tweets'][i]['favorite_count'])

tw.head(4)

# Check for problems
tw[tw.favorite_count != tw.favorite_count_JSON][0:4]

Saving JSON file and modified tw file¶

with open('tweet_json.txt', 'w') as outfile:  
    json.dump(data, outfile)

tw.to_csv("tw_gathered.csv", sep = ",")

Other possibilities for getting at info of tweets¶

# api.get_status(tw.tweet_id[1])._json['favorite_count']

# api.get_status(tw.tweet_id[1]).favorite_count

# js = api.get_status(tw.tweet_id[1])._json
# js['favorite_count']

Loading images¶

images = requests.get("https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv")

if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

images=StringIO(images.text)

img = pd.read_csv(images, sep="\t")

# Visual inspection
img.head()

Saving images¶

img.to_csv("image_predictions.tsv", sep = "\t")

2 - Assessing Data¶

Quality¶

Both datasets¶

Length of images (img) and of twitter data (tw) are different.

Twitter dataset¶

There were eight tweets removed by their users.
Some dogs that were given a stage by the tweet, do not have a dog stage in the table.
One dog's stage does not coincide with the one given in the text of the tweet.
Some dog names were incorrectly captured: 'such', 'an', 'a', 'the', 'quite', among others.
Blep column missing.

-- Added on iteration while cleaning:

Incorrectly entered name: tw.name[775] as "O" when it should be "O'Malley"

Images dataset¶

Some dog breeds are capitalized while others aren't.
Many images do not belong to a dog.

Tidiness¶

Twitter dataset¶

Dog stage is divided into separate columns when it is a single variable: dog stage.
Time stamp column contains six variables: day, month, year, hour, minute, second. Perhaps add time zone.

Images dataset¶

Dog breed information is contained in several columns. We know this from the explanation of the design of the dataset.

Quality¶

print("Length of images file: " + str(len(img)))
print("Length of twitter file: " + str(len(tw)))

Length of images file: 2075
Length of twitter file: 2356

tw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 26 columns):
Unnamed: 0                    2356 non-null int64
Unnamed: 0.1                  2356 non-null int64
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
favorite_count                2356 non-null int64
retweet_count                 2356 non-null int64
errors                        8 non-null object
json_tw                       2348 non-null object
favorite_count_JSON           2356 non-null int64
retweet_count_JSON            2356 non-null int64
tweet_id_JSON                 2356 non-null int64
dtypes: float64(4), int64(10), object(12)
memory usage: 478.6+ KB

tw.describe()

img.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 13 columns):
Unnamed: 0    2075 non-null int64
tweet_id      2075 non-null int64
jpg_url       2075 non-null object
img_num       2075 non-null int64
p1            2075 non-null object
p1_conf       2075 non-null float64
p1_dog        2075 non-null bool
p2            2075 non-null object
p2_conf       2075 non-null float64
p2_dog        2075 non-null bool
p3            2075 non-null object
p3_conf       2075 non-null float64
p3_dog        2075 non-null bool
dtypes: bool(3), float64(3), int64(3), object(4)
memory usage: 168.3+ KB

img.describe()

Twitter dataset¶

There were eight tweets removed by their users.

print('Total number of rows with error when querying API: ' + 
      str(len(tw[tw.errors == "<class 'tweepy.error.TweepError'>"])))
tw[tw.errors == "<class 'tweepy.error.TweepError'>"]

Total number of rows with error when querying API: 8

Some dogs that were given a stage by the tweet, do not have a dog stage in the table.

#no_stage = tw[(tw.doggo== "None") & (tw.pupper== "None") & (tw.puppo== "None")]
for i in [545 ,1779 ,1636]:
    print(tw.text[i])
    print(tw.doggo[i] + ' ' + tw.pupper[i] + ' ' + tw.puppo[i])
    print('\n')

This is Duke. He is not a fan of the pupporazzi. 12/10 https://t.co/SgpBVYIL18
None None None


IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
None None None


Gang of fearless hoofed puppers here. Straight savages. Elevated for extra terror. Front one has killed before 6/10s https://t.co/jkCb25OWfh
None None None

One dog's stage does not coincide with the one given in the text of the tweet.

i = 460
print(tw.text[i])
print('\n')
print(tw.doggo[i] + ' ' + tw.pupper[i] + ' ' + tw.puppo[i])

This is Dido. She's playing the lead role in "Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple." 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7


doggo pupper None

Some dog names were incorrectly captured: 'such', 'an', 'a', 'the', 'quite', among others.

rand_indices = np.random.randint(0,tw.shape[0],15)
tw.name[rand_indices]
# tw.name

504        Bauer
403         Nala
535         Cali
117     Clifford
139        Sammy
1730       Bruce
32          None
144        Albus
660        Mabel
168         None
981         Finn
1457        just
56             a
2125           a
1686        None
Name: name, dtype: object

Blep column missing.

tw.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
       'favorite_count', 'retweet_count', 'errors', 'json_tw',
       'favorite_count_JSON', 'retweet_count_JSON', 'tweet_id_JSON'],
      dtype='object')

Images dataset¶

Some dog breeds are capitalized while others aren't.

img.p1[0:10]

0    Welsh_springer_spaniel
1                   redbone
2           German_shepherd
3       Rhodesian_ridgeback
4        miniature_pinscher
5      Bernese_mountain_dog
6                box_turtle
7                      chow
8             shopping_cart
9          miniature_poodle
Name: p1, dtype: object

Many images do not belong to a dog.

print("There are " + str(sum(img.p1_dog == False))+ " images not of a dog.")
img.p1_dog[0:10]

There are 543 images not of a dog.

0     True
1     True
2     True
3     True
4     True
5     True
6    False
7     True
8    False
9     True
Name: p1_dog, dtype: bool

Dog breed is not a categorical type.

img.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

Tidiness¶

print(tw.doggo.unique())
print(tw.pupper.unique())
print(tw.puppo.unique())

['None' 'doggo']
['None' 'pupper']
['None' 'puppo']

Time stamp column contains six variables: day, month, year, hour, minute, second. Perhaps add time zone.

tw.timestamp[0:5]

0    2017-08-01 16:23:56 +0000
1    2017-08-01 00:17:27 +0000
2    2017-07-31 00:18:03 +0000
3    2017-07-30 15:58:51 +0000
4    2017-07-29 16:00:24 +0000
Name: timestamp, dtype: object

Images dataset¶

Dog breed information is contained in several columns. We know this from the explanation of the design of the dataset.

img.head(5)

3 - Cleaning Data¶

Plan and steps¶

Twitter dataset¶

Quality

Find indices of the seven twits were removed by their users, delete the corresponding rows. sum(tw.favorite_count == -1)
Search the text for a word in the set of ['doggo', 'pupper', 'puppo'] and fill in the stage for the dogs that don't have a dog stage. If the dog has a stage, and it doesn't coincide, save the index and check the reason(s). Adjust accordingly.
Initialize blep column as boolean False. Search the text for 'blep', if it appears, enter True in the corresponding column.
Create a list of all English determiners, match the rows that have any of them as a name, and remove them from names. Check visually for any noncapitalized names left if list is small.

Tidiness

Make column stage that contains the contents of columns doggo, pupper, puppo.
Divide time stamp into day,month,year,hour.

Images dataset¶

Quality

Capitalize all dog breeds.
Eliminate rows for which dog == False.
Make all values of column breed into categorical.

Tidiness

Keep only highest ranked dog breed columns (p1) and remove the rest.

Both¶

Merge twitter dataset with images dataset based on tweet_id. Images dataset tweet_id values are a subset of twitter dataset tweet_id values.

Making copies¶

It is important to have a backup of the original data, and effect changes on a copy.

tw = pd.read_csv("tw_gathered.csv")
img = pd.read_csv("image_predictions.tsv", sep ="\t")

# Making copies of original datasets
tw_clean = tw.copy()
img_clean = img.copy()

Tidiness¶

Twitter dataset¶

Issue:

Dog stage is divided into separate columns when it is a single variable: dog stage.

Define action:

Make column stage that contains the contents of columns doggo, pupper, puppo. It also contains an empty value for when the dog stage has not been determined.

# Code
# Cases in which stage has not been detected
tw_clean['no_stage'] = 'None'
tw_clean.loc[(tw_clean.puppo == 'None') & 
               (tw_clean.doggo == 'None') & 
               (tw_clean.pupper == 'None'), 'no_stage'] = '-'

# Make column stage that contains the contents of columns doggo, pupper, puppo.
cols = list(tw_clean.columns)
cols.remove('doggo')
cols.remove('pupper')
cols.remove('puppo')
cols.remove('no_stage')

tw_clean = pd.melt(tw_clean, id_vars = cols,
                var_name = 'Dog_stage', value_name = 'Stage_value')

tw_clean = tw_clean[tw_clean.Stage_value != 'None']
len(tw_clean)

2369

There are 13 entries for which the dog has been given 2 stages, as can be seen by checking duplicate tweet_id. In the code below, the dataframe dupls contains these cases.

dupls = tw_clean[tw_clean.duplicated('tweet_id')]
print(len(dupls))

13

For such a small number, I manually checked them and found that most contain two dogs or a dog with two stages.

The texts for all of these cases is displayed below.

dupls = dupls.reset_index(drop=True)
dupl_tweet_id = dupls.tweet_id.unique()

for i in range(0,len(dupls)):
    print(tw_clean[tw_clean.tweet_id == dupls.tweet_id[i]].reset_index(drop=True).text[0])
    print('\n')

This is Dido. She's playing the lead role in "Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple." 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7


Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho


Like doggo, like pupper version 2. Both 11/10 https://t.co/9IxWAXFqze


This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj


This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd


Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u


RT @dog_rates: Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda


RT @dog_rates: This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC


Meet Maggie &amp; Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time https://t.co/MYwR4DQKll


Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8


This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC


Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda


Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel

So, they get their own label. (Drop duplicates first)

tw_clean = tw_clean.drop_duplicates(subset='tweet_id')

tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*doggo|doggo.*pupper', case = False), 
                   'Dog_stage', 'pupper-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('puppo.*doggo|doggo.*puppo', case = False),
                   'Dog_stage', 'puppo-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*puppo|puppo.*pupper', case = False), 
                   'Dog_stage', 'pupper-puppo')

Importantly, there are no cases in which the three stages appear together.

len(tw_clean[tw_clean.text.str.contains('pupper.*doggo.*puppo', case = False)])

0

# Dropping column of original stage values, and resetting indices of tw_clean table
tw_clean = tw_clean.drop('Stage_value', axis=1)
tw_clean = tw_clean.reset_index(drop=True)

    # Test
    print(tw_clean.Dog_stage.unique())
    print(tw_clean.columns)

['doggo' 'puppo-doggo' 'pupper-doggo' 'pupper' 'puppo' 'no_stage']
Index(['Unnamed: 0', 'Unnamed: 0.1', 'tweet_id', 'in_reply_to_status_id',
       'in_reply_to_user_id', 'timestamp', 'source', 'text',
       'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'floofer', 'favorite_count',
       'retweet_count', 'errors', 'json_tw', 'favorite_count_JSON',
       'retweet_count_JSON', 'tweet_id_JSON', 'Dog_stage'],
      dtype='object')

Issue:

Time stamp column contains six variables: day, month, year, hour, minute, second. Perhaps add time zone.

Define action:

Use regular expresions to divide timestamp values into day, month, year, hour, minutes, and assign them to their respective columns.

tw_clean.timestamp[0]

'2017-07-26 15:59:51 +0000'

# Day, month, year
date = tw_clean.timestamp.str.extract('(\d{4}[-]\d{2}[-]\d{2})', expand=True)
tw_clean['year'], tw_clean['month'], tw_clean['day'] = date[0].str.split('-',2).str

# Hour, minute, second
time = tw_clean.timestamp.str.extract('(\d{2}[:]\d{2}[:]\d{2})', expand=True)
time
tw_clean['hour'], tw_clean['minute'], tw_clean['second'] = time[0].str.split(':',2).str

# Dropping timestamp to avoid duplicate data
tw_clean = tw_clean.drop('timestamp', axis=1)

    # Test
    #tw_clean.info()
    tw_clean.head(3)

Images dataset¶

Issue:

Dog breed information is contained in several columns. We know this from the explanation of the design of the dataset.

Define action:

Keep only highest ranked dog breed columns (p1) and remove the rest.

# Code
img_clean = img.copy()
img_clean.drop(img_clean.columns[7:], axis=1, inplace=True)

    # Test
    img_clean.head(5)

Quality¶

Twitter dataset¶

Issue

There were eight tweets removed by their users.

Define action

Find indices of the seven twits were removed by their users, delete the corresponding rows.

# Code
print(len(tw_clean))
print(sum(~tw_clean.errors.isnull()))

indices = tw_clean.index[~tw_clean.errors.isnull()].tolist()
tw_clean.drop(tw_clean.index[[indices]],inplace=True)

2356
8

# Test
print(len(tw_clean))
print(sum(~tw_clean.errors.isnull()))

2348
0

Issues:

Some dogs that were given a stage by the tweet, do not have a dog stage in the table.
One dog's stage does not coincide with the one given in the text of the tweet.

Define action:

Search the text of each tweet for a word in the set of ['doggo', 'pupper', 'puppo', 'doggo|pupper', 'doggo|puppo', 'puppo|pupper']. Enter result in a new stage column. Check for any mismatches between the new stage column and the existing one; adjust accordingly.

# Code
# Two values (either from two dogs, or one dog with two qualifications)
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*doggo|doggo.*pupper', case = False), 
                   'Dog_stage_new', 'pupper-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('puppo.*doggo|doggo.*puppo', case = False), 
                   'Dog_stage_new', 'puppo-doggo')
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('pupper.*puppo|puppo.*pupper', case = False),
                   'Dog_stage_new', 'pupper-puppo')

# Single value
tw_clean = tw_clean.set_value(~tw_clean.text.str.contains('doggo|puppo', case = False) & 
                   tw_clean.text.str.contains('pupper', case = False), 
                   'Dog_stage_new', 'pupper')
tw_clean = tw_clean.set_value(~tw_clean.text.str.contains('doggo|pupper', case = False) &
                   tw_clean.text.str.contains('puppo', case = False), 
                'Dog_stage_new', 'puppo')
tw_clean = tw_clean.set_value(~tw_clean.text.str.contains('pupper|puppo', case = False) &
                   tw_clean.text.str.contains('doggo', case = False), 
                   'Dog_stage_new', 'doggo')

# Agreement on those tweets for wchich there was no dog stage assigned 
# (between original stage, and my discovered)
new = tw_clean.Dog_stage_new.isnull()
old = tw_clean.Dog_stage == 'no_stage'
tw_clean = tw_clean.set_value(new & old, 'Dog_stage_new', 'no_stage')

Interestingly, I found that words that contain the dog stage within them, but that are longer, were not recognized in the original dataset.

The overwhelming case for this is plurals: puppers, doggos, puppos. However, there are also cases such as :pupperdoop, puppergeddon, pupporazi,apuppologized, puppoccino, puppertunity, pupposes; shown in the strings below.

# Texts of tweets containing dog stages missing in original dataset.
for i in tw_clean[~(tw_clean.Dog_stage == tw_clean.Dog_stage_new)].index:
    print(tw_clean.text[i])
    print('\n')

This is Gary. He couldn't miss this puppertunity for a selfie. Flawless focusing skills. 13/10 would boop intensely https://t.co/7CSWCl8I6s


I can say with the pupmost confidence that the doggos who assisted with this search are heroic as h*ck. 14/10 for all https://t.co/8yoc1CNTsu


Meet Venti, a seemingly caffeinated puppoccino. She was just informed the weekend would include walks, pats and scritches. 13/10 much excite https://t.co/ejExJFq3ek


Say hello to Lassie. She's celebrating #PrideMonth by being a splendid mix of astute and adorable. Proudly supupporting her owner. 13/10 https://t.co/uK6PNyeh9w


This is Lili. She can't believe you betrayed her with bath time. Never looking you in the eye again. 12/10 would puppologize profusely https://t.co/9b9J46E86Z


Jerry just apuppologized to me. He said there was no ill-intent to the slippage. I overreacted I admit. Pupgraded to an 11/10 would pet


Here we have some incredible doggos for #K9VeteransDay. All brave as h*ck. Salute your dog in solidarity. 14/10 for all https://t.co/SVNMdFqKDL


@0_kelvin_0 &gt;10/10 is reserved for puppos sorry Kevin


This is Lucy. She has a portrait of herself on her ear. Excellent for identification pupposes. 13/10 innovative af https://t.co/uNmxbL2lns


RT @SchafeBacon2016: @dog_rates Slightly disturbed by the outright profanity, but confident doggos were involved. 11/10, would tailgate aga…


RT @dog_rates: Meet Jack. He's one of the rare doggos that doesn't mind baths. 11/10 click the link to see how you can help Jack!

https://…


Meet Jack. He's one of the rare doggos that doesn't mind baths. 11/10 click the link to see how you can help Jack!

https://t.co/r4W111FzAq https://t.co/fQpYuMKG3p


This is Lennon. He's a Boopershnoop Pupperdoop. Quite rare. Exceptionally pettable. 12/10 would definitely boop that shnoop https://t.co/fhgP6vSfhX


This is Duke. He is not a fan of the pupporazzi. 12/10 https://t.co/SgpBVYIL18


You need to watch these two doggos argue through a cat door. Both 11/10 https://t.co/qEP31epKEV


Here we are witnessing an isolated squad of bouncing doggos. Unbelievably rare for this time of year. 11/10 for all https://t.co/CCdlwiTwQf


Here are three doggos completely misjudging an airborne stick. Decent efforts tho. All 9/10 https://t.co/HCXQL4fGVZ


This is Dietrich. He hops at random. Other doggos don't understand him. It upsets him greatly. 8/10 would comfort https://t.co/U8cSRz8wzC


This is one of the most reckless puppers I've ever seen. How she got a license in the first place is beyond me. 6/10 https://t.co/z5bAdtn9kd


This is Arlen and Thumpelina. They are best pals. Cuddly af. 11/10 for both puppers https://t.co/VJgbgIzIHx


Everybody stop what you're doing and watch these puppers enjoy summer. Both 13/10 https://t.co/wvjqSCN6iC


Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv


Here are two lil cuddly puppers. Both 12/10 would snug like so much https://t.co/zO4eb7C4tG


Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1


Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12


WeRateDogs stickers are here and they're 12/10! Use code "puppers" at checkout 🐶🐾

Shop now: https://t.co/k5xsufRKYm https://t.co/ShXk46V13r


Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa


This golden is happy to refute the soft mouth egg test. Not a fan of sweeping generalizations. 11/10 #notallpuppers https://t.co/DgXYBDMM3E


Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3


Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55


Gang of fearless hoofed puppers here. Straight savages. Elevated for extra terror. Front one has killed before 6/10s https://t.co/jkCb25OWfh


Meet Sadie. She fell asleep on the beach and her friends buried her. 10/10 can't trust fellow puppers these days https://t.co/LoKVvc1xAW


This is Penny. Her tennis ball slowly rolled down her cone and into the pool. 8/10 bad things happen to good puppers https://t.co/YNWU7LeFgg


Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10 https://t.co/eNm2S6p9BD


Crazy unseen footage from Jurassic Park. 10/10 for both dinosaur puppers https://t.co/L8wt2IpwxO


IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq


Hope your Monday isn't too awful. Here's two baseball puppers. 11/10 for each https://t.co/dB0H9hdZai


Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw


Here's a handful of sleepy puppers. All look unaware of their surroundings. Lousy guard dogs. Still cute tho 11/10s https://t.co/lyXX3v5j4s


Happy Friday. Here's some golden puppers. 12/10 for all https://t.co/wNkqAED6lG


This is Rodman. He's getting destroyed by the surfs. Valiant effort though. 10/10 better than most puppers probably https://t.co/S8wCLemrNb


Herd of wild dogs here. Not sure what they're trying to do. No real goals in life. 3/10 find your purpose puppers https://t.co/t5ih0VrK02


This is Zoey. Her dreams of becoming a hippo ballerina don't look promising. 9/10 it'll be ok puppers https://t.co/kR1fqy4NKK

# Do away with original values and keep the more thorough new dog stage column
tw_clean = tw_clean.drop('Dog_stage', axis=1)

# Remove "new" from dog stage column 
tw_clean = tw_clean.rename(columns = {'Dog_stage_new':'dog_stage'})

sample_size = 5
for k in tw_clean.dog_stage.unique():
    print('\n')
    print(k)
    print('\n')
    s = min(sample_size, len(tw_clean[tw_clean.dog_stage == k]))
    for i in tw_clean[tw_clean.dog_stage == k].sample(s).index:
        print(tw_clean.text[i])
        print("\n")


doggo


Here's a sleepy doggo that requested some assistance. 12/10 would carry everywhere https://t.co/bvkkqOjNDV


Meet Gerald. He's a fairly exotic doggo. Floofy af. Inadequate knees tho. Self conscious about large forehead. 8/10 https://t.co/WmczvjCWJq


Meet Jack. He's one of the rare doggos that doesn't mind baths. 11/10 click the link to see how you can help Jack!

https://t.co/r4W111FzAq https://t.co/fQpYuMKG3p


Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv


Here we have some incredible doggos for #K9VeteransDay. All brave as h*ck. Salute your dog in solidarity. 14/10 for all https://t.co/SVNMdFqKDL


puppo-doggo


I have stumbled puppon a doggo painting party. They're looking to be the next Pupcasso or Puppollock. All 13/10 would put it on the fridge https://t.co/cUeDMlHJbq


Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel


pupper-doggo


Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda


Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8


This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj


This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd


RT @dog_rates: Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda


pupper


Here's a pupper in a onesie. Quite pupset about it. Currently plotting revenge. 12/10 would rescue https://t.co/xQfrbNK3HD


This is Zoey. Her dreams of becoming a hippo ballerina don't look promising. 9/10 it'll be ok puppers https://t.co/kR1fqy4NKK


This pupper just got his first kiss. 12/10 he's so happy https://t.co/2sHwD7HztL


RT @dog_rates: Meet Herschel. He's slightly bigger than ur average pupper. Looks lonely. Could probably ride 7/10 would totally pet https:/…


Pupper hath acquire enemy. 13/10 https://t.co/ns9qoElfsX


puppo


Say hello to Lassie. She's celebrating #PrideMonth by being a splendid mix of astute and adorable. Proudly supupporting her owner. 13/10 https://t.co/uK6PNyeh9w


Sorry for the lack of posts today. I came home from school and had to spend quality time with my puppo. Her name is Zoey and she's 13/10 https://t.co/BArWupFAn0


This is Cooper. He's just so damn happy. 10/10 what's your secret puppo? https://t.co/yToDwVXEpA


Say hello to Lily. She's pupset that her costume doesn't fit as well as last year. 12/10 poor puppo https://t.co/YSi6K1firY


This is Duke. He is not a fan of the pupporazzi. 12/10 https://t.co/SgpBVYIL18


no_stage


Not much to say here. I just think everyone needs to see this. 12/10 https://t.co/AGag0hFHpe


This is Duke. He permanently looks like he just tripped over something. 11/10 https://t.co/1sNtG7GgiO


This is Jesse. He really wants a belly rub. Will be as cute as possible to achieve that goal. 11/10 https://t.co/1BxxcdVNJ8


When she says she'll be ready in a minute but you've been waiting in the car for almost an hour. 10/10 https://t.co/EH0N3dFKUi


This is Timber. He misses Christmas. Specifically the presents part. 12/10 cheer pup Timber https://t.co/dVVavqpeF9

Issue:

Some dog names were incorrectly captured: 'such', 'an', 'a', 'the', 'quite', among others.

Define action:

Get all unique names that do not start with uppper case and are not "None". Create set that excludes those that could still be names. Use set of no-names to filter the name column.

# Code
# Reset indices, changed after modifications above
tw_clean = tw_clean.reset_index(drop=True)

# Get all unique names that do not start with uppper case and are not "None"
not_names = tw_clean[~tw_clean.name.str.istitle() & ~tw_clean.name.str.isupper()]
not_names.name.unique()

array(['just', 'one', 'his', 'a', 'mad', 'actually', 'all', 'the', 'such',
       'quite', 'not', 'incredibly', 'BeBe', 'an', 'very', 'DonDon', 'my',
       'getting', 'this', 'unacceptable', 'old', 'infuriating', 'CeCe',
       'by', 'officially', 'life', 'light', 'space', 'DayZ'], dtype=object)

# Create set of words that are not names and use it filter 'name' columns off them.
false_names = ['just', 'one', 'his', 'a', 'mad', 'actually', 'all', 'the', 'such',
       'quite', 'not', 'incredibly','an', 'very', 'my',
       'getting', 'this', 'unacceptable', 'old', 'infuriating',
       'by', 'officially', 'life', 'light', 'space']

tw_clean = tw_clean.set_value(tw_clean.name.isin(false_names), 'name', 'None')

# Of these 2 full uppercase cases, one has to be fixed manually, "O'Malley" instead of "O"
tw_clean[tw_clean.name.str.isupper()]

# Fixing uppercase name
tw_clean = tw_clean.set_value(1008, 'name', "O'Malley")

    # Test
    print(sum(tw_clean.name.isin(false_names)))
    tw_clean.name[1008]

0

"O'Malley"

Issue:

Blep column missing.

Define action:

Initialize blep column as boolean False. Search the text for 'blep'. If it appears, enter True in the corresponding column.

# Code
# Initialize column blep
tw_clean['blep'] = False

# Assign values
tw_clean = tw_clean.set_value(tw_clean.text.str.contains('blep', case = False),'blep', True)

There are only four cases of "bleps".

tw_clean[tw_clean.text.str.contains('blep', case = False)]

# Test
print('BLEP')  
for i in tw_clean[tw_clean.text.str.contains('blep', case = False)].index:
    print(tw_clean.text[i])
    print('\n')
    

print('NO BLEP')    
for i in tw_clean[~tw_clean.text.str.contains('blep', case = False)].sample(5).index:
    print(tw_clean.text[i])
    print('\n')

BLEP
This is Roscoe. Another pupper fallen victim to spontaneous tongue ejections. Get the BlepiPen immediate. 12/10 deep breaths Roscoe https://t.co/RGE08MIJox


Meet Jesse. He's a Fetty Woof. His tongue ejects without warning. A true bleptomaniac. 12/10 would snug well https://t.co/fUod0tVmvK


This is Sammy. Her tongue ejects without warning sometimes. It's a serious condition. Needs a hefty dose from a BlepiPen. 13/10 https://t.co/g20EmqK7vc


I call this one "A Blep by the Sea" 12/10 https://t.co/EMdnCugNbo


NO BLEP
This is Vince. He's a Gregorian Flapjeck. White spot on legs almost looks like another dog (whoa). 9/10 rad as hell https://t.co/aczGAV2dK4


This left me speechless. 14/10 heckin heroic af https://t.co/3td8P3o0mB


Crazy unseen footage from Jurassic Park. 10/10 for both dinosaur puppers https://t.co/L8wt2IpwxO


This is Sadie and her 2 pups Shebang &amp; Ruffalo. Sadie says single parenting is challenging but rewarding. All 10/10 https://t.co/UzbhwXcLne


This is Steven. He has trouble relating to other dogs. Quite shy. Neck longer than average. Tropical probably. 11/10 would still pet https://t.co/2mJCDEJWdD

Images dataset¶

Issue:

Some dog breeds are capitalized while others aren't.

Design action:

Capitalize all dog breeds.

# Code
img_clean.p1 = img_clean.p1.str.title()

# Test
sum(~img_clean.p1.str.istitle())

0

Issue:

Many images do not belong to a dog.

Design action:

Eliminate rows for which dog == False.

# Code
indices = img_clean.index[img_clean.p1_dog == False].tolist()
img_clean.drop(img_clean.index[[indices]],inplace=True)

# Test
img_clean.p1_dog.unique()

array([ True])

Both twitter dataset and imgages dataset¶

Issue:

Length of images (img) and of twitter data (tw) are different.

Design action:

Merge twitter dataset with images dataset based on tweet_id. Images dataset tweet_id values are a subset of twitter dataset tweet_id values.

tw_clean = tw_clean.merge(img_clean,how='left', left_on='tweet_id', right_on='tweet_id')
# https://stackoverflow.com/questions/33086881/merge-two-python-pandas-data-frames-of-different-length-but-keep-all-rows-in-out
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

# Test
tw_clean.sample(4)

Saving cleaned dataset¶

tw_clean.to_csv("twitter_archive_master.csv", sep = ",")

4 - Plots and comments¶

Wordcloud¶

Wordcloud using the most frequently found words in the body of the tweets.

# Code from https://amueller.github.io/word_cloud/auto_examples/masked.html
# Slightly modified to be applied to my data
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

# All tweeted texts combined into a single text
texts = tw_clean.text.str.cat(sep=' ')
texts = texts.replace('https://t.co/','')

# Read the mask image
# (Taken from http://www.stencilry.org/stencils/animals/dog/dog+3.gif )
dog_mask = np.array(Image.open("dog_mask.png"))

stopwords = set(STOPWORDS)
stopwords.add("@dog_rates")

wc = WordCloud(background_color="white", max_words=2000, mask=dog_mask,
               stopwords=stopwords)

# Generate word cloud
wc.generate(texts)

# Store to file
wc.to_file("dog_wordcloud.png")

# Show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
plt.show()
#plt.imshow(dog_mask, cmap=plt.cm.gray, interpolation='bilinear')
#plt.axis("off")

<matplotlib.figure.Figure at 0x10a61c0f0>

Most popular breeds¶

counts = tw_clean.p1.value_counts()

names = list(counts.axes[0])
names.reverse()
values = list(counts.get_values())
values.reverse()

fig = plt.figure(figsize=(20,50))
ax = fig.add_subplot(111)
yvals = range(len(names))
ax.barh(yvals, values, align='center', alpha=0.4)
ax.tick_params(axis='both', labelsize=18)
plt.yticks(yvals,names)
plt.title('Count of each breed from photos with dogs in them', fontsize = 24)
plt.tight_layout()

plt.savefig('breeds_counts.png', bbox_inches='tight')
plt.show()

<matplotlib.figure.Figure at 0x11992a8d0>

Retweets by breed¶

breed_rt = tw_clean.groupby('p1').agg({'retweet_count': sum})
breed_rt = breed_rt.retweet_count.sort_values(ascending=False)

names = list(breed_rt.axes[0])
names.reverse()
values = list(breed_rt.get_values())
values.reverse()

fig = plt.figure(figsize=(20,50))
ax = fig.add_subplot(111)
yvals = range(len(names))
ax.barh(yvals, values, align='center', alpha=0.4)
ax.tick_params(axis='both', labelsize=18)
plt.yticks(yvals,names)
plt.title('Retweets by breed from tweets with photos with dogs in them', fontsize = 24)
plt.tight_layout()

plt.savefig('breeds_retweets.png', bbox_inches='tight')
plt.show()

Most popular stages¶

counts = tw_clean.dog_stage.value_counts()

names = list(counts.axes[0])
names.reverse()
values = list(counts.get_values())
values.reverse()

fig = plt.figure(figsize=(20,5))
ax = fig.add_subplot(111)
yvals = range(len(names))
ax.barh(yvals, values, align='center', alpha=0.4)
ax.tick_params(axis='both', labelsize=18)
plt.yticks(yvals,names)
plt.title('Count of each dog stage', fontsize = 24)
plt.tight_layout()

plt.savefig('stages.png', bbox_inches='tight')
plt.show()

Retweet counts by rating¶

# To add jitter to scatterplot
jitter = np.random.uniform(low = -.99, high = .99, size = len(tw_clean))
tw_jitter = tw_clean.copy()
tw_jitter.retweet_count = tw_jitter.retweet_count + jitter
tw_jitter.rating_denominator = tw_jitter.rating_denominator + jitter
fig = tw_jitter.plot.scatter('retweet_count', 'rating_denominator',alpha=0.1, s=3)
fig.axes.set_ylim(9,11)
plt.title('Ratings by retweet count', fontsize = 10)

plt.savefig('ratings_retweets.png', bbox_inches='tight')
plt.show()

First insights¶

Breed popularity¶

The top five most popular breeds are Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, and Pug. The top one, Golden Retriever, appears almost double and three times more than fourth and fifth places.

Unsurprisingly, the breeds whose photos and tweet were most retweeted were also, in order, Golden Retriever, Labrador Retriever, Pembroke, Chihuahua, but in fifth place is Samoyed (seventh in number of appearances). Pugs are popular but do not get as retweeted.

Overall, however, breeds that have high counts are also generally highly retweeted.

Stage popularity¶

The most common stage mentioned is “pupper”, three times more than the second most mentioned stage “doggo”, in turn twice as frequent as the least common stage “puppo”.

Double stages, or two dogs, are much less common. Consistent with the single stage popularity ranking, of the combinations, “pupper-doggo” is the most common. Interestingly, there are no “pupper-puppo” duos.

Rating and retweet counts¶

Generally unexpected, there is no relationship between rating and retweet count. This does make sense if put in the context of how the ratings work, without logic or consistency.

	Unnamed: 0	Unnamed: 0.1	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	...	floofer	pupper	puppo	favorite_count	retweet_count	errors	json_tw	favorite_count_JSON	retweet_count_JSON	tweet_id_JSON
0	0	0	892420643555336193	NaN	NaN	2017-08-01 16:23:56 +0000	<a href="http://twitter.com/download/iphone" r...	This is Phineas. He's a mystical boy. Only eve...	NaN	NaN	...	None	None	None	39294	8773	NaN	{'created_at': 'Tue Aug 01 16:23:56 +0000 2017...	39294	8773	-1
1	1	1	892177421306343426	NaN	NaN	2017-08-01 00:17:27 +0000	<a href="http://twitter.com/download/iphone" r...	This is Tilly. She's just checking pup on you....	NaN	NaN	...	None	None	None	33649	6426	NaN	{'created_at': 'Tue Aug 01 00:17:27 +0000 2017...	33649	6426	-1
2	2	2	891815181378084864	NaN	NaN	2017-07-31 00:18:03 +0000	<a href="http://twitter.com/download/iphone" r...	This is Archie. He is a rare Norwegian Pouncin...	NaN	NaN	...	None	None	None	25351	4268	NaN	{'created_at': 'Mon Jul 31 00:18:03 +0000 2017...	25351	4268	-1
3	3	3	891689557279858688	NaN	NaN	2017-07-30 15:58:51 +0000	<a href="http://twitter.com/download/iphone" r...	This is Darla. She commenced a snooze mid meal...	NaN	NaN	...	None	None	None	42668	8858	NaN	{'created_at': 'Sun Jul 30 15:58:51 +0000 2017...	42668	8858	-1

	Unnamed: 0	Unnamed: 0.1	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	...	floofer	pupper	puppo	favorite_count	retweet_count	errors	json_tw	favorite_count_JSON	retweet_count_JSON	tweet_id_JSON
0	0	0	892420643555336193	NaN	NaN	2017-08-01 16:23:56 +0000	<a href="http://twitter.com/download/iphone" r...	This is Phineas. He's a mystical boy. Only eve...	NaN	NaN	...	None	None	None	39294	8773	NaN	{'created_at': 'Tue Aug 01 16:23:56 +0000 2017...	39294	8773	-1
1	1	1	892177421306343426	NaN	NaN	2017-08-01 00:17:27 +0000	<a href="http://twitter.com/download/iphone" r...	This is Tilly. She's just checking pup on you....	NaN	NaN	...	None	None	None	33649	6426	NaN	{'created_at': 'Tue Aug 01 00:17:27 +0000 2017...	33649	6426	-1
2	2	2	891815181378084864	NaN	NaN	2017-07-31 00:18:03 +0000	<a href="http://twitter.com/download/iphone" r...	This is Archie. He is a rare Norwegian Pouncin...	NaN	NaN	...	None	None	None	25351	4268	NaN	{'created_at': 'Mon Jul 31 00:18:03 +0000 2017...	25351	4268	-1
3	3	3	891689557279858688	NaN	NaN	2017-07-30 15:58:51 +0000	<a href="http://twitter.com/download/iphone" r...	This is Darla. She commenced a snooze mid meal...	NaN	NaN	...	None	None	None	42668	8858	NaN	{'created_at': 'Sun Jul 30 15:58:51 +0000 2017...	42668	8858	-1

	Unnamed: 0	Unnamed: 0.1	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	...	floofer	pupper	puppo	favorite_count	retweet_count	errors	json_tw	favorite_count_JSON	retweet_count_JSON	tweet_id_JSON
85	85	85	876120275196170240	NaN	NaN	2017-06-17 16:52:05 +0000	<a href="http://twitter.com/download/iphone" r...	Meet Venti, a seemingly caffeinated puppoccino...	NaN	NaN	...	None	None	None	28271	4836	NaN	{'created_at': 'Sat Jun 17 16:52:05 +0000 2017...	28273	4836	-1
106	106	106	871879754684805121	NaN	NaN	2017-06-06 00:01:46 +0000	<a href="http://twitter.com/download/iphone" r...	Say hello to Lassie. She's celebrating #PrideM...	NaN	NaN	...	None	None	None	38789	11704	NaN	{'created_at': 'Tue Jun 06 00:01:46 +0000 2017...	38788	11704	-1
172	172	172	858843525470990336	NaN	NaN	2017-05-01 00:40:27 +0000	<a href="http://twitter.com/download/iphone" r...	I have stumbled puppon a doggo painting party....	NaN	NaN	...	None	None	None	16164	3715	NaN	{'created_at': 'Mon May 01 00:40:27 +0000 2017...	16165	3715	-1
175	175	175	857989990357356544	NaN	NaN	2017-04-28 16:08:49 +0000	<a href="http://twitter.com/download/iphone" r...	This is Rosie. She was just informed of the wa...	NaN	NaN	...	None	None	None	16807	2770	NaN	{'created_at': 'Fri Apr 28 16:08:49 +0000 2017...	16806	2770	-1

	Unnamed: 0	tweet_id	img_num	p1_conf	p2_conf	p3_conf
count	2075.000000	2.075000e+03	2075.000000	2075.000000	2.075000e+03	2.075000e+03
mean	1037.000000	7.384514e+17	1.203855	0.594548	1.345886e-01	6.032417e-02
std	599.145224	6.785203e+16	0.561875	0.271174	1.006657e-01	5.090593e-02
min	0.000000	6.660209e+17	1.000000	0.044333	1.011300e-08	1.740170e-10
25%	518.500000	6.764835e+17	1.000000	0.364412	5.388625e-02	1.622240e-02
50%	1037.000000	7.119988e+17	1.000000	0.588230	1.181810e-01	4.944380e-02
75%	1555.500000	7.932034e+17	1.000000	0.843855	1.955655e-01	9.180755e-02
max	2074.000000	8.924206e+17	4.000000	1.000000	4.880140e-01	2.734190e-01

	Unnamed: 0	tweet_id	img_num	p1_conf	p2_conf	p3_conf
count	2075.000000	2.075000e+03	2075.000000	2075.000000	2.075000e+03	2.075000e+03
mean	1037.000000	7.384514e+17	1.203855	0.594548	1.345886e-01	6.032417e-02
std	599.145224	6.785203e+16	0.561875	0.271174	1.006657e-01	5.090593e-02
min	0.000000	6.660209e+17	1.000000	0.044333	1.011300e-08	1.740170e-10
25%	518.500000	6.764835e+17	1.000000	0.364412	5.388625e-02	1.622240e-02
50%	1037.000000	7.119988e+17	1.000000	0.588230	1.181810e-01	4.944380e-02
75%	1555.500000	7.932034e+17	1.000000	0.843855	1.955655e-01	9.180755e-02
max	2074.000000	8.924206e+17	4.000000	1.000000	4.880140e-01	2.734190e-01

	Unnamed: 0	tweet_id	jpg_url	img_num	p1	p1_conf	p1_dog	p2	p2_conf	p2_dog	p3	p3_conf	p3_dog
0	0	666020888022790149	https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg	1	Welsh_springer_spaniel	0.465074	True	collie	0.156665	True	Shetland_sheepdog	0.061428	True
1	1	666029285002620928	https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg	1	redbone	0.506826	True	miniature_pinscher	0.074192	True	Rhodesian_ridgeback	0.072010	True
2	2	666033412701032449	https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg	1	German_shepherd	0.596461	True	malinois	0.138584	True	bloodhound	0.116197	True
3	3	666044226329800704	https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg	1	Rhodesian_ridgeback	0.408143	True	redbone	0.360687	True	miniature_pinscher	0.222752	True
4	4	666049248165822465	https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg	1	miniature_pinscher	0.560311	True	Rottweiler	0.243682	True	Doberman	0.154629	True

	tweet_id	in_reply_to_status_id	in_reply_to_user_id	timestamp	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	expanded_urls	...	floofer	pupper	puppo	favorite_count	retweet_count	errors	favorite_count_JSON	retweet_count_JSON	tweet_id_JSON
19	888202515573088257	NaN	NaN	2017-07-21 01:02:36 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: This is Canela. She attempted s...	8.874740e+17	4.196984e+09	2017-07-19 00:47:34 +0000	https://twitter.com/dog_rates/status/887473957...	...	None	None	None	-1	-1	<class 'tweepy.error.TweepError'>	-1	-1	-1
95	873697596434513921	NaN	NaN	2017-06-11 00:25:14 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: This is Walter. He won't start ...	8.688804e+17	4.196984e+09	2017-05-28 17:23:24 +0000	https://twitter.com/dog_rates/status/868880397...	...	None	None	None	-1	-1	<class 'tweepy.error.TweepError'>	-1	-1	-1
118	869988702071779329	NaN	NaN	2017-05-31 18:47:24 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: We only rate dogs. This is quit...	8.591970e+17	4.196984e+09	2017-05-02 00:04:57 +0000	https://twitter.com/dog_rates/status/859196978...	...	None	None	None	-1	-1	<class 'tweepy.error.TweepError'>	-1	-1	-1
155	861769973181624320	NaN	NaN	2017-05-09 02:29:07 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: "Good afternoon class today we'...	8.066291e+17	4.196984e+09	2016-12-07 22:38:52 +0000	https://twitter.com/dog_rates/status/806629075...	...	None	None	None	-1	-1	<class 'tweepy.error.TweepError'>	-1	-1	-1
260	842892208864923648	NaN	NaN	2017-03-18 00:15:37 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: This is Stephan. He just wants ...	8.071068e+17	4.196984e+09	2016-12-09 06:17:20 +0000	https://twitter.com/dog_rates/status/807106840...	...	None	None	None	-1	-1	<class 'tweepy.error.TweepError'>	-1	-1	-1
566	802247111496568832	NaN	NaN	2016-11-25 20:26:31 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: Everybody drop what you're doin...	7.790561e+17	4.196984e+09	2016-09-22 20:33:42 +0000	https://twitter.com/dog_rates/status/779056095...	...	None	None	None	-1	-1	<class 'tweepy.error.TweepError'>	-1	-1	-1
784	775096608509886464	NaN	NaN	2016-09-11 22:20:06 +0000	<a href="http://twitter.com/download/iphone" r...	RT @dog_rates: After so many requests, this is...	7.403732e+17	4.196984e+09	2016-06-08 02:41:38 +0000	https://twitter.com/dog_rates/status/740373189...	...	None	None	None	-1	-1	<class 'tweepy.error.TweepError'>	-1	-1	-1
1421	698195409219559425	NaN	NaN	2016-02-12 17:22:12 +0000	<a href="http://twitter.com/download/iphone" r...	Meet Beau & Wilbur. Wilbur stole Beau's be...	NaN	NaN	NaN	https://twitter.com/dog_rates/status/698195409...	...	None	None	None	18234	-1	<class 'tweepy.error.TweepError'>	-1	-1	-1

	Unnamed: 0	Unnamed: 0.1	tweet_id	in_reply_to_status_id	in_reply_to_user_id	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	...	favorite_count_JSON	retweet_count_JSON	tweet_id_JSON	Dog_stage	year	month	day	hour	minute	second
0	9	9	890240255349198849	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	This is Cassie. She is a college pup. Studying...	NaN	NaN	NaN	...	32333	7614	-1	doggo	2017	07	26	15	59	51
1	43	43	884162670584377345	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	Meet Yogi. He doesn't have any important dog m...	NaN	NaN	NaN	...	20637	3078	-1	doggo	2017	07	09	21	29	42
2	99	99	872967104147763200	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	Here's a very large dog. He has a date later. ...	NaN	NaN	NaN	...	27805	5597	-1	doggo	2017	06	09	00	02	31

	Unnamed: 0	Unnamed: 0.1	tweet_id	in_reply_to_status_id	in_reply_to_user_id	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	...	favorite_count_JSON	retweet_count_JSON	tweet_id_JSON	year	month	day	hour	minute	second	dog_stage
1008	775	775	776201521193218049	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	This is O'Malley. That is how he sleeps. Doesn...	NaN	NaN	NaN	...	10587	2871	-1	2016	09	14	23	30	38	no_stage
2033	2041	2041	671542985629241344	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	This is JD (stands for "just dog"). He's like ...	NaN	NaN	NaN	...	1149	610	-1	2015	12	01	04	14	59	no_stage

	Unnamed: 0	Unnamed: 0.1	tweet_id	in_reply_to_status_id	in_reply_to_user_id	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	...	retweet_count_JSON	tweet_id_JSON	year	month	day	hour	minute	second	dog_stage	blep
97	29	29	886366144734445568	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	This is Roscoe. Another pupper fallen victim t...	NaN	NaN	NaN	...	3267	-1	2017	07	15	23	25	31	pupper	True
424	61	61	880221127280381952	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	Meet Jesse. He's a Fetty Woof. His tongue ejec...	NaN	NaN	NaN	...	4369	-1	2017	06	29	00	27	25	no_stage	True
484	139	139	865359393868664832	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	This is Sammy. Her tongue ejects without warni...	NaN	NaN	NaN	...	5287	-1	2017	05	19	00	12	11	no_stage	True
802	523	523	809448704142938112	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	I call this one "A Blep by the Sea" 12/10 http...	NaN	NaN	NaN	...	1678	-1	2016	12	15	17	23	04	no_stage	True

	Unnamed: 0_x	Unnamed: 0.1	tweet_id	in_reply_to_status_id	in_reply_to_user_id	source	text	retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp	...	minute	second	dog_stage	blep	Unnamed: 0_y	jpg_url	img_num	p1	p1_conf	p1_dog
1464	1335	1335	705239209544720384	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	This is Jimothy. He lost his body during the t...	NaN	NaN	NaN	...	51	44	no_stage	False	955.0	https://pbs.twimg.com/media/CcmDUjFW8AAqAjc.jpg	1.0	Chihuahua	0.157950	True
181	1122	1122	730573383004487680	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	This is Rooney. He can't comprehend glass. 10/...	NaN	NaN	NaN	...	40	42	pupper	False	1146.0	https://pbs.twimg.com/media/CiOEnI6WgAAmq4E.jpg	2.0	American_Staffordshire_Terrier	0.810158	True
1395	1258	1258	710283270106132480	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	This is Gunner. He's a Figamus Newton. King of...	NaN	NaN	NaN	...	55	02	no_stage	False	1023.0	https://pbs.twimg.com/media/Cdtu3WRUkAAsRVx.jpg	2.0	Shih-Tzu	0.932401	True
146	772	772	776477788987613185	NaN	NaN	<a href="http://twitter.com/download/iphone" r...	This is Huck. He's addicted to caffeine. Hope ...	NaN	NaN	NaN	...	48	25	pupper	False	1451.0	https://pbs.twimg.com/media/CsaaaaxWgAEfzM7.jpg	1.0	Labrador_Retriever	0.884839	True