A Wrangling roject I worked on
wrangle_act
August 27, 2022
1
Project: Wrangling and Analyze Data
In [1]: # Import libraries
import pandas as pd
import numpy as np
import requests
import tweepy
import os
import json
import time
import re
import matplotlib.pyplot as plt
% matplotlib inline
import warnings
from IPython.display import Image
from functools import reduce
import re
import seaborn as sns
import datetime
1.1
Data Gathering
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)
In [2]: df_archive = pd.read_csv('twitter-archive-enhanced.csv')
2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)
In [3]: #This downloads the image prediction file using the link provided by Udacity
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictio
image_request = requests.get(url, allow_redirects=True)
open('image_predictions.tsv', 'wb').write(image_request.content)
Out[3]: 335079
In [4]: df_images = pd.read_csv('image_predictions.tsv', sep = '\t')
3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)
1
In [5]: tweets = []
for line in open('tweet-json.txt', 'r'):
tweets.append(json.loads(line))
df_1 = pd.DataFrame(tweets)
In [6]: df_1.info()
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 31 columns):
contributors
0 non-null object
coordinates
0 non-null object
created_at
2354 non-null object
display_text_range
2354 non-null object
entities
2354 non-null object
extended_entities
2073 non-null object
favorite_count
2354 non-null int64
favorited
2354 non-null bool
full_text
2354 non-null object
geo
0 non-null object
id
2354 non-null int64
id_str
2354 non-null object
in_reply_to_screen_name
78 non-null object
in_reply_to_status_id
78 non-null float64
in_reply_to_status_id_str
78 non-null object
in_reply_to_user_id
78 non-null float64
in_reply_to_user_id_str
78 non-null object
is_quote_status
2354 non-null bool
lang
2354 non-null object
place
1 non-null object
possibly_sensitive
2211 non-null object
possibly_sensitive_appealable
2211 non-null object
quoted_status
28 non-null object
quoted_status_id
29 non-null float64
quoted_status_id_str
29 non-null object
retweet_count
2354 non-null int64
retweeted
2354 non-null bool
retweeted_status
179 non-null object
source
2354 non-null object
truncated
2354 non-null bool
user
2354 non-null object
dtypes: bool(4), float64(3), int64(3), object(21)
memory usage: 505.8+ KB
In [7]: df_counts = df_1[['id', 'favorite_count', 'retweet_count']]
2
1.2
Assessing Data
In [8]: df_archive.head(25)
Out[8]:
-
tweet_id in_reply_to_status_id in_reply_to_user_id-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN
-
-
timestamp \
16:23:56 +0000
00:17:27 +0000
00:18:03 +0000
15:58:51 +0000
16:00:24 +0000
00:08:17 +0000
16:27:12 +0000
00:22:40 +0000
16:25:51 +0000
15:59:51 +0000
00:31:25 +0000
16:11:53 +0000
01:55:32 +0000
00:10:02 +0000
17:02:04 +0000
00:19:32 +0000
00:22:39 +0000
3
\
-
-
-
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id
2356 non-null int64
in_reply_to_status_id
78 non-null float64
in_reply_to_user_id
78 non-null float64
timestamp
2356 non-null object
source
2356 non-null object
text
2356 non-null object
retweeted_status_id
181 non-null float64
retweeted_status_user_id
181 non-null float64
retweeted_status_timestamp
181 non-null object
expanded_urls
2297 non-null object
rating_numerator
2356 non-null int64
rating_denominator
2356 non-null int64
name
2356 non-null object
doggo
2356 non-null object
floofer
2356 non-null object
pupper
2356 non-null object
puppo
2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [10]: df_archive.describe()
Out[10]:
count
mean
std
min
25%
50%
75%
max
tweet_id in_reply_to_status_id-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e+17
count
mean
std
min
25%
50%
75%
max
retweeted_status_id retweeted_status_user_id rating_numerator-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-
7
in_reply_to_user_id \-e-e-e-e-e-e-e-e+17
\
rating_denominator
count-
mean-
std-
min-%-%-%-
max-
In [11]: df_archive.duplicated().sum()
Out[11]: 0
In [12]: df_archive[df_archive.tweet_id.duplicated()]
Out[12]: Empty DataFrame
Columns: [tweet_id, in_reply_to_status_id, in_reply_to_user_id, timestamp, source, text
Index: []
In [13]: df_archive.name.value_counts()
Out[13]: None
a
Charlie
Lucy
Oliver
Cooper
Penny
Lola
Tucker
Winston
Bo
the
Sadie
Toby
Buddy
Bailey
an
Daisy
Koda
Stanley
Leo
Scout
Milo
Jax
Bella
Rusty
Jack
-
Oscar
Dave
George
...
6
6
5
Rorie
1
Tuco
1
Daniel
1
Kuyu
1
Dallas
1
Josep
1
Tommy
1
Zuzu
1
Clybe
1
Lupe
1
Theo
1
Bobb
1
Bruno
1
Maxwell
1
Autumn
1
Ashleigh
1
Chevy
1
Arya
1
Hubertson
1
Kathmandu
1
old
1
Aqua
1
Tater
1
Ebby
1
Geoff
1
Leonidas
1
Halo
1
Florence
1
Kenzie
1
Cleopatricia
1
Name: name, Length: 957, dtype: int64
In [14]: df_archive.sample(5)
Out[14]:
tweet_id in_reply_to_status_id-
NaN-
NaN-
NaN-
NaN-
NaN
timestamp \-:17:12 -:24:28 +0000
9
in_reply_to_user_id \
NaN
NaN
NaN
NaN
NaN
-:25:57 -:31:23 -:02:17 -
Twitter for iPhone
Vine - Make a Scene
Twitter Web Client
TweetDeck
Name: source, dtype: int64
In [16]: df_archive.rating_numerator.value_counts()
10
Out[16]:-
Name: rating_numerator, dtype: int64
In [17]: df_archive.rating_denominator.value_counts()
Out[17]:-
-
-
Name: rating_denominator, dtype: int64
In [18]: df_images.head()
Out[18]:
0
1
2
3
4
tweet_id-
jpg_url \
https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg
https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg
https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg
https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg
https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg
0
1
2
3
4
img_num
p1 p1_conf p1_dog
p2 \
1 Welsh_springer_spaniel-
True
collie
1
redbone-
True miniature_pinscher
1
German_shepherd-
True
malinois
1
Rhodesian_ridgeback-
True
redbone
1
miniature_pinscher-
True
Rottweiler
0
1
2
3
4
p2_conf p2_dog
p3 p3_conf p3_dog-
True
Shetland_sheepdog-
True-
True Rhodesian_ridgeback-
True-
True
bloodhound-
True-
True miniature_pinscher-
True-
True
Doberman-
True
In [19]: df_images.info()
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id
2075 non-null int64
jpg_url
2075 non-null object
img_num
2075 non-null int64
p1
2075 non-null object
p1_conf
2075 non-null float64
12
p1_dog
2075 non-null bool
p2
2075 non-null object
p2_conf
2075 non-null float64
p2_dog
2075 non-null bool
p3
2075 non-null object
p3_conf
2075 non-null float64
p3_dog
2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [20]: df_images.describe()
Out[20]:
count
mean
std
min
25%
50%
75%
max
tweet_id
img_num-e-e-e-e-e-e-e-e-
p1_conf-
p2_conf-e-e-e-e-e-e-e-e-01
p3_conf-e-e-e-e-e-e-e-e-01
In [21]: df_images.sample(5)
Out[21]:
tweet_id-
jpg_url \
https://pbs.twimg.com/media/CWE_x33UwAEE3no.jpg
https://pbs.twimg.com/media/C52V7PzWcAA_pVv.jpg
https://pbs.twimg.com/media/Crwxb5yWgAAX5P_.jpg
https://pbs.twimg.com/media/CWEs1b-WEAEhq82.jpg
https://pbs.twimg.com/media/CnsIT0WWcAAul8V.jpg
img_num
p1 p1_conf p1_dog
p2 \
505
1 Italian_greyhound-
True
whippet
1834
1
shopping_cart- False
shopping_basket
1433
1 Norwegian_elkhound-
True Chesapeake_Bay_retriever
502
1
golden_retriever-
True
Welsh_springer_spaniel
1317
1
web_site- False
printer-
p2_conf p2_dog
p3 p3_conf p3_dog-
True Great_Dane-
True- False toy_poodle-
True-
True
malamute-
True-
True
beagle-
True- False
carton- False
In [22]: df_images.duplicated().sum
Out[22]:
In [23]: df_images[df_images.tweet_id.duplicated()]
Out[23]: Empty DataFrame
Columns: [tweet_id, jpg_url, img_num, p1, p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_
Index: []
In [24]: df_images.p1.value_counts()
Out[24]: golden_retriever
Labrador_retriever
Pembroke
Chihuahua
pug
chow
Samoyed
toy_poodle
Pomeranian
cocker_spaniel
malamute
French_bulldog
miniature_pinscher
Chesapeake_Bay_retriever
seat_belt
German_shepherd
Siberian_husky
Staffordshire_bullterrier
web_site
Cardigan
Maltese_dog
Shetland_sheepdog
teddy
beagle
Eskimo_dog
Lakeland_terrier
Rottweiler
Shih-Tzu
-
kuvasz
Italian_greyhound
16
16
...
groenendael
1
shopping_basket
1
coho
1
silky_terrier
1
pot
1
coffee_mug
1
pillow
1
African_hunting_dog
1
park_bench
1
leopard
1
bookshop
1
hammer
1
terrapin
1
microwave
1
panpipe
1
sulphur-crested_cockatoo
1
book_jacket
1
walking_stick
1
sandbar
1
grey_fox
1
military_uniform
1
hay
1
ping-pong_ball
1
handkerchief
1
bearskin
1
mortarboard
1
orange
1
boathouse
1
leaf_beetle
1
coral_reef
1
Name: p1, Length: 378, dtype: int64
In [25]: df_images.p2.value_counts()
Out[25]: Labrador_retriever
golden_retriever
Cardigan
Chihuahua
Pomeranian
French_bulldog
Chesapeake_Bay_retriever
toy_poodle
cocker_spaniel
miniature_poodle
Siberian_husky
-
beagle
Eskimo_dog
Pembroke
collie
kuvasz
Italian_greyhound
Pekinese
American_Staffordshire_terrier
miniature_pinscher
Samoyed
malinois
chow
toy_terrier
Boston_bull
Norwegian_elkhound
Staffordshire_bullterrier
pug
Irish_terrier
Shih-Tzu
sarong
hamper
sulphur_butterfly
patio
toaster
hatchet
volcano
turnstile
chain_mail
saltshaker
home_theater
stove
drake
snowmobile
apron
ice_lolly
hyena
Bernese_mountain_dog
assault_rifle
screw
wombat
handkerchief
coral_fungus
racket
waffle_iron
tree_frog
coffee_mug
china_cabinet
-
..-
Gila_monster
1
coral_reef
1
Name: p2, Length: 405, dtype: int64
In [26]: df_images.p3.value_counts()
Out[26]: Labrador_retriever
Chihuahua
golden_retriever
Eskimo_dog
kelpie
kuvasz
Staffordshire_bullterrier
chow
cocker_spaniel
beagle
Pekinese
toy_poodle
Pomeranian
Great_Pyrenees
Pembroke
Chesapeake_Bay_retriever
French_bulldog
malamute
American_Staffordshire_terrier
pug
Cardigan
basenji
bull_mastiff
toy_terrier
Siberian_husky
Shetland_sheepdog
Boston_bull
boxer
Lakeland_terrier
doormat
-
.-
desktop_computer
screen
American_black_bear
mitten
tiger_cat
mongoose
Kerry_blue_terrier
wolf_spider
mosquito_net
mink
eel
18
Windsor_tie
1
partridge
1
stinkhorn
1
barrow
1
buckeye
1
pretzel
1
grand_piano
1
chimpanzee
1
croquet_ball
1
electric_fan
1
pool_table
1
loupe
1
green_lizard
1
bow
1
hatchet
1
quill
1
cardoon
1
broccoli
1
coral_reef
1
Name: p3, Length: 408, dtype: int64
In [27]: df_counts.head()
Out[27]:
0
1
2
3
4
id favorite_count-
retweet_count-
In [28]: df_counts.info()
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
id
2354 non-null int64
favorite_count
2354 non-null int64
retweet_count
2354 non-null int64
dtypes: int64(3)
memory usage: 55.2 KB
In [29]: df_counts.describe()
Out[29]:
count
mean
std
min
id favorite_count retweet_count-e-e-e-e-
25%
50%
75%
max
-e-e-e-e+17
-
-
In [30]: df_counts.duplicated().sum()
Out[30]: 0
In [31]: df_counts[df_counts.id.duplicated()]
Out[31]: Empty DataFrame
Columns: [id, favorite_count, retweet_count]
Index: []
1.2.1 Quality issues
1. Keep only the original ratings (no retweets) that have images
2. There are some columns not needed for our analysis
3. Datatype errors in the following columns: (tweet_id, source, timestamp)
4. Correct the numerators with decimals
5. Some of the records have more than one dog stage
6. Source column is in HTML-formatted string, not a normal string
7. Error in dog names (e.g a,an,actually) are not a dog’s name.
8. Dog ratings are not standardized
1.2.2 Tidiness issues
1. The last four columns all relate to the same variable (dogoo, floofer, pupper, puppo)
2. Twitter api table table(retweet_count, favorite_count, followers_count) and Image table
should be added to twitter archive table.
1.3
Cleaning Data
In [32]: # Make copies of original pieces of data
df_archive_clean = df_archive.copy()
df_images_clean = df_images.copy()
df_counts_clean = df_counts.copy()
1.3.1 Issue #1: Twitter api table table(retweet_count, favorite_count, followers_count) and Image table have identical coolums as the Twitter Archive
Define: Merge twitter api table Image prediction table and Twitter Archive table to form one
DataFrame
20
Code
In [33]: #First we need to rename the id column in the twitter api table
df_counts_clean.rename(columns = {'id' : 'tweet_id', 'favorite_count' : 'favorite_count
In [34]: #merge all three dataset
df_merge = [df_archive_clean, df_images_clean, df_counts_clean]
df_twitter_dogs = reduce(lambda left, right: pd.merge(left, right, on = 'tweet_id'), d
In [35]: df_twitter_dogs.head()
Out[35]:
0
1
2
3
4
tweet_id in_reply_to_status_id in_reply_to_user_id-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN-
NaN
NaN
0
1
2
3
4
-
0
1
2
3
4
Int64Index: 2073 entries, 0 to 2072
Data columns (total 30 columns):
tweet_id
2073 non-null int64
in_reply_to_status_id
23 non-null float64
in_reply_to_user_id
23 non-null float64
timestamp
2073 non-null object
source
2073 non-null object
text
2073 non-null object
retweeted_status_id
79 non-null float64
retweeted_status_user_id
79 non-null float64
retweeted_status_timestamp
79 non-null object
expanded_urls
2073 non-null object
rating_numerator
2073 non-null int64
rating_denominator
2073 non-null int64
name
2073 non-null object
doggo
2073 non-null object
floofer
2073 non-null object
pupper
2073 non-null object
puppo
2073 non-null object
jpg_url
2073 non-null object
img_num
2073 non-null int64
p1
2073 non-null object
p1_conf
2073 non-null float64
22
p1_dog
p2
p2_conf
p2_dog
p3
p3_conf
p3_dog
favorite_count
retweet_count
dtypes: bool(3), float64(7),
memory usage: 459.5+ KB
2073 non-null bool
2073 non-null object
2073 non-null float64
2073 non-null bool
2073 non-null object
2073 non-null float64
2073 non-null bool
2073 non-null int64
2073 non-null int64
int64(6), object(14)
1.3.2 Issue #2: The last four columns all relate to the same variable (dogoo, floofer, pupper,
puppo)
Define: doggo, floofer, pupper and puppo columns in twitter_archive table should be merged
into one column named "dog_stage"
Code
In [37]: # Extract the text from the columns into the new column named "dog_stage"
df_twitter_dogs['dog_type'] = df_twitter_dogs['text'].str.extract('(doggo|floofer|puppe
Test
In [38]: df_twitter_dogs[['dog_type']].head()
Out[38]:
0
1
2
3
4
dog_type
NaN
NaN
NaN
NaN
NaN
In [39]: df_twitter_dogs[['doggo', 'floofer', 'pupper', 'puppo']].head()
Out[39]:
0
1
2
3
4
doggo floofer pupper puppo
None
None None None
None
None None None
None
None None None
None
None None None
None
None None None
In [40]: df_twitter_dogs[['doggo', 'floofer', 'pupper', 'puppo']].sample(25)
Out[40]:
doggo floofer pupper puppo
238 None
None
None None
1066 None
None
None None
23
-
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
pupper
None
pupper
None
None
None
None
None
pupper
None
None
None
None
None
None
pupper
pupper
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
In [41]: df_twitter_dogs.info()
Int64Index: 2073 entries, 0 to 2072
Data columns (total 31 columns):
tweet_id
2073 non-null int64
in_reply_to_status_id
23 non-null float64
in_reply_to_user_id
23 non-null float64
timestamp
2073 non-null object
source
2073 non-null object
text
2073 non-null object
retweeted_status_id
79 non-null float64
retweeted_status_user_id
79 non-null float64
retweeted_status_timestamp
79 non-null object
expanded_urls
2073 non-null object
rating_numerator
2073 non-null int64
rating_denominator
2073 non-null int64
name
2073 non-null object
doggo
2073 non-null object
floofer
2073 non-null object
pupper
2073 non-null object
puppo
2073 non-null object
jpg_url
2073 non-null object
img_num
2073 non-null int64
24
p1
p1_conf
p1_dog
p2
p2_conf
p2_dog
p3
p3_conf
p3_dog
favorite_count
retweet_count
dog_type
dtypes: bool(3), float64(7),
memory usage: 475.7+ KB
2073 non-null object
2073 non-null float64
2073 non-null bool
2073 non-null object
2073 non-null float64
2073 non-null bool
2073 non-null object
2073 non-null float64
2073 non-null bool
2073 non-null int64
2073 non-null int64
337 non-null object
int64(6), object(15)
In [42]: df_twitter_dogs[df_twitter_dogs.tweet_id ==-]
Out[42]:
tweet_id in_reply_to_status_id in_reply_to_user_id-
NaN
NaN
\
timestamp \-:00:24 +0000
source \
4
Int64Index: 1994 entries, 0 to 2072
Data columns (total 31 columns):
tweet_id
1994 non-null int64
in_reply_to_status_id
23 non-null float64
in_reply_to_user_id
23 non-null float64
timestamp
1994 non-null object
source
1994 non-null object
text
1994 non-null object
retweeted_status_id
0 non-null float64
retweeted_status_user_id
0 non-null float64
retweeted_status_timestamp
0 non-null object
expanded_urls
1994 non-null object
rating_numerator
1994 non-null int64
rating_denominator
1994 non-null int64
1994 non-null object
name
doggo
1994 non-null object
floofer
1994 non-null object
pupper
1994 non-null object
puppo
1994 non-null object
jpg_url
1994 non-null object
img_num
1994 non-null int64
p1
1994 non-null object
p1_conf
1994 non-null float64
p1_dog
1994 non-null bool
p2
1994 non-null object
p2_conf
1994 non-null float64
p2_dog
1994 non-null bool
p3
1994 non-null object
p3_conf
1994 non-null float64
p3_dog
1994 non-null bool
26
favorite_count
retweet_count
dog_type
dtypes: bool(3), float64(7),
memory usage: 457.6+ KB
1994 non-null int64
1994 non-null int64
326 non-null object
int64(6), object(15)
1.3.4 Issue #4: There are some columns not needed for our analysis
Define: Drop all the columns not needed for our analysis
Code
In [46]: #drop unused columns
df_twitter_dogs = df_twitter_dogs.drop(['in_reply_to_status_id','in_reply_to_user_id','
Test
In [47]: df_twitter_dogs.info()
Int64Index: 1994 entries, 0 to 2072
Data columns (total 25 columns):
tweet_id
1994 non-null int64
timestamp
1994 non-null object
source
1994 non-null object
text
1994 non-null object
rating_numerator
1994 non-null int64
rating_denominator
1994 non-null int64
name
1994 non-null object
doggo
1994 non-null object
floofer
1994 non-null object
1994 non-null object
pupper
puppo
1994 non-null object
jpg_url
1994 non-null object
img_num
1994 non-null int64
p1
1994 non-null object
p1_conf
1994 non-null float64
p1_dog
1994 non-null bool
p2
1994 non-null object
p2_conf
1994 non-null float64
p2_dog
1994 non-null bool
p3
1994 non-null object
p3_conf
1994 non-null float64
p3_dog
1994 non-null bool
favorite_count
1994 non-null int64
retweet_count
1994 non-null int64
dog_type
326 non-null object
27
dtypes: bool(3), float64(3), int64(6), object(13)
memory usage: 364.1+ KB
In [48]: df_twitter_dogs.head()
Out[48]:
0
1
2
3
4
tweet_id-
0
1
2
3
4
Int64Index: 1994 entries, 0 to 2072
Data columns (total 25 columns):
tweet_id
1994 non-null object
timestamp
1994 non-null datetime64[ns]
source
1994 non-null category
text
1994 non-null object
rating_numerator
1994 non-null int64
rating_denominator
1994 non-null int64
name
1994 non-null object
doggo
1994 non-null object
floofer
1994 non-null object
pupper
1994 non-null object
puppo
1994 non-null object
jpg_url
1994 non-null object
img_num
1994 non-null int64
p1
1994 non-null object
p1_conf
1994 non-null float64
p1_dog
1994 non-null bool
p2
1994 non-null object
p2_conf
1994 non-null float64
p2_dog
1994 non-null bool
p3
1994 non-null object
p3_conf
1994 non-null float64
p3_dog
1994 non-null bool
favorite_count
1994 non-null int64
retweet_count
1994 non-null int64
dog_type
326 non-null object
dtypes: bool(3), category(1), datetime64[ns](1), float64(3), int64(5), object(12)
29
memory usage: 350.6+ KB
In [51]: df_twitter_dogs.head()
Out[51]:
0
1
2
3
4
tweet_id-
timestamp \-:23:-:17:-:18:-:58:-:00:24
0
1
2
3
4
Int64Index: 1994 entries, 0 to 2072
Data columns (total 27 columns):
tweet_id
1994 non-null object
timestamp
1994 non-null datetime64[ns]
source
1994 non-null category
text
1994 non-null object
rating_numerator
1994 non-null float64
rating_denominator
1994 non-null float64
name
1994 non-null object
doggo
1994 non-null object
floofer
1994 non-null object
pupper
1994 non-null object
puppo
1994 non-null object
jpg_url
1994 non-null object
img_num
1994 non-null int64
p1
1994 non-null object
p1_conf
1994 non-null float64
p1_dog
1994 non-null bool
p2
1994 non-null object
p2_conf
1994 non-null float64
p2_dog
1994 non-null bool
p3
1994 non-null object
p3_conf
1994 non-null float64
p3_dog
1994 non-null bool
favorite_count
1994 non-null int64
retweet_count
1994 non-null int64
dog_type
326 non-null object
all_stages
1994 non-null object
dog_stage
1994 non-null object
dtypes: bool(3), category(1), datetime64[ns](1), float64(5), int64(3), object(14)
memory usage: 381.8+ KB
33
1.3.8 Issue #8: Source column is in HTML-formatted string, not a normal string
Define: Change all HTML-Formatted strings to normal strings
Code
In [59]: #extract values
df_twitter_dogs.source = df_twitter_dogs.source.str.extract('>([\w\W\s]*)<', expand=Tru
In [60]: df_twitter_dogs.source.value_counts()
Out[60]: Twitter for iPhone
1955
Twitter Web Client
28
TweetDeck
11
Name: source, dtype: int64
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html
https://stackoverflow.com/questions/-/remove-unwanted-parts-from-strings-in-acolumn?noredirect=1
In [61]: #define function and apply to df_twitter_dogs table
def htmlink(x):
http_position = x.find("http")
#If there's no link, retain row
if http_position == -1:
x = x
else:
#Remove space before link to end
x = x[:http_position - 1]
return x
df_twitter_dogs.text = df_twitter_dogs.text.apply(htmlink)
https://stackoverflow.com/questions/-/remove-unwanted-parts-from-strings-in-acolumn?noredirect=1
Test
In [62]: #confirm that all the hyperlinks have been removed
for row in df_twitter_dogs.text[:10]:
print(row)
This
This
This
This
This
Here
Meet
is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10
is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available fo
is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know w
is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us
is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and sh
we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breath
Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more thing
34
When you watch your owner call another dog a good boy but then they turn back to you and say you
This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettabl
This is Cassie. She is a college pup. Studying international doggo communication and stick theor
1.3.9 Issue #9: Error in dog names (e.g a,an,actually) are not a dog’s name
Define: Correct all Error in dog names
Code
In [63]: df_twitter_dogs.name.unique()
Out[63]: array(['Phineas', 'Tilly', 'Archie', 'Darla', 'Franklin', 'None', 'Jax',
'Zoey', 'Cassie', 'Koda', 'Bruno', 'Ted', 'Stuart', 'Oliver', 'Jim',
'Zeke', 'Ralphus', 'Gerald', 'Jeffrey', 'such', 'Canela', 'Maya',
'Mingus', 'Derek', 'Roscoe', 'Waffles', 'Jimbo', 'Maisey', 'Earl',
'Lola', 'Kevin', 'Yogi', 'Noah', 'Bella', 'Grizzwald', 'Rusty',
'Gus', 'Stanley', 'Alfy', 'Koko', 'Rey', 'Gary', 'a', 'Elliot',
'Louis', 'Jesse', 'Romeo', 'Bailey', 'Duddles', 'Jack', 'Steven',
'Beau', 'Snoopy', 'Shadow', 'Emmy', 'Aja', 'Penny', 'Dante',
'Nelly', 'Ginger', 'Benedict', 'Venti', 'Goose', 'Nugget', 'Cash',
'Jed', 'Sebastian', 'Sierra', 'Monkey', 'Harry', 'Kody', 'Lassie',
'Rover', 'Napolean', 'Boomer', 'Cody', 'Rumble', 'Clifford',
'Dewey', 'Scout', 'Gizmo', 'Walter', 'Cooper', 'Harold', 'Shikha',
'Lili', 'Jamesy', 'Coco', 'Sammy', 'Meatball', 'Paisley', 'Albus',
'Neptune', 'Belle', 'Quinn', 'Zooey', 'Dave', 'Jersey', 'Hobbes',
'Burt', 'Lorenzo', 'Carl', 'Jordy', 'Milky', 'Trooper', 'quite',
'Sophie', 'Wyatt', 'Rosie', 'Thor', 'Oscar', 'Callie', 'Cermet',
'Marlee', 'Arya', 'Einstein', 'Alice', 'Rumpole', 'Benny', 'Aspen',
'Jarod', 'Wiggles', 'General', 'Sailor', 'Iggy', 'Snoop', 'Kyle',
'Leo', 'Riley', 'Noosh', 'Odin', 'Jerry', 'Georgie', 'Rontu',
'Cannon', 'Furzey', 'Daisy', 'Tuck', 'Barney', 'Vixen', 'Jarvis',
'Mimosa', 'Pickles', 'Brady', 'Luna', 'Charlie', 'Margo', 'Sadie',
'Hank', 'Tycho', 'Indie', 'Winnie', 'George', 'Bentley', 'Max',
'Dawn', 'Maddie', 'Monty', 'Sojourner', 'Winston', 'Odie', 'Arlo',
'Vincent', 'Lucy', 'Clark', 'Mookie', 'Meera', 'Ava', 'Eli', 'Ash',
'Tucker', 'Tobi', 'Chester', 'Wilson', 'Sunshine', 'Lipton',
'Bronte', 'Poppy', 'Gidget', 'Rhino', 'Willow', 'Orion', 'Eevee',
'Smiley', 'Miguel', 'Emanuel', 'Kuyu', 'Dutch', 'Pete', 'Scooter',
'Reggie', 'Lilly', 'Samson', 'Mia', 'Astrid', 'Malcolm', 'Dexter',
'Alfie', 'Fiona', 'one', 'Mutt', 'Bear', 'Doobert', 'Beebop',
'Alexander', 'Sailer', 'Brutus', 'Kona', 'Boots', 'Ralphie', 'Loki',
'Cupid', 'Pawnd', 'Pilot', 'Ike', 'Mo', 'Toby', 'Sweet', 'Pablo',
'Nala', 'Crawford', 'Gabe', 'Jimison', 'Duchess', 'Harlso',
'Sundance', 'Luca', 'Flash', 'Sunny', 'Howie', 'Jazzy', 'Anna',
'Finn', 'Bo', 'Wafer', 'Tom', 'Florence', 'Autumn', 'Buddy', 'Dido',
35
'Eugene', 'Ken', 'Strudel', 'Tebow', 'Chloe', 'Timber', 'Binky',
'Moose', 'Dudley', 'Comet', 'Akumi', 'Titan', 'Olivia', 'Alf',
'Oshie', 'Chubbs', 'Sky', 'Atlas', 'Eleanor', 'Layla', 'Rocky',
'Baron', 'Tyr', 'Bauer', 'Swagger', 'Brandi', 'Mary', 'Moe', 'Halo',
'Augie', 'Craig', 'Sam', 'Hunter', 'Pavlov', 'Phil', 'Kyro',
'Wallace', 'Ito', 'Ollie', 'Stephan', 'Lennon', 'incredibly',
'Major', 'Duke', 'Sansa', 'Shooter', 'Django', 'Diogi', 'Sonny',
'Marley', 'Severus', 'Ronnie', 'Milo', 'Bones', 'Mauve', 'Chef',
'Doc', 'Peaches', 'Sobe', 'Longfellow', 'Mister', 'Iroh', 'Pancake',
'Snicku', 'Ruby', 'Brody', 'Mack', 'Nimbus', 'Laika', 'Maximus',
'Dobby', 'Moreton', 'Juno', 'Maude', 'Lily', 'Newt', 'Benji',
'Nida', 'Robin', 'Monster', 'BeBe', 'Remus', 'Levi', 'Mabel',
'Misty', 'Betty', 'Mosby', 'Maggie', 'Bruce', 'Happy', 'Brownie',
'Rizzy', 'Stella', 'Butter', 'Frank', 'Tonks', 'Lincoln', 'Rory',
'Logan', 'Dale', 'Rizzo', 'Mattie', 'Pinot', 'Dallas', 'Hero',
'Frankie', 'Stormy', 'Mairi', 'Loomis', 'Godi', 'Cali', 'Deacon',
'Timmy', 'Sampson', 'Chipson', 'Oakley', 'Dash', 'Hercules', 'Jay',
'Mya', 'Strider', 'Wesley', 'Solomon', 'Huck', 'O', 'Blue',
'Anakin', 'Finley', 'Sprinkles', 'Heinrich', 'Shakespeare',
'Chelsea', 'Bungalo', 'Chip', 'Grey', 'Roosevelt', 'Willem',
'Davey', 'Dakota', 'Fizz', 'Dixie', 'very', 'Al', 'Jackson',
'Carbon', 'Klein', 'DonDon', 'Kirby', 'Lou', 'Chevy', 'Tito',
'Philbert', 'Louie', 'Rupert', 'Rufus', 'Brudge', 'Shadoe', 'Angel',
'Brat', 'Tove', 'my', 'Gromit', 'Aubie', 'Kota', 'Leela', 'Glenn',
'Shelby', 'Sephie', 'Bonaparte', 'Albert', 'Wishes', 'Rose', 'Theo',
'Rocco', 'Fido', 'Emma', 'Spencer', 'Lilli', 'Boston', 'Brandonald',
'Corey', 'Leonard', 'Beckham', 'Devón', 'Gert', 'Watson', 'Keith',
'Dex', 'Ace', 'Tayzie', 'Grizzie', 'Gilbert', 'Meyer', 'Arnie',
'Zoe', 'Stewie', 'Calvin', 'Lilah', 'Spanky', 'Jameson', 'Piper',
'Atticus', 'Blu', 'Dietrich', 'not', 'Divine', 'Tripp', 'his',
'Cora', 'Huxley', 'Bookstore', 'Abby', 'Shiloh', 'an', 'Gustav',
'Arlen', 'Percy', 'Lenox', 'Sugar', 'Harvey', 'Blanket', 'Geno',
'Stark', 'Beya', 'Kilo', 'Kayla', 'Maxaroni', 'Bell', 'Doug',
'Edmund', 'Aqua', 'Theodore', 'just', 'Baloo', 'Chase', 'getting',
'Nollie', 'Rorie', 'Simba', 'Charles', 'Bayley', 'Axel', 'Storkson',
'Remy', 'Chadrick', 'Kellogg', 'Buckley', 'Livvie', 'Terry',
'Hermione', 'Ralpher', 'Aldrick', 'Larry', 'this', 'unacceptable',
'Rooney', 'Crystal', 'Ziva', 'Stefan', 'Pupcasso', 'Puff',
'Flurpson', 'Coleman', 'Enchilada', 'Raymond', 'all', 'Rueben',
'Cilantro', 'Karll', 'Sprout', 'Blitz', 'Bloop', 'Colby', 'Lillie',
'Fred', 'Ashleigh', 'Kreggory', 'Sarge', 'Luther', 'Reginald',
'Ivar', 'Jangle', 'Schnitzel', 'Panda', 'Berkeley', 'Ralphé',
'Charleson', 'Clyde', 'Harnold', 'Sid', 'Pippa', 'Otis', 'Carper',
'Bowie', 'Alexanderson', 'Suki', 'Barclay', 'Ebby', 'Flávio',
'Smokey', 'Link', 'Jennifur', 'Bluebert', 'Stephanus', 'Bubbles',
'Zeus', 'Bertson', 'Nico', 'Michelangelope', 'Siba', 'Calbert',
'Curtis', 'Travis', 'Thumas', 'Kanu', 'Lance', 'Opie', 'Stubert',
'Kane', 'Olive', 'Chuckles', 'Staniel', 'Sora', 'Beemo', 'Gunner',
36
'infuriating', 'Lacy', 'Tater', 'Olaf', 'Cecil', 'Vince', 'Karma',
'Billy', 'Walker', 'Rodney', 'Klevin', 'Malikai', 'Bobble', 'River',
'Jebberson', 'Remington', 'Farfle', 'Jiminus', 'Harper', 'Keurig',
'Clarkus', 'Finnegus', 'Cupcake', 'Kathmandu', 'Ellie', 'Katie',
'Kara', 'Adele', 'Zara', 'Ambrose', 'Jimothy', 'Bode', 'Terrenth',
'Reese', 'Chesterson', 'Lucia', 'Bisquick', 'Ralphson', 'Socks',
'Rambo', 'Fiji', 'Rilo', 'Bilbo', 'Coopson', 'Yoda', 'Millie',
'Chet', 'Crouton', 'Daniel', 'Kaia', 'Murphy', 'Dotsy', 'Eazy',
'Coops', 'Fillup', 'Miley', 'Charl', 'Reagan', 'CeCe', 'Cuddles',
'Claude', 'Jessiga', 'Carter', 'Ole', 'Blipson', 'Reptar',
'Trevith', 'Berb', 'Bob', 'Colin', 'Brian', 'Oliviér', 'Grady',
'Kobe', 'Freddery', 'Bodie', 'Dunkin', 'Wally', 'Tupawc', 'Amber',
'Herschel', 'Edgar', 'Kingsley', 'Brockly', 'Richie', 'Molly',
'Vinscent', 'Cedrick', 'Hazel', 'Lolo', 'Eriq', 'Phred', 'the',
'Maxwell', 'Geoff', 'Covach', 'Durg', 'Fynn', 'Ricky', 'Herald',
'Lucky', 'Trip', 'Clarence', 'Hamrick', 'Brad', 'Pubert', 'Frönq',
'Derby', 'Lizzie', 'Blakely', 'Opal', 'Marq', 'Kramer', 'Tyrone',
'Gordon', 'Baxter', 'Mona', 'Horace', 'Crimson', 'Birf', 'Hammond',
'Lorelei', 'Marty', 'Brooks', 'Petrick', 'Hubertson', 'Gerbald',
'Oreo', 'Bruiser', 'Perry', 'Bobby', 'Jeph', 'Obi', 'Tino', 'Kulet',
'Lupe', 'Tiger', 'Jiminy', 'Griffin', 'Banjo', 'Brandy', 'Lulu',
'Darrel', 'Taco', 'Joey', 'Patrick', 'Kreg', 'Todo', 'Tess',
'Ulysses', 'Toffee', 'Apollo', 'Carly', 'Asher', 'Glacier', 'Chuck',
'actually', 'Champ', 'Ozzie', 'Griswold', 'Cheesy', 'Moofasa',
'Hector', 'Goliath', 'Kawhi', 'Ozzy', 'by', 'Emmie', 'Penelope',
'Willie', 'Rinna', 'Mike', 'William', 'Dwight', 'Evy', 'Hurley',
'Rubio', 'officially', 'Chompsky', 'Linda', 'Tug', 'Tango', 'Grizz',
'Jerome', 'Crumpet', 'Jessifer', 'Ralph', 'Sandy', 'Humphrey',
'Tassy', 'Juckson', 'Chuq', 'Tyrus', 'Karl', 'Godzilla', 'Vinnie',
'Kenneth', 'Herm', 'Bert', 'Striker', 'Donny', 'Pepper', 'Bernie',
'Buddah', 'Lenny', 'Arnold', 'Zuzu', 'Mollie', 'Laela', 'Tedders',
'Superpup', 'Rufio', 'Jeb', 'Rodman', 'Jonah', 'Chesney', 'Kenny',
'Henry', 'Bobbay', 'Mitch', 'Kaiya', 'Acro', 'Aiden', 'Obie', 'Dot',
'Shnuggles', 'Kendall', 'Jeffri', 'Steve', 'Eve', 'Mac', 'Fletcher',
'Kenzie', 'Pumpkin', 'Schnozz', 'Gustaf', 'Cheryl', 'Ed',
'Leonidas', 'Norman', 'Caryl', 'Scott', 'Taz', 'Darby', 'Jackie',
'light', 'Jazz', 'Franq', 'Pippin', 'Rolf', 'Snickers', 'Ridley',
'Cal', 'Bradley', 'Bubba', 'Tuco', 'Patch', 'Mojo', 'Batdog',
'Dylan', 'space', 'Mark', 'JD', 'Alejandro', 'Scruffers', 'Pip',
'Julius', 'Tanner', 'Sparky', 'Anthony', 'Holly', 'Jett', 'Amy',
'Sage', 'Andy', 'Mason', 'Trigger', 'Antony', 'Creg', 'Traviss',
'Gin', 'Jeffrie', 'Danny', 'Ester', 'Pluto', 'Bloo', 'Edd', 'Paull',
'Willy', 'Herb', 'Damon', 'Peanut', 'Nigel', 'Butters', 'Sandra',
'Fabio', 'Randall', 'Liam', 'Tommy', 'Ben', 'Raphael', 'Julio',
'Andru', 'Kloey', 'Shawwn', 'Skye', 'Kollin', 'Ronduh', 'Billl',
'Saydee', 'Dug', 'Tessa', 'Sully', 'Kirk', 'Ralf', 'Clarq',
'Jaspers', 'Samsom', 'Terrance', 'Harrison', 'Chaz', 'Jeremy',
'Jaycob', 'Lambeau', 'Ruffles', 'Amélie', 'Bobb', 'Banditt',
37
'Kevon', 'Winifred', 'Hanz', 'Churlie', 'Zeek', 'Timofy', 'Maks',
'Jomathan', 'Kallie', 'Marvin', 'Spark', 'Gòrdón', 'Jo', 'DayZ',
'Jareld', 'Torque', 'Ron', 'Skittles', 'Cleopatricia', 'Erik',
'Stu', 'Tedrick', 'Shaggy', 'Filup', 'Kial', 'Naphaniel', 'Dook',
'Hall', 'Philippe', 'Biden', 'Fwed', 'Genevieve', 'Joshwa',
'Timison', 'Bradlay', 'Pipsy', 'Clybe', 'Keet', 'Carll', 'Jockson',
'Josep', 'Lugan', 'Christoper'], dtype=object)
In [64]: df_twitter_dogs['name'][df_twitter_dogs['name'].str.match('[a-z]+')] = 'None'
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#
"""Entry point for launching an IPython kernel.
Test
In [65]: df_twitter_dogs.name.value_counts()
Out[65]: None
Charlie
Oliver
Cooper
Lucy
Tucker
Penny
Winston
Sadie
Toby
Daisy
Lola
Jax
Koda
Bella
Bo
Stanley
Leo
Chester
Milo
Buddy
Louis
Oscar
Scout
Dave
Rusty
Bailey
-
Winnie
Alfie
Bear
...
4
4
4
Margo
1
Cal
1
Travis
1
Alfy
1
Sailer
1
Randall
1
Noosh
1
Skye
1
Emma
1
Sobe
1
Mary
1
Sonny
1
Emmy
1
Ron
1
Tino
1
Flash
1
Bobble
1
Jeffri
1
Leela
1
Strider
1
Rose
1
Butter
1
Fletcher
1
Marty
1
Torque
1
Kial
1
Ester
1
Tobi
1
Bobby
1
Cleopatricia
1
Name: name, Length: 914, dtype: int64
1.3.10 Issue #10: Dog ratings are not standardized
Define: Standardize dog ratings
Code
In [66]: df_twitter_dogs['rating_numerator'] = df_twitter_dogs['rating_numerator'].astype(float)
df_twitter_dogs['rating_denominator'] = df_twitter_dogs['rating_denominator'].astype(fl
In [67]: #Test
df_twitter_dogs.info()
39
Int64Index: 1994 entries, 0 to 2072
Data columns (total 27 columns):
tweet_id
1994 non-null object
timestamp
1994 non-null datetime64[ns]
source
1994 non-null object
text
1994 non-null object
rating_numerator
1994 non-null float64
rating_denominator
1994 non-null float64
name
1994 non-null object
doggo
1994 non-null object
floofer
1994 non-null object
pupper
1994 non-null object
puppo
1994 non-null object
jpg_url
1994 non-null object
img_num
1994 non-null int64
p1
1994 non-null object
p1_conf
1994 non-null float64
p1_dog
1994 non-null bool
p2
1994 non-null object
p2_conf
1994 non-null float64
p2_dog
1994 non-null bool
p3
1994 non-null object
p3_conf
1994 non-null float64
p3_dog
1994 non-null bool
favorite_count
1994 non-null int64
retweet_count
1994 non-null int64
dog_type
326 non-null object
all_stages
1994 non-null object
dog_stage
1994 non-null object
dtypes: bool(3), datetime64[ns](1), float64(5), int64(3), object(15)
memory usage: 395.3+ KB
In [68]: # Create a loop to gather all text, indices, and ratings
#for tweets that contain a decimal in the numerator of the rating
ratings_decimals_text = []
ratings_decimals_index = []
ratings_decimals = []
for i, text in df_twitter_dogs['text'].iteritems():
if bool(re.search('\d+\.\d+\/\d+', text)):
ratings_decimals_text.append(text)
ratings_decimals_index.append(i)
ratings_decimals.append(re.search('\d+\.\d+', text).group())
ratings_decimals_text
40
Out[68]: ['This
"This
"This
'Here
is
is
is
we
Bella. She hopes her smile made you smile. If not, she is also offering you h
Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin
Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random ju
have uncovered an entire battalion of holiday puppers. Average of 11.26/10']
In [69]: ratings_decimals_index
Out[69]: [40, 558, 614, 1451]
In [70]: #Convert the decimal ratings to float
df_twitter_dogs.loc[ratings_decimals_index[0],'rating_numerator']
df_twitter_dogs.loc[ratings_decimals_index[1],'rating_numerator']
df_twitter_dogs.loc[ratings_decimals_index[2],'rating_numerator']
df_twitter_dogs.loc[ratings_decimals_index[3],'rating_numerator']
=
=
=
=
float(ratings_decim
float(ratings_decim
float(ratings_decim
float(ratings_decim
In [71]: # Create a new column called rating, and calulate the value with new, standardized rati
df_twitter_dogs['rating'] = df_twitter_dogs['rating_numerator'] / df_twitter_dogs['rati
Test
In [72]: df_twitter_dogs.sample(10)
Out[72]:
-
tweet_id-
timestamp-:07:-:00:-:28:-:56:-:55:-:19:-:32:-:38:-:53:-:24:38
-
text rating_numerator
This is Beau. That is Beau's balloon. He takes...
13.0
This is Remy. He has some long ass ears (proba...
10.0
This is Dave. He's a tropical pup. Short lil l...
5.0
"Yes hi could I get a number 4 with no pickles...
12.0
This is Oliver. He does toe touches in his sle...
13.0
This is Theodore. He just saw an adult wearing...
12.0
This is Axel. He's a professional leaf catcher...
12.0
Unique dog here. Oddly shaped tail. Long pink ...
4.0
Breathtaking pupper here. Should be on the cov...
12.0
This is Bear. He's a passionate believer of th...
12.0
62
894
rating_denominator
10.0
10.0
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
name doggo floofer
Beau None
None
Remy None
None
41
for
for
for
for
for
for
for
for
for
for
source \
iPhone
iPhone
iPhone
iPhone
iPhone
iPhone
iPhone
iPhone
iPhone
iPhone
pupper ...
None ...
None ...
p2_dog
True
True
\
\
-
10.0
Dave None
10.0
None None
10.0
Oliver None
10.0 Theodore None
10.0
Axel None
10.0
None None
10.0
None None
10.0
Bear None
None
None ...
None
None ...
None
None ...
None
None ...
None
None ...
None
None ...
None pupper ...
None
None ...
False
True
True
True
True
False
True
True
p3 p3_conf p3_dog favorite_count
62
American_Staffordshire_terrier- True-
Chesapeake_Bay_retriever- True-
dugong- False-
Tibetan_mastiff- True-
golden_retriever- True-
Pekinese- True-
malamute- True-
goldfish- False-
Eskimo_dog- True-
beagle- True-
\
retweet_count dog_type
all_stages dog_stage rating
2812
NaN
NoneNoneNoneNone
None
1.3
2006
NaN
NoneNoneNoneNone
None
1.0
5174
NaN
NoneNoneNoneNone
None
0.5
1727
NaN
NoneNoneNoneNone
None
1.2
1113
NaN
NoneNoneNoneNone
None
1.3
3650
NaN
NoneNoneNoneNone
None
1.2
3828
NaN
NoneNoneNoneNone
None
1.2
340
NaN
NoneNoneNoneNone
None
0.4
1195 pupper NoneNonepupperNone
Pupper
1.2
2982
NaN
NoneNoneNoneNone
None
1.2
[10 rows x 28 columns]
In [73]: df_twitter_dogs.loc[426]
Out[73]: tweet_id
timestamp
source
text
rating_numerator
rating_denominator
name
doggo
floofer
pupper
puppo
-:01:07
Twitter for iPhone
Here's a pupper in a onesie. Quite pupset abou...
12
10
None
None
None
pupper
None
42
jpg_url
https://pbs.twimg.com/media/Czky0v9VIAEXRkd.jpg
img_num
1
p1
seat_belt
p1_conf-
p1_dog
False
p2
toy_poodle
p2_conf-
p2_dog
True
p3
golden_retriever
p3_conf-
p3_dog
True
favorite_count
8784
retweet_count
2509
dog_type
pupper
all_stages
NoneNonepupperNone
dog_stage
Pupper
rating
1.2
Name: 426, dtype: object
In [74]: df_twitter_dogs.rating.describe()
Out[74]: count-
mean-
std-
min-%-%-%-
max-
Name: rating, dtype: float64
In [75]: df_twitter_dogs.rating.head()
Out[75]:-
Name: rating, dtype: float64
1.4
Storing Data
Save gathered, assessed,
ter_archive_master.csv".
and cleaned master dataset to a CSV file named "twit-
In [76]: df_twitter_dogs.to_csv('twitter_archive_master.csv', encoding='utf-8', index=False)
1.5
Analyzing and Visualizing Data
In [77]: twitter_archive_master = pd.read_csv('twitter_archive_master.csv')
43
1.5.1 Insights:
1. Is there correlation between the retweet counts, and favorite counts over time.
2. The most used Twitter Source
3. The most popular dog name
1.5.2 1. Is there correlation between the retweet counts, and favorite counts over time.
1.5.3 Visualization
In [78]: sns.lmplot(x="retweet_count",
y="favorite_count",
data = twitter_archive_master,
size = 5,
aspect=1.3,
scatter_kws={'alpha':1/5});
plt.title('Favorite Count vs. Retweet Count');
plt.xlabel('Retweet Count');
plt.ylabel('Favorite Count');
44
• This plot shows that there is a positive correlation between favorite counts and retweet
counts
1.5.4 2.The most used Twitter Source
In [79]: source = twitter_archive_master['source'].value_counts()
source
Out[79]: Twitter for iPhone
1955
Twitter Web Client
28
TweetDeck
11
Name: source, dtype: int64
1.5.5 Visualization
In [80]: #plot
g_bar = source.plot.bar(color = 'blue', fontsize = 15)
#figure size(width, height)
g_bar.figure.set_size_inches(8, 8);
#Add labels
plt.title('Most used Twitter source', color = 'black', fontsize = '15')
plt.xlabel('Source', color = 'black', fontsize = '15')
plt.ylabel('Number of tweets', color = 'black', fontsize = '15');
45
• The most used twitter source is Twitter for iPhone
1.5.6 3. The most popular dog name
In [81]: Dog_names = twitter_archive_master.name.value_counts()[1:10]
In [82]: Dog_names
46
Out[82]: Charlie
11
Oliver
10
Cooper
10
Lucy
10
Tucker
9
Penny
9
Winston
8
Sadie
8
Toby
7
Name: name, dtype: int64
1.5.7 Visualization
In [83]: #plot
g_bar = Dog_names.plot.bar(color = 'blue', fontsize = 15)
#figure size(width, height)
g_bar.figure.set_size_inches(8, 8);
#Add labels
plt.title('The Most popular Dog names', color = 'black', fontsize = '15')
plt.xlabel('Name', color = 'black', fontsize = '15')
plt.ylabel('Number of occurrence', color = 'black', fontsize = '15');
47
• The most most popular dog name is Charlie with 11 counts. The close second most popular
name Lucy, Oliver and Cooper with all three names having a tie at 10 counts respectively
1.5.8 Sources
• Data Analysis Nanodegree/Data Wrangling/Lesson 3: Assessing Data/Concepts 4-18
• https://stackabuse.com/reading-and-writing-json-to-a-file-in-python
• https://stackoverflow.com/questions/-/measure-time-elapsed-inpython?answertab=oldest#tab-top
• https://stackoverflow.com/questions/-/twitter-api-get-tweets-with-specific-id
48