Chukwuemeka Michael Obioha | Freelancer A Wrangling Roject I Worked On

A Wrangling roject I worked on

wrangle_act August 27, 2022 1 Project: Wrangling and Analyze Data In [1]: # Import libraries import pandas as pd import numpy as np import requests import tweepy import os import json import time import re import matplotlib.pyplot as plt % matplotlib inline import warnings from IPython.display import Image from functools import reduce import re import seaborn as sns import datetime 1.1 Data Gathering 1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv) In [2]: df_archive = pd.read_csv('twitter-archive-enhanced.csv') 2. Use the Requests library to download the tweet image prediction (image_predictions.tsv) In [3]: #This downloads the image prediction file using the link provided by Udacity url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictio image_request = requests.get(url, allow_redirects=True) open('image_predictions.tsv', 'wb').write(image_request.content) Out[3]: 335079 In [4]: df_images = pd.read_csv('image_predictions.tsv', sep = '\t') 3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt) 1 In [5]: tweets = [] for line in open('tweet-json.txt', 'r'): tweets.append(json.loads(line)) df_1 = pd.DataFrame(tweets) In [6]: df_1.info() RangeIndex: 2354 entries, 0 to 2353 Data columns (total 31 columns): contributors 0 non-null object coordinates 0 non-null object created_at 2354 non-null object display_text_range 2354 non-null object entities 2354 non-null object extended_entities 2073 non-null object favorite_count 2354 non-null int64 favorited 2354 non-null bool full_text 2354 non-null object geo 0 non-null object id 2354 non-null int64 id_str 2354 non-null object in_reply_to_screen_name 78 non-null object in_reply_to_status_id 78 non-null float64 in_reply_to_status_id_str 78 non-null object in_reply_to_user_id 78 non-null float64 in_reply_to_user_id_str 78 non-null object is_quote_status 2354 non-null bool lang 2354 non-null object place 1 non-null object possibly_sensitive 2211 non-null object possibly_sensitive_appealable 2211 non-null object quoted_status 28 non-null object quoted_status_id 29 non-null float64 quoted_status_id_str 29 non-null object retweet_count 2354 non-null int64 retweeted 2354 non-null bool retweeted_status 179 non-null object source 2354 non-null object truncated 2354 non-null bool user 2354 non-null object dtypes: bool(4), float64(3), int64(3), object(21) memory usage: 505.8+ KB In [7]: df_counts = df_1[['id', 'favorite_count', 'retweet_count']] 2 1.2 Assessing Data In [8]: df_archive.head(25) Out[8]: - tweet_id in_reply_to_status_id in_reply_to_user_id- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN - - timestamp \ 16:23:56 +0000 00:17:27 +0000 00:18:03 +0000 15:58:51 +0000 16:00:24 +0000 00:08:17 +0000 16:27:12 +0000 00:22:40 +0000 16:25:51 +0000 15:59:51 +0000 00:31:25 +0000 16:11:53 +0000 01:55:32 +0000 00:10:02 +0000 17:02:04 +0000 00:19:32 +0000 00:22:39 +0000 3 \ - - - RangeIndex: 2356 entries, 0 to 2355 Data columns (total 17 columns): tweet_id 2356 non-null int64 in_reply_to_status_id 78 non-null float64 in_reply_to_user_id 78 non-null float64 timestamp 2356 non-null object source 2356 non-null object text 2356 non-null object retweeted_status_id 181 non-null float64 retweeted_status_user_id 181 non-null float64 retweeted_status_timestamp 181 non-null object expanded_urls 2297 non-null object rating_numerator 2356 non-null int64 rating_denominator 2356 non-null int64 name 2356 non-null object doggo 2356 non-null object floofer 2356 non-null object pupper 2356 non-null object puppo 2356 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 313.0+ KB In [10]: df_archive.describe() Out[10]: count mean std min 25% 50% 75% max tweet_id in_reply_to_status_id-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e+17 count mean std min 25% 50% 75% max retweeted_status_id retweeted_status_user_id rating_numerator-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e- 7 in_reply_to_user_id \-e-e-e-e-e-e-e-e+17 \ rating_denominator count- mean- std- min-%-%-%- max- In [11]: df_archive.duplicated().sum() Out[11]: 0 In [12]: df_archive[df_archive.tweet_id.duplicated()] Out[12]: Empty DataFrame Columns: [tweet_id, in_reply_to_status_id, in_reply_to_user_id, timestamp, source, text Index: [] In [13]: df_archive.name.value_counts() Out[13]: None a Charlie Lucy Oliver Cooper Penny Lola Tucker Winston Bo the Sadie Toby Buddy Bailey an Daisy Koda Stanley Leo Scout Milo Jax Bella Rusty Jack - Oscar Dave George ... 6 6 5 Rorie 1 Tuco 1 Daniel 1 Kuyu 1 Dallas 1 Josep 1 Tommy 1 Zuzu 1 Clybe 1 Lupe 1 Theo 1 Bobb 1 Bruno 1 Maxwell 1 Autumn 1 Ashleigh 1 Chevy 1 Arya 1 Hubertson 1 Kathmandu 1 old 1 Aqua 1 Tater 1 Ebby 1 Geoff 1 Leonidas 1 Halo 1 Florence 1 Kenzie 1 Cleopatricia 1 Name: name, Length: 957, dtype: int64 In [14]: df_archive.sample(5) Out[14]: tweet_id in_reply_to_status_id- NaN- NaN- NaN- NaN- NaN timestamp \-:17:12 -:24:28 +0000 9 in_reply_to_user_id \ NaN NaN NaN NaN NaN -:25:57 -:31:23 -:02:17 - Twitter for iPhone Vine - Make a Scene Twitter Web Client TweetDeck Name: source, dtype: int64 In [16]: df_archive.rating_numerator.value_counts() 10 Out[16]:- Name: rating_numerator, dtype: int64 In [17]: df_archive.rating_denominator.value_counts() Out[17]:- - - Name: rating_denominator, dtype: int64 In [18]: df_images.head() Out[18]: 0 1 2 3 4 tweet_id- jpg_url \ https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 0 1 2 3 4 img_num p1 p1_conf p1_dog p2 \ 1 Welsh_springer_spaniel- True collie 1 redbone- True miniature_pinscher 1 German_shepherd- True malinois 1 Rhodesian_ridgeback- True redbone 1 miniature_pinscher- True Rottweiler 0 1 2 3 4 p2_conf p2_dog p3 p3_conf p3_dog- True Shetland_sheepdog- True- True Rhodesian_ridgeback- True- True bloodhound- True- True miniature_pinscher- True- True Doberman- True In [19]: df_images.info() RangeIndex: 2075 entries, 0 to 2074 Data columns (total 12 columns): tweet_id 2075 non-null int64 jpg_url 2075 non-null object img_num 2075 non-null int64 p1 2075 non-null object p1_conf 2075 non-null float64 12 p1_dog 2075 non-null bool p2 2075 non-null object p2_conf 2075 non-null float64 p2_dog 2075 non-null bool p3 2075 non-null object p3_conf 2075 non-null float64 p3_dog 2075 non-null bool dtypes: bool(3), float64(3), int64(2), object(4) memory usage: 152.1+ KB In [20]: df_images.describe() Out[20]: count mean std min 25% 50% 75% max tweet_id img_num-e-e-e-e-e-e-e-e- p1_conf- p2_conf-e-e-e-e-e-e-e-e-01 p3_conf-e-e-e-e-e-e-e-e-01 In [21]: df_images.sample(5) Out[21]: tweet_id- jpg_url \ https://pbs.twimg.com/media/CWE_x33UwAEE3no.jpg https://pbs.twimg.com/media/C52V7PzWcAA_pVv.jpg https://pbs.twimg.com/media/Crwxb5yWgAAX5P_.jpg https://pbs.twimg.com/media/CWEs1b-WEAEhq82.jpg https://pbs.twimg.com/media/CnsIT0WWcAAul8V.jpg img_num p1 p1_conf p1_dog p2 \ 505 1 Italian_greyhound- True whippet 1834 1 shopping_cart- False shopping_basket 1433 1 Norwegian_elkhound- True Chesapeake_Bay_retriever 502 1 golden_retriever- True Welsh_springer_spaniel 1317 1 web_site- False printer- p2_conf p2_dog p3 p3_conf p3_dog- True Great_Dane- True- False toy_poodle- True- True malamute- True- True beagle- True- False carton- False In [22]: df_images.duplicated().sum Out[22]: In [23]: df_images[df_images.tweet_id.duplicated()] Out[23]: Empty DataFrame Columns: [tweet_id, jpg_url, img_num, p1, p1_conf, p1_dog, p2, p2_conf, p2_dog, p3, p3_ Index: [] In [24]: df_images.p1.value_counts() Out[24]: golden_retriever Labrador_retriever Pembroke Chihuahua pug chow Samoyed toy_poodle Pomeranian cocker_spaniel malamute French_bulldog miniature_pinscher Chesapeake_Bay_retriever seat_belt German_shepherd Siberian_husky Staffordshire_bullterrier web_site Cardigan Maltese_dog Shetland_sheepdog teddy beagle Eskimo_dog Lakeland_terrier Rottweiler Shih-Tzu - kuvasz Italian_greyhound 16 16 ... groenendael 1 shopping_basket 1 coho 1 silky_terrier 1 pot 1 coffee_mug 1 pillow 1 African_hunting_dog 1 park_bench 1 leopard 1 bookshop 1 hammer 1 terrapin 1 microwave 1 panpipe 1 sulphur-crested_cockatoo 1 book_jacket 1 walking_stick 1 sandbar 1 grey_fox 1 military_uniform 1 hay 1 ping-pong_ball 1 handkerchief 1 bearskin 1 mortarboard 1 orange 1 boathouse 1 leaf_beetle 1 coral_reef 1 Name: p1, Length: 378, dtype: int64 In [25]: df_images.p2.value_counts() Out[25]: Labrador_retriever golden_retriever Cardigan Chihuahua Pomeranian French_bulldog Chesapeake_Bay_retriever toy_poodle cocker_spaniel miniature_poodle Siberian_husky - beagle Eskimo_dog Pembroke collie kuvasz Italian_greyhound Pekinese American_Staffordshire_terrier miniature_pinscher Samoyed malinois chow toy_terrier Boston_bull Norwegian_elkhound Staffordshire_bullterrier pug Irish_terrier Shih-Tzu sarong hamper sulphur_butterfly patio toaster hatchet volcano turnstile chain_mail saltshaker home_theater stove drake snowmobile apron ice_lolly hyena Bernese_mountain_dog assault_rifle screw wombat handkerchief coral_fungus racket waffle_iron tree_frog coffee_mug china_cabinet - ..- Gila_monster 1 coral_reef 1 Name: p2, Length: 405, dtype: int64 In [26]: df_images.p3.value_counts() Out[26]: Labrador_retriever Chihuahua golden_retriever Eskimo_dog kelpie kuvasz Staffordshire_bullterrier chow cocker_spaniel beagle Pekinese toy_poodle Pomeranian Great_Pyrenees Pembroke Chesapeake_Bay_retriever French_bulldog malamute American_Staffordshire_terrier pug Cardigan basenji bull_mastiff toy_terrier Siberian_husky Shetland_sheepdog Boston_bull boxer Lakeland_terrier doormat - .- desktop_computer screen American_black_bear mitten tiger_cat mongoose Kerry_blue_terrier wolf_spider mosquito_net mink eel 18 Windsor_tie 1 partridge 1 stinkhorn 1 barrow 1 buckeye 1 pretzel 1 grand_piano 1 chimpanzee 1 croquet_ball 1 electric_fan 1 pool_table 1 loupe 1 green_lizard 1 bow 1 hatchet 1 quill 1 cardoon 1 broccoli 1 coral_reef 1 Name: p3, Length: 408, dtype: int64 In [27]: df_counts.head() Out[27]: 0 1 2 3 4 id favorite_count- retweet_count- In [28]: df_counts.info() RangeIndex: 2354 entries, 0 to 2353 Data columns (total 3 columns): id 2354 non-null int64 favorite_count 2354 non-null int64 retweet_count 2354 non-null int64 dtypes: int64(3) memory usage: 55.2 KB In [29]: df_counts.describe() Out[29]: count mean std min id favorite_count retweet_count-e-e-e-e- 25% 50% 75% max -e-e-e-e+17 - - In [30]: df_counts.duplicated().sum() Out[30]: 0 In [31]: df_counts[df_counts.id.duplicated()] Out[31]: Empty DataFrame Columns: [id, favorite_count, retweet_count] Index: [] 1.2.1 Quality issues 1. Keep only the original ratings (no retweets) that have images 2. There are some columns not needed for our analysis 3. Datatype errors in the following columns: (tweet_id, source, timestamp) 4. Correct the numerators with decimals 5. Some of the records have more than one dog stage 6. Source column is in HTML-formatted string, not a normal string 7. Error in dog names (e.g a,an,actually) are not a dog’s name. 8. Dog ratings are not standardized 1.2.2 Tidiness issues 1. The last four columns all relate to the same variable (dogoo, floofer, pupper, puppo) 2. Twitter api table table(retweet_count, favorite_count, followers_count) and Image table should be added to twitter archive table. 1.3 Cleaning Data In [32]: # Make copies of original pieces of data df_archive_clean = df_archive.copy() df_images_clean = df_images.copy() df_counts_clean = df_counts.copy() 1.3.1 Issue #1: Twitter api table table(retweet_count, favorite_count, followers_count) and Image table have identical coolums as the Twitter Archive Define: Merge twitter api table Image prediction table and Twitter Archive table to form one DataFrame 20 Code In [33]: #First we need to rename the id column in the twitter api table df_counts_clean.rename(columns = {'id' : 'tweet_id', 'favorite_count' : 'favorite_count In [34]: #merge all three dataset df_merge = [df_archive_clean, df_images_clean, df_counts_clean] df_twitter_dogs = reduce(lambda left, right: pd.merge(left, right, on = 'tweet_id'), d In [35]: df_twitter_dogs.head() Out[35]: 0 1 2 3 4 tweet_id in_reply_to_status_id in_reply_to_user_id- NaN NaN- NaN NaN- NaN NaN- NaN NaN- NaN NaN 0 1 2 3 4 - 0 1 2 3 4 Int64Index: 2073 entries, 0 to 2072 Data columns (total 30 columns): tweet_id 2073 non-null int64 in_reply_to_status_id 23 non-null float64 in_reply_to_user_id 23 non-null float64 timestamp 2073 non-null object source 2073 non-null object text 2073 non-null object retweeted_status_id 79 non-null float64 retweeted_status_user_id 79 non-null float64 retweeted_status_timestamp 79 non-null object expanded_urls 2073 non-null object rating_numerator 2073 non-null int64 rating_denominator 2073 non-null int64 name 2073 non-null object doggo 2073 non-null object floofer 2073 non-null object pupper 2073 non-null object puppo 2073 non-null object jpg_url 2073 non-null object img_num 2073 non-null int64 p1 2073 non-null object p1_conf 2073 non-null float64 22 p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog favorite_count retweet_count dtypes: bool(3), float64(7), memory usage: 459.5+ KB 2073 non-null bool 2073 non-null object 2073 non-null float64 2073 non-null bool 2073 non-null object 2073 non-null float64 2073 non-null bool 2073 non-null int64 2073 non-null int64 int64(6), object(14) 1.3.2 Issue #2: The last four columns all relate to the same variable (dogoo, floofer, pupper, puppo) Define: doggo, floofer, pupper and puppo columns in twitter_archive table should be merged into one column named "dog_stage" Code In [37]: # Extract the text from the columns into the new column named "dog_stage" df_twitter_dogs['dog_type'] = df_twitter_dogs['text'].str.extract('(doggo|floofer|puppe Test In [38]: df_twitter_dogs[['dog_type']].head() Out[38]: 0 1 2 3 4 dog_type NaN NaN NaN NaN NaN In [39]: df_twitter_dogs[['doggo', 'floofer', 'pupper', 'puppo']].head() Out[39]: 0 1 2 3 4 doggo floofer pupper puppo None None None None None None None None None None None None None None None None None None None None In [40]: df_twitter_dogs[['doggo', 'floofer', 'pupper', 'puppo']].sample(25) Out[40]: doggo floofer pupper puppo 238 None None None None 1066 None None None None 23 - None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None pupper None pupper None None None None None pupper None None None None None None pupper pupper None None None None None None None None None None None None None None None None None None None None None None None None In [41]: df_twitter_dogs.info() Int64Index: 2073 entries, 0 to 2072 Data columns (total 31 columns): tweet_id 2073 non-null int64 in_reply_to_status_id 23 non-null float64 in_reply_to_user_id 23 non-null float64 timestamp 2073 non-null object source 2073 non-null object text 2073 non-null object retweeted_status_id 79 non-null float64 retweeted_status_user_id 79 non-null float64 retweeted_status_timestamp 79 non-null object expanded_urls 2073 non-null object rating_numerator 2073 non-null int64 rating_denominator 2073 non-null int64 name 2073 non-null object doggo 2073 non-null object floofer 2073 non-null object pupper 2073 non-null object puppo 2073 non-null object jpg_url 2073 non-null object img_num 2073 non-null int64 24 p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog favorite_count retweet_count dog_type dtypes: bool(3), float64(7), memory usage: 475.7+ KB 2073 non-null object 2073 non-null float64 2073 non-null bool 2073 non-null object 2073 non-null float64 2073 non-null bool 2073 non-null object 2073 non-null float64 2073 non-null bool 2073 non-null int64 2073 non-null int64 337 non-null object int64(6), object(15) In [42]: df_twitter_dogs[df_twitter_dogs.tweet_id ==-] Out[42]: tweet_id in_reply_to_status_id in_reply_to_user_id- NaN NaN \ timestamp \-:00:24 +0000 source \ 4 Int64Index: 1994 entries, 0 to 2072 Data columns (total 31 columns): tweet_id 1994 non-null int64 in_reply_to_status_id 23 non-null float64 in_reply_to_user_id 23 non-null float64 timestamp 1994 non-null object source 1994 non-null object text 1994 non-null object retweeted_status_id 0 non-null float64 retweeted_status_user_id 0 non-null float64 retweeted_status_timestamp 0 non-null object expanded_urls 1994 non-null object rating_numerator 1994 non-null int64 rating_denominator 1994 non-null int64 1994 non-null object name doggo 1994 non-null object floofer 1994 non-null object pupper 1994 non-null object puppo 1994 non-null object jpg_url 1994 non-null object img_num 1994 non-null int64 p1 1994 non-null object p1_conf 1994 non-null float64 p1_dog 1994 non-null bool p2 1994 non-null object p2_conf 1994 non-null float64 p2_dog 1994 non-null bool p3 1994 non-null object p3_conf 1994 non-null float64 p3_dog 1994 non-null bool 26 favorite_count retweet_count dog_type dtypes: bool(3), float64(7), memory usage: 457.6+ KB 1994 non-null int64 1994 non-null int64 326 non-null object int64(6), object(15) 1.3.4 Issue #4: There are some columns not needed for our analysis Define: Drop all the columns not needed for our analysis Code In [46]: #drop unused columns df_twitter_dogs = df_twitter_dogs.drop(['in_reply_to_status_id','in_reply_to_user_id',' Test In [47]: df_twitter_dogs.info() Int64Index: 1994 entries, 0 to 2072 Data columns (total 25 columns): tweet_id 1994 non-null int64 timestamp 1994 non-null object source 1994 non-null object text 1994 non-null object rating_numerator 1994 non-null int64 rating_denominator 1994 non-null int64 name 1994 non-null object doggo 1994 non-null object floofer 1994 non-null object 1994 non-null object pupper puppo 1994 non-null object jpg_url 1994 non-null object img_num 1994 non-null int64 p1 1994 non-null object p1_conf 1994 non-null float64 p1_dog 1994 non-null bool p2 1994 non-null object p2_conf 1994 non-null float64 p2_dog 1994 non-null bool p3 1994 non-null object p3_conf 1994 non-null float64 p3_dog 1994 non-null bool favorite_count 1994 non-null int64 retweet_count 1994 non-null int64 dog_type 326 non-null object 27 dtypes: bool(3), float64(3), int64(6), object(13) memory usage: 364.1+ KB In [48]: df_twitter_dogs.head() Out[48]: 0 1 2 3 4 tweet_id- 0 1 2 3 4 Int64Index: 1994 entries, 0 to 2072 Data columns (total 25 columns): tweet_id 1994 non-null object timestamp 1994 non-null datetime64[ns] source 1994 non-null category text 1994 non-null object rating_numerator 1994 non-null int64 rating_denominator 1994 non-null int64 name 1994 non-null object doggo 1994 non-null object floofer 1994 non-null object pupper 1994 non-null object puppo 1994 non-null object jpg_url 1994 non-null object img_num 1994 non-null int64 p1 1994 non-null object p1_conf 1994 non-null float64 p1_dog 1994 non-null bool p2 1994 non-null object p2_conf 1994 non-null float64 p2_dog 1994 non-null bool p3 1994 non-null object p3_conf 1994 non-null float64 p3_dog 1994 non-null bool favorite_count 1994 non-null int64 retweet_count 1994 non-null int64 dog_type 326 non-null object dtypes: bool(3), category(1), datetime64[ns](1), float64(3), int64(5), object(12) 29 memory usage: 350.6+ KB In [51]: df_twitter_dogs.head() Out[51]: 0 1 2 3 4 tweet_id- timestamp \-:23:-:17:-:18:-:58:-:00:24 0 1 2 3 4 Int64Index: 1994 entries, 0 to 2072 Data columns (total 27 columns): tweet_id 1994 non-null object timestamp 1994 non-null datetime64[ns] source 1994 non-null category text 1994 non-null object rating_numerator 1994 non-null float64 rating_denominator 1994 non-null float64 name 1994 non-null object doggo 1994 non-null object floofer 1994 non-null object pupper 1994 non-null object puppo 1994 non-null object jpg_url 1994 non-null object img_num 1994 non-null int64 p1 1994 non-null object p1_conf 1994 non-null float64 p1_dog 1994 non-null bool p2 1994 non-null object p2_conf 1994 non-null float64 p2_dog 1994 non-null bool p3 1994 non-null object p3_conf 1994 non-null float64 p3_dog 1994 non-null bool favorite_count 1994 non-null int64 retweet_count 1994 non-null int64 dog_type 326 non-null object all_stages 1994 non-null object dog_stage 1994 non-null object dtypes: bool(3), category(1), datetime64[ns](1), float64(5), int64(3), object(14) memory usage: 381.8+ KB 33 1.3.8 Issue #8: Source column is in HTML-formatted string, not a normal string Define: Change all HTML-Formatted strings to normal strings Code In [59]: #extract values df_twitter_dogs.source = df_twitter_dogs.source.str.extract('>([\w\W\s]*)<', expand=Tru In [60]: df_twitter_dogs.source.value_counts() Out[60]: Twitter for iPhone 1955 Twitter Web Client 28 TweetDeck 11 Name: source, dtype: int64 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html https://stackoverflow.com/questions/-/remove-unwanted-parts-from-strings-in-acolumn?noredirect=1 In [61]: #define function and apply to df_twitter_dogs table def htmlink(x): http_position = x.find("http") #If there's no link, retain row if http_position == -1: x = x else: #Remove space before link to end x = x[:http_position - 1] return x df_twitter_dogs.text = df_twitter_dogs.text.apply(htmlink) https://stackoverflow.com/questions/-/remove-unwanted-parts-from-strings-in-acolumn?noredirect=1 Test In [62]: #confirm that all the hyperlinks have been removed for row in df_twitter_dogs.text[:10]: print(row) This This This This This Here Meet is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available fo is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know w is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and sh we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breath Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more thing 34 When you watch your owner call another dog a good boy but then they turn back to you and say you This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettabl This is Cassie. She is a college pup. Studying international doggo communication and stick theor 1.3.9 Issue #9: Error in dog names (e.g a,an,actually) are not a dog’s name Define: Correct all Error in dog names Code In [63]: df_twitter_dogs.name.unique() Out[63]: array(['Phineas', 'Tilly', 'Archie', 'Darla', 'Franklin', 'None', 'Jax', 'Zoey', 'Cassie', 'Koda', 'Bruno', 'Ted', 'Stuart', 'Oliver', 'Jim', 'Zeke', 'Ralphus', 'Gerald', 'Jeffrey', 'such', 'Canela', 'Maya', 'Mingus', 'Derek', 'Roscoe', 'Waffles', 'Jimbo', 'Maisey', 'Earl', 'Lola', 'Kevin', 'Yogi', 'Noah', 'Bella', 'Grizzwald', 'Rusty', 'Gus', 'Stanley', 'Alfy', 'Koko', 'Rey', 'Gary', 'a', 'Elliot', 'Louis', 'Jesse', 'Romeo', 'Bailey', 'Duddles', 'Jack', 'Steven', 'Beau', 'Snoopy', 'Shadow', 'Emmy', 'Aja', 'Penny', 'Dante', 'Nelly', 'Ginger', 'Benedict', 'Venti', 'Goose', 'Nugget', 'Cash', 'Jed', 'Sebastian', 'Sierra', 'Monkey', 'Harry', 'Kody', 'Lassie', 'Rover', 'Napolean', 'Boomer', 'Cody', 'Rumble', 'Clifford', 'Dewey', 'Scout', 'Gizmo', 'Walter', 'Cooper', 'Harold', 'Shikha', 'Lili', 'Jamesy', 'Coco', 'Sammy', 'Meatball', 'Paisley', 'Albus', 'Neptune', 'Belle', 'Quinn', 'Zooey', 'Dave', 'Jersey', 'Hobbes', 'Burt', 'Lorenzo', 'Carl', 'Jordy', 'Milky', 'Trooper', 'quite', 'Sophie', 'Wyatt', 'Rosie', 'Thor', 'Oscar', 'Callie', 'Cermet', 'Marlee', 'Arya', 'Einstein', 'Alice', 'Rumpole', 'Benny', 'Aspen', 'Jarod', 'Wiggles', 'General', 'Sailor', 'Iggy', 'Snoop', 'Kyle', 'Leo', 'Riley', 'Noosh', 'Odin', 'Jerry', 'Georgie', 'Rontu', 'Cannon', 'Furzey', 'Daisy', 'Tuck', 'Barney', 'Vixen', 'Jarvis', 'Mimosa', 'Pickles', 'Brady', 'Luna', 'Charlie', 'Margo', 'Sadie', 'Hank', 'Tycho', 'Indie', 'Winnie', 'George', 'Bentley', 'Max', 'Dawn', 'Maddie', 'Monty', 'Sojourner', 'Winston', 'Odie', 'Arlo', 'Vincent', 'Lucy', 'Clark', 'Mookie', 'Meera', 'Ava', 'Eli', 'Ash', 'Tucker', 'Tobi', 'Chester', 'Wilson', 'Sunshine', 'Lipton', 'Bronte', 'Poppy', 'Gidget', 'Rhino', 'Willow', 'Orion', 'Eevee', 'Smiley', 'Miguel', 'Emanuel', 'Kuyu', 'Dutch', 'Pete', 'Scooter', 'Reggie', 'Lilly', 'Samson', 'Mia', 'Astrid', 'Malcolm', 'Dexter', 'Alfie', 'Fiona', 'one', 'Mutt', 'Bear', 'Doobert', 'Beebop', 'Alexander', 'Sailer', 'Brutus', 'Kona', 'Boots', 'Ralphie', 'Loki', 'Cupid', 'Pawnd', 'Pilot', 'Ike', 'Mo', 'Toby', 'Sweet', 'Pablo', 'Nala', 'Crawford', 'Gabe', 'Jimison', 'Duchess', 'Harlso', 'Sundance', 'Luca', 'Flash', 'Sunny', 'Howie', 'Jazzy', 'Anna', 'Finn', 'Bo', 'Wafer', 'Tom', 'Florence', 'Autumn', 'Buddy', 'Dido', 35 'Eugene', 'Ken', 'Strudel', 'Tebow', 'Chloe', 'Timber', 'Binky', 'Moose', 'Dudley', 'Comet', 'Akumi', 'Titan', 'Olivia', 'Alf', 'Oshie', 'Chubbs', 'Sky', 'Atlas', 'Eleanor', 'Layla', 'Rocky', 'Baron', 'Tyr', 'Bauer', 'Swagger', 'Brandi', 'Mary', 'Moe', 'Halo', 'Augie', 'Craig', 'Sam', 'Hunter', 'Pavlov', 'Phil', 'Kyro', 'Wallace', 'Ito', 'Ollie', 'Stephan', 'Lennon', 'incredibly', 'Major', 'Duke', 'Sansa', 'Shooter', 'Django', 'Diogi', 'Sonny', 'Marley', 'Severus', 'Ronnie', 'Milo', 'Bones', 'Mauve', 'Chef', 'Doc', 'Peaches', 'Sobe', 'Longfellow', 'Mister', 'Iroh', 'Pancake', 'Snicku', 'Ruby', 'Brody', 'Mack', 'Nimbus', 'Laika', 'Maximus', 'Dobby', 'Moreton', 'Juno', 'Maude', 'Lily', 'Newt', 'Benji', 'Nida', 'Robin', 'Monster', 'BeBe', 'Remus', 'Levi', 'Mabel', 'Misty', 'Betty', 'Mosby', 'Maggie', 'Bruce', 'Happy', 'Brownie', 'Rizzy', 'Stella', 'Butter', 'Frank', 'Tonks', 'Lincoln', 'Rory', 'Logan', 'Dale', 'Rizzo', 'Mattie', 'Pinot', 'Dallas', 'Hero', 'Frankie', 'Stormy', 'Mairi', 'Loomis', 'Godi', 'Cali', 'Deacon', 'Timmy', 'Sampson', 'Chipson', 'Oakley', 'Dash', 'Hercules', 'Jay', 'Mya', 'Strider', 'Wesley', 'Solomon', 'Huck', 'O', 'Blue', 'Anakin', 'Finley', 'Sprinkles', 'Heinrich', 'Shakespeare', 'Chelsea', 'Bungalo', 'Chip', 'Grey', 'Roosevelt', 'Willem', 'Davey', 'Dakota', 'Fizz', 'Dixie', 'very', 'Al', 'Jackson', 'Carbon', 'Klein', 'DonDon', 'Kirby', 'Lou', 'Chevy', 'Tito', 'Philbert', 'Louie', 'Rupert', 'Rufus', 'Brudge', 'Shadoe', 'Angel', 'Brat', 'Tove', 'my', 'Gromit', 'Aubie', 'Kota', 'Leela', 'Glenn', 'Shelby', 'Sephie', 'Bonaparte', 'Albert', 'Wishes', 'Rose', 'Theo', 'Rocco', 'Fido', 'Emma', 'Spencer', 'Lilli', 'Boston', 'Brandonald', 'Corey', 'Leonard', 'Beckham', 'Devón', 'Gert', 'Watson', 'Keith', 'Dex', 'Ace', 'Tayzie', 'Grizzie', 'Gilbert', 'Meyer', 'Arnie', 'Zoe', 'Stewie', 'Calvin', 'Lilah', 'Spanky', 'Jameson', 'Piper', 'Atticus', 'Blu', 'Dietrich', 'not', 'Divine', 'Tripp', 'his', 'Cora', 'Huxley', 'Bookstore', 'Abby', 'Shiloh', 'an', 'Gustav', 'Arlen', 'Percy', 'Lenox', 'Sugar', 'Harvey', 'Blanket', 'Geno', 'Stark', 'Beya', 'Kilo', 'Kayla', 'Maxaroni', 'Bell', 'Doug', 'Edmund', 'Aqua', 'Theodore', 'just', 'Baloo', 'Chase', 'getting', 'Nollie', 'Rorie', 'Simba', 'Charles', 'Bayley', 'Axel', 'Storkson', 'Remy', 'Chadrick', 'Kellogg', 'Buckley', 'Livvie', 'Terry', 'Hermione', 'Ralpher', 'Aldrick', 'Larry', 'this', 'unacceptable', 'Rooney', 'Crystal', 'Ziva', 'Stefan', 'Pupcasso', 'Puff', 'Flurpson', 'Coleman', 'Enchilada', 'Raymond', 'all', 'Rueben', 'Cilantro', 'Karll', 'Sprout', 'Blitz', 'Bloop', 'Colby', 'Lillie', 'Fred', 'Ashleigh', 'Kreggory', 'Sarge', 'Luther', 'Reginald', 'Ivar', 'Jangle', 'Schnitzel', 'Panda', 'Berkeley', 'Ralphé', 'Charleson', 'Clyde', 'Harnold', 'Sid', 'Pippa', 'Otis', 'Carper', 'Bowie', 'Alexanderson', 'Suki', 'Barclay', 'Ebby', 'Flávio', 'Smokey', 'Link', 'Jennifur', 'Bluebert', 'Stephanus', 'Bubbles', 'Zeus', 'Bertson', 'Nico', 'Michelangelope', 'Siba', 'Calbert', 'Curtis', 'Travis', 'Thumas', 'Kanu', 'Lance', 'Opie', 'Stubert', 'Kane', 'Olive', 'Chuckles', 'Staniel', 'Sora', 'Beemo', 'Gunner', 36 'infuriating', 'Lacy', 'Tater', 'Olaf', 'Cecil', 'Vince', 'Karma', 'Billy', 'Walker', 'Rodney', 'Klevin', 'Malikai', 'Bobble', 'River', 'Jebberson', 'Remington', 'Farfle', 'Jiminus', 'Harper', 'Keurig', 'Clarkus', 'Finnegus', 'Cupcake', 'Kathmandu', 'Ellie', 'Katie', 'Kara', 'Adele', 'Zara', 'Ambrose', 'Jimothy', 'Bode', 'Terrenth', 'Reese', 'Chesterson', 'Lucia', 'Bisquick', 'Ralphson', 'Socks', 'Rambo', 'Fiji', 'Rilo', 'Bilbo', 'Coopson', 'Yoda', 'Millie', 'Chet', 'Crouton', 'Daniel', 'Kaia', 'Murphy', 'Dotsy', 'Eazy', 'Coops', 'Fillup', 'Miley', 'Charl', 'Reagan', 'CeCe', 'Cuddles', 'Claude', 'Jessiga', 'Carter', 'Ole', 'Blipson', 'Reptar', 'Trevith', 'Berb', 'Bob', 'Colin', 'Brian', 'Oliviér', 'Grady', 'Kobe', 'Freddery', 'Bodie', 'Dunkin', 'Wally', 'Tupawc', 'Amber', 'Herschel', 'Edgar', 'Kingsley', 'Brockly', 'Richie', 'Molly', 'Vinscent', 'Cedrick', 'Hazel', 'Lolo', 'Eriq', 'Phred', 'the', 'Maxwell', 'Geoff', 'Covach', 'Durg', 'Fynn', 'Ricky', 'Herald', 'Lucky', 'Trip', 'Clarence', 'Hamrick', 'Brad', 'Pubert', 'Frönq', 'Derby', 'Lizzie', 'Blakely', 'Opal', 'Marq', 'Kramer', 'Tyrone', 'Gordon', 'Baxter', 'Mona', 'Horace', 'Crimson', 'Birf', 'Hammond', 'Lorelei', 'Marty', 'Brooks', 'Petrick', 'Hubertson', 'Gerbald', 'Oreo', 'Bruiser', 'Perry', 'Bobby', 'Jeph', 'Obi', 'Tino', 'Kulet', 'Lupe', 'Tiger', 'Jiminy', 'Griffin', 'Banjo', 'Brandy', 'Lulu', 'Darrel', 'Taco', 'Joey', 'Patrick', 'Kreg', 'Todo', 'Tess', 'Ulysses', 'Toffee', 'Apollo', 'Carly', 'Asher', 'Glacier', 'Chuck', 'actually', 'Champ', 'Ozzie', 'Griswold', 'Cheesy', 'Moofasa', 'Hector', 'Goliath', 'Kawhi', 'Ozzy', 'by', 'Emmie', 'Penelope', 'Willie', 'Rinna', 'Mike', 'William', 'Dwight', 'Evy', 'Hurley', 'Rubio', 'officially', 'Chompsky', 'Linda', 'Tug', 'Tango', 'Grizz', 'Jerome', 'Crumpet', 'Jessifer', 'Ralph', 'Sandy', 'Humphrey', 'Tassy', 'Juckson', 'Chuq', 'Tyrus', 'Karl', 'Godzilla', 'Vinnie', 'Kenneth', 'Herm', 'Bert', 'Striker', 'Donny', 'Pepper', 'Bernie', 'Buddah', 'Lenny', 'Arnold', 'Zuzu', 'Mollie', 'Laela', 'Tedders', 'Superpup', 'Rufio', 'Jeb', 'Rodman', 'Jonah', 'Chesney', 'Kenny', 'Henry', 'Bobbay', 'Mitch', 'Kaiya', 'Acro', 'Aiden', 'Obie', 'Dot', 'Shnuggles', 'Kendall', 'Jeffri', 'Steve', 'Eve', 'Mac', 'Fletcher', 'Kenzie', 'Pumpkin', 'Schnozz', 'Gustaf', 'Cheryl', 'Ed', 'Leonidas', 'Norman', 'Caryl', 'Scott', 'Taz', 'Darby', 'Jackie', 'light', 'Jazz', 'Franq', 'Pippin', 'Rolf', 'Snickers', 'Ridley', 'Cal', 'Bradley', 'Bubba', 'Tuco', 'Patch', 'Mojo', 'Batdog', 'Dylan', 'space', 'Mark', 'JD', 'Alejandro', 'Scruffers', 'Pip', 'Julius', 'Tanner', 'Sparky', 'Anthony', 'Holly', 'Jett', 'Amy', 'Sage', 'Andy', 'Mason', 'Trigger', 'Antony', 'Creg', 'Traviss', 'Gin', 'Jeffrie', 'Danny', 'Ester', 'Pluto', 'Bloo', 'Edd', 'Paull', 'Willy', 'Herb', 'Damon', 'Peanut', 'Nigel', 'Butters', 'Sandra', 'Fabio', 'Randall', 'Liam', 'Tommy', 'Ben', 'Raphael', 'Julio', 'Andru', 'Kloey', 'Shawwn', 'Skye', 'Kollin', 'Ronduh', 'Billl', 'Saydee', 'Dug', 'Tessa', 'Sully', 'Kirk', 'Ralf', 'Clarq', 'Jaspers', 'Samsom', 'Terrance', 'Harrison', 'Chaz', 'Jeremy', 'Jaycob', 'Lambeau', 'Ruffles', 'Amélie', 'Bobb', 'Banditt', 37 'Kevon', 'Winifred', 'Hanz', 'Churlie', 'Zeek', 'Timofy', 'Maks', 'Jomathan', 'Kallie', 'Marvin', 'Spark', 'Gòrdón', 'Jo', 'DayZ', 'Jareld', 'Torque', 'Ron', 'Skittles', 'Cleopatricia', 'Erik', 'Stu', 'Tedrick', 'Shaggy', 'Filup', 'Kial', 'Naphaniel', 'Dook', 'Hall', 'Philippe', 'Biden', 'Fwed', 'Genevieve', 'Joshwa', 'Timison', 'Bradlay', 'Pipsy', 'Clybe', 'Keet', 'Carll', 'Jockson', 'Josep', 'Lugan', 'Christoper'], dtype=object) In [64]: df_twitter_dogs['name'][df_twitter_dogs['name'].str.match('[a-z]+')] = 'None' /opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html# """Entry point for launching an IPython kernel. Test In [65]: df_twitter_dogs.name.value_counts() Out[65]: None Charlie Oliver Cooper Lucy Tucker Penny Winston Sadie Toby Daisy Lola Jax Koda Bella Bo Stanley Leo Chester Milo Buddy Louis Oscar Scout Dave Rusty Bailey - Winnie Alfie Bear ... 4 4 4 Margo 1 Cal 1 Travis 1 Alfy 1 Sailer 1 Randall 1 Noosh 1 Skye 1 Emma 1 Sobe 1 Mary 1 Sonny 1 Emmy 1 Ron 1 Tino 1 Flash 1 Bobble 1 Jeffri 1 Leela 1 Strider 1 Rose 1 Butter 1 Fletcher 1 Marty 1 Torque 1 Kial 1 Ester 1 Tobi 1 Bobby 1 Cleopatricia 1 Name: name, Length: 914, dtype: int64 1.3.10 Issue #10: Dog ratings are not standardized Define: Standardize dog ratings Code In [66]: df_twitter_dogs['rating_numerator'] = df_twitter_dogs['rating_numerator'].astype(float) df_twitter_dogs['rating_denominator'] = df_twitter_dogs['rating_denominator'].astype(fl In [67]: #Test df_twitter_dogs.info() 39 Int64Index: 1994 entries, 0 to 2072 Data columns (total 27 columns): tweet_id 1994 non-null object timestamp 1994 non-null datetime64[ns] source 1994 non-null object text 1994 non-null object rating_numerator 1994 non-null float64 rating_denominator 1994 non-null float64 name 1994 non-null object doggo 1994 non-null object floofer 1994 non-null object pupper 1994 non-null object puppo 1994 non-null object jpg_url 1994 non-null object img_num 1994 non-null int64 p1 1994 non-null object p1_conf 1994 non-null float64 p1_dog 1994 non-null bool p2 1994 non-null object p2_conf 1994 non-null float64 p2_dog 1994 non-null bool p3 1994 non-null object p3_conf 1994 non-null float64 p3_dog 1994 non-null bool favorite_count 1994 non-null int64 retweet_count 1994 non-null int64 dog_type 326 non-null object all_stages 1994 non-null object dog_stage 1994 non-null object dtypes: bool(3), datetime64[ns](1), float64(5), int64(3), object(15) memory usage: 395.3+ KB In [68]: # Create a loop to gather all text, indices, and ratings #for tweets that contain a decimal in the numerator of the rating ratings_decimals_text = [] ratings_decimals_index = [] ratings_decimals = [] for i, text in df_twitter_dogs['text'].iteritems(): if bool(re.search('\d+\.\d+\/\d+', text)): ratings_decimals_text.append(text) ratings_decimals_index.append(i) ratings_decimals.append(re.search('\d+\.\d+', text).group()) ratings_decimals_text 40 Out[68]: ['This "This "This 'Here is is is we Bella. She hopes her smile made you smile. If not, she is also offering you h Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random ju have uncovered an entire battalion of holiday puppers. Average of 11.26/10'] In [69]: ratings_decimals_index Out[69]: [40, 558, 614, 1451] In [70]: #Convert the decimal ratings to float df_twitter_dogs.loc[ratings_decimals_index[0],'rating_numerator'] df_twitter_dogs.loc[ratings_decimals_index[1],'rating_numerator'] df_twitter_dogs.loc[ratings_decimals_index[2],'rating_numerator'] df_twitter_dogs.loc[ratings_decimals_index[3],'rating_numerator'] = = = = float(ratings_decim float(ratings_decim float(ratings_decim float(ratings_decim In [71]: # Create a new column called rating, and calulate the value with new, standardized rati df_twitter_dogs['rating'] = df_twitter_dogs['rating_numerator'] / df_twitter_dogs['rati Test In [72]: df_twitter_dogs.sample(10) Out[72]: - tweet_id- timestamp-:07:-:00:-:28:-:56:-:55:-:19:-:32:-:38:-:53:-:24:38 - text rating_numerator This is Beau. That is Beau's balloon. He takes... 13.0 This is Remy. He has some long ass ears (proba... 10.0 This is Dave. He's a tropical pup. Short lil l... 5.0 "Yes hi could I get a number 4 with no pickles... 12.0 This is Oliver. He does toe touches in his sle... 13.0 This is Theodore. He just saw an adult wearing... 12.0 This is Axel. He's a professional leaf catcher... 12.0 Unique dog here. Oddly shaped tail. Long pink ... 4.0 Breathtaking pupper here. Should be on the cov... 12.0 This is Bear. He's a passionate believer of th... 12.0 62 894 rating_denominator 10.0 10.0 Twitter Twitter Twitter Twitter Twitter Twitter Twitter Twitter Twitter Twitter name doggo floofer Beau None None Remy None None 41 for for for for for for for for for for source \ iPhone iPhone iPhone iPhone iPhone iPhone iPhone iPhone iPhone iPhone pupper ... None ... None ... p2_dog True True \ \ - 10.0 Dave None 10.0 None None 10.0 Oliver None 10.0 Theodore None 10.0 Axel None 10.0 None None 10.0 None None 10.0 Bear None None None ... None None ... None None ... None None ... None None ... None None ... None pupper ... None None ... False True True True True False True True p3 p3_conf p3_dog favorite_count 62 American_Staffordshire_terrier- True- Chesapeake_Bay_retriever- True- dugong- False- Tibetan_mastiff- True- golden_retriever- True- Pekinese- True- malamute- True- goldfish- False- Eskimo_dog- True- beagle- True- \ retweet_count dog_type all_stages dog_stage rating 2812 NaN NoneNoneNoneNone None 1.3 2006 NaN NoneNoneNoneNone None 1.0 5174 NaN NoneNoneNoneNone None 0.5 1727 NaN NoneNoneNoneNone None 1.2 1113 NaN NoneNoneNoneNone None 1.3 3650 NaN NoneNoneNoneNone None 1.2 3828 NaN NoneNoneNoneNone None 1.2 340 NaN NoneNoneNoneNone None 0.4 1195 pupper NoneNonepupperNone Pupper 1.2 2982 NaN NoneNoneNoneNone None 1.2 [10 rows x 28 columns] In [73]: df_twitter_dogs.loc[426] Out[73]: tweet_id timestamp source text rating_numerator rating_denominator name doggo floofer pupper puppo -:01:07 Twitter for iPhone Here's a pupper in a onesie. Quite pupset abou... 12 10 None None None pupper None 42 jpg_url https://pbs.twimg.com/media/Czky0v9VIAEXRkd.jpg img_num 1 p1 seat_belt p1_conf- p1_dog False p2 toy_poodle p2_conf- p2_dog True p3 golden_retriever p3_conf- p3_dog True favorite_count 8784 retweet_count 2509 dog_type pupper all_stages NoneNonepupperNone dog_stage Pupper rating 1.2 Name: 426, dtype: object In [74]: df_twitter_dogs.rating.describe() Out[74]: count- mean- std- min-%-%-%- max- Name: rating, dtype: float64 In [75]: df_twitter_dogs.rating.head() Out[75]:- Name: rating, dtype: float64 1.4 Storing Data Save gathered, assessed, ter_archive_master.csv". and cleaned master dataset to a CSV file named "twit- In [76]: df_twitter_dogs.to_csv('twitter_archive_master.csv', encoding='utf-8', index=False) 1.5 Analyzing and Visualizing Data In [77]: twitter_archive_master = pd.read_csv('twitter_archive_master.csv') 43 1.5.1 Insights: 1. Is there correlation between the retweet counts, and favorite counts over time. 2. The most used Twitter Source 3. The most popular dog name 1.5.2 1. Is there correlation between the retweet counts, and favorite counts over time. 1.5.3 Visualization In [78]: sns.lmplot(x="retweet_count", y="favorite_count", data = twitter_archive_master, size = 5, aspect=1.3, scatter_kws={'alpha':1/5}); plt.title('Favorite Count vs. Retweet Count'); plt.xlabel('Retweet Count'); plt.ylabel('Favorite Count'); 44 • This plot shows that there is a positive correlation between favorite counts and retweet counts 1.5.4 2.The most used Twitter Source In [79]: source = twitter_archive_master['source'].value_counts() source Out[79]: Twitter for iPhone 1955 Twitter Web Client 28 TweetDeck 11 Name: source, dtype: int64 1.5.5 Visualization In [80]: #plot g_bar = source.plot.bar(color = 'blue', fontsize = 15) #figure size(width, height) g_bar.figure.set_size_inches(8, 8); #Add labels plt.title('Most used Twitter source', color = 'black', fontsize = '15') plt.xlabel('Source', color = 'black', fontsize = '15') plt.ylabel('Number of tweets', color = 'black', fontsize = '15'); 45 • The most used twitter source is Twitter for iPhone 1.5.6 3. The most popular dog name In [81]: Dog_names = twitter_archive_master.name.value_counts()[1:10] In [82]: Dog_names 46 Out[82]: Charlie 11 Oliver 10 Cooper 10 Lucy 10 Tucker 9 Penny 9 Winston 8 Sadie 8 Toby 7 Name: name, dtype: int64 1.5.7 Visualization In [83]: #plot g_bar = Dog_names.plot.bar(color = 'blue', fontsize = 15) #figure size(width, height) g_bar.figure.set_size_inches(8, 8); #Add labels plt.title('The Most popular Dog names', color = 'black', fontsize = '15') plt.xlabel('Name', color = 'black', fontsize = '15') plt.ylabel('Number of occurrence', color = 'black', fontsize = '15'); 47 • The most most popular dog name is Charlie with 11 counts. The close second most popular name Lucy, Oliver and Cooper with all three names having a tie at 10 counts respectively 1.5.8 Sources • Data Analysis Nanodegree/Data Wrangling/Lesson 3: Assessing Data/Concepts 4-18 • https://stackabuse.com/reading-and-writing-json-to-a-file-in-python • https://stackoverflow.com/questions/-/measure-time-elapsed-inpython?answertab=oldest#tab-top • https://stackoverflow.com/questions/-/twitter-api-get-tweets-with-specific-id 48