Chukwuemeka Michael Obioha | Freelancer Portfolio Item #301972

wrangle_report August 27, 2022 1 Reporting: Wrangle Report The dataset that I wrangled (and analyzed and visualized) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people’s dogs with a humorous comment about the dog. The WeRateDogs project goals included: 1. Wrangling the twitter data through the following processes: • Gathering data • Assessing data • Cleaning data 2. Storing, analyzing, and visualizing the wrangled data 3. Writing a report on the data wrangling efforts and data analyses and visualizations Gathering The data I gathered data from the following sources: • The WeRateDogs Twitter archive. The twitter_archive_enhanced.csv file was provided to my by Udacity. WeRateDogs downloaded their Twitter archive and sent it to Udacity exclusively for Udacity students to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. • I programatically downloaded image_predictions.tsv file Udacity’s server using the requests library and https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_imagepredictions/image-predictions.tsv. from URL: • Twitter API and Python’s Tweepy library to gather each tweet’s retweet count and favorite count Assessing The Data After gathering the data, I did some qaulity and tidiness issue assessments. My assessment produced the following issues below: Quality issues 1. Keep only the original ratings (no retweets) that have images 2. There are some columns not needed for our analysis 1 3. Datatype errors in the following columns: (tweet_id, source, timestamp) 4. Correct the numerators with decimals 5. Some of the records have more than one dog stage 6. Source column is in HTML-formatted string, not a normal string 7. Error in dog names (e.g a,an,actually) are not a dog’s name. 8. Dog ratings are not standardized Tidiness issues 1. The last four columns all relate to the same variable (dogoo, floofer, pupper, puppo) 2. Twitter api table table(retweet_count, favorite_count, followers_count) and Image table should be added to twitter archive table. Cleaning The Data After assessing the data, I cleaned the assessed data using the following steps - Define the solution to the problem - Write the code to the solution - Test the code 1. Merge twitter api table Image prediction table and Twitter Archive table to form one DataFrame 2. doggo, floofer, pupper and puppo columns in twitter_archive table should be merged into one column named "dog_stage" 3. Delete retweets 4. Drop all the columns not needed for our analysis 5. Fix Datatype errors in tweet_id, source, timestamp 6. Correct the numerators with decimals 7. Create a function to check for dog stages 8. Change all HTML-Formatted strings to normal strings 9. Correct all Error in dog names 10. Standardize dog ratings 2