wrangle_report
August 27, 2022
1
Reporting: Wrangle Report
The dataset that I wrangled (and analyzed and visualized) is the tweet archive of Twitter user
@dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people’s
dogs with a humorous comment about the dog. The WeRateDogs project goals included:
1. Wrangling the twitter data through the following processes:
• Gathering data
• Assessing data
• Cleaning data
2. Storing, analyzing, and visualizing the wrangled data
3. Writing a report on the data wrangling efforts and data analyses and visualizations
Gathering The data
I gathered data from the following sources:
• The WeRateDogs Twitter archive. The twitter_archive_enhanced.csv file was provided to my
by Udacity. WeRateDogs downloaded their Twitter archive and sent it to Udacity exclusively
for Udacity students to use in this project. This archive contains basic tweet data (tweet ID,
timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.
• I
programatically
downloaded
image_predictions.tsv
file
Udacity’s
server
using
the
requests
library
and
https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_imagepredictions/image-predictions.tsv.
from
URL:
• Twitter API and Python’s Tweepy library to gather each tweet’s retweet count and favorite
count
Assessing The Data After gathering the data, I did some qaulity and tidiness issue assessments.
My assessment produced the following issues below:
Quality issues
1. Keep only the original ratings (no retweets) that have images
2. There are some columns not needed for our analysis
1
3. Datatype errors in the following columns: (tweet_id, source, timestamp)
4. Correct the numerators with decimals
5. Some of the records have more than one dog stage
6. Source column is in HTML-formatted string, not a normal string
7. Error in dog names (e.g a,an,actually) are not a dog’s name.
8. Dog ratings are not standardized
Tidiness issues
1. The last four columns all relate to the same variable (dogoo, floofer, pupper, puppo)
2. Twitter api table table(retweet_count, favorite_count, followers_count) and Image table
should be added to twitter archive table.
Cleaning The Data After assessing the data, I cleaned the assessed data using the following
steps - Define the solution to the problem - Write the code to the solution - Test the code
1. Merge twitter api table Image prediction table and Twitter Archive table to form one
DataFrame
2. doggo, floofer, pupper and puppo columns in twitter_archive table should be merged into
one column named "dog_stage"
3. Delete retweets
4. Drop all the columns not needed for our analysis
5. Fix Datatype errors in tweet_id, source, timestamp
6. Correct the numerators with decimals
7. Create a function to check for dog stages
8. Change all HTML-Formatted strings to normal strings
9. Correct all Error in dog names
10. Standardize dog ratings
2