Mohamed Ihab Khalifa | Freelancer Business Data Analysis

Business Data Analysis

Business Data Analysis This project is centered on reviewing and analyzing businesses. It showcases a multiplicity of different data analysis methods and techniques learned whilst undertaking my course, 'Data Analysis with Python', offered online by the University of Pennsylvania. Further, it utilizes a variety of popular Python libraries for data analysis such as pandas and matplotlib. The project covers a range of core data analysis skills, including: 1) Reading and inspecting Excel files 2) Merging and updating data 3) Querying data 4) Grouping, summarizing, and statistically analyzing data 5) Data visualization 6) Writing and/or updating Excel files The data being analyzed are contained in the file 'yelp.xlsx', which consists of data pertaining to a variety of local businesses, collected and published by yelp.com, a popular website for publishing crowd-sourced reviews about businesses. The file is comprised of three sheets (with the following coloumns): yelp_data: 'name', 'category_0', 'category_1', 'take_out', 'review_count', 'stars', 'city_id', 'state_id' cities: 'id', 'city' states: 'id', 'state' In [1]: #Importing the Python modules to be used import pandas as pd import openpyxl import numpy as np import matplotlib.pyplot as plt import warnings warnings.simplefilter("ignore") Part One: Loading and Inspecting Excel Data 1. Loading and Reading the File In [2]: #Accessing the file, 'yelp.xlsx' xl = pd.ExcelFile('yelp.xlsx') #Loading each sheet onto a separate dataframe df_yelp = xl.parse('yelp_data') df_cities = xl.parse('cities') df_states = xl.parse('states') 2. General Inspections of the File In [3]: #Inspecting the shape (rows x coloumns) of first sheet shape = df_yelp.shape print('Number of coloumns:', shape[1]) print('Number of rows:', shape[0]) Number of coloumns: 8 Number of rows: 600 In [4]: #Inspecting the coloumn headers of the first sheet print('Coloumn headers in the sheet \'yelp_data\':') for column in list(df_yelp.columns): print(column) Coloumn headers in the sheet 'yelp_data': name category_0 category_1 take_out review_count stars city_id state_id In [5]: #Display the first 5 enteries of data in the first sheet print('The first 5 enteries off the \'yelp_data\' sheet:') df_yelp.head() The first 5 enteries off the 'yelp_data' sheet: name category_0 category_1 take_out review_count stars city_id state_id 0 China Sea Chinese Restaurant Restaurants Chinese True 11 2.5 1 1 1 Discount Tire Center Tires Automotive False 24 4.5 1 1 2 Frankfurters Restaurants Hot Dogs True 3 4.5 1 1 3 Fred Dietz Floral Shopping Flowers & Gifts False 6 4.0 1 1 4 Kuhn's Market Food Grocery False 8 3.5 1 1 Out[5]: Part Two: Merging and Updating Data 1. Merging Sheets In [6]: #Merging the three sheets, 'yelp_data', 'cities', and 'states', into one dataframe 'df' #Merging the two dataframes 'df_yelp' and 'df_cities' df = pd.merge(left=df_yelp, right=df_cities, how='inner', left_on='city_id', right_on='id') #Merging the dataframes 'df' and 'df_states' #specifying the dataframes to mer #specifying the type of data merg #specifying the common coloumns to m df = pd.merge(left=df, right=df_states, how='inner', left_on='state_id', right_on='id') #Displaying the first 5 rows after merging print('Dataframe after merging:') df.head() Dataframe after merging: name category_0 category_1 take_out review_count stars city_id state_id id_x city id_y state 0 China Sea Chinese Restaurant Restaurants Chinese True 11 2.5 1 1 1 Bellevue 1 PA 1 Discount Tire Center Tires Automotive False 24 4.5 1 1 1 Bellevue 1 PA 2 Frankfurters Restaurants Hot Dogs True 3 4.5 1 1 1 Bellevue 1 PA 3 Fred Dietz Floral Shopping Flowers & Gifts False 6 4.0 1 1 1 Bellevue 1 PA 4 Kuhn's Market Food Grocery False 8 3.5 1 1 1 Bellevue 1 PA Out[6]: 2. Updating the Dataframe In [7]: #Deleting duplicate coloumns df = df.drop(['id_x', 'id_y'],axis=1) #Renaming the coloumns, 'category_0' and 'category_1', to be more representative of thei df.rename(columns={'category_0': 'business type', 'category_1': 'service type'}, inplace=True) #specifying the old (left) and n #Displaying the coloumn titles again print('Coloumn titles after renaming:\n', list(df.columns)) print('') #Rearranging the coloumns order rearranged_coloumns = ['name', 'business type', 'service type', 'stars', 'review_count', df = df.reindex(columns=rearranged_coloumns) #To preview the dataframe after updates print('Updated dataframe:') df.head() Coloumn titles after renaming: ['name', 'business type', 'service type', 'take_out', 'review_count', 'stars', 'city_i d', 'state_id', 'city', 'state'] Updated dataframe: name business type service type stars review_count take_out city city_id state state_id 0 China Sea Chinese Restaurant Restaurants Chinese 2.5 11 True Bellevue 1 PA 1 1 Discount Tire Tires Automotive 4.5 24 False Bellevue 1 PA 1 Out[7]: Center 2 Frankfurters Restaurants Hot Dogs 4.5 3 True Bellevue 1 PA 1 3 Fred Dietz Floral Shopping Flowers & Gifts 4.0 6 False Bellevue 1 PA 1 4 Kuhn's Market Food Grocery 3.5 8 False Bellevue 1 PA 1 In [8]: #Displaying only the coloumns 'business type' and 'service type' print('The following table displays the type of business for each business listed and th df[['business type', 'service type']] The following table displays the type of business for each business listed and the servi ce they offer: business type service type 0 Restaurants Chinese 1 Tires Automotive 2 Restaurants Hot Dogs 3 Shopping Flowers & Gifts 4 Food Grocery ... ... ... 595 Local Services Recycling Center 596 Local Services Recycling Center 597 Restaurants Fast Food 598 Shopping Heating & Air Conditioning/HVAC 599 Automotive Towing Out[8]: 600 rows × 2 columns Part Three: Querying Data General queries 1. Report businesses located in Las Vegas only In [9]: #Filtering data for Las Vegas (LA) enteries only LA_filter = df['city'] == 'Las Vegas' df_LA = df[LA_filter] #reporting businesses located in LA only print('The following table displays businesses located in Las Vegas only:') df_LA The following table displays businesses located in Las Vegas only: Out[9]: 430 name business type service type stars review_count take_out city city_id state state_id Reliable Tax Services Financial Services Accountants 4.5 3 False Las Vegas 12 NV 2 431 Paradise Electro Stimulations Shopping Adult 4.5 4 False 432 The Studio Shopping Adult 5.0 7 False 433 Boomer's Bar Bars American (Traditional) 3.5 19 434 Omelet House Breakfast & Brunch American (Traditional) 4.5 ... ... ... ... 558 It's Paradise Boutique Shopping 559 Smog Busters Repair Center 560 Las Vegas 12 NV 2 Las Vegas 12 NV 2 False Las Vegas 12 NV 2 703 True Las Vegas 12 NV 2 ... ... ... ... ... ... ... Fashion 3.5 3 False Las Vegas 12 NV 2 Automotive Smog Check Stations 4.0 3 False Las Vegas 12 NV 2 A&M Enterprises Financial Services Tax Services 2.5 3 False Las Vegas 12 NV 2 561 Gray Line of Las Vegas Hotels & Travel Transportation 1.5 3 False Las Vegas 12 NV 2 562 AAA Hotels & Travel Travel Services 3.5 12 False Las Vegas 12 NV 2 133 rows × 10 columns 2. Report businesses classified as 'Restaurants' (in either categories, business/service type) In [10]: #Filtering data for 'Restaurants' in coloumns 'business type' and 'service type' rest_filter1 = df['business type'] == 'Restaurants' rest_filter2 = df['service type'] == 'Restaurants' df_restaurants = df[(rest_filter1 | rest_filter2)] #i.e., restaurants is 'either #reporting businesses classified as 'Restaurants' only print('The following table displays all the restaurants featured in the data set:') df_restaurants The following table displays all the restaurants featured in the data set: name business type service type stars review_count take_out city city_id state state_id 0 China Sea Chinese Restaurant Restaurants Chinese 2.5 11 True Bellevue 1 PA 1 2 Frankfurters Restaurants Hot Dogs 4.5 3 True Bellevue 1 PA 1 6 Luigi's Pizzeria Restaurants Pizza 4.0 18 True Bellevue 1 PA 1 8 R & B's Pizza Place Restaurants Pizza 4.0 17 True Bellevue 1 PA 1 9 Rusty Nail Restaurants American (Traditional) 3.5 32 True Bellevue 1 PA 1 ... ... ... ... ... ... ... ... ... ... ... 536 Little Caesars Pizza Restaurants Pizza 4.0 4 True Las Vegas 12 NV 2 Out[10]: 537 Pho So 1 Restaurants Vietnamese 3.5 763 True 538 Rice To-Go Restaurants Chinese 3.0 19 True 539 Wienerschnitzel Restaurants Fast Food 1.0 3 597 Jack In the Box Restaurants Fast Food 3.0 12 Las Vegas 12 NV 2 Las Vegas 12 NV 2 False Las Vegas 12 NV 2 True North Las Vegas 13 NV 2 71 rows × 10 columns 3. Report businesses classified as either 'Restaurants' or 'Bars' (in both categories) In [11]: #Filtering data for both 'Restaurants' and 'Bars' simultaneously rest_filter = df['business type'].isin(['Restaurants', 'Bars']) bars_filter = df['service type'].isin(['Restaurants', 'Bars']) df_rest_bars = df[(rest_filter | bars_filter)] #reporting businesses after filtering print('The following table displays businesses classified as either restaurants or bars: df_rest_bars The following table displays businesses classified as either restaurants or bars: name business type service type stars review_count take_out city city_id state state_id 0 China Sea Chinese Restaurant Restaurants Chinese 2.5 11 True Bellevue 1 PA 1 2 Frankfurters Restaurants Hot Dogs 4.5 3 True Bellevue 1 PA 1 6 Luigi's Pizzeria Restaurants Pizza 4.0 18 True Bellevue 1 PA 1 8 R & B's Pizza Place Restaurants Pizza 4.0 17 True Bellevue 1 PA 1 9 Rusty Nail Restaurants American (Traditional) 3.5 32 True Bellevue 1 PA 1 ... ... ... ... ... ... ... ... ... ... ... 537 Pho So 1 Restaurants Vietnamese 3.5 763 True Las Vegas 12 NV 2 538 Rice To-Go Restaurants Chinese 3.0 19 True Las Vegas 12 NV 2 539 Wienerschnitzel Restaurants Fast Food 1.0 3 False Las Vegas 12 NV 2 593 Sporting Chance Saloon Bars Nightlife 3.5 7 False North Las Vegas 13 NV 2 597 Jack In the Box Restaurants Fast Food 3.0 12 True North Las Vegas 13 NV 2 Out[11]: 98 rows × 10 columns 4. Report Restaurants and Bars located in Las Vegas only In [12]: #Adding a city filter to extract those in LA LA_filter = df['city'] == 'Las Vegas' LA_rest_bars = df[LA_filter & (rest_filter | bars_filter)] #i.e., the city filter #reporting bars and/or restaurans in LA print('The following table lists businesses in Las Vegas that are classified as restaura LA_rest_bars The following table lists businesses in Las Vegas that are classified as restaurants or bars: name business type service type stars review_count take_out city city_id state state_id 433 Boomer's Bar Bars American (Traditional) 3.5 19 False Las Vegas 12 NV 2 450 Hard Hat Lounge Dive Bars Bars 3.5 31 False Las Vegas 12 NV 2 451 Huntridge Tavern Dive Bars Bars 4.0 50 False Las Vegas 12 NV 2 452 Jack's Pub Pubs Bars 2.5 20 False Las Vegas 12 NV 2 453 Moon Doggies Dive Bars Bars 4.0 85 False Las Vegas 12 NV 2 515 Jake's Bar Bars Nightlife 4.5 12 False Las Vegas 12 NV 2 527 Capriotti's Sandwich Shop Restaurants Sandwiches 4.0 271 True Las Vegas 12 NV 2 528 Carl's Jr Restaurants Fast Food 1.5 5 True Las Vegas 12 NV 2 529 Chapala's Mexican Restaurant Restaurants Mexican 3.0 19 True Las Vegas 12 NV 2 530 China Star Restaurants Chinese 3.0 45 True Las Vegas 12 NV 2 531 Chinatown Cuisine Restaurants Mexican 4.5 4 True Las Vegas 12 NV 2 532 Emperor's Garden Restaurants Chinese 2.0 129 True Las Vegas 12 NV 2 533 Food Express Chinese Restaurant Restaurants Chinese 4.0 178 True Las Vegas 12 NV 2 534 Ichiza Restaurants Japanese 4.0 1305 True Las Vegas 12 NV 2 535 India Oven Restaurants Indian 3.5 130 True Las Vegas 12 NV 2 536 Little Caesars Pizza Restaurants Pizza 4.0 4 True Las Vegas 12 NV 2 537 Pho So 1 Restaurants Vietnamese 3.5 763 True 12 NV 2 Out[12]: Las Vegas 538 Rice To-Go Restaurants Chinese 3.0 19 True Las Vegas 12 NV 2 539 Wienerschnitzel Restaurants Fast Food 1.0 3 False Las Vegas 12 NV 2 More specific queries 5. How many Beauty and Spa centers are there in the city of Henderson? In [13]: #Filtering cities for Henderson Henderson_filter = df['city'] == 'Henderson' #filtering businesses for Beauty & Spa centers BnS_filter1 = df['business type'] == 'Beauty & Spas' BnS_filter2 = df['service type'] == 'Beauty & Spas' #creating a filtered dataframe df_BnS_Henderson = df[Henderson_filter & (BnS_filter1 | BnS_filter2)] #Reporting the number of beauty & spa centers print('The total number of Beauty and Spa centers in Henderson is:', len(df_BnS_Henderso The total number of Beauty and Spa centers in Henderson is: 14 6. What type of service does the business 'Dairy Queen' offer? In [14]: #Filtering business names for 'Dairy Queen' DQ_filter = df['name'] == 'Dairy Queen' df_DQ = df[DQ_filter] print('Dairy Queen offers:', df_DQ['service type'].values[0]) Dairy Queen offers: Ice Cream & Frozen Yogurt 7. How many 'dive bars' are there in Las Vegas? Can you recommend one in that city with at least a 4 star rating? In [15]: #Filtering data by city LA_filter = df['city'] == 'Las Vegas' #filtering data for dive bars in business and service categories divebars_filter1 = df['business type'] == 'Dive Bars' divebars_filter2 = df['service type'] == 'Dive Bars' #creating a filtered dataframe LA_divebars = df[LA_filter & (divebars_filter1 | divebars_filter2)] #reporting the number of dive bars print('The total number of dive bars in Las Vegas is:', len(LA_divebars)) print('') #Recommending a dive bar with at least a 4-star rating (randomly) #adding a rating filter rating_filter = LA_divebars['stars'] >= 4 LA_divebars_filtered = LA_divebars[rating_filter] #resetting the index of the dataframe LA_divebars LA_divebars_filtered.reset_index(drop=True, inplace=True) #getting a random index to make a random recommendation import random random_index = random.randint(0, len(LA_divebars_filtered)-1) #reporting a random LA dive bar with 4-star rating or above random_divebar = LA_divebars_filtered.iloc[random_index] print('Here\'s your recommendation of a highly rated dive bar in LA:', random_divebar['n The total number of dive bars in Las Vegas is: 3 Here's your recommendation of a highly rated dive bar in LA: Moon Doggies 8. Which city has the largest number of pizza shops? Can you recommend a random one with at least a 4-star rating and which offers take-outs? In [16]: #Filtering data for pizza restaurants pizza_filter1 = df['business type'].str.contains('Pizza') pizza_filter2 = df['service type'].str.contains('Pizza') #creating a dataframe with pizza businesses only df_pizza = df[(pizza_filter1 | pizza_filter2)] #Grouping and filtering data based on city cities = df_pizza.groupby('city').groups maxcount = None maxcity = None for city, val in cities.items(): if maxcount is None or len(val) > maxcount: maxcount = len(val) maxcity = city else: continue #returns a dictionary #to get the city with most pizza #Reporting the result print('The city with the most pizza shops is:', maxcity) print('') #Recommending a random pizza shop with a 4-star rating or above and which offers take-ou #adding filters for each of the requirements city_filter = df_pizza['city'] == maxcity rating_filter = df_pizza['stars'] >= 4 take_outs_filter = df_pizza['take_out'] == True #creating a dataframe with data meeting the specified requirements df_pizza_filtered = df_pizza[city_filter & rating_filter & take_outs_filter] #resetting the index of the dataframe df_pizza_filtered.reset_index(drop=True, inplace=True) #selecting one at random random_indx = random.randint(0, len(df_pizza_filtered)-1) random_pizza = df_pizza_filtered['name'].iloc[random_indx] #reporting the results print('Here is your recommendation:', random_pizza) The city with the most pizza shops is: Pittsburgh Here is your recommendation: Eddie's Pizza Haus Part Four: Grouping, Summarizing, and Statistically Analyzing Data 1. Report the mean and standard deviation of rating scores of businesses for each city separately In [17]: #grouping data by city and selecting statistics to perform city_stats = df.groupby('city').agg([np.mean, np.std])['stars'] #reporting the rating mean and standard deviations per city print('The following table displays the mean and standard deviation scores of businesses city_stats The following table displays the mean and standard deviation scores of businesses per ci ty: mean std Bellevue - - Braddock - - Carnegie - - Henderson - - Homestead - - Las Vegas - - Mc Kees Rocks - - Mount Lebanon - - Munhall - - North Las Vegas - - Pittsburgh - - West Homestead - - West Mifflin - - Out[17]: city 2. Report the mean star ratings and total number of reviews of restaurants for each city separately In [18]: #filtering dataframe for resturants only rest_filter1 = df['business type'] == 'Restaurants' rest_filter2 = df['service type'] == 'Restaurants' df_restaurants = df[(rest_filter1 | rest_filter2)] #grouping data by city and selecting statistics for each coloumn to analyze city_rest_stats = df_restaurants.groupby('city').agg({'stars': np.mean, 'review_count': #reporting the mean of star ratings and sum of reviews for restaurants per city print('The following table demonstrates the average rating score and total sum of review city_rest_stats The following table demonstrates the average rating score and total sum of reviews for r estaurants per city: stars Out[18]: city review_count Bellevue - 84 Carnegie - 104 Henderson - 1127 Homestead - 6 Las Vegas - 2875 Mc Kees Rocks - 22 Munhall - 19 North Las Vegas - 12 Pittsburgh - 1809 West Mifflin - 15 2.2. Based on the results, which city has the best restaurants? In [19]: #traversing through the dataframe to get the name of the city with the highest mean rati maxrating = None maxcity = None for rating in city_rest_stats['stars']: if maxrating is None or rating > maxrating: maxrating = rating rating_filt = city_rest_stats['stars'] == rating maxcity = city_rest_stats[rating_filt].index.values else: continue #Reporting city whose restaurants received the highest mean rating print('The city with the best restaurants:', maxcity[0]) The city with the best restaurants: West Mifflin 3. Create a pivot table showing the mean star rating and mean reviews of businesses, grouping the results by city In [20]: #creating the pivot table pivot_tab = pd.pivot_table(df, #specifying the dataframe to analyze index=['city'], #specifying the coloumn to group data by values=['stars', 'review_count'], #specifying the coloumns to an aggfunc=np.mean) #specifying the statistic to perform #reporting the results of the table print('The following pivot table displays the average star rating and review count for b pivot_tab The following pivot table displays the average star rating and review count for business es per city: review_count stars Bellevue - - Braddock - - Carnegie - - Out[20]: city Henderson - - Homestead - - Las Vegas - - Mc Kees Rocks - - Mount Lebanon - - Munhall - - North Las Vegas - - Pittsburgh - - West Homestead - - West Mifflin - - 4. Create a pivot table showing the mean star rating and total sum of reviews for businesses classified as 'Hotels & Travel', grouping the results by state and city In [21]: #filtering data for businesses classified as Hotels & Travel Hotels_filter1 = df['business type'] == 'Hotels & Travel' Hotels_filter2 = df['service type'] == 'Hotels & Travel' #creating dataframe with hotels & travel businesses df_Hotels = df[(Hotels_filter1 | Hotels_filter2)] #creating a pivot table to calculate their mean star ratings and sum of reviews pivot_Hotels = pd.pivot_table(df_Hotels, index=['state', 'city'], #grouping results by values=['stars', 'review_count'], #specifying the co aggfunc={'stars': np.mean, 'review_count': np.sum}) #reporting the pivot table print('The following pivot table displays the average star ratings and total sum of revi pivot_Hotels The following pivot table displays the average star ratings and total sum of reviews for hotels & travel businesses, based on their state and city locations: Out[21]: review_count stars state city NV Las Vegas 559 2.1 PA Carnegie 11 3.5 Homestead 16 4.0 Pittsburgh 190 3.5 West Mifflin 3 4.5 Part Five: Data Visualization 1. Create a histogram to compare the frequency distribution of rating scores of businesses in Pittsburgh vs. Las Vegas In [22]: #filtering data for cities Pittsburgh and Las Vegas Pitts_filter = df['city'] == 'Pittsburgh' Pitts_rating = df[Pitts_filter]['stars'] LA_filter = df['city'] == 'Las Vegas' LA_rating = df[LA_filter]['stars'] #Setting the figure size plt.figure(figsize=(10,6)) #plotting histogram to show frequency distribution of different businesses ratings in Pi plt.hist(Pitts_rating, label='Pittsburgh', #setting the label/description of the histogram color='#9cd095', #specifying the color of histogram bars alpha=0.8, #setting the degree of bars transparency linewidth=1, edgecolor='k', #setting the bar edges' width and color bins='auto' #divides the bins along the x-axis automatically ) #plotting histogram to show frequency distribution of different businesses ratings in LA plt.hist(LA_rating, label='Las Vegas', color='#95c6d0', alpha=0.6, linewidth=1, edgecolor='k', bins='auto') #Adding a title to the histogram plt.title('Distribution of Rating Scores for Las Vegas and Pittsburgh Businesses', fonts #Adding labels to the histogram axes plt.xlabel('Rating Score', fontsize=12) plt.ylabel('Frequency of Rating Score', fontsize=12) #Adding a legend to describe the histogram better plt.legend(title='Business location:', loc='best') #specifies the title and locatio #To display the histogram plt.tight_layout() plt.show() In [23]: #Alternatively, comparison could be presented better using a histogram with 'non-overlapp #plotting a non-overlapping histogram plt.figure(figsize=(10,6)) plt.hist([Pitts_rating, LA_rating], label=['Pittsburgh', 'Las Vegas'], color=['#9cd095', '#95c6d0'], alpha=0.8, linewidth=0.5, edgecolor='k', bins='auto') #Adding a title plt.title('Distribution of Rating Scores for Las Vegas and Pittsburgh Businesses', fonts #Adding labels to histogram axes plt.xlabel('Rating Score', fontsize=12) plt.ylabel('Frequency of Rating Score', fontsize=12) #Adding a legend to describe the histogram better plt.legend(title='Business location:', loc='best') #To display the histogram plt.tight_layout() plt.show() 2. Is the popularity of a given business (measured by reviews count) a good indication of the qualify of its services (measured by star rating)? Create a scatter plot to assess the relationship between popularity and service quality along 3 different classes of businesses: 'Health & Medical', 'Beauty & Spas', and 'Fashion' In [24]: #filtering data for the three types of businesses Health_filter1 = df['business type'] == 'Health & Medical' Health_filter2 = df['service type'] == 'Health & Medical' df_Health = df[Health_filter1 | Health_filter2] Beauty_filter1 = df['business type'] == 'Beauty & Spas' Beauty_filter2 = df['service type'] == 'Beauty & Spas' df_Beauty = df[Beauty_filter1 | Beauty_filter2] Fashion_filter1 = df['business type'] == 'Fashion' Fashion_filter2 = df['service type'] == 'Fashion' df_Fashion = df[Fashion_filter1 | Fashion_filter2] #Creating a scatterplot for each class of business to compare popularity to service qual #first, setting the figure size plt.figure(figsize=(10,6)) #plotting the data for health and medical industries plt.scatter(df_Health['review_count'], #specifying the data points to plot alo df_Health['stars'], #specifying the data points to plot alon label='Health & Medical', #labeling the data points marker='o', #setting the marker type (circle-shaped) s=80, #setting the marker size c='#90b0d0', #setting the marker color alpha=1) #setting the degree of transparency of mar #plotting the data for the beauty industries plt.scatter(df_Beauty['review_count'], df_Beauty['stars'], label='Beauty & Spas', marker='o', s=80, c='#93d090', alpha=1) #plotting the data for the fashion industries plt.scatter(df_Fashion['review_count'], df_Fashion['stars'], label='Fashion', marker='o', s=80, c='#f89d68', alpha=0.9) #Adding a title to the scatter plot plt.title('The Relationship Between Popularity and Quality of Service', fontsize=15) #Adding labels to the scatterplot axes plt.xlabel('Popularity (measured by reviews frequency)', fontsize=12) plt.ylabel('Service Quality (measured by star rating)', fontsize=12) #Adding a legend plt.legend(title='Business Type:', loc='best') #To display the scatterplot plt.tight_layout() plt.show() 3. Create a pivot table that shows the mean star ratings for Health & Medical businesses located in three different cities: 'Pittsburgh', 'Henderson', and 'Las Vegas' Plot a bar chart to compare the mean star ratings in the three cities In [25]: #filtering data for Health & Medical Health_filter1 = df['business type'] == 'Health & Medical' Health_filter2 = df['service type'] == 'Health & Medical' #filtering data for cities Pitts_filter = df['city'] == 'Pittsburgh' Henderson_filter = df['city'] == 'Henderson' LA_filter = df['city'] == 'Las Vegas' #creating a dataframe with the filtered data df_HealthByCity = df[(Health_filter1 | Health_filter2) & (Pitts_filter | Henderson_filte #Creating a pivot table to calculate mean star rating per city pivot_HealthByCity = pd.pivot_table(df_HealthByCity, index=['city'], values=['stars'], aggfunc=np.mean) #Plotting a bar chart to compare the means between cities pivot_HealthByCity.plot(figsize=(9,6), kind='bar', color='#44749d', linewidth=1, edgecolor='k', width=0.35, legend=False, ) #Adding title and labeling the axes plt.title('Average Star Rating for Health & Medical Businesses Per City', fontsize=15) plt.xlabel('Cities', fontsize=12) plt.ylabel('Average Star Rating', fontsize=12) #adjusting the tick label's rotation plt.xticks(rotation=25) #adjusting the y-axis scaler plt.ylim(0, 5) #To display the bar chart plt.show() Part Six: Writing and/or Updating Excel Files 1. Writing a New Excel File Say we want to filter data for businesses classified as either restaurants or bars, store them into a dataframe, and write a new Excel file with that filtered dataframe In [26]: #First, filtering data for 'Restaurants' or 'Bars' business_filter = df['business type'].isin(['Restaurants', 'Bars']) service_filter = df['service type'].isin(['Restaurants', 'Bars']) df_RnB = df[business_filter | service_filter] #Writing the dataframe, df_RnB, into the file 'yelp_filtered.xlsx' df_RnB.to_excel('yelp_filtered.xlsx', #specifying file path and/or name sheet_name='Restaurants and Bars', #specifying sheet name index=False) #removing the index coloumn #in order to preview the file created try: df_newfile = pd.read_excel('yelp_filtered.xlsx') except: print('Error: file does not exist or cannot be accessed.') #to preview the first 5 enteries in the new file df_newfile.head() name Out[26]: business type service type stars review_count take_out city city_id state state_id 0 China Sea Chinese Restaurant Restaurants Chinese 2.5 11 True Bellevue 1 PA 1 1 Frankfurters Restaurants Hot Dogs 4.5 3 True Bellevue 1 PA 1 2 Luigi's Pizzeria Restaurants Pizza 4.0 18 True Bellevue 1 PA 1 3 R & B's Pizza Place Restaurants Pizza 4.0 17 True Bellevue 1 PA 1 4 Rusty Nail Restaurants American (Traditional) 3.5 32 True Bellevue 1 PA 1 2. Writing New Data into an Existing Excel File This time I'll extract restaurants and bars specifically located in Las Vegas, store it into a new dataframe, df_LA_RnB, and write it into a new sheet in the 'yelp_filtered.xlsx' file In [27]: #filtering data for LA restaurants and bars LA_filter = df_RnB['city'] == 'Las Vegas' df_LA_RnB = df_RnB[LA_filter] #Writing dataframe df_LA_RnB into a new sheet with pd.ExcelWriter('yelp_filtered.xlsx', engine='openpyxl', mode='a') as writer: df_LA_RnB.to_excel(writer, sheet_name='LA Restaurants and Bars', index=False) #to preview the data in the new sheet try: xl = pd.ExcelFile('yelp_filtered.xlsx') df_newsheet = xl.parse('LA Restaurants and Bars') except: print('Error: sheet does not exist or cannot be accessed.') #to preview the first 5 enteries df_newsheet.head() name business type service type stars review_count take_out city city_id state state_id 0 Boomer's Bar Bars American (Traditional) 3.5 19 False Las Vegas 12 NV 2 1 Hard Hat Lounge Dive Bars Bars 3.5 31 False Las Vegas 12 NV 2 2 Huntridge Tavern Dive Bars Bars 4.0 50 False Las Vegas 12 NV 2 3 Jack's Pub Pubs Bars 2.5 20 False Las Vegas 12 NV 2 4 Moon Doggies Dive Bars Bars 4.0 85 False Las Vegas 12 NV 2 Out[27]: 3. Appending Data into Existing Sheets in an Existing Excel File Lastly, I'll filter the data further, say to extract businesses classified as 'Hotels & Travel' as well as those in LA only, and append the filtered data into the two sheets I created earlier in the 'yelp_filtered.xlsx' file In [28]: #Filtering data for 'Hotels & Travel' Hotels_filter1 = df['business type'] == 'Hotels & Travel' Hotels_filter2 = df['service type'] == 'Hotels & Travel' df_Hotels = df[Hotels_filter1 | Hotels_filter2] #Extracting those in LA LA_filter = df_Hotels['city'] == 'Las Vegas' df_LA_Hotels = df_Hotels[LA_filter] #Appending the two dataframes, df_Hotels and df_LA_Hotels, into the two existing sheets with pd.ExcelWriter('yelp_filtered.xlsx', engine='openpyxl', mode='a', if_sheet_exists=' #specifying the workbook with the existing data writer.book = openpyxl.load_workbook('yelp_filtered.xlsx') #appending the dataframe df_Hotels to the 'Restaurants and Bars' sheet df_Hotels.to_excel(writer, sheet_name='Restaurants and Bars', #specifying the sheet to ap index=False, #to remove unnecessary ind header=False, #to remove unnecessary col startrow=(len(df_RnB)+1)) #specifying the starting r #appending the dataframe df_LA_Hotels to the 'LA Restaurants and Bars' sheet df_LA_Hotels.to_excel(writer, sheet_name='LA Restaurants and Bars', index=False, header=False, startrow=(len(df_LA_RnB)+1)) #Renaming the sheets to match current data #first, loading the file into a workbook object workbook_obj = openpyxl.load_workbook('yelp_filtered.xlsx') #Getting the original sheet names sheet_1 = workbook_obj['Restaurants and Bars'] sheet_2 = workbook_obj['LA Restaurants and Bars'] #Changing the sheet names sheet_1.title = 'Leisure Businesses' sheet_2.title = 'LA Leisure Businesses' #saving the workbook with the new updates workbook_obj.save('yelp_filtered.xlsx') In [29]: #To check if the data were appended successfully (& the other updates) xl = pd.ExcelFile('yelp_filtered.xlsx') df_sheet1 = xl.parse('Leisure Businesses') df_sheet2 = xl.parse('LA Leisure Businesses') #displaying the last 5 enteries in the first sheet print('Last 5 enteries in sheet \'Leisure Businesses\':') df_sheet1.tail() Last 5 enteries in sheet 'Leisure Businesses': Out[29]: name business type service type stars review_count take_out city city_id state state_id 107 Palace Station Hotel & Casino Hotels & Travel Arts & Entertainment 2.5 526 True 108 Econo Lodge Hotels & Travel Event Planning & Services 2.0 13 False 109 El Mirador Hotels & Travel Event Planning & Services 1.0 5 110 Gray Line of Las Vegas Hotels & Travel Transportation 1.5 111 AAA Hotels & Travel Travel Services 3.5 Las Vegas 12 NV 2 Las Vegas 12 NV 2 False Las Vegas 12 NV 2 3 False Las Vegas 12 NV 2 12 False Las Vegas 12 NV 2 In [30]: #displaying the last 5 enteries in the second sheet print('Last 5 enteries in sheet \'LA Leisure Businesses\':') df_sheet2.tail() Last 5 enteries in sheet 'LA Leisure Businesses': name business type service type stars review_count take_out city city_id state state_id 19 Palace Station Hotel & Casino Hotels & Travel Arts & Entertainment 2.5 526 True Las Vegas 12 NV 2 20 Econo Lodge Hotels & Travel Event Planning & Services 2.0 13 False Las Vegas 12 NV 2 21 El Mirador Hotels & Travel Event Planning & Services 1.0 5 False Las Vegas 12 NV 2 22 Gray Line of Las Vegas Hotels & Travel Transportation 1.5 3 False Las Vegas 12 NV 2 23 AAA Hotels & Travel Travel Services 3.5 12 False Las Vegas 12 NV 2 Out[30]: