🐍 Profitable App Profiles
Introduction
For this project, we’ll pretend we’re working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.
We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.
Step One: Analyze data sets
Analyse Appstore data
Download the source csv files under the following links:
Applestore Dataset
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'] Number of rows: 7198 Number of columns: 16
The following columns from the applestore dataset can be helpful for our analysis:
- price
- prime_genre
- user_rating
Playstore Dataset
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] Number of rows: 10841 Number of columns: 13
As you can see in the discussions related to the playstore dataset, one row has missing data. Thats why we have to delete the row 10473
The following columns from the playstore dataset can be helpful for our analysis:
- Rating
- Genres
- Price
- Category
Show duplicate entries
The Google Play dataset has many duplicate entries.
The higher the number of reviews, the more recent data should be, this is the criterion for removing duplicates.
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
Number of duplicate apps: 1181 Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software'] Expected length: 9659
Step Two: Clean data sets
Remove duplicate entries
Helper functions
For every app name, find the highest number of reviews.
# Reviews max reviews_max = {} # 'Instagram': 66577446 for row in playstore_data[1:]: name = row[0] try: n_reviews = float(row[3]) except ValueError: print(row) if name in reviews_max and reviews_max[name] < n_reviews: reviews_max[name] = n_reviews if name not in reviews_max: reviews_max[name] = n_reviews
Create a new playstore dataset with the highest review for every app to prevent duplication of datums. The `already_added` list helps keep us track of apps that we already added.
# Android clean android_clean = [] already_added = [] for row in playstore_data[1:]: name = row[0] try: n_reviews = float(row[3]) except ValueError: print(row) if n_reviews == reviews_max[name] and name not in already_added: android_clean.append(row) already_added.append(name)
Build the key value storage
We need to clean the playstore dataset to remove all the duplicates. The basic idea is to build a dictionary with the name of the app and the count of reviews. If a dulpicate entry exist and the count is higher, then it will be replaced.
9659
Clean the playstore data
We will extract the entry with the highest reviews and save the whole dataset in the android_clean list. We will also check if we already added this row because it is possible that we have more than one entry with the same count of reviews.
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'] ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']
Remove Non-English Apps
Helper functions
This function checks if a letter is ascii or non-ascii. If a string of letters contains at least more than three non-ascii characters, it will return False. Otherwise it will return True.
# Is english def is_english(string): non_ascii = 0 for character in string: if ord(character) > 127: non_ascii += 1 if non_ascii > 3: return False else: return True
We will loop throught the android_clean dataset and check if the name of the app includes more than three non-ascii characters.
# Android english android_english = [] for row in android_clean[1:]: name = row[0] if is_english(name): android_english.append(row)
And we will do the same for the ios dataset. If the name of an app includes more than three non-ascii characters, we will remove it from the dataset.
# IOS english ios_english = [] for row in applestore_data[1:]: name = row[1] if is_english(name): ios_english.append(row)
Removing None-English Apps
One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text.
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'] Number of rows: 9613 Number of columns: 13 ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 6183 Number of columns: 16
Isolating the free Apps
Helper functions
android_free = [] for row in android_english: price = row[7] if price == '0': android_free.append(row)
ios_free = [] for row in ios_english: price = row[4] if price == '0.0': ios_free.append(row)
Isolating the free Apps
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'] Number of rows: 8863 Number of columns: 13 ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] Number of rows: 3222 Number of columns: 16
As you can see the difference of Apps in our playstore dataset between the lists “android_english” and “android_free” is 750.
For iOS the difference between the lists “ios_english” and “ios_free” is 2961.
As a result of this analysis, I would say a higher number of english language Apps in the playstore are free available.
Step Three: Profiling Apps
Most popular Apps by Genre
Games: 58.16263190564867 Entertainment: 7.883302296710118 Photo & Video: 4.9658597144630665 Education: 3.662321539416512 Social Networking: 3.2898820608317814 Shopping: 2.60707635009311 Utilities: 2.5139664804469275 Sports: 2.1415270018621975 Music: 2.0484171322160147 Health & Fitness: 2.0173805090006205 Productivity: 1.7380509000620732 Lifestyle: 1.5828677839851024 News: 1.3345747982619491 Travel: 1.2414649286157666 Finance: 1.1173184357541899 Weather: 0.8690254500310366 Food & Drink: 0.8069522036002483 Reference: 0.5586592178770949 Business: 0.5276225946617008 Book: 0.4345127250155183 Navigation: 0.186219739292365 Medical: 0.186219739292365 Catalogs: 0.12414649286157665