Last week Google provided some initial numbers for their new social network, Google+. Better yet, Larry Page posted them on Google+ during their quarterly call – so far they’ve had ten million users sign-up to the service. Personally, I really like the service and was excited and invited many fellow Facebook users to join.
Even before Google released the initial numbers, I was curious as to how many users had joined the site. I realized that many users already had a Google Profile before they moved to Google Plus. Would it be possible to know how many users had migrated to Google Plus? I decided to do some poking around for fun.
Digging up Google Profile users
Google publishes sitemap files for Google Profiles but the index file appears to be out of date – the last change being March 16. The URL for the sitemap index can be found at profiles.google.com/robots.txt:
As you can see, Google doesn’t prohibit spiders from crawling user’s profiles. If you dig further into the sitemap index file you’ll find the following:
The sitemap files are numbered and only go to 999. The sitemap specification allows for 50,000 URLs per file thus requiring multiple files for a large website. This appeared odd to me since 50,000 x 999 would only equal 49,950,000 profiles. I dug further into a few example files and found that the number of URLs per file was actually much lower – most only had 2500 URLs per file. This would mean only 2,497,500 profiles – too low to be credible.
Since Google decided to name each sitemap file with an incremental number I decided to dig in further and see how high they went. The result was 7,102 site map files or, in theory, 17,755,000 profiles. Next up – how could I determine how many of them had migrated?
Turning the tables: crawling Google
The first thing I did was write a quick app to download each sitemap file and store the individual profile URLs into a database. I did this as politely as possible including waiting three seconds between requests. The result was 17,717,154 profile URLs. That’s a big number, although perhaps not the complete list of profiles but it’s the best I could come up with.
Next up – how to determine if the user had migrated? What I found was that if the user had migrated to Google Plus then the server would return a redirect to their new Google Plus URL. Also, I found that a certain number of URLs returned a 404 error – presumably for users who had deleted their profile since the sitemap files had last been updated.
I didn’t want to download all 17m+ profiles – for a couple of reasons. First, I didn’t want to pull in all that personal user information for no reason. People have enough trouble with privacy without worrying about some curious geek in Canada. Secondly, given that I decided up front to not hit the services very hard (with my three second waits between requests) that it would take forever to pull in all of the URLs.
The results are in
What I settled on was to simply issue an HTTP HEAD request for the profile URLs and do it only for a credible sample number of users. A HEAD request doesn’t actually pull down the web page you are requesting – only the headers which was good enough for my purpose. I estimated that I would need to pull in 10,000 URLs for a reasonable sample. In addition, I excluded the small number of deleted profiles that my spider found (<200). Here are the results:
Users who migrated to Google Plus: 1,047
Users who still have a Google Profile: 8,953
This shows that, so far, only 10.47% of Google Profile users in my sample have migrated to Google Plus.
What does it mean?
To me, it means that Google has a ready set of users who have profiles that can easily be migrated to Google Plus. I’m sure they have a plan for that. But it also means that only a small number of users have already migrated. Given that users who had profiles seemed to be relatively limited (early adopters?) perhaps that’s worrying. But we won’t really know if this is important until open sign-ups become the norm with the service.
In other words: it was a fun exercise and I’ll leave it to the reader to decide how valuable this data is and what it could mean this early for Google’s new service.