Subscribe to Dr. Granville's Weekly Digest

Google search: three bugs to fix with better data science

These big data problems probably impact many search engines. It also proves that there is still room for new start-up to invent superior search engines. These problems can be fixed with improved analytics and data science.

Here are the problems, and the solutions:

1. Outdated search results. Google does not do a good job at showing new or recently updated web pages. Of course new does not mean better, and Google algorithm favors old pages with a good ranking, on purpose, maybe because ranking for new pages is less reliable / has less history (that's why we created statistical scores to rank web pages with no history). To solve this problem, the user can add 2013 to Google searches. And Google could do that too, by default. For instance, compare search results for the query data science with those for data science 2013. Which one do you like best? Better, Google should allow you to choose between "recent" vs. "permanent" search results, when you do a search.

The issue here is to correctly date web pages, a difficult problem since webmasters can use fake time stamps to fool Google. But since Google indexes most pages every couple of days, it's easy to create a Google time stamp, and keep two dates for each (static) web page: date when first indexed, date when last modified. You also need to keep a 128-bit signature (in addition to related keywords) for each webpage, to easily detect when it is modified. The problem is more difficult for web pages created on the fly.

2. Wrongly attributed articles. You write an article on your blog. It then gets picked up by another media outlet, say the New York Times. Google displays the New York Times version at the top, and sometimes does not even display the original version at all, even if the search query is the title of the articles, using exact match. One might argue that the New York Times is more trustworthy than your little unknown blog, or that your blog has a poor page rank. But this has two implications: 

  • It creates a poor user experience. To the Google user, the Internet appears much smaller than it really is.
  • Webmasters can use unfair strategies to defeat Google: using smart Botnet technology with the right keyword mix, to manufacture tons of organic Google queries, but manufacture very few clicks - only on their own links. This will lower organic Click-through rate (CTR) for competitors, but boost yours. It is expected that this "attack" would fix your content attribution problem  (we will test it in our research lab - we actually successfully tested it in the past for paid search, but not yet for organic search).
  • Another, meaner version of the CTR attack, consists of hitting the culprit website with lot's of organic Google clicks (Google redirects) but no Google impression: this could fool Google algorithms into believing that the website in question is engaging in CTR fraud by manipulating click-in ratios, and thus get them dropped from Google; at the same time it will create a terrible impression-to-conversion (click-out) ratio for advertisers showing up on the incriminated landing pages, hitting display ads particularly hard, causing advertisers to drop the publisher. In short, this strategy could automatically take care of copyright infringement using two different levers at once: click-in and click-out. It must be applied to one violator at a time (to avoid detection), until they are all gone. If properly executed, nobody will ever know who did the attack; maybe nobody will ever find out that an attack took place.

One easy way for Google to fix the problem is again to correctly identify the first version of an article, as described in the previous paragraph.

3. Favoring irrelevant webpages. Google generates a number of search result impressions per week for every website, and this number is extremely stable. It is probably based on the number of pages, keywords and popularity (page rank) of the web site in question, as well as a bunch of other metrics (time to load, proportion of original content, niche vs. generic website etc.) If every week, Google shows exactly 10,000 impressions for your website, which page / keyword match should Google favor?

Answer: Google should favor pages with low bounce rate. In practice, it does the exact opposite.

  • Why? Maybe if a user does not find your website interesting, he performs more Google searches and the chance of him clicking on a paid Google ad increases. So bad (landing page) user experience financially rewards Google.
  • How to fix it? If most users spend very little time on a web page (and Google can easily measure time spent), that web page (or better, that web page / keyword combination) should be penalized in Google index, to show up less frequently. Many publishers (including us) also use Google Analytics, which provides Google with additional valuable information about bounce rate and user interest, at the page level.

However, one might argue that if bounce rate is high, maybe the user has found the answer to his question right away by visiting your landing page, and thus user experience is actually great. In our case (regarding our websites) we disagree, as each page displays links to similar articles and typically results in subsequent page views. Indeed, our worst bounce rate is associated with Google organic searches. More problematic is the fact that bounce rate from Google organic is getting worse (while it's getting better for all other traffic sources), as if Google algorithm lacks machine learning capabilities, or is doing a poor job with new pages added daily. In the future, we will write longer articles broken down in 2 or 3 pages. Hopefully, this will improve our bounce rate from Google organic (and from other sources as well).

Related articles

Views: 1610


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Vincent Granville on May 15, 2013 at 9:58am

Another area of concern is SEO companies destroying your Google page rank using bad SEO practices on purpose, then contacting you "offering" to fix your poor rankings for a fee.

Comment by Vincent Granville on April 16, 2013 at 2:21pm

Oleg: A bounce rate of 50% is far above average, if you compute it over thousands of websites (via Google Analytics; the Alexa computation provides very different numbers, but I think they are not as accurate). Besides bounce rate, other metrics measure user interest, such as "time spent", "visit depth" (pages per visit), "number of actions by user" etc.

Of course all these metrics have limitations: if you split a long page into two pages, suddenly all your metrics improve (except number of users, number of visits), but it's purely artificial. However page splitting does really improve one thing: the chance that the visitor will click on a banner ad, especially if banner ads are rotating. So it does indirectly increase revenue.

Comment by Oleg Okun on April 16, 2013 at 2:05pm

Vincent, don't you think that the current definition of bounce rate is a bit illogical. For instance, someone visits the main page of a company, sees the 'jobs' page and click on it, thus moving to another page, while still remaining on pages of the same organization. Is such behavior treated as a bounce or not? 

Comment by Oleg Okun on April 16, 2013 at 2:00pm

Thank you, Vincent. But why do you consider bounce rate of 55 % very good? I think a really good bounce rate needs to be 25-30%, i.e., close to its actual minimum.

Comment by Vincent Granville on April 16, 2013 at 12:20am

@Oleg: A bounce is a user visiting a webpage and then leaving the website right away. In other words, it's a single-page visit to a website, with entry page = exit page.

By "worse", I mean (organic) bounce rate went from (say) 76% to 79%, over several months, while the number of clicks-in remained very stable. I ranked our traffic sources by bounce rate, and clearly, Google has the worst among all large traffic sources. The only one that was worse was Google paid traffic. Stumbleupon and Reddit have a worse bounce rate (above 90%), due to the way it works - they deliver traffic spikes very infrequently to non-targeted users. LinkedIn, Google+ and our internal properties have very good bounce rates, some below 55%.

Comment by Oleg Okun on April 13, 2013 at 4:39am

Hello Vincent,

How would you count a bounce? As far as I understood from your blog, you take into account not only time spent on a web page but also whether a visitor clicked on one of the links on that page, too, right?

What did you mean by saying that bounce rate from Google organic search gets worse? I think Google returns relevant to searches information on the first few pages (of course, this depends a lot on the ability of a visitor to formulate the right query to Google). If a user is inexperienced or doesn't know precisely what to look for, his queries are vague and this is not Google's fault if irrelevant pages are returned, thus leading to a high bounce rate. 

© 2014 is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service