Priyank Garg is the director of product management for Yahoo! Search Technology (YST), the team responsible for the functionality of Yahoo!’s Web search engine including crawling, indexing, ranking, summarizing and spelling Web search functions along with products for webmasters, such as Site Explorer. During his three years at Yahoo!, Priyank has led many highly-visible product launches for Yahoo! Search and championed the cause of webmasters with Yahoo! Site Explorer and Yahoo! Sitemaps.
Prior to Yahoo!, Priyank worked at Ensim Corporation managing and evangelizing the company’s flagship hosted service automation platform, Unify. Earlier, Priyank worked as a Systems Consultant with Deloitte Consulting working alongside top-tier U.S. organizations to address their information systems needs.
Priyank earned a bachelor of technology in computer science from Indian Institute of Technology, Delhi, where he also led student extra-curriculars and governance. He also holds a master of science in computer science from Stanford University.
The Digital Essentials, Part 3
Developing a robust digital strategy is both a challenge and an opportunity. Part 3 of the Digital Essentials guide series explores five of the essential technology-driven experiences customers expect, which you may be missing or not fully utilizing.
Eric Enge: Can you talk a little bit about the role that links play in Yahoo’s ranking algorithms?
Priyank Garg: Sure. There’s a lot of mythology in the industry sometimes around how links work. Links are a way for us to understand how Web sites and other people on the Web are recognizing other content that they have come across. The anchor text from that indicates context for the content that it’s linking to, and we have used this information in our algorithms for many, many years to better address search queries as they come into the search engine.
So links are important, but anchor text is just as important. What we look for are links that would be naturally useful to users in context, and that adds to their experience browsing on the Web. And links of that nature, which are organic, will survive when the user comes across them and interests him. Those are the kinds of links that we are trying to recognize, identify, and attribute to the target content.
Eric Enge: Right. So part of what you are pointing at there is that relevance matters a lot. So getting a link from the bottom of a WordPress template that you create and distribute is completely irrelevant.
Priyank Garg: Exactly, that’s the kind of thing that we are trying to do all the time. The irrelevant links at the bottom of a page, which will not be as valuable for a user, don’t add to the quality of the user experience, so we don’t account for those in our ranking. All of those links might still be useful for crawl discovery, but they won’t support the ranking. That’s what we are constantly looking at in algorithms. I can tell you one thing, that over the last few years as we have been building out our search engine and incorporating lots of data, the absolute percentage contribution of links and anchor text to the natural ranking of algorithms or to the importance in our ranking algorithms has gone down somewhat.
New sources of data and new features that Yahoo! has built and developed have made our ranking algorithm better. Consequently, as a percentage contribution to our ranking algorithm, links have been going down over time. I believe that is somewhat attributable to people abusing links on the Web. As that happens, the net quality of links goes down, and the net contribution directly goes down too. However, we’re still working hard to make sure all the high-quality links are effective in providing us the information we need to search on queries.
Eric Enge: So from a mindset point of view it sounds like you are much more focused on the high-quality links because they are less noisy as a ranking signal?
Priyank Garg: Exactly, that’s right. If we take all the links together including the noise, the percentage contribution to our ranking system goes down, because we are discounting the noise more effectively over time.
Eric Enge: Right. I understand, but the links are still a very significant factor even now.
Priyank Garg: Yes. They continue to be a very significant factor.
I’m saying that people and site owners should think about the site in all aspects of the user experience, and not obsess about links as the only thing that drives traffic to them. Links are critical factors, good organic links are earned through great content and great value that will add to the users’ visibility on search engines. But they can do a lot of things in parallel that will also make the search engine visibility better and beyond the search.
Eric Enge: Yes, indeed. So do you picture the role of links that will continue to decrease as new rankings do?
Priyank Garg: We are not focused on doing that. This is a developmental process and you might have anything happen, right? It somewhat depends on how the Web evolves. For example, if tomorrow there is a whole turnaround and all spammy links are shut down, we might suddenly have a link signal noise quality go up so much that they might increase in importance. I don’t want to make predictions about what will happen in the future. All I can say is that we have seen that as we enhance our algorithms with a lot of other features, which we have been building, they have been contributing a lot of information to supplement the links. As a percentage contribution to the ranking function, links are relatively less than they were in the past.
Eric Enge: So it sounds like the process in which you are evaluating the noisiness of the signal is used to attenuate its impact on the results.
Priyank Garg: That’s right.
Eric Enge: That’s a very interesting concept. If link prices were going up at the same rate that gas prices are going up and people just stopped doing it, then the signal quality would improve and its importance would improve.
Priyank Garg: Yes. Our algorithms are evolving constantly; we are making changes to our systems multiple times a week. Some of the changes are minor. Most of them are so minor that we don’t even need to talk about it, but we are constantly evolving our system to keep up with the data on the Web that is also evolving. In search engines, the data is the key part of the output.
The data and the algorithm constantly evolve. The Web evolves, new tools come into play, new ways of interacting with users comes into play. It is a constant evolution all the time, and we adapt our algorithms to what we are seeing on the Web to make sure that our end goal of user relevance is optimized.
Eric Enge: Sure. Ultimately, that’s what it is all about. So, what kind of non-link based signals do you use?
Priyank Garg: Well, we have lots of data sources that we are recognizing all the time. We build understandings of how a site lays out its contents; what’s the distribution of the quality of the content; what’s the spamminess of the content on the site; what is the spamminess of an individual page; what is the spamminess of a site in aggregate; what’s the emphasis on words on the page; what’s the context of anchor text of the page? There are so many factors in there, hundreds and hundreds of elements.
Eric Enge: Right. As for off page factors, one example of something is the level of adoption on social media sites, like del.icio.us for example.
Priyank Garg: All of those factors are a part of it. To give you a general answer, the elements of the locations that provide the most signals are the ones where users are taking active steps to recognize the value of content, whether it be through links they have created on their clean Web pages, or through social media sites like del.icio.us.
So every location where the incentive is aligned for user value is the place where it matters most. If there’s a good website with reviews for a product and it’s generated by users, which have no incentive except to help other users, then those links would be valued more.
That would come out through the algorithm because of the quality of that site itself. If that site is used by users and they value it, that will represent itself on the rest of the Web, and the quality of the site will propagate down to the sources that it links as well.
Eric Enge: An incredibly important thing to consider is the type of relationships you are trying to develop in your market. It becomes really important to focus on the truly authoritative type of sites in your space, and I can make the argument without even bringing search engines into it, but it sounds like it is a very smart way for an SEO to think about promoting his/her site.
Priyank Garg: That’s right.
Eric Enge: Very interesting. Now, as you have already alluded to, there are, unfortunately, a lot of people out there generating spam type tactics, ranging from old-fashioned things like hidden text to purchasing links and these kinds of things.
Priyank Garg: Yes.
Eric Enge: So, what are the kinds of things that Yahoo typically does to fight spam?
Priyank Garg: We use algorithmic and editorial means to fight spam. What we have found is algorithms are very effective at fighting spam at the large scale, and our human editors are very effective at recognizing new techniques and providing us that early signal, which we can use to scale-up the detection processes. This two-step approach helps us to be recognized as one of the best in the industry.
We show the least spam among the search engines because both of our techniques are in action. Our spam detection techniques run on every page, every time we crawl it. Those detection algorithms are fed directly into our ranking function, where the spam detection is actually pretty high in importance.
Eric Enge: Yes, I guess the editorial function, which isn’t quite scalable, probably gets directed by the algorithmic detection of things that just smell bad, and by where people are reporting problems.
Priyank Garg: Yes it is. Yahoo! specialists who are doing all the editorial efforts are people who are great experts in this and they sometimes are ahead of our algorithms in detecting these things. Some times, the algorithms point out suspicious things, which they look at. There is a knowledge that builds up over time about what looks suspicious, which only humans can detect in the beginning.
Then, we use that to go to the next level of quality in our spam detection. Both of those mechanisms of algorithmic detection – followed by editorial follow-up or editorial detection followed by algorithmic follow-up – are in action all the time. Ultimately, the way to scale up the response is to build algorithmic ways to detect things on every page and every crawl. So everything that our editors do is constantly being mirrored by our spam team as quickly as possible in the algorithms.
Eric Enge: Yes, I understand. Do you have a situation where editors have the ability to take manual action if they see something extreme?
Priyank Garg: Our editors are authorized to take action for various kinds of situations like DMCA or legal deletes, such as for markets like France, where there is a restriction of certain types of content, such as Nazi memorabilia, which other markets don’t have. Consequently, there are various tools that are available to them. They are not focused on saying, I need to find a million pages of spam and remove them in this month.
We can use it to have our algorithms learn, we can use it to address it directly, we can use it to reach out to the webmaster and warn them that it might not be meeting our guidelines, so we do what is right for the users as best we can.
Eric Enge: Right. And sometimes, you are going to spot something which looks like a mistake. That’s the kind of the scenario where you might reach out to someone and ask if they know that they have hidden text that is kind of objectionable.
Priyank Garg: Yes, exactly. The point is that we don’t want to hurt people who may be doing things innocuously or starting to cross the line without being aware. And our clear intent is not to explicitly remove spam from our index. Our goal is to affect the ranking and reflect the relevance appropriately.
There is a query out there for which each page is relevant, and so the completion of our goal requires our algorithm to keep all the content we can, even the spammy ones. Of course, that’s something that becomes egregious on resources, and then sometimes, we have to make other choices. However, if there is a page that is generally okay but has some spamming techniques, someone might search for that URL, and as a search engine, we want to make sure we have the most comprehensive experience we can.
But if someone goes out there and creates a hundred million spam DNS hosts, that’s just a waste of resources and we may not choose to take that approach. In principle, our desire is to keep as much of the Web available to users on our search engine, to rank it appropriately.
Eric Enge: Right. So, if you saw a page that has some good content, but there are some spammy practices on it, it affects ranking as opposed to indexing? When it crosses a certain line of resource consumption, we may change the approach. But that is the intent, yes?
Eric Enge: Of course, another way you can end up with pages that are not high quality is if you have large sites, which perhaps have enormous reams of really good content, but because of their size, you may end up with pages that are thin in unique content.
You might also have a database of some addresses, so when you look at it it’s relatively thin in content, but then again, someone might actually be searching for the address of a particular business that’s on that page.
Priyank Garg: Yes. Again, as we work we’ve tried to make sure that our algorithms are constantly trying to optimize the experience for the largest number of queries that we get from users. The net information content that’s available to users and ranks for queries is what we are looking at. If the page has unique content, whether it is from a large site or small site, it may mean that it’s less useful for most of the queries.
Eric Enge: Right. Yes, indeed. So, what about just paid links in general? What’s your policy on that?
Priyank Garg: There’s no black and white policy that makes sense in our mind for paid links. The principle remains value to the users. If a paid link is not valuable to the users, we will not want to give it value. Our algorithms are being organized for detecting value to users. We feel most of the time that paid links are less valuable to users than organic links.
But that’s not black and white, it is always a continuum. Yahoo! continues to focus on the element of recognizing links that are valuable to users, building mechanisms in our algorithms that attenuate the signal and capture as much value from that link in context, rather than worrying about it being paid or unpaid. As I said before, paid links are found to be generally less useful to users. That’s how we try to capture that aspect of it.
Eric Enge: Right. So now I would like to talk a little bit about some of the common meta tags that have emerged in the past couple of years: NoIndex, NoFollow meta tags, and NoFollow attributes, and robots.txt. In particular, the context I’d like to talk about is limiting the flow of what I will call link juice, or stopping a page from being crawled, or stopping it from being indexed.
Let’s take them one at a time and talk about how you handle NoIndex?
Priyank Garg: NoIndex on the page essentially means that none of the content on that page will be searched or will be indexed in our search engines. If there is a page with a meta NoIndex on it, that page will not be recalled for any of the terms in its HTML.
Eric Enge: Right. Now let us say lots of people link to this page, which is NoIndexed and those are good relevant links, and then the NoIndexed page turns around and links to some other pages with good relevant content; is that NoIndexed page passing link value to those other pages?
Priyank Garg: We do index a page and we will show its URL in search results if it is very heavily linked to the Web, even if it has a NoIndex tag on it.
That is something that is a behavior that we follow. That’s been essentially applicable to situations where the page itself is high value, and it has many links that are very relevant to a particular query as indicated by anchor text.
Eric Enge: Right. I guess there is sort of a threshold in which the links indicate a high enough demand for that page’s content that it’s hard to not have it in the index.
Priyank Garg: Exactly. So in that particular case, we will have the URLs show up in the search results, but there will be no abstract. And the URL would show up only because of the anchor text; it will not show up because of any terms on that page.
Eric Enge: Right.
Priyank Garg: We do currently show pages which have a NoIndex if anchor text recommends that. We also discover links from a NoIndex page and pass the link weights and anchors to destination documents.
Eric Enge: So can we talk a little bit about NoFollow meta tags and NoFollow attributes?
Priyank Garg: Yes, so NoFollow meta tags mean that we will not use the links on a page as an attribution, but you may use them for discovery. The same thing applies for the NoFollow attribute on a link.
Eric Enge: Right. So the anchor text and the vote represented by the link for a given page are ignored if it’s NoFollowed or it’s on a page that has the NoFollow meta tag, but you will still look through the page and use it for discovery and potentially indexing if there are other reasons to index it.
Priyank Garg: Yes. Exactly.
Eric Enge: Yes, that makes sense. And then lastly, robots.txt? Say somebody uses robots.txt who don’t crawl a page, is it still possible for that page to get into the index?
Priyank Garg: Yes. If robots.txt files says don’t crawl, we will not crawl, we will not even try to retrieve that page at all from our crawling. But if the anchor text to that URL, as discovered on the Web, indicates a strong preference for it to show up for certain queries, it might show up.
One example in the past was the Library of Congress had a robots.txt denying crawling, but we still had that page show up because it was what users wanted for that query. So it will only show up when lots of anchor text on the Web suggest that this page on this particular query is relevant to that query.
Eric Enge: Right, okay so that makes perfect sense. So if you can’t crawl the page because it said don’t crawl it, then it’s hard to show a search snippet, for example, right?
Priyank Garg: Yes, we won’t have a search snippet for that page. We won’t even be showing the title of the page; the title we show will be generated by other information sources.
Eric Enge: Right. So all that makes sense and what it ranks for is really driven by the source of the links?
Priyank Garg: Exactly.
Eric Enge: That same phenomenon could be ascribed to other technologies which just aren’t in practice crawled, like Flash files for example.
Priyank Garg: Adobe Flash files are somewhat different because that’s not always what happens there. We do have an ability to crawl the HTML of the page, and they might give us an HTML title with the description of that page. There might be a version of the content that might be available for crawlers for Flash. So there is another thing playing into it, but if there is nothing on the page except a link to a Flash file, then the other off-page factors will be what drives the visibility of that URL in the search results.
Eric Enge: Right, and then perhaps a PDF file?
Priyank Garg: We are actually able to convert PDF files to HTML.
Eric Enge: So you can actually look at the text inside the PDF file, and process that, and use that for context and everything else?
Priyank Garg: That’s right. We can also do that with Word files, Excel files, and many other formats.
Eric Enge: That’s, of course, something that has evolved pretty significantly over the past few years, because not so long ago nobody was looking inside PDFs.
Priyank Garg: Yes we continued to evolve our quality of tools to look into PDF files, and there are efforts that have been going on over the last few years, so that has been an evolving area as well.
Eric Enge: Is there any reason for a publisher to fear that issuing content in the form of a PDF file wouldn’t rank as well or get exposed to as many terms as the same content rendered in HTML?
Priyank Garg: That’s a tough one to say, and the reason is that my sense is users link less to PDF content or non-HTML content, just because it’s somewhat slower to view.
Consequently, what effects are playing into the visibility of this could be multi-variant. I wouldn’t make a blanket statement about HTML being equivalent to PDF because user attribution and other factors do play out to be different on the Web for different formats of context. So that is something that the publishers will need to think about.
Eric Enge: Right, I understand. So just to step back a second to the NoIndex, NoFollow and robots type stuff. The notion has been discussed in many circles on the Web of what people call link juice sculpting. Using tools like the NoFollow attribute a little more explicitly to show what you think is important versus which ones you don’t think are important. And so a classic example is, you have a website and you have your contact us, about us page, and legal disclaimer page linked to from every page of the site. What your thoughts on that kind of sculpting?
Priyank Garg: It’s interesting that this discussion is described in that context. A NoFollow tag creates an alternative state of attribution, but if you think about it, it’s not very different from not linking to those pages. When you link to a page, you are saying something about it. When you don’t link, that’s also an implicit comment, either you didn’t know about the page, or you didn’t think it was useful.
So if you think about link juice sculpting, this targeting of link attribution existed even before the NoFollow tag, where you could link and you could not link to something. Now you have an intermediate stage such that:
- you can link without NoFollow
- you can link with NoFollow
- you can not link.
So, it’s not something that is entirely out of the blue, it’s just an intermediate stage that’s created, and it’s not anything terribly new. You should always make sure you link to content that’s useful to users and if you link to the right content, that will work best.
One of the things Yahoo! has done is look for template structures inside sites so that we can recognize the boilerplate pages and understand what they are doing. And as you can expect, a boilerplate page like a contact us or an about us is not going to be getting a lot of anchor text from the Web and outside of your site. So there is natural targeting of links to your useful content.
We are also performing detection of templates within your site and the feeling is that that information can help us better recognize valuable links to users. We do that algorithmically, but one of the things we did last year around this time is we launched the robots-NoContent tag, which is a tool that webmasters can use to identify parts of their site that are actually not unique content for that page or that may not be relevant for the indexing of the page.
If you have ads on a page, or if you have navigation that’s common to the whole site, you could take more control over our efforts to recognize templates by marking those sections with the robots-NoContent tag. That will be a clear indicator to us that as the webmaster who knows this content, you are telling us this part of the page is not the unique main content of this page and don’t recall this page for those terms.
That kind of mechanism is something that we provide as a control for site owners to be more explicit about what parts of the page are boilerplate. But the NoFollow links are very different from not putting the link, and so I don’t see this to be very different in terms of the tools available to webmasters.
Eric Enge: Yes, indeed. So you have a NoContent that is interesting too because I am sure when people use that it just removes potential ambiguities in the interpretation of the page and allows them to focus the attention on all the things that are most important.
Priyank Garg: Exactly right.
Eric Enge: Yes, so that’s a good tool. Have any of the other search engines moved toward supporting that?
Priyank Garg: We’ve actually brought this up in our conversations. You might recall earlier this month we all blogged about the support we have for robots exclusion protocols.
Eric Enge: Yes.
Priyank Garg: And we resolved a bunch of the small variations among us. So it has been brought up. I don’t believe any of the others are supporting it yet, but we will find out in time.
Eric Enge: What is it that Yahoo does when you discover duplicate content across two different sites and how does it deal with that in terms of the quality of the search experience?
Priyank Garg: Our goal is to surface good, unique content for users and provide the maximum amount of relevant information for every query the user makes. So, our efforts are constantly to detect duplicate content sources, recognize the parent source as much as possible, and attribute content as much as possible to the parent or the original author for duplicate content. Then we try and surface that for every query that we receive that it’s relevant for. Say site-A has content which is duplicated on site-B, and we recognize that A is the parent, then for a query related to that content will likely surface A higher. But if a query says I want content from side B on those terms, we will obviously try to surface that.
Eric Enge: But it’s not always that easy to know who the parent is.
Priyank Garg: That’s true, that’s something that it is not always easy to know the best page, but its part of our algorithmic efforts to detect that intent, and we continue to do that. So there are lots of signals that can often work, and in most cases works when the duplication is not egregious or intentional. It is entirely a function of how the Web is operating. Usually, we do a reasonable job, but sometimes it’s not always possible.
Eric Enge: Right. And then, of course, is the extreme version where it’s a copyright violation and sometimes that escalates itself to you in the form of DMCA requests.
Priyank Garg: That’s right, we have a well-documented process for DMCA complaints. Those complaints, when they come in, are investigated directly by our support and editorial teams and can be acted upon in a very targeted manner. So if you or any site owner has any content that you believe has been plagiarized or taken without your consent and you file a DMCA complaint with us, we will investigate that and take down the content that is found to be in violation of the copyright rules.
Eric Enge: Right, although I think it’s probably fair to say that if you file a DMCA request that you best be the owner of the content.
Priyank Garg: That’s of course true. You better know what you are pointing out.
Eric Enge: Yes, indeed. Are there situations in which extreme amounts of duplicate content can be flagged?
Priyank Garg: The essential policy on duplicate content is not to treat it as negative; it’s essential to treat it as an optimization on our systems. But there is a point where that no longer holds true. A common example could be a spammer, who has hundreds of millions of posts up on the same content.
That’s a clear example where you can say that it’s not really duplicate content, it is just an egregious practice that can affect the entire site. So, there is a point at which it does become a violation of our editorial guidelines.
Eric Enge: Yes, indeed. To finish I would like to see if you have some general advice that you would like to offer to publishers and SEOs in terms of what Yahoo views as best practices.
Priyank Garg: Yes. The basic principle remains the same as you said; be aware of users and that’s what we will be continued to gear ourselves toward. But be search engine smart so that we can discover your content. The robots NoContent and other tools that we have provided are means that give you control and if you use them, they can work for you. We don’t expect everyone to have to use those controls, and we continue to build algorithms to do much of that work.
Yahoo Site Explorer continues to be a great avenue to learn about what we are doing. We have been doing some work learning from the last feature we launched, which was well-appreciated, the Dynamic URL Rewriting. That is a tool that we have seen in multiple examples as having really significantly increased the quality of the experience of site owners.
I talked about this feature again at SMX Advanced in June 2008, and while speaking on the panel, someone from Disneyworld was in the audience. Within five minutes, while I was still describing it, he went to Site Explorer and set it up, and he is already seeing the benefits of what he had implemented.
Eric Enge: Right. I really appreciate you are taking the time to speak with me today.
Priyank Garg: Thank you Eric!