Monday, January 25, 2010
If you read news online, you've probably noticed that articles aren't static. They often change over time, to reflect things like typo fixes, shifts in emphasis, new information or corrections of previous mistakes. Sometimes they even switch URLs, or become unavailable after a certain period of time. As a human being, reading at most a few dozen articles a day, this is no big deal.
But if you happen to be, say, a news search engine that crawls hundreds of articles at thousands of sites every minute, this presents a unique set of challenges. How do you balance looking for new content against the need to update older content? How can you make sure the content is fresh, doesn't link to dead pages or display headlines that have been changed by the publisher?
To deal with these issues, Google News has implemented a recrawl feature that allows us to focus on getting the newest articles around while still ensuring that we're displaying the most up-to-date information. From the moment we discover a new article, we'll keep revisiting it looking for changes. Since we've noticed that most changes to articles occur just after they're published, we revisit articles most frequently in the first day after we've found them. In some cases, we'll even revisit articles we had trouble crawling the first time around. After that, we visit them less often. Either way, we try hard to present users with the freshest news. (We bet whoever wrote "Dewey Defeats Truman" wishes they had recrawl!)
For readers, this feature is intended to reduce the number of outdated headlines and dead links you might find. And for publishers, rest assured that we'll be back to find your latest stories and updates as soon as we can.