NITLE Blog Census

2,865,107 Weblogs Indexed
1,890,970 Estimated Active

Home

News

About

Methodology

Languages

Map

Market Share

Download

API

Credits



Creative Commons License

Methodology

Definitions

For the purposes of this study, a blog is considered to be a regularly updated personal website with posts that appear in reverse chronological order. Community sites like Slashdot and MetaFilter are included in the crawl; however, sites that serve as the home page for a major blogging tool (MovableType.org, Blogger.com) are not included.

For the purposes of our study, an active weblog is a site that has over 500 bytes of textual content (about a hundred words), and was updated sometime in the last 90 days.

Finding weblogs

We find weblogs by crawling the web - starting at one site, and following all of the outbound links to see if any of those sites are weblogs. We also seed our crawl queue with lists of known weblog URLs. These known URLs come to us from a variety of sources:

Blog Determination

How do we know that a site is a weblog? Our policy is to err on the side of false negatives - that is, we'd rather miss some real weblogs than improperly include a non-blog site. All sites in the database are marked with a 'certainty level', depending on how sure we are of their status.

A site will be marked as a weblog if it meets any of the following criteria, in order of precedence:

Our blog identification code is open source and available on the Comprehensive Perl Archive Network (CPAN) as WWW::Blog::Identify.

Blogs are stored with a confidence value attached. The highest value is for blogs confirmed as such by a human user (i.e., myself). The next highest is for user-submitted blogs, then blogs from update sites, then blogs detected by the Perl module, and finally sites rejected by all methods.

If a site is marked as a blog because of one set of criteria, and subsequently appears in a set of URLs with a higher confidence level, its status is upgraded. I.e, if the crawler guesses that a site is a Manila weblog based on a GIF in the HTML, and then that site appears on the weblogs.com update list, its confidence level is bumped up accordingly. So the best way to make sure you are included in our crawl, short of filling out the online form, is to ping weblogs.com.

Language Identification

We do language identification using a program adapted from TextCat, which analyzes trigram (three-letter) patterns in site text. Blogs with fewer than 500 bytes of text (as determined by HTML::TreeBuilder) are ignored.

Please note that we ignore any language metadata in the HTML markup itself, since many non-English templates claim to be English anyway.

Bilingual bloggers are likely to have a wonky determination made. I'm thinking about how to fix this.

Software

Crawl data is stored in a MySQL database. The crawler is written in Perl. Everything runs on a Linux server.

Contact

Please address all questions and comments to Aaron Coburn.

Updated 06-13-2003