NITLE Census News

News and statistics from the NITLE Blog Census






Others:

BlogCount
Phil Wolff's statistics aggregator

Blogdex
Weblog diffusion index

Technorati
High coverage blog search engine

Waypath
Per-post blog search engine

LiveJournal Stats
Excellent statistics page for LiveJournal

Blogalization
A multinational perpective on blogging




Email



Creative Commons License

11.17.03

Archive Status

Several people have written in to report broken links in the archive section. We've been having some growing pains, including finding server space to store the full-db archive. I've gone ahead and fixed links to the URL+metadata list, at least, so you can get a full list of known blog URLs, their suspected langauge, and authorship tool.

Expect the full-content archive to be functional by early December. Thanks to everyone for being patient - and not suspecting nefarious motives.

9:36 PM


08.14.03

Equal Numbers, Different Interests

A recent Jupiter Research article included the claim that "blogging is split evenly among the genders".

We were curious to see if this result would hold for Blog Census data. On August 5, we hand-checked a random sample of 776 out of a pool of 490,000 English-language weblogs. We looked for unambiguous evidence of the blogger's sex (such as photos or gendered pronouns in reported speech), and marked sex as unknown when such evidence was unavailable.

Our results for anglophone bloggers supported the Jupiter data:



39.8% of bloggers in the sample were men, and 36.3% were women. This result fell well within the margin of error of �3.5% (indicated in the graph by red error bars).

When we looked at the sample blogs in more detail, however, an interesting pattern emerged:



Nearly half of the blogs in our sample (368, or 47%) fell within the category of 'personal diary' - a journal dedicated entirely to recording the events of the blogger's life. Within this group, women outnumbered men by about two to one. (56% to 28% , with a margin of error of �4.8%).

In other categories, women were greatly outnumbered:



Of the 6.2% of sites in the 'political' category - sites primarily devoted to politics, current events, foreign policy, and various ongoing wars - a bare 4% were written by women.

(Note this result has a larger margin of error: �14.5%).

This quick look suggests that the overall even split between the sexes masks significant differences in what men and women choose to write about. In future studies, we'll be looking at blog categories in more detail, and seeing if these patterns of interest hold true in other languages.

Margins of error listed here were computed at a 95% confidence interval.

6:43 PM


07.29.03

Measuring Weblog Churn Rate

There has been some debate lately about the proper definition of "active" weblog.

The Blog Census gets its data from a variety of sources, including popular update sites and its own crawl of found links (described in euthanizing detail on the Methodology page). I was curious to see how accurate the census was in picking out active sites, and what kind of meaningful threshold we should be using to determine 'active' vs. 'out-of-date' weblogs. So on July 28, we picked a random sample of 529 weblogs from the full pool of 675,000, and examined them by hand.

Here's how our sample broke down (margin of error is �4.5%):

An explanation of the categories:

This data suggests that about one in three weblogs in the census database is abandoned, unused, or very much out of date.

Cameron Marlow points out that there is no clear definition for what should constitute an "active" weblog. The threshold of eight weeks is completely arbitrary, so I thought it would be interesting to see the distribution of most recent posts over time.

The figure below shows the percentage of all actual weblogs from our sample (that is, the "active" and "out-of-date" slices of our pie chart) plotted against the number of weeks we want to use for a definition of "active". For example, 77% of the blogs in the sample had been posted to within the past eight weeks.

As you can see, the plot tails off at about the 95% level. Several blogs in the sample were abandoned as early as 2001.

Thanks go out to the indefatigable Rachel Cotton for her help in preparing this data.

3:40 PM


07.14.03

Added a Creative Commons license (ugly button at left), and put up a new full database snapshot in the Download section.

8:27 AM


07.04.03

I have fixed some crawler throttling bugs. The NITLE blog bot should not be visiting any domain more often than once every ten seconds. If you spot a NITLE bot that's crawling faster, please let me know.

4:30 PM


06.30.03

The error on our RAID 5 archive filesystem was not recoverable. Think twice before buying a POPnetserver, boys! This means the earliest (June 11) database snapshot is no longer available. If anyone downloaded the whole thing, please email me and help me restore it to the archive.

2:25 PM


06.27.03

We've now got Diaryland updates feeding the queue as well, and I've added a map to the stats bar that displays GeoURL data that we find. On the unhappy side, a disk failure seems to have killed the archive. I'm awaiting word from tech support as to whether the data can be recovered.

2:47 PM


06.19.03

I have added regular monitoring of LiveJournal updates to the census - expect the numbers to grow quickly! I've also made a new data set available on the download page. There are about 5,000 weblogs with geographical metadata in our current collection, and now you can download those URLs as a flat file. It's tab-delimited, with URL and lat/long coordinates for each.11:17 AM



06.12.03

I'm happy to announce the launch of the new NITLE Blog Census homepage. We've been crawling the web in search of weblogs since early May, To our knowledge, this is the most comprehensive collection of weblog stats currently available. You can find out all the details on our about and methodology pages.

Please note that all of our data is available for download - our first database snapshot was taken on June 11, and we'll be taking regular snapshots every two weeks or so, to create a permanent blog archive.

Our crawler obeys all robots.txt exclusion rules, so if you don't wish to be a part of the census, just tell our robot. You can also send me email directly with any questions, suggestions, or comments

12:00 AM


© 2003 National Insitute for Technology and Liberal Education.

NITLE is a non-profit consortium of liberal arts colleges funded by the Andrew W. Mellon Foundation