MetaFilter's site and server can always use upgrades of hardware, software, and bandwidth, as well as more stable funding for continued support of its small but high-skilled moderation and backend team! If you'd like to chip in, you can donate to Metafilter.

Infodump

From Mefi Wiki
Jump to: navigation, search

The infodump is a collection of information about Metafilter, for the number crunchers. See also the Metafilter Corpus project for more language-focused data.

Some brief notes on the file formats follow.


Data format

The infodump data is in tab-separated value format, with one record per line.

Most data is encoded in UTF-8, but see the MetaTalk post Tags with weird characters.


Available data

Post stats

The first lines of postdata_mefi.txt.zip (the Mefi file), slightly reformatted: I include the date at the beginning here, but I'll remove it in the following examples.

Wed Oct  1 06:39:16 2008
                                        category        favorites
postid  userid  datestamp                       comments        deleted reason
19      1       1999-07-14 15:03:04.930 0       116     35      0       [NULL]
24      15      1999-07-14 21:58:18.327 0       1       0       0       [NULL]
25      1       1999-07-15 09:37:51.770 0       6       0       0       [NULL]
26      16      1999-07-15 09:54:26.280 0       4       0       0       [NULL]
27      16      1999-07-15 09:57:54.160 0       1       0       0       [NULL]

The AskMe, MeTa, & Music headers are the same. The category field values differ for each of the four files: askme and meta have several topical categories each represented by values 1-n, music lists a value of 1 for "Music Talk" posts and a 0 for song posts, and mefi posts have no category information and so list a zero in every row. See below for a key to each category.

Deleted posts are included in these files, along with a deletion reason where one was provided; a value of 1 in the deleted column indicates a deleted post. Deleted comments are not counted toward the comment totals for each thread.

For Metatalk posts, thread-closure status is also captured in the deleted column; a value of 2 indicates the thread was closed, and a value of 3 indicates the post was both closed and deleted (presumably in that order).

Category value key

Metatalk

Cat ID  Description             URL stub
1	bugs	                bugs
2	feature requests	feature-requests
3	etiquette/policy	policy
4	uptime	                uptime
5	MetaFilter-related	meta-meta
6	general weblog-related	weblogs
7	ticketstub project	ticketstub
8	MetaFilter gatherings	meetups
9	MetaFilter Music	music
10	Ask MetaFilter	        ask
11	MeFi Podcast	        mefi-podcast

The URL stub is the string by which to access a by-category-only index of Metatalk. A url of the form http://metatalk.metafilter.com/feature-requests would, for example, provide a list of only those posts filed as "feature requests".

  • Only six of these categories are still available as category selections by users making new Metatalk posts: 1, 2, 3, 4, 5, and 8.
  • Category 11, "MeFi Podcast", is usable only by admins to denote official podcast posts (which are also indexed on the Podcast subsite).
  • Category 10, "Ask Metafilter", was used early in the life of the AskMe subsite, before it was ported to its own database table. It was never a selectable option for Metatalk posts.
  • Categories 6 and 7 have both been removed as obsolete. The "general weblog-related" category was more relevant early in the site's history when the crowd and much of the site content was more explicitly and narrowly blog-centric. The "ticketstub project" category was used for posts about a ticketstub-memories idea Matt had been working on at the time which has since been set aside.


Ask Metafilter

Cat ID  Description                     URL stub
1	computers & internet	        computers-internet
2	technology	                technology
3	home & garden	                home-garden
4	work & money	                work-money
5	sports, hobbies, & recreation	sports-hobbies-recreation
6	society & culture	        society-culture
7	travel & transportation	        travel-transportation
8	science & nature	        science-nature
9	education	                education
10	health & fitness	        health
11	shopping	                shopping
12	food & drink	                food-drink
13	writing & language	        writing-language
14	human relations	                human-relations
15	media & arts	                media-arts
16	pets & animals	                pets-animals
17	religion & philosophy	        religion-philosophy
18	clothing, beauty, & fashion	clothing-beauty-fashion
19	law & government	        law-government
20	grab bag	                grab-bag

The URL stub values work just as with Metatalk above: http://ask.metafilter.com/home-garden will, for example, display an index of only recent "home & garden" questions.

Music

1       song
2       talk post

When Music was first launched, all posts were songs uploaded by users. A new section, Music Talk, was added July 2nd, 2008. Posts in the talk section have category_id 2; song have category_id 1.

Music posts are not sortable by category via any URL stub; songs are listed on the front page of music.metafilter.com while talk posts are listed at music.metafilter.com/home/talk . Users cannot select a category per se at post time, but are presented with an initial page asking whether they intend to post a song or a talk post.

Metafilter

Metafilter has no category information associated with posts. All values in the category column of the postdata_mefi.txt are 0.

Post titles

postid  title
21616	Lord of the Peeps! 
21617	boot bus
21618	Fireworks in England
21619	The Teddy bear turns 100

Note that titles were not initially part of the posting form for any subsite other than Music. Titles have since been added to some (or in the case of AskMe all or very nearly so) posts created before the introduction of the title field, by the backtagging crew (most likely) or by an admin.

  • Post titles were added to Metafilter on November 12th, 2002.
  • Post titles were added to AskMe on February 17, 2005.
  • Post titles were added to Metatalk on February 13th, 2007.
  • Post titles were present from launch day for Music.

Post length

postid	title	above	below	url	urldesc
19	12	136	0	24	12
25	0	223	0	0	0
26	16	62	0	70	16
...
108426	19	256	0	35	49
108427	40	368	130	0	0
108428	10	253	0	25	35

Length is a raw character count of each field, including white space and html.

Title is the thread title; above is the above-the-fold text area; below is the below-the-fold "more inside" area. url and urldesc have non-zero values only in the postlength_mefi file, and correspond to the dedicated link and linktext fields users can use when posting. The fields are included across all postlength files to keep the file format consistent across subsites.

Comment stats

        postid
comment-id      userid  datestamp               faves   best answer?
1       1       1       1999-06-13 17:48:00.000 0       0
2       24      1       1999-07-15 01:21:06.213 0       0
4       2       1       1999-07-15 01:58:52.340 0       0
5       26      1       1999-07-15 10:00:12.850 0       0
6       25      16      1999-07-15 10:04:48.563 0       0

The best answer column lists a value of 1 for askme answers that have been marked "best" the asker, a 0 for all other answers. The column lists zeroes for all rows in the non-askme comment files.

Deleted comments are not included in these files.

Comment length

        length
comment-id
1       40
2       209
4       96
5       92
6       132

Length is a raw character count of the comment, including whitespace and html.

Tag data

tag_id  link_id link_date               tag_name
1       38715   2005-01-18 01:06:16.560 testing
3       38715   2005-01-18 01:06:16.560 metafilter
4       38733   2005-01-18 15:23:29.233 silly
5       38733   2005-01-18 15:23:29.233 recursion
6       38733   2005-01-18 15:23:29.233 photo

Tags can be added to a post in three different cases:

  • 1. By the poster at post creation time.
  • 2. By the poster, a mutual contact of the poster, or an admin, at some point after post creation.
  • 3. By a member of the backtagging crew at some point after post creation, if the post received no tags at the time it was created.

Tags whose link_date values are equal to the datestamp of their corresponding (by link_id) post were added by the poster at post creation time (case 1). There is no simple way to distinguish, with the data available in the database, between tags added in (case 2) or (case 3), though it's very likely that tags added to posts created before the original introduction of tagging to the various subsites were added by the backtagging crew.

Tags are automatically removed from deleted posts at deletion time, though that may not always have been the case.

Tag creation date approximation

Tag creation time is approximate for all tags that fall in to cases 2 and 3, above, and exact for case 1.

When a tag is created within the database, its link_date is set equal to the creation date of the post it is attached to, regardless of when the tag is added or by whom. To compensate for this, the Infodump uses a simple heuristic to provide a more correct approximation of the tags creation time:

- Given that tag_ids are created in chronological order, and so 
  no tag record n+1 could be created earlier than tag record n

For each record,
1. Let LASTDATE be the approximate date of the prior record.
2. For each tag record, compare two dates: 
 - the link_date for this tag record stored in the database
 - LASTDATE.
3. Record the most recent of those two dates as the approximate 
   date of this tag record's creation (and ergo the new LASTDATE).

Therefore, for any tag whose link_date is equal to that of the corresponding post, the link_date is exact; for any other tag, the link_date is an approximation in the form of the earliest date at which that tag could possibly have been added. The margin for error on this estimate is exactly the difference between a tag's listed link_date and the next newer link_date present in the tagdata file.

A similar technique is used in calculating Contact creation dates, see below.

Favorites stats

Some example lines, the first valid ones of each type:

faveid  faver   favee   type    target  parent  datestamp
1       1       23470   1       51485   0       2006-05-09 10:40:43.467
376     33779   17897   2       1304343 51504   2006-05-10 17:37:38.530
12      1       30348   3       37730   0       2006-05-10 09:00:50.297
343     1       36467   4       584780  37790   2006-05-10 16:59:02.670
3       1       19832   5       11837   0       2006-05-09 14:12:52.670
418     1490    20496   6       311412  11855   2006-05-10 18:07:51.670
276     1       14928   7       317     0       2006-05-10 14:22:31.060
18386   1       1983    8       9       0       2006-06-29 20:57:24.060
18662   29872   30452   9       61      0       2006-06-30 11:04:06.983
77718   1       22242   9       2557    618     2006-10-22 11:16:09.553
47272   4741    508     10      10      0       2006-08-25 21:51:45.807
346380  52871   35136   11      7072    0       2007-07-06 17:41:58.100
283418  191     191     12      3       1       2007-05-24 16:03:46.233
1052954 1       191     13      1       1511    2008-05-27 09:38:47.633

Types:

  • 1 & 2 - Metafilter post & comment
  • 3 & 4 - Ask Metafilter post & comment
  • 5 & 6 - MetaTalk post & comment
  • 7 - Projects post
  • 8 - Music post
  • 9 - Music comment - if the parent is 0, this is broken.
  • 10 - Jobs post
  • 11 & 12 - Travel post & comment
  • 13 - Projects comment

For post-type favorites, "target" is the link_id of the post being favorited.

For comment-type favorites, "target" is the comment_id of the comment being favorited, and "parent" is the link_id of the thread in which the comment resides.

Note that the case of a fave of type 9, a Music comment, with a parent of 0, is the result of a sporadic bug introduced at the launch of Music on June 29th, 2006 and present until October 21st, 2006. There are approximately 30 degenerate favorites of this sort in the database, and they may be either repaired or removed in the future.

Favorites that have been removed by the favoriting user are deleted from the database; accordingly, the faveid values present in this file are not strictly sequential.

Contact data

        contactee
contacter       date
1       14155   2004-06-15 12:00:00.000
1       2238    2004-06-15 12:00:00.000
1       14275   2004-06-15 12:00:00.000
...
13099   7683    2004-06-17 16:31:51.040
15231   14752   2004-06-17 16:31:51.040
...
45087   7610    2007-10-31 12:23:15.683
16719   61      2007-10-31 13:28:38.670
48758   1       2007-10-31 13:47:16.843


Contacter is the id of the user creating the contact; contactee is the user added as a contact.

All dates on or after 2007-10-31 11:55:38.840 are exact; all dates prior to that are approximate.

Contact creation date approximation

Because date-of-creation information for contacts did not exist in the database originally, the date provided for all records before 10/31/07 is an approximation, best formally described as "the earliest date at which the contact could possibly have been created". The algorithm for determining the date is as follows:

- Given that the earliest date any contact could have been added 
  was June 15th, 2004 (the date the feature was launched), and
- Given that contacts are created in chronological order, and so 
  no contact record n+1 could be created earlier than contact record n

For each record,
1. Let LASTDATE be the approximate date of the prior record.
2. For each contact record, compare three dates: 
 - the date the contacter joined mefi
 - the contactee's join date
 - LASTDATE.
3. Record the most recent of those three dates as the approximate 
   date of this contact record's creation.

The approximation relies on the following assumption: Alice cannot create a contact before she has joined the site; Bob cannot be the target of a contact before he has joined the site; and no contact can be be added before the date a previous contact was added. The date each record was created therefore cannot be earlier than the most recent of those three checkpoints.

This functionally limits the accuracy of the approximate dates to however frequently brand-new users either create or are the target of contact records; LASTDATE will remain static for stretches between these events.

Notes:

- Signups were closed at the time the feature was launched, which means that the appearance of contacts involving newer-than-LASTDATE users is very infrequent up until November of 2004 when signups reopened. Consequently, the approximations for the first several months of the feature's use are exceptionally poor.

- The approximate date will never be correct; it will always be too early an estimate, as the time between when any new user joins the site and when they either create their first contact or become a contact of someone else is clearly non-zero.

- While it is apparently impossible to know what degree of error is involved in any approximate date, it may be possible to at least approximate the upper bound of the degree of error by comparing any given approximate date to the next distinct date.

User data

userid  joindate                name
1       2000-01-27 20:16:57.367 mathowie
8       2000-01-27 20:16:57.367 OneBallJay
13      2000-01-27 20:16:57.367 jeffp
16      2000-01-27 20:16:57.367 jjg
17      2000-01-27 20:16:57.367 honkzilla

This list includes all account that have at some point been active, with the sole exception (to the best of cortex's knowledge) of early experimental accounts removed from the db by Matt near the original launch of the site.

Gaps in the userid sequence before November 18th, 2004 are the result of bugs/testing work on the site (cortex doesn't have as much detail about this as he'd like, yet!); gaps on and after that date (when paid $5 signups began) most likely correspond to accounts for which a potential new user began the signup process (thus reserving the username and hence a db row) but did not complete it.

This file includes accounts which were once active but have subsequently been closed by their owners or by an admin.

Signup dates weren't tracked until January 27, 2000. Users who registered before that get placeholders for signup date; the Infodump lists "2000-01-27 20:16:57.367", and profile pages list "sometime in 1999".


Related

Sites powered by the Infodump

Tools for working with the Infodump

Userid munging

Upon request, a user can have their id obscured in Infodump contents. In all instances where their userid would normally appear, a unique 7-digit fake id is listed instead, such that the user's activity/id connection is broken while analytical views of the data can still function normally.

Any analysis that makes assumptions about the meaningfulness of a usernumber should take this into account and sanity check for munged ids.

File size

  • Excel 95 supports 16384 rows
  • Excel 97 to 2003 support 65536 rows
  • Excel 2007 supports 1048576 rows
  • Openoffice 1.1 & 2 support 32000 rows
  • OpenOffice 3 supports 65536 rows
  • MS Access supports a total of 2 gigabytes, also limiting temporary storage


rows per file, as of 2009-08-13
category Mefi AskMe MeTa Music
Post 83257 124895 17446 3803
Comment 2682110 1718384 623356 19849
Tag 409230 501886 22491 17511
Favorites 2412653
Contact 56625
User 43676

Metafilter Corpus project

A younger sibling to the Infodump, the Metafilter Corpus project is focused more on actual language use on the site.