MetaFilter's site and server can always use upgrades of hardware, software, and bandwidth, as well as more stable funding for continued support of its small but high-skilled moderation and backend team! If you'd like to chip in, you can donate to Metafilter.

Infodump and Excel

From Mefi Wiki
Jump to: navigation, search

Opening infodump files in Excel

Get & unpack the file. I'm using MetaTalk post stats for this, because it's one of the smaller files.

Open the file. It probably won't show the file by default - instead of showing Files of type: All Excel Files, show Text Files instead.

Note that many of the Infodump files have enough rows to overwhelm Excel. See Infodump#File_size for file sizes & spreadsheet limits.

The Text Import Wizard should start:

  • Step 1: It's delimited, not Fixed width.
  • Step 2: Select only Tab as the delimiter.
  • Step 3: Finish.

The datestamp format is a bit off: instead of showing "2000-03-03 18:22:59.870", it shows up as "22:59.9" here. The full datestamp is still there, it's just not being displayed. You can change it to your liking - I like the format "3/14/01 1:30 PM".

Do Stuff

The first line is the datestamp for the database dump. This can get in the way.

The second line is a header, labelling each field type.

You could now "Select All," go to Data then Sort, tell Excel that there is a header row, & find

  • the threads with the most comments - in Metatalk, post 15931 currently has the most, 3649
  • the most favorites - post 15606 currently has the most, 270.

Filtering by deleted values of 1 shows only the deleted posts - sorting by userid then scanning shows that userid 191 has the most deleted posts in MetaTalk. (That would be admin pb, deleting test posts.)

Sorting popular tags

I'm using Excel 2007, which supports more rows. Tagdata_mefi has 366k rows, with multiple rows for posts with multiple tags. I removed the date line, then sorted the tag column alphabetically. Googling "count uniq rows excel" (I was looking for equivalents for the unix command uniq) led me to a way to count tags using a pivot table. I copied the pivot table report & used Paste Special to put the values into a new spreadsheet, which I then sorted by Count of tag_name, & started scanning.

There were 79k unique tags.

For comparison, using Debian Linux tools, here's the command to produce the equivalent list: "sort -f -k 5 tagdata_mefi.txt | uniq -c -i -f 4 | sort -rn | less" That leaves some unnecessary columns.

See Also

  • Infodump — Source of the data files about Metafilter.