MetaFilter's site and server can always use upgrades of hardware, software, and bandwidth, as well as more stable funding for continued support of its small but high-skilled moderation and backend team! If you'd like to chip in, you can donate to Metafilter.

Corpus

From Mefi Wiki
Jump to navigationJump to search

The Metafilter Corpus project is a collection of language data about Metafilter. It's intended as an adjunct to the Metafilter Infodump, focusing more on actual language content of the site compared with the Infodump's focus on site activity numbers.

Some brief notes on the file format and the collection methodology follow.

Available data

Frequency Tables

Table format

Each table contains word-frequency information for comments posted to a specific subsite or set of subsites within a specific time period. The files have four header lines followed by a series of rows of tab-separated values, sorted in descending order of frequency.

The header lines are:

  • 1. Date-range (and in some cases subsite) information about the file
  • 2. Count of total words analyzed for the file, count of unique distinct tokens in the file (i.e. number of rows after the header)
  • 3. Column headers for count (number of times the given word appeared in comments), parts per million (normalized value for comparison with other frequency tables), and the word in question
  • 4. Blank line for readability.
2010-01-01 to 2011-01-01
86883789 total words, 567139unique words
count   PPM                     word

3676618 42316.5016433618        the
2469774 28426.1774080778        to
2258729 25997.1281869395        a
2075948 23893.3870621135        and
1842864 21210.67717247  of

Methodology

The frequency tables are generated by a perl script running with local access to the metafilter database. For any given table, the script queries for the comment text fitting the subsite and date-range constraints, and then takes each individual comment through a tokenizing process to break the text down into individual words for counting.

The current tokenizing process involves:

  • converting the source comment to all lower case
  • removing HTML tag content to reduce the comment to plain text
  • removing line breaks
  • removing HTML named entities, e.g. & (&) or > (>)
  • stripping most punctuation and replacing it with white space to prevent accidental concatenation of adjacent words
  • splitting the resulting cleaned-up comment into individual word tokens wherever whitespace is found
  • for each token, stripping any remaining character other than letters, numerals, hyphens, single-quotes (as apostrophes), and underscores
  • further stripping any leading or trailing hypens, underscores, and single-quotes, leaving only intra-word occurrences of those characters

The resulting tokens are counted with a hash and sorted for write out to the resulting file.

Some caveats:

  • The sanitizing steps in the tokenizing process (stripping punctuation, etc) were designed to create a relatively clean word list for frequency calculation purposes, but leads to small discrepancies between the literal strings occurring in some comments and the data in the frequency table. Words containing intentional use of non-alphabetic, non-numeric characters will have been transformed somewhat in the process, non-ASCII unicode characters will be absent (so "año" would become "ao"), possessive forms with a trailing apostrophe will be missing that apostrophe (so "mefite's" appears in the word list correctly but "mefites'" will be counted as "mefites"), etc.
  • Deleted comments are not excluded from the corpus, so manual counts on a token based on the results of a normal Metafilter site search may be lower for some tokens than the counts in these files.
  • No attempt is currently made to identify and exclude those portions of a given comment consisting of quotation of some earlier comment in a thread or quotation from some external source. Accordingly, some tokens are arguably over-represented in the counts because they are reiterated in one or more replies to a given source comment or other text.
  • Because HTML is simply stripped and ignored, the content of intra-tag text content such as title= fields is not captured in these files.