StackOverflow Download Data Schema

Recently the StackOverflow team released a download of their data.

Unfortunately, a schema was not included. Here it is in full:

Users

  • Id
  • Reputation
  • CreationDate
  • DisplayName
  • LastAccessDate
  • WebsiteUrl
  • Location
  • Age
  • AboutMe
  • Views
  • UpVotes
  • DownVotes

Badges

  • UserId
  • Name
  • Date

Comments

  • Id
  • PostId
  • Text
  • CreationDate
  • UserId
  • UserDisplayName

Posts

  • Id
  • PostTypeId
  • CreationDate
  • Score
  • ViewCount
  • Body
  • OwnerUserId
  • OwnerDisplayName
  • LastEditorUserId
  • LastEditDate
  • LastActivityDate
  • Title
  • Tags
  • AnswerCount
  • CommentCount
  • FavoriteCount
  • ClosedDate

Votes

  • Id
  • PostId
  • VoteTypeId
  • CreationDate

There has been 4 responses to “StackOverflow Download Data Schema”

  1. nobody_ says:

    You have a few things missing in your schema:

    Comments:
    – userdisplayname

    Posts:
    – lasteditordisplayname

    Votes:
    – userid

  2. slacy says:

    Thanks! I was downloading the dump to figure out the schema when I came across your comment.

    There are some interesting points here, which it would be good to detail what the possible values for Post.PostTypeId are, and how some of the fields appear to be calculated from the data itself (i.e. they are shortcuts to make implementation of web frontends easier. Posts.AnswerCount and Posts.CommentCount come to mind here)

  3. nobody_ says:

    Yeah, I figured out the schema more or less through trial and error. I have a script that parses the XML file and generates the appropriate SQL INSERT queries, and kept modifying the schema until I didn’t get any column-not-found errors.

    With regard to posttypeid, it can be either 1, which is a question, or 2, which is an answer.

    Also, the 2009-06 dump includes 2 new columns for the posts table, parentid and acceptedanswerid, so you should probably update your post.

  4. TomPier says:

    great post as usual!

Leave a Reply