StackOverflow Download Data Schema
Recently the StackOverflow team released a download of their data.
Unfortunately, a schema was not included. Here it is in full:
Users
- Id
- Reputation
- CreationDate
- DisplayName
- LastAccessDate
- WebsiteUrl
- Location
- Age
- AboutMe
- Views
- UpVotes
- DownVotes
Badges
- UserId
- Name
- Date
Comments
- Id
- PostId
- Text
- CreationDate
- UserId
- UserDisplayName
Posts
- Id
- PostTypeId
- CreationDate
- Score
- ViewCount
- Body
- OwnerUserId
- OwnerDisplayName
- LastEditorUserId
- LastEditDate
- LastActivityDate
- Title
- Tags
- AnswerCount
- CommentCount
- FavoriteCount
- ClosedDate
Votes
- Id
- PostId
- VoteTypeId
- CreationDate
This entry was posted on Saturday, June 6th, 2009 at 11:49 am and is filed under Resources. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
You have a few things missing in your schema:
Comments:
– userdisplayname
Posts:
– lasteditordisplayname
Votes:
– userid
Thanks! I was downloading the dump to figure out the schema when I came across your comment.
There are some interesting points here, which it would be good to detail what the possible values for Post.PostTypeId are, and how some of the fields appear to be calculated from the data itself (i.e. they are shortcuts to make implementation of web frontends easier. Posts.AnswerCount and Posts.CommentCount come to mind here)
Yeah, I figured out the schema more or less through trial and error. I have a script that parses the XML file and generates the appropriate SQL INSERT queries, and kept modifying the schema until I didn’t get any column-not-found errors.
With regard to posttypeid, it can be either 1, which is a question, or 2, which is an answer.
Also, the 2009-06 dump includes 2 new columns for the posts table, parentid and acceptedanswerid, so you should probably update your post.
great post as usual!