Skip to content

Instantly share code, notes, and snippets.

@gousiosg
Last active December 6, 2023 22:12
Show Gist options
  • Save gousiosg/7600626 to your computer and use it in GitHub Desktop.
Save gousiosg/7600626 to your computer and use it in GitHub Desktop.
Script to import the stackexchange dumps into MySQL
# Copyright (c) 2013 Georgios Gousios
# MIT-licensed
create database stackoverflow DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
use stackoverflow;
create table badges (
Id INT NOT NULL PRIMARY KEY,
UserId INT,
Name VARCHAR(50),
CreationDate DATETIME
);
CREATE TABLE comments (
Id INT NOT NULL PRIMARY KEY,
PostId INT NOT NULL,
Score INT NOT NULL DEFAULT 0,
Text TEXT,
CreationDate DATETIME,
UserId INT NOT NULL
);
CREATE TABLE post_history (
Id INT NOT NULL PRIMARY KEY,
PostHistoryTypeId SMALLINT NOT NULL,
PostId INT NOT NULL,
RevisionGUID VARCHAR(36),
CreationDate DATETIME,
UserId INT NOT NULL,
Text TEXT
);
CREATE TABLE posts (
Id INT NOT NULL PRIMARY KEY,
PostTypeId SMALLINT,
AcceptedAnswerId INT,
ParentId INT,
Score INT NULL,
ViewCount INT NULL,
Body text NULL,
OwnerUserId INT NOT NULL,
LastEditorUserId INT,
LastEditDate DATETIME,
LastActivityDate DATETIME,
Title varchar(256) NOT NULL,
Tags VARCHAR(256),
AnswerCount INT NOT NULL DEFAULT 0,
CommentCount INT NOT NULL DEFAULT 0,
FavoriteCount INT NOT NULL DEFAULT 0,
CreationDate DATETIME
);
CREATE TABLE users (
Id INT NOT NULL PRIMARY KEY,
Reputation INT NOT NULL,
CreationDate DATETIME,
DisplayName VARCHAR(50) NULL,
LastAccessDate DATETIME,
Views INT DEFAULT 0,
WebsiteUrl VARCHAR(256) NULL,
Location VARCHAR(256) NULL,
AboutMe TEXT NULL,
Age INT,
UpVotes INT,
DownVotes INT,
EmailHash VARCHAR(32)
);
CREATE TABLE votes (
Id INT NOT NULL PRIMARY KEY,
PostId INT NOT NULL,
VoteTypeId SMALLINT,
CreationDate DATETIME
);
load xml infile 'Badges.xml'
into table badges
rows identified by '<row>';
load xml infile 'Comments.xml'
into table comments
rows identified by '<row>';
load xml infile 'PostHistory.xml'
into table post_history
rows identified by '<row>';
load xml infile 'Posts.xml'
into table posts
rows identified by '<row>';
load xml infile 'Users.xml'
into table users
rows identified by '<row>';
load xml infile 'Votes.xml'
into table votes
rows identified by '<row>';
create index badges_idx_1 on badges(UserId);
create index comments_idx_1 on comments(PostId);
create index comments_idx_2 on comments(UserId);
create index post_history_idx_1 on post_history(PostId);
create index post_history_idx_2 on post_history(UserId);
create index posts_idx_1 on posts(AcceptedAnswerId);
create index posts_idx_2 on posts(ParentId);
create index posts_idx_3 on posts(OwnerUserId);
create index posts_idx_4 on posts(LastEditorUserId);
create index votes_idx_1 on votes(PostId);
#!/usr/bin/env ruby
# Copyright (c) 2013 Georgios Gousios
# MIT-licensed
require 'sequel'
require 'mysql2'
require 'cgi'
Sequel::Database.extension :pagination
DB = Sequel.connect("mysql2://#{ARGV[1]}:#{ARGV[2]}@#{ARGV[0]}/#{ARGV[3]}")
table = ARGV[4].to_sym
field = ARGV[5].to_sym
lines = 0
lines_with_escapes = 0
DB.from(table.to_sym).select(:Id, field).each_page(1000) do |page|
page.each do |row|
lines += 1
text = row[field]
next if text.nil?
text_unescaped = CGI.unescapeHTML(text)
if text_unescaped != text
lines_with_escapes += 1
DB[table].filter(:Id => row[:Id]).update(field => text_unescaped)
end
print "\r #{lines} lines, #{lines_with_escapes} with escapes"
end
end

MySQL script to import stackexchange data dumps (as retrieved from here).

The accompanying Ruby script will un-escape various XML-escaped characters that the MySQL import process fails to do.

@gousiosg
Copy link
Author

Make sure you use absolute paths to the XML files.

@megansquire
Copy link

Hi -- I forked and updated this gist (link to fork here: https://gist.github.com/megansquire/877e028504c92e94192d) to show the two new tables that are available in the Sept 2014 version (tags, and post_links) as well as the new URL on Archive.org. Just in case anyone wants a newer version of this script.

@enricorotundo
Copy link

enricorotundo commented Dec 14, 2016

Hi, see my fork to import the dataset version Sept. 12, 2016 from https://archive.org/details/stackexchange.

I added some columns and changed some NOT NULLs.

Update:
Previous fork link is broken, follow this new link for the import script. Thanks @RoelVdP.

@wzw1990
Copy link

wzw1990 commented Jan 23, 2017

@tundo91 good job!

@fkenjikamei
Copy link

Anyone use Mac OS? I had a problem:
#13 - Can't get stat of '/Users/kenji/Documents/stackoverflow2016/Badges.xml' (Errcode: 13 "Permission denied")

Copy link

ghost commented Apr 6, 2017

@fkenjikamei most likely you have not set the secure_file_priv variable in your mysql .my.cnf file.
Adding
secure_file_priv=/Users/kenji/Documents/stackoverflow2016/
to $HOME/.my.cnf should fix the permission issue for you.

@EeshitaBiswas
Copy link

Can you provide your sample Badges.xml

@RoelVdP
Copy link

RoelVdP commented Aug 8, 2019

The link by @enricorotundo gives a 404. Looks like his userID changed or something. Here is the correct/current one I think; https://gist.github.com/enricorotundo/1e074af39d90629252a7df3fc1066397

@RoelVdP
Copy link

RoelVdP commented Aug 8, 2019

@EeshitaBiswas see that last link, it has Badges too.

@RoelVdP
Copy link

RoelVdP commented Aug 8, 2019

Suggest using Body MEDIUMTEXT NULL instead of just TEXT. Only one more byte overhead, but much more space in case question or answer is longer (16 MiB instead of 64 KiB possible). LONGTEXT (4 GiB) is likely overdoing it :) even though only one more byte.

@RoelVdP
Copy link

RoelVdP commented Aug 8, 2019

Two comments if you want to use this;

  1. Make sure to add --secure_file_priv= to the mysqld startup options (or secure_file_priv= in my.cnf under [mysqld])
  2. Add this to the top of the SQL file; SET @@SESSION.SQL_MODE=''; this is especially important for late 5.7 and 8.0 versions as they have the STRICT_TRANS_TABLES SQL_MODE turned on by default (and this results in 'ERROR 1263 (22004): Column set to default value; NULL supplied to NOT NULL column 'id' at row 1'. For more info, see https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_strict_trans_tables

@snimmagadda1
Copy link

Hello from 2020 - one more comment about using nowadays... I would update the statements for the create tables to have CHARACTER SET = utf8; ... might save some trouble for some of these more exotic posts with emojis/symbols.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment