<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: David Roher</title>
    <description>The latest articles on Forem by David Roher (@droher).</description>
    <link>https://forem.com/droher</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F114388%2Fa6353231-0fef-4595-9a9e-871092e13803.jpeg</url>
      <title>Forem: David Roher</title>
      <link>https://forem.com/droher</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/droher"/>
    <language>en</language>
    <item>
      <title>Boxball: Prebuilt baseball stats databases using Docker</title>
      <dc:creator>David Roher</dc:creator>
      <pubDate>Fri, 14 Jun 2019 15:17:40 +0000</pubDate>
      <link>https://forem.com/droher/boxball-prebuilt-baseball-stats-databases-using-docker-154n</link>
      <guid>https://forem.com/droher/boxball-prebuilt-baseball-stats-databases-using-docker-154n</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/droher/boxball"&gt;Boxball&lt;/a&gt; creates prepopulated databases of the two most significant open source baseball datasets: &lt;a href="http://retrosheet.org"&gt;Retrosheet&lt;/a&gt; and the &lt;a href="https://github.com/chadwickbureau/baseballdatabank"&gt;Baseball Databank&lt;/a&gt;. Retrosheet contains information on every major-league pitch since 2000, every play since 1937, every box score since 1906, and every game since 1871. The Databank (based on the &lt;a href="http://www.seanlahman.com/baseball-archive/statistics/"&gt;Lahman Database&lt;/a&gt;) contains yearly summaries for every player and team in history. In addition to the data and databases themselves, Boxball relies on the following tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://docs.docker.com/engine/docker-overview/"&gt;Docker&lt;/a&gt; for repeatable builds and easy distribution&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.sqlalchemy.org/"&gt;SQLAlchemy&lt;/a&gt; for abstracting away DDL differences between databases&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://github.com/chadwickbureau/chadwick"&gt;Chadwick&lt;/a&gt; for translating Retrosheet's complex event files into a relational format&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Follow the instructions below to install your distribution of choice. The full set of images is also available on Docker Hub.&lt;/p&gt;

&lt;p&gt;The Retrosheet schema is extensively documented in the code; see the source &lt;a href="https://github.com/droher/boxball/blob/master/transform/src/schemas/retrosheet.pyhttps://github.com/droher/boxball/blob/master/transform/src/schemas/retrosheet.py"&gt;here&lt;/a&gt; until I find a prettier solution.&lt;/p&gt;

&lt;p&gt;If you find the project useful, please consider donating to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The &lt;a href="https://aliforneycenter.donordrive.com/index.cfm?fuseaction=donate.general"&gt;Ali Forney Center&lt;/a&gt; for homeless LGBTQ youth&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://act.350.org/donate/build/"&gt;350.org&lt;/a&gt;, a grassroots international climate change organization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feel free to &lt;a href="//mailto:david@boxball.io"&gt;contact me&lt;/a&gt; with questions or comments! &lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://docs.docker.com/install/"&gt;Docker&lt;/a&gt; (v18.06, earlier versions may not work)&lt;/li&gt;
&lt;li&gt;  2-20GB Disk space (depends on distribution choice)&lt;/li&gt;
&lt;li&gt;  500MB-8GB RAM available to Docker (depends on distribution choice)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Distributions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Column-Oriented Databases
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Postgres cstore_fdw
&lt;/h4&gt;

&lt;p&gt;This distribution uses the &lt;a href="https://github.com/citusdata/cstore_fdw"&gt;cstore_fdw&lt;/a&gt; extension to turn PostgreSQL into a column-oriented database. This means that you get the rich featureset of Postgres, but with a huge improvement in speed and disk usage. To install and run the database server:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker run --name postgres-cstore-fdw -d -p 5433:5432 -v ~/boxball/postgres-cstore-fdw:/var/lib/postgresql/data doublewick/boxball:postgres-cstore-fdw-0.0.2&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Roughly an hour after the image is downloaded, the data will be fully loaded into the database, and you can connect to it on port &lt;code&gt;5433&lt;/code&gt; (either using the &lt;code&gt;psql&lt;/code&gt; command line tool or a database client of your choice). The data will be persisted on your machine in &lt;code&gt;~/boxball/postgres-cstore-fdw&lt;/code&gt; (~1.5GB), which means you can stop/remove the container without having to reload the data when you turn it back on.&lt;/p&gt;

&lt;h4&gt;
  
  
  Clickhouse
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://clickhouse.yandex/"&gt;Clickhouse&lt;/a&gt; is a database developed by Yandex with some very impressive performance benchmarks. It uses less disk space than Postgres cstore_fdw, but significantly more RAM (~5GB). I've yet to run any query performance comparisons. To install and run the database server:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker run --name clickhouse -d -p 8123:8123 -v ~/boxball/clickhouse:/var/lib/clickhouse doublewick/boxball:clickhouse-0.0.2&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;15-30 minutes after the image is downloaded, the data will be fully loaded into the database, and you can connect to it either by attaching the container and using the &lt;code&gt;clickhouse-client&lt;/code&gt; CLI or by using a local database client on port &lt;code&gt;8123&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Drill
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://drill.apache.org/"&gt;Drill&lt;/a&gt; is a framework that allows for SQL queries directly on files, without having to declare any schema. It is usually used on a computing cluster with massive datasets, but we use a single-node setup. To install and run:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker run --name drill -id -p 8047:8047 -p 31010:31010 -v ~/boxball/drill:/data doublewick/boxball:drill-0.0.2&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Data will be immediately available to query after the image is downloaded. Use port &lt;code&gt;8047&lt;/code&gt; to access the Web UI (which includes a SQL runner) and port &lt;code&gt;31010&lt;/code&gt; to connect via a database client.&lt;br&gt;
You may also attach the container and query from the command line.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional (Row-oriented) Databases
&lt;/h3&gt;

&lt;p&gt;Note: these frameworks are likely to be prohibitively slow when querying play-by-play data, and they take up significantly more disk space than their columnar counterparts.&lt;/p&gt;

&lt;h4&gt;
  
  
  Postgres
&lt;/h4&gt;

&lt;p&gt;Similar configuration to the cstore_fdw extended version above, but stored in the conventional way.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker run --name postgres -d -p 5432:5432 -v ~/boxball/postgres:/var/lib/postgresql/data doublewick/boxball:postgres-0.0.2&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Roughly 90 minutes after the image is downloaded, the data will be fully loaded into the database, and you can connect to it on port &lt;code&gt;5432&lt;/code&gt;&lt;br&gt;
(either using the &lt;code&gt;psql&lt;/code&gt; command line tool or a database client of your choice).&lt;/p&gt;

&lt;h4&gt;
  
  
  MySQL
&lt;/h4&gt;

&lt;p&gt;To install and run:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker run --name mysql -d -p 3306:3306 -v ~/boxball/mysql:/var/lib/mysql doublewick/boxball:mysql-0.0.2&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Roughly two hours after the image is downloaded, the data will be fully loaded into the database, and you can connect to it on port &lt;code&gt;3306&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  SQLite (with web UI)
&lt;/h4&gt;

&lt;p&gt;To install and run:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker run --name sqlite -d -p 8080:8080 -v ~/boxball/sqlite:/db doublewick/boxball:sqlite-0.0.2&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Roughly two minutes after the image is downloaded, the data will be fully loaded into the database. &lt;code&gt;localhost:8080&lt;/code&gt; will provide a &lt;a href="https://github.com/coleifer/sqlite-web"&gt;web UI&lt;/a&gt; where you can write queries and perform schema exploration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Flat File Downloads
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Parquet
&lt;/h4&gt;

&lt;p&gt;Parquet is a columnar data format originally developed for the Hadoop ecosystem. It has solid support in Spark, Pandas, and many other frameworks.&lt;br&gt;
&lt;a href="https://1drv.ms/u/s!AtpEocFNRNBWg1eR5L-U7bupJqyt?e=RbxuMp"&gt;OneDrive&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  CSV
&lt;/h4&gt;

&lt;p&gt;The original CSVs from the extract step (folder stored as &lt;code&gt;.tar.gz&lt;/code&gt;).&lt;br&gt;
&lt;a href="https://1drv.ms/u/s!AtpEocFNRNBWhAb_gwNbBLPB1pDv?e=qyrU3L"&gt;OneDrive&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Acknowledgements
&lt;/h2&gt;

&lt;p&gt;Ted Turocy's &lt;a href="http://chadwick-bureau.com/"&gt;Chadwick Bureau&lt;/a&gt; developed the tools and repos that made this project possible. I am also grateful to &lt;a href="http://www.seanlahman.com/"&gt;Sean Lahman&lt;/a&gt; for creating his database, which I have been using for over 15 years. I was able to develop and host this project for free thanks to the generous open-source plans of CircleCI, Github, and Docker Hub.&lt;/p&gt;

&lt;p&gt;Retrosheet represents the collective effort of thousands of baseball fans over 150 years of scorekeeping and data entry. I hope Boxball facilitates more historical research to continue this tradition.&lt;/p&gt;

</description>
      <category>showdev</category>
      <category>docker</category>
      <category>sql</category>
      <category>python</category>
    </item>
  </channel>
</rss>
