DATASURFING ON THE WORLD WIDE WEB - Part 2

Robin H. Lock
Department of Mathematics, Computer Science, and Statistics
St. Lawrence University
Canton, NY 13617
rlock@stlawu.edu

Outline for a talk at the 2016 Joint Statistical Meetings


ABSTRACT: This is a continuation of a presentation from the last JSM in Chicago (1996). At that time we looked at web sources for students and instructors to obtain real data for use in projects and class examples. What’s changed in this regard over the past 20 years? Where are some places to go now to get easy access to useful data? What new challenges have emerged for obtaining data from the ever expanding web? .


CATEGORIES OF DATA SOURCES

  • Dataset Archives with Teaching Suport
  • Pages of Data Links
  • Government Sources
  • R Packages
  • Data from Visualizations
  • More Data for Countries
  • Survey/Study Repositories
  • Fun and Games
  • Data Scraping

  • Dataset Archives with Teaching Support

  • Journal of Statistics Education Data Archive More than 100 datasets and documentation contributed by statistics teachers for classroom use. At least 80 of these datasets are tied to longer JSE articles discussing their use in statistics classes. Jenny Baglivo has made a quick summary of some of her favorites from this collection.
  • DASL - Dataset and Stroy Library A collection of datasets and related documentation (stories) which may be searched by data subjects and/or statistical techniques. Thanks to Paul Velleman and DataDesk for taking over hosting of this project.
  • ICSPR Data-Driven Learning Guides 50+ topics linked to political and social research survey data.
  • TSHS Resources Portal A new collection of resources started by the ASA's Section on Teaching Statistics in the Health Sciences. A limited number of datasets at this point, but they are just getting started and have good support for using the data in class.

  • Pages of Data Links

  • Winner's Miscellaneous Datasets Lots of links (data and documentation) maintained by Larry Winner at Univ. of Florida, organized by statistical technique.
  • Kuiper's Sources of Data Links to data (and other useful teaching resources) maintained by Shonda Kuiper at Grinnell College.
  • Awesome Public Datasets A very large list of links to public data organized by subject area (Sammy Chen). May take some digging to get to actual data.
  • Big Data: 33 Brilliant And Free Data Sources For 2016 Article by Bernard Marr in Forbes. An earlier list with 20 sources is at The Big Data Guru

  • Government Sources

  • Data.gov "The home of the U.S. Government's open data." Searchable links to hundreds of thousands of datasets. Try "College Scorecard" to get a click away from downloading a .csv file with infomration on almost a hundred varaibles for more than 7000 colleges and universities.
  • Canada Open Data Portal Similar site with searchable links for Canadian data.
  • OpenDataSoft List List of open data portals from around the world organized by country

  • R Packages

    Several R packages with good data for teaching (requires R to get the data) include ...

  • Mosaic A collection of data sets from the Mosaic package developed by Randall Prium, Daniel Kaplan, and Nicholas Horton.
  • Lock5Data Datasets from the textbook "Statistic: Unlocking the Power of Data" by Lock^5 (Wiley), datasets also availabe at lock5stat.com
  • Stat2Data Datasets from the textbook "Stat2: Models for a World of Data" by Cannon, et al. (Freeman).

    and you don't always even need to use R...

  • Rdatasets A collection of data sets from various R packages (e.g. datasets, car, Ecdat, MASS, HistData, survival, ...) mantained by Vincent Arel-Bundock. Current list has 758 datasets from more than 30 R packages with links to the data as .csv files and documentation (without neeeding R). Find a link to the R script for doing this at Rdatasets Github page

  • Data from Visualizations

  • Gapminder Country Data Download data on countries that drives the neat interactive displays at Hans Rosling's Gapminder World

  • More Data for Countries

  • World Bank Open Data Search by individual countries, general categories, or specific indicators.
  • CIA Factbook Lots of country level data, but trickier to get it in dowloadable format. Look for "Country Comparisons". Variables there have a "Dowload Data" link, but countries are ordered by that particular variable.

  • Survey/Study Repositories

  • ICPSR Inter-university Consortium for Political and Social Research .
  • Dryad Digital Repository Seeks to promote the availability of data underlying findings in the scientific literature for research and educational reuse. Houses data for lots of scientific journals,

  • Funs and Games

    SPORTS:
  • Baseball-reference.com Major League Baseball (MLB)
  • Basketball-reference.com National Basketball Association (NBA)
  • Pro-football-reference.com National Football Association (NFL)
  • Hockey-reference.com National Hockey League (NHL)

    OR get all of the above, plus college basketball, college football, and Olympics at Sports-reference.com

  • GAMES:
  • Shonda Kuiper's Stat2Lab Games Several games (e.g. tangrams, memorathon, shapesplosion, ...) that allow students to design experiments or sampling schemes, record data, and access stored data.

  • Data Scraping

    Several useful R packages:
  • rvest Try the tutorial by Justin Law and Jordan Rosenblum.
  • httr See the quick start guide.
  • Example: A shiny app to scrape IMDb ratings for all episodes of a chosen TV show created by Ivan Ramler and Tenzin Choeyang.