Skip to content Skip to navigation

Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts

Citation
Abstract: 

This dataset contains processed text from the bound and daily editions of the United States Congressional Record, as provided by HeinOnline. The bound edition covers the 43rd to 111th Congresses, and the daily edition covers the 97th to 114th. Each edition includes all text spoken on the floor of each chamber of Congress: the United States House of Representatives and the United States Senate. An automated script parses the text from each session to produce full-text speeches, metadata on speeches and their speakers, and counts of two-word phrases (bigrams) by speaker and party. Text is aggregated over sessions to flag bigrams that relate to congressional procedure or are extremely common or rare. The results of a manual audit of the script and statistics on our rate of matching speeches with members of Congress are included as well.

Principal Investigator: 
Matthew Gentzkow
Jesse M. Shapiro
Matt Taddy
How to Cite this Dataset: 

Gentzkow, Matthew, Jesse M. Shapiro, and Matt Taddy. Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts. Palo Alto, CA: Stanford Libraries [distributor], 2018-01-16. https://data.stanford.edu/congress_text

Contact Email: 
Description
Acknowledgements: 

We thank HeinOnline for providing scans of the Congressional Record and allowing the public release of this dataset. We acknowledge funding from the Initiative on Global Markets and the Stigler Center at Chicago Booth, the National Science Foundation, the Brown University Population Studies and Training Center, and the Stanford Institute for Economic Policy Research (SIEPR). This work was completed in part with resources provided by the University of Chicago Research Computing Center. We thank our many dedicated research assistants for their contributions to this project. The relevant funding agencies bear no responsibility for use of the data or for interpretations or inferences based upon such uses.

Methodology/Sampling
Universe: 
All speeches recorded in the House and Senate from the bound and daily editions of the US Congressional Record
Type of data collection: 
Automated parsing of OCR scans from printed volumes of the US Congressional Record
Time span: 
1873–2017 (43rd–114th US Congress)
Time of data collection: 
1873–2011 (bound), 1981–2017 (daily)
Geographic coverage: 
US states and territories with a delegate in Congress
Documentation
Data Download Link(s)
To view data file link(s), please agree to the following conditions: 
  • To Create: To produce works from the data.
  • To Adapt: To modify, transform and build upon the data.

As long as you:

  • Attribute: You must attribute any public use of the data, or works produced from the data, in the manner specified in the license. For any use or redistribution of the data, or works produced from it, you must make clear to others the license of the data and keep intact any notices on the original data.

Read the full ODC-BY 1.0 license text for the exact terms that apply.

Bibliography
 
1 Start 2 Complete