Submitted by David Michael R... on Wed, 10/25/2017 - 11:40 CitationAbstract: This dataset contains processed text from the bound and daily editions of the United States Congressional Record, as provided by HeinOnline. The bound edition covers the 43rd to 111th Congresses, and the daily edition covers the 97th to 114th. Each edition includes all text spoken on the floor of each chamber of Congress: the United States House of Representatives and the United States Senate. An automated script parses the text from each session to produce full-text speeches, metadata on speeches and their speakers, and counts of two-word phrases (bigrams) by speaker and party. Text is aggregated over sessions to flag bigrams that relate to congressional procedure or are extremely common or rare. The results of a manual audit of the script and statistics on our rate of matching speeches with members of Congress are included as well. Principal Investigator: Matthew GentzkowJesse M. ShapiroMatt TaddyHow to Cite this Dataset: Gentzkow, Matthew, Jesse M. Shapiro, and Matt Taddy. Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts. Palo Alto, CA: Stanford Libraries [distributor], 2018-01-16. https://data.stanford.edu/congress_text Contact Email: email@example.com DescriptionAcknowledgements: We thank HeinOnline for providing scans of the Congressional Record and allowing the public release of this dataset. We acknowledge funding from the Initiative on Global Markets and the Stigler Center at Chicago Booth, the National Science Foundation, the Brown University Population Studies and Training Center, and the Stanford Institute for Economic Policy Research (SIEPR). This work was completed in part with resources provided by the University of Chicago Research Computing Center. We thank our many dedicated research assistants for their contributions to this project. The relevant funding agencies bear no responsibility for use of the data or for interpretations or inferences based upon such uses. Methodology/SamplingUniverse: All speeches recorded in the House and Senate from the bound and daily editions of the US Congressional RecordType of data collection: Automated parsing of OCR scans from printed volumes of the US Congressional RecordTime span: 1873–2017 (43rd–114th US Congress)Time of data collection: 1873–2011 (bound), 1981–2017 (daily)Geographic coverage: US states and territories with a delegate in Congress DocumentationWeb site or document download link(s): codebook_v3.pdfOpen Data Commons Attribution License (ODC-By) v1.0.pdf Data Download Link(s)To view data file link(s), please agree to the following conditions: You are free: To Share: To copy, distribute and use the data. To Create: To produce works from the data. To Adapt: To modify, transform and build upon the data. As long as you: Attribute: You must attribute any public use of the data, or works produced from the data, in the manner specified in the license. For any use or redistribution of the data, or works produced from it, you must make clear to others the license of the data and keep intact any notices on the original data. Read the full ODC-BY 1.0 license text for the exact terms that apply. Data file link(s): audit.ziphein-bound.ziphein-daily.zipspeakermap_stats.zipvocabulary.zipphrase_clusters.zipphrase_partisanship.zip: NotesData Notes: Feburary 11, 2019: We have released an additional dataset of phrases manually classified into substantive topics. The documentation has been updated and saved as codebook_v2.pdf. February 20, 2019: We have released an additional data set of the most partisan phrases from each session of congress. BibliographyBibliography: Gentzkow, Matthew, Jesse M. Shapiro, and Matthew Taddy. Measuring Group Differences in High Dimensional Choices: Method and Application to Congressional Speech. Econometrica, 87(4), pp. 1307-1340. Stay Updated (optional) Enter your email address to receive an email when data is updated.