Skip to main content Skip to secondary navigation

Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts

Main content start


This dataset contains processed text from the bound and daily editions of the United States Congressional Record, as provided by HeinOnline. The bound edition covers the 43rd to 111th Congresses, and the daily edition covers the 97th to 114th. Each edition includes all text spoken on the floor of each chamber of Congress: the United States House of Representatives and the United States Senate. An automated script parses the text from each session to produce full-text speeches, metadata on speeches and their speakers, and counts of two-word phrases (bigrams) by speaker and party. Text is aggregated over sessions to flag bigrams that relate to congressional procedure or are extremely common or rare. The results of a manual audit of the script and statistics on our rate of matching speeches with members of Congress are included as well.

Principal Investigator: 

Matthew Gentzkow
Jesse M. Shapiro
Matt Taddy

How to Cite this Dataset: 

Gentzkow, Matthew, Jesse M. Shapiro, and Matt Taddy. Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts. Palo Alto, CA: Stanford Libraries [distributor], 2018-01-16.

Contact Email: 


The HCMST data are freely available to users who register with SSDS/ Stanford Libraries.

* Note [12/06/2011] has a new web server, so if you find that you old login and password do not work, please clear your browser cache and cookies and try again. Thanks, and sorry for any difficulties.


We thank HeinOnline for providing scans of the Congressional Record and allowing the public release of this dataset. We acknowledge funding from the Initiative on Global Markets and the Stigler Center at Chicago Booth, the National Science Foundation, the Brown University Population Studies and Training Center, and the Stanford Institute for Economic Policy Research (SIEPR). This work was completed in part with resources provided by the University of Chicago Research Computing Center. We thank our many dedicated research assistants for their contributions to this project. The relevant funding agencies bear no responsibility for use of the data or for interpretations or inferences based upon such uses.



All speeches recorded in the House and Senate from the bound and daily editions of the US Congressional Record

Type of data collection: 

Automated parsing of OCR scans from printed volumes of the US Congressional Record

Time span: 

1873–2017 (43rd–114th US Congress)

Time of data collection: 

1873–2011 (bound), 1981–2017 (daily)

Geographic coverage: 

US states and territories with a delegate in Congress



Data Use Agreement

  • To Create: To produce works from the data.

  • To Adapt: To modify, transform and build upon the data.

As long as you:

  • Attribute: You must attribute any public use of the data, or works produced from the data, in the manner specified in the license. For any use or redistribution of the data, or works produced from it, you must make clear to others the license of the data and keep intact any notices on the original data.

Read the full ODC-BY 1.0 license text for the exact terms that apply.

Data Download Links

Data Notes

Feburary 11, 2019: We have released an additional dataset of phrases manually classified into substantive topics. The documentation has been updated and saved as codebook_v2.pdf.

February 20, 2020: We have released an additional data set of the most partisan phrases from each session of congress. 

April 14, 2020: We have released an additional data set of the names of the political parties in each session of congress.