Laughter is a common paralinguistic vocalization that has been shown to be used for controlling the flow of a conversation, nullifying previous statements, and managing conversations on delicate topics. Already there have been concerted efforts to develop methods for automatically detecting laughter in speech. Many of these studies use artificial datasets and report their model performance using the AUC metric. This paper replicates previous work on laughter detection on those artificial datasets and then extends them by validating the methods on a larger and more naturalistic dataset made up of 60 spontaneous conversations (120 speakers and roughly 12 hours of material in total) with the best performing model achieving an AUC of 90.39\5 +/- 1.10 (precision=13.99 +/- 4.09, recall=76.36 +/- 12.00, F1=23.06 +/- 4.99). The paper then goes on to discuss the shortcomings with the current standard comparison metric in the field of AUC and suggests alternatives which may aid in the comparison and understanding of method's effectiveness.