As the DARPA spoken language community moves towards developing useful systems for interactive problem solving, we must develop new evaluation metrics to assess whether these systems aid people in solving problems. In this paper, we report on experiments with two new metrics: task completion and logfile evaluation (where human evaluators judge query correctness). In one experiment, we used two variants of our data collection system (with a human transcriber) to compare an aggressive system using robust parsing to a more cautious "full-parse" system. In a second experiment, we compared a system using the human transcriber to a fully automated system using the speech recognizer. There were clear differences in task completion, time to task completion, and number of correct and incorrect answers. These experiments lead us to conclude that task completion and logfile evaluation are useful metrics for evaluating interactive systems.