New paper on cognitive evaluation for instruction generation agents. TL;DR They need better theory-of-mind capabilities.