I work in Natural Language Processing with a particular interest in commonsense reasoning. My PhD topic is thinking of ways to diagnose current systems' reasoning capabilities beyond what a single performance score can tell us.
I recently made a crowdsourced semi-structured explanation dataset for a commonsense reasoning benchmark which can be used to train explanation-generating models or to compare with existing knowledge bases. Available here!
I am now working on a way to automatically evaluate generated explanations in a predict-and-explain (or explain-and-predict) paradigm: a benchmark for training evaluator models to score explanations according to explanation-specific criteria.