在机器学习领域,个人觉得有一个大前提:数据是永远不够的。虽然现在有很多吹嘘大数据,在自然语言处理领域,标注数据尤其匮乏,而且标注的质量也非常难控制。在这种情况下,数据增强是非常必要的,这对于模型的robustness和generalization都非常重要。
在不同NLP领域都有一些特定的数据增强的方法.
Task-independent data augmentation for NLP
However, in NLP, data augmentation is not widely used. In my mind, this is for two reasons:
There are a few research directions that would be interesting to pursue:
Data augmentation with style transfer:Investigate if style transfer can be used to modify various attributes of training examples for more robust learning.
Learn the augmentation:Similar to Dong et al. (2017) we could learn either to paraphrase or to generate transformations for a particular task.
Tutorial
(未完待续...)