Apr 27, 2016
I’m a PhD student in Computer Science at Sorbonne Université and at the ALMAnaCH research team at Inria
We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.
Pedro Javier Ortiz Suárez, Laurent Romary, Benoît Sagot
We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.
Pedro Javier Ortiz Suárez, Benoît Sagot, Laurent Romary