This guide shows how to use NER tagging for English and non-English languages with NLTK and Standford NER tagger (Python). You can also use it to improve the Stanford NER Tagger.
A short introduction to Named-Entities Recognition
First and foremost, a few explanations: Natural Language Processing (NLP) is a field of machine learning that seek to understand human languages. It’s one of the most difficult challenges Artificial Intelligence has to face. NLP covers several problematic from speech recognition, language generation, to information extraction.
NLP provides specific tools to help programmers extract pieces of information in a given corpus. Here is a short list of most common algorithms: tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition.
NLTK (Natural Language Toolkit) is a wonderful Python package that provides a set of natural languages corpora and APIs to an impressing diversity of NLP algorithms. It’s easy to use, complete, and well documented. Of course, it’s free, open-source and community-driven.
Let’s dive into Named Entity Recognition (NER). NER is about locating and classifying named entities in texts in order to recognize places, people, dates, values, organizations. As an example:
Twenty miles east of Reno, Nev., where packs of wild mustangs roam free through the parched landscape, Tesla Gigafactory 1 sprawls near Interstate 80. […] The Gigafactory, whose construction began in June 2014, is not only outrageously large but also on its way to becoming the biggest manufacturing plant on earth. Now 30 percent complete, its square footage already equals about 35 Costco stores. […] (NY Times, November 2017)
This guide will show you how to implement NER tagging for non-English languages using NLTK. Enjoy reading!
A step-by-step guide to non-English NER with NLTK
At Sicara, I recently had to build algorithms to extract names and organization from a French corpus. As NLTK comes along with the efficient Stanford Named Entities tagger, I thought that NLTK would do the work for me, out of the box.
But I was wrong: I forgot my corpus was French and Stanford NER tagger is designed for English language only.
The only way to get it done is to train your own NER model. Use cases :
- you are working with a non-English corpus too (French, German and Dutch…) ;
- you want to improve Stanford English model.
I hope this step-by-step guide will help you.
Step 1: Implementing NER with Stanford NER / NLTK
Let’s start!
Because Stanford NER tagger is written in Java, you are going to need a proper Java Virtual Machine to be installed on your computer.
To do so, install Java JRE 8 or higher. You can install Java JDK (developer kit) if you want because it contains JRE. For Linux users, you will find all needed information on this guide on How To Install Java with Apt-Get on Ubuntu 16.04. For other users, please have a look at Java official documentation.
Once installed, make sure your $JAVA_HOME environment is set:
echo $JAVA_HOME
Mine is /user/lib/jvm/java-8-oracle
. That’s it for Java!
If you haven’t done it yet, create a virtual environment to work on:
mkvirtualenv .venv-ner --python=/usr/bin/python3
workon .venv-ner
Download NLTK:
pip install nltk
Get Stanford NER Tagger. Download zip file stanford-ner-xxxx-xx-xx.zip
: see ‘Download’ section from The Stanford NLP website.
Unzip it and move ner-tagger ner-tagger.jar
and gzipped English model english.all.3class.distsim.crf.ser.gz
to your application folder:
cd /home/charles/Downloads/
unzip stanford-ner-2017-06-09.zip
mv stanford-ner-2017-06-09/ner-tagger.jar {yourAppFolder}/stanford-ner-tagger/ner-tagger.jar
mv stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz {yourAppFolder}/stanford-ner-tagger/ner-model-english.ser.gz
We now have two files in our stanford-ner-tagger
folder:
ner-tagger.jar
: NER tagger engine properly said ;ner-model-english.ser.gz
: NER model trained on an english corpus.gi
Copy the following ner_english.py
script to perform english Named Entities Recognition:
Run it:
python ner_english.py
Output should be:
[('Twenty', 'O'), ('miles', 'O'), ('east', 'O'), ('of', 'O'), ('Reno', 'ORGANIZATION'), (',', 'O'), ('Nev.', 'LOCATION'), (',', 'O'), ('where', 'O'), ('packs', 'O'), ('of', 'O'), ('wild', 'O'), ('mustangs', 'O'), ('roam', 'O'), ('free', 'O'), ('through', 'O'), ('the', 'O'), ('parched', 'O'), ('landscape', 'O'), (',', 'O'), ('Tesla', 'ORGANIZATION'), ('Gigafactory', 'ORGANIZATION'), ('1', 'ORGANIZATION'), ('sprawls', 'O'), ('near', 'O'), ('Interstate', 'LOCATION'), ('80', 'LOCATION'), ('.', 'O'), ('The', 'O'), ('Gigafactory', 'O'), (',', 'O'), ('whose', 'O'), ('construction', 'O'), ('began', 'O'), ('in', 'O'), ('June', 'DATE'), ('2014', 'DATE'), (',', 'O'), ('is', 'O'), ('not', 'O'), ('only', 'O'), ('outrageously', 'O'), ('large', 'O'), ('but', 'O'), ('also', 'O'), ('on', 'O'), ('its', 'O'), ('way', 'O'), ('to', 'O'), ('becoming', 'O'), ('the', 'O'), ('biggest', 'O'), ('manufacturing', 'O'), ('plant', 'O'), ('on', 'O'), ('earth', 'O'), ('.', 'O'), ('Now', 'O'), ('30', 'PERCENT'), ('percent', 'PERCENT'), ('complete', 'O'), (',', 'O'), ('its', 'O'), ('square', 'O'), ('footage', 'O'), ('already', 'O'), ('equals', 'O'), ('about', 'O'), ('35', 'O'), ('Costco', 'ORGANIZATION'), ('stores', 'O'), ('.', 'O')]
Not bad at all! However, it is not perfect :
- it does not detect all values : but these can be easily extracted using Regex ;
- if does not detect all Named Entities : if you want to go further, you will have to train a more complete (or dataset specific) model.
Step 2: Training our own (French) model
Now, you know how to run NER on an English corpus. What about other languages like French?
You need to train your own model. To do so, create a dummy-french-corpus.tsv
file in {yourAppFolder}/stanford-ner-tagger/train
with the following syntax:
En O
2017 DATE
, O
Une O
intelligence O
artificielle O
est O
en O
mesure O
de O
développer O
par O
elle-même O
Super PERSON
Mario PERSON
Bros PERSON
. O
Sans O
avoir O
eu O
accès O
au O
code O
du O
jeu O
, O
elle O
a O
récrée O
ce O
hit O
des O
consoles O
Nintendo ORGANIZATION
. O
Des O
chercheurs O
de O
l'Institut ORGANIZATION
de ORGANIZATION
Technologie ORGANIZATION
de O
Géorgie LOCATION
, O
aux O
Etats-Unis LOCATION
, O
viennent O
de O
la O
mettre O
à O
l'épreuve O
. O
Create a prop.txt file in the same folder too:
trainFile = train/dummy-french-corpus.tsv
serializeTo = dummy-ner-model-french.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
Train it, using:
cd stanford-ner-tagger/
java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop train/prop.txt
This should output dummy-ner-model-french.ser.gz
file. Create a new ner_french.py
script to use it:
Run it:
python ner_french.py
The output seems to be right:
[('En', 'O'), ('2017', 'DATE'), (',', 'O'), ('une', 'O'), ('intelligence', 'O'), ('artificielle', 'O'), ('est', 'O'), ('en', 'O'), ('mesure', 'O'), ('de', 'O'), ('développer', 'O'), ('par', 'O'), ('elle-même', 'O'), ('Super', 'PERSON'), ('Mario', 'PERSON'), ('Bros.', 'O'), ('Sans', 'O'), ('avoir', 'O'), ('eu', 'O'), ('accès', 'O'), ('au', 'O'), ('code', 'O'), ('du', 'O'), ('jeu', 'O'), (',', 'O'), ('elle', 'O'), ('a', 'O'), ('récrée', 'O'), ('ce', 'O'), ('hit', 'O'), ('des', 'O'), ('consoles', 'O'), ('Nintendo', 'ORGANIZATION'), ('.', 'O'), ('Des', 'O'), ('chercheurs', 'O'), ('de', 'O'), ("l'Institut", 'ORGANIZATION'), ('de', 'ORGANIZATION'), ('Technologie', 'ORGANIZATION'), ('de', 'O'), ('Géorgie', 'LOCATION'), (',', 'O'), ('aux', 'O'), ('Etats-Unis', 'LOCATION'), (',', 'O'), ('viennent', 'O'), ('de', 'O'), ('la', 'O'), ('mettre', 'O'), ('à', 'O'), ("l'épreuve", 'O'), ('.', 'O')]
Congratulations, your model is trained! Of course, as the corpus we trained it on is ridiculous, you won’t succeed on a different text:
As you can see, none of the name entities have been caught:
[(‘La’, ‘O’), (‘première’, ‘O’), (‘Falcon’, ‘O’), (‘Heavy’, ‘O’), (‘de’, ‘O’), (“l’entreprise”, ‘O’), (‘SpaceX’, ‘O’), (‘,’, ‘O’), (‘la’, ‘O’), (‘plus’, ‘O’), (‘puissante’, ‘O’), (‘fusée’, ‘O’), (‘américaine’, ‘O’), (‘jamais’, ‘O’), (‘lancée’, ‘O’), (‘depuis’, ‘O’), (‘plus’, ‘O’), (‘de’, ‘O’), (‘quarante’, ‘O’), (‘ans’, ‘O’), (‘,’, ‘O’), (‘devrait’, ‘O’), (‘bien’, ‘O’), (‘emporter’, ‘O’), (‘le’, ‘O’), (‘roadster’, ‘O’), (‘de’, ‘O’), (“l’entrepreneur”, ‘O’), (‘américain’, ‘O’), (‘,’, ‘O’), (‘mais’, ‘O’), (‘sur’, ‘O’), (‘une’, ‘O’), (‘orbite’, ‘O’), (‘bien’, ‘O’), (‘différente’, ‘O’), (‘.’, ‘O’), (‘Elon’, ‘O’), (‘Musk’, ‘O’), (‘a’, ‘O’), (‘le’, ‘O’), (‘sens’, ‘O’), (‘du’, ‘O’), (‘spectacle’, ‘O’), (‘.’, ‘O’)]
You will need a bigger dataset to train on.
Step 3: Performing NER on French article
Two solutions:
- You face a custom use case (you have specialized vocabulary or you are looking for high accuracy), and you write your own
corpus.tsv
file by labeling a big corpus by yourself; - You want to perform regular NER and you use an existing labeled corpus.
I have found this nice dataset (FR, DE, NL) that you can use: https://github.com/EuropeanaNewspapers/ner-corpora
Download enp_FR.bnf.bio file into your train
folder. Adjust trainFile = train/enp_FR.bnf.bio
and serializedTo=trained-ner-model-french-ser.gizin prop.txt
file and train your model again (that may last 10 minutes or more) :
cd stanford-ner-tagger/
java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop train/prop.txt
Run ner_french.py
again:
[('La', 'O'), ('première', 'O'), ('Falcon', 'I-PER'), ('Heavy', 'I-PER'), ('de', 'O'), ("l'entreprise", 'O'), ('SpaceX', 'O'), (',', 'O'), ('la', 'O'), ('plus', 'O'), ('puissante', 'O'), ('fusée', 'O'), ('des', 'O'), ('Etats-Unis', 'I-LOC'), ('jamais', 'O'), ('lancée', 'O'), ('depuis', 'O'), ('plus', 'O'), ('de', 'O'), ('quarante', 'O'), ('ans', 'O'), (',', 'O'), ('devrait', 'O'), ('bien', 'O'), ('emporter', 'O'), ('le', 'O'), ('roadster', 'O'), ('de', 'O'), ("l'entrepreneur", 'O'), ('américain', 'O'), (',', 'O'), ('mais', 'O'), ('sur', 'O'), ('une', 'O'), ('orbite', 'O'), ('bien', 'O'), ('différente', 'O'), ('.', 'O'), ('Elon', 'I-PER'), ('Musk', 'I-PER'), ('a', 'O'), ('le', 'O'), ('sens', 'O'), ('du', 'O'), ('spectacle', 'O'), ('.', 'O')]
Now, it looks better, while not perfect !
Note: Output shows ‘I-PER’ instead of ‘PERSON’. It depends on how your initial corpus is labeled.
Conclusions
After a few hours on the Internet, looking for tools or packages that could handle french NER tagging, I had to resign myself. The only software I found is FreeLing, which seems great but it seems rather hard to install and C++ written.
Neither NLTK, Spacy, nor SciPy handles french NER tagging out-of-the-box. Hopefully, you can train models for new languages but respective documentations are really light on that point.
Useful Links
- Freeling: an NLP tool written in C++ that works for many languages including English, French, German, Spanish, Russian, Italian, Norwegian ;
- Spacy: : really good NLP python package with a nice documentation. Here is a link to add new language in Spacy.
- NLTK (Natural Language Toolkit) is a wonderful Python package that provides a set of natural languages corpora and APIs to an impressing diversity of NLP algorithms
- Stanford NER tagger: NER Tagger you can use with NLTK open-sourced by Stanford engineers and used in this tutorial.
Thanks to Flavian Hautbois and Pierre-Henri Cumenge.
If you are looking for Machine Learning expert's, don't hesitate to contact us !