• SSC Lunch Time Python
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

Lunch Time Python¶

25.11.2022: spaCy¶

No description has been provided for this image

spaCy is an open-source natural language processing library written in Python and Cython.

spaCy focuses on production usage and is very fast and efficient. It also supports deep learning workflows through interfacing with TensorFlow or PyTorch, as well as the transformer model library Hugging Face.

Press Spacebar to go to the next slide (or ? to see all navigation shortcuts)

Lunch Time Python, Scientific Software Center, Heidelberg University

0 What to do with spaCy¶

spaCy is very powerful for text annotation:

  • sentencize and tokenize
  • POS (part-of-speech) and lemma
  • NER (named entity recognition)
  • dependency parsing
  • text classification
  • morphological analysis
  • pattern matching
  • ...

spaCy can also learn new tasks through integration with your machine learning stack. It also provides multi-task learning with pretrained transformers like BERT. (BERT is used in the google search engine.)

In [1]:
import spacy
from spacy import displacy

if "google.colab" in str(get_ipython()):
    spacy.cli.download("en_core_web_md")
nlp = spacy.load("en_core_web_md")
doc = nlp(
    "The Scientific Software Center offers lunch-time Python - an informal way to learn about new Python libraries."
)
displacy.render(doc, style="dep")
The DET Scientific PROPN Software PROPN Center PROPN offers VERB lunch- NOUN time NOUN Python - PROPN an DET informal ADJ way NOUN to PART learn VERB about ADP new ADJ Python PROPN libraries. NOUN det compound compound nsubj compound compound dobj det amod appos aux relcl prep amod compound pobj
In [2]:
displacy.render(doc, style="ent")
The Scientific Software Center ORG offers lunch-time Python ORG - an informal way to learn about new Python ORG libraries.

1 Install spaCy¶

You can install spaCy using pip:

pip install spacy

It is also available via conda-forge:

conda install -c conda-forge spacy

After installing spaCy, you also need to download the language model. For a medium-sized English model, you would do this using

python -m spacy download en_core_web_md

The available models are listed on the spaCy website: https://spacy.io/usage/models

Install spaCy with CUDA support¶

pip install -U spacy[cuda]

You can also explore the online tool for installation instructions.

2 Let's try it out!¶

In [3]:
nlp = spacy.load("en_core_web_md")
nlp("This is lunch-time Python.")
Out[3]:
This is lunch-time Python.
In [4]:
doc = nlp("This is lunch-time Python.")
print(type(doc))
[i for i in doc]
<class 'spacy.tokens.doc.Doc'>
Out[4]:
[This, is, lunch, -, time, Python, .]
In [5]:
t = doc[0]
type(t)
Out[5]:
spacy.tokens.token.Token
In [6]:
t.ent_id_
Out[6]:
''
In [7]:
displacy.render(doc)
This PRON is AUX lunch- NOUN time NOUN Python. PROPN nsubj compound compound attr
In [8]:
spacy.explain("AUX")
Out[8]:
'auxiliary'
In [9]:
for t in doc:
    print(t.text, t.pos_, t.dep_, t.lemma_)
This PRON nsubj this
is AUX ROOT be
lunch NOUN compound lunch
- PUNCT punct -
time NOUN compound time
Python PROPN attr Python
. PUNCT punct .

3 Pipelines¶

pipeline

[source: spaCy 101]

The capabilities of the processing pipeline dependes on the components, their models and how they were trained.

In [10]:
nlp.pipe_names
Out[10]:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
In [11]:
nlp.tokenizer
Out[11]:
<spacy.tokenizer.Tokenizer at 0x7fe33bfe6d40>
In [12]:
text = "Python is a very popular - maybe even the most popular - programming language among scientific software developers. One of the reasons for this success story is the rich standard library and the rich ecosystem of available (scientific) libraries. To fully leverage this ecosystem, developers need to stay up to date and explore new libraries. Lunch Time Python aims at providing a communication platform between Pythonistas to learn about new libraries in an informal setting. Sessions take roughly 30 minutes, one library is presented per session and the code will be made available afterwards. Come by, enjoy your lunch with us and step up your Python game!"
In [13]:
print(text)
Python is a very popular - maybe even the most popular - programming language among scientific software developers. One of the reasons for this success story is the rich standard library and the rich ecosystem of available (scientific) libraries. To fully leverage this ecosystem, developers need to stay up to date and explore new libraries. Lunch Time Python aims at providing a communication platform between Pythonistas to learn about new libraries in an informal setting. Sessions take roughly 30 minutes, one library is presented per session and the code will be made available afterwards. Come by, enjoy your lunch with us and step up your Python game!
In [14]:
doc = nlp(text)
In [15]:
for i, sent in enumerate(doc.sents):
    print(i, sent)
0 Python is a very popular - maybe even the most popular - programming language among scientific software developers.
1 One of the reasons for this success story is the rich standard library and the rich ecosystem of available (scientific) libraries.
2 To fully leverage this ecosystem, developers need to stay up to date and explore new libraries.
3 Lunch Time
4 Python aims at providing a communication platform between Pythonistas to learn about new libraries in an informal setting.
5 Sessions take roughly 30 minutes, one library is presented per session and the code will be made available afterwards.
6 Come by, enjoy your lunch with us and step up your Python game!
In [16]:
for i, sent in enumerate(doc.sents):
    for j, token in enumerate(sent):
        print(i, j, token.text, token.pos_)
0 0 Python PROPN
0 1 is AUX
0 2 a DET
0 3 very ADV
0 4 popular ADJ
0 5 - PUNCT
0 6 maybe ADV
0 7 even ADV
0 8 the DET
0 9 most ADV
0 10 popular ADJ
0 11 - PUNCT
0 12 programming VERB
0 13 language NOUN
0 14 among ADP
0 15 scientific ADJ
0 16 software NOUN
0 17 developers NOUN
0 18 . PUNCT
1 0 One NUM
1 1 of ADP
1 2 the DET
1 3 reasons NOUN
1 4 for ADP
1 5 this DET
1 6 success NOUN
1 7 story NOUN
1 8 is AUX
1 9 the DET
1 10 rich ADJ
1 11 standard ADJ
1 12 library NOUN
1 13 and CCONJ
1 14 the DET
1 15 rich ADJ
1 16 ecosystem NOUN
1 17 of ADP
1 18 available ADJ
1 19 ( PUNCT
1 20 scientific ADJ
1 21 ) PUNCT
1 22 libraries NOUN
1 23 . PUNCT
2 0 To PART
2 1 fully ADV
2 2 leverage VERB
2 3 this DET
2 4 ecosystem NOUN
2 5 , PUNCT
2 6 developers NOUN
2 7 need VERB
2 8 to PART
2 9 stay VERB
2 10 up ADP
2 11 to ADP
2 12 date NOUN
2 13 and CCONJ
2 14 explore VERB
2 15 new ADJ
2 16 libraries NOUN
2 17 . PUNCT
3 0 Lunch NOUN
3 1 Time PROPN
4 0 Python PROPN
4 1 aims VERB
4 2 at ADP
4 3 providing VERB
4 4 a DET
4 5 communication NOUN
4 6 platform NOUN
4 7 between ADP
4 8 Pythonistas PROPN
4 9 to PART
4 10 learn VERB
4 11 about ADP
4 12 new ADJ
4 13 libraries NOUN
4 14 in ADP
4 15 an DET
4 16 informal ADJ
4 17 setting NOUN
4 18 . PUNCT
5 0 Sessions NOUN
5 1 take VERB
5 2 roughly ADV
5 3 30 NUM
5 4 minutes NOUN
5 5 , PUNCT
5 6 one NUM
5 7 library NOUN
5 8 is AUX
5 9 presented VERB
5 10 per ADP
5 11 session NOUN
5 12 and CCONJ
5 13 the DET
5 14 code NOUN
5 15 will AUX
5 16 be AUX
5 17 made VERB
5 18 available ADJ
5 19 afterwards ADV
5 20 . PUNCT
6 0 Come VERB
6 1 by ADV
6 2 , PUNCT
6 3 enjoy VERB
6 4 your PRON
6 5 lunch NOUN
6 6 with ADP
6 7 us PRON
6 8 and CCONJ
6 9 step VERB
6 10 up ADP
6 11 your PRON
6 12 Python PROPN
6 13 game NOUN
6 14 ! PUNCT

Adding custom components¶

You can add custom pipeline components, for example rule-based or phrase matchers, and add the custom attributes to the doc, token and span objects.

Processing batches of texts¶

You can process batches of texts using the nlp.pipe() command.

docs = list(nlp.pipe(LOTS_OF_TEXTS))

Disabling pipeline components¶

To achieve higher efficiency, it is possible to disable pipeline components.

nlp.select_pipes(disable=["ner"])

4 Rule-based matching¶

In [17]:
# Import the Matcher
from spacy.matcher import Matcher

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
python_pattern = [{"TEXT": "Python", "POS": "PROPN"}]
matcher.add("PYTHON_PATTERN", [python_pattern])

doc = nlp(text)

# Call the matcher on the doc
matches = matcher(doc)
In [18]:
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)
Python
Python
Python

5 Phrase matching¶

More efficient than the rule-based matching, can be used for finding sequences of words, and also gives you access to the tokens in context.

  • Rule-based matching: find patterns in the tokens (token-based matching)
  • Phrase matching: find exact string; useful for names and if there are several options of tokenizing the string
In [19]:
doc = nlp(
    "The Scientific Software Center supports researchers in developing scientific software."
)

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)
# you can also pass in attributes, for example attr="LOWER" or attr="POS"

# Create pattern Doc objects and add them to the matcher
term = "Scientific Software Center"
pattern = nlp(term)
# or use pattern = nlp.make_doc(term) to only invoke tokenizer - more efficient!
matcher.add("SSC", [pattern])

# Call the matcher on the test document and print the result
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)
Scientific Software Center

6 Word vectors and semantic similarity¶

spaCy can compare two objects and predict similarity:

In [20]:
text1 = "I like Python."
text2 = "I like snakes."


doc1 = nlp(text1)
doc2 = nlp(text2)
In [21]:
print(doc1.similarity(doc2))
0.9570476558016586
In [22]:
token1 = doc1[2]
token2 = doc2[2]
print(token1.text, token2.text)
Python snakes
In [23]:
print(token1.similarity(token2))
0.18009579181671143

The similarity score is generated from word vectors.

In [24]:
print(token1.vector)
[-1.2606    0.065898  6.0885   -0.22722   0.83154   0.41309   3.1979
 -0.046191 -1.2829   -1.3479    1.7709    3.668    -2.0622    2.7155
 -1.0578   -2.5758    2.4921    1.6091   -1.0377    3.0679   -1.4015
  3.7073    1.9131   -0.57248  -2.6436    0.63337  -0.29285  -3.4357
 -2.1266    1.7317   -5.3598    1.3803   -0.54765   0.35455   2.7631
 -1.977     0.44758  -1.4725    2.8591   -2.1695    2.3519   -1.3073
 -2.5832   -1.1488   -6.6438   -0.93801   0.56867   0.87114  -0.96782
 -5.2648    0.94436   2.2771    1.1189   -0.34377  -2.5144    2.9963
 -2.5062    2.1578   -0.67746  -1.0898    1.6241    3.6518   -3.1079
  4.7306   -0.66454   2.7364    0.13306  -3.4212    1.3897    2.3435
 -5.4255    1.9155   -1.7938   -0.3813    1.5523    0.10848  -2.3448
 -1.336     2.8275   -1.1881   -2.0658   -1.704    -0.72433   1.1114
 -0.59757  -5.9866    2.3778   -0.16238  -2.3423   -1.7955   -0.77142
  0.068012  0.68761   0.67404  -4.4701    2.4112   -0.2604   -1.0389
  2.1799   -1.8888    2.3248   -0.68885   0.90761   1.6504   -0.5866
 -0.95308  -2.2514    0.26756   0.090679  3.9386    3.1946    1.1651
 -2.867     1.3898   -0.50941  -0.89953  -6.4801    2.1745   -2.1203
  0.55437  -1.3614   -3.2856   -2.1754    0.48878  -2.4629    0.15834
 -4.2165   -4.2826    0.56998   0.082179  0.42306   1.7157    4.5706
 -0.57897   1.6457    0.32642  -0.50926   1.0044    0.11967  -1.2308
  2.1196   -1.0886    2.0302   -0.22822   2.1447    1.3428    1.7925
 -0.91104  -1.5624    0.59617  -0.34208   5.2826    0.37967  -3.9622
 -5.4539   -2.3045    1.7818    5.9382    0.95568  -2.4973    3.5077
  2.3859   -0.41935  -1.8645    0.80334  -0.40924  -1.3111    0.90649
 -1.2311    0.7847    1.3806   -0.37329  -5.5309    2.092     0.81443
  0.097034  2.9104    0.34064   0.075322 -0.46475   0.17099   2.6546
 -4.8524   -0.029789  0.64981   0.76909  -4.32     -6.5618   -0.37659
  0.15436   2.5368   -0.17104  -0.14987   2.1709   -0.60606   6.0411
  2.8818    2.8922    2.8558   -0.61347  -4.4471    2.6216   -5.6342
  2.1586    2.0838   -0.12496  -3.1686   -1.5929    4.5141   -0.060719
 -3.2781   -1.5175    0.48335   3.9961    1.6667    1.6139    1.2288
 -0.095046  0.52451   0.98974   2.4654    3.1082   -2.9114    2.9509
  1.9835   -0.075264  4.079    -0.43975   0.70653   1.8881   -0.13128
 -2.4122    0.37447  -0.086059  0.018365 -1.0378    1.9564   -0.089256
  4.3107   -1.6252   -1.5946   -2.5387   -0.54987   0.83453   5.3653
  1.2602   -1.3737   -5.2252   -0.61126  -2.4068    2.6474   -0.66264
 -3.2214   -1.4838   -0.34186   2.418     2.1285   -2.8315   -1.4845
  1.9585    1.3732   -0.83277   0.30195   0.050321 -0.14242  -2.96
  2.3108    4.2398   -4.639    -3.6083   -0.97992  -2.9713    2.2687
  0.02414   0.25454  -2.2333    1.933     1.6268   -3.3229    3.1813
  0.17175  -1.6586   -0.12658   1.3129    1.3892    1.5215    2.4376
  0.17856  -0.65205   0.72564  -0.92968  -3.0689    3.5688    1.8885
  3.7389    1.9741    0.69516  -2.4315   -3.1602    2.8082  ]

Similarity can be used to predict similar texts to users, or to flag duplicate content.

But: Similarity always depends on the context.

In [25]:
text3 = "I hate snakes."
doc3 = nlp(text3)
print(doc2.similarity(doc3))
0.9609648190520086

These come out similar as both statements express a sentiment.

7 Internal workings¶

spaCy stores all strings as hash values and creates a lookup table. This way, a word that occurs several times only needs to be stored once.

In [26]:
nlp.vocab.strings.add("python")
python_hash = nlp.vocab.strings["python"]
python_string = nlp.vocab.strings[python_hash]
print(python_hash, python_string)
17956708691072489762 python
  • lexemes are entries in the vocabulary and contain context-independent information (the text, hash, lexical attributes). data structure

8 Train your own model¶

training_scheme [source: spaCy online course]

Training data: Annotated text
Text: The input text that the model should label
Label: The label that the model should predict
Gradient: How to change the weights

The training data¶

  • Examples in context
  • Update existing model: a few hundred to a few thousand examples
  • Train a new category: a few thousand to a million examples
  • Created manually by human annotators
  • Use matcher to semi-automatize

Also need evaluation data.

Create a training corpus¶

In [27]:
from spacy.tokens import Span

nlp = spacy.blank("en")

# Create a Doc with entity spans
doc1 = nlp("iPhone X is coming")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
# Create another doc without entity spans
doc2 = nlp("I need a new phone! Any tips?")

docs = [doc1, doc2]  # and so on...

Configuring the training¶

The training config.cfg contains the settings for the training, such as configuration of the pipeline and setting of hyperparameters.

[nlp]
lang = "en"
pipeline = ["tok2vec", "ner"]
batch_size = 1000

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

[components]

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
hidden_width = 64
...

Use the quickstart-widget to initialize a config.

That's it! All you need is the training and evaluation data and the config.¶

python -m spacy train ./config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy

After you have completed the training, the model can be loaded and used with spacy.load().

You can also package and deploy your pipeline so others can use it.

A few notes on training¶

  • If you update existing models, previously predicted categories can be unlearned ("catastrophic forgetting")!
  • Labels need to be consistent and not too specific

9 spaCy transformers¶

You can load in transformer models using spacy-transformers:

pip install spacy-transformers

Remember that transformer models work with context, so if you have a list of terms with no context around them (say, titles of blog posts), a transformer model may not be the best choice.

transformer_pipeline [source: spaCy documentation]

transformer-based pipelines end in _trf:

python -m spacy download en_core_web_trf

10 Further information¶

spaCy demos¶

  • You can explore spaCy using online tools

For example, the rule-based matcher explorer -

  • or the spaCy online course.

Example use cases¶

  • Detection of programming language in stackoverflow posts
  • take a look at spaCy projects!