# Lunch Time Python

## 25.11.2022: spaCy
<img style="width: 600px;" src="https://upload.wikimedia.org/wikipedia/commons/8/88/SpaCy_logo.svg">

[spaCy](https://spacy.io/) is an open-source natural language processing library written in Python and Cython.

spaCy focuses on production usage and is very fast and efficient. It also supports deep learning workflows through interfacing with [TensorFlow](https://www.tensorflow.org/) or [PyTorch](https://pytorch.org/), as well as the transformer model library [Hugging Face](https://github.com/huggingface).

*Press `Spacebar` to go to the next slide (or `?` to see all navigation shortcuts)*

[Lunch Time Python](https://ssciwr.github.io/lunch-time-python/), [Scientific Software Center](https://ssc.iwr.uni-heidelberg.de), [Heidelberg University](https://www.uni-heidelberg.de/)

# 0 What to do with spaCy

spaCy is very powerful for text annotation:
- sentencize and tokenize
- POS (part-of-speech) and lemma
- NER (named entity recognition)
- dependency parsing
- text classification
- morphological analysis
- pattern matching
- ...

spaCy can also learn new tasks through integration with your machine learning stack. It also provides multi-task learning with pretrained transformers like [BERT](https://arxiv.org/abs/1810.04805). 
(BERT is used in the google search engine.)


In [None]:
import spacy
from spacy import displacy

if "google.colab" in str(get_ipython()):
    spacy.cli.download("en_core_web_md")
nlp = spacy.load("en_core_web_md")
doc = nlp(
    "The Scientific Software Center offers lunch-time Python - an informal way to learn about new Python libraries."
)
displacy.render(doc, style="dep")

In [None]:
displacy.render(doc, style="ent")

# 1 Install spaCy
You can install spaCy using `pip`:

`pip install spacy`

It is also available via `conda-forge`:

`conda install -c conda-forge spacy`

After installing spaCy, you also need to download the language model. For a medium-sized English model, you would do this using

`python -m spacy download en_core_web_md`

The available models are listed on the spaCy website: https://spacy.io/usage/models

## Install spaCy with CUDA support

`pip install -U spacy[cuda]`

You can also explore the [online tool](https://spacy.io/usage) for installation instructions.

# 2 Let's try it out!

In [None]:
nlp = spacy.load("en_core_web_md")
nlp("This is lunch-time Python.")

In [None]:
doc = nlp("This is lunch-time Python.")
print(type(doc))
[i for i in doc]

In [None]:
t = doc[0]
type(t)

In [None]:
t.ent_id_

In [None]:
displacy.render(doc)

In [None]:
spacy.explain("AUX")

In [None]:
for t in doc:
    print(t.text, t.pos_, t.dep_, t.lemma_)

# 3 Pipelines


![pipeline](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

[source: spaCy 101]

The capabilities of the processing pipeline dependes on the components, their models and how they were trained.

In [None]:
nlp.pipe_names

In [None]:
nlp.tokenizer

In [None]:
text = "Python is a very popular - maybe even the most popular - programming language among scientific software developers. One of the reasons for this success story is the rich standard library and the rich ecosystem of available (scientific) libraries. To fully leverage this ecosystem, developers need to stay up to date and explore new libraries. Lunch Time Python aims at providing a communication platform between Pythonistas to learn about new libraries in an informal setting. Sessions take roughly 30 minutes, one library is presented per session and the code will be made available afterwards. Come by, enjoy your lunch with us and step up your Python game!"

In [None]:
print(text)

In [None]:
doc = nlp(text)

In [None]:
for i, sent in enumerate(doc.sents):
    print(i, sent)

In [None]:
for i, sent in enumerate(doc.sents):
    for j, token in enumerate(sent):
        print(i, j, token.text, token.pos_)

## Adding custom components
You can add custom pipeline components, for example rule-based or phrase matchers, and add the custom attributes to the `doc`, `token` and `span` objects.

## Processing batches of texts
You can process batches of texts using the `nlp.pipe()` command.

`docs = list(nlp.pipe(LOTS_OF_TEXTS))`

## Disabling pipeline components
To achieve higher efficiency, it is possible to disable pipeline components.

`nlp.select_pipes(disable=["ner"])`

# 4 Rule-based matching

In [None]:
# Import the Matcher
from spacy.matcher import Matcher

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
python_pattern = [{"TEXT": "Python", "POS": "PROPN"}]
matcher.add("PYTHON_PATTERN", [python_pattern])

doc = nlp(text)

# Call the matcher on the doc
matches = matcher(doc)

In [None]:
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

# 5 Phrase matching
More efficient than the rule-based matching, can be used for finding sequences of words, and also gives you access to the tokens in context.

- Rule-based matching: find patterns in the tokens (token-based matching)
- Phrase matching: find exact string; useful for names and if there are several options of tokenizing the string

In [None]:
doc = nlp(
    "The Scientific Software Center supports researchers in developing scientific software."
)

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)
# you can also pass in attributes, for example attr="LOWER" or attr="POS"

# Create pattern Doc objects and add them to the matcher
term = "Scientific Software Center"
pattern = nlp(term)
# or use pattern = nlp.make_doc(term) to only invoke tokenizer - more efficient!
matcher.add("SSC", [pattern])

# Call the matcher on the test document and print the result
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

# 6 Word vectors and semantic similarity
spaCy can compare two objects and predict similarity:

In [None]:
text1 = "I like Python."
text2 = "I like snakes."


doc1 = nlp(text1)
doc2 = nlp(text2)

In [None]:
print(doc1.similarity(doc2))

In [None]:
token1 = doc1[2]
token2 = doc2[2]
print(token1.text, token2.text)

In [None]:
print(token1.similarity(token2))

The similarity score is generated from word vectors.

In [None]:
print(token1.vector)

Similarity can be used to predict similar texts to users, or to flag duplicate content. 

But: Similarity always depends on the context.

In [None]:
text3 = "I hate snakes."
doc3 = nlp(text3)
print(doc2.similarity(doc3))

These come out similar as both statements express a sentiment.

# 7 Internal workings
spaCy stores all strings as hash values and creates a lookup table. This way, a word that occurs several times only needs to be stored once.

In [None]:
nlp.vocab.strings.add("python")
python_hash = nlp.vocab.strings["python"]
python_string = nlp.vocab.strings[python_hash]
print(python_hash, python_string)

- lexemes are entries in the vocabulary and contain context-independent information (the text, hash, lexical attributes).
![data structure](https://course.spacy.io/vocab_stringstore.png)

# 8 Train your own model
![training_scheme](https://course.spacy.io/training.png)
[source: spaCy online course]

Training data: Annotated text  
Text: The input text that the model should label  
Label: The label that the model should predict  
Gradient: How to change the weights

## The training data
- Examples in context
- Update existing model: a few hundred to a few thousand examples
- Train a new category: a few thousand to a million examples
- Created manually by human annotators
- Use matcher to semi-automatize

Also need evaluation data.

## Create a training corpus

In [None]:
from spacy.tokens import Span

nlp = spacy.blank("en")

# Create a Doc with entity spans
doc1 = nlp("iPhone X is coming")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
# Create another doc without entity spans
doc2 = nlp("I need a new phone! Any tips?")

docs = [doc1, doc2]  # and so on...

## Configuring the training
The training `config.cfg` contains the settings for the training, such as configuration of the pipeline and setting of hyperparameters.

```
[nlp]
lang = "en"
pipeline = ["tok2vec", "ner"]
batch_size = 1000

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

[components]

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
hidden_width = 64
...
```

Use the [quickstart-widget](https://spacy.io/usage/training#quickstart) to initialize a config.

## That's it! All you need is the training and evaluation data and the config.
`python -m spacy train ./config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy`

After you have completed the training, the model can be loaded and used with `spacy.load()`.

You can also package and deploy your pipeline so others can use it.

## A few notes on training
- If you update existing models, previously predicted categories can be unlearned ("catastrophic forgetting")!
- Labels need to be consistent and not too specific

# 9 spaCy transformers
You can load in transformer models using `spacy-transformers`:

`pip install spacy-transformers`

Remember that transformer models work with context, so if you have a list of terms with no context around them (say, titles of blog posts), a transformer model may not be the best choice.

![transformer_pipeline](https://spacy.io/pipeline_transformer-3464b402cf7b19c3dd1efe1c0b4336dd.svg)
[source: spaCy documentation]

transformer-based pipelines end in `_trf`:

`python -m spacy download en_core_web_trf`

# 10 Further information

# spaCy demos
- You can explore spaCy using [online tools](https://explosion.ai/software)

For example, the [rule-based matcher explorer](https://demos.explosion.ai/matcher) -

- or the [spaCy online course](https://course.spacy.io/en/).


# Example use cases
- [Detection of programming language in stackoverflow posts](https://github.com/koaning/spacy-youtube-material)
- take a look at [spaCy projects](https://spacy.io/usage/projects)!
