Write Vectors In Component Form

Linguistic Features

Processing raw text intelligently is difficult: nearly words are rare, and it'southward mutual for words that look completely different to mean almost the aforementioned thing. The aforementioned words in a different social club can mean something completely different. Even splitting text into useful discussion-like units tin be difficult in many languages. While information technology'southward possible to solve some issues starting from but the raw characters, it's usually better to employ linguistic knowledge to add together useful information. That'southward exactly what spaCy is designed to do: you put in raw text, and get dorsum a Md object, that comes with a variety of annotations.

Part-of-speech tagging Needs model

After tokenization, spaCy tin parse and tag a given Doctor. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label almost probable applies in this context. A trained component includes binary information that is produced by showing a system enough examples for it to make predictions that generalize across the language – for case, a word following "the" in English is virtually probable a noun.

Linguistic annotations are available as Token attributes. Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and better efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              md              =              nlp(              "Apple tree is looking at ownership U.K. startup for $ane billion"              )              for              token              in              doc:              print              (token.text,              token.lemma_,              token.pos_,              token.tag_,              token.dep_,              token.shape_,              token.is_alpha,              token.is_stop)

Text	Lemma	POS	Tag	Dep	Shape	alpha	stop
Apple tree	apple	`PROPN`	`NNP`	`nsubj`	`Xxxxx`	`Truthful`	`Faux`
is	be	`AUX`	`VBZ`	`aux`	`twenty`	`Truthful`	`Truthful`
looking	look	`VERB`	`VBG`	`ROOT`	`xxxx`	`True`	`False`
at	at	`ADP`	`IN`	`prep`	`20`	`True`	`True`
buying	buy	`VERB`	`VBG`	`pcomp`	`xxxx`	`Truthful`	`False`
U.K.	u.thousand.	`PROPN`	`NNP`	`compound`	`Ten.Ten.`	`Simulated`	`Simulated`
startup	startup	`Noun`	`NN`	`dobj`	`xxxx`	`True`	`False`
for	for	`ADP`	`IN`	`prep`	`xxx`	`Truthful`	`Truthful`
$	$	`SYM`	`$`	`quantmod`	`$`	`False`	`False`
1	1	`NUM`	`CD`	`chemical compound`	`d`	`Imitation`	`False`
billion	billion	`NUM`	`CD`	`pobj`	`xxxx`	`True`	`Imitation`

Using spaCy'south built-in displaCy visualizer, hither'south what our case sentence and its dependencies await like:

Morphology

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical office only do non change its part-of-oral communication. Nosotros say that a lemma (root class) is inflected (modified/combined) with one or more morphological features to create a surface course. Here are some examples:

Context	Surface	Lemma	POS	Morphological Features
I was reading the newspaper	reading	read	`VERB`	`VerbForm=Ger`
I don't lookout man the news, I read the newspaper	read	read	`VERB`	`VerbForm=Fin`, `Mood=Ind`, `Tense=Pres`
I read the paper yesterday	read	read	`VERB`	`VerbForm=Fin`, `Mood=Ind`, `Tense=Past`

Morphological features are stored in the MorphAnalysis under Token.morph, which allows you to admission individual morphological features.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              print              (              "Pipeline:"              ,              nlp.pipe_names)              medico              =              nlp(              "I was reading the paper."              )              token              =              doctor[              0              ]              # 'I'              impress              (token.morph)              # 'Case=Nom|Number=Sing|Person=1|PronType=Prs'              print              (token.morph.go(              "PronType"              )              )              # ['Prs']

Statistical morphology v 3.0 Needs model

spaCy'south statistical Morphologizer component assigns the morphological features and coarse-grained part-of-speech tags as Token.morph and Token.pos.

                          import              spacy  nlp              =              spacy.load(              "de_core_news_sm"              )              doc              =              nlp(              "Wo bist du?"              )              # English: 'Where are y'all?'              print              (doc[              2              ]              .morph)              # 'Case=Nom|Number=Sing|Person=2|PronType=Prs'              print              (doc[              two              ]              .pos_)              # 'PRON'

Rule-based morphology

For languages with relatively simple morphological systems like English, spaCy can assign morphological features through a dominion-based approach, which uses the token text and fine-grained role-of-speech tags to produce fibroid-grained part-of-spoken language tags and morphological features.

The part-of-voice communication tagger assigns each token a fine-grained part-of-spoken communication tag. In the API, these tags are known as Token.tag. They express the part-of-voice communication (due east.m. verb) and some amount of morphological data, east.thousand. that the verb is past tense (due east.g. VBD for a past tense verb in the Penn Treebank) .
For words whose coarse-grained POS is not set by a prior process, a mapping table maps the fine-grained tags to a fibroid-grained POS tags and morphological features.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(              "Where are y'all?"              )              print              (doc[              ii              ]              .morph)              # 'Example=Nom|Person=2|PronType=Prs'              print              (doc[              2              ]              .pos_)              # 'PRON'

Lemmatization v 3.0 Needs model

spaCy provides two pipeline components for lemmatization:

The Lemmatizer component provides lookup and rule-based lemmatization methods in a configurable component. An private language can extend the Lemmatizer as part of its language data.
The EditTreeLemmatizer v iii.iii component provides a trainable lemmatizer.

                          import              spacy              # English pipelines include a rule-based lemmatizer              nlp              =              spacy.load(              "en_core_web_sm"              )              lemmatizer              =              nlp.get_pipe(              "lemmatizer"              )              print              (lemmatizer.mode)              # 'rule'              doc              =              nlp(              "I was reading the paper."              )              print              (              [token.lemma_              for              token              in              dr.]              )              # ['I', 'be', 'read', 'the', 'paper', '.']

The data for spaCy's lemmatizers is distributed in the packet spacy-lookups-data. The provided trained pipelines already include all the required tables, but if you are creating new pipelines, you'll probably desire to install spacy-lookups-data to provide the data when the lemmatizer is initialized.

Lookup lemmatizer

For pipelines without a tagger or morphologizer, a lookup lemmatizer can be added to the pipeline as long as a lookup table is provided, typically through spacy-lookups-data. The lookup lemmatizer looks up the token surface form in the lookup table without reference to the token'due south part-of-speech communication or context.

                          # pip install -U spacy[lookups]              import              spacy  nlp              =              spacy.blank(              "sv"              )              nlp.add_pipe(              "lemmatizer"              ,              config=              {              "mode"              :              "lookup"              }              )

Dominion-based lemmatizer

When training pipelines that include a component that assigns role-of-speech tags (a morphologizer or a tagger with a POS mapping), a rule-based lemmatizer can exist added using dominion tables from spacy-lookups-data:

                          # pip install -U spacy[lookups]              import              spacy  nlp              =              spacy.blank(              "de"              )              # Morphologizer (note: model is non however trained!)              nlp.add_pipe(              "morphologizer"              )              # Rule-based lemmatizer              nlp.add_pipe(              "lemmatizer"              ,              config=              {              "fashion"              :              "rule"              }              )

The rule-based deterministic lemmatizer maps the surface course to a lemma in light of the previously assigned coarse-grained function-of-speech and morphological information, without consulting the context of the token. The rule-based lemmatizer also accepts list-based exception files. For English, these are acquired from WordNet.

Trainable lemmatizer

The EditTreeLemmatizer can acquire class-to-lemma transformations from a training corpus that includes lemma annotations. This removes the need to write language-specific rules and can (in many cases) provide higher accuracies than lookup and dominion-based lemmatizers.

                          import              spacy  nlp              =              spacy.bare(              "de"              )              nlp.add_pipe(              "trainable_lemmatizer"              ,              name=              "lemmatizer"              )

Dependency Parsing Needs model

spaCy features a fast and authentic syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the judgement boundary detection, and lets you iterate over base of operations noun phrases, or "chunks". You can bank check whether a Doc object has been parsed by calling doc.has_annotation("DEP"), which checks whether the aspect Token.dep has been set returns a boolean value. If the event is False, the default sentence iterator will raise an exception.

Noun chunks

Substantive chunks are "base noun phrases" – apartment phrases that have a noun every bit their head. You can think of substantive chunks as a noun plus the words describing the noun – for example, "the lavish dark-green grass" or "the earth'southward largest tech fund". To get the noun chunks in a document, simply iterate over Doc.noun_chunks.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(              "Autonomous cars shift insurance liability toward manufacturers"              )              for              clamper              in              doc.noun_chunks:              print              (chunk.text,              chunk.root.text,              chunk.root.dep_,              clamper.root.head.text)

Text	root.text	root.dep _	root.caput.text
Autonomous cars	cars	`nsubj`	shift
insurance liability	liability	`dobj`	shift
manufacturers	manufacturers	`pobj`	toward

Navigating the parse tree

spaCy uses the terms head and kid to depict the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You tin can get the string value with .dep_.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(              "Autonomous cars shift insurance liability toward manufacturers"              )              for              token              in              doc:              print              (token.text,              token.dep_,              token.head.text,              token.caput.pos_,              [child              for              child              in              token.children]              )

Text	Dep	Caput text	Head POS	Children
Autonomous	`amod`	cars	`NOUN`
cars	`nsubj`	shift	`VERB`	Autonomous
shift	`ROOT`	shift	`VERB`	cars, liability, toward
insurance	`compound`	liability	`Noun`
liability	`dobj`	shift	`VERB`	insurance
toward	`prep`	shift	`Noun`	manufacturers
manufacturers	`pobj`	toward	`ADP`

Because the syntactic relations class a tree, every give-and-take has exactly one head. Yous can therefore iterate over the arcs in the tree past iterating over the words in the sentence. This is normally the all-time manner to lucifer an arc of interest – from below:

                          import              spacy              from              spacy.symbols              import              nsubj,              VERB  nlp              =              spacy.load(              "en_core_web_sm"              )              medico              =              nlp(              "Democratic cars shift insurance liability toward manufacturers"              )              # Finding a verb with a bailiwick from below — expert              verbs              =              fix              (              )              for              possible_subject              in              doc:              if              possible_subject.dep              ==              nsubj              and              possible_subject.caput.pos              ==              VERB:              verbs.add(possible_subject.head)              impress              (verbs)

If y'all try to match from higher up, you'll take to iterate twice. Once for the caput, and then again through the children:

                          # Finding a verb with a field of study from in a higher place — less expert              verbs              =              [              ]              for              possible_verb              in              medico:              if              possible_verb.pos              ==              VERB:              for              possible_subject              in              possible_verb.children:              if              possible_subject.dep              ==              nsubj:              verbs.append(possible_verb)              intermission

To iterate through the children, employ the token.children aspect, which provides a sequence of Token objects.

Iterating effectually the local tree

A few more convenience attributes are provided for iterating around the local tree from the token. Token.lefts and Token.rights attributes provide sequences of syntactic children that occur before and later the token. Both sequences are in sentence order. There are also 2 integer-typed attributes, Token.n_lefts and Token.n_rights that give the number of left and right children.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(              "brilliant cherry apples on the tree"              )              print              (              [token.text              for              token              in              physician[              2              ]              .lefts]              )              # ['bright', 'red']              print              (              [token.text              for              token              in              doc[              2              ]              .rights]              )              # ['on']              impress              (doctor[              2              ]              .n_lefts)              # 2              print              (doc[              ii              ]              .n_rights)              # 1

                          import              spacy  nlp              =              spacy.load(              "de_core_news_sm"              )              doc              =              nlp(              "schöne rote Äpfel auf dem Baum"              )              print              (              [token.text              for              token              in              doc[              two              ]              .lefts]              )              # ['schöne', 'rote']              print              (              [token.text              for              token              in              medico[              2              ]              .rights]              )              # ['auf']

Y'all tin can go a whole phrase by its syntactic caput using the Token.subtree attribute. This returns an ordered sequence of tokens. You tin walk upward the tree with the Token.ancestors attribute, and check dominance with Token.is_ancestor

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              dr.              =              nlp(              "Credit and mortgage business relationship holders must submit their requests"              )              root              =              [token              for              token              in              doc              if              token.head              ==              token]              [              0              ]              subject              =              list              (root.lefts)              [              0              ]              for              descendant              in              subject.subtree:              assert              subject field              is              descendant              or              discipline.is_ancestor(descendant)              print              (descendant.text,              descendant.dep_,              descendant.n_lefts,              descendant.n_rights,              [ancestor.text              for              ancestor              in              descendant.ancestors]              )

Text	Dep	n_lefts	n_rights	ancestors
Credit	`nmod`	`0`	`ii`	holders, submit
and	`cc`	`0`	`0`	holders, submit
mortgage	`compound`	`0`	`0`	account, Credit, holders, submit
account	`conj`	`1`	`0`	Credit, holders, submit
holders	`nsubj`	`1`	`0`	submit

Finally, the .left_edge and .right_edge attributes tin can be especially useful, because they give you lot the first and last token of the subtree. This is the easiest way to create a Span object for a syntactic phrase. Note that .right_edge gives a token within the subtree – and then if you apply it as the stop-point of a range, don't forget to +one!

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              dr.              =              nlp(              "Credit and mortgage account holders must submit their requests"              )              bridge              =              doc[physician[              four              ]              .left_edge.i              :              physician[              4              ]              .right_edge.i+              1              ]              with              dr..retokenize(              )              as              retokenizer:              retokenizer.merge(span)              for              token              in              doc:              print              (token.text,              token.pos_,              token.dep_,              token.head.text)

Text	POS	Dep	Head text
Credit and mortgage business relationship holders	`Substantive`	`nsubj`	submit
must	`VERB`	`aux`	submit
submit	`VERB`	`ROOT`	submit
their	`ADJ`	`poss`	requests
requests	`Substantive`	`dobj`	submit

The dependency parse can exist a useful tool for information extraction, especially when combined with other predictions like named entities. The following instance extracts coin and currency values, i.eastward. entities labeled as MONEY, so uses the dependency parse to find the substantive phrase they are referring to – for instance "Net income"→ "$9.4 one thousand thousand".

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              # Merge noun phrases and entities for easier analysis              nlp.add_pipe(              "merge_entities"              )              nlp.add_pipe(              "merge_noun_chunks"              )              TEXTS              =              [              "Cyberspace income was $9.4 million compared to the prior yr of $ii.7 one thousand thousand."              ,              "Revenue exceeded twelve billion dollars, with a loss of $1b."              ,              ]              for              doctor              in              nlp.pipe(TEXTS)              :              for              token              in              doctor:              if              token.ent_type_              ==              "MONEY"              :              # Nosotros take an attribute and direct object, and then check for subject              if              token.dep_              in              (              "attr"              ,              "dobj"              )              :              subj              =              [due west              for              w              in              token.head.lefts              if              w.dep_              ==              "nsubj"              ]              if              subj:              print              (subj[              0              ]              ,              "-->"              ,              token)              # We take a prepositional object with a preposition              elif              token.dep_              ==              "pobj"              and              token.caput.dep_              ==              "prep"              :              print              (token.caput.head,              "-->"              ,              token)

Visualizing dependencies

The best way to empathize spaCy'south dependency parser is interactively. To make this easier, spaCy comes with a visualization module. You can pass a Physician or a list of Doc objects to displaCy and run displacy.serve to run the spider web server, or displacy.render to generate the raw markup. If you want to know how to write rules that hook into some type of syntactic structure, only plug the judgement into the visualizer and see how spaCy annotates information technology.

                          import              spacy              from              spacy              import              displacy  nlp              =              spacy.load(              "en_core_web_sm"              )              medico              =              nlp(              "Autonomous cars shift insurance liability toward manufacturers"              )              # Since this is an interactive Jupyter environment, we can use displacy.render here              displacy.render(medico,              style=              'dep'              )

Disabling the parser

In the trained pipelines provided by spaCy, the parser is loaded and enabled by default every bit function of the standard processing pipeline. If you don't need whatever of the syntactic information, yous should disable the parser. Disabling the parser will make spaCy load and run much faster. If yous want to load the parser, but need to disable it for specific documents, you can besides control its use on the nlp object. For more details, see the usage guide on disabling pipeline components.

            nlp              =              spacy.load(              "en_core_web_sm"              ,              disable=              [              "parser"              ]              )

Named Entity Recognition

spaCy features an extremely fast statistical entity recognition arrangement, that assigns labels to contiguous spans of tokens. The default trained pipelines tin identify a variety of named and numeric entities, including companies, locations, organizations and products. Yous can add together arbitrary classes to the entity recognition system, and update the model with new examples.

Named Entity Recognition 101

A named entity is a "existent-earth object" that'due south assigned a name – for example, a person, a country, a product or a volume title. spaCy can recognize various types of named entities in a document, by request the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't ever work perfectly and might need some tuning later, depending on your use case.

Named entities are available every bit the ents belongings of a Doctor:

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              dr.              =              nlp(              "Apple is looking at buying U.K. startup for $1 billion"              )              for              ent              in              medico.ents:              print              (ent.text,              ent.start_char,              ent.end_char,              ent.label_)

Text	Start	End	Label	Description
Apple tree	0	5	`ORG`	Companies, agencies, institutions.
U.K.	27	31	`GPE`	Geopolitical entity, i.e. countries, cities, states.
$1 billion	44	54	`MONEY`	Monetary values, including unit of measurement.

Using spaCy'southward built-in displaCy visualizer, hither'south what our case sentence and its named entities look similar:

Accessing entity annotations and labels

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value or as a string, using the attributes ent.characterization and ent.label_. The Span object acts every bit a sequence of tokens, then y'all can iterate over the entity or index into it. You can also get the text form of the whole entity, every bit though information technology were a single token.

You can likewise admission token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity blazon is set on a token, information technology will return an empty string.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              doctor              =              nlp(              "San Francisco considers banning sidewalk delivery robots"              )              # certificate level              ents              =              [              (e.text,              e.start_char,              eastward.end_char,              e.label_)              for              eastward              in              dr..ents]              print              (ents)              # token level              ent_san              =              [doc[              0              ]              .text,              doctor[              0              ]              .ent_iob_,              md[              0              ]              .ent_type_]              ent_francisco              =              [doc[              1              ]              .text,              doc[              1              ]              .ent_iob_,              doc[              one              ]              .ent_type_]              print              (ent_san)              # ['San', 'B', 'GPE']              print              (ent_francisco)              # ['Francisco', 'I', 'GPE']

Text	ent_iob	ent_iob _	ent_type _	Description
San	`3`	`B`	`"GPE"`	beginning of an entity
Francisco	`1`	`I`	`"GPE"`	inside an entity
considers	`2`	`O`	`""`	outside an entity
banning	`two`	`O`	`""`	outside an entity
sidewalk	`2`	`O`	`""`	outside an entity
delivery	`2`	`O`	`""`	outside an entity
robots	`ii`	`O`	`""`	outside an entity

Setting entity annotations

To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. Notwithstanding, you can't write straight to the token.ent_iob or token.ent_type attributes, so the easiest way to set entities is to use the doc.set_ents function and create the new entity every bit a Span.

                          import              spacy              from              spacy.tokens              import              Bridge  nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(              "fb is hiring a new vice president of global policy"              )              ents              =              [              (e.text,              e.start_char,              due east.end_char,              eastward.label_)              for              due east              in              md.ents]              impress              (              'Earlier'              ,              ents)              # The model didn't recognize "fb" as an entity :(              # Create a bridge for the new entity              fb_ent              =              Span(doc,              0              ,              1              ,              characterization=              "ORG"              )              orig_ents              =              list              (dr..ents)              # Option one: Modify the provided entity spans, leaving the residuum unmodified              doc.set_ents(              [fb_ent]              ,              default=              "unmodified"              )              # Choice 2: Assign a consummate list of ents to doctor.ents              dr..ents              =              orig_ents              +              [fb_ent]              ents              =              [              (east.text,              due east.start,              e.end,              e.label_)              for              e              in              doc.ents]              impress              (              'After'              ,              ents)              # [('fb', 0, 1, 'ORG')] 🎉

Proceed in listen that Span is initialized with the start and end token indices, non the graphic symbol offsets. To create a span from character offsets, apply Doc.char_span:

            fb_ent              =              doc.char_span(              0              ,              ii              ,              label=              "ORG"              )

Setting entity annotations from assortment

You lot can also assign entity annotations using the doc.from_array method. To practice this, y'all should include both the ENT_TYPE and the ENT_IOB attributes in the array you're importing from.

                          import              numpy              import              spacy              from              spacy.attrs              import              ENT_IOB,              ENT_TYPE  nlp              =              spacy.load(              "en_core_web_sm"              )              physician              =              nlp.make_doc(              "London is a big urban center in the Britain."              )              print              (              "Earlier"              ,              md.ents)              # []              header              =              [ENT_IOB,              ENT_TYPE]              attr_array              =              numpy.zeros(              (              len              (doc)              ,              len              (header)              )              ,              dtype=              "uint64"              )              attr_array[              0              ,              0              ]              =              3              # B              attr_array[              0              ,              i              ]              =              doctor.vocab.strings[              "GPE"              ]              doc.from_array(header,              attr_array)              print              (              "After"              ,              doc.ents)              # [London]

Setting entity annotations in Cython

Finally, you can ever write to the underlying struct if you compile a Cython function. This is easy to do, and allows you to write efficient native code.

                          # cython: infer_types=Truthful              from              spacy.typedefs cimport attr_t              from              spacy.tokens.medico cimport Doc  cpdef set_entity(Physician doc,              int              start,              int              end,              attr_t ent_type)              :              for              i              in              range              (kickoff,              cease)              :              doc.c[i]              .ent_type              =              ent_type     doctor.c[start]              .ent_iob              =              3              for              i              in              range              (start+              1              ,              stop)              :              physician.c[i]              .ent_iob              =              2

Plainly, if you write directly to the array of TokenC* structs, yous'll have responsibility for ensuring that the information is left in a consistent state.

Built-in entity types

Visualizing named entities

The displaCy ^ENT visualizer lets you lot explore an entity recognition model's behavior interactively. If y'all're training a model, it's very useful to run the visualization yourself. To assistance yous practise that, spaCy comes with a visualization module. Yous tin can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup.

For more details and examples, see the usage guide on visualizing spaCy.

            Named Entity example
                          import              spacy              from              spacy              import              displacy  text              =              "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."              nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(text)              displacy.serve(physician,              style=              "ent"              )

Entity Linking

To basis the named entities into the "real world", spaCy provides functionality to perform entity linking, which resolves a textual entity to a unique identifier from a knowledge base (KB). You can create your own KnowledgeBase and train a new EntityLinker using that custom knowledge base.

Accessing entity identifiers Needs model

The annotated KB identifier is accessible equally either a hash value or as a string, using the attributes ent.kb_id and ent.kb_id_ of a Bridge object, or the ent_kb_id and ent_kb_id_ attributes of a Token object.

                          import              spacy  nlp              =              spacy.load(              "my_custom_el_pipeline"              )              physician              =              nlp(              "Ada Lovelace was born in London"              )              # Document level              ents              =              [              (e.text,              eastward.label_,              due east.kb_id_)              for              e              in              doc.ents]              print              (ents)              # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]              # Token level              ent_ada_0              =              [doc[              0              ]              .text,              doc[              0              ]              .ent_type_,              doc[              0              ]              .ent_kb_id_]              ent_ada_1              =              [doc[              1              ]              .text,              doc[              ane              ]              .ent_type_,              doc[              one              ]              .ent_kb_id_]              ent_london_5              =              [doc[              5              ]              .text,              doc[              5              ]              .ent_type_,              md[              5              ]              .ent_kb_id_]              print              (ent_ada_0)              # ['Ada', 'PERSON', 'Q7259']              impress              (ent_ada_1)              # ['Lovelace', 'PERSON', 'Q7259']              impress              (ent_london_5)              # ['London', 'GPE', 'Q84']

Tokenization

Tokenization is the chore of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object. To construct a Doc object, you need a Vocab instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string.

During processing, spaCy first tokenizes the text, i.due east. segments information technology into words, punctuation and and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split up off – whereas "U.K." should remain one token. Each Doc consists of private tokens, and nosotros tin can iterate over them:

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              doctor              =              nlp(              "Apple is looking at ownership U.K. startup for $1 billion"              )              for              token              in              medico:              print              (token.text)

0	1	2	3	four	5	6	7	eight	9	ten
Apple	is	looking	at	buying	U.K.	startup	for	$	1	billion

First, the raw text is divide on whitespace characters, like to text.split(' '). And so, the tokenizer processes the text from left to right. On each substring, it performs two checks:

Does the substring match a tokenizer exception dominion? For case, "don't" does non contain whitespace, only should be split into two tokens, "do" and "n't", while "U.Chiliad." should ever remain one token.
Tin a prefix, suffix or infix exist split off? For example punctuation like commas, periods, hyphens or quotes.

If there's a match, the rule is applied and the tokenizer continues its loop, starting with the newly separate substrings. This mode, spaCy can separate complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

While punctuation rules are usually pretty full general, tokenizer exceptions strongly depend on the specifics of the individual language. This is why each bachelor language has its ain bracket, like English or German, that loads in lists of hard-coded data and exception rules.

spaCy introduces a novel tokenization algorithm that gives a better residuum between operation, ease of definition and ease of alignment into the original string.

Later on consuming a prefix or suffix, nosotros consult the special cases again. We want the special cases to handle things like "don't" in English, and we desire the same rule to work for "(don't)!". We practice this past splitting off the open bracket, so the exclamation, then the airtight subclass, and finally matching the special example. Here's an implementation of the algorithm in Python optimized for readability rather than performance:

                                  def                  tokenizer_pseudo_code                  (                  text,                  special_cases,                  prefix_search,                  suffix_search,                  infix_finditer,                  token_match,                  url_match                  )                  :                  tokens                  =                  [                  ]                  for                  substring                  in                  text.split(                  )                  :                  suffixes                  =                  [                  ]                  while                  substring:                  if                  substring                  in                  special_cases:                  tokens.extend(special_cases[substring]                  )                  substring                  =                  ""                  continue                  while                  prefix_search(substring)                  or                  suffix_search(substring)                  :                  if                  token_match(substring)                  :                  tokens.append(substring)                  substring                  =                  ""                  interruption                  if                  substring                  in                  special_cases:                  tokens.extend(special_cases[substring]                  )                  substring                  =                  ""                  interruption                  if                  prefix_search(substring)                  :                  split                  =                  prefix_search(substring)                  .cease(                  )                  tokens.suspend(substring[                  :split]                  )                  substring                  =                  substring[split:                  ]                  if                  substring                  in                  special_cases:                  continue                  if                  suffix_search(substring)                  :                  split                  =                  suffix_search(substring)                  .outset(                  )                  suffixes.suspend(substring[split:                  ]                  )                  substring                  =                  substring[                  :split up]                  if                  token_match(substring)                  :                  tokens.append(substring)                  substring                  =                  ""                  elif                  url_match(substring)                  :                  tokens.append(substring)                  substring                  =                  ""                  elif                  substring                  in                  special_cases:                  tokens.extend(special_cases[substring]                  )                  substring                  =                  ""                  elif                  list                  (infix_finditer(substring)                  )                  :                  infixes                  =                  infix_finditer(substring)                  commencement                  =                  0                  for                  match                  in                  infixes:                  if                  offset                  ==                  0                  and                  match.start(                  )                  ==                  0                  :                  proceed                  tokens.append(substring[offset                  :                  match.start(                  )                  ]                  )                  tokens.append(substring[lucifer.start(                  )                  :                  match.end(                  )                  ]                  )                  offset                  =                  lucifer.end(                  )                  if                  substring[starting time:                  ]                  :                  tokens.append(substring[offset:                  ]                  )                  substring                  =                  ""                  elif                  substring:                  tokens.append(substring)                  substring                  =                  ""                  tokens.extend(                  reversed                  (suffixes)                  )                  for                  match                  in                  matcher(special_cases,                  text)                  :                  tokens.replace(match,                  special_cases[match]                  )                  return                  tokens

The algorithm tin be summarized as follows:

Iterate over space-separated substrings.
Cheque whether we have an explicitly defined special instance for this substring. If we do, use it.
Look for a token lucifer. If there is a friction match, terminate processing and go along this token.
Cheque whether we take an explicitly defined special case for this substring. If we exercise, use it.
Otherwise, endeavour to consume 1 prefix. If we consumed a prefix, go back to #iii, and so that the token match and special cases always go priority.
If we didn't consume a prefix, try to consume a suffix and then become back to #3.
If we tin can't consume a prefix or a suffix, look for a URL lucifer.
If there's no URL match, then look for a special case.
Look for "infixes" – stuff like hyphens etc. and dissever the substring into tokens on all infixes.
One time we tin can't eat any more of the cord, handle information technology as a single token.
Make a final pass over the text to cheque for special cases that include spaces or that were missed due to the incremental processing of affixes.

Global and linguistic communication-specific tokenizer information is supplied via the linguistic communication information in spacy/lang. The tokenizer exceptions define special cases like "don't" in English language, which needs to be divide into two tokens: {ORTH: "exercise"} and {ORTH: "n't", NORM: "not"}. The prefixes, suffixes and infixes mostly define punctuation rules – for example, when to split off periods (at the end of a judgement), and when to leave tokens containing periods intact (abbreviations like "U.S.").

Tokenization rules that are specific to 1 linguistic communication, but can be generalized across that language, should ideally live in the linguistic communication information in spacy/lang – we always appreciate pull requests! Anything that's specific to a domain or text type – similar financial trading abbreviations or Bavarian youth slang – should be added as a special case rule to your tokenizer instance. If yous're dealing with a lot of customizations, it might make sense to create an entirely custom subclass.

Adding special case tokenization rules

Most domains have at to the lowest degree some idiosyncrasies that require custom tokenization rules. This could be very certain expressions, or abbreviations just used in this specific field. Hither's how to add together a special case dominion to an existing Tokenizer case:

                          import              spacy              from              spacy.symbols              import              ORTH  nlp              =              spacy.load(              "en_core_web_sm"              )              doctor              =              nlp(              "gimme that"              )              # phrase to tokenize              print              (              [due west.text              for              westward              in              doc]              )              # ['gimme', 'that']              # Add special case rule              special_case              =              [              {ORTH:              "gim"              }              ,              {ORTH:              "me"              }              ]              nlp.tokenizer.add_special_case(              "gimme"              ,              special_case)              # Bank check new tokenization              print              (              [w.text              for              due west              in              nlp(              "gimme that"              )              ]              )              # ['gim', 'me', 'that']

The special case doesn't have to match an entire whitespace-delimited substring. The tokenizer will incrementally carve up off punctuation, and continue looking up the remaining substring. The special case rules also have precedence over the punctuation splitting.

                          assert              "gimme"              not              in              [westward.text              for              westward              in              nlp(              "gimme!"              )              ]              assert              "gimme"              non              in              [w.text              for              w              in              nlp(              '("...gimme...?")'              )              ]              nlp.tokenizer.add_special_case(              "...gimme...?"              ,              [              {              "ORTH"              :              "...gimme...?"              }              ]              )              assert              len              (nlp(              "...gimme...?"              )              )              ==              1

Debugging the tokenizer

A working implementation of the pseudo-code higher up is available for debugging as nlp.tokenizer.explain(text). It returns a list of tuples showing which tokenizer rule or design was matched for each token. The tokens produced are identical to nlp.tokenizer() except for whitespace tokens:

                          from              spacy.lang.en              import              English  nlp              =              English(              )              text              =              '''"Let'south go!"'''              medico              =              nlp(text)              tok_exp              =              nlp.tokenizer.explain(text)              assert              [t.text              for              t              in              doc              if              not              t.is_space]              ==              [t[              i              ]              for              t              in              tok_exp]              for              t              in              tok_exp:              print              (t[              1              ]              ,              "\t"              ,              t[              0              ]              )

Customizing spaCy's Tokenizer course

Let's imagine you wanted to create a tokenizer for a new language or specific domain. There are six things you may need to ascertain:

A dictionary of special cases. This handles things like contractions, units of measurement, emoticons, certain abbreviations, etc.
A function prefix_search, to handle preceding punctuation, such as open quotes, open brackets, etc.
A function suffix_search, to handle succeeding punctuation, such every bit commas, periods, close quotes, etc.
A role infix_finditer, to handle non-whitespace separators, such as hyphens etc.
An optional boolean office token_match matching strings that should never be separate, overriding the infix rules. Useful for things like numbers.
An optional boolean function url_match, which is similar to token_match except that prefixes and suffixes are removed earlier applying the friction match.

Y'all shouldn't usually need to create a Tokenizer subclass. Standard usage is to use re.compile() to build a regular expression object, and pass its .search() and .finditer() methods:

                          import              re              import              spacy              from              spacy.tokenizer              import              Tokenizer  special_cases              =              {              ":)"              :              [              {              "ORTH"              :              ":)"              }              ]              }              prefix_re              =              re.              compile              (r'''^[\[\("']'''              )              suffix_re              =              re.              compile              (r'''[\]\)"']$'''              )              infix_re              =              re.              compile              (r'''[-~]'''              )              simple_url_re              =              re.              compile              (r'''^https?://'''              )              def              custom_tokenizer              (nlp)              :              return              Tokenizer(nlp.vocab,              rules=special_cases,              prefix_search=prefix_re.search,              suffix_search=suffix_re.search,              infix_finditer=infix_re.finditer,              url_match=simple_url_re.match)              nlp              =              spacy.load(              "en_core_web_sm"              )              nlp.tokenizer              =              custom_tokenizer(nlp)              doc              =              nlp(              "hi-world. :)"              )              impress              (              [t.text              for              t              in              doc]              )              # ['hi', '-', 'world.', ':)']

If you demand to bracket the tokenizer instead, the relevant methods to specialize are find_prefix, find_suffix and find_infix.

Modifying existing rule sets

In many situations, you don't necessarily need entirely custom rules. Sometimes you lot but want to add some other graphic symbol to the prefixes, suffixes or infixes. The default prefix, suffix and infix rules are bachelor via the nlp object'southward Defaults and the Tokenizer attributes such as Tokenizer.suffix_search are writable, so you can overwrite them with compiled regular expression objects using modified default rules. spaCy ships with utility functions to help you lot compile the regular expressions – for example, compile_suffix_regex:

            suffixes              =              nlp.Defaults.suffixes              +              [r'''-+$'''              ,              ]              suffix_regex              =              spacy.util.compile_suffix_regex(suffixes)              nlp.tokenizer.suffix_search              =              suffix_regex.search

Similarly, you lot tin can remove a grapheme from the default suffixes:

            suffixes              =              list              (nlp.Defaults.suffixes)              suffixes.remove(              "\\["              )              suffix_regex              =              spacy.util.compile_suffix_regex(suffixes)              nlp.tokenizer.suffix_search              =              suffix_regex.search

The Tokenizer.suffix_search attribute should be a function which takes a unicode cord and returns a regex match object or None. Usually we use the .search attribute of a compiled regex object, merely you lot can use some other function that behaves the same way.

The prefix, infix and suffix dominion sets include not merely individual characters but also detailed regular expressions that take the surrounding context into business relationship. For instance, in that location is a regular expression that treats a hyphen between letters as an infix. If yous practice not desire the tokenizer to dissever on hyphens between letters, you can modify the existing infix definition from lang/punctuation.py:

                          import              spacy              from              spacy.lang.char_classes              import              Alpha,              ALPHA_LOWER,              ALPHA_UPPER              from              spacy.lang.char_classes              import              CONCAT_QUOTES,              LIST_ELLIPSES,              LIST_ICONS              from              spacy.util              import              compile_infix_regex              # Default tokenizer              nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(              "mother in law"              )              print              (              [t.text              for              t              in              doc]              )              # ['mother', '-', 'in', '-', 'law']              # Modify tokenizer infix patterns              infixes              =              (              LIST_ELLIPSES              +              LIST_ICONS              +              [              r"(?<=[0-9])[+\-\*^](?=[0-ix-])"              ,              r"(?<=[{al}{q}])\.(?=[{au}{q}])"              .              format              (              al=ALPHA_LOWER,              au=ALPHA_UPPER,              q=CONCAT_QUOTES              )              ,              r"(?<=[{a}]),(?=[{a}])"              .              format              (a=ALPHA)              ,              # ✅ Commented out regex that splits on hyphens betwixt letters:              # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=Alpha, h=HYPHENS),              r"(?<=[{a}0-nine])[:<>=/](?=[{a}])"              .              format              (a=Blastoff)              ,              ]              )              infix_re              =              compile_infix_regex(infixes)              nlp.tokenizer.infix_finditer              =              infix_re.finditer doc              =              nlp(              "mother-in-law"              )              print              (              [t.text              for              t              in              doctor]              )              # ['female parent-in-law']

For an overview of the default regular expressions, run across lang/punctuation.py and linguistic communication-specific definitions such every bit lang/de/punctuation.py for German language.

Hooking a custom tokenizer into the pipeline

The tokenizer is the outset component of the processing pipeline and the only one that can't be replaced past writing to nlp.pipeline. This is because it has a dissimilar signature from all the other components: it takes a text and returns a Doc, whereas all other components await to already receive a tokenized Doc.

To overwrite the existing tokenizer, you need to replace nlp.tokenizer with a custom function that takes a text and returns a Doctor.

            nlp              =              spacy.blank(              "en"              )              nlp.tokenizer              =              my_tokenizer

Argument	Type	Description
`text`	`str`	The raw text to tokenize.
RETURNS	`Doc`	The tokenized document.

Example 1: Basic whitespace tokenizer

Hither'due south an example of the most basic whitespace tokenizer. It takes the shared vocab, so it can construct Doc objects. When information technology's called on a text, it returns a Doc object consisting of the text split on single space characters. Nosotros can then overwrite the nlp.tokenizer aspect with an example of our custom tokenizer.

                          import              spacy              from              spacy.tokens              import              Doc              class              WhitespaceTokenizer              :              def              __init__              (cocky,              vocab)              :              self.vocab              =              vocab              def              __call__              (self,              text)              :              words              =              text.split(              " "              )              spaces              =              [              True              ]              *              len              (words)              # Avoid nada-length tokens              for              i,              word              in              enumerate              (words)              :              if              word              ==              ""              :              words[i]              =              " "              spaces[i]              =              Faux              # Remove the final trailing space              if              words[              -              one              ]              ==              " "              :              words              =              words[              0              :              -              one              ]              spaces              =              spaces[              0              :              -              one              ]              else              :              spaces[              -              ane              ]              =              False              return              Doc(self.vocab,              words=words,              spaces=spaces)              nlp              =              spacy.blank(              "en"              )              nlp.tokenizer              =              WhitespaceTokenizer(nlp.vocab)              physician              =              nlp(              "What'south happened to me? he thought. It wasn't a dream."              )              print              (              [token.text              for              token              in              doc]              )

Example 2: Third-party tokenizers (BERT give-and-take pieces)

Y'all can apply the aforementioned approach to plug in any other tertiary-party tokenizers. Your custom callable just needs to return a Doc object with the tokens produced by your tokenizer. In this example, the wrapper uses the BERT discussion piece tokenizer, provided by the tokenizers library. The tokens available in the Physician object returned by spaCy now match the exact word pieces produced past the tokenizer.

            Custom BERT word piece tokenizer
                          from              tokenizers              import              BertWordPieceTokenizer              from              spacy.tokens              import              Doc              import              spacy              form              BertTokenizer              :              def              __init__              (self,              vocab,              vocab_file,              lowercase=              Truthful              )              :              self.vocab              =              vocab         self._tokenizer              =              BertWordPieceTokenizer(vocab_file,              lowercase=lowercase)              def              __call__              (self,              text)              :              tokens              =              self._tokenizer.encode(text)              words              =              [              ]              spaces              =              [              ]              for              i,              (text,              (start,              end)              )              in              enumerate              (              nada              (tokens.tokens,              tokens.offsets)              )              :              words.append(text)              if              i              <              len              (tokens.tokens)              -              1              :              # If next start != current finish we assume a space in between              next_start,              next_end              =              tokens.offsets[i              +              ane              ]              spaces.append(next_start              >              stop)              else              :              spaces.append(              Truthful              )              return              Medico(self.vocab,              words=words,              spaces=spaces)              nlp              =              spacy.bare(              "en"              )              nlp.tokenizer              =              BertTokenizer(nlp.vocab,              "bert-base-uncased-vocab.txt"              )              md              =              nlp(              "Justin Drew Bieber is a Canadian singer, songwriter, and actor."              )              impress              (doc.text,              [token.text              for              token              in              md]              )              # [CLS]justin drew bi##eber is a canadian singer, songwriter, and actor.[SEP]              # ['[CLS]', 'justin', 'drew', 'bi', '##eber', 'is', 'a', 'canadian', 'singer',              #  ',', 'songwriter', ',', 'and', 'thespian', '.', '[SEP]']

Preparation with custom tokenization v 3.0

spaCy'southward grooming config describes the settings, hyperparameters, pipeline and tokenizer used for constructing and training the pipeline. The [nlp.tokenizer] cake refers to a registered office that takes the nlp object and returns a tokenizer. Here, we're registering a function called whitespace_tokenizer in the @tokenizers registry. To make certain spaCy knows how to construct your tokenizer during grooming, y'all can laissez passer in your Python file by setting --code functions.py when y'all run spacy train.

            functions.py
                          @spacy.registry.tokenizers(                "whitespace_tokenizer"                )                            def              create_whitespace_tokenizer              (              )              :              def              create_tokenizer              (nlp)              :              return              WhitespaceTokenizer(nlp.vocab)              render              create_tokenizer

Registered functions tin also take arguments that are then passed in from the config. This allows you to rapidly change and keep rails of different settings. Here, the registered part chosen bert_word_piece_tokenizer takes two arguments: the path to a vocabulary file and whether to lowercase the text. The Python type hints str and bool ensure that the received values accept the right type.

            functions.py
                          @spacy.registry.tokenizers(                "bert_word_piece_tokenizer"                )                            def              create_whitespace_tokenizer              (vocab_file:              str              ,              lowercase:              bool              )              :              def              create_tokenizer              (nlp)              :              return              BertWordPieceTokenizer(nlp.vocab,              vocab_file,              lowercase)              render              create_tokenizer

To avoid hard-coding local paths into your config file, yous can as well fix the vocab path on the CLI by using the --nlp.tokenizer.vocab_fileoverride when you run spacy train. For more details on using registered functions, see the docs in training with custom code.

Using pre-tokenized text

spaCy generally assumes by default that your data is raw text. Even so, sometimes your data is partially annotated, e.1000. with pre-existing tokenization, part-of-speech tags, etc. The most common situation is that you have pre-defined tokenization. If you lot accept a list of strings, you tin can create a Doc object directly. Optionally, you can also specify a listing of boolean values, indicating whether each word is followed by a space.

                          import              spacy              from              spacy.tokens              import              Doc  nlp              =              spacy.blank(              "en"              )              words              =              [              "Hi"              ,              ","              ,              "globe"              ,              "!"              ]              spaces              =              [              False              ,              True              ,              False              ,              False              ]              md              =              Medico(nlp.vocab,              words=words,              spaces=spaces)              print              (dr..text)              print              (              [              (t.text,              t.text_with_ws,              t.whitespace_)              for              t              in              doc]              )

If provided, the spaces list must be the same length as the words list. The spaces list affects the doc.text, span.text, token.idx, span.start_char and span.end_char attributes. If y'all don't provide a spaces sequence, spaCy will presume that all words are followed by a space. Once you take a Physician object, y'all tin can write to its attributes to set the part-of-voice communication tags, syntactic dependencies, named entities and other attributes.

Aligning tokenization

spaCy's tokenization is non-destructive and uses language-specific rules optimized for compatibility with treebank annotations. Other tools and resources tin sometimes tokenize things differently – for example, "I'm" → ["I", "'", "m"] instead of ["I", "'yard"].

In situations like that, you frequently want to align the tokenization and then that you tin can merge annotations from unlike sources together, or take vectors predicted by a pretrained BERT model and apply them to spaCy tokens. spaCy's Alignment object allows the one-to-1 mappings of token indices in both directions too as taking into account indices where multiple tokens align to one single token.

                          from              spacy.grooming              import              Alignment  other_tokens              =              [              "i"              ,              "listened"              ,              "to"              ,              "obama"              ,              "'"              ,              "s"              ,              "podcasts"              ,              "."              ]              spacy_tokens              =              [              "i"              ,              "listened"              ,              "to"              ,              "obama"              ,              "'s"              ,              "podcasts"              ,              "."              ]              align              =              Alignment.from_strings(other_tokens,              spacy_tokens)              print              (f"a -> b, lengths: {align.x2y.lengths}"              )              # array([1, 1, i, one, 1, one, 1, 1])              print              (f"a -> b, mapping: {align.x2y.data}"              )              # array([0, 1, two, 3, four, 4, 5, half-dozen]) : two tokens both refer to "'south"              impress              (f"b -> a, lengths: {align.y2x.lengths}"              )              # array([1, 1, 1, 1, ii, i, i])   : the token "'due south" refers to two tokens              print              (f"b -> a, mappings: {marshal.y2x.data}"              )              # array([0, i, ii, 3, 4, v, 6, vii])

Here are some insights from the alignment information generated in the example above:

The one-to-ane mappings for the showtime four tokens are identical, which means they map to each other. This makes sense because they're as well identical in the input: "i", "listened", "to" and "obama".
The value of x2y.data[6] is 5, which means that other_tokens[six] ("podcasts") aligns to spacy_tokens[five] (too "podcasts").
x2y.data[4] and x2y.data[five] are both four, which means that both tokens 4 and v of other_tokens ("'" and "s") align to token 4 of spacy_tokens ("'s").

Merging and splitting

The Doc.retokenize context manager lets you merge and split up tokens. Modifications to the tokenization are stored and performed all at once when the context manager exits. To merge several tokens into 1 single token, laissez passer a Span to retokenizer.merge. An optional dictionary of attrs lets yous prepare attributes that will exist assigned to the merged token – for example, the lemma, role-of-voice communication tag or entity type. By default, the merged token will receive the same attributes as the merged span's root.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(              "I live in New York"              )              print              (              "Before:"              ,              [token.text              for              token              in              doc]              )              with              doc.retokenize(              )              as              retokenizer:              retokenizer.merge(physician[              three              :              5              ]              ,              attrs=              {              "LEMMA"              :              "new york"              }              )              print              (              "Later:"              ,              [token.text              for              token              in              doc]              )

If an attribute in the attrs is a context-dependent token attribute, it will be practical to the underlying Token. For example LEMMA, POS or DEP only employ to a give-and-take in context, then they're token attributes. If an aspect is a context-independent lexical attribute, it will be applied to the underlying Lexeme, the entry in the vocabulary. For instance, LOWER or IS_STOP apply to all words of the same spelling, regardless of the context.

Splitting tokens

The retokenizer.split method allows splitting i token into two or more tokens. This can be useful for cases where tokenization rules lone aren't sufficient. For example, yous might want to divide "its" into the tokens "it" and "is" – just not the possessive pronoun "its". You lot can write dominion-based logic that can find simply the correct "its" to carve up, but past that fourth dimension, the Doctor will already be tokenized.

This procedure of splitting a token requires more settings, because you need to specify the text of the private tokens, optional per-token attributes and how the tokens should be attached to the existing syntax tree. This can be washed past supplying a list of heads – either the token to attach the newly dissever token to, or a (token, subtoken) tuple if the newly split token should exist attached to another subtoken. In this case, "New" should be attached to "York" (the second split subtoken) and "York" should be fastened to "in".

                          import              spacy              from              spacy              import              displacy  nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(              "I live in NewYork"              )              print              (              "Before:"              ,              [token.text              for              token              in              doc]              )              displacy.render(doctor)              # displacy.serve if you're not in a Jupyter surroundings              with              medico.retokenize(              )              as              retokenizer:              heads              =              [              (doc[              3              ]              ,              1              )              ,              doc[              2              ]              ]              attrs              =              {              "POS"              :              [              "PROPN"              ,              "PROPN"              ]              ,              "DEP"              :              [              "pobj"              ,              "compound"              ]              }              retokenizer.split(doc[              3              ]              ,              [              "New"              ,              "York"              ]              ,              heads=heads,              attrs=attrs)              print              (              "After:"              ,              [token.text              for              token              in              dr.]              )              displacy.render(medico)              # displacy.serve if you're not in a Jupyter environment

Specifying the heads as a list of token or (token, subtoken) tuples allows attaching separate subtokens to other subtokens, without having to keep track of the token indices afterward splitting.

Token	Head	Description
`"New"`	`(md[3], 1)`	Attach this token to the second subtoken (index `ane`) that `doc[3]` volition be carve up into, i.east. "York".
`"York"`	`doc[2]`	Attach this token to `doc[ane]` in the original `Physician`, i.e. "in".

If y'all don't care about the heads (for example, if you're but running the tokenizer and not the parser), you can attach each subtoken to itself:

            doc              =              nlp(              "I live in NewYorkCity"              )              with              doc.retokenize(              )              as              retokenizer:                              heads                =                [                (doc[                3                ]                ,                0                )                ,                (physician[                three                ]                ,                1                )                ,                (doc[                iii                ]                ,                two                )                ]                            retokenizer.split(doc[              3              ]              ,              [              "New"              ,              "York"              ,              "City"              ]              ,              heads=heads)

Overwriting custom extension attributes

If you've registered custom extension attributes, you can overwrite them during tokenization past providing a dictionary of attribute names mapped to new values as the "_" primal in the attrs. For merging, you lot need to provide one dictionary of attributes for the resulting merged token. For splitting, you lot need to provide a list of dictionaries with custom attributes, one per separate subtoken.

                          import              spacy              from              spacy.tokens              import              Token              # Register a custom token attribute, token._.is_musician              Token.set_extension(              "is_musician"              ,              default=              Simulated              )              nlp              =              spacy.load(              "en_core_web_sm"              )              md              =              nlp(              "I similar David Bowie"              )              print              (              "Before:"              ,              [              (token.text,              token._.is_musician)              for              token              in              doctor]              )              with              doctor.retokenize(              )              every bit              retokenizer:              retokenizer.merge(md[              2              :              4              ]              ,              attrs=              {              "_"              :              {              "is_musician"              :              True              }              }              )              print              (              "After:"              ,              [              (token.text,              token._.is_musician)              for              token              in              md]              )

Sentence Segmentation

A Medico object'southward sentences are bachelor via the Doc.sents holding. To view a Doc'south sentences, y'all tin can iterate over the Physician.sents, a generator that yields Span objects. You can check whether a Doc has sentence boundaries past calling Dr..has_annotation with the attribute name "SENT_START".

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(              "This is a sentence. This is another judgement."              )              affirm              doc.has_annotation(              "SENT_START"              )              for              sent              in              doc.sents:              print              (sent.text)

spaCy provides iv alternatives for judgement segmentation:

Dependency parser: the statistical DependencyParser provides the well-nigh accurate sentence boundaries based on total dependency parses.
Statistical judgement segmenter: the statistical SentenceRecognizer is a simpler and faster alternative to the parser that only sets sentence boundaries.
Rule-based pipeline component: the rule-based Sentencizer sets sentence boundaries using a customizable listing of sentence-final punctuation.
Custom function: your own custom office added to the processing pipeline can set judgement boundaries by writing to Token.is_sent_start.

Default: Using the dependency parse Needs model

Unlike other libraries, spaCy uses the dependency parse to determine sentence boundaries. This is usually the most accurate approach, but it requires a trained pipeline that provides accurate predictions. If your texts are closer to general-purpose news or web text, this should work well out-of-the-box with spaCy'south provided trained pipelines. For social media or conversational text that doesn't follow the same rules, your application may do good from a custom trained or dominion-based component.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(              "This is a sentence. This is another sentence."              )              for              sent              in              doc.sents:              print              (sent.text)

spaCy's dependency parser respects already set boundaries, so you tin preprocess your Doc using custom components before it'south parsed. Depending on your text, this may also better parse accuracy, since the parser is constrained to predict parses consistent with the judgement boundaries.

Statistical sentence segmenter 5 3.0 Needs model

The SentenceRecognizer is a simple statistical component that only provides sentence boundaries. Forth with existence faster and smaller than the parser, its primary advantage is that it's easier to train because it only requires annotated sentence boundaries rather than full dependency parses. spaCy'due south trained pipelines include both a parser and a trained sentence segmenter, which is disabled by default. If you only need sentence boundaries and no parser, you tin can use the exclude or disable argument on spacy.load to load the pipeline without the parser and then enable the sentence recognizer explicitly with nlp.enable_pipe.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              ,              exclude=              [              "parser"              ]              )              nlp.enable_pipe(              "senter"              )              doc              =              nlp(              "This is a sentence. This is another sentence."              )              for              sent              in              physician.sents:              print              (sent.text)

Rule-based pipeline component

The Sentencizer component is a pipeline component that splits sentences on punctuation like ., ! or ?. You can plug information technology into your pipeline if you but need sentence boundaries without dependency parses.

                          import              spacy              from              spacy.lang.en              import              English  nlp              =              English(              )              # just the linguistic communication with no pipeline              nlp.add_pipe(              "sentencizer"              )              doc              =              nlp(              "This is a sentence. This is some other judgement."              )              for              sent              in              medico.sents:              print              (sent.text)

Custom rule-based strategy

If you want to implement your own strategy that differs from the default rule-based approach of splitting on sentences, you can also create a custom pipeline component that takes a Md object and sets the Token.is_sent_start aspect on each individual token. If set to False, the token is explicitly marked as not the first of a sentence. If set to None (default), it'south treated as a missing value and can nevertheless be overwritten by the parser.

Here's an instance of a component that implements a pre-processing rule for splitting on "..." tokens. The component is added before the parser, which is and then used to further segment the text. That's possible, because is_sent_start is simply set to True for some of the tokens – all others yet specify None for unset judgement boundaries. This approach can exist useful if you lot desire to implement additional rules specific to your data, while still being able to take advantage of dependency-based sentence partitioning.

                          from              spacy.linguistic communication              import              Linguistic communication              import              spacy  text              =              "this is a sentence...hello...and another sentence."              nlp              =              spacy.load(              "en_core_web_sm"              )              doc              =              nlp(text)              print              (              "Before:"              ,              [sent.text              for              sent              in              doc.sents]              )              @Language.component(              "set_custom_boundaries"              )              def              set_custom_boundaries              (doc)              :              for              token              in              doc[              :              -              1              ]              :              if              token.text              ==              "..."              :              doc[token.i              +              i              ]              .is_sent_start              =              True              render              physician  nlp.add_pipe(              "set_custom_boundaries"              ,              before=              "parser"              )              doc              =              nlp(text)              print              (              "Subsequently:"              ,              [sent.text              for              sent              in              dr..sents]              )

Mappings & Exceptions v 3.0

The AttributeRuler manages rule-based mappings and exceptions for all token-level attributes. Equally the number of pipeline components has grown from spaCy v2 to v3, treatment rules and exceptions in each component individually has go impractical, and then the AttributeRuler provides a unmarried component with a unified blueprint format for all token attribute mappings and exceptions.

The AttributeRuler uses Matcher patterns to identify tokens and so assigns them the provided attributes. If needed, the Matcher patterns tin include context around the target token. For example, the aspect ruler can:

provide exceptions for any token attributes
map fine-grained tags to coarse-grained tags for languages without statistical morphologizers (replacing the v2.x tag_map in the language data)
map token surface form + fine-grained tags to morphological features (replacing the v2.10 morph_rules in the linguistic communication information)
specify the tags for infinite tokens (replacing difficult-coded behavior in the tagger)

The following example shows how the tag and POS NNP/PROPN can be specified for the phrase "The Who", overriding the tags provided past the statistical tagger and the POS tag map.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_sm"              )              text              =              "I saw The Who perform. Who did you see?"              doc1              =              nlp(text)              print              (doc1[              ii              ]              .tag_,              doc1[              2              ]              .pos_)              # DT DET              print              (doc1[              iii              ]              .tag_,              doc1[              3              ]              .pos_)              # WP PRON              # Add aspect ruler with exception for "The Who" as NNP/PROPN NNP/PROPN              ruler              =              nlp.get_pipe(              "attribute_ruler"              )              # Pattern to match "The Who"              patterns              =              [              [              {              "LOWER"              :              "the"              }              ,              {              "TEXT"              :              "Who"              }              ]              ]              # The attributes to assign to the matched token              attrs              =              {              "TAG"              :              "NNP"              ,              "POS"              :              "PROPN"              }              # Add rules to the aspect ruler              ruler.add(patterns=patterns,              attrs=attrs,              index=              0              )              # "The" in "The Who"              ruler.add(patterns=patterns,              attrs=attrs,              alphabetize=              1              )              # "Who" in "The Who"              doc2              =              nlp(text)              impress              (doc2[              2              ]              .tag_,              doc2[              ii              ]              .pos_)              # NNP PROPN              print              (doc2[              three              ]              .tag_,              doc2[              three              ]              .pos_)              # NNP PROPN              # The second "Who" remains unmodified              print              (doc2[              five              ]              .tag_,              doc2[              v              ]              .pos_)              # WP PRON

Word vectors and semantic similarity

Similarity is adamant past comparing word vectors or "word embeddings", multi-dimensional meaning representations of a give-and-take. Word vectors can be generated using an algorithm like word2vec and usually wait like this:

            banana.vector
            array(              [              2.02280000e-01              ,              -              7.66180009e-02              ,              3.70319992e-01              ,              3.28450017e-02              ,              -              four.19569999e-01              ,              seven.20689967e-02              ,              -              3.74760002e-01              ,              5.74599989e-02              ,              -              1.24009997e-02              ,              5.29489994e-01              ,              -              5.23800015e-01              ,              -              1.97710007e-01              ,              -              3.41470003e-01              ,              v.33169985e-01              ,              -              2.53309999e-02              ,              one.73800007e-01              ,              i.67720005e-01              ,              8.39839995e-01              ,              5.51070012e-02              ,              one.05470002e-01              ,              iii.78719985e-01              ,              two.42750004e-01              ,              1.47449998e-02              ,              5.59509993e-01              ,              1.25210002e-01              ,              -              6.75960004e-01              ,              3.58420014e-01              ,              # ... and then on ...              three.66849989e-01              ,              2.52470002e-03              ,              -              6.40089989e-01              ,              -              2.97650009e-01              ,              seven.89430022e-01              ,              3.31680000e-01              ,              -              1.19659996e+00              ,              -              iv.71559986e-02              ,              v.31750023e-01              ]              ,              dtype=float32)

Pipeline packages that come with congenital-in word vectors brand them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors. Y'all can also check if a token has a vector assigned, and get the L2 norm, which tin can be used to normalize vectors.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_md"              )              tokens              =              nlp(              "dog cat assistant afskfsd"              )              for              token              in              tokens:              print              (token.text,              token.has_vector,              token.vector_norm,              token.is_oov)

The words "canis familiaris", "true cat" and "banana" are all pretty mutual in English, so they're function of the pipeline's vocabulary, and come with a vector. The word "afskfsd" on the other hand is a lot less mutual and out-of-vocabulary – then its vector representation consists of 300 dimensions of 0, which means it's practically nonexistent. If your application will benefit from a large vocabulary with more vectors, y'all should consider using one of the larger pipeline packages or loading in a full vector packet, for example, en_core_web_lg, which includes 685k unique vectors.

spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For instance, y'all can propose a user content that'southward similar to what they're currently looking at, or characterization a support ticket as a duplicate if it's very similar to an already existing one.

Each Doc, Span, Token and Lexeme comes with a .similarity method that lets you lot compare it with some other object, and determine the similarity. Of class similarity is e'er subjective – whether two words, spans or documents are similar actually depends on how you're looking at it. spaCy's similarity implementation usually assumes a pretty full general-purpose definition of similarity.

                          import              spacy  nlp              =              spacy.load(              "en_core_web_md"              )              # make sure to utilise larger packet!              doc1              =              nlp(              "I like salty fries and hamburgers."              )              doc2              =              nlp(              "Fast food tastes very proficient."              )              # Similarity of two documents              print              (doc1,              "<->"              ,              doc2,              doc1.similarity(doc2)              )              # Similarity of tokens and spans              french_fries              =              doc1[              2              :              4              ]              burgers              =              doc1[              5              ]              print              (french_fries,              "<->"              ,              burgers,              french_fries.similarity(burgers)              )

What to expect from similarity results

Computing similarity scores can be helpful in many situations, but it'south also of import to maintain realistic expectations about what information it can provide. Words can be related to each other in many ways, so a single "similarity" score volition always be a mix of different signals, and vectors trained on different data can produce very different results that may not be useful for your purpose. Here are some of import considerations to continue in mind:

There'south no objective definition of similarity. Whether "I like burgers" and "I like pasta" is similar depends on your application. Both talk about food preferences, which makes them very like – but if you're analyzing mentions of nutrient, those sentences are pretty unlike, considering they talk nearly very different foods.
The similarity of Doc and Span objects defaults to the average of the token vectors. This means that the vector for "fast nutrient" is the average of the vectors for "fast" and "food", which isn't necessarily representative of the phrase "fast food".
Vector averaging means that the vector of multiple tokens is insensitive to the guild of the words. 2 documents expressing the aforementioned meaning with unlike wording volition return a lower similarity score than two documents that happen to contain the same words while expressing dissimilar meanings.

Adding word vectors

Custom word vectors can exist trained using a number of open-source libraries, such equally Gensim, FastText, or Tomas Mikolov'south original Word2vec implementation. Most word vector libraries output an easy-to-read text-based format, where each line consists of the word followed by its vector. For everyday use, we want to convert the vectors into a binary format that loads faster and takes upwards less space on deejay. The easiest fashion to do this is the init vectors command-line utility. This volition output a blank spaCy pipeline in the directory /tmp/la_vectors_wiki_lg, giving you admission to some nice Latin vectors. You can so pass the directory path to spacy.load or use it in the [initialize] of your config when you train a model.

                                          wget                https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.la.300.vec.gz              
              python -g              spacy              init vectors              en              cc.la.300.vec.gz              /tmp/la_vectors_wiki_lg

To assistance you strike a good rest between coverage and memory usage, spaCy'south Vectors grade lets you map multiple keys to the same row of the tabular array. If you're using the spacy init vectors command to create a vocabulary, pruning the vectors will be taken care of automatically if yous set up the --prune flag. Yous can likewise do it manually in the following steps:

Beginning with a discussion vectors package that covers a huge vocabulary. For example, the en_core_web_lg bundle provides 300-dimensional GloVe vectors for 685k terms of English language.
If your vocabulary has values ready for the Lexeme.prob attribute, the lexemes will exist sorted by descending probability to decide which vectors to prune. Otherwise, lexemes will exist sorted by their order in the Vocab.
Phone call Vocab.prune_vectors with the number of vectors you want to proceed.

                nlp                  =                  spacy.load(                  "en_core_web_lg"                  )                  n_vectors                  =                  105000                  # number of vectors to keep                  removed_words                  =                  nlp.vocab.prune_vectors(n_vectors)                  assert                  len                  (nlp.vocab.vectors)                  <=                  n_vectors                  # unique vectors have been pruned                  assert                  nlp.vocab.vectors.n_keys                  >                  n_vectors                  # but non the total entries

Vocab.prune_vectors reduces the electric current vector table to a given number of unique entries, and returns a dictionary containing the removed words, mapped to (cord, score) tuples, where cord is the entry the removed discussion was mapped to and score the similarity score between the two words.

                Removed words
                                  {                  "Shore"                  :                  (                  "coast"                  ,                  0.732257                  )                  ,                  "Precautionary"                  :                  (                  "caution"                  ,                  0.490973                  )                  ,                  "hopelessness"                  :                  (                  "sadness"                  ,                  0.742366                  )                  ,                  "Continous"                  :                  (                  "continuous"                  ,                  0.732549                  )                  ,                  "Disemboweled"                  :                  (                  "corpse"                  ,                  0.499432                  )                  ,                  "biostatistician"                  :                  (                  "scientist"                  ,                  0.339724                  )                  ,                  "somewheres"                  :                  (                  "somewheres"                  ,                  0.402736                  )                  ,                  "observing"                  :                  (                  "observe"                  ,                  0.823096                  )                  ,                  "Leaving"                  :                  (                  "leaving"                  ,                  1.0                  )                  ,                  }

In the example above, the vector for "Shore" was removed and remapped to the vector of "coast", which is accounted about 73% similar. "Leaving" was remapped to the vector of "leaving", which is identical. If you lot're using the init vectors control, you can set the --clip option to easily reduce the size of the vectors equally yous add together them to a spaCy pipeline:

                                  python -g                  spacy                  init vectors                  en                  la.300d.vec.tgz                  /tmp/la_vectors_web_md                  --clip 10000

This volition create a blank spaCy pipeline with vectors for the outset ten,000 words in the vectors. All other words in the vectors are mapped to the closest vector among those retained.

Calculation vectors individually

The vector attribute is a read-only numpy or cupy array (depending on whether yous've configured spaCy to use GPU retention), with dtype float32. The array is read-but so that spaCy tin can avoid unnecessary copy operations where possible. Y'all can alter the vectors via the Vocab or Vectors table. Using the Vocab.set_vector method is oftentimes the easiest approach if you have vectors in an arbitrary format, as y'all tin read in the vectors with your ain logic, and just set them with a elementary loop. This method is likely to be slower than approaches that work with the whole vectors tabular array at once, but it's a great approach for in one case-off conversions before you salvage out your nlp object to deejay.

            Adding vectors
                          from              spacy.vocab              import              Vocab  vector_data              =              {              "dog"              :              numpy.random.uniform(              -              ane              ,              ane              ,              (              300              ,              )              )              ,              "cat"              :              numpy.random.compatible(              -              ane              ,              1              ,              (              300              ,              )              )              ,              "orange"              :              numpy.random.uniform(              -              i              ,              1              ,              (              300              ,              )              )              }              vocab              =              Vocab(              )              for              word,              vector              in              vector_data.items(              )              :              vocab.set_vector(word,              vector)

Language Data

Every language is different – and usually full of exceptions and special cases, especially amongst the near mutual words. Some of these exceptions are shared across languages, while others are entirely specific – usually so specific that they demand to be hard-coded. The lang module contains all language-specific data, organized in simple Python files. This makes the information piece of cake to update and extend.

The shared language data in the directory root includes rules that can be generalized across languages – for instance, rules for basic punctuation, emoji, emoticons and unmarried-letter abbreviations. The individual linguistic communication information in a submodule contains rules that are simply relevant to a particular language. It also takes intendance of putting together all components and creating the Language bracket – for example, English or German. The values are divers in the Language.Defaults.

Proper name	Clarification
Stop words `stop_words.py`	List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`.
Tokenizer exceptions `tokenizer_exceptions.py`	Special-example rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.1000.".
Punctuation rules `punctuation.py`	Regular expressions for splitting tokens, e.one thousand. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes.
Graphic symbol classes `char_classes.py`	Character classes to be used in regular expressions, for example, Latin characters, quotes, hyphens or icons.
Lexical attributes `lex_attrs.py`	Custom functions for setting lexical attributes on tokens, e.yard. `like_num`, which includes language-specific words like "ten" or "hundred".
Syntax iterators `syntax_iterators.py`	Functions that compute views of a `Doc` object based on its syntax. At the moment, just used for noun chunks.
Lemmatizer `lemmatizer.py` `spacy-lookups-information`	Custom lemmatizer implementation and lemmatization tables.

Creating a custom linguistic communication subclass

If you want to customize multiple components of the language data or add support for a custom language or domain-specific "dialect", you can also implement your own linguistic communication bracket. The subclass should ascertain two attributes: the lang (unique language code) and the Defaults defining the language data. For an overview of the bachelor attributes that can be overwritten, see the Language.Defaults documentation.

                          from              spacy.lang.en              import              English              class              CustomEnglishDefaults              (English.Defaults)              :              stop_words              =              gear up              (              [              "custom"              ,              "cease"              ]              )              class              CustomEnglish              (English)              :              lang              =              "custom_en"              Defaults              =              CustomEnglishDefaults  nlp1              =              English language(              )              nlp2              =              CustomEnglish(              )              impress              (nlp1.lang,              [token.is_stop              for              token              in              nlp1(              "custom stop"              )              ]              )              impress              (nlp2.lang,              [token.is_stop              for              token              in              nlp2(              "custom end"              )              ]              )

The @spacy.registry.languages decorator lets you register a custom language class and assign it a cord proper name. This means that you can call spacy.blank with your custom language proper noun, and even train pipelines with it and refer to it in your training config.

            Registering a custom linguistic communication
                          import              spacy              from              spacy.lang.en              import              English              class              CustomEnglishDefaults              (English.Defaults)              :              stop_words              =              prepare              (              [              "custom"              ,              "stop"              ]              )              @spacy.registry.languages(                "custom_en"                )                            class              CustomEnglish              (English)              :              lang              =              "custom_en"              Defaults              =              CustomEnglishDefaults                              # This now works! 🎉                            nlp                =                spacy.blank(                "custom_en"                )