NLP & Topic Modelling

Jun.-Prof. Dr. Mark Hall

Wintersemester 2018/19

NLP Pipeline

Tokenisation
POS-tagging
Dependency tagging
Named Entity Recognition

Tokenisation

Quelltext

'Dies ist ein Beispiel für einen Text.'

wird tokenisiert in

['Dies', 'ist', 'ein', 'Beispiel', 'für', 'einen', 'Text', '.']

Tokenisation

'Hier ist das Büro von Dr. Hall.'

muss tokenisiert werden in

['Hier', 'ist', 'das', 'Büro', 'von', 'Dr.', 'Hall', '.']

und nicht in

['Hier', 'ist', 'das', 'Büro', 'von', 'Dr', '.', 'Hall', '.']

Tokenisation

'Dies ist die Haus-aufgabe.'
['Dies', 'ist', 'die', 'Haus-aufgabe', '.']

'BAH!—Phyllis: You look very sheepish.—Corydon: I am thinking of ewe.—Ariel.'

Falsch

['BAH!—Phyllis', ':', 'You', 'look', 'very', 'sheepish.—Corydon', ':',
 'I', 'am', 'thinking', 'of', 'ewe.—Ariel', '.']

Richtig

['BAH', '!', '—', 'Phyllis', ':', 'You', 'look', 'very', 'sheepish', '.', '—',
 'Corydon', ':', 'I', 'am', 'thinking', 'of', 'ewe', '.', '—', 'Ariel', '.']

POS-Tagging

['Dies', 'ist', 'ein', 'Beispiel', 'für', 'einen', 'Text', '.']

wird getagged als

['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT']

POS-Tagging

['Dies', 'ist', 'ein', 'Beispiel', 'für', 'einen', 'Text', '.']

sollte getagged werden als

['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT']

wird getagged als

['PRON', 'AUX', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT']

Dependency Tagging

Dependency Graph für den Text -This is an example sentence.-

Named Entity Recognition

['Das', 'Beatles', 'Museum', 'ist', 'in', 'Halle', '.']

[('Beatles Museum', 'ORG'), ('Halle', 'LOC')]

Word2Vec

Wort + Kontext → Vektorraum

['Das', 'Beatles', 'Museum', 'ist', 'in', 'Halle']

[['Das', 'Beatles', 'Museum'], ['Beatles', 'Museum', 'ist'],
 ['Museum', 'ist', 'in'], ['ist', 'in', 'Halle']]

Topic Modelling

Dictionary generierung

['Das', 'Beatles', 'Museum', 'ist', 'in', 'Halle']

['Beatles', 'Museum', 'Halle']

{'Beatles': 0, 'Museum': 1, 'Halle': 2}

Latent Semantic Indexing

Latent Dirichlet Allocation

Code Beispiele

import spacy

nlp = spacy.load('en_core_web_lg')

tokens = nlp("The sun's light is a shadow compared to your beauty")

tokens[0].text == 'The'
tokens[0].lemma_ == 'the'
tokens[0].pos_ == 'DET'
tokens[0].is_stop == False

Code Beispiele

import spacy
from gensim.corpora import Dictionary

nlp = spacy.load('en_core_web_lg')

dictionary = Dictionary()

with open('ap.txt') as input_file:
    for line in input_file.readlines():
        tokens = nlp(line)
        dictionary.add_documents([[t.lemma_ for t in tokens if not t.is_stop]])

dictionary.filter_extremes()
dictionary.compactify()

Code Beispiele

corpus = []

with open('ap.txt') as input_file:
    for line in input_file.readlines():
        tokens = nlp(line)
        corpus.append(dictionary.doc2bow([t.lemma_ for t in tokens if not t.is_stop]))

NLP & Topic Modelling

Jun.-Prof. Dr. Mark Hall

Wintersemester 2018/19

NLP Pipeline

Tokenisation

Tokenisation

Tokenisation

POS-Tagging

POS-Tagging

Dependency Tagging

Named Entity Recognition

Word2Vec

Topic Modelling

Dictionary generierung

Latent Semantic Indexing

Latent Dirichlet Allocation

Code Beispiele

Code Beispiele

Code Beispiele

Übungsaufgabe