NLP & Topic Modelling

Jun.-Prof. Dr. Mark Hall

Wintersemester 2018/19

NLP Pipeline

  1. Tokenisation
  2. POS-tagging
  3. Dependency tagging
  4. Named Entity Recognition

Tokenisation

Quelltext

'Dies ist ein Beispiel für einen Text.'
      

wird tokenisiert in

['Dies', 'ist', 'ein', 'Beispiel', 'für', 'einen', 'Text', '.']
      

Tokenisation

'Hier ist das Büro von Dr. Hall.'
      

muss tokenisiert werden in

['Hier', 'ist', 'das', 'Büro', 'von', 'Dr.', 'Hall', '.']
      

und nicht in

['Hier', 'ist', 'das', 'Büro', 'von', 'Dr', '.', 'Hall', '.']
      

Tokenisation

'Dies ist die Haus-aufgabe.'
['Dies', 'ist', 'die', 'Haus-aufgabe', '.']
      
'BAH!—Phyllis: You look very sheepish.—Corydon: I am thinking of ewe.—Ariel.'
      

Falsch

['BAH!—Phyllis', ':', 'You', 'look', 'very', 'sheepish.—Corydon', ':',
 'I', 'am', 'thinking', 'of', 'ewe.—Ariel', '.']
      

Richtig

['BAH', '!', '—', 'Phyllis', ':', 'You', 'look', 'very', 'sheepish', '.', '—',
 'Corydon', ':', 'I', 'am', 'thinking', 'of', 'ewe', '.', '—', 'Ariel', '.']
      

POS-Tagging

['Dies', 'ist', 'ein', 'Beispiel', 'für', 'einen', 'Text', '.']
      

wird getagged als

['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT']
      

POS-Tagging

['Dies', 'ist', 'ein', 'Beispiel', 'für', 'einen', 'Text', '.']
      

sollte getagged werden als

['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT']
      

wird getagged als

['PRON', 'AUX', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT']
      

Dependency Tagging

Dependency Graph für den Text -This is an example sentence.-

Named Entity Recognition

['Das', 'Beatles', 'Museum', 'ist', 'in', 'Halle', '.']
      
[('Beatles Museum', 'ORG'), ('Halle', 'LOC')]
      

Word2Vec

Wort + Kontext → Vektorraum

['Das', 'Beatles', 'Museum', 'ist', 'in', 'Halle']
      
[['Das', 'Beatles', 'Museum'], ['Beatles', 'Museum', 'ist'],
 ['Museum', 'ist', 'in'], ['ist', 'in', 'Halle']]
      

Topic Modelling

Dictionary generierung

['Das', 'Beatles', 'Museum', 'ist', 'in', 'Halle']
      
['Beatles', 'Museum', 'Halle']
      
{'Beatles': 0, 'Museum': 1, 'Halle': 2}
      

Latent Semantic Indexing

Latent Dirichlet Allocation

Code Beispiele

import spacy

nlp = spacy.load('en_core_web_lg')

tokens = nlp("The sun's light is a shadow compared to your beauty")

tokens[0].text == 'The'
tokens[0].lemma_ == 'the'
tokens[0].pos_ == 'DET'
tokens[0].is_stop == False
      

Code Beispiele

import spacy
from gensim.corpora import Dictionary

nlp = spacy.load('en_core_web_lg')

dictionary = Dictionary()

with open('ap.txt') as input_file:
    for line in input_file.readlines():
        tokens = nlp(line)
        dictionary.add_documents([[t.lemma_ for t in tokens if not t.is_stop]])

dictionary.filter_extremes()
dictionary.compactify()
      

Code Beispiele

corpus = []

with open('ap.txt') as input_file:
    for line in input_file.readlines():
        tokens = nlp(line)
        corpus.append(dictionary.doc2bow([t.lemma_ for t in tokens if not t.is_stop]))
      

Übungsaufgabe