Natural Language Processing and Text Mining

This course will cover the main topics in Natural Language Processing by computational means, with emphasis in written texts. We will study a broad range of techniques for computational linguistics, statistical language analysis, text mining and machine learning, as applied to problems such as sense disambiguation, syntactic analysis, automatic translation, text classification and clustering, sentiment analysis and authorship assignment, among others.

Instructors

Manuel Montes-y-Gómez , PhD - INAOE - Mexico
Thamar Solorio , PhD - University of Houston - USA
Sergio Jiménez , PhD - Universidad Nacional de Colombia - Colombia
Fabio A. González , PhD - Universidad Nacional de Colombia - Colombia

Classroom

103 - 401

Course topics

1 Introduction to NLP (Sergio Jiménez)

Linguistics background, basic text processing, language models, collocations, textual and lexical similarity, word sense desambiguation

2 Parsing and translation (Thamar Solorio)

POS tagging & formal grammars of english, syntactic and statistical parsing, semantic role labeling, machine translation, code switching.

3 Text mining (Manuel Montes-y-Gómez)

Text classification, text clustering, distributional semantics, distributed representations, authorship attribution, author profiling

4 Advanced machine learning models for NLP (Fabio González)

Neural networks, deep learning, word embeddings, recurrent neural networks

Evaluation and grading policy

4 credits in 64 hours.
You can decide, during the first days, wheter to get a note, get a certificate, both or none.

Grades

Course resources

References and resources

D. Jurafsky and J. H. Martin, “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition”, 3nd Ed.
Bird S., Klein E., Loper E., “Natural language processing with Python”, O’Reilly Media, Inc., 2009.
Feldman R., Sanger J., “The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data”, Cambridge University Press, 2006.
Srivastava A., Sahami M., “Text Mining: Classification, Clustering, and applications”, Chapman and Hall, 2009.

Course schedule

Date	Topic	Material	Assignments
Jun 13	1. Introduction to NLP
Jun 14	1.1 NLP goals. 1.2 Introduction to lexical similarity Slides	LexSim V1 Words DataSet
Jun 15	1.3 Lexical similarity functions implementation	LexSim V2 Results lexical similarity
Jun 16	1.4 Workshop text similarity	LexSim V3 Text similarity DataSet Must read: Computational Linguistics and Deep Learning word2vec Google News Pre-traning Model	Assignment 1 Files
Jun 17	2. Parsing and translation 2.1 Introduction Slides 2.2 Pre-processing Slides Regular expressions Tokenization Word Normalization and Stemming Sentence Segmentation and Decision Trees		Must read: Computational Linguistics and Deep Learning
Jun 20	2.3 Language Models Slides N-grams Estimating N-gram Probabilities Evaluation and Perplexity Smoothing 2.4 Hidden Markov Models Example	New notebook tutorial
Jun 21	2.5 Word classes and part of speech tagging Slides	POS Tagging Exercise V1 Notebook (download) Language Models with KenLM Mac OS X Notebook (download)
Jun 22	2.6 Parsing Slides 2.7 Statistical parsing Slides	Example CKY Example Prob CKY	Assignment 2 Notebook (download)
Jun 23	3. Text mining 3.1 Introduction to text classification Slides BoW representation Classification methods Evaluation
Jun 24	3.2 Beyond the Bag-of-Words representation Slides word n-grams Linguistic features word senses as feasters concept-based representations
Jun 27	3.3 Non Conventional text classification techniques Slides Semi-supervised text classification One Class text classification Set-based text classification Cross-language text classification
Jun 28	3.4 Authorship Analysis Slides Methods for Authorship Attribution Methods for Author Profiling		Assignment 3 Poems
Jun 29	4. Advanced machine learning models for NLP Slides 4.1 Introduction Introduction to Machine Learning Introduction to Neural Networks Notebook 4.2 Word Embeddings Notebook	Introduction video	Assignment 4.1 (download)
Jun 30	4.2 Recurrent Neural Networks Notebook	Neural Networks video Softmax video	Assignment 4.2 (download) Bible model - biblia.txt Alternative model - reg1.txt
Jul 1	4.3 Applications Image Captioning Slides CNN for text Slides Word2Vec in SO Slides

Contact

Coordinación académica

Ingeniero Fabio A. González

Correo electrónico: fagonzalezo@unal.edu.co

Teléfono: 3165000 ext: 14077/14011

Monitora

Lina F. Rosales 

Correo electrónico: lfrosalesc@unal.edu.co