Natural Language Processing and Text Mining

This course will cover the main topics in Natural Language Processing by computational means, with emphasis in written texts. We will study a broad range of techniques for computational linguistics, statistical language analysis, text mining and machine learning, as applied to problems such as sense disambiguation, syntactic analysis, automatic translation, text classification and clustering, sentiment analysis and authorship assignment, among others.

Catedra Internacional de Ingenieria


  1. Manuel Montes-y-Gómez , PhD - INAOE - Mexico
  2. Thamar Solorio , PhD - University of Houston - USA
  3. Sergio Jiménez , PhD - Universidad Nacional de Colombia - Colombia
  4. Fabio A. González , PhD - Universidad Nacional de Colombia - Colombia


103 - 401

Course topics

1 Introduction to NLP (Sergio Jiménez)

Linguistics background, basic text processing, language models, collocations, textual and lexical similarity, word sense desambiguation

2 Parsing and translation (Thamar Solorio)

POS tagging & formal grammars of english, syntactic and statistical parsing, semantic role labeling, machine translation, code switching.

3 Text mining (Manuel Montes-y-Gómez)

Text classification, text clustering, distributional semantics, distributed representations, authorship attribution, author profiling

4 Advanced machine learning models for NLP (Fabio González)

Neural networks, deep learning, word embeddings, recurrent neural networks

Evaluation and grading policy

4 credits in 64 hours.
You can decide, during the first days, wheter to get a note, get a certificate, both or none.


Course resources

References and resources

Course schedule

Date Topic Material Assignments
Jun 13 1. Introduction to NLP
Jun 14 1.1 NLP goals.
1.2 Introduction to lexical similarity Slides
LexSim V1
Words DataSet
Jun 15 1.3 Lexical similarity functions implementation LexSim V2
Results lexical similarity
Jun 16 1.4 Workshop text similarity LexSim V3
Text similarity DataSet
Must read: Computational Linguistics and Deep Learning
word2vec Google News Pre-traning Model
Assignment 1
Jun 17 2. Parsing and translation
2.1 Introduction Slides
2.2 Pre-processing Slides
  • Regular expressions
  • Tokenization
  • Word Normalization and Stemming
  • Sentence Segmentation and Decision Trees
Must read: Computational Linguistics and Deep Learning
Jun 20 2.3 Language Models Slides
  • N-grams
  • Estimating N-gram Probabilities
  • Evaluation and Perplexity
  • Smoothing

2.4 Hidden Markov Models Example
New notebook tutorial
Jun 21 2.5 Word classes and part of speech tagging Slides POS Tagging Exercise V1 Notebook (download)
Language Models with KenLM Mac OS X Notebook (download)
Jun 22 2.6 Parsing Slides
2.7 Statistical parsing Slides
Example CKY
Example Prob CKY
Assignment 2
Notebook (download)
Jun 23 3. Text mining
3.1 Introduction to text classification Slides
  • BoW representation
  • Classification methods
  • Evaluation
Jun 24 3.2 Beyond the Bag-of-Words representation Slides
  • word n-grams
  • Linguistic features
  • word senses as feasters
  • concept-based representations
Jun 27 3.3 Non Conventional text classification techniques Slides
  • Semi-supervised text classification
  • One Class text classification
  • Set-based text classification
  • Cross-language text classification
Jun 28 3.4 Authorship Analysis Slides
  • Methods for Authorship Attribution
  • Methods for Author Profiling
Assignment 3
Jun 29 4. Advanced machine learning models for NLP Slides
4.1 Introduction
  • Introduction to Machine Learning
  • Introduction to Neural Networks Notebook
4.2 Word Embeddings Notebook
Introduction video Assignment 4.1 (download)
Jun 30 4.2 Recurrent Neural Networks Notebook Neural Networks video
Softmax video
Assignment 4.2 (download)
Bible model - biblia.txt
Alternative model - reg1.txt
Jul 1 4.3 Applications


Coordinación académica

Ingeniero Fabio A. González

Correo electrónico:

Teléfono: 3165000 ext: 14077/14011


Lina F. Rosales 

Correo electrónico: