To make use of huge text data such as web text, automatic processing with computers is necessary. In natural language processing, computers recognize words from text data represented as sequences of characters, find phrases, and estimate the syntactic structures. This course is designed to give students an opportunity to learn the basic idea and knowledge, and the methods especially based on machine learning. Applications and their mathematical models are also topics of this course, such as machine translation, text summarization, and sentiment analysis. Mathematical approaches to language study are also briefly explained.
By the end of this course, students will have acquired the following skills:
(i) read and understand research papers in the natural language processing field
(ii) use basic techniques of natural language processing such as part-of-speech tagging and syntactic parsing
(iii) derive mathematical formula of basic machine learning methods used in natural language processing
computational linguistics, natural language processing, machine learning, text mining
✔ Specialist skills | Intercultural skills | Communication skills | Critical thinking skills | Practical and/or problem-solving skills |
At the beginning of each class, assignments given in the previous class are reviewed, followed by a lecture.
Homework assignments include reading assignments, exercise problems, and programming assignments.
Course schedule | Required learning | |
---|---|---|
Class 1 | Part-of-speech tagging with HMM | Understand the probabilistic model of HMM-based POS tagging and its decoding with dynamic programming. |
Class 2 | Text classification with naive bayes classifier | Learn the multinomial model and the multi-variate Bernoulli model of naive bayes classifiers and learn the idea of generative models. |
Class 3 | Basic knowledge of optimization and parameter estimation | Learn the constrained optimization based on the method of Lagrange multipliers and its application to parameter estimation. |
Class 4 | Mathematical representation of document and classification with support vector machines | Learn the bag-of-words representation of a document and its variant, as well as the classification with support vector machine, |
Class 5 | Named-entity recognition and dependency parsing with sequential tagging | Understand how the named-entity extraction and dependency analysis are implemented as sequential classification. |
Class 6 | Probabilistic model for sequential tagging | Understand the log-linear model and its variant for sequence data: conditional random fields. |
Class 7 | Text summarization | Learn the basic knowledge on text summarization and understand the importance of optimization problems in this task. |
Class 8 | Methods for text clustering | Learn k-means clustering, Gaussian mixture clustering, EM algorithm, probabilistic latent semantic analysis. |
Class 9 | Generative models of documents | Understand the latent Dirichlet allocation and the Gibbs sampling for it. |
Class 10 | Language resources and algorithm implementation | Obtain the knowledge of various language resources and tools and learn how to use them. |
Class 11 | Sophisticated methods for representing words, sentences, and documents | Learn the distributed representations of words, sentences and documents. |
Class 12 | Sentiment analysis of text | Learn various tasks and their methods for sentiment analysis of text. |
Class 13 | Machine translation | Learn about the IBM model, which is a statistical machine translation model, and understand the basic part of its algorithm. |
Class 14 | Basic knowledge for language study | Learn the computational methods that are used for language study and the research areas for which computational methods are useful. |
Class 15 | Mathematical methods for language study | Learn the instances of language studies with computational approaches. |
None.
None.
Students' knowledge and practical skills of natural language processing and mathematical models for language will be assessed.
Exercise problems 40%, term paper 60%.
None.