Skip to Content

COMP348: Document Processing and the Semantic Web


COMP348 explores the issues involved in building natural language processing (NLP) applications that operate on large bodies of real text such as are found on the World Wide Web (WWW).

Because the Web is full of unstructured and largely text-based data, the applications we need in order to do useful things have their own particular characteristics. In this unit we discuss some core applications for dealing with data on the Web, such as search engines and spam filtering. We also focus on the dominant approach to building these applications (and to building applications in many other areas of computing), machine learning, where algorithms improve automatically through learning from data; neural networks and genetic algorithms are instances of this. The unit also explores some developments of the Web, such as emerging semantic web technologies which support the exchange of XML metadata, and Web 2.0 technologies (e.g. social networking, folksonomies, wikis and blogs). Application areas covered include information retrieval, web search, document summarisation, machine translation and natural language parsing.

The unit focuses on the concepts and techniques required to process real natural language text. Students gain practical experience in using the Python programming language to develop language processing systems.