COMP348: Document Processing and the Semantic Web
Unit Outline: COMP348
Document Processing and the Semantic Web
Semester 1, 2012
Convenor: Mark Johnson
Faculty: Science
Department: Computing
Credit Points: 3
Co-badged with: N/A
Prerequisites: 40cp and COMP249(P)
Students should read this unit outline carefully at the start of semester. It contains important information about the unit. If anything in it is unclear, please consult one of the teaching staff in the unit.
About This Unit
COMP348 explores the issues involved in building natural language processing (NLP) applications that operate on large bodies of real text such as are found on the World Wide Web (WWW).
Because the Web is full of unstructured and largely text-based data, the applications we need in order to do useful things have their own particular characteristics. In this unit we discuss some core applications for dealing with data on the Web, such as search engines and spam filtering. We also focus on the dominant approach to building these applications (and to building applications in many other areas of computing), machine learning, where algorithms improve automatically through learning from data; neural networks and genetic algorithms are instances of this. Application areas covered include information retrieval, web search, document classification and summarisation, information extraction, machine translation and natural language parsing.
The unit focuses on the concepts and techniques required to process real natural language text. Students gain practical experience in using the Python programming language to develop language processing systems.
Teaching Staff
| Role | Name | Room | Office hours | |
|---|---|---|---|---|
| Convenor, Lecturer | Mark Johnson | Mark.Johnson at MQ.edu.au | E6A 316 | Tuesday 10am--12pm |
| Lecturer | Robert Dale | Robert.Dale at MQ.edu.au | E6A 328 | Monday 3pm--4pm, Friday 2pm--3pm |
| Tutor | Yasaman Motazedi | Yasaman.Motazedi at MQ.edu.au | E6A 348 | --- |
All emails related to COMP348 should be sent to Mark.Johnson at MQ.edu.au and must include your full name and your student ID number.
Classes
Each week you should attend two hours of lectures and a two-hour workshop class. For details of days, times and rooms consult the timetables webpage.
Note that the workshop classes commence in Week 2.
You should have selected a workshop session at enrolment. You should attend the workshop session you are enrolled in. If you do not have a class, or if you wish to change one, you should see the enrolment operators in the E7B courtyard during the first two weeks of the semester. Thereafter you should go to the Student Centre.
Attendance at the workshop is not compulsory, but submission of the assessed tasks is compulsory. If you don't attend a workshop class, it is esesntial that you ensure the tutor receives your mixed task before the workshop in which you are enrolled.
Required and Recommended Texts
There are two textbooks for this class:
- Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing --- Analyzing Text with Python and the Natural Language Toolkit. Online at http://www.nltk.org/book
- Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Introduction to Information Retrieval. Online at http://nlp.stanford.edu/IR-book/
Technology Used and Required
The following technology is used in COMP348:
- Python 2.7.2: available from www.python.org.
- NLTK 2.0.1rc1, available from www.nltk.org/download.
- The NLTK data, see instructions at http://www.nltk.org/data.
- We may also use some of the following: PyYAML 3.09, Numpy 1.5.1, matplotlib 1.0.1. Instructions for downloading these are on www.nltk.org/download.
Unit Web Page
The web page for this unit can be found at http://www.comp.mq.edu.au/units/comp348. Note that the majority of the unit materials is publicly available, but access to some material requires you to log in to iLearn.
The unit will make use of discussion boards hosted within iLearn. Please post questions there; they will be monitored by the staff on the unit.
Learning Outcomes
A student completing the unit should have:
- A basic understanding of the range of applications that require intelligent text processing.
- An understanding of a variety of rule-based approaches to intelligent text processing.
- An understanding of the main techniques involved in statistical and machine learning approaches to intelligent text processing.
- Practical ability in using Python for intelligent text processing.
- Practical ability in using machine learning methods for text processing.
Graduate Capabilities Developed
In addition to the discipline-based learning objectives, all academic programs at Macquarie seek to develop students' generic skills in a range of areas. One of the aims of this unit is that students develop their skills in the following areas:
Cognitive Capabilities
- Discipline-specific knowledge and skills: See the specific learning outcomes.
- Critical, analytical and integrative thinking: Each of the unit assignments include a written report with sections for the evaluation of the system developed, discussion of results, and explanation of the best settings and the methodology used to determine the best settings. The interactive mixed classes include group discussions of open-ended topics.
- Problem solving and research capability: There will be practical tasks that involve solving high-level problem specifications. In addition, the assignments are ideal environments for the development of individual problem-solving and research skills.
- Creative and innovative: Many of the applications introduced and explained in the unit are open-ended and they do not have one "right" solution. Special tasks in the mixed classes and especially in the assignments will encourage creative and innovative thinking.
Interpersonal or Social Capabilities
- Effective communication: The written component of the assignments will enhance your writing skills. The social component of the activities in the mixed classes will enhance collaborative work and interpersonal communicative skills.
Personal Capabilities
- Capable of professional and personal judgement and initiative: the open-ended nature of the assignments will encourage initiative and judgement.
- Commitment to continuous learning: This unit will present a range of unsolved problems and problems that are only partially solved, some of which can be attempted in an Honours, Masters or PhD project.
Teaching and Learning Strategy
COMP348 is taught via lectures, tutorials and workshop sessions in the laboratory. Lectures are used to introduce new material, give examples of the use of programing methods and techniques, and put them in a wider context. While lectures are largely one-to-many presentations, you are encouraged to ask questions of the lecturer to clarify anything you might not be sure of. Workshops are small group classes which combine tutorial work and practical exercises. You will be given problems to solve each week prior to the mixed class; these problems will be assessed and discussed during the mixed class. It is important that you keep up with these problems as doing so will help you understand the material in the unit and prepare you for the work in assignments.
Each week you should:
- Attend lectures, take notes, ask questions.
- Attend your workshop, and seek feedback from your tutor on your work.
- Submit the weekly task.
- Read the assigned readings, add to your notes, and prepare questions for your lecturer or tutor.
- Prepare answers to the following week's tutorial questions.
- Work on any assignments that have been released.
Lecture notes may be made available each week, but these notes are intended as an outline of the lecture only, and are not a substitute for your own notes or the assigned readings.
Topic List
See the schedule page for the topics to be covered in each week, as well as links to slides and other material.
Relationship Between Assessment and Learning Outcomes
- A basic understanding of the range of applications that require intelligent text processing: The exam will cover these concepts.
- An understanding of a variety of rule-based approaches to intelligent text processing: The weekly tasks and the exam will cover these concepts.
- An understanding of the main techniques involved in statistical and machine learning approaches to intelligent text processing: The weekly tasks and the exam will cover these concepts. In addition, the assignments will use some statistical modelling and require quantitative evaluation.
- Practical ability in using Python for intelligent text processing: The assignments will focus on this.
- Practical ability in using machine learning methods for text processing: The weekly tasks and assignments will focus on this.
Assignments
There will be three assignments in this unit. We may adjust the release and the due dates depending on the speed at which we cover material in the unit. In general each assignment will be released immediately following a lecture, and will be available over the web.
| Task | Planned Release Date | Planned Due Date | Total Marks |
|---|---|---|---|
| Assignment 1: Designing a supervised classifier | Week 4 | Week 6 | 8% |
| Assignment 2: Principles of NLP applications | Week 5 | Week 7 | 8% |
| Assignment 3: Spelling correction as an example NLP application | Week 10 | Week 12 | 14% |
| Weekly Tasks | Weekly | Following week | 10% |
| Final Examination | TBA | 60% |
All assignments should be submitted via iLearn by the time specified in the assignment description.
All work submitted should be readable and well-presented.
Each correct weekly task will be marked with one mark. If the total marks received for the weekly tasks is greater than 10, this will count as 10 marks.
Late work will be accepted with a penalty of 20% of the maximum marks for the assignment per day submitted late. Hence, an assignment submitted five days late will not get any marks. If you cannot submit on time because of illness or other circumstances, please contact the lecturer before the due date.
All assignments have a set of core tasks. These tasks must be assessed as correct in order to count as a satisfactory submission. You can resubmit the core tasks until they are correct, even if that means submitting after the deadline (in which case you will still incur the penalty for late submission).
Your final grade will depend on your performance in each part separately. Note that on occasion your raw mark for a unit (i.e., the total of your marks for each assessment item) may not be the same as the SNG which you receive.
Assessment Standards
The following table shows an indication of achievements required corresponding each final grade relative to each learning outcome. The standards of a level also include the standards of a lower level. For example, the standards of a HD level includes the standards of P, CR and D.
Where applicable, more specific versions of the requirements will be provided with the assessment task descriptions.
| L.O. #1 | ||||
|---|---|---|---|---|
| A basic understanding of the range of applications that require intelligent text processing, and the abiity to describe several such applications. | Ability to describe the state of the art in performance of several text processing applications. | Ability to compare several text processing applications. | Ability to give well-informed descriptions of the types of applications that require intelligent text processing. | |
| L.O. #2 | ||||
| An understanding of a variety of rule-based approaches to intelligent text processing, and an ability to list several applications that use rules. | Ability to write simple rules that are applicable to a text processing application, e.g. Part of Speech tagger. | Ability to write ellaborate rules for a text processing application. | Ability to discuss the kinds of applications that benefit of rule-based approaches. | |
| L.O. #3 | ||||
| An understanding of the main techniques involved in statistical and machine learning approaches to intelligent text processing, and ability to explain the basic concepts of statistical and machine learning approaches. | Ability to describe several machine learning approaches. | Ability to determine what machine learning approaches are appropriate for specific tasks. | Ability to provide insightful comparisons between machine learning approaches and the types of tasks that each approach is best suited for. | |
| L.O. #4 | ||||
| An ability to use Python for intelligent text processing, and to implement parts of an intelligent text processing application in Python. | Ability to implement and document a complete intelligent text processing application in Python. | Ability to implement, document, and evaluate an intelligent text processing application in Python. | Ability to provide a complete intelligent text processing application with detailed documentation and insightful evaluation. | |
| L.O. #5 | ||||
| Practical ability in using machine learning methods for text processing, demonsrated by the implementation of parts of an intelligent text processing system that uses machine learning. | Ability to implement and document a complete intelligent text processing system that uses machine learning. | Ability to implement, document, and evaluate (including significance tests) an intelligent text processing system that uses machine learning. | Ability to provide a complete intelligent text processing system that uses machine learning, with detailed documentation and insightful evaluation. | |
Your final grade depends on your performance in each part of the assessment. For each task, you receive a mark that combines your standard of performance regarding each learning outcome assessed by that task; then the different component marks are added up to determine your total mark out of 100. Your grade then depends on this total mark and your overall standard of performance.
You will obtain a grade of Pass if you meet the learning outcomes of this unit at a basic level. In particular:
- you must perform satisfactorily in the examination; and
- you must submit all the core tasks of all the assignments, and these submissions must be assessed as correct.
You will obtain a grade of Credit if you demonstrate performance at the level of Pass, and in addition demonstrate performance at a level of CR or higher in the exam and in at least another of the assessment criteria.
You will obtain a grade of Distinction if you meet all the requirements of a Pass grade and in addition demonstrate performance at a level of CR or higher in the exam and most of the other assessment criteria.
You will obtain a grade of High Distinction if you meet all the requirements of a Pass grade and in addition demonstrate performance at the level of HD in most of the assessment criteria.
Examinations
The university examination period in First Half year 2012 is from Tuesday, 12th June.
You are expected to present yourself for examination at the time and place designated in the University Examination Timetable. The timetable will be available in Draft form approximately eight weeks before the commencement of the examinations and in Final form approximately four weeks before the commencement of examinations.
You are advised that it is Macquarie University policy not to set early examinations for individuals or groups of students. All students are expected to ensure that they are available until the end of the teaching semester, that is the final day of the official examination period.
Special Consideration
The only exception to not sitting an examination at the designated time is because of documented illness or unavoidable disruption. In these circumstances you may wish to consider applying for Special Consideration. Information about unavoidable disruption and the special consideration process is available on the web.
If a Supplementary Examination is granted as a result of the Special Consideration process the examination will be scheduled after the conclusion of the official examination period. For details of the Special Consideration policy specific to the Department of Computing, see the Department's policy page.
To be eligible for special consideration you must show a genuine interest in the unit by participating in its activities. In particular:
- You must submit correctly all the core tasks of all assignments (this is a necessary condition for passing the unit); and
- You must provide correct submissions for most of the weekly tasks.
Academic Honesty and Plagiarism
Plagiarism involves using the work of another person and presenting it as one's own.
The Department, in line with
University policy, treats all cases seriously.
In particular, the
Department, keeps a record of all plagiarism cases. This record
is referred to so that an appropriate penalty can be applied to each case.
For concrete examples, see this
page.
Student Support Services
Macquarie University provides a range of Academic Student Support Services. Details of these services can accessed at http://www.student.mq.edu.au.
Staff-Student Liaison Committee
The Department has established a Staff-Student Liaison Committee at each level (100, 200, 300) to provide all students studying a Computing unit the opportunity to discuss related issues or problems with both students and staff.
For each meeting, an agenda is issued and minutes are taken. These are posted on the web at:
Details of the regular meeting dates will be posted on the unit home page. Anyone with an interest in Computing units may attend. This includes staff involved in the teaching and administration of the units, and all students currently taking a Computing unit at that level. There are formal Liaison Committee representatives for each unit who attend to present the views of the student body; all students are welcome and are encouraged to attend.
The meetings are usually held in the Department of Computing Meeting Room, E6A357.
To forward agenda items or get in touch with your representative, send an email to comp348liaison@ics.mq.edu.au.
If you have exhausted all other avenues, then you should consult the Director of Teaching (Dr Chistophe Doche) or the Head of Department (Assoc. Prof. Bernard Mans). You are entitled to have your concerns raised, discussed and resolved.
Changes Made to Previous Offerings
We try to adapt this unit to adapt it to new developments in the area of Language Technology, and in response to feedback from students from past years.
Compared with last year, this year we plan to give stronger emphasis to machine learning methods, and we will substantially reduce the material related to the Semantic Web.
Assessment is based on assessment standards.

