major challenge is to build communication
[ad_1]
Currently, a major challenge is to build communication between users and Web search systems. However, most Web search systems use user queries rather than user information needs due to the difficulty of automatically acquiring user information needs.
The first reason for this is that users may not know how to represent their topics of interest.
The second reason is that users may not wish to invest a great deal of effort to dig out relevant pages from hundreds of thousands of candidates provided by a Web search system.
In this assignment, you are expected to design a system, “Weak Supervision Model (WSM)”, to provide a solution for this challenging issue.
The system is broken up into three parts:
Part I (Training Set Discovery),
Part II (IF model) and
Part III (Evaluation).
In Part I,
the major task is to present an approach in order to automatically discover a training set for a specified topic (we will provide you 50 topics),
which includes both positive documents (e.g., labelled as “1”) and
negative documents (e.g., labelled as “0”).
You may need to use
the topic title, description or narratives,
Pseudo-Relevance Feedback technique (or clustering technique) and
an IR model
for this part to find a training set D which includes both D+ (positive – likely relevant documents) and D–(negative – likely irrelevant documents) in a given un-labelled document set U.
Part II
is to select more terms in D and discover weights for them;
and then use the selected terms and their weights to rank documents in U.
Part III
is the evaluation, you are required to prove your solution is better than the query-based method (“the baseline model”) which uses only the topic titles to rank U.
Example of topic102 – “Convicts, repeat offenders” is described as follows:
<top>
<num> Number: R102
<title>Convicts, repeat offenders
<desc> Description:
Search for information pertaining to crimes committed by people who have been previously convicted and later released or paroled from prison.
<narr> Narrative:
Relevant documents are those which cite actual crimes committed by “repeat offenders” or ex-convicts. Documents which only generally discuss the topic or efforts to prevent its occurrence with no specific cases cited are irrelevant.
</top>
Part I: Training Set Discovery
It requires obtaining a complete training set D which consists of a set of positive documents D+; and a set of negative documents D-. In this part, you attempt to present an approach (or two approaches for a pair) finding a complete training setDinU(a given unlabelled document set,e.g., the set of documents in Training102 folder), which includes at least some likely relevant documents (positive part) and some likely irrelevant documents (the negative part). The proposed approach depends on your knowledge acquired from this unit. You could discuss your approach with your tutor before you do the implementation.
Q1) (6 marks) Write an algorithm (or two algorithms for a pair) in plain English to show yourapproach for the discovery of a complete training set for 50 topics and the corresponding 50 datasets (Training101 to Training 150). Your approach should be generic that means it is feasible to be used for all (or most) topics.
For each topic, e.g., Topic102, you should use the following input and generate the output.
Inputs: queryQ= a topic (you may use title e.g., ‘Convicts repeatoffenders’ or all information including the <desc> and <narr>); and U = folder “Traning102”.
Output: D=D+D–, whereD+D–=andDU. The following is the possibleoutputs in D (not the answer) for topic 102:
R102 73038 1
R102 26061 1
R102 65414 1
R102 57914 1
R102 58476 1
R102 76635 1
R102 12769 1
R102 12767 1
R102 25096 1
R102 78836 1
R102 82227 1
R102 26611 1
R102 15200 0
R102 13320 0
R102 54745 0
R102 15082 0
R102 53523 0
R102 65306 0
R102 68419 0
R102 29920 0
R102 30456 0
R102 75563 0
R102 28657 0
R102 65394 0
R102 85372 0
Q2) (6 marks) Implement the algorithm (two algorithms for a pair) by usingPython. You alsoneed to discuss the output to justify why the proposed algorithm likely generates high quality training sets. You may use figures to show the justification.
Q3) (3 marks) BM25 based baseline model implementation (see week 8 workshop) – please usethe titles as queries to rank documents for each topic, and save the result into 50 files; e.g., BaselineResult1.dat, …, BaselineResult50.dat; where each row includes the document numberand the corresponding relevance degree or ranking (in descendent order). The following is the possible result (not the answer) for topic 102 (in BaselineResult2.dat):
73038 5.898798484774149
26061 4.273638903483098
65414 4.1414522450167475
57914 3.967136888209526
58476 3.708467957856744
76635 3.5867337114200843
12769 3.4341129093591456
12767 3.352170358051889
25096 2.7646308089876177
78836 2.6823617071618404
82227 2.6056189593652537
26611 2.3595327588643613
24515 2.2258395867976226
33172 2.218657303566887
33203 2.2027873338265396
29908 2.188504022701605
…
Part II: Information Filtering Model
Q4) (5 marks) Design an information filtering model (your WSM) that includes both a trainingalgorithm and a testing algorithm (for an individual person) or two information filtering models (for a pair) in plain English, which illustrates your idea for using your discovered training set in Part I to learn the model. Please note your selected keywords (terms) in the discovered training set should be very important for each given topic.
You will use the following input for the training algorithm to select some useful features Input: D=D+ÈD–
Output: Features
For the testing algorithm, you will have the following input and output Input: U(e.g., folder “Traning102”).
Output: sortedU
Q5) (5 marks) Implement your WSM (or two models for a pair) inPython. You needto finduseful features (e.g., terms) and their weights for every topic using the proposed training algorithm (in Q4) and store them in a data structure or a file. For all documents in U, you also need to calculate the relevance score for each document using the proposed testing algorithm; and sort the documents in U for each topic according to their relevance scores and save the results into “result1.dat” to “result50.dat” files for 50 topics, where each row includes the document number and the corresponding relevance score or ranking (in descendent order).
The following is the possible result (not the answer) for topic 102 (in result2.dat):
73038 5.898798484774149
26061 4.273638903483098
65414 4.1414522450167475
57914 3.967136888209526
58476 3.708467957856744
76635 3.5867337114200843
12769 3.4341129093591456
12767 3.352170358051889
25096 2.7646308089876177
78836 2.6823617071618404
…
Part III: Evaluation
Q6) (5 marks) Implement a python program to calculate top10 precision, recall and F1 (youmay use extra measures, e.g., average precision) for both the baseline model and your WSM on all topics by using the provided relevant judgements for each topic and save the results into “EvaluationResult.dat”. Please note you can use the evaluation result to update your WSM.
For each topic, e.g., Topic102, you should use the following inputs for your WSM, the output includes all evaluation results for the 50 topics:
Input: “result2.dat” and “Training102.txt”
Output: EvaluationResult.dat
The following is the possible result (not the answer) in a csv file:
Topic | precision | recall | F1 |
101 | 0.130435 | 0.428571 | 0.20 |
102 | 0.020100 | 0.029630 | 0.023952 |
103 | 0.046875 | 0.214286 | 0.076923 |
… |
Q7) (5 marks) You will get the 5 marks if you can approve your WSM is significantly betterthan the baseline model (you can choose any measure used in Q6); otherwise, you will lose the 5 marks. Please use “t-test” to help you answering this question.
Please Note
- Your programs should be well laid out, easy to read and well commented.
- All items submitted should be clearly labelled with your name and student number.
- Marks will be awarded for programs (correctness, programming style, elegance, commenting) and evaluation results, according to the marking guide.
- You will lose marks for missing or inaccurate statements of completeness, and for missing files or items.
IFN647 ASSIGNMENT2.201
[Button id=”1″]
[ad_2]
Source link