major challenge is to build communication

[ad_1]

Currently, a major challenge is to build communication between users and Web search systems. However, most Web search systems use user queries rather than user information needs due to the difficulty of automatically acquiring user information needs.

The first reason for this is that users may not know how to represent their topics of interest.

The second reason is that users may not wish to invest a great deal of effort to dig out relevant pages from hundreds of thousands of candidates provided by a Web search system.

In this assignment, you are expected to design a system, “Weak Supervision Model (WSM)”, to provide a solution for this challenging issue.

The system is broken up into three parts:

Part I (Training Set Discovery),

Part II (IF model) and

Part III (Evaluation).

In Part I,

the major task is to present an approach in order to automatically discover a training set for a specified topic (we will provide you 50 topics),

which includes both positive documents (e.g., labelled as “1”) and

negative documents (e.g., labelled as “0”).

You may need to use

the topic title, description or narratives,

Pseudo-Relevance Feedback technique (or clustering technique) and

an IR model

for this part to find a training set D which includes both D⁺ (positive – likely relevant documents) and D^–(negative – likely irrelevant documents) in a given un-labelled document set U.

Part II

is to select more terms in D and discover weights for them;

and then use the selected terms and their weights to rank documents in U.

Part III

is the evaluation, you are required to prove your solution is better than the query-based method (“the baseline model”) which uses only the topic titles to rank U.

Example of topic102 – “Convicts, repeat offenders” is described as follows:

<top>

<num> Number: R102

<title>Convicts, repeat offenders

<desc> Description:

Search for information pertaining to crimes committed by people who have been previously convicted and later released or paroled from prison.

<narr> Narrative:

Relevant documents are those which cite actual crimes committed by “repeat offenders” or ex-convicts. Documents which only generally discuss the topic or efforts to prevent its occurrence with no specific cases cited are irrelevant.

</top>

Part I: Training Set Discovery

It requires obtaining a complete training set D which consists of a set of positive documents D+; and a set of negative documents D-. In this part, you attempt to present an approach (or two approaches for a pair) finding a complete training setDinU(a given unlabelled document set,e.g., the set of documents in Training102 folder), which includes at least some likely relevant documents (positive part) and some likely irrelevant documents (the negative part). The proposed approach depends on your knowledge acquired from this unit. You could discuss your approach with your tutor before you do the implementation.

Q1) (6 marks) Write an algorithm (or two algorithms for a pair) in plain English to show yourapproach for the discovery of a complete training set for 50 topics and the corresponding 50 datasets (Training101 to Training 150). Your approach should be generic that means it is feasible to be used for all (or most) topics.

For each topic, e.g., Topic102, you should use the following input and generate the output.

Inputs: queryQ= a topic (you may use title e.g., ‘Convicts repeatoffenders’ or all information including the <desc> and <narr>); and U = folder “Traning102”.

Output: D=D⁺D^–, whereD⁺D^–=andDU. The following is the possibleoutputs in D (not the answer) for topic 102:

R102 73038 1

R102 26061 1

R102 65414 1

R102 57914 1

R102 58476 1

R102 76635 1

R102 12769 1

R102 12767 1

R102 25096 1

R102 78836 1

R102 82227 1

R102 26611 1

R102 15200 0

R102 13320 0

R102 54745 0

R102 15082 0

R102 53523 0

R102 65306 0

R102 68419 0

R102 29920 0

R102 30456 0

R102 75563 0

R102 28657 0

R102 65394 0

R102 85372 0

Q2) (6 marks) Implement the algorithm (two algorithms for a pair) by usingPython. You alsoneed to discuss the output to justify why the proposed algorithm likely generates high quality training sets. You may use figures to show the justification.

Q3) (3 marks) BM25 based baseline model implementation (see week 8 workshop) – please usethe titles as queries to rank documents for each topic, and save the result into 50 files; e.g., BaselineResult1.dat, …, BaselineResult50.dat; where each row includes the document numberand the corresponding relevance degree or ranking (in descendent order). The following is the possible result (not the answer) for topic 102 (in BaselineResult2.dat):

73038 5.898798484774149

26061 4.273638903483098

65414 4.1414522450167475

57914 3.967136888209526

58476 3.708467957856744

76635 3.5867337114200843

12769 3.4341129093591456

12767 3.352170358051889

25096 2.7646308089876177

78836 2.6823617071618404

82227 2.6056189593652537

26611 2.3595327588643613

24515 2.2258395867976226

33172 2.218657303566887

33203 2.2027873338265396

29908 2.188504022701605

…

Part II: Information Filtering Model

Q4) (5 marks) Design an information filtering model (your WSM) that includes both a trainingalgorithm and a testing algorithm (for an individual person) or two information filtering models (for a pair) in plain English, which illustrates your idea for using your discovered training set in Part I to learn the model. Please note your selected keywords (terms) in the discovered training set should be very important for each given topic.

You will use the following input for the training algorithm to select some useful features Input: D=D⁺ÈD^–

Output: Features

For the testing algorithm, you will have the following input and output Input: U(e.g., folder “Traning102”).

Output: sortedU

Q5) (5 marks) Implement your WSM (or two models for a pair) inPython. You needto finduseful features (e.g., terms) and their weights for every topic using the proposed training algorithm (in Q4) and store them in a data structure or a file. For all documents in U, you also need to calculate the relevance score for each document using the proposed testing algorithm; and sort the documents in U for each topic according to their relevance scores and save the results into “result1.dat” to “result50.dat” files for 50 topics, where each row includes the document number and the corresponding relevance score or ranking (in descendent order).

The following is the possible result (not the answer) for topic 102 (in result2.dat):

73038 5.898798484774149

26061 4.273638903483098

65414 4.1414522450167475

57914 3.967136888209526

58476 3.708467957856744

76635 3.5867337114200843

12769 3.4341129093591456

12767 3.352170358051889

25096 2.7646308089876177

78836 2.6823617071618404

…

Part III: Evaluation

Q6) (5 marks) Implement a python program to calculate top10 precision, recall and F1 (youmay use extra measures, e.g., average precision) for both the baseline model and your WSM on all topics by using the provided relevant judgements for each topic and save the results into “EvaluationResult.dat”. Please note you can use the evaluation result to update your WSM.

For each topic, e.g., Topic102, you should use the following inputs for your WSM, the output includes all evaluation results for the 50 topics:

Input: “result2.dat” and “Training102.txt”

Output: EvaluationResult.dat

The following is the possible result (not the answer) in a csv file:

Topic	precision	recall	F1
101	0.130435	0.428571	0.20
102	0.020100	0.029630	0.023952
103	0.046875	0.214286	0.076923
…

Q7) (5 marks) You will get the 5 marks if you can approve your WSM is significantly betterthan the baseline model (you can choose any measure used in Q6); otherwise, you will lose the 5 marks. Please use “t-test” to help you answering this question.

Please Note

Your programs should be well laid out, easy to read and well commented.

All items submitted should be clearly labelled with your name and student number.

Marks will be awarded for programs (correctness, programming style, elegance, commenting) and evaluation results, according to the marking guide.

You will lose marks for missing or inaccurate statements of completeness, and for missing files or items.

IFN647 ASSIGNMENT2.201

[Button id=”1″]

[ad_2]

Source link

"96% of our customers have reported a 90% and above score. You might want to place an order with us."

Affordable prices

You might be focused on looking for a cheap essay writing service instead of searching for the perfect combination of quality and affordable rates. You need to be aware that a cheap essay does not mean a good essay, as qualified authors estimate their knowledge realistically. At the same time, it is all about balance. We are proud to offer rates among the best on the market and believe every student must have access to effective writing assistance for a cost that he or she finds affordable.

Caring support 24/7

If you need a cheap paper writing service, note that we combine affordable rates with excellent customer support. Our experienced support managers professionally resolve issues that might appear during your collaboration with our service. Apply to them with questions about orders, rates, payments, and more. Contact our managers via our website or email.

Non-plagiarized papers

“Please, write my paper, making it 100% unique.” We understand how vital it is for students to be sure their paper is original and written from scratch. To us, the reputation of a reliable service that offers non-plagiarized texts is vital. We stop collaborating with authors who get caught in plagiarism to avoid confusion. Besides, our customers’ satisfaction rate says it all.

Home

About Us

Professional Writing Services

Contact Custom Essay Writers

Order Custom Essay