Towards_Collaborative_Question_Answering

pdf

School

TAFE NSW - Sydney Institute *

*We aren’t endorsed by this school

Course

411

Subject

Computer Science

Date

Nov 24, 2024

Type

pdf

Pages

12

Uploaded by ChefAtomEchidna44

Report
Towards Collaborative Question Answering: A Preliminary Study Xiangkun Hu 1, * , Hang Yan 2, * , Qipeng Guo 1 , Xipeng Qiu 2 , Weinan Zhang 3 , Zheng Zhang 1 1 AWS Shanghai AI Lab 2 School of Computer Science, Fudan University 3 Shanghai Jiao Tong University { xiangkhu, gqipeng, zhaz } @amazon.com { hyan19,xpqiu } @fudan.edu.cn wnzhang@apex.sjtu.edu.cn Abstract Knowledge and expertise in the real-world can be disjointedly owned. To solve a complex question, collaboration among experts is often called for. In this paper, we propose CollabQA, a novel QA task in which several expert agents coordinated by a moderator work together to answer questions that cannot be answered with any single agent alone. We make a synthetic dataset of a large knowledge graph that can be distributed to experts. We define the process to form a complex question from ground truth reasoning path, neural network agent models that can learn to solve the task, and evalua- tion metrics to check the performance. We show that the problem can be challenging with- out introducing prior of the collaboration struc- ture, unless experts are perfect and uniform. Based on this experience, we elaborate exten- sions needed to approach collaboration tasks in real-world settings. 1 Introduction One of the fascinating aspects of human activity is collaboration: despite the limitations of our individ- ual experience and knowledge, we can collaborate to solve a problem too challenging for any one per- son alone. In the context of this paper, we are inter- ested in collaboration via rounds of questions and answers internal to a panel of experts responding to an external question. Forms of such activities have broadened into the realm of robots as well. For instance, customer service is automated with the backing of machine agents, each holding expert knowledge in a specific domain. Figure 1 shows a hypothetical customer service example, where an AI agent is serving a customer who is about to place an order of a mask. Even though the agent has access to the features (e.g., ‘N95’) of the mask in its local database, it cannot answer the question “ Can this mask protect me from * Equal contribution. Can this mask protect me from COVID-19 Select TYPE from order … Can the N95 mask prevent COVID-19? Yes, because it can prevent 95% small particles. N95 Expert 1 Expert 2 Yes, because it can prevent 95% small particles. Expert … customer Database Agent Figure 1: An example in a hypothetical customer ser- vice scenario. The customer asks a question about a fea- ture of the product in the order he/she is about to place. However the database of the service agent doesn’t con- tain the information. Instead of responding with some- thing like “ Sorry I don’t know ”, the better way is to get help from human experts, or other QA agents. COVID-19? ”. Instead of responding with “ Sorry I don’t know. ” as most of the current QA systems do, it can reroute a new question “ Can the N95 mask prevent the COVID-19? ” to a human expert. We call this task CollabQA where a single agent (human or robot) cannot reason and respond to a complex question, but collectively they can. In other words, knowledge is not shared across agents, but the union contains the required reasoning path, which necessitates collaboration. To solve the problem in its general form is hard. In this paper, we take a few steps forward by proposing 1) a simplified version of CollabQA task where one front-serving agent decomposes an ex- ternal question into simple ones for the rest of the experts to answer, 2) a synthetic dataset of a large knowledge graph that can be distributed to experts, and 3) a set of baseline models and the associated arXiv:2201.09708v1 [cs.AI] 24 Jan 2022
evaluation metrics. Despite this very simple form, we show the prob- lem can be very challenging. Our overall conclu- sion is that, even with such a simple setting where 1) knowledge is clearly decomposed, 2) collabora- tion is passive, and 3) questions and answers are formed with simple templates and node prediction, training a good collaboration policy remains chal- lenging, unless we add a strong prior reflecting the collaboration structure, and assume collaborators that are both perfect and uniform. We use these experiences to reflect how to improve this task to gradually approach collaboration tasks with more real-world flavor. The rest of the paper is organized as follows. Section 3 formally defines the CollabQA task set- ting and shows the toy dataset we synthesized for a preliminary study. Section 4 describes the approach we proposed to for the task. We show experimental results and their enlightenment in Section 5 . Sec- tion 6 surveys related works to CollabQA and dis- cuss the key differences. And Section 7 discusses some potential directions for future works. 2 Opening Remarks This paper was initially submitted to the EMNLP 2020 on June 3, 2020. The reviewers’ primary concern about this paper was that it lacked real data experiments and rejected this paper. Since then, we thought we might have time to polish this paper further, but the research direction of our team changed to other fields. Therefore, we did not have the chance to go deeper in this direction. Some of the settings or discussions may be interesting to the community, so we decided to release this paper on the arxiv. Since 2020, we have noticed more related papers in this section. We list some of them for the readers’ reference and leave other parts of this paper almost unaltered to its initial version. To make agents collaborate, we usually need to decompose a complex task into simpler ones so that different agents can tackle these simple tasks. Wolfson et al. ( 2020 ) defined several operators, and a complex question will be decomposed into sev- eral sub-queries so that each sub-query will have only one operator. Based on this principle, Wolf- son et al. ( 2020 ) annotated a large dataset BREAK which can be served as a good starting point for Question Answering (QA) collaboration. He et al. ( 2017 ) proposed a dataset that requires two people, each with a distinct private list of friends, to find Notations Description P i The i -th panelist. Q The external complex question. q t Utterance by P 0 at t -th dialog turn. u ( t ) i Response of P i at t -th dialog turn. KG i Knowledge graph owned by P i . τ ( Q ) The reasoning path of Q . T Dialog turns. Table 1: Notations of CollabQA. their mutual friends through talking. CEREAL- BAR proposed in ( Suhr et al. , 2019 ) is a collab- orative game, which requires an instructor and a follower to collaborate to gather three cards in a virtual environment. The instructor can use natural language to pass messages to a follower, but not vice versa. The instructor has to learn to use better instructions to achieve better scores. Khot et al. ( 2021a ) proposed to use natural language to make several existing QA models collaborate so that they together can solve a question that cannot be solved solely by any existing QA models. And they fur- ther proposed a synthetic benchmark COMMAQA which can facilitate the research of collaboration QA ( Khot et al. , 2021b ). 3 The CollabQA Task 3.1 Notations and Settings The general setting of CollabQA simulates a group of panelists { P i } n i =0 , out of which P 0 is a special: it is the front-serving receptionist and the repre- sentative to the external world, it is also the mod- erator of the collaboration among { P i } n i =1 , who we term as the panelists. When P 0 receives an ex- ternal question Q , it broadcasts an utterance q (1) to the panelists and collects responses { u (1) i } n i =1 from them. This process continues iteratively, each round is a tuple ( q ( t ) , { u ( t ) i } n i =1 ) , until a maximum of T turns, and/or when P 0 is able to generate the final response which includes “UNK”, that means “I don’t know”. Notations used in this paper are listed in Table 1 . The panelists { P i } n i =1 owns a list of knowl- edge graphs, KG 1 , KG 2 , . . . , KG n , and their union KG = n i =1 KG i is the total graph. Questions are usually complex in the sense that they cannot be answered by one single agent. However, they are always answerable by KG . In other words, τ ( Q ) , the reasoning path of question Q can cut across
different graphs but is always contained within KG . As such, P 0 must generate multiple polls to the panelists to stitch together τ ( Q ) . Our objective is to minimize the total number of turns while maxi- mizing the success rate. 3.2 A Toy Task Inspired by the bAbI task ( Weston et al. , 2015 ), we construct a CollabQA dataset, which contains a series of QA pairs and 3 supporting knowledge graphs. We first construct KG 1 , KG 2 , KG 3 consisting of fabricated person , company and city entities and their relations. They stores the knowledge of N 1 persons, N 2 companies and N 3 cities respectively. The details of the three knowledge graphs are listed in Appendix A . They are assigned to the panelists P 1 , P 2 and P 3 respectively as their knowledge. Then we synthesize QA pairs from the knowl- edge graphs as well as the reasoning paths. Each question needs a cross-graph multi-hop reasoning. To illustrate the process of creating the dataset ex- amples, we use an example to show how to create a 2-hop question: from a node “Person#1” in KG 1 , we follow a path with many-to-one or one-to-one types of relations, for example the “birthplace” rela- tion and get a triplet (Person#1, birthplace, City#4); then we start from node “City#4” in KG 3 to search a triplet (City#4, largest company, Company#4). Then we combine the two triplets into a reasoning path: Person#1 birthplace -----→ City#4 largest company ---------→ Company#4 (1) so the final answer is the entity Company#4 in the end of the path. The question Q asking about Company#4 following the reasoning path is “What is the largest company in the city where Person#1 was born?”, which is generated by templates. A more complex example is shown in the upper part of Figure 2 . The reason we use many-to-one or one-to-one types of relations during the search, is that it en- sures the entities occurred in the path are unique, so that we can decompose the question into sub ques- tions and each with unique answer. In general, to generate an n -hop question, we randomly pick an entity node and perform n -hop Depth First Search (DFS). Note that multiple edges may exist between a pair of entities (as in a person may live and die in the same city). We limit the communications among the pan- elists are natural language. Therefore, P 0 needs to learn how to ask questions. To alleviate the burden of text generation, we pre-define a set of templates of sub questions. So, for τ ( Q ) in equation 1 , the sub questions are “Which city was Person# born in ?” for the first hop and “What is the largest company in City#4 ?” for the second hop. The bot- tom part of Figure 2 shows an ideal collaborative process. Table 2 lists the overall statistics of the dataset. Statistics description Value Train set size 66,800 Dev set size 8,350 Test set size 8,350 # of templates of Q 49 # of templates of simple questions 28 Table 2: Overall statistics of the CollabQA dataset. Person#1 175cm height City#4 birthplace live_in City#8 Person#9 work_in live_in Company#2 Company#4 female gender gender work_in Company#4 Person#8 founder 2010.2.8 establish_date City#4 locate_in Company#6 locate_in CEO Person#2 City#4 Company#4 largest_company area 500 ?? # mayor Person#5 State#1 contained_by City#1 contained_by mayor Person#3 Turn 1 : ? % : Which city was Person#1 born in ? ? & : City#4 ? # : UNK ? : UNK Turn 2 : ?0 : What is the largest company in City#4 ? ? & : UNK ? # : UNK ? : Company#4 Turn 3 : ?0 : When was Company#4 established ? ? & : UNK ? # : 2010.2.8 ? : UNK Turn 4 : ? % : [return answer] 2010.2.8 ? & ? # ? Reasoning Path ?: When was the largest company in the city where Person#1 was born established ? ? : 2010.2.8 Figure 2: Illustration of Toy CollabQA Task: an exam- ple of QA pair and the ideal collaborative process.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3.3 Links to other QA Tasks CollabQA can be regarded as an combination of several kinds of QA tasks: knowledge graph ques- tion answering (KGQA), multi-hop QA, multi-turn dialogue. KGQA In CollabQA, each panelist is a KGQA system. KGQA assumes that each panelist can answer the questions according its own KG, though it may need one- or several-step reasoning. Multi-hop QA In multi-hop QA, the supporting facts of a question are scattered in different sources. Most models for multi-hop QA assume they can access all the sources. Different from multi-hop QA, the supporting facts in CollabQA are sepa- rately owned by different panelists. Each support- ing fact is not accessible except its owner. There- fore, panelists need communicate with each other to exchange information. Multi-turn Dialogue The multi-turn dialogue usually occurs between human and agent. Col- labQA aims to develop multi-turn interactions among several agents (panelists). As such, CollabQA is more challenging than KGQA, multi-hop QA and multi-turn dialogue. 4 Proposed Approach In our setting, panelists collaborate passively in that they respond with what they know or else with “UNK”. Therefore, P 0 leads the process of collabo- ration. Our general approach consists of two stages: 1) pre-train the panelists with supervised learning; 2) train the collaboration policy with reinforcement learning. 4.1 Panelists Panelists share the same model architecture: a Graph Encoder that encode the knowledge graph into a graph representation matrix H ( KG ) , a Ques- tion Encoder that encodes the incoming question q ( t ) into h ( q ( t ) ) , feeding both into a Node Selector then picks an entity as the answer. Graph Encoder Without ambiguity, we call each knowledge graph owned by expert agents KG = ( V , E ) . KG is a heterogeneous graph con- sisting of different types of entities V and their relations E . The form of each relation is a triplet { ( u, rel, v ) } , where u, v ∈ V and rel ∈ R and R is the set of relation types. We come up with a modified version of Relational Graph Convolutional Network (R- GCN) ( Schlichtkrull et al. , 2018 ) as the graph en- coder. R-GCN encodes the graph by aggregating the neighbor and edge information to the nodes. Given a node v in KG , let h ( l ) v denote its represen- tation at the l -th layer of R-GCN, then h ( l +1) v = δ summationdisplay rel ∈R summationdisplay u ∈N rel v 1 c v,rel W ( l ) rel h ( l ) u + W ( l ) 0 h ( l ) v , (2) where δ is an activation function, N rel v denotes the neighbors of v which have relation r with v , W ( l ) rel is the weight matrix of relation rel at the l -th layer, and 1 c v,rel is a normalization factor. After L aggregations, the final representations of the nodes are H ( KG ) . However, R-GCN suffers from high GPU mem- ory usage, making it hard to scale to large graphs. The reason is that computing the message includes a direct tensor operation that will produce a very large tensor, espeically when number of relations is large. On the other hand, if we compute the mes- sages with a for-loop, the speed of the aggregation will suffer. To get rid of this problem, we make modifications similar to ( Vashishth et al. , 2020 ) but simpler and sufficient for this task in CollabQA dataset: each relation is modelled by a trainable vector h rel instead of a matrix W rel , then the ag- gregation process changes to: h ( l +1) v = δ 1 c v,rel summationdisplay rel,u ∈N rel v MLP ( l ) ([ h ( l ) v , h ( l ) rel , h ( l ) u ]) . (3) In our experiments, we observe significant GPU memory saving. Our implementation leverages the DGL package for its superior GPU performance ( Wang et al. , 2019 ). Question Encoder Similarly, we call the ques- tion to the expert agents q . The representation of a question q is computed by BiLSTM ( Hochreiter and Schmidhuber , 1997 ): h ( q ) =BiLSTM( q ) , (4) Node Selector Node selector performs an atten- tion operation with h ( q ) on H ( KG ) , and returns at- tention scores α ( KG ) on H ( G ) as the likelihood of selecting each node as answer: α ( KG ) =( H ( KG ) ) T W h ( q ) (5) then the answer is the value of the node which has the highest attention value.
4.2 Moderator and Collaboration Policy P 0 coordinates collaboration according to a learned collaboration policy. At turn t , P 0 takes a ( t ) according to its current state s ( t ) = f s ( d ( t ) ) , where d ( t ) is the dialog history up to t , d ( t ) = [ Q, q (1) , { u (1) i } n i =1 , · · · , q ( t - 1) , { u ( t - 1) i } n i =1 ] . The state encoder f s ( · ) can be any neural model; here we use BiLSTM. The action space includes asking a new sub ques- tion or returning the final answer. To alleviate the burden of text generation, P 0 generates sub ques- tion by selecting templates from a predefined set U . To enable P 0 to determine whether to finish the collaboration and return the answer of Q , we add a special template in U which stands for “finish the collaboration”. Thus, we use a simple Multi- layer perceptron (MLP) to implement the collabo- ration policy π ( a ( t ) | s ( t ) ) , which takes s ( t ) as input, outputs a probability distribution over the list of templates. In CollabQA dataset, at each dialog turn, only one of the answers from the panelists is not “UNK”. So, once the template is selected, we fill in the placeholder with this answer and update s ( t ) to generate q ( t +1) or the final answer. Reward We use the number of correct answers for CollabQA as baseline reward . For each ques- tion, getting an correct answer within T max turns leads to a reward r = +1 ; otherwise, the reward r = - 1 . To alleviate the problem of reward sparsity, we assign the reward r to all the actions in the tra- jectory. Besides, we add an entropy regulariza- tion term to encourage exploration ( Haarnoja et al. , 2018 ). We apply policy gradient method to train P 0 . The gradient of the policy is θ J = E τ π bracketleftBigg T summationdisplay t =1 r θ log π ( a ( t ) | s ( t ) ) + θ parenleftBig max(0 , C H ( π ( ·| s ( t ) ))) parenrightBigbracketrightBig , (6) where T is the turns of dialogue, C is a hyper- parameter, and θ stands for the parameters of the policy. In our simple setting, we can introduce an in- ductive bias specifically tailored to improve the learning effects. Since experts do not share knowl- edge, and there shall be exactly one response that is not “UNK” in each turn, we add an extra negative reward β ( β < 0) if it’s not the case. Therefore, the reward r are re-denoted as Figure 3: The training curve for P 0 with baseline and enhanced reward . The final answer accuracy (EMA) will tend to converge, while the reasoning path accu- racy (EMP) fluctuates. Adding prior will make the training faster and achieve better final accuracy, but can- not reduce the reasoning path accuracy fluctuation. r = 1+ β, if not exactly one answer , 1 , if wrong answer , +1 , if right answer . (7) where β is a hyper-parameter. As we add prior information in this setting, we call it enhanced reward setting. 5 Experiments and Analysis 5.1 Experimental Setup Pre-training of the Panelists We first train the P 1 , P 2 , P 3 with sub questions and their answers appeared in the training set. We fix the well trained panelists as the environment during training P 0 . The performance of them are shown in Table 3 . P 1 P 2 P 3 Accuracy 99.6 99.6 100 Table 3: Performance of the pre-trained panelists when asked one-hop questions on their domain knowledge. The hyper-parameters of the model used in our experiments are listed in Table 4 . Evaluation Metrics We evaluate the perfor- mance of P 0 with two metrics: 1) EMA : exact match of the final answer; 2) EMP : the extracted reasoning path of P 0 ex- actly matches the ground-truth path. 5.2 Results and error analysis The main results are shown in Table 5 . In the top row, we show the performance for a random P 0
Hyper-parameter Value R-GCN Layer 1 R-GCN hidden size 80 Embedding dim. 40 Bi-LSTM hidden size 40 Number of Epoch 1000 Batch size 500 Optimizer Adam Learning rate 3e-3 Entropy threshold C 0.1 Prior penalty reward β -0.2 Table 4: The hyper-parameter used to learn the pan- elists and P 0 . EMA EMP Random 0.0 0.0 Baseline Reward 68.8 45.3 Enhanced Reward 80.1 52.6 Table 5: Main results of the experiments on the test set. who randomly picks one question to ask at each step. Its EMA is zero, which reveals that it is not easy to guess the right answer. The “baseline re- ward” row presents the results of using the Equa- tion 6 as the gradient to optimize the model, and this model has one more termination action at each step (and it should pose the “termination” action only at the 4th turn.). The “enhanced reward” row is the additional penalty we give in Equation 7 (i.e. knowing exact number of turns and only one re- sponse is not “UNK”). Performance gap between two kinds of rewards The accuracy difference between the “baseline reward” row and the “enhanced reward” row is caused by that the “baseline reward” setting has one extra action, therefore it has the chance to termi- nate too early or fail to stop at the last turn. Based on our experiments, we found that this improper termination accounts for near 9% of the total errors. However, this kind error can be totally avoided in the “enhanced reward” setting. The left 2.3% per- formance lost may attribute to better training of “enhanced reward”. Another noticeable fact is that the EMP drop from “enhanced reward” to “baseline reward” is not as large as the EMA, this is because the wrong reasoning path in “enhanced reward” is also prone to terminate improperly. The training curves for “baseline reward” and “enhanced reward” are presented in Figure 3 , the confidence interval Figure 4: One example about the data bias in our data, 99% of persons have the same “live in” and “born in” city. is calculated from 5 experiments. From the figure, the accuracy for the EMA is quite stable, while the EMP is fluctuating. This is because the number of distinct samples are not very rich in our dataset, the variance should be innately small. However, because of the data bias which will be discussed in the following part, the EMP will fluctuate and without hurting the EMA. Fitting the data bias What is interesting is that a high answer accuracy (EMA) does not mean a high reasoning path accuracy (EMP); in both settings there is a large gap between the two accuracies. To understand the reason behind the gap between answer accuracy and reasoning path accuracy, we compute the performance for each type of Q . We found that the model finds wrong reasoning path mostly on the questions that has sub questions “Which city does [PersonName] live in ?” and “Which city was [PersonName] born in ?”. It turns out that, in our dataset, nearly 99% of the time that a person’s “birthplace” and “live in place” are the same, and the model cannot distinguish the differ- ence between them during training. We observe that nearly all the questions that need to be decom- posed to “Which city does [PersonName] live in ?” have been decomposed to “Which city was [Per- sonName] born in ?” instead. One example can be viewed in Figure 4 . To further show how this overlap will have an impact on our results, we vary the overlap ratio, which is the probability one person has the same “live in placce” and “birthplace”, in Figure 5 . As the overlap ratio goes up, it will be harder for P0 to discern between “live in placce” and “birthplace”, but the difficulty does not go up linearly, the EMP drops sharply after some point. However, the the drop of EMP does not have too much negative
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Figure 5: The final answer accuracy and reasoning path accuracy with respect to the overlap ratio. affect on the EMA, since when the overlap ratio is high, it can still get the right answer with the wrong reasoning path. In other words, the model settles on an approxi- mate question decomposition that the final reward cannot distinguish. We note that similar data bias exist in the real world, and the model exploited it through exploration in our dataset. Fixing it means P 0 should inspect the semantic consistency between the sub questions and the original question, instead of blindly selecting templates. Except for the data bias issue, there are numer- ous error cases. For instance, an imperfect an ex- pert can pick a wrong answer that are structurally correct (i.e. answer the birth city when the question is work location) that leads to a correct decomposi- tion but wrong final answer. Note that our panelists are nearly perfect. Thus, small errors accumulate over the turns and greatly affect P 0 ’s performance. The problem of group bias The data bias de- scribed above reminds us of another kind of bias, that during learning to collaborate, P 0 may fit the bias of the panelists. This is intuitive, since pan- elists are environment, any bias therein will lead to bias in P 0 . What is more interesting in a collabora- tion setting is that such bias is per group . To verify this assumption, we conduct experiments training 3 groups of panelists with different initialization, and we call them Panel (1) , Panel (2) , Panel (3) . Each group has similar performance to those in Table 3 . Then we train 3 versions of P 0 paired with different groups of panelist: P ( i ) 0 trains with Panel ( i ) . During testing, we pair each P 0 with different group of panelist. The the resulting answer accuracy are listed in Table 6 . The results show that there are always perfor- mance drop when P 0 is paired with panelists which P (1) 0 P (2) 0 P (3) 0 Panel (1) 81.7 82.5 76.4 Panel (2) 80.0 84.6 74.2 Panel (3) 80.8 83.6 81.9 Table 6: Results for pairing each P 0 with different group of panelist at testing time. Each column shows how a version of P 0 is paired with different panels; the diagonal entry is where P 0 pairs with the group it was trained on. is not trained with it. 6 Related work KGQA In the simplified setting of CollabQA, each panelist is a simple KGQA system. The ques- tions are either simple, or need one step reasoning to be reformulated into another simple question. KGQA has been widely studied. The most com- mon way of doing KGQA is by semantic parsing. Semantic parser maps a natural language question to a formal query such as SPARQL, λ -DCS ( Liang et al. , 2011 ) or FunQL ( Liang et al. , 2011 ). Previ- ous works on KGQA can be categorized into classi- fication based, ranking based and translation based methods ( Chakraborty et al. , 2019 ). The model of the panelists we proposed is most related to classification based methods. Classification based methods assume the target formal query has a fixed structure, and the task is to predict the elements in it. For example, in SimpleQuestions benchmark ( Bor- des et al. , 2015 ), all the questions are factoid ques- tions that need one-step reasoning. SimpleQues- tions has been approached by various NN models ( He and Golub , 2016 ; Dai et al. , 2016 ; Yin et al. , 2016 ; Yu et al. , 2017 ; Lukovnikov et al. , 2017 ; Mo- hammed et al. , 2018 ; Petrochuk and Zettlemoyer , 2018 ; Huang et al. , 2019 ). Another line of KGQA approaches leverages knowledge graph embedding to make full use of the structural information of KGs ( Huang et al. , 2019 ). Multi-hop QA To answer a multi-hopped ques- tion, multiple supporting facts are needed. Wiki- Hop ( Welbl et al. , 2018 ) and HotpotQA ( Yang et al. , 2018 ) are recently proposed multi-hop QA datasets for text understanding. Different from multi-hop QA, the supporting facts in CollabQA are sepa- rately owned by different panelists. Each support- ing fact is not accessible except its owner. There- fore, CollabQA is more challenging than multi-hop
QA. Multi-Agent Reinforcement Learning (MARL) In this paper, panelists are passive and pre-trained, we just train the collaborative policy under single- agent RL setting. However, the general CollabQA should allow the panelists discuss with each other; therefore, each panelist has its own policy and can update the policy. Under this general set- ting, CollabQA naturally falls into the realm of MARL ( Bu s ¸ oniu et al. , 2010 ; Foerster et al. , 2016 ), which is a more challenging task. 7 Discussion The task of CollabQA as it stands is very simple. Nevertheless, the experiences are helpful to drive towards an improved setting that is closer to real- world scenarios. To put it differently, if we were to design the task anew, what are the most impor- tant extensions? We examine three dimension: 1) the role and capability of participants, 2) the col- laboration structure and 3) scaling to real-world problems. (1) Role Definition In the current setting, the moderator P 0 assumes no knowledge of its own and its capacity is limited to breaking down a com- plex question. The panelists are domain experts whose knowledge do not overlap, and they can only respond with facts, without proactively ask questions, nor can they reveal any reasoning path. These are much simplified assumptions that do not reflect the reality. Relaxing these constraints is in general challenging; we list some of the issues below. Consider the issue of common sense knowledge . Although inconsistencies among individuals do ex- ist, it is nevertheless the foundation where col- laboration among a collection of human experts can start. Often times, common sense is required to meaningfully decompose a complex question, whether the panelists are involved or not. Take the question “Does Person#1 work in the same city as Person#2 ?” as an example. P 0 needs to realize that the entities of the companies and their locations are key to solve this question. These missing steps, which are not obvious from the question itself, need to be inserted and it takes common sense to deduce them, since “working city” is not a relation readily available in our KG. A debate is interesting when there are gaps between experts, not because they have non- overlapping knowledge but more often because they have different opinions on the same facts. As such we need to introduce overlapping knowl- edge imbued with different certainty (or reliability). This, in turn, requires P 0 to have the capacity to arbitrate among parallel responses from different panelists. (2) Collaboration Structure The overall struc- ture of a moderator working together with a group of expert is not uncommon. Even with this broad structure, there can be other valid variations. For instance, instead of broadcast, the moderator can have pointed question to one panelist, or more gen- erally a subset of the panelists. It is also possible that the final response needs a vote when the mod- erator cannot resolve a difference. The constraint that panelists can only passively state facts is problematic when a question is am- biguous. Consider the question “Where does [Per- sonName] work ?” There are multiple legitimate responses (e.g. a company, a city, and/or a coun- try). As such, a panelist should ask clarification question ; drawing an exhaustive list from KG is a possibility, but an unnatural one. As a further extension, clarification questions can be generated and responded by any of the participants. (3) Scale to Real World Scenario Despite its simplicity, our current setting is meaningful to ap- proach real CollabQA tasks. In order to do so we believe there are few more necessary extensions. Currently, we assume a complex question is the realization of a unique path. In general this is not true even when the reasoning does take a multi- hop path; multiple edges can exist between a pair of entities. A lazy (or unlucky) P 0 may learn to choose only one of them, if the only award is to get the final answer right. This is one problem we discussed in our experiments where “work in” and “live in” happen to overlap in the end nodes. In general, reasoning can take a graph (thought we can consider a path as a degenerated graph, too). The earlier example (“Does Person#1 work in the same city as Person#2 ?”) can only be solved by a two-level tree with a Boolean comparison at the root. Booking an airline ticket with both pricing and timing constraints while the required information reside in different KGs is similar. As a result, to generate complex question there is a need to go beyond the perspective of a reasoning path. In our current setting, P 0 selects template, and
panelists respond with the entity. As such, the ac- tion space of P 0 is constrained, and there is very low risk that communications get “lost in transla- tion.” Ideally, such communication should take generated natural language. In other words, Col- labQA needs natural language generation (NLG) as a component. However, doing so will be pro- hibitive expensive if we want to train from scratch. If there are | V | valid words, the number of pos- sible sentences for a L length sequence will be | V | L , and that is only for one turn. This will ex- ponentially exacerbate the issue of sparse reward, making training difficult. Thus, we believe that this is not a fundamental problem. In other words, in the context of CollabQA, leaning what to ask is more important than how to ask. A more practical approach is using transfer learning to endorse the agents with NLG capability. However, there should be surface realization di- versity even for semantically identically questions. Doing so is not only a practically required, but will also make the system more robust. This can be easily accomplished by adding noises to templates, provided that the action space stays manageable. 8 Conclusion The fact that knowledge are not shared gives rise to individual diversity and motivates collaboration. We believe natural-language based collaboration system is a domain that has practical implication and holds scientific values. The CollabQA task and dataset we proposed in this paper is a small step towards that direction. References Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075 . Lucian Bus ¸oniu, Robert Babuˇska, and Bart De Schut- ter. 2010. Multi-agent reinforcement learning: An overview. In Innovations in multi-agent systems and applications-1 , pages 183–221. Springer. Nilesh Chakraborty, Denis Lukovnikov, Gaurav Ma- heshwari, Priyansh Trivedi, Jens Lehmann, and Asja Fischer. 2019. Introduction to neural network based approaches for question answering over knowledge graphs. arXiv preprint arXiv:1907.09361 . Zihang Dai, Lei Li, and Wei Xu. 2016. CFO: Condi- tional focused neural question answering with large- scale knowledge bases . In Proceedings of the 54th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) , pages 800–810, Berlin, Germany. Association for Compu- tational Linguistics. Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. 2016. Learn- ing to communicate with deep multi-agent reinforce- ment learning. In Advances in neural information processing systems , pages 2137–2145. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning , pages 1861–1870. He He, Anusha Balakrishnan, Mihail Eric, and Percy Liang. 2017. Learning symmetric collaborative dia- logue agents with dynamic knowledge graph embed- dings . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Vol- ume 1: Long Papers , pages 1766–1776. Association for Computational Linguistics. Xiaodong He and David Golub. 2016. Character- level question answering with attention . In Proceed- ings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 1598–1607, Austin, Texas. Association for Computational Lin- guistics. Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural computation , 9(8):1735–1780. Xiao Huang, Jingyuan Zhang, Dingcheng Li, and Ping Li. 2019. Knowledge graph embedding based ques- tion answering . In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining , WSDM ’19, page 105–113, New York, NY, USA. Association for Computing Machinery. Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2021a. Text modu- lar networks: Learning to decompose tasks in the language of existing models . In Proceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021 , pages 1264–1279. Associ- ation for Computational Linguistics. Tushar Khot, Kyle Richardson, Daniel Khashabi, and Ashish Sabharwal. 2021b. Learning to solve complex tasks by talking to agents . CoRR , abs/2110.08542. Percy Liang, Michael Jordan, and Dan Klein. 2011. Learning dependency-based compositional seman- tics . In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies , pages 590–599, Port- land, Oregon, USA. Association for Computational Linguistics.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Denis Lukovnikov, Asja Fischer, Jens Lehmann, and S¨oren Auer. 2017. Neural network-based question answering over knowledge graphs on word and char- acter level . In Proceedings of the 26th International Conference on World Wide Web , WWW ’17, page 1211–1220, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee. Salman Mohammed, Peng Shi, and Jimmy Lin. 2018. Strong baselines for simple question answering over knowledge graphs with and without neural networks . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technolo- gies, Volume 2 (Short Papers) , pages 291–296, New Orleans, Louisiana. Association for Computational Linguistics. Michael Petrochuk and Luke Zettlemoyer. 2018. Sim- pleQuestions nearly solved: A new upperbound and baseline approach . In Proceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing , pages 554–558, Brussels, Belgium. As- sociation for Computational Linguistics. Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolu- tional networks. In European Semantic Web Confer- ence , pages 593–607. Springer. Alane Suhr, Claudia Yan, Jacob Schluger, Stanley Yu, Hadi Khader, Marwa Mouallem, Iris Zhang, and Yoav Artzi. 2019. Executing instructions in situ- ated collaborative interactions . In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, Novem- ber 3-7, 2019 , pages 2119–2130. Association for Computational Linguistics. Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha Talukdar. 2020. Composition-based multi- relational graph convolutional networks . In Interna- tional Conference on Learning Representations . Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J. Smola, and Zheng Zhang. 2019. Deep graph li- brary: Towards efficient and scalable deep learning on graphs. CoRR , abs/1909.01315. Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents . Transac- tions of the Association for Computational Linguis- tics , 6:287–302. Jason Weston, Antoine Bordes, Sumit Chopra, Alexan- der M Rush, Bart van Merri¨enboer, Armand Joulin, and Tomas Mikolov. 2015. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698 . Tomer Wolfson, Mor Geva, Ankit Gupta, Yoav Gold- berg, Matt Gardner, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question under- standing benchmark . Trans. Assoc. Comput. Lin- guistics , 8:183–198. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answer- ing . In Proceedings of the 2018 Conference on Em- pirical Methods in Natural Language Processing , pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics. Wenpeng Yin, Mo Yu, Bing Xiang, Bowen Zhou, and Hinrich Sch¨utze. 2016. Simple question answering by attentive convolutional neural network . In Pro- ceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Techni- cal Papers , pages 1746–1756, Osaka, Japan. The COLING 2016 Organizing Committee. Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2017. Im- proved neural relation detection for knowledge base question answering . In Proceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 571– 581, Vancouver, Canada. Association for Computa- tional Linguistics.
person_1 male gender city_12 170cm height 1991.2.1 birthday birthplace 71kg weight $80,000 annual_income company_22 work_in city_3 live_in (a) An example of Person entity in G 1 company_1 business_1 2012.3.2 person_19 CEO person_22 founder time_of_establishment 500 number_of_employees main_business city_3 locate_in person_34 chairman_of_the_board 20 million market_value city_3 has_service_in city_4 has_service_in (b) An example of Company entity in G 2 city_1 company_10 largest_company 1.2 million population 2000 km 2 area person_55 mayor state_10 locate_in (c) An example of City entity in G 3 Figure 6: Structure and examples of entities in the three proposed knowledge graphs. Appendices A Details of the CollabQA dataset Structures of the three KGs Figure 6 shows the structure and examples in our proposed knowledge graphs. G 1 contains a list of Person entities. The value of a property of the entity is randomly gener- ated within a reasonable range. For example, the value of a person’s height is randomly sampled in the range [160 cm, 200 cm ] . We add a series of constraints to make the KGs more realistic, such as a person who doesn’t have job gets no annual income; a person cannot be a mayor and be an employee in some company at the same time; the largest company of a city must be located in that city, and so on. Statistics of the KGs The detailed statistics of the three KGs are shown in Table 7 .
G 1 G 2 G 3 Overall Number of entities: 7541 Number of entities: 7719 Number of entities: 1360 Number of relations: 24000 Number of relations: 16000 Number of relations: 1500 Number of different node types gender value: 2 CompanyName: 2000 CityName: 300 PersonName: 3000 date value: 1862 area value: 211 height value: 21 number value: 836 number value: 259 weight value: 31 PersonName: 2600 PersonName: 285 date value: 2597 BusinessName: 20 CompanyName: 300 CityName: 300 CityName: 300 StateName: 5 CompanyName: 1559 market value: 101 annual income value: 31 Number of different relation types height: 3000 establish date: 2000 area: 300 weight: 3000 number of employees:2000 population: 300 birthday: 3000 ceo:2000 mayor: 300 gender: 3000 founder:2000 largest company: 300 birthplace: 3000 main business:2000 contained by: 300 live in: 3000 locate in:2000 work in: 3000 has service in:2000 annual income: 3000 chairman:2000 market value:2000 Table 7: Statistics of three knowledge graphs used in our experiment.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help