==================================
SIGIR'21 ABSTRACT
==================================
.. raw:: html
Abstract: We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query. Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions. Despite their effectiveness, current models mostly exploit dataset biases while ignoring the video content, thus leading to poor generalizability. We argue that the issue is caused by the hidden confounder in VMR, i.e., temporal location of moments, that spuriously correlates the model input and prediction. How to design robust matching models against the temporal location biases is crucial but, as far as we know, has not been studied yet for VMR. To fill the research gap, we propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction. Specifically, we develop a Deconfounded Cross-modal Matching (DCM) method to remove the confounding effects of moment location. It first disentangles moment representation to infer the core feature of visual content, and then applies causal intervention on the disentangled multimodal input based on backdoor adjustment, which forces the model to fairly incorporate each possible location of the target into consideration. Extensive experiments clearly show that our approach can achieve significant improvement over the state-of-the-art methods in terms of both accuracy and generalization.
Abstract: Recommender system usually faces popularity bias issues: from the data perspective, items exhibit uneven (usually long-tail) distribution on the interaction frequency; from the method perspective, collaborative filtering methods are prone to amplify the bias by over-recommending popular items. It is undoubtedly critical to consider popularity bias in recommender systems, and existing work mainly eliminates the bias effect with propensity-based unbiased learning or causal embeddings. However, we argue that not all biases in the data are bad, \ie some items demonstrate higher popularity because of their better intrinsic quality. Blindly pursuing unbiased learning may remove the beneficial patterns in the data, degrading the recommendation accuracy and user satisfaction. This work studies an unexplored problem in recommendation --- how to leverage popularity bias to improve the recommendation accuracy. The key lies in two aspects: how to remove the bad impact of popularity bias during training, and how to inject the desired popularity bias in the inference stage that generates top-K recommendations. This questions the causal mechanism of the recommendation generation process. Along this line, we find that item popularity plays the role ofconfounder between the exposed items and the observed interactions, causing the bad effect of bias amplification. To achieve our goal, we propose a new training and inference paradigm for recommendation named Popularity-bias Deconfounding and Adjusting (PDA). It removes the confounding popularity bias in model training and adjusts the recommendation score with desired popularity bias via causal intervention. We demonstrate the new paradigm on the latent factor model and perform extensive experiments on three real-world datasets from Kwai, Douban, and Tencent. Empirical studies validate that the deconfounded training is helpful to discover user real interests and the inference adjustment with popularity bias could further improve the recommendation accuracy. We release our code at https://github.com/zyang1580/PDA.
Abstract: Recommender systems rely on user behavior data like ratings and clicks to build personalization model. However, the collected data is observational rather than experimental, causing various biases in the data which significantly affect the learned model. Most existing work for recommendation debiasing, such as the inverse propensity scoring and imputation approaches, focuses on one or two specific biases, lacking the universal capacity that can account for mixed or even unknown biases in the data. Towards this research gap, we first analyze the origin of biases from the perspective of risk discrepancy that represents the difference between the expectation empirical risk and the true risk. Remarkably, we derive a general learning framework that well summarizes most existing debiasing strategies by specifying some parameters of the general framework. This provides a valuable opportunity to develop a universal solution for debiasing, e.g., by learning the debiasing parameters from data. However, the training data lacks important signal of how the data is biased and what the unbiased data looks like. To move this idea forward, we propose AotoDebias that leverages another (small) set of uniform data to optimize the debiasing parameters by solving the bi-level optimization problem with meta-learning. Through theoretical analyses, we derive the generalization bound for AutoDebias and prove its ability to acquire the appropriate debiasing strategy. Extensive experiments on two real datasets and a simulated dataset demonstrated effectiveness of AutoDebias. The code is available at https://github.com/DongHande/AutoDebias.
Abstract: Biases and de-biasing in recommender systems (RS) have become a research hotspot recently. This paper reveals an unexplored type of bias, i.e., sentiment bias. Through an empirical study, we find that many RS models provide more accurate recommendations on user/item groups having more positive feedback (i.e., positive users/items) than on user/item groups having more negative feedback (i.e., negative users/items). We show that sentiment bias is different from existing biases such as popularity bias: positive users/items do not have more user feedback (i.e., either more ratings or longer reviews). The existence of sentiment bias leads to low-quality recommendations to critical users and unfair recommendations for niche items. We discuss the factors that cause sentiment bias. Then, to fix the sources of sentiment bias, we propose a general de-biasing framework with three strategies manifesting in different regularizers that can be easily plugged into RS models without changing model architectures. Experiments on various RS models and benchmark datasets have verified the effectiveness of our de-biasing framework. To our best knowledge, sentiment bias and its de-biasing have not been studied before. We hope that this work can help strengthen the study of biases and de-biasing in RS.
Abstract: The user feedbacks could be delayed in many streaming recommendation scenarios. As an example, the user feedbacks to a recommended coupon consist of the immediate feedback on the click event and the delayed feedback on the resultant conversion. The delayed feedbacks pose a challenge of training recommendation models using instances with incomplete labels. When being applied to real products, the challenge becomes more severe as the streaming recommendation models need to be retrained very frequently and the training instances need to be collected over very short time scales. Existing approaches either simply ignore the unobserved feedbacks or heuristically adjust the feedbacks on a static instance set, resulting in biases in the training data and hurting the accuracy of the learned recommenders. In this paper, we propose a novel and theoretic sound counterfactual approach to adjusting the user feedbacks and learning the recommendation models, called CBDF (Counterfactual Bandit with Delayed Feedback). CBDF formulates the streaming recommendation with delayed feedback as a problem of sequential decision making and models it with a batched bandit. To deal with the issue of delayed feedback, at each iteration (episode), a counterfactual importance sampling model is employed to re-weight the original feedbacks and generate the modified rewards. Based on the modified rewards, a batched bandit is learned for conducting online recommendation at the next iteration. Theoretical analysis showed that the modified rewards are statistically unbiased, and the learned bandit policy enjoys a sub-linear regret bound. Experimental results demonstrated that CBDF can outperform the state-of-the-art baselines on a synthetic dataset, the Criteo dataset, and a dataset from Tencent's WeChat app.
Abstract: Recently, exploiting a knowledge graph (KG) to enrich the semantic representation of a news article have been proven to be effective for news recommendation. These solutions focus on the representation learning for news articles with additional information in the knowledge graph, where the user representations are mainly derived based on these news representations later. However, different users would hold different interests on the same news article. In other words, directly identifying the entities relevant to the user's interest and deriving the resultant user representation could enable a better news recommendation and explanation. To this end, in this paper, we propose a novel knowledge pruning based recurrent graph convolutional network (named Kopra) for news recommendation. Instead of extracting relevant entities for a news article from KG, Kopra is devised to identify the relevant entities from both a user's clicked history and a KG to derive the user representation. We firstly form an initial entity graph (namely interest graph) with seed entities extracted from news titles and abstracts. Then, a joint knowledge pruning and recurrent graph convolution (RGC) mechanism is introduced to augment each seed entity with relevant entities from KG in a recurrent manner. That is, the entities in the neighborhood of each seed entity inside KG but irrelevant to the user's interest are pruned from the augmentation. With this pruning and graph convolution process in a recurrent manner, we can derive the user's both long- and short-term representations based on her click history within a long and short time period respectively. At last, we introduce a max-pooling predictor over the long- and short-term user representations and the seed entities in the candidate news to calculate the ranking score for recommendation. The experimental results over two real-world datasets in two different languages suggest that the proposed Kopra obtains significantly better performance than a series of state-of-the-art technical alternatives. Moreover, the entity graph generated by Kopra can facilitate recommendation explanation much easier.
Abstract: The most important task in personalized news recommendation is accurate matching between candidate news and user interest. Most of existing news recommendation methods model candidate news from its textual content and user interest from their clicked news in an independent way. However, a news article may cover multiple aspects and entities, and a user usually has different kinds of interest. Independent modeling of candidate news and user interest may lead to inferior matching between news and users. In this paper, we propose a knowledge-aware interactive matching method for news recommendation. Our method interactively models candidate news and user interest to facilitate their accurate matching. We design a knowledge-aware news co-encoder to interactively learn representations for both clicked news and candidate news by capturing their relatedness in both semantic and entities with the help of knowledge graphs. We also design a user-news co-encoder to learn candidate news-aware user interest representation and user-aware candidate news representation for better interest matching. Experiments on two real-world datasets validate that our method can effectively improve the performance of news recommendation.
Abstract: Neural graph based Collaborative Filtering (CF) models learn user and item embeddings based on the user-item bipartite graph structure, and have achieved state-of-the-art recommendation performance. In the ubiquitous implicit feedback based CF, users' unobserved behaviors are treated as unlinked edges in the user-item bipartite graph. As users' unobserved behaviors are mixed with dislikes and unknown positive preferences, the fixed graph structure input is missing with potential positive preference links. In this paper, we study how to better learn enhanced graph structure for CF. We argue that node embedding learning and graph structure learning can mutually enhance each other in CF, as updated node embeddings are learned from previous graph structure, and vice versa ~(i.e., newly updated graph structure are optimized based on current node embedding results). Some previous works provided approaches to refine the graph structure. However, most of these graph learning models relied on node features for modeling, which are not available in CF. Besides, nearly all optimization goals tried to compare the learned adaptive graph and the original graph from a local reconstruction perspective, whether the global properties of the adaptive graph structure are modeled in the learning process is still unknown. To this end, in this paper, we propose an enhanced graph learning network EGLN approach for CF via mutual information maximization. The key idea of EGLN is two folds: First, we let the enhanced graph learning module and the node embedding module iteratively learn from each other without any feature input. Second, we design a local-global consistency optimization function to capture the global properties in the enhanced graph learning process. Finally, extensive experimental results on three real-world datasets clearly show the effectiveness of our proposed model.
Abstract: Explainable Recommendations provide the reasons behind why an item is recommended to a user, which often leads to increased user satisfaction and persuasiveness. An intuitive way to explain recommendations is by generating a synthetic personalized natural language review for a user-item pair. Although there exist some approaches in the literature that explain recommendations by generating reviews, the quality of the reviews is questionable. Besides, these methods usually take considerable time to train the underlying language model responsible for generating the text. In this work, we propose ReXPlug, an end-to-end framework with a plug and play way of explaining recommendations. ReXPlug predicts accurate ratings as well as exploits Plug and Play Language Model to generate high-quality reviews. We train a simple sentiment classifier for controlling a pre-trained language model for the generation, bypassing the language model's training from scratch again. Such a simple and neat model is much easier to implement and train, and hence, very efficient for generating reviews. We personalize the reviews by leveraging a special jointly-trained cross attention network. Our detailed experiments show that ReXPlug outperforms many recent models across various datasets on rating prediction by utilizing textual reviews as a regularizer. Quantitative analysis shows that the reviews generated by ReXPlug are semantically close to the ground truth reviews, while the qualitative analysis demonstrates the high quality of the generated reviews, both from empirical and analytical viewpoints. Our implementation is available online.
Abstract: The key to personalized search is to build the user profile based on historical behaviour. To deal with the users who lack historical data, group based personalized models were proposed to incorporate the profiles of similar users when re-ranking the results. However, similar users are mostly found based on simple lexical or topical similarity in search behaviours. In this paper, we propose a neural network enhanced method to highlight similar users in semantic space. Furthermore, we argue that the behaviour-based similar users are still insufficient to understand a new query when user's historical activities are limited. To tackle this issue, we introduce the friend network into personalized search to determine the closeness between users in another way. Since the friendship is often formed based on similar background or interest, there are plenty of personalized signals hidden in the friend network naturally. Specifically, we propose a friend network enhanced personalized search model, which groups the user into multiple friend circles based on search behaviours and friend relations respectively. These two types of friend circles are complementary to construct a more comprehensive group profile for refining the personalization. Experimental results show the significant improvement of our model over existing personalized search models.
Abstract: Visual search has become popular in recent years, allowing users to search by an image they are taking using their mobile device or uploading from their photo library. One domain in which visual search is especially valuable is electronic commerce, where users seek for items to purchase. In this work, we present an in-depth comprehensive study of visual e-commerce search. We perform query log analysis of one of the largest e-commerce platforms' mobile search application. We compare visual and textual search by a variety of characteristics, with special focus on the retrieved results and user interaction with them. We also examine image query characteristics, refinement by attributes, and performance prediction for visual search queries. Our analysis points out a variety of differences between visual and textual e-commerce search. We discuss the implications of these differences for the design of future e-commerce search systems.
Abstract: A vital step towards the widespread adoption of neural retrieval models is their resource efficiency throughout the training, indexing and query workflows. The neural IR community made great advancements in training effective dual-encoder dense retrieval (DR) models recently. A dense text retrieval model uses a single vector representation per query and passage to score a match, which enables low-latency first-stage retrieval with a nearest neighbor search. Increasingly common, training approaches require enormous compute power, as they either conduct negative passage sampling out of a continuously updating refreshing index or require very large batch sizes. Instead of relying on more compute capability, we introduce an efficient topic-aware query and balanced margin sampling technique, called TAS-Balanced. We cluster queries once before training and sample queries out of a cluster per batch. We train our lightweight 6-layer DR model with a novel dual-teacher supervision that combines pairwise and in-batch negative teachers. Our method is trainable on a single consumer-grade GPU in under 48 hours. We show that our TAS-Balanced training method achieves state-of-the-art low-latency (64ms per query) results on two TREC Deep Learning Track query sets. Evaluated on NDCG@10, we outperform BM25 by 44%, a plainly trained DR by 19%, docT5query by 11%, and the previous best DR model by 5%. Additionally, TAS-Balanced produces the first dense retriever that outperforms every other method on recall at any cutoff on TREC-DL and allows more resource intensive re-ranking models to operate on fewer passages to improve results further.
Abstract: Product search has been a crucial entry point to serve people shopping online. Most existing personalized product models follow the paradigm of representing and matching user intents and items in the semantic space, where finer-grained matching is totally discarded and the ranking of an item cannot be explained further than just user/item level similarity. In addition, while some models in existing studies have created dynamic user representations based on search context, their representations for items are static across all search sessions. This makes every piece of information about the item always equally important in representing the item during matching with various user intents. Aware of the above limitations, we propose a review-based transformer model (RTM) for personalized product search, which encodes the sequence of query, user reviews, and item reviews with a transformer architecture. RTM conducts review-level matching between the user and item, where each review has a dynamic effect according to the context in the sequence. This makes it possible to identify useful reviews to explain the scoring. Experimental results show that RTM significantly outperforms state-of-the-art personalized product search baselines.
Abstract: Twitter is currently a popular online social media platform which allows users to share their user-generated content. This publicly-generated user data is also crucial to healthcare technologies because the discovered patterns would hugely benefit them in several ways. One of the applications is in automatically discovering mental health problems, e.g., depression. Previous studies to automatically detect a depressed user on online social media have largely relied upon the user behaviour and their linguistic patterns including user's social interactions. The downside is that these models are trained on several irrelevant content which might not be crucial towards detecting a depressed user. Besides, these content have a negative impact on the overall efficiency and effectiveness of the model. To overcome the shortcomings in the existing automatic depression detection methods, we propose a novel computational framework for automatic depression detection that initially selects relevant content through a hybrid extractive and abstractive summarization strategy on the sequence of all user tweets leading to a more fine-grained and relevant content. The content then goes to our novel deep learning framework comprising of a unified learning machinery comprising of Convolutional Neural Network (CNN) coupled with attention-enhanced Gated Recurrent Units (GRU) models leading to better empirical performance than existing strong baselines.
Abstract: In this paper, we address the personalized node ranking (PNR) problem for signed networks, which aims to rank nodes in an order most relevant to a given seed node in a signed network. The recently-proposed PNR methods introduce the concept of the signed random surfer, denoted as SRSurfer, that performs the score propagation between nodes using the balance theory. However, in real settings of signed networks, edge relationships often do not strictly follow the rules of the balance theory. Therefore, SRSurfer-based PNR methods frequently perform incorrect score propagation to nodes, thereby degrading the accuracy of PNR. To address this limitation, we propose a novel random-walk based PNR approach with sign verification, named as OBOE (lOok Before yOu lEap). Specifically, OBOE carefully verifies the score propagation of SRSurfer by using the topological features of nodes. Then, OBOE corrects all incorrect score propagation cases by exploiting the statistics of a given network. The experiments on 3 real-world signed networks show that OBOE consistently and significantly outperforms 5 competing methods with improvement up to 13%, 95%, and 249% in top-k PNR, bottom-k PNR, and troll identification tasks, respectively. All OBOE codes and datasets are available at: http://github.com/wonchang24/OBOE.
Abstract: Nowadays, detecting fake news on social media platforms has become a top priority since the widespread dissemination of fake news may mislead readers and have negative effects. To date, many algorithms have been proposed to facilitate the detection of fake news from the hand-crafted feature extraction methods to deep learning approaches. However, these methods may suffer from the following limitations: (1) fail to utilize the multi-modal context information and extract high-order complementary information for each news to enhance the detection of fake news; (2) largely ignore the full hierarchical semantics of textual content to assist in learning a better news representation. To overcome these limitations, this paper proposes a novel hierarchical multi-modal contextual attention network (HMCAN) for fake news detection by jointly modeling the multi-modal context information and the hierarchical semantics of text in a unified deep model. Specifically, we employ BERT and ResNet to learn better representations for text and images, respectively. Then, we feed the obtained representations of images and text into a multi-modal contextual attention network to fuse both inter-modality and intra-modality relationships. Finally, we design a hierarchical encoding network to capture the rich hierarchical semantics for fake news detection. Extensive experiments on three public real datasets demonstrate that our proposed HMCAN achieves state-of-the-art performance.
Abstract: This paper describes a novel diffusion model, DyDiff-VAE, for information diffusion prediction on social media. Given the initial content and a sequence of forwarding users, DyDiff-VAE aims to estimate the propagation likelihood for other potential users and predict the corresponding user rankings. Inferring user interests from diffusion data lies the foundation of diffusion prediction, because users often forward the information in which they are interested or the information from those who share similar interests. Their interests also evolve over time as the result of the dynamic social influence from neighbors and the time-sensitive information gained inside/outside the social media. Existing works fail to model users' intrinsic interests from the diffusion data and assume user interests remain static along the time. DyDiff-VAE advances the state of the art in two directions: (i) We propose a dynamic encoder to infer the evolution of user interests from observed diffusion data. (ii) We propose a dual attentive decoder to estimate the propagation likelihood by integrating information from both the initial cascade content and the forwarding user sequence. Extensive experiments on four real-world datasets from Twitter and Youtube demonstrate the advantages of the proposed model; we show that it achieves 43.3%relative gains over the best baseline on average. Moreover, it has the lowest run-time compared with recurrent neural network based models.
Abstract: Knowledge tracing, which dynamically estimates students' learning states by predicting their performance on answering questions, is an essential task in online education. One typical solution for knowledge tracing is based on Recurrent Neural Networks (RNNs), which represent students' knowledge states with the hidden states of RNNs. Such type of methods normally assumes that students have the same cognition level and knowledge acquisition sensitivity on the same question. Thus, they (i) predict students' responses by referring to their knowledge states and question representations, and (ii) update the knowledge states according to the question representations and students' responses. No explicit cognition level or knowledge acquisition sensitivity is considered in the above two processes. However, in real-world scenarios, students have different understandings on a question and have various knowledge acquisition after they finish the same question. In this paper, we propose a novel model called Individual Estimation Knowledge Tracing (IEKT), which estimates the students' cognition on the question before response prediction and assesses their knowledge acquisition sensitivity on the questions before updating the knowledge state. In the experiments, we compare IEKT with 11 knowledge tracing baselines on four benchmark datasets, and the results show IEKT achieves the state-of-the-art performance.
Abstract: As a natural language generation task, it is challenging to generate informative and coherent review text. In order to enhance the informativeness of the generated text, existing solutions typically learn to copy entities or triples from knowledge graphs (KGs). However, they lack overall consideration to select and arrange the incorporated knowledge, which tends to cause text incoherence. To address the above issue, we focus on improving entity-centric coherence of the generated reviews by leveraging the semantic structure of KGs. In this paper, we propose a novel Coherence Enhanced Text Planning model (CETP) based on knowledge graphs (KGs) to improve both global and local coherence for review generation. The proposed model learns a two-level text plan for generating a document: (1) the document plan is modeled as a sequence of sentence plans in order, and (2) the sentence plan is modeled as an entity-based subgraph from KG. Local coherence can be naturally enforced by KG subgraphs through intra-sentence correlations between entities. For global coherence, we design a hierarchical self-attentive architecture with both subgraph- and node-level attention to enhance the correlations between subgraphs. To our knowledge, we are the first to utilize a KG-based text planning model to enhance text coherence for review generation. Extensive experiments on three datasets confirm the effectiveness of our model on improving the content coherence of generated texts.
Abstract: The recommender systems, which merely leverage user-item interactions for user preference prediction (such as the collaborative filtering-based ones), often face dramatic performance degradation when the interactions of users or items are insufficient. In recent years, various types of side information have been explored to alleviate this problem. Among them, knowledge graph (KG) has attracted extensive research interests as it can encode users/items and their associated attributes in the graph structure to preserve the relation information. In contrast, less attention has been paid to the item-item co-occurrence information (i.e., co-view), which contains rich item-item similarity information. It provides information from a perspective different from the user/item-attribute graph and is also valuable for the CF recommendation models. In this work, we make an effort to study the potential of integrating both types of side information (i.e., KG and item-item co-occurrence data) for recommendation. To achieve the goal, we propose a unified graph-based recommendation model (UGRec), which integrates the traditional directed relations in KG and the undirected item-item co-occurrence relations simultaneously. In particular, for a directed relation, we transform the head and tail entities into the corresponding relation space to model their relation; and for an undirected co-occurrence relation, we project head and tail entities into a unique hyperplane in the entity space to minimize their distance. In addition, a head-tail relation-aware attentive mechanism is designed for fine-grained relation modeling.
Abstract: The huge number of machine learning (ML) methods has resulted in significant information overload. Faced with an overwhelming number of ML methods, it is challenging to select appropriate ones for the given dataset and task. In general, the names of ML methods or datasets are rather condensed, thus lacking specific explanations, while the rich latent relationships between ML entities are not fully explored. In this paper, we propose a description-enhanced machine learning knowledge graph-based approach - DEKR - to help recommend appropriate ML methods for given ML datasets. The proposed knowledge graph (KG) not only includes the connections between entities but also contains the descriptions of the dataset and method entities. DEKR fuses the structural information with the description information of entities in the knowledge graph. It is a deep hybrid recommendation framework, which incorporates the knowledge graph-based and text-based methods, overcoming the limitations of previous knowledge graph-based recommendation systems that ignore the description information. There are two key components of DEKR: 1) a graph neural network aggregating information from multi-order neighbors with attention to enrich the seed (i.e. dataset or method) node's own representation, and 2) a deep collaborative filtering network based on the description text to obtain the linear and nonlinear interactions of description features. Through extensive experiments, we demonstrated the efficiency of DEKR, which outperforms the current state-of-the-art baselines by a large margin.
Abstract: Aiming at expanding few-shot relations' coverage in knowledge graphs (KGs), few-shot knowledge graph completion (FKGC) has recently gained more research interests. Some existing models employ a few-shot relation's multi-hop neighbor information to enhance its semantic representation. However, noise neighbor information might be amplified when the neighborhood is excessively sparse and no neighbor is available to represent the few-shot relation. Moreover, modeling and inferring complex relations of one-to-many (1-N), many-to-one (N-1), and many-to-many (N-N) by previous knowledge graph completion approaches requires high model complexity and a large amount of training instances. Thus, inferring complex relations in the few-shot scenario is difficult for FKGC models due to limited training instances. In this paper, we propose a few-shot relational learning with global-local framework to address the above issues. At the global stage, a novel gated and attentive neighbor aggregator is built for accurately integrating the semantics of a few-shot relation's neighborhood, which helps filtering the noise neighbors even if a KG contains extremely sparse neighborhoods. For the local stage, a meta-learning based TransH (MTransH) method is designed to model complex relations and train our model in a few-shot learning fashion. Extensive experiments show that our model outperforms the state-of-the-art FKGC approaches on the frequently-used benchmark datasets NELL-One and Wiki-One. Compared with the strong baseline model MetaR, our model achieves 5-shot FKGC performance improvements of 8.0% on NELL-One and 2.8% on Wiki-One by the metric Hits@10.
Abstract: Sponsored search ads appear next to search results when people look for products and services on search engines. In recent years, they have become one of the most lucrative channels for marketing. As the fundamental basis of search ads, relevance modeling has attracted increasing attention due to the significant research challenges and tremendous practical value. Most existing approaches solely rely on the semantic information in the input query-ad pair, while the pure semantic information in the short ads data is not sufficient to fully identify user's search intents. Our motivation lies in incorporating the tremendous amount of unsupervised user behavior data from the historical search logs as the complementary graph to facilitate relevance modeling. In this paper, we extensively investigate how to naturally fuse the semantic textual information with the user behavior graph, and further propose three novel AdsGNN models to aggregate topological neighborhood from the perspectives of nodes, edges and tokens. Furthermore, two critical but rarely investigated problems, domain-specific pre-training and long-tail ads matching, are studied thoroughly. Empirically, we evaluate the AdsGNN models over the large industry dataset, and the experimental results of online/offline tests consistently demonstrate the superiority of our proposal.
Abstract: The financial markets are moved by events such as the issuance of administrative orders. The participants in financial markets (e.g., traders) thus pay constant attention to financial news relevant to the financial asset (e.g., oil) of interest. Due to the large scale of news stream, it is time and labor intensive to manually identify influential events that can move the price of the financial asset, pushing the financial participants to embrace automatic financial event ranking, which has received relatively little scrutiny to date. In this work, we formulate the financial event ranking task, which aims to score financial news (document) according to its influence to the given asset (query). To solve this task, we propose a Hybrid News Ranking framework that, from the asset perspective, evaluates the influence of news articles by comparing their contents; and from the event perspective, accesses the influence over all query assets. Moreover, we resolve the dilemma between the essential requirement of sufficient labels for training the framework and the unaffordable cost of hiring domain experts for labeling the news. In particular, we design a cost-friendly system for news labeling that leverages the knowledge within published financial analyst reports. In this way, we construct three financial event ranking datasets. Extensive experiments on the datasets validate the effectiveness of the proposed framework and the rationality of solving financial event ranking through learning to rank.
Abstract: Image-recipe retrieval, which aims at retrieving the relevant recipe from a food image and vice versa, is now attracting widespread attention, since sharing food-related images and recipes on the Internet has become a popular trend. Existing methods have formulated this problem as a typical cross-modal retrieval task by learning the image-recipe similarity. Though these methods have made inspiring achievements for image-recipe retrieval, they may still be less effective to jointly incorporate the three crucial points: (1) the association between ingredients and instructions, (2) fine-grained image information, and (3) the latent alignment between recipes and images. To this end, we propose a novel framework namedHybrid Fusion with Intra- and Cross-Modality Attention (HF-ICMA) to learn accurate image-recipe similarity. Our HF-ICMA model adopts an intra-recipe fusion module to focus on the interaction between ingredients and instructions within a recipe, and further enriches the expressions of the two separate embeddings. Meanwhile, an image-recipe fusion module is devised to explore the potential relationship between fine-grained image regions and ingredients from the recipe, which jointly forms the final image-recipe similarity from both the local and global aspects. Extensive experiments on the large-scale benchmark dataset Recipe1M show that our model significantly outperforms the state-of-the-art approaches on various image-recipe retrieval scenarios.
Abstract: Recent advances in the e-commerce fashion industry have led to an exploration of novel ways to enhance buyer experience via improved personalization. Predicting a proper size for an item to recommend is an important personalization challenge, and is being studied in this work. Earlier works in this field either focused on modeling explicit buyer fitment feedback or modeling of only a single aspect of the problem (e.g., specific category, brand, etc.). More recent works proposed richer models, either content-based or sequence-based, better accounting for content-based aspects of the problem or better modeling the buyer's online journey. However, both these approaches fail in certain scenarios: either when encountering unseen items (sequence-based models) or when encountering new users (content-based models). To address the aforementioned gaps, we propose PreSizE -- a novel deep learning framework which utilizes Transformers for accurate size prediction. PreSizE models the effect of both content-based attributes, such as brand and category, and the buyer's purchase history on her size preferences. Using an extensive set of experiments on a large-scale e-commerce dataset, we demonstrate that PreSizE is capable of achieving superior prediction performance compared to previous state-of-the-art baselines. By encoding item attributes, PreSizE better handles cold-start cases with unseen items, and cases where buyers have little past purchase data. As a proof of concept, we demonstrate that size predictions made by PreSizE can be effectively integrated into an existing production recommender system yielding very effective features and significantly improving recommendations.
Abstract: Search engines are perceived as a reliable source for general information needs. However, finding the answer to medical questions using search engines can be challenging for an ordinary user. Content can be biased and results may present different opinions. In addition, interpreting medically related content can be difficult for users with no medical background. All of these can lead users to incorrect conclusions regarding health related questions. In this work we address this problem from two perspectives. First, to gain insight on users' ability to correctly answer medical questions using search engines, we conduct a comprehensive user study. We show that for questions regarding medical treatment effectiveness, participants struggle to find the correct answer and are prone to overestimating treatment effectiveness. We analyze participants' demographic traits according to age and education level and show that this problem persists in all demographic groups. We then propose a semi-automatic machine learning approach to find the correct answer to queries on medical treatment effectiveness as it is viewed by the medical community. The model relies on the opinions presented in medical papers related to the queries, as well as features representing their impact. We show that, compared to human behaviour, our method is less prone to bias. We compare various configurations of our inference model and a baseline method that determines treatment effectiveness based solely on the opinion of medical papers. The results bolster our confidence that our approach can pave the way to developing automatic bias-free tools that can help mediate complex health related content to users.
Abstract: Post-click conversion, as a strong signal indicating the user preference, is salutary for building recommender systems. However, accurately estimating the post-click conversion rate (CVR) is challenging due to the selection bias, i.e., the observed clicked events usually happen on users' preferred items. Currently, most existing methods utilize counterfactual learning to debias recommender systems. Among them, the doubly robust (DR) estimator has achieved competitive performance by combining the error imputation based (EIB) estimator and the inverse propensity score (IPS) estimator in a doubly robust way. However, inaccurate error imputation may result in its higher variance than the IPS estimator. Worse still, existing methods typically use simple model-agnostic methods to estimate the imputation error, which are not sufficient to approximate the dynamically changing model-correlated target (i.e., the gradient direction of the prediction model). To solve these problems, we first derive the bias and variance of the DR estimator. Based on it, a more robust doubly robust (MRDR) estimator has been proposed to further reduce its variance while retaining its double robustness. Moreover, we propose a novel double learning approach for the MRDR estimator, which can convert the error imputation into the general CVR estimation. Besides, we empirically verify that the proposed learning scheme can further eliminate the high variance problem of the imputation learning. To evaluate its effectiveness, extensive experiments are conducted on a semi-synthetic dataset and two real-world datasets. The results demonstrate the superiority of the proposed approach over the state-of-the-art methods. The code is available at https://github.com/guosyjlu/MRDR-DL.
Abstract: Counterfactual Learning to Rank (CLTR) becomes an attractive research topic due to its capability of training ranker with click logs. However, CLTR inherently suffers from a large amount of bias caused by confounders, variables that affect both the observation (examination) behavior and click behavior. Recent efforts to correct bias mostly focus on position bias, which assumes that each observation in a ranking list is isolated and only depends on the position. Though effective, users often engage with documents in an interactive manner. Ignoring the interactions between observations/clicks would incur a large interactional observation bias no matter how much data is collected. In this work, we leverage the embedding method to develop an Interactional Observation-Based Model (IOBM) to estimate the observation probability. We argue that while there exist complex observed and unobserved confounders for observation/click interactions, it is sufficient to use the embedding as a proxy confounder to uncover the relevant information for the prediction of the observation propensity. Moreover, the embedding could offer an alternative to the fully specified generative model for observation and decouples the complex interaction structure of observations/clicks. In our IOBM, we first learn the individual observation embedding to capture position and click information. Then, we learn the interactional observation embedding to uncover their local interaction structure. To filter out irrelevant information and reduce contextual bias, we utilize query context information and propose the intra-observation attention and the inter-observation attention, respectively. We conduct extensive experiments on two LTR benchmark datasets, demonstrating that the proposed IOBM consistently achieves better performance over the baseline models in various click situations and verifying its effectiveness of eliminating interactional observation bias.
Abstract: In web search on debated topics, algorithmic and cognitive biases strongly influence how users consume and process information. Recent research has shown that this can lead to a search engine manipulation effect (SEME): when search result rankings are biased towards a particular viewpoint, users tend to adopt this favored viewpoint. To better understand the mechanisms underlying SEME, we present a pre-registered, 5 x 3 factorial user study investigating whether order effects (i.e., users adopting the viewpoint pertaining to higher-ranked documents) can cause SEME. For five different debated topics, we evaluated attitude change after exposing participants with mild pre-existing attitudes to search results that were overall viewpoint-balanced but reflected one of three levels of algorithmic ranking bias. We found that attitude change did not differ across levels of ranking bias and did not vary based on individual user differences. Our results thus suggest that order effects may not be an underlying mechanism of SEME. Exploratory analyses lend support to the presence of exposure effects (i.e., users adopting the majority viewpoint among the results they examine) as a contributing factor to users' attitude change. We discuss how our findings can inform the design of user bias mitigation strategies.
Abstract: Societal biases resonate in the retrieved contents of information retrieval (IR) systems, resulting in reinforcing existing stereotypes. Approaching this issue requires established measures of fairness in respect to the representation of various social groups in retrieval results, as well as methods to mitigate such biases, particularly in the light of the advances in deep ranking models. In this work, we first provide a novel framework to measure the fairness in the retrieved text contents of ranking models. Introducing a ranker-agnostic measurement, the framework also enables the disentanglement of the effect on fairness of collection from that of rankers. To mitigate these biases, we propose AdvBert, a ranking model achieved by adapting adversarial bias mitigation for IR, which jointly learns to predict relevance and remove protected attributes. We conduct experiments on two passage retrieval collections (MSMARCO Passage Re-ranking and TREC Deep Learning 2019 Passage Re-ranking), which we extend by fairness annotations of a selected subset of queries regarding gender attributes. Our results on the MSMARCO benchmark show that, (1) all ranking models are less fair in comparison with ranker-agnostic baselines, and (2) the fairness of Bert rankers significantly improves when using the proposed AdvBert models. Lastly, we investigate the trade-off between fairness and utility, showing that we can maintain the significant improvements in fairness without any significant loss in utility.
Abstract: The goal of one-class collaborative filtering (OCCF) is to identify the user-item pairs that are positively-related but have not been interacted yet, where only a small portion of positive user-item interactions (e.g., users' implicit feedback) are observed. For discriminative modeling between positive and negative interactions, most previous work relied on negative sampling to some extent, which refers to considering unobserved user-item pairs as negative, as actual negative ones are unknown. However, the negative sampling scheme has critical limitations because it may choose "positive but unobserved" pairs as negative. This paper proposes a novel OCCF framework, named as BUIR, which does not require negative sampling. To make the representations of positively-related users and items similar to each other while avoiding a collapsed solution, BUIR adopts two distinct encoder networks that learn from each other; the first encoder is trained to predict the output of the second encoder as its target, while the second encoder provides the consistent targets by slowly approximating the first encoder. In addition, BUIR effectively alleviates the data sparsity issue of OCCF, by applying stochastic data augmentation to encoder inputs. Based on the neighborhood information of users and items, BUIR randomly generates the augmented views of each positive interaction each time it encodes, then further trains the model by this self-supervision. Our extensive experiments demonstrate that BUIR consistently and significantly outperforms all baseline methods by a large margin especially for much sparse datasets in which any assumptions about negative interactions are less valid.
Abstract: Session-based Recommender Systems (SRSs) have been actively developed to recommend the next item of an anonymous short item sequence (i.e., session). Unlike sequence-aware recommender systems where the whole interaction sequence of each user can be used to model both the short-term interest and the general interest of the user, the absence of user-dependent information in SRSs makes it difficult to directly derive the user's general interest from data. Therefore, existing SRSs have focused on how to effectively model the information about short-term interest within the sessions, but they are insufficient to capture the general interest of users. To this end, we propose a novel framework to overcome the limitation of SRSs, named ProxySR, which imitates the missing information in SRSs (i.e., general interest of users) by modeling proxies of sessions. ProxySR selects a proxy for the input session in an unsupervised manner, and combines it with the encoded short-term interest of the session. As a proxy is jointly learned with the short-term interest and selected by multiple sessions, a proxy learns to play the role of the general interest of a user and ProxySR learns how to select a suitable proxy for an input session. Moreover, we propose another real-world situation of SRSs where a few users are logged-in and leave their identifiers in sessions, and a revision of ProxySR for the situation. Our experiments on real-world datasets show that ProxySR considerably outperforms the state-of-the-art competitors, and the proxies successfully imitate the general interest of the users without any user-dependent information.
Abstract: The factorization-based models have achieved great success in online advertisements and recommender systems due to the capability of efficiently modeling combinational features. These models encode feature interactions by the vector product between feature embedding. Despite the improvement of generalization, the memory consumption of these models grows significantly, because they usually take hundreds to thousands of large categorical features as input. Several existing works try to reduce the memory footprint by hashing, randomized embedding composition, and dimensionality search, but they suffer from either substantial performance degradation or limited memory compression. To this end, in this paper, we propose an extremely memory-efficient Factorization Machine (xLightFM), where each category embedding is composited with latent vectors selected from codebooks. Based on the characteristics of each categorical feature, we further propose to adapt the codebook size with the neural architecture search techniques for compositing the embedding of each categorical feature. This further pushes the limits of memory compression while incurring negligible degradation or even some improvements in prediction performance. We extensively evaluate the proposed algorithm with two real-world datasets. The results demonstrate that xLightFM can outperform the state-of-the-art lightweight factorization-based methods in terms of both prediction quality and memory footprint, and achieve more than 18x and 27x memory compression compared to the vanilla FM on these two datasets, respectively.
Abstract: Sequential recommendation aims at predicting users' preferences based on their historical behaviors. However, this recommendation strategy may not perform well in practice due to the sparsity of the real-world data. In this paper, we propose a novel counterfactual data augmentation framework to mitigate the impact of the imperfect training data and empower sequential recommendation models. Our framework is composed of a sampler model and an anchor model. The sampler model aims to generate new user behavior sequences based on the observed ones, while the anchor model is leveraged to provide the final recommendation list, which is trained based on both observed and generated sequences. We design the sampler model to answer the key counterfactual question: "what would a user like to buy if her previously purchased items had been different?". Beyond heuristic intervention methods, we leverage two learning-based methods to implement the sampler model, and thus, improve the quality of the generated sequences when training the anchor model. Additionally, we analyze the influence of the generated sequences on the anchor model in theory and achieve a trade-off between the information and the noise introduced by the generated sequences. Experiments on nine real-world datasets demonstrate our framework's effectiveness and generality.
Abstract: Deep learning has brought great progress for the sequential recommendation (SR) tasks. With advanced network architectures, sequential recommender models can be stacked with many hidden layers, e.g., up to 100 layers on real-world recommendation datasets. Training such a deep network is difficult because it can be computationally very expensive and takes much longer time, especially in situations where there are tens of billions of user-item interactions. To deal with such a challenge, we present StackRec, a simple, yet very effective and efficient training framework for deep SR models by iterative layer stacking. Specifically, we first offer an important insight that hidden layers/blocks in a well-trained deep SR model have very similar distributions. Enlightened by this, we propose the stacking operation on the pre-trained layers/blocks to transfer knowledge from a shallower model to a deep model, then we perform iterative stacking so as to yield a much deeper but easier-to-train SR model. We validate the performance of StackRec by instantiating it with four state-of-the-art SR models in three practical scenarios with real-world datasets. Extensive experiments show that StackRec achieves not only comparable performance, but also substantial acceleration in training time, compared to SR models that are trained from scratch. Codes are available at https://github.com/wangjiachun0426/StackRec.
Abstract: Learning user representations based on historical behaviors lies at the core of modern recommender systems. Recent advances in sequential recommenders have convincingly demonstrated high capability in extracting effective user representations from the given behavior sequences. Despite significant progress, we argue that solely modeling the observational behaviors sequences may end up with a brittle and unstable system due to the noisy and sparse nature of user interactions logged. In this paper, we propose to learn accurate and robust user representations, which are required to be less sensitive to (attack on) noisy behaviors and trust more on the indispensable ones, by modeling counterfactual data distribution. Specifically, given an observed behavior sequence, the proposed CauseRec framework identifies dispensable and indispensable concepts at both the fine-grained item level and the abstract interest level. CauseRec conditionally samples user concept sequences from the counterfactual data distributions by replacing dispensable and indispensable concepts within the original concept sequence. With user representations obtained from the synthesized user sequences, CauseRec performs contrastive user representation learning by contrasting the counterfactual with the observational. We conduct extensive experiments on real-world public recommendation benchmarks and justify the effectiveness of CauseRec with multi-aspects model analysis. The results demonstrate that the proposed CauseRec outperforms state-of-the-art sequential recommenders by learning accurate and robust user representations.
Abstract: Sequential recommendation aims to leverage users' historical behaviors to predict their next interaction. Existing works have not yet addressed two main challenges in sequential recommendation. First, user behaviors in their rich historical sequences are often implicit and noisy preference signals, they cannot sufficiently reflect users' actual preferences. In addition, users' dynamic preferences often change rapidly over time, and hence it is difficult to capture user patterns in their historical sequences. In this work, we propose a graph neural network model called SURGE (short forSeqUential Recommendation with Graph neural nEtworks) to address these two issues. Specifically, SURGE integrates different types of preferences in long-term user behaviors into clusters in the graph by re-constructing loose item sequences into tight item-item interest graphs based on metric learning. This helps explicitly distinguish users' core interests, by forming dense clusters in the interest graph. Then, we perform cluster-aware and query-aware graph convolutional propagation and graph pooling on the constructed graph. It dynamically fuses and extracts users' current activated core interests from noisy user behavior sequences. We conduct extensive experiments on both public and proprietary industrial datasets. Experimental results demonstrate significant performance gains of our proposed method compared to state-of-the-art methods. Further studies on sequence length confirm that our method can model long behavioral sequences effectively and efficiently.
Abstract: Sequential recommendation is the task of predicting the next items for users based on their interaction history. Modeling the dependence of the next action on the past actions accurately is crucial to this problem. Moreover, sequential recommendation often faces serious sparsity of item-to-item transitions in a user's action sequence, which limits the practical utility of such solutions. To tackle these challenges, we propose a Category-aware Collaborative Sequential Recommender. Our preliminary statistical tests demonstrate that the in-category item-to-item transitions are often much stronger indicators of the next items than the general item-to-item transitions observed in the original sequence. Our method makes use of item category in two ways. First, the recommender utilizes item category to organize a user's own actions to enhance dependency modeling based on her own past actions. It utilizes self-attention to capture in-category transition patterns, and determines which of the in-category transition patterns to consider based on the categories of recent actions. Second, the recommender utilizes the item category to retrieve users with similar in-category preferences to enhance collaborative learning across users, and thus conquer sparsity. It utilizes attention to incorporate in-category transition patterns from the retrieved users for the target user. Extensive experiments on two large datasets prove the effectiveness of our solution against an extensive list of state-of-the-art sequential recommendation models.
Abstract: Real world events are quite often mentioned in texts. Estimating the occurrence time of event mentions has many applications in IR, QA, general document understanding and downstream NLP tasks. In this paper we propose an approach to temporal profiling of event mentions in text. Our method utilizes a news article archival collection for collecting temporal as well as textual information containing contemporary and retrospective event references. As we demonstrate in our experiments, the recent method which relies on secondary data sources like Wikipedia is insufficient to correctly estimate the event time, especially, for minor or less well-known events that happened in the past. Our method then harnesses news article archives to effectively infer the occurrence time of past events, and is able to estimate the time at different temporal granularities (e.g., day, week, month, or year). As evidenced through extensive experiments, the proposed model outperforms the existing methods by a large margin at all granularities. We also demonstrate that our approach helps to answer arbitrary questions about past events, when incorporated into a QA framework operating over news article archives.
Abstract: Knowledge Graph (KG) reasoning that predicts missing facts for incomplete KGs has been widely explored. However, reasoning over Temporal KG (TKG) that predicts facts in the future is still far from resolved. The key to predict future facts is to thoroughly understand the historical facts. A TKG is actually a sequence of KGs corresponding to different timestamps, where all concurrent facts in each KG exhibit structural dependencies and temporally adjacent facts carry informative sequential patterns. To capture these properties effectively and efficiently, we propose a novel Recurrent Evolution network based on Graph Convolution Network (GCN), called RE-GCN, which learns the evolutional representations of entities and relations at each timestamp by modeling the KG sequence recurrently. Specifically, for the evolution unit, a relation-aware GCN is leveraged to capture the structural dependencies within the KG at each timestamp. In order to capture the sequential patterns of all facts in parallel, the historical KG sequence is modeled auto-regressively by the gate recurrent components. Moreover, the static properties of entities, such as entity types, are also incorporated via a static graph constraint component to obtain better entity representations. Fact prediction at future timestamps can then be realized based on the evolutional entity and relation representations. Extensive experiments demonstrate that the RE-GCN model obtains substantial performance and efficiency improvement for the temporal reasoning tasks on six benchmark datasets. Especially, it achieves up to 11.46% improvement in MRR for entity prediction with up to 82 times speedup compared to the state-of-the-art baseline.
Abstract: Timeline summarization aims at presenting long news stories in a compact manner. State-of-the-art approaches first select the most relevant dates from the original event timeline then produce per-date news summaries. Date selection is driven by either per-date news content or date-level references. When coping with complex event data, characterized by inherent news flow redundancy, this pipeline may encounter relevant issues in both date selection and summarization due to a limited use of news content in date selection and no use of high-level temporal references (e.g., the past month). This paper proposes a paradigm shift in timeline summarization aimed at overcoming the above issues. It presents a new approach, namely Summarize Date First, which focuses on first generating date-level summaries then selecting the most relevant dates on top of summarized knowledge. In the latter stage, it performs date aggregations to consider high-level temporal references as well. The proposed pipeline also supports frequent incremental timeline updates more efficiently than previous approaches. We tested our unsupervised approach both on existing benchmark datasets and on a newly proposed benchmark dataset describing the COVID-19 news timeline. The achieved results were superior to state-of-the-art unsupervised methods and competitive against supervised ones.
Abstract: Reasoning in a temporal knowledge graph (TKG) is a critical task for information retrieval and semantic search. It is particularly challenging when the TKG is updated frequently. The model has to adapt to changes in the TKG for efficient training and inference while preserving its performance on historical knowledge. Recent work approaches TKG completion (TKGC) by augmenting the encoder-decoder framework with a time-aware encoding function. However, naively fine-tuning the model at every time step using these methods does not address the problems of 1) catastrophic forgetting, 2) the model's inability to identify the change of facts (e.g., the change of the political affiliation and end of a marriage), and 3) the lack of training efficiency. To address these challenges, we present the Time-aware Incremental Embedding (TIE) framework, which combines TKG representation learning, experience replay, and temporal regularization. We introduce a set of metrics that characterizes the intransigence of the model and propose a constraint that associates the deleted facts with negative labels. Experimental results on Wikidata12k and YAGO11k datasets demonstrate that the proposed TIE framework reduces training time by about ten times and improves on the proposed metrics compared to vanilla full-batch training. It comes without a significant loss in performance for any traditional measures. Extensive ablation studies reveal performance trade-offs among different evaluation metrics, which is essential for decision-making around real-world TKG applications.
Abstract: We introduce a modification of an established reinforcement learning method to facilitate the widespread use of temporal difference learning for IR: interpolated substate temporal difference (ISSTD) learning. While reinforcement learning methods have shown success in document ranking, these contributions have relied on relatively antiquated policy gradient methods like REINFORCE. These methods bring associated issues like high variance gradient estimates and sample inefficiency, which presents significant obstacles when training deep neural retrieval models. Within the reinforcement learning community, there exists a substantial body of work on alternative methods of training which revolve around temporal difference updates, such as Q-learning, Actor-Critic, or SARSA, that resolve some of the issues seen in REINFORCE. However, temporal difference methods require the full size of the state to be modeled internally within the ranking model, which is unrealistic for deep full text retrieval or first stage retrieval. We therefore propose ISSTD, operating on the substate, or individual documents in the case of matching models, and interpolating the temporal difference updates to the rest of the state. We provide theoretical guarantees on convergence, enabling the drop in use of ISSTD for any algorithm that relies on temporal difference updates. Furthermore, empirical results demonstrate the robustness of this approach for deep neural models, outperforming the current policy gradient approach for training deep neural retrieval models.
Abstract: Currently, the most popular method for open-domain Question Answering (QA) adopts "Retriever and Reader" pipeline, where the retriever extracts a list of candidate documents from a large set of documents followed by a ranker to rank the most relevant documents and the reader extracts answer from the candidates. Existing studies take the greedy strategy in the sense that they only use samples for ranking at the current hop, and ignore the global information across the whole documents. In this paper, we propose a purely rank-based framework Thinking Path Re-Ranker (TPRR), which is comprised of Thinking Path Ranker (TPR) for generating document sequences called "a path" and External Path Reranker (EPR) for selecting the best path from candidate paths generated by TPR. Specifically, TPR leverages the scores of a dense model and conditional probabilities to score the full paths. Moreover, to further enhance the performance of the dense ranker in the iterative training, we propose a "thinking" negatives selection method that the top-K candidates treated as negatives in the current hop are adjusted dynamically through supervised signals. After achieving multiple supporting paths through TPR, the EPR component which integrates several fine-grained training tasks for QA is used to select the best path for answer extraction. We have tested our proposed solution on the multi-hop dataset "HotpotQA" with a full wiki set ting, and the results show that TPRR significantly outperforms the existing state-of-the-art models. Moreover, our method has won the first place in the HotpotQA official leaderboard since Feb 1, 2021 under the Fullwiki setting. Code is available at https://gitee.com/mindspore/mindspore/ tree/master/model_zoo/research/nlp/tprr.
Abstract: The rise of personal assistants has made conversational question answering (ConvQA) a very popular mechanism for user-system interaction. State-of-the-art methods for ConvQA over knowledge graphs (KGs) can only learn from crisp question-answer pairs found in popular benchmarks. In reality, however, such training data is hard to come by: users would rarely mark answers explicitly as correct or wrong. In this work, we take a step towards a more natural learning paradigm - from noisy and implicit feedback via question reformulations. A reformulation is likely to be triggered by an incorrect system response, whereas a new follow-up question could be a positive signal on the previous turn's answer. We present a reinforcement learning model, termed CONQUER, that can learn from a conversational stream of questions and reformulations. CONQUER models the answering process as multiple agents walking in parallel on the KG, where the walks are determined by actions sampled using a policy network. This policy network takes the question along with the conversational context as inputs and is trained via noisy rewards obtained from the reformulation likelihood. To evaluate CONQUER, we create and release ConvRef, a benchmark with about 11k natural conversations containing around 205k reformulations. Experiments show that CONQUER successfully learns from noisy reward signals, significantly improving over a state-of-the-art baseline.
Abstract: The quality variance in user-generated content is a major bottleneck to serving communities on online platforms. Current content ranking methods primarily evaluate text and non-textual content features of each user post in isolation. In this paper, we demonstrate the utility of considering the implicit and explicit relational aspects across user content to assess their quality. First, we develop a modular platform-agnostic framework to represent the contrastive (or competing) and similarity-based relational aspects of user-generated content via independently induced content graphs. Second, we develop two complementary graph convolutional operators that enable feature contrast for competing content and feature smoothing/sharing for similar content. Depending on the edge semantics of each content graph, we embed its nodes via one of the above two mechanisms. We also show that our contrastive operator creates discriminative magnification across the embeddings of competing posts. Third, we show a surprising result-applying classical boosting techniques to combine final-layer embeddings across the content graphs significantly outperforms the typical stacking, fusion, or neighborhood embedding aggregation methods in graph convolutional architectures. We exhaustively validate our method via accepted answer prediction over fifty diverse Stack-Exchange (https://stackexchange.com/) websites with consistent relative gains of over 5% accuracy over state-of-the-art neural, multi-relational and textual baselines.
Abstract: Existing approaches for open-domain question answering (QA) are typically designed for questions that require either single-hop or multi-hop reasoning, which make strong assumptions of the complexity of questions to be answered. Also, multi-step document retrieval often incurs higher number of relevant but non-supporting documents, which dampens the downstream noise-sensitive reader module for answer extraction. To address these challenges, we propose a unified QA framework to answer any-hop open-domain questions, which iteratively retrieves, reranks and filters documents, and adaptively determines when to stop the retrieval process. To improve the retrieval accuracy, we propose a graph-based reranking model that perform multi-document interaction as the core of our iterative reranking framework. Our method consistently achieves performance comparable to or better than the state-of-the-art on both single-hop and multi-hop open-domain QA datasets, including Natural Questions Open, SQuAD Open, and HotpotQA.
Abstract: When talking to the dialog robots, users have to activate the robot first from the standby mode with special wake words, such as "Hey Siri", which is apparently not user-friendly. The latest generation of dialog robots have been equipped with advanced sensors, like the camera, enabling multimodal activation. In this work, we work towards awaking the robot without wake words. To accomplish this task, we present a Multimodal Activation Scheme (MAS), consisting of two key components: audio-visual consistency detection and semantic talking intention inference. The first one is devised to measure the consistency between the audio and visual modalities in order to figure out weather the heard speech comes from the detected user in front of the camera. Towards this end, two heterogeneous CNN-based networks are introduced to convolutionalize the fine-grained facial landmark features and the MFCC audio features, respectively. The second one is to infer the semantic talking intention of the recorded speech, where the transcript of the speech is recognized and matrix factorization is utilized to uncover the latent human-robot talking topics. We ultimately devise different fusion strategies to unify these two components. To evaluate MAS, we construct a dataset containing 12,741 short videos recorded by 194 invited volunteers. Extensive experiments demonstrate the effectiveness of our scheme.
Abstract: Cognitive diagnosis (CD) is a fundamental issue in intelligent educational settings, which aims to discover the mastery levels of students on different knowledge concepts. In general, most previous works consider it as an inter-layer interaction modeling problem, e.g., student-exercise interactions in IRT or student-concept interactions in DINA, while the inner-layer structural relations, such as educational interdependencies among concepts, are still underexplored. Furthermore, there is a lack of comprehensive modeling for the student-exercise-concept hierarchical relations in CD systems. To this end, in this paper, we present a novel Relation map driven Cognitive Diagnosis (RCD) framework, uniformly modeling the interactive and structural relations via a multi-layer student-exercise-concept relation map. Specifically, we first represent students, exercises and concepts as individual nodes in a hierarchical layout, and construct three well-defined local relation maps to incorporate inter- and inner-layer relations, including a student-exercise interaction map, a concept-exercise correlation map and a concept dependency map. Then, we leverage a multi-level attention network to integrate node-level relation aggregation inside each local map and balance map-level aggregation across different maps. Finally, we design an extendable diagnosis function to predict students' performance and jointly train the networks. Extensive experimental results on real-world datasets clearly show the effectiveness and extendibility of our RCD in both diagnosis accuracy improvement and relation-aware representation learning.
Abstract: We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in a fine-tuning process for tasks that might still require label data such as code summarization. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through acontrastive learning objective. To do so, we use a set of semantic-preserving transformation operators to generate code snippets that are syntactically diverse but semantically equivalent. Through extensive experiments, we have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks.
Abstract: In the knowledge-grounded conversation (KGC) task systems aim to produce more informative responses by leveraging external knowledge. KGC includes a vital part, knowledge selection, where conversational agents select the appropriate knowledge to be incorporated in the next response. Mixed initiative is an intrinsic feature of conversations where the user and the system can both take the initiative in suggesting new conversational directions. Knowledge selection can be driven by the user's initiative or by the system's initiative. For the former, the system usually selects knowledge according to the current user utterance that contains new topics or questions posed by the user; for the latter, the system usually selects knowledge according to the previously selected knowledge. No previous study has considered the mixed-initiative characteristics of knowledge selection to improve its performance. In this paper, we propose a mixed-initiative knowledge selection method (MIKe) for KGC, which explicitly distinguishes between user-initiative and system-initiative knowledge selection. Specifically, we introduce two knowledge selectors to model both of them separately, and design a novel initiative discriminator to discriminate the initiative type of knowledge selection at each conversational turn. A challenge for training MIKe is that we usually have no labels for indicating initiative. To tackle this challenge, we devise an initiative-aware self-supervised learning scheme that helps MIKe to learn to discriminate the initiative type via a self-supervised task. Experimental results on two datasets show that MIKe significantly outperforms state-of-the-art methods in terms of both automatic and human evaluations, indicating that it can select more appropriate knowledge and generate more informative and engaging responses.
Abstract: Conversational information seeking (CIS) is playing an increasingly important role in connecting people to information. Due to a lackof suitable resources, previous studies on CIS are limited to thestudy of conceptual frameworks, laboratory-based user studies, or a particular aspect of CIS (e.g., asking clarifying questions). In this work, we make three main contributions to facilitate research into CIS: (1) We formulate a pipeline for CIS with six subtasks: intent detection, keyphrase extraction, action prediction, query selection, passage selection, and response generation. (2) We release a benchmark dataset, called wizard of search engine (WISE), which allows for comprehensive and in-depth research on all aspects of CIS. (3) We design a neural architecture capable of training and evaluating both jointly and separately on the six sub-tasks, and devise a pre-train/fine-tune learning scheme, that can reduce the requirements of WISE in scale by making full use of available data. We report useful characteristics of the CIS task based on statistics of the WISE dataset. We also show that our best performing model variant is able to achieve effective CIS. We release the dataset, code as well as evaluation scripts to facilitate future research by measuring further improvements in this important research direction.
Abstract: Medical dialogue generation aims to provide automatic and accurate responses to assist physicians to obtain diagnosis and treatment suggestions in an efficient manner. In medical dialogues, two key characteristics are relevant for response generation: patient states (such as symptoms, medication) and physician actions (such as diagnosis, treatments). In medical scenarios large-scale human annotations are usually not available, due to the high costs and privacy requirements. Hence, current approaches to medical dialogue generation typically do not explicitly account for patient states and physician actions, and focus on implicit representation instead. We propose an end-to-end variational reasoning approach to medical dialogue generation. To be able to deal with a limited amount of labeled data, we introduce both patient state and physician action as latent variables with categorical priors for explicit patient state tracking and physician policy learning, respectively. We propose a variational Bayesian generative approach to approximate posterior distributions over patient states and physician actions. We use an efficient stochastic gradient variational Bayes estimator to optimize the derived evidence lower bound, where a 2-stage collapsed inference method is proposed to reduce the bias during model training. A physician policy network composed of an action-classifier and two reasoning detectors is proposed for augmented reasoning ability. We conduct experiments on three datasets collected from medical platforms. Our experimental results show that the proposed method outperforms state-of-the-art baselines in terms of objective and subjective evaluation metrics. Our experiments also indicate that our proposed semi-supervised reasoning method achieves a comparable performance as state-of-the-art fully supervised learning baselines for physician policy learning.
Abstract: Personalized chatbots focus on endowing chatbots with a consistent personality to behave like real users, give more informative responses, and further act as personal assistants. Existing personalized approaches tried to incorporate several text descriptions as explicit user profiles. However, the acquisition of such explicit profiles is expensive and time-consuming, thus being impractical for large-scale real-world applications. Moreover, the restricted predefined profile neglects the language behavior of a real user and cannot be automatically updated together with the change of user interests. In this paper, we propose to learn implicit user profiles automatically from large-scale user dialogue history for building personalized chatbots. Specifically, leveraging the benefits of Transformer on language understanding, we train a personalized language model to construct a general user profile from the user's historical responses. To highlight the relevant historical responses to the input post, we further establish a key-value memory network of historical post-response pairs, and build a dynamic post-aware user profile. The dynamic profile mainly describes what and how the user has responded to similar posts in history. To explicitly utilize users' frequently used words, we design a personalized decoder to fuse two decoding strategies, including generating a word from the generic vocabulary and copying one word from the user's personalized vocabulary. Experiments on two real-world datasets show the significant improvement of our model compared with existing methods.
Abstract: Persona can function as the prior knowledge for maintaining the consistency of dialogue systems. Most of previous studies adopted the self persona in dialogue whose response was about to be selected from a set of candidates or directly generated, but few have noticed the role of partner in dialogue. This paper makes an attempt to thoroughly explore the impact of utilizing personas that describe either self or partner speakers on the task of response selection in retrieval-based chatbots. Four persona fusion strategies are designed, which assume personas interact with contexts or responses in different ways. These strategies are implemented into three representative models for response selection, which are based on the Hierarchical Recurrent Encoder (HRE), Interactive Matching Network (IMN) and Bidirectional Encoder Representations from Transformers (BERT) respectively. Empirical studies on the Persona-Chat dataset show that the partner personas neglected in previous studies can improve the accuracy of response selection in the IMN- and BERT-based models. Besides, our BERT-based model implemented with the context-response-aware persona fusion strategy outperforms previous methods by margins larger than 2.7% on original personas and 4.6% on revised personas in terms of hits@1 (top-1 accuracy), achieving a new state-of-the-art performance on the Persona-Chat dataset.
Abstract: One-hot encoder accompanied by a softmax loss has become the default configuration to deal with the multiclass problem, and is also prevalent in deep learning (DL) based recommender systems (RS). The standard learning process of such methods is to fit the model outputs to a one-hot encoding of the ground truth, referred to as the hard target. However, it is known that these hard targets largely ignore the ambiguity of unobserved feedback in RS, and thus may lead to sub-optimal generalization performance. In this work, we propose SoftRec, a new RS optimization framework to enhance item recommendation. The core idea is that we add additional supervisory signals - well-designed soft targets - for each instance so as to better guide the recommender learning. Meanwhile, we carefully investigate the impacts of specific soft target distributions by instantiating the SoftRec with a series of strategies, including item-based, user-based, and model-based. To verify the effectiveness of SoftRec, we conduct extensive experiments on two public recommendation datasets by using various deep recommendation architectures. The experimental results show that our methods achieve superior performance compared with the standard optimization approaches. Moreover, SoftRec could also exhibit strong performance in cold-start scenarios where user-item interaction has higher sparsity.
Abstract: As users often express their preferences with binary behavior data~(implicit feedback), such as clicking items or buying products, implicit feedback based Collaborative Filtering~(CF) models predict the top ranked items a user might like by leveraging implicit user-item interaction data. For each user, the implicit feedback is divided into two sets: an observed item set with limited observed behaviors, and a large unobserved item set that is mixed with negative item behaviors and unknown behaviors. Given any user preference prediction model, researchers either designed ranking based optimization goals or relied on negative item mining techniques for better optimization. Despite the performance gain of these implicit feedback based models, the recommendation results are still far from satisfactory due to the sparsity of the observed item set for each user. To this end, in this paper, we explore the unique characteristics of the implicit feedback and propose Set2setRank framework for recommendation. The optimization criteria of Set2setRank are two folds: First, we design an item to an item set comparison that encourages each observed item from the sampled observed set is ranked higher than any unobserved item from the sampled unobserved set. Second, we model set level comparison that encourages a margin between the distance summarized from the observed item set and the most "hard'' unobserved item from the sampled negative set. Further, an adaptive sampling technique is designed to implement these two goals. We have to note that our proposed framework is model-agnostic and can be easily applied to most recommendation prediction approaches, and is time efficient in practice. Finally, extensive experiments on three real-world datasets demonstrate the superiority of our proposed approach.
Abstract: With the booming of online social networks in the mobile internet, an emerging recommendation scenario has played a vital role in information acquisition for user, where users are no longer recommended with a single item or item list, but a combination of heterogeneous and diverse objects (called a package, e.g., a package including news, publisher, and friends viewing the news). Different from the conventional recommendation where users are recommended with the item itself, in package recommendation, users would show great interests on the explicitly displayed objects that could have a significant influence on the user behaviors. However, to the best of our knowledge, few effort has been made for package recommendation and existing approaches can hardly model the complex interactions of diverse objects in a package. Thus, in this paper, we make a first study on package recommendation and propose an Intra- and inter-package attention network for Package Recommendation (IPRec). Specifically, for package modeling, an intra-package attention network is put forward to capture the object-level intention of user interacting with the package, while an inter-package attention network acts as a package-level information encoder that captures collaborative features of neighboring packages. In addition, to capture users preference representation, we present a user preference learner equipped with a fine-grained feature aggregation network and coarse-grained package aggregation network. Extensive experiments on three real-world datasets demonstrate that IPRec significantly outperforms the state of the arts. Moreover, the model analysis demonstrates the interpretability of our IPRec and the characteristics of user behaviors. Codes and datasets can be obtained at https://github.com/LeeChenChen/IPRec.
Abstract: Normalized discounted cumulative gain (NDCG) is one of the popular evaluation metrics for recommender systems and learning-to-rank problems. As it is non-differentiable, it cannot be optimized by gradient-based optimization procedures. In the last twenty years, a plethora of surrogate losses have been engineered that aim to make learning recommendation and ranking models that optimize NDCG possible. However, binary relevance implicit feedback settings still pose a significant challenge for such surrogate losses as they are usually designed and evaluated only for multi-level relevance feedback. In this paper, we address the limitations of directly optimizing the NDCG measure by proposing a guided learning approach (GuidedRec) that adopts recent advances in parameterized surrogate losses for NDCG. Starting from the observation that jointly learning a surrogate loss for NDCG and the recommendation model is very unstable, we design a stepwise approach that can be seamlessly applied to any recommender system model that uses a point-wise logistic loss function. The proposed approach guides the models towards optimizing the NDCG using an independent surrogate-loss model trained to approximate the true NDCG measure while maintaining the original logistic loss function as a stabilizer for the guiding procedure. In experiments on three recommendation datasets, we show that our guided surrogate learning approach yields models better optimized for NDCG than recent state-of-the-art approaches using engineered surrogate losses.
Abstract: Graph Convolutional Networks (GCNs) are powerful for collaborative filtering. The key component of GCNs is to explore neighborhood aggregation mechanisms to extract high-level representations of users and items. However, real-world user-item graphs are often incomplete and noisy. Aggregating misleading neighborhood information may lead to sub-optimal performance if GCNs are not regularized properly. Also, the real-world user-item graphs are often sparse and low rank. These two intrinsic graph properties are widely used in shallow matrix completion models, but far less studied in graph neural models. Here we propose Structured Graph Convolutional Networks (SGCNs) to enhance the performance of GCNs by exploiting graph structural properties of sparsity and low rank. To achieve sparsity, we attach each layer of a GCN with a trainable stochastic binary mask to prune noisy and insignificant edges, resulting in a clean and sparsified graph. To preserve its low-rank property, the nuclear norm regularization is applied. We jointly learn the parameters of stochastic binary masks and original GCNs by solving a stochastic binary optimization problem. An unbiased gradient estimator is further proposed to better backpropagate the gradients of binary variables. Experimental results demonstrate that SGCNs achieve better performance compared with the state-of-the-art GCNs.
Abstract: Graph structural information such as topologies or connectivities provides valuable guidance for graph convolutional networks (GCNs) to learn nodes' representations. Existing GCN models that capture nodes' structural information weight in- and out-neighbors equally or differentiate in- and out-neighbors globally without considering nodes' local topologies. We observe that in- and out-neighbors contribute differently for nodes with different local topologies. To explore the directional structural information for different nodes, we propose a GCN model with weighted structural features, named WGCN. WGCN first captures nodes' structural fingerprints via a direction and degree aware Random Walk with Restart algorithm, where the walk is guided by both edge direction and nodes' in- and out-degrees. Then, the interactions between nodes' structural fingerprints are used as the weighted node structural features. To further capture nodes' high-order dependencies and graph geometry, WGCN embeds graphs into a latent space to obtain nodes' latent neighbors and geometrical relationships. Based on nodes' geometrical relationships in the latent space, WGCN differentiates latent, in-, and out-neighbors with an attention-based geometrical aggregation. Experiments on transductive node classification tasks show that WGCN outperforms the baseline models consistently by up to 17.07% in terms of accuracy on five benchmark datasets.
Abstract: Deep learning techniques have ushered in significant progress in large-scale multi-modal retrieval. Nevertheless, the advanced techniques may be used nefariously to conduct a search that violates the privacy of individuals. In this paper, we propose a novel PrIvacy Protection method (PIP) against malicious multi-modal retrieval models, which proactively transfers original data into adversarial data with quasi-imperceptible perturbations before releasing them. Consequently, unauthorized malicious parties are not able to use deployed deep models to find out desired sensitive information with them. In addition to privacy preserving, PIP synchronously learns an effective multi-modal retrieval model to facilitate authorized uses, endowed with strong resilience to the perturbations. To the best of our knowledge, it is a very first attempt to consider privacy issues in multi-modal retrieval, and encapsulate both privacy protection against unauthorized retrieval and robust multi-modal learning for authorized uses into a unified framework. This work is conducted in the challenging no-box and unsupervised settings, where neither target malicious models nor supervised information is known. The optimization objective of our versatile PIP is achieved through a two-player game between different components with both the intra- and inter-modality graph alignments and the domain distribution alignment considered. Besides, a high-level similarity matrix is developed to obtain reliable guidance for learning. Empirically, we apply the proposed PIP to hashing based multi-modal retrieval scenarios and prove its effectiveness on a range of benchmarks and tasks.
Abstract: Retrieving event instances from texts is pivotal to various natural language processing applications (e.g., automatic question answering and dialogue systems), and the first task to perform is event detection. There are two related sub-tasks therein-trigger identification and type classification, and the former is considered to play a dominant role. Nevertheless, it is notoriously challenging to predict event triggers right. To handle the task, existing work has made tremendous progress by incorporating manual features, data augmentation and neural networks, etc. Due to the scarcity of data and insufficient representation of trigger words, however, they still fail to precisely determine the spans of triggers (coined as trigger span detection problem). To address the challenge, we propose to learn discriminative neural representations (DNR) from texts. Specifically, our DNR model tackles the trigger span detection problem by exploiting two novel techniques: 1) a contrastive learning strategy, which enlarges the discrepancy between representations of words inside and outside triggers; and 2) a Mixspan strategy, which better trains the model to differentiate words nearby triggers' span boundaries. Extensive experiments on benchmarks-ACE2005 and TAC2015-demonstrate the superiority of our DNR model, leading to state-of-the-art performance.
Abstract: In any ranking system, the retrieval model outputs a single score for a document based on its belief on how relevant it is to a given search query. While retrieval models have continued to improve with the introduction of increasingly complex architectures, few works have investigated a retrieval model's belief in the score beyond the scope of a single value. We argue that capturing the model's uncertainty with respect to its own scoring of a document is a critical aspect of retrieval that allows for greater use of current models across new document distributions, collections, or even improving effectiveness for down-stream tasks. In this paper, we address this problem via an efficient Bayesian framework for retrieval models which captures the model's belief in the relevance score through a stochastic process while adding only negligible computational overhead. We evaluate this belief via a ranking based calibration metric showing that our approximate Bayesian framework significantly improves a retrieval model's ranking effectiveness through a risk aware reranking as well as its confidence calibration. Lastly, we demonstrate that this additional uncertainty information is actionable and reliable on down-stream tasks represented via cutoff prediction.
Abstract: Computing graph similarity is an important task in many graph-related applications such as retrieval in graph databases or graph clustering. While numerous measures have been proposed to capture the similarity between a pair of graphs, Graph Edit Distance (GED) and Maximum Common Subgraphs (MCS) are the two widely used measures in practice. GED and MCS are domain-agnostic measures of structural similarity between the graphs and define the similarity as a function of pairwise alignment of different entities (such as nodes, edges, and subgraphs) in the two graphs. The explicit explainability offered by the pairwise alignment provides transparency and justification of the similarity score, thus, GED and MCS have important practical applications. However, their exact computations are known to be NP-hard. While recently proposed neural-network based approximations have been shown to accurately compute these similarity scores, they have limited ability in providing comprehensive explanations compared to classical combinatorial algorithms, e.g., Beam search. This paper aims at efficiently approximating these domain-agnostic similarity measures through a neural network, and simultaneously learning the alignments (i.e., explanations) similar to those of classical intractable methods. Specifically, we formulate the similarity between a pair of graphs as the minimal "transformation" cost from one graph to another in the learnable node-embedding space. We show that, if node embedding is able to capture its neighborhood context closely, our proposed similarity function closely approximates both the alignment and the similarity score of classical methods. Furthermore, we also propose an efficient differentiable computation of our proposed objective for model training. Empirically, we demonstrate that the proposed method achieves up to 50%-100% reduction in the Mean Squared Error for the graph similarity approximation task and up to 20% improvement in the retrieval evaluation metrics for the graph retrieval task. The source code is available at https://github.com/khoadoan/GraphOTSim.
Abstract: Although conversational search has become a hot topic in both dialogue research and IR community, the real breakthrough has been limited by the scale and quality of datasets available. To address this fundamental obstacle, we introduce the Multimodal Multi-domain Conversational dataset (MMConv), a fully annotated collection of human-to-human role-playing dialogues spanning over multiple domains and tasks. The contribution is two-fold. First, beyond the task-oriented multimodal dialogues among user and agent pairs, dialogues are fully annotated with dialogue belief states and dialogue acts. More importantly, we create a relatively comprehensive environment for conducting multimodal conversational search with real user settings, structured venue database, annotated image repository as well as crowd-sourced knowledge database. A detailed description of the data collection procedure along with a summary of data structure and analysis is provided. Second, a set of benchmark results for dialogue state tracking, conversational recommendation, response generation as well as a unified model for multiple tasks are reported. We adopt the state-of-the-art methods for these tasks respectively to demonstrate the usability of the data, discuss limitations of current methods and set baselines for future studies.
Abstract: Given a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct feature spaces, there are two general approaches to address VCMR: (i) to separately encode each modality representations, then align the two modality representations for query processing, and (ii) to adopt fine-grained cross-modal interaction to learn multi-modal representations for query processing. While the second approach often leads to better retrieval accuracy, the first approach is far more efficient. In this paper, we propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. We adopt the first approach and introduce two contrastive learning objectives to refine video encoder and text encoder to learn video and text representations separately but with better alignment for VCMR. The video contrastive learning (VideoCL) is to maximize mutual information between query and candidate video at video-level. The frame contrastive learning (FrameCL) aims to highlight the moment region corresponds to the query at frame-level, within a video. Experimental results show that, although ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
Abstract: Learning user representations is a vital technique toward effective user modeling and personalized recommender systems. Existing approaches often derive an individual set of model parameters for each task by training on separate data. However, the representation of the same user potentially has some commonalities, such as preference and personality, even in different tasks. As such, these separately trained representations could be suboptimal in performance as well as inefficient in terms of parameter sharing. In this paper, we delve on research to continually learn user representations task by task, whereby new tasks are learned while using partial parameters from old ones. A new problem arises since when new tasks are trained, previously learned parameters are very likely to be modified, and as a result, an artificial neural network (ANN)-based model may lose its capacity to serve for well-trained previous tasks forever, this issue is termed catastrophic forgetting. To address this issue, we present Conure the first continual, or lifelong, user representation learner --- i.e., learning new tasks over time without forgetting old ones. Specifically, we propose iteratively removing less important weights of old tasks in a deep user representation model, motivated by the fact that neural network models are usually over-parameterized. In this way, we could learn many tasks with a single model by reusing the important weights, and modifying the less important weights to adapt to new tasks. We conduct extensive experiments on two real-world datasets with nine tasks and show that Conure largely exceeds the standard model that does not purposely preserve such old "knowledge'', and performs competitively or sometimes better than models which are trained either individually for each task or simultaneously by merging all task data.
Abstract: Understanding how knowledge is technically transferred across academic disciplines is very relevant for understanding and facilitating innovation. There are two challenges for this purpose, namely the semantic ambiguity and the asymmetric influence across disciplines. In this paper we investigate knowledge propagation and characterize semantic correlations for cross discipline paper recommendation. We adopt a generative model to represent a paper content as the probabilistic association with an existing hierarchically classified discipline to reduce the ambiguity of word semantics. The semantic correlation across disciplines is represented by an influence function, a correlation metric and a ranking mechanism. Then a user interest is represented as a probabilistic distribution over the target domain semantics and the correlated papers are recommended. Experimental results on real datasets show the effectiveness of our methods. We also discuss the intrinsic factors of results in an interpretable way. Compared with traditional word embedding based methods, our approach supports the evolution of domain semantics that accordingly lead to the update of semantic correlation. Another advantage of our approach is its flexibility and uniformity in supporting user interest specifications by either a list of papers or a query of key words, which is suited for practical scenarios.
Abstract: When a user starts exploring items from a new area of an e-commerce system, cross-domain recommendation techniques come into help by transferring the abundant knowledge from the user's familiar domains to this new domain. However, this solution usually requires direct information sharing between service providers on the cloud which may not always be available and brings privacy concerns. In this paper, we show that one can overcome these concerns through learning on edge devices such as smartphones and laptops. The cross-domain recommendation problem is formalized under a decentralized computing environment with multiple domain servers. And we identify two key challenges for this setting: the unavailability of direct transfer and the heterogeneity of the domain-specific user representations. We then propose to learn and maintain a decentralized user encoding on each user's personal space. The optimization follows a variational inference framework that maximizes the mutual information between the user's encoding and the domain-specific user information from all her interacted domains. Empirical studies on real-world datasets exhibit the effectiveness of our proposed framework on recommendation tasks and its superiority over domain-pairwise transfer models. The resulting system offers reduced communication cost and an efficient inference mechanism that does not depend on the number of involved domains, and it allows flexible plugin of domain-specific transfer models without significant interference on other domains.
Abstract: Representation learning on user-item graph for recommendation has evolved from using single ID or interaction history to exploiting higher-order neighbors. This leads to the success of graph convolution networks (GCNs) for recommendation such as PinSage and LightGCN. Despite effectiveness, we argue that they suffer from two limitations: (1) high-degree nodes exert larger impact on the representation learning, deteriorating the recommendations of low-degree (long-tail) items; and (2) representations are vulnerable to noisy interactions, as the neighborhood aggregation scheme further enlarges the impact of observed edges. In this work, we explore self-supervised learning on user-item graph, so as to improve the accuracy and robustness of GCNs for recommendation. The idea is to supplement the classical supervised task of recommendation with an auxiliary self-supervised task, which reinforces node representation learning via self-discrimination. Specifically, we generate multiple views of a node, maximizing the agreement between different views of the same node compared to that of other nodes. We devise three operators to generate the views --- node dropout, edge dropout, and random walk --- that change the graph structure in different manners. We term this new learning paradigm asSelf-supervised Graph Learning (SGL), implementing it on the state-of-the-art model LightGCN. Through theoretical analyses, we find that SGL has the ability of automatically mining hard negatives. Empirical studies on three benchmark datasets demonstrate the effectiveness of SGL, which improves the recommendation accuracy, especially on long-tail items, and the robustness against interaction noises. Our implementations are available at \urlhttps://github.com/wujcan/SGL.
Abstract: Search result diversification aims to offer diverse documents that cover as many intents as possible. Most existing implicit diversification approaches model diversity through the similarity of document representation, which is indirect and unnatural. To handle the diversity more precisely, we measure the similarity of documents by their similarity of the intent coverage. Specifically, we build a classifier to judge whether two different documents contain the same intent based on the document's content. Then we construct an intent graph to present the complicated relationship of documents and the query. On the intent graph, documents are connected if they are similar, while the query and the document are gradually connected based on the document selection result. Then we employ graph convolutional networks (GCNs) to update the representation of the query and each document by aggregating its neighbors. By this means, we can obtain the context-aware query representation and the intent-aware document representations through the dynamic intent graph during the document selection process. Furthermore, these representations and intent graph features are fused into diversity features. Combined with the traditional relevance features, we obtain the final ranking score that balances the relevance and the diversity. Experimental results show that this implicit diversification model significantly outperforms all existing implicit diversification methods, and it can even beat the state-of-the-art explicit models.
Abstract: Recommender systems are playing a vital role in online platforms due to the ability of incorporating users' personal tastes. Beyond accuracy, diversity has been recognized as a key factor to broaden users' horizons as well as to promote enterprises' sales. However, the trade-off between accuracy and diversity remains to be a big challenge. More importantly, none of existing methods has explored the domain and user biases toward diversity. In this paper, we focus on enhancing both domain-level and user-level adaptivity in diversified recommendation. Specifically, we first encode domain-level diversity into a generalized bi-lateral branch network with an adaptive balancing strategy. We further capture user-level diversity by developing a two-way adaptive metric learning backbone network inside each branch. We conduct extensive experiments on three real-world datasets. Results demonstrate that our proposed approach consistently outperforms the state-of-the-art baselines.
Abstract: Modern recommender systems often embed users and items into low-dimensional latent representations, based on their observed interactions. In practical recommendation scenarios, users often exhibit various intents which drive them to interact with items with multiple behavior types (e.g., click, tag-as-favorite, purchase). However, the diversity of user behaviors is ignored in most of existing approaches, which makes them difficult to capture heterogeneous relational structures across different types of interactive behaviors. Exploring multi-typed behavior patterns is of great importance to recommendation systems, yet is very challenging because of two aspects: i) The complex dependencies across different types of user-item interactions; ii) Diversity of such multi-behavior patterns may vary by users due to their personalized preference. To tackle the above challenges, we propose a Multi-Behavior recommendation framework with Graph Meta Network to incorporate the multi-behavior pattern modeling into a meta-learning paradigm. Our developed MB-GMN empowers the user-item interaction learning with the capability of uncovering type-dependent behavior representations, which automatically distills the behavior heterogeneity and interaction diversity for recommendations. Extensive experiments on three real-world datasets show the effectiveness of MB-GMN by significantly boosting the recommendation performance as compared to various state-of-the-art baselines. The source code is available at https://github.com/akaxlh/MB-GMN.
Abstract: This paper investigates recommendation fairness among new items. While previous efforts have studied fairness in recommender systems and shown success in improving fairness, they mainly focus on scenarios where unfairness arises due to biased prior user-feedback history (like clicks or views). Yet, it is unknown whether new items without any feedback history can be recommended fairly, and if unfairness does exist, how can we provide fair recommendations among these new items in such a cold-start scenario. In detail, we first formalize fairness among new items with the well-known concepts of equal opportunity and Rawlsian Max-Min fairness. We empirically show the prevalence of unfairness in cold start recommender systems. Then we propose a novel learnable post-processing framework as a model blueprint for enhancing fairness, with which we propose two concrete models: a joint-learning generative model, and a score scaling model. Extensive experiments over four public datasets show the effectiveness of the proposed models for enhancing fairness while also preserving recommendation utility.
Abstract: Entity alignment (EA) is a prerequisite for enlarging the coverage of a unified knowledge graph. Previous EA approaches either restrain the performance due to inadequate information utilization or need labor-intensive pre-processing to get external or reliable information to perform the EA task. This paper proposes EASY, an effective end-to-end EA framework, which is able to (i) remove the labor-intensive pre-processing by fully discovering the name information provided by the entities themselves; and (ii) jointly fuse the features captured by the names of entities and the structural information of the graph to improve the EA results. Specifically, EASY first introduces NEAP, a highly effective name-based entity alignment procedure, to obtain an initial alignment that has reasonable accuracy and meanwhile does not require much memory consumption or any complex training process. Then, EASY invokes SRS, a novel structure-based refinement strategy, to iteratively correct the misaligned entities generated by NEAP to further enhance the entity alignment. Extensive experiments demonstrate the superiority of our proposed EASY with significant improvement against 13 existing state-of-the-art competitors.
Abstract: Prior work suggests that users conceptualize the organization of personal collections of digital files through the lens of similarity. However, it is unclear to what degree similar files are actually located near one another (e.g., in the same directory) in actual file collections, or whether leveraging file similarity can improve information retrieval and organization for disorganized collections of files. To this end, we conducted an online study combining automated analysis of 50 Google Drive and Dropbox users' cloud accounts with a survey asking about pairs of files from those accounts. We found that many files located in different parts of file hierarchies were similar in how they were perceived by participants, as well as in their algorithmically extractable features. Participants often wished to co-manage similar files (e.g., deleting one file implied deleting the other file) even if they were far apart in the file hierarchy. To further understand this relationship, we built regression models, finding several algorithmically extractable file features to be predictive of human perceptions of file similarity and desired file co-management. Our findings pave the way for leveraging file similarity to automatically recommend access, move, or delete operations based on users' prior interactions with similar files.
Abstract: Citations of scientific papers and patents reveal the knowledge flow and usually serve as the metric for evaluating their novelty and impacts in the field. Citation Forecasting thus has various applications in the real world. Existing works on citation forecasting typically exploit the sequential properties of citation events, without exploring the citation network. In this paper, we propose to explore both the citation network and the related citation event sequences which provide valuable information for future citation forecasting. We propose a novel \em Citation Network and Event Sequence (CINES) Model to encode signals in the citation network and related citation event sequences into various types of embeddings for decoding to the arrivals of future citations. Moreover, we propose atemporal network attention and three alternative designs of \em bidirectional feature propagation to aggregate the retrospective and prospective aspects of publications in the citation network, coupled with the citation event sequence embeddings learned by a \em two-level attention mechanism for the citation forecasting. We evaluate our models and baselines on both a U.S. patent dataset and a DBLP dataset. Experimental results show that our models outperform the state-of-the-art methods, i.e., RMTPP, CYAN-RNN, Intensity-RNN, and PC-RNN, reducing the forecasting error by 37.76% - 75.32%.
Abstract: Conversational recommender systems (CRSs) have revolutionized the conventional recommendation paradigm by embracing dialogue agents to dynamically capture the fine-grained user preference. In a typical conversational recommendation scenario, a CRS firstly generates questions to let the user clarify her/his demands and then makes suitable recommendations. Hence, the ability to generate suitable clarifying questions is the key to timely tracing users' dynamic preferences and achieving successful recommendations. However, existing CRSs fall short in asking high-quality questions because: (1) system-generated responses heavily depends on the performance of the dialogue policy agent, which has to be trained with huge conversation corpus to cover all circumstances; and (2) current CRSs cannot fully utilize the learned latent user profiles for generating appropriate and personalized responses. To mitigate these issues, we propose the Knowledge-Based Question Generation System (KBQG), a novel framework for conversational recommendation. Distinct from previous conversational recommender systems, KBQG models a user's preference in a finer granularity by identifying the most relevant relations from a structured knowledge graph (KG). Conditioned on the varied importance of different relations, the generated clarifying questions could perform better in impelling users to provide more details on their preferences. Finially, accurate recommendations can be generated in fewer conversational turns. Furthermore, the proposed KBQG outperforms all baselines in our experiments on two real-world datasets.
Abstract: Large amounts of multi-modal information online make it difficult for users to obtain proper insights. In this paper, we introduce and formally define the concepts of supplementary and complementary multi-modal summaries in the context of the overlap of information covered by different modalities in the summary output. A new problem statement of combined complementary and supplementary multi-modal summarization (CCS-MMS) is formulated. The problem is then solved in several steps by utilizing the concepts of multi-objective optimization by devising a novel unsupervised framework. An existing multi-modal summarization data set is further extended by adding outputs in different modalities to establish the efficacy of the proposed technique. The results obtained by the proposed approach are compared with several strong baselines; ablation experiments are also conducted to empirically justify the proposed techniques. Furthermore, the proposed model is evaluated separately for different modalities quantitatively and qualitatively, demonstrating the superiority of our approach.
Abstract: Dense retrieval (DR) has the potential to resolve the query understanding challenge in conversational search by matching in the learned embedding space. However, this adaptation is challenging due to DR models' extra needs for supervision signals and the long-tail nature of conversational search. In this paper, we present a Conversational Dense Retrieval system, ConvDR, that learns contextualized embeddings for multi-turn conversational queries and retrieves documents solely using embedding dot products. In addition, we grant ConvDR few-shot ability using a teacher-student framework, where we employ an ad hoc dense retriever as the teacher, inherit its document encodings, and learn a student query encoder to mimic the teacher embeddings on oracle reformulated queries. Our experiments on TREC CAsT and OR-QuAC demonstrate ConvDR's effectiveness in both few-shot and fully-supervised settings. It outperforms previous systems that operate in the sparse word space, matches the retrieval accuracy of oracle query reformulations, and is also more efficient thanks to its simplicity. Our analyses reveal that the advantages of ConvDR come from its ability to capture informative context while ignoring the unrelated context in previous conversation rounds. This makes ConvDR more effective as conversations evolve while previous systems may get confused by the increased noise from previous turns. Our code is publicly available at https://github.com/thunlp/ConvDR.
Abstract: We study the task of conversational fashion image retrieval via multiturn natural language feedback. Most previous studies are based on single-turn settings. Existing models on multiturn conversational fashion image retrieval have limitations, such as employing traditional models, and leading to ineffective performance. We propose a novel framework that can effectively handle conversational fashion image retrieval with multiturn natural language feedback texts. One characteristic of the framework is that it searches for candidate images based on exploitation of the encoded reference image and feedback text information together with the conversation history. Furthermore, the image fashion attribute information is leveraged via a mutual attention strategy. Since there is no existing fashion dataset suitable for the multiturn setting of our task, we derive a large-scale multiturn fashion dataset via additional manual annotation efforts on an existing single-turn dataset. The experiments show that our proposed model significantly outperforms existing state-of-the-art methods.
Abstract: User and item attributes are essential side-information; their interactions (i.e., their co-occurrence in the sample data) can significantly enhance prediction accuracy in various recommender systems. We identify two different types of attribute interactions, inner interactions and cross interactions: inner interactions are those between only user attributes or those between only item attributes; cross interactions are those between user attributes and item attributes. Existing models do not distinguish these two types of attribute interactions, which may not be the most effective way to exploit the information carried by the interactions. To address this drawback, we propose a neural Graph Matching based Collaborative Filtering model (GMCF), which effectively captures the two types of attribute interactions through modeling and aggregating attribute interactions in a graph matching structure for recommendation. In our model, the two essential recommendation procedures, characteristic learning and preference matching, are explicitly conducted through graph learning (based on inner interactions) and node matching (based on cross interactions), respectively. Experimental results show that our model outperforms state-of-the-art models. Further studies verify the effectiveness of GMCF in improving the accuracy of recommendation.
Abstract: Next basket recommendation aims to infer a set of items that a user will purchase at the next visit by considering a sequence of baskets he/she has purchased previously. This task has drawn increasing attention from both the academic and industrial communities. The existing solutions mainly focus on sequential modeling over their historical interactions. However, due to the diversity and randomness of users' behaviors, not all these baskets are relevant to help identify the user's next move. It is necessary to denoise the baskets and extract credibly relevant items to enhance recommendation performance. Unfortunately, this dimension is usually overlooked in the current literature. To this end, in this paper, we propose a Contrastive Learning Model~(named CLEA) to automatically extract items relevant to the target item for next basket recommendation. Specifically, empowered by Gumbel Softmax, we devise a denoising generator to adaptively identify whether each item in a historical basket is relevant to the target item or not. With this process, we can obtain a positive sub-basket and a negative sub-basket for each basket over each user. Then, we derive the representation of each sub-basket based on its constituent items through a GRU-based context encoder, which expresses either relevant preference or irrelevant noises regarding the target item. After that, a novel two-stage anchor-guided contrastive learning process is then designed to simultaneously guide this relevance learning without requiring any item-level relevance supervision. To the best of our knowledge, this is the first work of performing item-level denoising for a basket in an end-to-end fashion for next basket recommendation. Extensive experiments are conducted over four real-world datasets with diverse characteristics. The results demonstrate that our proposed CLEA achieves significantly better recommendation performance than the existing state-of-the-art alternatives. Moreover, further analysis also shows that CLEA can successfully discover the real relevant items towards the recommendation decision.
Abstract: Session-based recommendation (SBR) is widely used in e-commerce to predict the anonymous user's next click action according to a short sequence. Many previous studies have shown the potential advantages of applying Graph Neural Networks (GNN) to SBR tasks. However, the existing SBR models using GNN to solve user preference problems are only based on one single dataset to obtain one recommendation model during training. While the single dataset has the problems including the excessive sparse data source and the long-distance relationship of items. Therefore, introducing the dual transfer, which can enrich the data source, to SBR is absolutely necessary. To this end, a new method is proposed in this paper, which is called dual attention transfer based on multi-dimensional integration (DAT-MDI): (i) DAT uses a potential mapping method based on a slot attention mechanism to extract the user's representation information in different sessions between multiple domains. (ii) MDI combines the graph neural network for the graphs (session graph and global graph) and the gate recurrent unit (GRU) for the sequence to learn the item representation in each session. Then the multi-level session representation are combined by a soft-attention mechanism. We do a variety of experiments on four benchmark datasets which have shown that the superiority of the DAT-MDI model over the state-of-the-art methods.
Abstract: There has been significant progress in the utilization of heterogeneous knowledge graphs (KG) as auxiliary information in recommendation systems. Reasoning over KG paths sheds light on the user's decision-making process. Previous methods focus on formulating this process as a multi-hop reasoning problem. However, without some form of guidance in the reasoning process, such a huge search space results in poor accuracy and little explanation diversity. In this paper, we propose UCPR, a user-centric path reasoning network that constantly guides the search from the aspect of user demand and enables explainable recommendations. In this network, a multi-view structure leverages not only local sequence reasoning information but also a panoramic view of the user's demand portfolio while inferring subsequent user decision-making steps. Experiments on five real-world benchmarks show UCPR is significantly more accurate than state-of-the-art methods. Besides, we show that the proposed model successfully identifies users' concerns and increases reason-ing diversity to enhance explainability
Abstract: We address how to robustly interpret natural language refinements (or critiques) in recommender systems. In particular, in human-human recommendation settings people frequently use soft attributes to express preferences about items, including concepts like the originality of a movie plot, the noisiness of a venue, or the complexity of a recipe. While binary tagging is extensively studied in the context of recommender systems, soft attributes often involve subjective and contextual aspects, which cannot be captured reliably in this way, nor be represented as objective binary truth in a knowledge base. This also adds important considerations when measuring soft attribute ranking. We propose a more natural representation as personalized relative statements, rather than as absolute item properties. We present novel data collection techniques and evaluation approaches, and a new public dataset. We also propose a set of scoring approaches, from unsupervised to weakly supervised to fully supervised, as a step towards interpreting and acting upon soft attribute based critiques.
Abstract: Online learning to rank (OLTR) uses interaction data, such as clicks, to dynamically update rankers. OLTR has been thought to capture user intent change overtime - a task that is impossible for rankers trained on statistic datasets such as in offline and counterfactual learning to rank. However, this feature has never been demonstrated and empirically studied, as previous work only considered simulated online data with single user intent or real online data with no explicit notion of intents and how they change over interactions. In this paper, we address this gap by study the capability of OLTR algorithms to adapt to user intent change. Our empirical experiments show that the adaptation to intent change does vary across OLTR methods, and is also dependent on the amount of noise in the implicit feedback signal. This is an important result, as it highlights that intent change adaptation should be studied alongside online and offline performance. Investigating how OLTR algorithms adapt to intent change is challenging as current LTR datasets do not explicitly contain the required intent data. Along with the main findings reported in this paper related to intent change, we also contribute a methodology to investigate this aspect of OLTR methods. Specifically, we create a collection for OLTR with explicit intent change by adapting an existing TREC collection to this task. We further introduce methods to model and simulate click behaviour related to intent change. We further propose novel evaluation metrics tailored to study different aspects of how OLTR methods adapt to intent change.
Abstract: Learning from implicit feedback is challenging because of the difficult nature of the one-class problem: we can observe only positive examples. Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem. However, such methods have two main drawbacks particularly in large-scale applications; (1) the pairwise approach is severely inefficient due to the quadratic computational cost; and (2) even recent model-based samplers (e.g. IRGAN) cannot achieve practical efficiency due to the training of an extra model. In this paper, we propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart while performing similarly to the pairwise counterpart in terms of ranking effectiveness. Our approach estimates the probability densities of positive items for each user within a rich class of distributions, viz. exponential family. In our formulation, we derive a loss function and the appropriate negative sampling distribution based on maximum likelihood estimation. We also develop a practical technique for risk approximation and a regularisation scheme. We then discuss that our single-model approach is equivalent to an IRGAN variant under a certain condition. Through experiments on real-world datasets, our approach outperforms the pointwise and pairwise counterparts in terms of effectiveness and efficiency.
Abstract: Direct optimization of IR metrics has often been adopted as an approach to devise and develop ranking-based recommender systems. Most methods following this approach (e.g. TFMAP, CLiMF, Top-N-Rank) aim at optimizing the same metric being used for evaluation, under the assumption that this will lead to the best performance. A number of studies of this practice bring this assumption, however, into question. In this paper, we dig deeper into this issue in order to learn more about the effects of the choice of the metric to optimize on the performance of a ranking-based recommender system. We present an extensive experimental study conducted on different datasets in both pairwise and listwise learning-to-rank (LTR) scenarios, to compare the relative merit of four popular IR metrics, namely RR, AP, nDCG and RBP, when used for optimization and assessment of recommender systems in various combinations. For the first three, we follow the practice of loss function formulation available in literature. For the fourth one, we propose novel loss functions inspired by RBP for both the pairwise and listwise scenario. Our results confirm that the best performance is indeed not necessarily achieved when optimizing the same metric being used for evaluation. In fact, we find that RBP-inspired losses perform at least as well as other metrics in a consistent way, and offer clear benefits in several cases. Interesting to see is that RBP-inspired losses, while improving the recommendation performance for all uses, may lead to an individual performance gain that is correlated with the activity level of a user in interacting with items. The more active the users, the more they benefit. Overall, our results challenge the assumption behind the current research practice of optimizing and evaluating the same metric, and point to RBP-based optimization instead as a promising alternative when learning to rank in the recommendation context.
Abstract: Recent works show that Transformer-based learning-to-rank (LTR) approaches can outperform previous well-established ranking methods, such as gradient-boosted decision trees (GBDT), on document and passage re-ranking problems. A common assumption in these works is that the query and the result documents are comprised of purely textual information without explicit structure. In map search, the relevance of results is determined based on rich heterogeneous features - textual features derived from the query and the results, geospatial features such as proximity of a result to the user, structured features reflecting the address format of the result, and the perceived structure of the query. In this work, we propose a novel deep neural network LTR architecture, capable of seamlessly handling heterogeneous inputs, similar to GBDT-based methods. At the same time, unlike GBDT, the architecture does not require human input via (numerous) carefully-crafted features. Instead, features are inferred through a self-attention mechanism. Our model implements two lightweight attention layers optimized for ranking: the first layer computes query-result similarities, the second implements listwise ranking inference. We perform evaluation on several single language and one multilingual dataset. Our model outperforms by a wide margin other Transformer-based ranking architectures and has equal or better performance than GBDT models. Equally important, runtime inference is orders of magnitude faster than other Transformer architectures, significantly reducing hardware serving costs. The model is a low-cost alternative suitable to power ranking in industrial map search engines across a variety of languages and markets.
Abstract: In Mathematical Information Retrieval (MIR), formulae can be used in a query to match other similar formulae in documents. However, due to the structural complexity of formulae, specialized processing is needed for formula matching. Formulae may be represented by their appearance in Symbol Layout Trees (SLTs) or by their syntax in Operator Trees (OPTs). Previous approaches for formula retrieval used one or both of these representations and used unification to improve search results for inexact matches (e.g., allowing different variable names to match). On these representations, models for matching full expressions (trees), subexpressions, and paths have been used. Recently embedding models were used to represent formulae as vectors. In this paper, the effectiveness of retrieval models and formula representations are studied to identify their relative strengths and weaknesses. Then, a learning to rank model is proposed, using SVM-rank over similarity scores from different formula retrieval models as features. Experiments on the ARQMath formula retrieval task results show that the proposed learning to rank model is effective, producing new state-of-the-art results.
Abstract: Legal case retrieval is a specialized IR task aiming to retrieve supporting cases given a query case. While recent research efforts are committed to improving the automatic retrieval models' performances, little attention has been paid to the practical search interactions between users and systems in this task. Therefore, we focus on investigating user behavior in the scenario of legal case retrieval. Specifically, we conducted a laboratory user study that involved 45 participants majoring in law to collect users' rich interactions and relevance assessments. With the collected data, we first analyzed the characteristics of the search process in legal case retrieval practice. We observed significant differences between legal case retrieval and general web search in various search behavior. These differences highlight the necessity of in-depth investigating user behavior in legal case retrieval and re-thinking the application of related mechanisms developed based on the user models in Web search. Then we investigated factors that would influence search behavior from different perspectives, including task difficulty and domain expertise. Finally, we shed light on implicit feedback in legal case retrieval and designed a predictive model for relevance based on user behavior. Our work provides a better understanding of user interactions in the legal case retrieval process, which can benefit the design of the corresponding retrieval systems to support legal practitioners.
Abstract: Legal Judgment Prediction is a fundamental task in legal intelligence of the civil law system, which aims to automatically predict the judgment results of multiple subtasks, such as charge, law article, and term of penalty prediction. Existing studies mainly focus on the impact of the entire fact description on all subtasks. They ignore the practical judicial scenario, where judges adopt circumstances of crime (i.e., various parts of the fact) to decide judgment results. To this end, in this paper, we propose a circumstance-aware legal judgment prediction framework (i.e., NeurJudge) by exploring circumstances of crime. Specifically, NeurJudge utilizes the results of intermediate subtasks to separate the fact description into different circumstances and exploits them to make the predictions of other subtasks. In addition, considering the popularity of confusing verdicts (i.e., charges and law articles), we further extend NeurJudge to a more comprehensive framework which is denoted by NeurJudge+. Particularly, NeurJudge+ utilizes a label embedding method to incorporate the semantics of labels (i.e., charges and law articles) into facts to generate more expressive fact representations for confusing verdicts problems. Extensive experimental results on two real-world datasets clearly validate the effectiveness of our proposed frameworks.
Abstract: Given a legal case and all law articles, L egal J udgment P rediction (LJP) is to predict the case's violated articles, charges and term of penalty. Naturally, these labels are entangled among different tasks and within a task. For example, each charge is only logically or semantically related to some fixed articles. Ignoring these constraints, LJP methods would predict unreliable results. To solve this problem, we first formalize LJP as a node classification problem over a global consistency graph derived from the training set. In terms of node encoder, we utilize a masked transformer network to obtain case aware node representations consistent among tasks and discriminative within a task. In terms of node classifier, each node's label distribution is dependent on its neighbors' in this graph to achieve local consistency by relational learning. Both the node encoder and classifier are optimized by variational EM. Finally, we propose a novel measure to evaluate self-consistency of classification results. Experimental results on two benchmark datasets demonstrate that the F1 improvement of our method is about $4.8%$ compared with SOTA methods.
Abstract: Legal judgment prediction(LJP) is an essential task for legal AI. While prior methods studied on this topic in a pseudo setting by employing the judge-summarized case narrative as the input to predict the judgment, neglecting critical case life-cycle information in real court setting could threaten the case logic representation quality and prediction correctness. In this paper, we introduce a novel challenging dataset from real courtrooms to predict the legal judgment in a reasonably encyclopedic manner by leveraging the genuine input of the case - plaintiff's claims and court debate data, from which the case's facts are automatically recognized by comprehensively understanding the multi-role dialogues of the court debate, and then learnt to discriminate the claims so as to reach the final judgment through multi-task learning. An extensive set of experiments with a large civil trial data set shows that the proposed model can more accurately characterize the interactions among claims, fact and debate for legal judgment prediction, achieving significant improvements over strong state-of-the-art baselines. Moreover, the user study conducted with real judges and law school students shows the neural predictions can also be interpretable and easily observed, and thus enhancing the trial efficiency and judgment quality.
Abstract: Contract element extraction (CEE) is the novel task of automatically identifying and extracting legally relevant elements such as contract dates, payments, and legislation references from contracts. Automatic methods for this task view it as a sequence labeling problem and dramatically reduce human labor. However, as contract genres and element types may vary widely, a significant challenge for this sequence labeling task is how to transfer knowledge from one domain to another, i.e., cross-domain CEE. Cross-domain CEE differs from cross-domain named entity recognition (NER) in two important ways. First, contract elements are far more fine-grained than named entities, which hinders the transfer of extractors. Second, the extraction zones for cross-domain CEE are much larger than for cross-domain NER. As a result, the contexts of elements from different domains can be more diverse. We propose a framework, the Bi-directional Feedback cLause-Element relaTion network (Bi-FLEET), for the cross-domain CEE task that addresses the above challenges. Bi-FLEET has three main components: (1) a context encoder, (2) a clause-element relation encoder, and (3) an inference layer. To incorporate invariant knowledge about element and clause types, a clause-element graph is constructed across domains and a hierarchical graph neural network is adopted in the clause-element relation encoder. To reduce the influence of context variations, a multi-task framework with a bi-directional feedback scheme is designed in the inference layer, conducting both clause classification and element extraction. The experimental results over both cross-domain NER and CEE tasks show that Bi-FLEET significantly outperforms state-of-the-art baselines.
Abstract: At present, most research on the fairness of recommender systems is conducted either from the perspective of customers or from the perspective of product (or service) providers. However, such a practice ignores the fact that when fairness is guaranteed to one side, the fairness and rights of the other side are likely to reduce. In this paper, we consider recommendation scenarios from the perspective of two sides (customers and providers). From the perspective of providers, we consider the fairness of the providers' exposure in recommender system. For customers, we consider the fairness of the reduced quality of recommendation results due to the introduction of fairness measures. We theoretically analyzed the relationship between recommendation quality, customers fairness, and provider fairness, and design a two-sided fairness-aware recommendation model (TFROM) for both customers and providers. Specifically, we design two versions of TFROM for offline and online recommendation. The effectiveness of the model is verified on three real-world data sets. The experimental results show that TFROM provides better two-sided fairness while still maintaining a higher level of personalization than the baseline algorithms.
Abstract: Recent work has proposed stochastic Plackett-Luce (PL) ranking models as a robust choice for optimizing relevance and fairness metrics. Unlike their deterministic counterparts that require heuristic optimization algorithms, PL models are fully differentiable. Theoretically, they can be used to optimize ranking metrics via stochastic gradient descent. However, in practice, the computation of the gradient is infeasible because it requires one to iterate over all possible permutations of items. Consequently, actual applications rely on approximating the gradient via sampling techniques. In this paper, we introduce a novel algorithm: PL-Rank, that estimates the gradient of a PL ranking model w.r.t. both relevance and fairness metrics. Unlike existing approaches that are based on policy gradients, PL-Rank makes use of the specific structure of PL models and ranking metrics. Our experimental analysis shows that PL-Rank has a greater sample-efficiency and is computationally less costly than existing policy gradients, resulting in faster convergence at higher performance. PL-Rank further enables the industry to apply PL models for more relevant and fairer real-world ranking systems.
Abstract: Existing fair ranking systems, especially those designed to be demographically fair, assume that accurate demographic information about individuals is available to the ranking algorithm. In practice, however, this assumption may not hold --- in real-world contexts like ranking job applicants or credit seekers, social and legal barriers may prevent algorithm operators from collecting peoples' demographic information. In these cases, algorithm operators may attempt to infer peoples' demographics and then supply these inferences as inputs to the ranking algorithm. In this study, we investigate how uncertainty and errors in demographic inference impact the fairness offered by fair ranking algorithms. Using simulations and three case studies with real datasets, we show how demographic inferences drawn from real systems can lead to unfair rankings. Our results suggest that developers should not use inferred demographic data as input to fair ranking algorithms, unless the inferences are extremely accurate.
Abstract: While implicit feedback (e.g., clicks, dwell times, etc.) is an abundant and attractive source of data for learning to rank, it can produce unfair ranking policies for both exogenous and endogenous reasons. Exogenous reasons typically manifest themselves as biases in the training data, which then get reflected in the learned ranking policy and often lead to rich-get-richer dynamics. Moreover, even after the correction of such biases, reasons endogenous to the design of the learning algorithm can still lead to ranking policies that do not allocate exposure among items in a fair way. To address both exogenous and endogenous sources of unfairness, we present the first learning-to-rank approach that addresses both presentation bias and merit-based fairness of exposure simultaneously. Specifically, we define a class of amortized fairness-of-exposure constraints that can be chosen based on the needs of an application, and we show how these fairness criteria can be enforced despite the selection biases in implicit feedback data. The key result is an efficient and flexible policy-gradient algorithm, called FULTR, which is the first to enable the use of counterfactual estimators for both utility estimation and fairness constraints. Beyond the theoretical justification of the framework, we show empirically that the proposed algorithm can learn accurate and fair ranking policies from biased and noisy feedback.
Abstract: Recommender systems are gaining increasing and critical impacts on human and society since a growing number of users use them for information seeking and decision making. Therefore, it is crucial to address the potential unfairness problems in recommendations. Just like users have personalized preferences on items, users' demands for fairness are also personalized in many scenarios. Therefore, it is important to providepersonalized fair recommendations for users to satisfy theirpersonalized fairness demands. Besides, previous works on fair recommendation mainly focus on association-based fairness. However, it is important to advance from associative fairness notions to causal fairness notions for assessing fairness more properly in recommender systems. Based on the above considerations, this paper focuses on achieving personalized counterfactual fairness for users in recommender systems. To this end, we introduce a framework for achieving counterfactually fair recommendations through adversary learning by generating feature-independent user embeddings for recommendation. The framework allows recommender systems to achieve personalized fairness for users while also covering non-personalized situations. Experiments on two real-world datasets with shallow and deep recommendation algorithms show that our method can generate fairer recommendations for users with a desirable recommendation performance.
Abstract: There is an increasing interest in studying adversarial attacks on image retrieval systems. However, most of the existing attack methods are based on the white-box setting, where the attackers have access to all the model and database details, which is a strong assumption for practical attacks. The generic transfer-based attack also requires substantial resources yet the effect was shown to be unreliable. In this paper, we make the first attempt in proposing a query-efficient decision-based attack framework for the image retrieval (DAIR) to completely subvert the top-K retrieval results with human imperceptible perturbations. We propose an optimization-based method with a smoothed utility function to overcome the challenging discrete nature of the problem. To further improve the query efficiency, we propose a novel sampling method that can achieve the transferability between the surrogate and the target model efficiently. Our comprehensive experimental evaluation on the benchmark datasets shows that our DAIR method outperforms significantly the state-of-the-art decision-based methods. We also demonstrate that real image retrieval engines (Bing Visual Search and Face++ engines) can be attacked successfully with only several hundreds of queries.
Abstract: Recent studies have shown that recommender systems are vulnerable, and it is easy for attackers to inject well-designed malicious profiles into the system, leading to biased recommendations. We cannot deny these data's rationality, making it imperative to establish a robust recommender system. Adversarial training has been extensively studied for robust recommendations. However, traditional adversarial training adds small perturbations to the parameters (inputs), which do not comply with the poisoning mechanism in the recommender system. Thus for the practical models that are very good at learning existing data, it does not perform well. To address the above limitations, we propose adversarial poisoning training (APT). It simulates the poisoning process by injecting fake users (ERM users) who are dedicated to minimizing empirical risk to build a robust system. Besides, to generate ERM users, we explore an approximation approach to estimate each fake user's influence on the empirical risk. Although the strategy of "fighting fire with fire" seems counterintuitive, we theoretically prove that the proposed APT can boost the upper bound of poisoning robustness. Also, we deliver the first theoretical proof that adversarial training holds a positive effect on enhancing recommendation robustness. Through extensive experiments with five poisoning attacks on four real-world datasets, the results show that the robustness improvement of APT significantly outperforms baselines. It is worth mentioning that APT also improves model generalization in most cases.
Abstract: In this work, we investigate the user identity linkage task across different social media platforms based on heterogeneous multi-modal posts and social connections. This task is non-trivial due to the following two challenges. 1) As each user involves both intra multi-modal posts and inter social connections, how to accurately fulfil the user representation learning from both intra and inter perspectives constitutes the main challenge. And 2) even representations distributed on different platforms of the same identity tend to be distinct (i.e., the semantic gap problem) owing to discrepant data distribution of different platforms. Hence, how to alleviate the semantic gap problem poses another tough challenge. To this end, we propose a novel adversarial-enhanced hybrid graph network (AHG-Net), consisting of three key components: user representation extraction, hybrid user representation learning, and adversarial learning. Specifically, AHG-Net first employs advanced deep learning techniques to extract the user's intermediate representations from his/her heterogeneous multi-modal posts and social connections. Then AHG-Net unifies the intra-user representation learning and inter-user representation learning with a hybrid graph network. Finally, AHG-Net adopts adversarial learning to encourage the learned user presentations of the same identity to be similar using a semantic discriminator. Towards evaluation, we create a multi-modal user identity linkage dataset by augmenting an existing dataset with 62,021 images collected from Twitter and Foursquare. Extensive experiments validate the superiority of the proposed network. Meanwhile, we release the dataset, codes, and parameters to facilitate the research community.
Abstract: Visual-based recommender systems (VRSs) enhance recommendation performance by integrating users' feedback with the visual features of items' images. Recently, human-imperceptible image perturbations, defined adversarial samples, have been shown capable of altering the VRSs performance, for example, by pushing (promoting) or nuking (demoting) specific categories of products. One of the most effective adversarial defense methods is adversarial training (AT), which enhances the robustness of the model by incorporating adversarial samples into the training process and minimizing an adversarial risk. The AT effectiveness has been verified on defending DNNs in supervised learning tasks such as image classification. However, the extent to which AT can protect deep VRSs, against adversarial perturbation of images remains mostly under-investigated. This work focuses on the defensive side of VRSs and provides general insights that could be further exploited to broaden the frontier in the field. First, we introduce a suite of adversarial attacks against DNNs on top of VRSs, and defense strategies to counteract them. Next, we present an evaluation framework, named Visual Adversarial Recommender (VAR), to empirically investigate the performance of defended or undefended DNNs in various visually-aware item recommendation tasks. The results of large-scale experiments indicate alarming risks in protecting a VRS through the DNN robustification. Source code and data are available at https://github.com/sisinflab/Visual-Adversarial-Recommendation.
Abstract: Image-text retrieval is a fundamental and crucial branch in information retrieval. Although much progress has been made in bridging vision and language, it remains challenging because of the difficult intra-modal reasoning and cross-modal alignment. Existing modality interaction methods have achieved impressive results on public datasets. However, they heavily rely on expert experience and empirical feedback towards the design of interaction patterns, therefore, lacking flexibility. To address these issues, we develop a novel modality interaction modeling network based upon the routing mechanism, which is the first unified and dynamic multimodal interaction framework towards image-text retrieval. In particular, we first design four types of cells as basic units to explore different levels of modality interactions, and then connect them in a dense strategy to construct a routing space. To endow the model with the capability of path decision, we integrate a dynamic router in each cell for pattern exploration. As the routers are conditioned on inputs, our model can dynamically learn different activated paths for different data. Extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, verify the superiority of our model compared with several state-of-the-art baselines.
Abstract: Due to the popularity of video contents on the Internet, the information retrieval between videos and texts has attracted broad interest from researchers, which is a challenging cross-modal retrieval task. A common solution is to learn a joint embedding space to measure the cross-modal similarity. However, many existing approaches either pay more attention to textual information, video information, or cross-modal matching methods, but less to all three. We believe that a good video-text retrieval system should take into account all three points, fully exploiting the semantic information of both modalities and considering a comprehensive match. In this paper, we propose a Hierarchical Cross-Modal Graph Consistency Learning Network (HCGC) for video-text retrieval task, which considers multi-level graph consistency for video-text matching. Specifically, we first construct a hierarchical graph representation for the video, which includes three levels from global to local: video, clips and objects. Similarly, the corresponding text graph is constructed according to the semantic relationships among sentence, actions and entities. Then, in order to learn a better match between the video and text graph, we design three types of graph consistency (both direct and indirect): inter-graph parallel consistency, inter-graph cross consistency and intra-graph cross consistency. Extensive experimental results on different video-text datasets demonstrate the effectiveness of our approach on both text-to-video and video-to-text retrieval.
Abstract: In practical applications of cross-modal retrieval, test queries of the retrieval system may vary greatly and come from unknown category. Meanwhile, due to the cost and difficulty of data collection as well as other issues, the available data for cross-modal retrieval are often imbalanced over different modalities. In this paper, we address two important issues to increase the robustness of cross-modal retrieval system for real-world applications: handling test queries from unknown category and modality-imbalanced training data. The first issue has not been addressed by existing methods and the second issue was not well addressed in the related research. To tackle the above issues, we take the advantage of prototype learning, and propose a prototype-based adaptive network (PAN) for robust cross-modal retrieval. Our method leverages a unified prototype to represent each semantic category across modalities, which provides discriminative information of different categories and takes unified prototypes as anchors to learn cross-modal representations adaptively. Moreover, we propose a novel prototype propagation strategy to reconstruct balanced representations which preserves the semantic consistency and modality heterogeneity. Experimental results on the benchmark datasets demonstrate the effectiveness of our method compared to the SOTA methods, and further robustness tests show the superiority of our method in solving the above issues.
Abstract: By reading reviews and product attributes, e-commerce question-answering task aims to automatically generate natural-sounding answers for product-related questions. Existing methods, however, typically assume that each review and each product attribute are semantically independent, ignoring the relation among all these multi-type texts. In this paper, we propose a review-attribute heterogeneous graph neural network (abbreviated as RAHGNN) to model the logical relation of all multi-type text. RAHGNN consists of four components: a review-attribute heterogeneous graph constructor, a question-aware input encoder, a heterogeneous graph relation analyzer, and a context-based answer decoder. Specifically, after constructing the heterogeneous graph with reviews and product attributes, we derive the initial representation of each review node and attribute node based on question attention network and key-value memory network respectively. RAHGNN analyzes the relation according to the subgraph structure and subgraph semantic meaning using node-level attention and semantic-level attention. Finally, the answer is generated by the recurrent neural network with the relation representation as context input. Extensive experimental results on a large-scale real-world e-commerce dataset not only show the superior performance of RAHGNN over state-of-the-art baselines, but also demonstrate its potentially good interpretability for multi-type text relation in product-aware answer generation.
Abstract: Traditionally, the task of cross-modal retrieval is tackled through joint embedding. However, the global matching used in joint embedding methods often fails to effectively describe matchings between local regions of the image and words in the text. Hence they may not be effective in capturing the relevance between the text and the image. In this work, we propose a heterogeneous attention network (HAN) for effective and efficient cross-modal retrieval. The proposed HAN represents an image by a set of bounding box features and a sentence by a set of word features. The relevance between the image and the sentence is determined by the set-to-set matching between the set of word features and the set of bounding box features. To enhance the matching effectiveness, we exploit the proposed heterogeneous attention layer to provide the cross-modal context for word features as well as bounding box features. Meanwhile, to optimize the metric more effectively, we propose a new soft-max triplet loss, which adaptively gives more attention to harder negatives and thus trains the proposed HAN in a more effective manner compared with the original triplet loss. Meanwhile, the proposed HAN is efficient, and its lightweight architecture only needs a single GPU card for training. Extensive experiments conducted on two public benchmarks demonstrate the effectiveness and efficiency of our HAN. This work has been deployed in production Baidu Search Ads and is part of the "PaddleBox'' platform.
Abstract: Click-through rate (CTR) prediction is one of the most central tasks in online advertising systems. Recent deep learning-based models that exploit feature embedding and high-order data nonlinearity have shown dramatic successes in CTR prediction. However, these models work poorly on cold-start ads with new IDs, whose embeddings are not well learned yet. In this paper, we propose Graph Meta Embedding (GME) models that can rapidly learn how to generate desirable initial embeddings for new ad IDs based on graph neural networks and meta learning. Previous works address this problem from the new ad itself, but ignore possibly useful information contained in existing old ads. In contrast, GMEs simultaneously consider two information sources: the new ad and existing old ads. For the new ad, GMEs exploit its associated attributes. For existing old ads, GMEs first build a graph to connect them with new ads, and then adaptively distill useful information. We propose three specific GMEs from different perspectives to explore what kind of information to use and how to distill information. In particular, GME-P uses Pre-trained neighbor ID embeddings, GME-G uses Generated neighbor ID embeddings and GME-A uses neighbor Attributes. Experimental results on three real-world datasets show that GMEs can significantly improve the prediction performance in both cold-start (i.e., no training data is available) and warm-up (i.e., a small number of training samples are collected) scenarios over five major deep learning-based CTR prediction models. GMEs can be applied to conversion rate (CVR) prediction as well.
Abstract: Recently, embedding techniques have achieved impressive success in recommender systems. However, the embedding techniques are data demanding and suffer from the cold-start problem. Especially, for the cold-start item which only has limited interactions, it is hard to train a reasonable item ID embedding, called cold ID embedding, which is a major challenge for the embedding techniques. The cold item ID embedding has two main problems: (1) A gap is existing between the cold ID embedding and the deep model. (2) Cold ID embedding would be seriously affected by noisy interaction. However, most existing methods do not consider both two issues in the cold-start problem, simultaneously. To address these problems, we adopt two key ideas: (1) Speed up the model fitting for the cold item ID embedding (fast adaptation). (2) Alleviate the influence of noise. Along this line, we propose Meta Scaling and Shifting Networks to generate scaling and shifting functions for each item, respectively. The scaling function can directly transform cold item ID embeddings into warm feature space which can fit the model better, and the shifting function is able to produce stable embeddings from the noisy embeddings. With the two meta networks, we propose Meta Warm Up Framework (MWUF) which learns to warm up cold ID embeddings. Moreover, MWUF is a general framework that can be applied upon various existing deep recommendation models. The proposed model is evaluated on three popular benchmarks, including both recommendation and advertising datasets. The evaluation results demonstrate its superior performance and compatibility.
Abstract: Meta-learning based recommendation systems alleviate the cold-start problem through a bi-level meta-optimization process. Recommendation borrows prior experience from pre-trained static system-level parameters and fine-tunes the model in user-level for new users. However, it is more natural for the system to sample users in a dynamic online sequence in most real-world recommendation systems, which brings further challenges for existing meta-learning based recommendation: system-level updates begins before user-level recommendation models have converged on the whole time series; stable and randomness-resistant bi-level gradient descent approaches are missing in the current meta-learning framework; evaluation on learning abilities across different users are lacked for exploring the diversities of different users. In this paper, we propose an online regularized meta-leader recommendation approaches named FORM to address such problems. To transfer meta-learning based recommender into the online scenario, we develop follow-the-meta-leader algorithm to learn stable online gradients. Regularized methods are then introduced to alleviate the volatility of online systems and produce sparse weight parameters. Besides, we design a scalable meta-trained learning rate based on the variance and learning-shots of existing users to guide the model to adapt efficiently to new users. Extensive experiments on three public datasets and one commercial online advertisement dataset demonstrate our approaches' effectiveness and stability, which outperform other state-of-the-art methods and achieve a stable and fast adaption on new users.
Abstract: The cold start problem in recommender systems is a long-standing challenge, which requires recommending to new users (items) based on attributes without any historical interaction records. In these recommendation systems, warm users (items) have privileged collaborative signals of interaction records compared to cold start users (items), and these Collaborative Filtering (CF) signals are shown to have competing performance for recommendation. Many researchers proposed to learn the correlation between collaborative signal embedding space and the attribute embedding space to improve the cold start recommendation, in which user and item categorical attributes are available in many online platforms. However, the cold start recommendation is still limited by two embedding spaces modeling and simple assumptions of space transformation. As user-item interaction behaviors and user (item) attributes naturally form a heterogeneous graph structure, in this paper, we propose a privileged graph distillation model (PGD). The teacher model is composed of a heterogeneous graph structure for warm users and items with privileged CF links. The student model is composed of an entity-attribute graph without CF links. Specifically, the teacher model can learn better embeddings of each entity by injecting complex higher-order relationships from the constructed heterogeneous graph. The student model can learn the distilled output with privileged CF embeddings from the teacher embeddings. Our proposed model is generally applicable to different cold start scenarios with new user, new item, or new user-new item. Finally, extensive experimental results on the real-world datasets clearly show the effectiveness of our proposed model on different types of cold start problems, with average 6.6%, 5.6%, and 17.1% improvement over state-of-the-art baselines on three datasets, respectively.
Abstract: Current search systems provide effective support to users engaged in fact-finding and look-up oriented tasks. However, they provide relatively little support for users engaged in exploratory search tasks that involve cognitive and metacognitive activities such as learning, synthesis, planning, and reflection. We conducted a within-subject user study (N=24) that investigated the effects of a novel knowledge organization tool called the OrgBox, designed to assist users with organizing and synthesizing information, and metacognitive activities. The OrgBox included features to allow users to drag-drop information they found through search into "boxes" that could be created, labelled, and re-arranged. Study participants completed two exploratory search tasks, one with the OrgBox, and one with the OrgDoc, a baseline tool that included features of a rich-text editor (e.g., formatting, bullets) for taking notes. In this paper, we present results from our study comparing the OrgBox and OrgDoc tools. Specifically, we investigate if there were differences in participants' (1) search interactions, (2) saving and organizing behaviors (e.g., amount of information, structure of notes), (3) perceptions of the tasks, tool usability, and quality of their task outputs, and (4) perceptions of how the tools provided support for cognitive and metacognitive activities involved in the task. Our results show that when using the OrgBox tool, participants created more grouping sections in their notes and saved more text. In terms of metacognitive support, participants perceived the OrgBox tool to provide significantly higher levels of support for three types of metacognitive activity (monitoring/tracking, evaluation, and planning) without changing their perceptions of the task difficulty.
Abstract: Graph Convolutional Network (GCN) is an emerging technique for information retrieval (IR) applications. While GCN assumes the homophily property of a graph, real-world graphs are never perfect: the local structure of a node may contain discrepancy, e.g., the labels of a node's neighbors could vary. This pushes us to consider the discrepancy of local structure in GCN modeling. Existing work approaches this issue by introducing an additional module such as graph attention, which is expected to learn the contribution of each neighbor. However, such module may not work reliably as expected, especially when there lacks supervision signal, e.g., when the labeled data is small. Moreover, existing methods focus on modeling the nodes in the training data, and never consider the local structure discrepancy of testing nodes. This work focuses on the local structure discrepancy issue for testing nodes, which has received little scrutiny. From a novel perspective of causality, we investigate whether a GCN should trust the local structure of a testing node when predicting its label. To this end, we analyze the working mechanism of GCN with causal graph, estimating the causal effect of a node's local structure for the prediction. The idea is simple yet effective: given a trained GCN model, we first intervene the prediction by blocking the graph structure; we then compare the original prediction with the intervened prediction to assess the causal effect of the local structure on the prediction. Through this way, we can eliminate the impact of local structure discrepancy and make more accurate prediction. Extensive experiments on seven node classification datasets show that our causality-based method effectively enhances the inference stage of GCN.
Abstract: Semi-supervised node classification on graphs is an important research problem, with many real-world applications in information retrieval such as content classification on a social network and query intent classification on an e-commerce query graph. While traditional approaches are largely transductive, recent graph neural networks (GNNs) integrate node features with network structures, thus enabling inductive node classification models that can be applied to new nodes or even new graphs in the same feature space. However, inter-graph differences still exist across graphs within the same domain. Thus, training just one global model (e.g., a state-of-the-art GNN) to handle all new graphs, whilst ignoring the inter-graph differences, can lead to suboptimal performance. In this paper, we study the problem of inductive node classification across graphs. Unlike existing one-model-fits-all approaches, we propose a novel meta-inductive framework called MI-GNN to customize the inductive model to each graph under a meta-learning paradigm. That is, MI-GNN does not directly learn an inductive model; it learns the general knowledge of how to train a model for semi-supervised node classification on new graphs. To cope with the differences across graphs, MI-GNN employs a dual adaptation mechanism at both the graph and task levels. More specifically, we learn a graph prior to adapt for the graph-level differences, and a task prior to adapt for the task-level differences conditioned on a graph. Extensive experiments on five real-world graph collections demonstrate the effectiveness of our proposed model.
Abstract: Lifelong learning capabilities are crucial for sentiment classifiers to process continuous streams of opinioned information on the Web. However, performing lifelong learning is non-trivial for deep neural networks as continually training of incrementally available information inevitably results in catastrophic forgetting or interference. In this paper, we propose a novel i terative network p runing with uncertainty r egularization method for l ifelong s entiment classification (IPRLS), which leverages the principles of network pruning and weight regularization. By performing network pruning with uncertainty regularization in an iterative manner, IPRLS can adapt a single BERT model to work with continuously arriving data from multiple domains while avoiding catastrophic forgetting and interference. Specifically, we leverage an iterative pruning method to remove redundant parameters in large deep networks so that the freed-up space can then be employed to learn new tasks, tackling the catastrophic forgetting problem. Instead of keeping the old-tasks fixed when learning new tasks, we also use an uncertainty regularization based on the Bayesian online learning framework to constrain the update of old tasks weights in BERT, which enables positive backward transfer, i.e. learning new tasks improves performance on past tasks while protecting old knowledge from being lost. In addition, we propose a task-specific low-dimensional residual function in parallel to each layer of BERT, which makes IPRLS less prone to losing the knowledge saved in the base BERT network when learning a new task. Extensive experiments on 16 popular review corpora demonstrate that the proposed IPRLS method significantly outperforms the strong baselines for lifelong sentiment classification. For reproducibility, we submit the code and data at: \urlhttps://github.com/siat-nlp/IPRLS .
Abstract: GNN-based anomaly detection has recently attracted considerable attention. Existing attempts have thus far focused on jointly learning the node representations and the classifier for detecting the anomalies. Inspired by the recent advances of self-supervised learning (SSL) on graphs, we explore another possibility of decoupling the node representation learning and the classification for anomaly detection. We conduct a preliminary study to show that decoupled training using existing graph SSL schemes to represent nodes can obtain performance gains over joint training, but it may deteriorate when the behavior patterns and the label semantics become highly inconsistent. To be less biased by the inconsistency, we propose a simple yet effective graph SSL scheme, called Deep Cluster Infomax (DCI) for node representation learning, which captures the intrinsic graph properties in more concentrated feature spaces by clustering the entire graph into multiple parts. We conduct extensive experiments on four real-world datasets for anomaly detection. The results demonstrate that decoupled training equipped with a proper SSL scheme can outperform joint training in AUC. Compared with existing graph SSL schemes, DCI can help decoupled training gain more improvements.
Abstract: In this study we address the problem of identifying the purchase-state of users, based on product-related questions they ask on an eCommerce website. We differentiate between questions asked before buying a product (pre-purchase) and after (post-purchase). At first, we study the ambiguity that exists in purchase-states' definition, and then investigate the linguistic characteristics of the questions in each state. We analyze the discrepancy between the language models of pre- and post-purchase questions, and offer two classification schemes for this task, both outperform human judgments. We additionally show the effectiveness of our classification models in improving real world applications for both consumers and sellers.
Abstract: To better exploit search logs and model users' behavior patterns, numerous click models are proposed to extract users' implicit interaction feedback. Most traditional click models are based on the probabilistic graphical model (PGM) framework, which requires manually designed dependencies and may oversimplify user behaviors. Recently, methods based on neural networks are proposed to improve the prediction accuracy of user behaviors by enhancing the expressive ability and allowing flexible dependencies. However, they still suffer from the data sparsity and cold-start problems. In this paper, we propose a novel graph-enhanced click model (GraphCM) for web search. Firstly, we regard each query or document as a vertex, and propose novel homogeneous graph construction methods for queries and documents respectively, to fully exploit both intra-session and inter-session information for the sparsity and cold-start problems. Secondly, following the examination hypothesis, we separately model the attractiveness estimator and examination predictor to output the attractiveness scores and examination probabilities, where graph neural networks and neighbor interaction techniques are applied to extract the auxiliary information encoded in the pre-constructed homogeneous graphs. Finally, we apply combination functions to integrate examination probabilities and attractiveness scores into click predictions. Extensive experiments conducted on three real-world session datasets show that GraphCM not only outperforms the state-of-art models, but also achieves superior performance in addressing the data sparsity and cold-start problems.
Abstract: Because of the superior feature representation ability of deep learning, various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies. To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently, which makes speeding up the training process an essential problem. Different from the models with dense training data, the training data for CTR models is usually high-dimensional and sparse. To transform the high-dimensional sparse input into low-dimensional dense real-value vectors, almost all deep CTR models adopt the embedding layer, which easily reaches hundreds of GB or even TB. Since a single GPU cannot afford to accommodate all the embedding parameters, when performing distributed training, it is not reasonable to conduct the data-parallelism only. Therefore, existing distributed training platforms for recommendation adopt model-parallelism. Specifically, they use CPU (Host) memory of servers to maintain and update the embedding parameters and utilize GPU worker to conduct forward and backward computations. Unfortunately, these platforms suffer from two bottlenecks: (1) the latency of pull & push operations between Host and GPU; (2) parameters update and synchronization in the CPU servers. To address such bottlenecks, in this paper, we propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models. Specifically, in SFCTR, we also store huge embedding table in CPU but utilize GPU instead of CPU to conduct embedding synchronization efficiently. To reduce the latency of data transfer between both GPU-Host and GPU-GPU, the MixCache mechanism and Virtual Sparse Id operation are proposed. Comprehensive experiments are conducted to demonstrate the effectiveness and efficiency of SFCTR. In addition, our system will be open-source based on MindSpore in the near future.
Abstract: Click-through rate (CTR) prediction is a critical problem in web search, recommendation systems and online advertisement displaying. Learning good feature interactions is essential to reflect user's preferences to items. Many CTR prediction models based on deep learning have been proposed, but researchers usually only pay attention to whether state-of-the-art performance is achieved, and ignore whether the entire framework is reasonable. In this work, we use the discrete choice model in economics to redefine the CTR prediction problem, and propose a general neural network framework built on self-attention mechanism. It is found that most existing CTR prediction models align with our proposed general framework. We also examine the expressive power and model complexity of our proposed framework, along with potential extensions to some existing models. And finally we demonstrate and verify our insights through some experimental results on public datasets.
Abstract: Recommendation is a prevalent and critical service in information systems. To provide personalized suggestions to users, industry players embrace machine learning, more specifically, building predictive models based on the click behavior data. This is known as the Click-Through Rate (CTR) prediction, which has become the gold standard for building personalized recommendation service. However, we argue that there is a significant gap between clicks and user satisfaction --- it is common that a user is "cheated" to click an item by the attractive title/cover of the item. This will severely hurt user's trust on the system if the user finds the actual content of the clicked item disappointing. What's even worse, optimizing CTR models on such flawed data will result in the Matthew Effect, making the seemingly attractive but actually low-quality items be more frequently recommended. In this paper, we formulate the recommendation models as a causal graph that reflects the cause-effect factors in recommendation, and address the clickbait issue by performing counterfactual inference on the causal graph. We imagine a counterfactual world where each item has only exposure features (i.e., the features that the user can see before making a click decision). By estimating the click likelihood of a user in the counterfactual world, we are able to reduce the direct effect of exposure features and eliminate the clickbait issue. Experiments on real-world datasets demonstrate that our method significantly improves the post-click satisfaction of CTR models.
Abstract: Modeling powerful interactions is a critical challenge in Click-through rate (CTR) prediction, which is one of the most typical machine learning tasks in personalized advertising and recommender systems. Although developing hand-crafted interactions is effective for a small number of datasets, it generally requires laborious and tedious architecture engineering for extensive scenarios. In recent years, several neural architecture search (NAS) methods have been proposed for designing interactions automatically. However, existing methods only explore limited types and connections of operators for interaction generation, leading to low generalization ability. To address these problems, we propose a more general automated method for building powerful interactions named AutoPI. The main contributions of this paper are as follows: AutoPI adopts a more general search space in which the computational graph is generalized from existing network connections, and the interactive operators in the edges of the graph are extracted from representative hand-crafted works. It allows searching for various powerful feature interactions to produce higher AUC and lower Logloss in a wide variety of applications. Besides, AutoPI utilizes a gradient-based search strategy for exploration with a significantly low computational cost. Experimentally, we evaluate AutoPI on a diverse suite of benchmark datasets, demonstrating the generalizability and efficiency of AutoPI over hand-crafted architectures and state-of-the-art NAS algorithms.
Abstract: A taocode is a kind of specially coded text-link on taobao.com (the world's biggest online shopping website), through which users can share messages about products with each other. Analyzing taocodes can potentially facilitate understanding of the social relationships between users and, more excitingly, their online purchasing behaviors under the influence of taocode diffusion. This paper innovatively investigates the problem of online purchasing predictions from an information diffusion perspective, with taocode as a case study. Specifically, we conduct profound observational studies on a large-scale real-world dataset from Taobao, containing over 100M Taocode sharing records. Inspired by our observations, we propose InfNet, a dynamic GNN-based framework that models the information diffusion across Taocode. We then apply InfNet to item purchasing predictions. Extensive experiments on real-world datasets validate the effectiveness of InfNet compared with νmofbaseline~ state-of-the-art baselines.
Abstract: Hashing has become increasingly important for large-scale image retrieval, of which the low storage cost and fast searching are two key properties. However, existing methods adopt large neural networks, which are hard to be deployed in resource-limited devices due to the unacceptable memory and runtime overhead. We address that this huge overhead of neural networks somewhatviolates the appealing properties of hashing. In this paper, we propose a novel deep hashing method, called Binary Neural Network Hashing (BNNH) for fast image retrieval. Specifically, we construct an efficient binarized network architecture to provide lighter model and faster inference, which directly generates binary outputs as the desired hash codes without introducing the quantization loss. Besides, in order to circumvent the huge performance degradation caused by the extremely quantized activations, we introduce a simple yet effective activation-aware loss to explicitly guide the updating of activations in intermediate layers. Extensive experiments conducted on three benchmarks show that the proposed method outperforms the state-of-the-art binarization methods by large margins and validate the efficiency of BNNH.
Abstract: Hashing, which represents data items as compact binary codes, has been becoming a more and more popular technique, e.g., for large-scale image retrieval, owing to its super fast search speed as well as its extremely economical memory consumption. However, existing hashing methods all try to learn binary codes from artificially balanced datasets which are not commonly available in real-world scenarios. In this paper, we propose Long-Tail Hashing Network (LTHNet), a novel two-stage deep hashing approach that addresses the problem of learning to hash for more realistic datasets where the data labels roughly exhibit a long-tail distribution. Specifically, the first stage is to learn relaxed embeddings of the given dataset with its long-tail characteristic taken into account via an end-to-end deep neural network; the second stage is to binarize those obtained embeddings. A critical part of LTHNet is its dynamic meta-embedding module extended with a determinantal point process which can adaptively realize visual knowledge transfer between head and tail classes, and thus enrich image representations for hashing. Our experiments have shown that LTHNet achieves dramatic performance improvements over all state-of-the-art competitors on long-tail datasets, with no or little sacrifice on balanced datasets. Further analyses reveal that while to our surprise directly manipulating class weights in the loss function has little effect, the extended dynamic meta-embedding module, the usage of cross-entropy loss instead of square loss, and the relatively small batch-size for training all contribute to LTHNet's success.
Abstract: Given a set S of n distinct keys, a function f that bijectively maps the keys of S into the range (0,...,n-1) is called a minimal perfect hash function for S. Algorithms that find such functions when n is large and retain constant evaluation time are of practical interest; for instance, search engines and databases typically use minimal perfect hash functions to quickly assign identifiers to static sets of variable-length keys such as strings. The challenge is to design an algorithm which is efficient in three different aspects: time to find f (construction time), time to evaluate f on a key of S (lookup time), and space of representation for f. Several algorithms have been proposed to trade-off between these aspects. In 1992, Fox, Chen, and Heath (FCH) presented an algorithm at SIGIR providing very fast lookup evaluation. However, the approach received little attention because of its large construction time and higher space consumption compared to other subsequent techniques. Almost thirty years later we revisit their framework and present an improved algorithm that scales well to large sets and reduces space consumption altogether, without compromising the lookup time. We conduct an extensive experimental assessment and show that the algorithm finds functions that are competitive in space with state-of-the art techniques and provide 2-4x better lookup time.
Abstract: An emerging recipe for achieving state-of-the-art effectiveness in neural document re-ranking involves utilizing large pre-trained language models - e.g., BERT - to evaluate all individual passages in the document and then aggregating the outputs by pooling or additional Transformer layers. A major drawback of this approach is high query latency due to the cost of evaluating every passage in the document with BERT. To make matters worse, this high inference cost and latency varies based on the length of the document, with longer documents requiring more time and computation. To address this challenge, we adopt an intra-document cascading strategy, which prunes passages of a candidate document using a less expensive model, called ESM, before running a scoring model that is more expensive and effective, called ETM. We found it best to train ESM (short for Efficient Student Model) via knowledge distillation from the ETM (short for Effective Teacher Model) e.g., BERT. This pruning allows us to only run the ETM model on a smaller set of passages whose size does not vary by document length. Our experiments on the MS MARCO and TREC Deep Learning Track benchmarks suggest that the proposed Intra-Document Cascaded Ranking Model (IDCM) leads to over 400% lower query latency by providing essentially the same effectiveness as the state-of-the-art BERT-based document ranking models.
Abstract: Video retrieval is becoming increasingly important owing to the rapid emergence of videos on the Internet. The dominant paradigm for video retrieval learns video-text representations by pushing the distance between the similarity of positive pairs and that of negative pairs apart from a fixed margin. However, negative pairs used for training are sampled randomly, which indicates that the semantics between negative pairs may be related or even equivalent, while most methods still enforce dissimilar representations to decrease their similarity. This phenomenon leads to inaccurate supervision and poor performance in learning video-text representations. While most video retrieval methods overlook that phenomenon, we propose an adaptive margin changed with the distance between positive and negative pairs to solve the aforementioned issue. First, we design the calculation framework of the adaptive margin, including the method of distance measurement and the function between the distance and the margin. Then, we explore a novel implementation called "Cross-Modal Generalized Self-Distillation" (CMGSD), which can be built on the top of most video retrieval models with few modifications. Notably, CMGSD adds few computational overheads at train time and adds no computational overhead at test time. Experimental results on three widely used datasets demonstrate that the proposed method can yield significantly better performance than the corresponding backbone model, and it outperforms state-of-the-art methods by a large margin.
Abstract: Composing text and image for image retrieval (CTI-IR) is a new yet challenging task, for which the input query is not the conventional image or text but a composition, i.e., a reference image and its corresponding modification text. The key of CTI-IR lies in how to properly compose the multi-modal query to retrieve the target image. In a sense, pioneer studies mainly focus on composing the text with either the local visual descriptor or global feature of the reference image. However, they overlook the fact that the text modifications are indeed diverse, ranging from the concrete attribute changes, like "change it to long sleeves", to the abstract visual property adjustments, e.g., "change the style to professional". Thus, simply emphasizing the local or global feature of the reference image for the query composition is insufficient. In light of the above analysis, we propose a Comprehensive Linguistic-Visual Composition Network (CLVC-Net) for image retrieval. The core of CLVC-Net is that it designs two composition modules: fine-grained local-wise composition module and fine-grained global-wise composition module, targeting comprehensive multi-modal compositions. Additionally, a mutual enhancement module is designed to promote local-wise and global-wise composition processes by forcing them to share knowledge with each other. Extensive experiments conducted on three real-world datasets demonstrate the superiority of our CLVC-Net. We released the codes to benefit other researchers.
Abstract: Given a text/image query, image-text retrieval aims to find the relevant items in the database. Recently, visual-linguistic pre-training (VLP) methods have demonstrated promising accuracy on image-text retrieval and other visual-linguistic tasks. These VLP methods are typically pre-trained on a large amount of image-text pairs, then fine-tuned on various downstream tasks. Nevertheless, due to the natural modality incompleteness in image-text retrieval, i.e., the query is either image or text rather than an image-text pair, the naive application of VLP to image-text retrieval results in significant inefficiency. Moreover, existing VLP methods cannot extract comparable representations for a single-modal query and multi-modal database items. In this work, we propose a generative visual-linguistic pre-training approach, termed as GilBERT, to simultaneously learn generic representations of image-text data and complete the missing modality for incomplete pairs. In testing phase, the proposed GilBERT facilitates efficient vector-based retrieval by providing unified feature embedding for query and database items. Moreover, the generative training not only makes GilBERT compatible with non-parallel text/image corpus, but also enables GilBERT to model the image-text relationships without suffering massive randomly-sampled negative samples, leading to superior experimental performances. Extensive experiments demonstrate the advantages of GilBERT in image-text retrieval, in terms of both efficiency and accuracy.
Abstract: The recent growth of web video sharing platforms has increased the demand for systems that can efficiently browse, retrieve and summarize video content. Query-aware multi-video summarization is a promising technique that caters to this demand. In this work, we introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS, that jointly optimizes multiple criteria: (1) conciseness, (2) representativeness of important query-relevant events and (3) chronological soundness. We design a hierarchical attention model that factorizes over three distributions, each collecting evidence from a different modality, followed by a pointer network that selects frames to include in the summary. DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence. We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
Abstract: With the recent advances of conversational recommendations, the recommender system is able to actively and dynamically elicit user preference via conversational interactions. To achieve this, the system periodically queries users' preference on attributes and collects their feedback. However, most existing conversational recommender systems only enable the user to provide absolute feedback to the attributes. In practice, the absolute feedback is usually limited, as the users tend to provide biased feedback when expressing the preference. Instead, the user is often more inclined to express comparative preferences, since user preferences are inherently relative. To enable users to provide comparative preferences during conversational interactions, we propose a novel comparison-based conversational recommender system. The relative feedback, though more practical, is not easy to be incorporated since its feedback scale is always mismatched with users' absolute preferences. With effectively collecting and understanding the relative feedback from an interactive manner, we further propose a new bandit algorithm, which we call RelativeConUCB. The experiments on both synthetic and real-world datasets validate the advantage of our proposed method, compared to the existing bandit algorithms in the conversational recommender systems.
Abstract: Collaborative bandit learning, i.e., bandit algorithms that utilize collaborative filtering techniques to improve sample efficiency in online interactive recommendation, has attracted much research attention as it enjoys the best of both worlds. However, all existing collaborative bandit learning solutions impose a stationary assumption about the environment, i.e., both user preferences and the dependency among users are assumed static over time. Unfortunately, this assumption hardly holds in practice due to users' ever-changing interests and dependency relations, which inevitably costs a recommender system sub-optimal performance in practice. In this work, we develop a collaborative dynamic bandit solution to handle a changing environment for recommendation. We explicitly model the underlying changes in both user preferences and their dependency relation as a stochastic process. Individual user's preference is modeled by a mixture of globally shared contextual bandit models with a Dirichlet process prior. Collaboration among users is thus achieved via Bayesian inference over the global bandit models. To balance exploitation and exploration during the interactions, Thompson sampling is used for both model selection and arm selection. Our solution is proved to maintain a standard ~O(√T) Bayesian regret in this challenging environment. Extensive empirical evaluations on both synthetic and real-world datasets further confirmed the necessity of modeling a changing environment and our algorithm's practical advantages against several state-of-the-art online learning solutions.
Abstract: Web automation scripts (tasklets) are used by personal AI assistants to carry out human tasks such as reserving a car or buying movie tickets. Generating tasklets today is a tedious job which requires much manual effort. We propose Glider, an automated and scalable approach to generate tasklets from a natural language task query and a website URL. A major advantage of Glider is that it does not require any pre-training. Glider models tasklet extraction as a state space search, where agents can explore a website's UI and get rewarded when making progress towards task completion. The reward is computed based on the agent's navigating pattern and the similarity between its trajectory and the task query. A hierarchical reinforcement learning policy is used to efficiently find the action sequences that maximize the reward. To evaluate Glider, we used it to extract tasklets for tasks in various categories (shopping, real-estate, flights, etc.); in 79% of cases a correct tasklet was generated.
Abstract: Conversational recommender systems (CRS) enable the traditional recommender systems to explicitly acquire user preferences towards items and attributes through interactive conversations. Reinforcement learning (RL) is widely adopted to learn conversational recommendation policies to decide what attributes to ask, which items to recommend, and when to ask or recommend, at each conversation turn. However, existing methods mainly target at solving one or two of these three decision-making problems in CRS with separated conversation and recommendation components, which restrict the scalability and generality of CRS and fall short of preserving a stable training procedure. In the light of these challenges, we propose to formulate these three decision-making problems in CRS as a unified policy learning task. In order to systematically integrate conversation and recommendation components, we develop a dynamic weighted graph based RL method to learn a policy to select the action at each conversation turn, either asking an attribute or recommending items. Further, to deal with the sample efficiency issue, we propose two action selection strategies for reducing the candidate action space according to the preference and entropy information. Experimental results on two benchmark CRS datasets and a real-world E-Commerce application show that the proposed method not only significantly outperforms state-of-the-art methods but also enhances the scalability and stability of CRS.
Abstract: Today's open-domain conversational agents increase the informativeness of generated responses by leveraging external knowledge. Most of the existing approaches work only for scenarios with a massive amount of monolingual knowledge sources. For languages with limited availability of knowledge sources, it is not effective to use knowledge in the same language to generate informative responses. To address this problem, we propose the task of cross-lingual knowledge grounded conversation (CKGC), where we leverage large-scale knowledge sources in another language to generate informative responses. Two main challenges come with the task of cross-lingual knowledge grounded conversation: (1) knowledge selection and response generation in a cross-lingual setting; and (2) the lack of a test dataset for evaluation. To tackle the first challenge, we propose the curriculum self-knowledge distillation (CSKD) scheme, which utilizes a large-scale dialogue corpus in an auxiliary language to improve cross-lingual knowledge selection and knowledge expression in the target language via knowledge distillation. To tackle the second challenge, we collect a cross-lingual knowledge grounded conversation test dataset to facilitate relevant research in the future. Extensive experiments on the newly created dataset verify the effectiveness of our proposed curriculum self-knowledge distillation method for cross-lingual knowledge grounded conversation. In addition, we find that our proposed unsupervised method significantly outperforms the state-of-the-art baselines in cross-lingual knowledge selection.
Abstract: Review summarization aims to generate condensed text for online product reviews, and has attracted more and more attention in E-commerce platforms. In addition to the input review, the quality of generated summaries is highly related to the characteristics of users and products, e.g., their historical summaries, which could provide useful clues for the target summary generation. However, most previous works ignore the underlying interaction between the given input review and the corresponding historical summaries. Therefore, we aim to explore how to effectively incorporate the history information into the summary generation. In this paper, we propose a novel transformer-based reasoning framework for personalized review summarization. We design an elaborately adapted transformer network containing an encoder and a decoder, to fully infer the important and informative parts among the historical summaries in terms of the input review to generate more comprehensive summaries. In the encoder of our approach, we develop an inter- and intra-attention to involve the history information selectively to learn the personalized representation of the input review. In the decoder part, we propose to incorporate the constructed reasoning memory learning from historical summaries into the original transformer decoder, and design a memory-decoder attention module to retrieve more useful information for the final summary generation. Extensive experiments are conducted and the results show our approach could generate more reasonable summaries for recommendation, and outperform many competitive baseline methods.
Abstract: A typical journalistic convention in news articles is to deliver the most salient information in the beginning, also known as the lead bias. While this phenomenon can be exploited in generating a summary, it has a detrimental effect on teaching a model to discriminate and extract important information in general. We propose that this lead bias can be leveraged in our favor in a simple and effective way to pre-train abstractive news summarization models on large-scale unlabeled news corpora: predicting the leading sentences using the rest of an article. We collect a massive news corpus and conduct data cleaning and filtering via statistical analysis. We then apply self-supervised pre-training on this dataset to existing generation models BART and T5 for domain adaptation. Via extensive experiments on six benchmark datasets, we show that this approach can dramatically improve the summarization quality and achieve state-of-the-art results for zero-shot news summarization without any fine-tuning. For example, in the DUC2003 dataset, the ROUGE-1 score of BART increases 13.7% after the lead-bias pre-training. We deploy the model in Microsoft News and provide public APIs as well as a demo website for multi-lingual news summarization.
Abstract: The task of natural language table retrieval (NLTR) seeks to retrieve semantically relevant tables based on natural language queries. Existing learning systems for this task often treat tables as plain text based on the assumption that tables are structured as dataframes. However, tables can have complex layouts which indicate diverse dependencies between subtable structures, such as nested headers. As a result, queries may refer to different spans of relevant content that is distributed across these structures. Moreover, such systems fail to generalize to novel scenarios beyond those seen in the training set. Prior methods are still distant from a generalizable solution to the NLTR problem, as they fall short in handling complex table layouts or queries over multiple granularities. To address these issues, we propose Graph-based Table Retrieval (GTR), a generalizable NLTR framework with multi-granular graph representation learning. In our framework, a table is first converted into a tabular graph, with cell nodes, row nodes and column nodes to capture content at different granularities. Then the tabular graph is input to a Graph Transformer model that can capture both table cell content and the layout structures. To enhance the robustness and generalizability of the model, we further incorporate a self-supervised pre-training task based on graph-context matching. Experimental results on two benchmarks show that our method leads to significant improvements over the current state-of-the-art systems. Further experiments demonstrate promising performance of our method on cross-dataset generalization, and enhanced capability of handling complex tables and fulfilling diverse query intents.
Abstract: Deep language models (deep LMs) are increasingly being used for full text retrieval or within cascade retrieval pipelines as later-stage re-rankers. A problem with using deep LMs is that, at query time, a slow inference step needs to be performed -- this hinders the practical adoption of these powerful retrieval models, or limits sensibly how many documents can be considered for re-ranking. We propose the novel, BERT-based, Term Independent Likelihood moDEl (TILDE), which ranks documents by both query and document likelihood. At query time, our model does not require the inference step of deep language models based retrieval approaches, thus providing consistent time-savings, as the prediction of query terms' likelihood can be pre-computed and stored during index creation. This is achieved by relaxing the term dependence assumption made by the deep LMs. In addition, we have devised a novel bi-directional training loss which allows TILDE to maximise both query and document likelihood at the same time during training. At query time, TILDE can rely on its query likelihood component (TILDE-QL) solely, or the combination of TILDE-QL and its document likelihood component (TILDE-DL), thus providing a flexible trade-off between efficiency and effectiveness. Exploiting both components provide the highest effectiveness at a higher computational cost while relying only on TILDE-QL trades off effectiveness for faster response time due to no inference being required. TILDE is evaluated on the MS MARCO and TREC Deep Learning 2019 and 2020 passage ranking datasets. Empirical results show that, compared to other approaches that aim to make deep language models viable operationally, TILDE achieves competitive effectiveness coupled with low query latency.
Abstract: The large-scale recommender system mainly consists of two stages: matching and ranking. The matching stage (also known as the retrieval step) identifies a small fraction of relevant items from billion-scale item corpus in low latency and computational cost. Item-to-item collaborative filtering (item-based CF) and embedding-based retrieval (EBR) have been long used in the industrial matching stage owing to its efficiency. However, item-based CF is hard to meet personalization, while EBR has difficulty in satisfying diversity. In this paper, we propose a novel matching architecture, Path-based Deep Network (named PDN), through incorporating both personalization and diversity to enhance matching performance. Specifically, PDN is comprised of two modules: Trigger Net and Similarity Net. PDN utilizes Trigger Net to capture the user's interest in each of his/her interacted item. Similarity Net is devised to evaluate the similarity between each interacted item and the target item based on these items' profile and CF information. The final relevance between the user and the target item is calculated by explicitly considering user's diverse interests, \ie aggregating the relevance weights of the related two-hop paths (one hop of a path corresponds to user-item interaction and the other to item-item relevance). Furthermore, we describe the architecture design of the proposed PDN in a leading real-world E-Commerce service (Mobile Taobao App). Based on offline evaluations and online A/B test, we show that PDN outperforms the existing solutions for the same task. The online results also demonstrate that PDN can retrieve more personalized and more diverse items to significantly improve user engagement. Currently, PDN system has been successfully deployed at Mobile Taobao App and handling major online traffic.
Abstract: Ranking has always been one of the top concerns in information retrieval researches. For decades, the lexical matching signal has dominated the ad-hoc retrieval process, but solely using this signal in retrieval may cause the vocabulary mismatch problem. In recent years, with the development of representation learning techniques, many researchers turn to Dense Retrieval (DR) models for better ranking performance. Although several existing DR models have already obtained promising results, their performance improvement heavily relies on the sampling of training examples. Many effective sampling strategies are not efficient enough for practical usage, and for most of them, there still lacks theoretical analysis in how and why performance improvement happens. To shed light on these research questions, we theoretically investigate different training strategies for DR models and try to explain why hard negative sampling performs better than random sampling. Through the analysis, we also find that there are many potential risks in static hard negative sampling, which is employed by many existing training methods. Therefore, we propose two training strategies named a Stable Training Algorithm for dense Retrieval (STAR) and a query-side training Algorithm for Directly Optimizing Ranking pErformance (ADORE), respectively. STAR improves the stability of DR training process by introducing random negatives. ADORE replaces the widely-adopted static hard negative sampling method with a dynamic one to directly optimize the ranking performance. Experimental results on two publicly available retrieval benchmark datasets show that either strategy gains significant improvements over existing competitive baselines and a combination of them leads to the best performance.
Abstract: Pre-training and fine-tuning have achieved remarkable success in many downstream natural language processing (NLP) tasks. Recently, pre-training methods tailored for information retrieval (IR) have also been explored, and the latest success is the PROP method which has reached new SOTA on a variety of ad-hoc retrieval benchmarks. The basic idea of PROP is to construct therepresentative words prediction (ROP) task for pre-training inspired by the query likelihood model. Despite its exciting performance, the effectiveness of PROP might be bounded by the classical unigram language model adopted in the ROP task construction process. To tackle this problem, we propose a bootstrapped pre-training method (namely B-PROP) based on BERT for ad-hoc retrieval. The key idea is to use the powerful contextual language model BERT to replace the classical unigram language model for the ROP task construction, and re-train BERT itself towards the tailored objective for IR. Specifically, we introduce a novel contrastive method, inspired by the divergence-from-randomness idea, to leverage BERT's self-attention mechanism to sample representative words from the document. By further fine-tuning on downstream ad-hoc retrieval tasks, our method achieves significant improvements over PROP and other baselines, and further pushes forward the SOTA on a variety of ad-hoc retrieval tasks.
Abstract: The evaluation of recommender systems relies on user preference data, which is difficult to acquire directly because of its subjective nature. Current recommender systems widely utilize users' historical interactions as implicit or explicit feedback, but such data usually suffers from various types of bias. Little work has been done on collecting and understanding user's personal preferences via third-party annotations. External assessments, that is, annotations made by assessors who are not the systems' users, have been widely used in information search scenarios. Is it possible to use external assessments to construct user preference labels? This paper presents the first attempt to incorporate external assessments into preference labeling and recommendation evaluation. The aim is to verify the possibility and reliability of external assessments for personalized recommender systems. We collect both users' real preferences and assessors' estimated preferences through a multi-role, multi-session user study. By investigating the inter-assessor agreement and user-assessor consistency, we demonstrate the reasonable stability and high accuracy of external preference assessments. Furthermore, we investigate the usage of external assessments in system evaluation. A higher degree of consistency with users' online feedback is observed, even better than traditional history-based online evaluation. Our findings show that external assessments can be used for assessing user preference labels and evaluating systems in personalized recommendation scenarios.
Abstract: The offline evaluation of search requires us to define a standard against which we measure the quality of results returned by a ranker. Frequently this standard is defined in absolute terms through relevance grades, but it can also be defined in relative terms through preferences. These preferences might be created through explicit preference judgments, derived from relevance grades, or inferred from clicks and other signals. Preferences from multiple sources might even be combined. In contrast to absolute grades, preferences avoid complex definitions of relevance, indicating only that a ranker should favor one result over another. Despite the simplicity and flexibility of preferences, widespread adoption has been limited by the lack of established evaluation measures. Recent work in this direction has taken two approaches: 1) measures based on weighted counts of agreements and disagreements between a set of preferences and an actual ranking generated by a ranker; and 2) measures that translate preferences into gain values for use with traditional measures, such as nDCG. Both approaches require methods for specifying weights or gains that have little or no theoretical foundation, and the values of these measures have no clear and meaningful interpretation. To address these problems, we propose an evaluation measure that computes the similarity between a directed multigraph of preferences and an actual ranking generated by a ranker. The measure computes an ordering for the vertices of the preference graph that maximizes its similarity to the actual ranking under a rank similarity measure. This maximum similarity becomes the value of the measure. Preference graphs are often acyclic, or nearly so, and to compute the measure we extend an approximate greedy algorithm that is known to produce good results for nearly acyclic graphs. For the rank similarity measure we employ Rank Biased Overlap (RBO) which was explicitly created to match the requirements of search and related applications. We validate the new measure over several collections of preferences explored in recent work.
Abstract: Traces of searcher behaviour, such as query reformulation or clicks, are commonly used to evaluate a running search engine. The underlying expectation is that these behaviours are proxies for something more important, such as relevance, utility, or satisfaction. Affective computing technology gives us the tools to help confirm some of these expectations, by examining visceral expressive responses during search sessions. However, work to date has only studied small populations in laboratory settings and with a limited number of contrived search tasks. In this study, we analysed longitudinal, in-situ, search behaviours of 152 information workers, over the course of several weeks while simultaneously tracking their facial expressions. Results from over 20,000 search sessions and 45,000 queries allow us to observe that indeed affective expressions are consistent with, and complementary to, existing "click-based'' metrics. On a query-level, searches that result in a short dwell time are associated with a decrease in smiles (expressions of "happiness'') and that if a query is reformulated the results of the reformulation are associated with an increase in smiling---suggesting a positive outcome as people converge on the information they need. On a session-level, sessions that feature reformulations are more commonly associated with fewer smiles and more furrowed brows (expressions of "anger/frustration''). Similarly, sessions with short-dwell clicks are also associated with fewer smiles. These data provide an insight into visceral aspects of search experience and present a new dimension for evaluating engine performance.
Abstract: Podcasts are spoken documents across a wide-range of genres and styles, with growing listenership across the world, and a rapidly lowering barrier to entry for both listeners and creators. The great strides in search and recommendation in research and industry have yet to see impact in the podcast space, where recommendations are still largely driven by word of mouth. In this perspective paper, we highlight the many differences between podcasts and other media, and discuss our perspective on challenges and future research directions in the domain of podcast information access.
Abstract: Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field. However, the goal is not simply to identify which run is "best", achieving the top score. The goal is to move the field forward by developing new robust techniques, that work in many different settings, and are adopted in research and practice. This paper uses the MS MARCO and TREC Deep Learning Track as our case study, comparing it to the case of TREC ad hoc ranking in the 1990s. We show how the design of the evaluation effort can encourage or discourage certain outcomes, and raising questions about internal and external validity of results. We provide some analysis of certain pitfalls, and a statement of best practices for avoiding such pitfalls. We summarize the progress of the effort so far, and describe our desired end state of "robust usefulness", along with steps that might be required to get us there.
Abstract: Recent research on conversational information seeking (CIS) mostly focuses on uni-modal interactions and information items. This per- spective paper highlights the importance of moving towards de- veloping and evaluating multi-modal conversational information seeking (MMCIS) systems as they enable us to leverage richer context, overcome errors, and increase accessibility. We bridge the gap between the multi-modal and CIS research and provide a formal definition for MMCIS. We discuss potential opportunities and research challenges in designing, implementing, and evaluating MMCIS systems. Based on this research, we propose and implement a practical open-source framework for facilitating MMCIS research.
Abstract: Personalized news recommendation is a critical technology to help users find interested news, and how to precisely match users' interests and candidate news lies in the core of news recommendation. Existing studies generally learn user's interest vector by aggregating his/her browsed news and then match it with the candidate news vector, which may lose the textual semantic matching signals for recommendation. In this paper, we propose an Attentive Multi-field Matching (AMM) framework for news recommendation which captures the semantic matching representations between each browsed news and candidate news, and then aggregates them as final user-news matching signal. In addition, our method incorporates multi-field information and designs a within-field and cross-field matching mechanism, which leverages complementary information from different fields (e.g., titles, abstracts and bodies) and obtain the multi-field matching representations. To achieve a comprehensive semantic understanding, we employ the most popular language model BERT to learn the matching representation of each browsed-candidate news pair, and incorporate the attention mechanism in aggregating procedure to characterize the importance of each matching representation for the final user-news matching signal. Experiments on the real world datasets validate the effectiveness of AMM.
Abstract: Mathematical Language Processing (MLP) deals with the automated processing and analysis of mathematical documents and relies heavily on good representations of mathematical symbols and texts. The aim of this work is to explore the modeling capabilities of state-of-the-art unsupervised deep learning methods to create such representations. Therefore, we pre-trained different instances of an ALBERT model on Mathematics StackExchange data and fine-tuned it on the task of Mathematical Answer Retrieval. Our evaluation shows that ALBERT outperforms all previous systems and is on par with current state-of-the-art systems for math retrieval indicating strong capabilities of modeling mathematical posts. This implies that our approach can also be beneficial to various other tasks in MLP such as automatic proof checking or summarization of scientific texts.
Abstract: User simulation is needed for evaluating Interactive Information Retrieval (IIR) Systems. However, for any user simulator to be useful, it must be reliable. In this paper, we propose a novel Tester-based evaluation approach to evaluating the reliability of user simulators, in which we would construct a Tester based on a set of IR systems with an expected performance pattern and apply such a Tester to a user simulator to see if the user simulator would generate the expected performance pattern. We construct multiple Testers and apply them to a set of representative user simulators to empirically study the feasibility and effectiveness of the proposed Tester-based evaluation method. The results show that Tester-based evaluation is a feasible and effective method for evaluating user simulators and selecting reliable ones for evaluating IIR systems.
Abstract: Query categorization is an essential part of query intent understanding in e-commerce search. A common query categorization task is to select the relevant fine-grained product categories in a product taxonomy. For frequent queries, rich customer behavior (e.g., click-through data) can be used to infer the relevant product categories. However, for more rare queries, which cover a large volume of search traffic, relying solely on customer behavior may not suffice due to the lack of this signal. To improve categorization of rare queries, we adapt the Pseudo-Relevance Feedback (PRF) approach to utilize the latent knowledge embedded in semantically or lexically similar product documents to enrich the representation of the more rare queries. To this end, we propose a novel deep neural model named Attentive Pseudo Relevance Feedback Network (APRF-Net) to enhance the representation of rare queries for query categorization. To demonstrate the effectiveness of our approach, we collect search queries from a large commercial search engine, and compare APRF-Net to state-of-the-art deep learning models for text classification. Our results show that the APRF-Net significantly improves query categorization by 5.9% on F1@1 score over the baselines, which increases to 8.2% improvement for the rare (tail) queries. The findings of this paper can be leveraged for further improvements in search query representation and understanding.
Abstract: Sequential Recommendation characterizes the evolving patterns by modeling item sequences chronologically. The essential target of it is to capture the item transition correlations. The recent developments of transformer inspire the community to design effective sequence encoders,e.g., SASRec and BERT4Rec. However, we observe that these transformer-based models suffer from the cold-start issue,i.e., performing poorly for short sequences. Therefore, we propose to augment short sequences while still preserving original sequential correlations. We introduce a new framework for Augmenting Sequential Recommendation with Pseudo-prior items (ASReP). We firstly pre-train a transformer with sequences in a reverse direction to predict prior items. Then, we use this transformer to generate fabricated historical items at the beginning of short sequences. Finally, we fine-tune the transformer using these augmented sequences from the time order to predict the next item. Experiments on two real-world datasets verify the effectiveness of ASReP. The code is available on https://github.com/DyGRec/ASReP.
Abstract: How to quickly and reliably learn the preferences of new users remains a key challenge in the design of recommender systems. In this paper we introduce a new type of online learning algorithm, cluster-based bandits, to address this challenge. This exploits the fact that users can often be grouped into clusters based on the similarity of their preferences, and this allows accelerated learning of new user preferences since the task becomes one of identifying which cluster a user belongs to and typically there are far fewer clusters than there are items to be rated. Clustering by itself is not enough however. Intra-cluster variability between users can be thought of as adding noise to user ratings. Deterministic methods such as decision-trees perform poorly in the presence of such noise. We identify so-called distinguisher items that are particularly informative for deciding which cluster a new user belongs to despite the rating noise. Using these items the cluster-based bandit algorithm is able to efficiently adapt to user responses and rapidly learn the correct cluster to assign to a new user.
Abstract: Online search latency is a major bottleneck in deploying large-scale pre-trained language models, e.g. BERT, in retrieval applications. Inspired by the recent advances in transformer-based document expansion technique, we propose to trade offline relevance weighting for online retrieval efficiency by utilizing the powerful BERT ranker to weight the neighbour documents collected by generated pseudo-queries for each document. In the online retrieval stage, the traditional query-document matching is reduced to the much less expensive query to pseudo-query matching, and a document rank list is quickly recalled according to the pre-computed neighbour documents. Extensive experiments on the standard MS MARCO dataset with both passage and document ranking tasks demonstrate promising results of our method in terms of both online efficiency and effectiveness.
Abstract: In recent years, legal case retrieval has attracted much attention in the IR research community. It aims to retrieve supporting cases for a given query case and contributes to better legal systems. While using a legal case retrieval system, users always feel difficult to construct accurate queries to express their information need, especially when they lack sufficient domain knowledge. Since conversational search has been widely recognized to fulfill users' complex and exploratory information need, we investigate whether conversational search paradigm can be adopted to improve users' legal case retrieval experience. We design a laboratory-based study to collect users' interaction behaviors and explicit feedback signals while using traditional and agent-mediated conversational legal case retrieval systems. Based on the collected data, we compare search behavior and outcome of these two different kinds of interaction paradigms. Compared with the traditional one, experimental results show that users can achieve better retrieval performance with the conversational case retrieval system. Moreover, conversational system can also save users' efforts in formulating queries and examining results.
Abstract: While neural recommenders have become the state-of-the-art in recent years, the complexity of deep models still makes the generation of tangible explanations for end users a challenging problem. Existing methods are usually based on attention distributions over a variety of features, which are still questionable regarding their suitability as explanations, and rather unwieldy to grasp for an end user. Counterfactual explanations based on a small set of the user's own actions have been shown to be an acceptable solution to the tangibility problem. However, current work on such counterfactuals cannot be readily applied to neural models. In this work, we propose ACCENT, the first general framework for finding counterfactual explanations for neural recommenders. It extends recently-proposed influence functions for identifying training points most relevant to a recommendation, from a single to a pair of items, while deducing a counterfactual set in an iterative process. We use ACCENT to generate counterfactual explanations for two popular neural models, Neural Collaborative Filtering (NCF) and Relational Collaborative Filtering (RCF), and demonstrate its feasibility on a sample of the popular MovieLens 100K dataset.
Abstract: The two-tower architecture has been widely applied for learning item and user representations, which is important for large-scale recommender systems. Many two-tower models are trained using various in-batch negative sampling strategies, where the effects of such strategies inherently rely on the size of mini-batches. However, training two-tower models with a large batch size is inefficient, as it demands a large volume of memory for item and user contents and consumes a lot of time for feature encoding. Interestingly, we find that neural encoders can output relatively stable features for the same input after warming up in the training process. Based on such facts, we propose a simple yet effective sampling strategy called Cross-Batch Negative Sampling (CBNS), which takes advantage of the encoded item embeddings from recent mini-batches to boost the model training. Both theoretical analysis and empirical evaluations demonstrate the effectiveness and the efficiency of CBNS.
Abstract: Users' clicks on Web search results are one of the key signals for evaluating and improving web search quality and have been widely used as part of current state-of-the-art Learning-To-Rank(LTR) models. With a large volume of search logs available for major search engines, effective models of searcher click behavior have emerged to evaluate and train LTR models. However, when modeling the users' click behavior, considering the bias of the behavior is imperative. In particular, when a search result is not clicked, it is not necessarily chosen as not relevant by the user, but instead could have been simply missed, especially for lower-ranked results. These kinds of biases in the click log data can be incorporated into the click models, propagating the errors to the resulting LTR ranking models or evaluation metrics. In this paper, we propose the De-biased Reinforcement Learning Click model (DRLC). The DRLC model relaxes previously made assumptions about the users' examination behavior and resulting latent states. To implement the DRLC model, convolutional neural networks are used as the value networks for reinforcement learning, trained to learn a policy to reduce bias in the click logs. To demonstrate the effectiveness of the DRLC model, we first compare performance with the previous state-of-art approaches using established click prediction metrics, including log-likelihood and perplexity. We further show that DRLC also leads to improvements in ranking performance. Our experiments demonstrate the effectiveness of the DRLC model in learning to reduce bias in click logs, leading to improved modeling performance and showing the potential for using DRLC for improving Web search quality.
Abstract: With the new developments of natural language processing, increasing attention has been given to the task of Named Entity Recognition (NER). However, the vast majority of work focus on a small number of large-scale annotated datasets with a limited number of entities such as person, location and organization. While other datasets have been introduced with domain-specific entities, the smaller size of these largely limits the applicability of state-of-the-art deep models. Even if there are promising new approaches for performing zero-shot learning (ZSL), they are not designed for a cross-domain settings. We propose Cross Domain Zero Shot Named Entity Recognition with Knowledge Graph (DOZEN), which learns the relations between entities across different domains from an existing ontology of external knowledge and a set of analogies linking entities and domains. Experiments performed on both large scale and domain-specific datasets indicate that DOZEN is the most suitable option to extracts unseen entities in a target dataset from a different domain.
Abstract: Unbiased recommender learning has been actively studied to alleviate the inherent bias of implicit datasets under the missing-not-at-random assumption. Existing studies solely address the bias of positive feedback but do not account for the bias of missing feedback, which heavily affects their sub-optimal performance gains. This paper proposes a dual recommender learning framework that simultaneously eliminates the bias of clicked and unclicked data. Specifically, the proposed loss function adopts two propensity weighting to effectively estimate the true positive and negative preferences from clicked and unclicked data. We also prove that the proposed loss function converges to the ideal loss function for both clicked and unclicked data. Because of the model-agnostic property, it can be applied to any existing unbiased learning models. Experimental results show that the proposed method outperforms state-of-the-art unbiased models up to 5.54-24.56% for MAP@1 on three datasets.
Abstract: Personalized news recommendation is an essential technique for online news services. News articles usually contain rich textual content, and accurate news modeling is important for personalized news recommendation. Existing news recommendation methods mainly model news texts based on traditional text modeling methods, which is not optimal for mining the deep semantic information in news texts. Pre-trained language models (PLMs) are powerful for natural language understanding, which has the potential for better news modeling. However, there is no public report that shows PLMs have been applied to news recommendation. In this paper, we report our work on pre-trained language models empowered news recommendation (PLM-NR). Offline experimental results on both monolingual and multilingual news recommendation datasets show that leveraging PLMs for news modeling can effectively improve the performance of news recommendation. Our PLM-NR models have been deployed to the Microsoft News platform, and online flight results show that they can achieve significant performance gains in both English-speaking and global markets.
Abstract: Recently, BERT has shown overwhelming performance in sequential recommendation by using a bidirectional attention mechanism. Although the bidirectional model effectively captures dynamics from user interaction, its training strategy does not fit well to the inference stage in sequential recommendation which generally proceeds in a left-to-right way. To address this problem, we introduce a new recommendation system built upon BART, which is widely used in NLP tasks. BART uses a left-to-right decoder and injects noise into its bidirectional encoder, which can reduce the gap between training and inference. However, direct usage of BART for recommendation system is challenging due to its model property and domain difference. BART is an auto-regressive generative model, and its noising transformation techniques are originally developed for text sequence. In this paper, we present a novel sequential recommendation model, Entangled BART for Recommendation (E-BART4Rec) that entangles bidirectional encoder and auto-regressive decoder with noisy transformations for user interaction. Unlike BART, where the final output only depends on its output of the decoder, E-BART4Rec dynamically integrates the output of the bidirectional encoder and auto-regressive decoder based on a gating mechanism that calculates the importance of each output. We also employ noisy transformation that imitates the real users' behaviors, such as item deletion, item cropping, item reverse, and item infilling, to the input of the encoder. Extensive experiments on widely used real-world datasets demonstrate that our models significantly outperform the baselines.
Abstract: Using entity aspect links, we improve upon the current state-of-the-art in entity retrieval. Entity retrieval is the task of retrieving relevant entities for search queries, such as "Antibiotic Use In Livestock". Entity aspect linking is a new technique to refine the semantic information of entity links. For example, while passages relevant to the query above may mention the entity "USA", there are many aspects of the USA of which only few, such as "USA/Agriculture", are relevant for this query. By using entity aspect links that indicate which aspect of an entity is being referred to in the context of the query, we obtain more specific relevance indicators for entities. We show that our approach improves upon all baseline methods, including the current state-of-the-art using a standard entity retrieval test collection. With this work, we release a large collection of entity-aspect-links for a large TREC corpus.
Abstract: Experimental evaluation is regarded as a critical element of any research activity in Information Retrieval, and is typically used to support assertions of the form "Technique A provides better retrieval effectiveness than does Technique B". Implicit in such claims are the characteristics of the data to which the results apply, in terms of both the queries used and the documents they were applied to. Here we explore the role of evaluation on a collection as a prediction of relative performance on collections that have different characteristics. In particular, by synthesizing new collections that vary from each other in a controlled way, we show that it is possible to explore the reliability of an IR evaluation pipeline, and to better understand the complex interrelationship between documents, queries, and metrics that is an important part of any experimental validation. Our results show that predictivity declines as the collection is varied, even in simple ways such as shifting in focus from one document source to another similar source.
Abstract: Deep cross-modal retrieval methods have shown their competitiveness among different cross-modal retrieval algorithms. Generally, these methods require a large amount of training data. However, aggregating large amounts of data will incur huge privacy risks and high maintenance costs. Inspired by the recent success of federated learning, we propose the federated cross-modal retrieval (FedCMR), which learns the model with decentralized multi-modal data. Specifically, we first train the cross-modal retrieval model and learn the common space across multiple modalities in each client using its local data. Then, we jointly learn the common subspace of multiple clients on the trusted central server. Finally, each client updates the common subspace of the local model based on the aggregated common subspace on the server, so that all clients participated in the training can benefit from federated learning. Experiment results on four benchmark datasets demonstrate the effectiveness proposed method.
Abstract: Named entity recognition (NER) for Web queries is very challenging. Queries often do not consist of well-formed sentences, and contain very little context, with highly ambiguous queried entities. Code-mixed queries, with entities in a different language than the rest of the query, pose a particular challenge in domains like e-commerce (e.g. queries containing movie or product names). This work tackles NER for code-mixed queries, where entities and non-entity query terms co-exist simultaneously in different languages. Our contributions are twofold. First, to address the lack of code-mixed NER data we create EMBER, a large-scale dataset in six languages with four different scripts. Based on Bing query data, we include numerous language combinations that showcase real-world search scenarios. Secondly, we propose a novel gated architecture that enhances existing multi-lingual Transformers with a Mixture-of-Experts model to dynamically infuse multi-lingual gazetteers, allowing it to simultaneously differentiate and handle entities and non-entity query terms in multiple languages. Experimental evaluation on code-mixed queries in several languages shows that our approach efficiently utilizes gazetteers to recognize entities in code-mixed queries with an F1=68%, an absolute improvement of +31% over a non-gazetteer baseline.
Abstract: The rapidly rising ubiquity and dissemination of online information such as social media text and news improve user accessibility towards financial markets, however, modeling these vast streams of irregular, temporal data poses a challenge. Such temporal streams of information show power-law dynamics, scale-free characteristics, and time irregularities that sequential models are unable to accurately model. In this work, we propose the first Hierarchical Time-Aware Hyperbolic LSTM (HTLSTM), which leverages the Riemannian manifold for encoding the scale-free nature of a sequence of text in a time-aware fashion. Through experiments on three financial tasks: stock trading, equity price movement prediction, and financial risk prediction, we demonstrate HTLSTM's applicability for modeling temporal sequences of online information. On real-world data from four global stock markets and three stock indices spanning data in English and Chinese, we make a step towards time-aware text modeling via hyperbolic geometry.
Abstract: Sequential recommendation (SR) has attracted much research attention in the past few years. Most existing attribute integrated SR models do not directly model the complex relations between items and categorical attributes, as well do not exploit the power of attribute sequence in predicting the next item. In this paper, we propose an Item Categorical Attribute Integrated Sequential Recommendation (ICAI-SR) framework, which consists of an Item-Attribute Aggregation (IAA) model and Entity Sequential (ES) models. In IAA model, we employ a heterogeneous graph to represent the complex relations between items and different types of categorical attributes, then the attention mechanism based neighborhood aggregation is designed to model the correlations between items and attributes. For ES models, there are one Item Sequential (IS) model and one or more Attribute Sequential (AS) models. With IS and AS models, not only the item sequence but also the attribute sequence are used to predict the next item during model training. ICAI-SR is instantiated by taking Gated Recurrent Unit (GRU) and Bidirectional Encoder Representations from Transformers (BERT) as ES models, resulting in ICAI-GRU and ICAI-BERT respectively. Extensive experiments have been conducted on three public datasets to validate the performance of ICAI-SR. Experimental Results show that ICAI-SR performs better than both basic SR models and a competitive attribute integrated SR model.
Abstract: Query logs of search engines with instant search functionality are challenging for log analysis, since the log entries represent interactions at the keystroke level, rather than at the query level. To enable log analyses at the query level, a user's logged sequence of keystroke-level interactions needs to be mapped to distinct queries. This problem bears strong parallels to session detection in "standard" query logs (i.e., forming groups of subsequent queries on the same topic), but there are salient differences. In this paper, we present a new approach to identifying interactions belonging to the same query in instant query logs. In an experimental comparison, our new approach achieves an F2 score of 0.93 compared to only 0.83 of a state-of-the-art cascading method for query log session detection.
Abstract: The Transformer-Kernel (TK) model has demonstrated strong reranking performance on the TREC Deep Learning benchmark---and can be considered to be an efficient (but slightly less effective) alternative to other Transformer-based architectures that employ (i) large-scale pretraining (high training cost), (ii) joint encoding of query and document (high inference cost), and (iii) larger number of Transformer layers (both high training and high inference costs). Since, a variant of the TK model---called TKL---has been developed that incorporates local self-attention to efficiently process longer input sequences in the context of document ranking. In this work, we propose a novel Conformer layer as an alternative approach to scale TK to longer input sequences. Furthermore, we incorporate query term independence and explicit term matching to extend the model to the full retrieval setting. We benchmark our models under the strictly blind evaluation setting of the TREC 2020 Deep Learning track and find that our proposed architecture changes lead to improved retrieval quality over TKL. Our best model also outperforms all non-neural runs ("trad") and two-thirds of the pretrained Transformer-based runs ("nnlm") on NDCG@10.
Abstract: Recommendation systems can help users process large amounts of information, and generative adversarial networks (GANs) show great potential in recommendation systems. In this paper, we propose a new GAN model to enhance the information flow within the generator based on the information flow between the original generator and discriminator. Our experimental results indicate that our model reduces the discrepancy between the generator and the discriminator. Both the generator and discriminator yield considerable performance improvements compared to other strong baselines. The improvements by NDCG@3 and MRR are significant, which can reach 30.98% and 30.17%, respectively.
Abstract: Knowledge graphs are widely used in information retrieval as they can enhance our semantic understanding of queries and documents. The main idea is to consider entities and entity relationships as side information. Although existing work has achieved improvements in retrieval effectiveness by incorporating information from knowledge graphs into retrieval models, few studies have leveraged knowledge graphs in understanding users' search behavior. We investigate user behavior during session search from the perspective of a knowledge graph. We conduct a query log-based analysis of users' query reformulation and document clicking behavior. Based on a large-scale commercial query log and a knowledge graph, we find new user behavior patterns in terms of query reformulation and document clicking. Our study deepens our understanding of user behavior in session search and provides implications to help improve retrieval models with knowledge graphs.
Abstract: Accurately estimating the retrieval effectiveness of different queries representing distinct information needs is a problem in Information Retrieval (IR) that has been studied for over 20 years. Recent work showed that the problem can be significantly harder when multiple queries representing the same information need are used in prediction. By generalizing the existing evaluation framework of Query Performance Prediction (QPP) we explore the causes of these differences in prediction quality in the two scenarios. Our empirical analysis demonstrates that for most predictors, this difference is solely an artifact of the underlying differences in the query effectiveness distributions. Our detailed analysis also demonstrates key performance distribution properties under which (QPP) is most and least reliable.
Abstract: Embedding index that enables fast approximate nearest neighbor(ANN) search, serves as an indispensable component for state-of-the-art deep retrieval systems. Traditional approaches, often separating the two steps of embedding learning and index building, incur additional indexing time and decayed retrieval accuracy. In this paper, we propose a novel method called Poeem, which stands for product quantization based embedding index jointly trained with deep retrieval model, to unify the two separate steps within an end-to-end training, by utilizing a few techniques including the gradient straight-through estimator, warm start strategy, optimal space decomposition and Givens rotation. Extensive experimental results show that the proposed method not only improves retrieval accuracy significantly but also reduces the indexing time to almost none. We have open sourced our approach for the sake of comparison and reproducibility.
Abstract: Neural information retrieval systems typically use a cascading pipeline, in which a first-stage model retrieves a candidate set of documents and one or more subsequent stages re-rank this set using contextualized language models such as BERT. In this paper, we propose DeepImpact, a new document term-weighting scheme suitable for efficient retrieval using a standard inverted index. Compared to existing methods, DeepImpact improves impact-score modeling and tackles the vocabulary-mismatch problem. In particular, DeepImpact leverages DocT5Query to enrich the document collection and, using a contextualized language model, directly estimates the semantic importance of tokens in a document, producing a single-value representation for each token in each document. Our experiments show that DeepImpact significantly outperforms prior first-stage retrieval approaches by up to 17% on effectiveness metrics w.r.t. DocT5Query, and, when deployed in a re-ranking scenario, can reach the same effectiveness of state-of-the-art approaches with up to 5.1x speedup in efficiency.
Abstract: Recent deployment of efficient billion-scale approximate nearest neighbor (ANN) search algorithms on GPUs has motivated information retrieval researchers to develop neural ranking models that learn low-dimensional dense representations for queries and documents and use ANN search for retrieval. However, optimizing these dense retrieval models poses several challenges including negative sampling for (pair-wise) training. A recent model, called ANCE, successfully uses dynamic negative sampling using ANN search. This paper improves upon ANCE by proposing a robust negative sampling strategy for scenarios where the training data lacks complete relevance annotations. This is of particular importance as obtaining large-scale training data with complete relevance judgment is extremely expensive. Our model uses a small validation set with complete relevance judgments to accurately estimate a negative sampling distribution for dense retrieval models. We also explore leveraging a lexical matching signal during training and pseudo-relevance feedback during evaluation for improved performance. Our experiments on the TREC Deep Learning Track benchmarks demonstrate the effectiveness of our solutions.
Abstract: Self-attention networks (SANs) have been intensively applied for sequential recommenders, but they are limited due to: (1) the quadratic complexity and vulnerability to over-parameterization in self-attention; (2) inaccurate modeling of sequential relations between items due to the implicit position encoding. In this work, we propose the low-rank decomposed self-attention networks (LightSANs) to overcome these problems. Particularly, we introduce the low-rank decomposed self-attention, which projects user's historical items into a small constant number of latent interests and leverages item-to-interest interaction to generate the context-aware representation. It scales linearly w.r.t. the user's historical sequence length in terms of time and space, and is more resilient to over-parameterization. Besides, we design the decoupled position encoding, which models the sequential relations between items more precisely. Extensive experimental studies are carried out on three real-world datasets, where LightSANs outperform the existing SANs-based recommenders in terms of both effectiveness and efficiency.
Abstract: Sequential recommendation is intended to model the dynamic behavior regularity through users' behavior sequences. Recently, various deep learning techniques are applied to model the relation of items in the sequences. Despite their effectiveness, we argue that the aforementioned methods only consider the macro-structure of the behavior sequence, but neglect the micro-structure in the sequence which is important to sequential recommendation. To address the above limitation, we propose a novel model called Motif-aware Sequential Recommendation (MoSeR), which captures the motifs hidden in behavior sequences to model the micro-structure features. MoSeR extracts the motifs that contain both the last behavior and the target item. These motifs reflect the topological relations among local items in the form of directed graphs. Thus our method can make a more accurate prediction with the awareness of the inherent patterns between local items. Extensive experiments on three benchmark datasets demonstrate that our model outperforms the state-of-the-art sequential recommendation models.
Abstract: Autoencoder-based hybrid recommender systems have become popular recently because of their ability to learn user and item representations by reconstructing various information sources, including users' feedback on items (e.g., ratings) and side information of users and items (e.g., users' occupation and items' title). However, existing systems still use representations learned by matrix factorization (MF) to predict the rating, while using representations learned by neural networks as the regularizer. In this paper, we define the neural representation for prediction (NRP) framework and apply it to the autoencoder-based recommendation systems. We theoretically analyze how our objective function is related to the previous MF and autoencoder-based methods and explain what it means to use neural representations as the regularizer. We also apply the NRP framework to a direct neural network structure which predicts the ratings without reconstructing the user and item information. We conduct extensive experiments which confirm that neural representations are better for prediction than regularization and show that the NRP framework outperforms the state-of-the-art methods in the prediction task, with less training time and memory.
Abstract: Various researchers have recently explored the impact of different types of biases on information retrieval tasks such as ad hoc retrieval and question answering. While the impact of bias needs to be controlled in order to avoid increased prejudices, the literature has often viewed the relationship between increased retrieval utility (effectiveness) and reduced bias as a tradeoff where one can suffer from the other. In this paper, we empirically study this tradeoff and explore whether it would be possible to reduce bias while maintaining similar retrieval utility. We show this would be possible by revising the input query through a bias-aware pseudo-relevance feedback framework. We report our findings based on four widely used TREC corpora namely Robust04, Gov2, ClueWeb09 and ClueWeb12 and using two classes of bias metrics. The findings of this paper are significant as they are among the first to show that decrease in bias does not necessarily need to come at the cost of reduced utility.
Abstract: In this work, we address multi-modal information needs that contain text questions and images by focusing on passage retrieval for outside-knowledge visual question answering. This task requires access to outside knowledge, which in our case we define to be a large unstructured passage collection. We first conduct sparse retrieval with BM25 and study expanding the question with object names and image captions. We verify that visual clues play an important role and captions tend to be more informative than object names in sparse retrieval. We then construct a dual-encoder dense retriever, with the query encoder being LXMERT, a multi-modal pre-trained transformer. We further show that dense retrieval significantly outperforms sparse retrieval that uses object expansion. Moreover, dense retrieval matches the performance of sparse retrieval that leverages human-generated captions.
Abstract: Wikipedia, the largest open-collaborative online encyclopedia, is a corpus of documents bound together by internal hyperlinks. These links form the building blocks of a large network whose structure contains important information on the concepts covered in this encyclopedia. The presence of a link between two articles, materialised by an anchor text in the source page pointing to the target page, can increase readers' understanding of a topic. However, the process of linking follows specific editorial rules to avoid both under-linking and over-linking. In this paper, we study the transductive and the inductive tasks of link prediction on several subsets of the English Wikipedia and identify some key challenges behind automatic linking based on anchor text information. We propose an appropriate evaluation sampling methodology and compare several algorithms. Moreover, we propose baseline models that provide a good estimation of the overall difficulty of the tasks.
Abstract: Learning-to-rank systems often utilize user-item interaction data (e.g., clicks) to provide users with high-quality rankings. However, this data suffers from several biases, and if naively used as training data, it can lead to suboptimal ranking algorithms. Most existing bias-correcting methods focus on position bias, the fact that higher-ranked results are more likely to receive interaction, and address this bias by leveraging inverse propensity weighting. However, it is not always possible to accurately estimate propensity scores, and in addition to position bias, selection bias is often encountered in real-world recommender systems. Selection bias occurs because users are exposed to a truncated list of results, which gives a zero chance for some items to be observed and, therefore, interacted with, even if they are relevant. Here, we propose a new counterfactual method that uses a two-stage correction approach and jointly addresses selection and position bias in learning-to-rank systems without relying on propensity scores. Our experimental results show that our method is better than state-of-the-art propensity-independent methods and either better than or comparable to methods that make the strong assumption for which the propensity model is known.
Abstract: Traditionally, recommender systems provide a list of suggestions to a user based on past interactions with items of this user. These recommendations are usually based on user preferences for items and generated with a delay. Critiquing recommender systems allow users to provide immediate feedback to recommendations with tags and receive a new set of recommendations in response. However, these systems often require rich item descriptions that contain relevance scores indicating the strength, with which a tag applies to an item. For example, this relevance score could indicate how violent the movie "The Godfather" is on a scale from 0 to 1. Retrieving these data is a very demanding process, as it requires users to explicitly indicate the degree to which a tag applies to an item. This process can be improved with machine learning methods that predict tag relevance. In this paper, we explore the dataset from a different study, where the authors collected relevance scores on movie-tag pairs. In particular, we define the tag relevance prediction problem, explore the inconsistency of relevance scores provided by users as a challenge of this problem and present a method, which outperforms the state-of-the-art method for predicting tag relevance. We found a moderate inconsistency of user relevance scores. We also found that users tend to disagree more on subjective tags, such as "good acting", "bad plot" or "quotable" than on objective tags, such as "animation", "cars" or "wedding", but the disagreement of users regarding objective tags is also moderate.
Abstract: Personalized news recommendation aims to alleviate information overload and help users find news of their interests. Accurately matching candidate news and users' interests is the key to news recommendation. Most existing methods separately encode each user and news into vectors by news contents and then match the two vectors. However, a user's interest may differ in each news or each topic of one news. It's necessary to dynamically learn user and news vector and model their interaction. In this work, we present Recurrent Reasoning Memory Network over BERT (RMBERT) for news recommendation. Compared with other methods, our approach can leverage the ability of content modeling from BERT. Moreover, the recurrent reasoning memory network which performs a series of attention based reasoning steps can dynamically learn user and news vector and model their interaction in each step. As a result, our approach can better model user's interests. We conduct extensive experiments on a real-world news recommendation dataset and the results show that our approach significantly outperforms existing state-of-the-art methods.
Abstract: Recent studies show that neural models for natural language processing are usually fragile under adversarial attacks (e.g., character-level insertion and word-level synonym substitution), which exposes the lack of robustness. Most defense techniques are tailored to specific semantic level attacks and do not possess the ability to mitigate multi-level attack simultaneously. Adversarial training has been shown the effectiveness of increasing model robustness. However, it often suffers from degradation on normal data, especially when the proportion of adversarial examples increase. To address this, we propose mixup regularized adversarial training (MRAT) against multi-level attack. Our method can utilize multiple adversarial examples to increase model intrinsic robustness without sacrificing the performance on normal data. We evaluate our method on text classification and entailment tasks. Experimental results on different text encoders (BERT, LSTM and CNN) with multi-level attack show that our method outperforms adversarial training consistently.
Abstract: A fundamental challenge for sequential recommenders is to capture the sequential patterns of users toward modeling how users transit among items. In many practical scenarios, however, there are a great number of cold-start users with only minimal logged interactions. As a result, existing sequential recommendation models will lose their predictive power due to the difficulties in learning sequential patterns over users with only limited interactions. In this work, we aim to improve sequential recommendation for cold-start users with a novel framework named MetaTL, which learns to model the transition patterns of users through meta-learning. Specifically, the proposed MetaTL: (i) formulates sequential recommendation for cold-start users as a few-shot learning problem; (ii) extracts the dynamic transition patterns among users with a translation-based architecture; and (iii) adopts meta transitional learning to enable fast learning for cold-start users with only limited interactions, leading to accurate inference of sequential interactions.
Abstract: Social influence is essential to social recommendation. Current influence-based social recommendation focuses on the explicit influence on observed social links. However, in real cases, implicit social influence can also impact users' preference in an unobserved way. In this work, we concern two kinds of implicit influence: Local Implicit Influence of persons on unobserved interpersonal relations, and Global Implicit Influence of items broadcasted to users. We improve the state-of-the-art GNN-based social recommendation methods by modeling two kinds of implicit influences separately. Local implicit influence is involved by predicting unobserved social relationships. Global implicit influence is involved by defining global popularity of each item and personalize the impact of the popularity on each user. In a GCN network, explicit and implicit influence are integrated to learn the social embedding of users and items in social recommendation. Experimental results on Yelp initially demonstrate the effectiveness of proposed model.
Abstract: Neural passage retrieval is a new and promising approach in open retrieval question answering. In this work, we stress-test the Dense Passage Retriever (DPR)---a state-of-the-art (SOTA) open domain neural retrieval model---on closed and specialized target domains such as COVID-19, and find that it lags behind standard BM25 in this important real-world setting. To make DPR more robust under domain shift, we explore its fine-tuning with synthetic training examples, which we generate from unlabeled target domain text using a text-to-text generator. In our experiments, this noisy but fully automated target domain supervision gives DPR a sizable advantage over BM25 in out-of-domain settings, making it a more viable model in practice. Finally, an ensemble of BM25 and our improved DPR model yields the best results, further pushing the SOTA for open retrieval QA on multiple out-of-domain test sets.
Abstract: Session-based recommendation aims to predict the next item that is most likely to be clicked by an anonymous user, based on his/her clicking sequence within one visit. It becomes an essential function of many recommender systems since it protects privacy. However, as the accumulated session records keep increasing, it becomes challenging to model the user interests since they would drift when the time span is large. Efforts have been devoted to handling dynamic user interests by modeling all historical sessions at one time or conducting offline retraining regularly. These solutions are far from practical requirements in terms of efficiency and capturing timely user interests. To this end, we propose a memory-efficient framework - TASRec. It constructs a graph for each day to model the relations among items. Thus, the same item on different days could have different neighbors, corresponding to the drifting user interests. We design a tailored graph neural network to embed this dynamic graph of items and learn temporal augmented item representations. Based on this, we leverage a sequential neural architecture to predict the next item of a given sequence. Experiments on real-world datasets demonstrate that TASRec outperforms state-of-the-art session-based recommendation methods.
Abstract: Recently, much progress in natural language processing has been driven by deep contextualized representations pretrained on large corpora. Typically, the fine-tuning on these pretrained models for a specific downstream task is based on single-view learning, which is however inadequate as a sentence can be interpreted differently from different perspectives. Therefore, in this work, we propose a text-to-text multi-view learning framework by incorporating an additional view---the text generation view---into a typical single-view passage ranking model. Empirically, the proposed approach is of help to the ranking performance compared to its single-view counterpart. Component analysis is also reported in the paper.
Abstract: Educational recommender systems channel most of the research efforts on the effectiveness of the recommended items. While teachers have a central role in online platforms, the impact of recommender systems for teachers in terms of the exposure such systems give to the courses is an under-explored area. In this paper, we consider data coming from a real-world platform and analyze the distribution of the recommendations w.r.t. the geographical provenience of the teachers. We observe that data is highly imbalanced towards the United States, in terms of offered courses and of interactions. These imbalances are exacerbated by recommender systems, which overexpose the country w.r.t. its representation in the data, thus generating unfairness for teachers outside that country. To introduce equity, we propose an approach that regulates the share of recommendations given to the items produced in a country (visibility) and the position of the items in the recommended list (exposure).
Abstract: Cold-start problems are enormous challenges in practical recommender systems. One promising solution for this problem is cross-domain recommendation (CDR) which leverages rich information from an auxiliary (source) domain to improve the performance of recommender system in the target domain. In these CDR approaches, the family of Embedding and Mapping methods for CDR (EMCDR) is very effective, which explicitly learn a mapping function from source embeddings to target embeddings with overlapping users. However, these approaches suffer from one serious problem: the mapping function is only learned on limited overlapping users, and the function would be biased to the limited overlapping users, which leads to unsatisfying generalization ability and degrades the performance on cold-start users in the target domain. With the advantage of meta learning which has good generalization ability to novel tasks, we propose a transfer-meta framework for CDR (TMCDR) which has a transfer stage and a meta stage. In the transfer (pre-training) stage, a source model and a target model are trained on source and target domains, respectively. In the meta stage, a task-oriented meta network is learned to implicitly transform the user embedding in the source domain to the target feature space. In addition, the TMCDR is a general framework that can be applied upon various base models, e.g., MF, BPR, CML. By utilizing data from Amazon and Douban, we conduct extensive experiments on 6 cross-domain tasks to demonstrate the superior performance and compatibility of TMCDR.
Abstract: Click-through rate (CTR) prediction based on deep neural networks has made significant progress in recommendation systems. However, these methods often suffer from CTR underestimation due to insufficient impressions for long-tail items. When formalizing CTR prediction as a contextual bandit problem, exploration methods provide a natural solution addressing this issue. In this paper, we first benchmark state-of-the-art exploration methods in the recommendation system setting. We find that the combination of gradient-based uncertainty modeling and Thompson Sampling achieves a significant advantage. On the basis of the benchmark, we further propose a general enhancement strategy, Underestimation Refinement (UR), which explicitly incorporates the prior knowledge that insufficient impressions likely leads to CTR underestimation. This strategy is applicable to almost all the existing exploration methods. Experimental results validate UR's effectiveness, achieving consistent improvement across all baseline exploration methods.
Abstract: We study keyphrase extraction (KPE) from Web documents. Our key contribution is encoding Web documents to leverage structure, such as title or anchors, by building a graph of words representing both (a) position-based proximity and (b) structural relations. We evaluate KPE performance on real-world search engine NAVER and human-annotated KPE benchmarks, and ours outperforms state-of-the-arts in both tasks.
Abstract: Due to the significant cognition reduction, multi-media content has become an increasingly important information type nowadays. More and more descriptions are coupled with images to make them more attractive and persuasive. Currently, several text-image retrieval methods have been developed to improve the efficiency of the time-consuming and professional process. However, in practical retrieval applications, it is the vivid and terse descriptions that are widely used, instead of the shallow captions that describe what is contained. Therefore, the most existing methods designed for the caption-style text can not achieve this purpose. To eliminate the mismatch, we introduce a novel problem about description-image retrieval and propose the specially designed method, named Adapted Graph Reasoning and Filtration (AGRF). In AGRF, we firstly leverage an adapted graph reasoning network to discover the combination of visual objects in the image. Then, a cross-modal gate mechanism is proposed to cast aside those description-independent combinations. Experiment results on the real-world dataset demonstrate the advantages of the AGRF over the state-of-the-art methods.
Abstract: Detecting sarcastic expressions could promote the understanding of natural language in social media. In this paper, we revisit sarcasm detection from a novel perspective, so as to account for the long-range literal sentiment inconsistencies. More concretely, we explore a novel scenario of constructing an affective graph and a dependency graph for each sentence based on the affective information retrieved from external affective commonsense knowledge and the syntactical information of the sentence. Based on it, an Affective Dependency Graph Convolutional Network (ADGCN) framework is proposed to draw long-range incongruity patterns and inconsistent expressions over the context for sarcasm detection by means with interactively modeling the affective and dependency information. Experimental results on multiple benchmark datasets show that our proposed approach outperforms the current state-of-the-art methods in sarcasm detection.
Abstract: Digital-forms are commonly used for collecting structured information from users. However, filling digital-forms that include a large number of fields is tedious and error-prone. Auto-filling form fields for the user is highly beneficial for improving user experience and potentially collecting more valuable information (in cases where not all fields are mandatory). Online E-commerce marketplaces quite often utilize such forms to collect listing attributes from sellers. In this work, we describe Form-BERT -- a Transformer-based model which is optimized for auto-filling listing attributes given the following inputs: free-text, list of known attribute names, and zero or more attribute values. Form-BERT can be further used iteratively to leverage filled out attributes as the form filling progresses.
Abstract: Criminal Court View Generation is an essential task in legal intelligence, which aims to automatically generate sentences interpreting judgment results. The court view could be seen as the summary of crime circumstances in a case, including ADjudging Circumstance (ADC) and SEntencing Circumstance (SEC). However, different circumstances vary widely, and adopting them to generate court views directly may limit the generation performance. Therefore, it is necessary to identify the ADC and SEC related sentences in case facts and enhance them into the court view generation, respectively. To this end, in this paper, we propose a novel Circumstances enhanced Criminal Court View Generation (C3VG) method, consisting of the extraction and generation stage. Specifically, in the extraction stage, we design a Circumstances Selector to select ADC and SEC related sentences. After that, we apply them to two generators to generate the circumstances enhanced court views, respectively. After merging the two types of court views, we could obtain the final court views. We evaluate C3VG by conducting extensive experiments on a real-world dataset and experimental results clearly validate the effectiveness of our proposed model.
Abstract: Natural language query grounding in videos is a challenging task that requires comprehensive understanding of the query, video and fusion of information across these modalities. Existing methods mostly emphasize on the query-to-video one-way interaction with a late fusion scheme, lacking effective ways to capture the relationship within and between query and video in a fine-grained manner. Moreover, current methods are often overly complicated resulting in long training time. We propose a self-attention together with cross interaction multi-head-attention mechanism in an early fusion scheme to capture video-query intra-dependencies as well as inter-relation from both directions (query-to-video and video-to-query). The cross-attention method can associate query words and video frames at any position and account for long-range dependencies in the video context. In addition, we propose a multi-task training objective that includes start/end prediction and moment segmentation. The moment segmentation task provides additional training signals that remedy the start/end prediction noise caused by annotator disagreement. Our simple yet effective architecture enables speedy training (within 1 hour on an AWS P3.2xlarge GPU instance) and instant inference. We showed that the proposed method achieves superior performance compared to complex state of the art methods, in particular surpassing the SOTA on high IoU metrics (R@1, IoU=0.7) by 3.52% absolute (11.09% relative) on the Charades-STA dataset.
Abstract: Fine-grained Image-text retrieval is challenging but vital technology in the field of multimedia analysis. Existing methods mainly focus on learning the common embedding space of images (or patches) and sentences (or words), whereby their mapping features in such embedding space can be directly measured. Nevertheless, most existing image-text retrieval works rarely consider the shared semantic concepts that potentially correlated the heterogeneous modalities, which can enhance the discriminative power of learning such embedding space. Toward this end, we propose a Cross-Graph Attention model (CGAM) to explicitly learn the shared semantic concepts, which can be well utilized to guide the feature learning process of each modality and promote the common embedding learning. More specifically, we build semantic-embedded graph for each modality, and smooth the discrepancy between two modalities via cross-graph attention model to obtain shared semantic-enhanced features. Meanwhile, we reconstruct image and text features via the shared semantic concepts and original embedding representations, and leverage multi-head mechanism for similarity calculation. Accordingly, the semantic-enhanced cross-modal embedding between image and text is discriminatively obtained to benefit the fine-grained retrieval with high retrieval performance. Extensive experiments evaluated on benchmark datasets show the performance improvements in comparison with state-of-the-arts.
Abstract: Spelling Error Correction (SEC) that detects and corrects spelling errors in a text has a wide range of applications in human language understanding. Earlier solutions, including statistic-based methods, one-stage, and two-stage machine learning-based methods, cannot build deeply bidirectional models and significantly confine the learning ability. With the recently emerging masked language models, transformer-based networks have achieved remarkable success in SEC. However, current transformer-based Chinese SEC algorithms are all end-to-end methods, which suffer from high false alarm rates because they correct each character of the sentence regardless of its correctness. This issue becomes even more severe when there exist only a small fraction of incorrect characters in the whole sentence. To solve this problem, we propose a cloze-style detector-corrector framework (DCSpell) that firstly detects whether a character is erroneous before correcting it. Specifically, DCSpell employs the discriminator of ELECTRA as the Detector to detect the positions of incorrect characters. The Detector is trained by a sample-efficient replaced token detection pre-training task, and thus allows domain adaption with a small amount of data. After that, a transformer-based Corrector is used to find the correct character for each detected position. It employs sentence pairs as the input, which potentially incorporates the knowledge of phonological and visual similarity. A confusion-set-based post-processing is used to further improve the performance. Experiments show that DCSpell achieves 15.7% improvement on the SIGHAN dataset and 6.6% improvement on a dataset transcribed from a real-world acoustic speech corpus compared to the state-of-the-art methods in terms of the F1 score.
Abstract: Effectively predicting the size of information cascades is crucial for understanding the evolution of many social applications, such as influence maximization and fake news detection. Conventional methods face the challenge of data imbalance which, in turn, yields unsatisfactory prediction performance. To prevent the loss functions or metrics from being affected by extreme values and assure numerical stability, previous works reformulate the problem definitions or adopt other types of evaluation metrics. However, solving the regression prediction of information cascades from a long-tailed distribution perspective is under explored. In this paper, we propose a general decoupling prediction solution -- first extracting the representation, then fine-tuning the regressor, which combines the original prediction value and weighted bias generated by a sub-network (SUB) that we designed. Our experiments conducted on long-tailed benchmarks demonstrate that our method significantly improves the prediction accuracy over state-of-the-art methods and mitigates the long-tailed cascade prediction problem.
Abstract: Recently, the witness of the rapidly growing popularity of short videos on different Internet platforms has intensified the need for a background music (BGM) retrieval system. However, existing video-music retrieval methods only based on the visual modality cannot show promising performance regarding videos with fine-grained virtual contents. In this paper, we also investigate the widely added voice-overs in short videos and propose a novel framework to retrieve BGM for fine-grained short videos. In our framework, we use the self-attention (SA) and the cross-modal attention (CMA) modules to explore the intra- and the inter-relationships of different modalities respectively. For balancing the modalities, we dynamically assign different weights to the modal features via a fusion gate. For paring the query and the BGM embeddings, we introduce a triplet pseudo-label loss to constrain the semantics of the modal embeddings. As there are no existing virtual-content video-BGM retrieval datasets, we build and release two virtual-content video datasets HoK400 and CFM400. Experimental results show that our method achieves superior performance and outperforms other state-of-the-art methods with large margins.
Abstract: Click-through rate(CTR) prediction plays an important role in online advertising and recommender systems. In practice, the training of CTR models depends on click data which is intrinsically biased towards higher positions since higher position has higher CTR by nature. Existing methods such as actual position training with fixed position inference and inverse propensity weighted training with no position inference alleviate the bias problem to some extend. However, the different treatment of position information between training and inference will inevitably lead to inconsistency and sub-optimal online performance. Meanwhile, the basic assumption of these methods, i.e., the click probability is the product of examination probability and relevance probability, is oversimplified and insufficient to model the rich interaction between position and other information. In this paper, we propose a Deep Position-wise Interaction Network (DPIN) to efficiently combine all candidate items and positions for estimating CTR at each position, achieving consistency between offline and online as well as modeling the deep non-linear interaction among position, user, context and item under the limit of serving performance. Following our new treatment to the position bias in CTR prediction, we propose a new evaluation metrics named PAUC (position-wise AUC) that is suitable for measuring the ranking quality at a given position. Through extensive experiments on a real world dataset, we show empirically that our method is both effective and efficient in solving position bias problem. We have also deployed our method in production and observed statistically significant improvement over a highly optimized baseline in a rigorous A/B test.
Abstract: Click-through rate (CTR) prediction is a crucial task in many applications (e.g. recommender systems). Recently deep learning based models have been proposed and successfully applied for CTR prediction by focusing on feature interaction or user interest based on the item-to-item relevance between user behaviors and candidate item. However, these existing models neglect the user-to-user relevance between the target user and those who like the candidate item, which can reflect the preference of target user. To this end, in this paper, we propose a novel Deep User Match Network (DUMN) which measures the user-to-user relevance for CTR prediction. Specifically, in DUMN, we design a User Representation Layer to learn a unified user representation which contains user latent interest based on user behaviors. Then, User Match Layer is designed to measure the user-to-user relevance by matching the target user and those who have interacted with candidate item and modeling their similarities in user representation space. Extensive experimental results on three public real-world datasets validate the effectiveness of DUMN compared with state-of-the-art methods.
Abstract: Given a long text, the summarization system aims to obtain a shorter highlight while keeping important information on the original text. For customer service, the summaries of most dialogues between an agent and a user focus on several fixed key points, such as user's question, user's purpose, the agent's solution, and so on. Traditional extractive methods are difficult to extract all predefined key points exactly. Furthermore, there is a lack of large-scale and high-quality extractive summarization datasets containing key points. In order to solve the above challenges, we propose a Distant Supervision based Machine Reading Comprehension model for extractive Summarization (DSMRC-S). DSMRC-S transforms the summarization task into the machine reading comprehension problem, to fetch key points from the original text exactly according to the predefined questions. In addition, a distant supervision method is proposed to alleviate the lack of eligible extractive summarization datasets. We conduct experiments on a large-scale summarization dataset collected in customer service scenarios, and the results show that the proposed DSMRC-S outperforms the strong baseline methods by 4 points on ROUGE-L.
Abstract: Social media have brought threats like cyberbullying, which can lead to stress, anxiety, depression, and in some severe cases, suicide attempts. Detecting cyberbullying can help to warn/ block bullies and provide support to victims. However, very few studies have used self-attention-based language models like BERT for cyberbullying detection and they typically only report BERT's performance without examining in depth the reasons for its performance. In this work, we examine the use of BERT for cyberbullying detection on various datasets and attempt to explain its performance by analyzing its attention weights and gradient-based feature importance scores for textual and linguistic features. Our results show that attention weights do not correlate with feature importance scores and thus do not explain the model's performance. Additionally, they suggest that BERT relies on syntactical biases in the datasets to assign feature importance scores to class-related wordsrather than cyberbullying-related linguistic features.
Abstract: In modern clinical medicine, electrocardiogram (ECG) is a common diagnosis technique of cardiovascular diseases. The purpose of this paper is to propose a novel model-based clustering approach for analyzing ECG data. Our approach is composed of two modules: representation learning and ECG data clustering. In the module of representation learning, a deep generative model referred to as the hyperspherical variational recurrent autoencoder (HVRAE) is developed to extract the representation of observed ECG data, based on the variational autoencoder (VAE) with long short-term memory (LSTM) networks. In the module of ECG data clustering, we develop a nonparametric hidden Markov model (NHMM) based on Dirichlet process in which the number of hidden states is inferred automatically during the learning process. Moreover, the emission density of each hidden state of our NHMM follows a mixture of von Mises-Fisher (VMF) distributions which have better capability for modeling ECG representations than other commonly used distributions (such as the Gaussian distribution). To learn the proposed VMF-based NHMM, we theoretically develop an effective learning algorithm based on variational Bayes. The merits of our model-based clustering approach for analyzing ECG data are verified through experiments on publicly available ECG data sets.
Abstract: We revisit the Bipartite Graph Partitioning approach to document reordering (Dhulipala et al., KDD 2016), and consider a range of algorithmic and heuristic refinements that lead to faster computation of index-minimizing document orderings. Our final implementation executes approximately four times faster than the reference implementation we commence with, and obtains the same, or slightly better, compression effectiveness on three large text collections.
Abstract: The delayed feedback problem is one of the imperative challenges in online advertising, which is caused by the highly diversified feedback delay of a conversion varying from a few minutes to several days. It is hard to design an appropriate online learning system under these non-identical delay for different types of ads and users. In this paper, we propose to tackle the delayed feedback problem in online advertising by "Following the Prophet" (FTP for short). The key insight is that, if the feedback came instantly for all the logged samples, we could get a model without delayed feedback, namely the "prophet". Although the prophet cannot be obtained during online learning, we show that we could predict the prophet's predictions by an aggregation policy on top of a set of multi-task predictions, where each task captures the feedback patterns of different periods. We propose the objective and optimization approach for the policy, and use the logged data to imitate the prophet. Extensive experiments on three real-world advertising datasets show that our method outperforms the previous state-of-the-art baselines.
Abstract: In this paper, we propose the GAIPS framework for efficient maximum inner product search (MIPS) on GPU. We observe that a query can usually find a good lower bound of its maximum inner product in some large norm items that take up only a small portion of the dataset and utilize this fact to facilitate pruning. In addition, we design norm-based, residue-based and hash-based pruning techniques to avoid computation for items that are unlikely to be the MIPS results. Experiment results show that compared with FAISS, the state-of-the-art GPU-based similarity search framework, GAIPS has significantly shorter query processing time at the same recall.
Abstract: Identifying user intents from natural language utterances is a crucial step in conversational systems that has been extensively studied as a supervised classification problem. However, in practice, new intents emerge after deploying an intent detection model. Thus, these models should seamlessly adapt and classify utterances with both seen and unseen intents -- unseen intents emerge after deployment and they do not have training data. The few existing models that target this setting rely heavily on the training data of seen intents and consequently overfit to these intents, resulting in a bias to misclassify utterances with unseen intents into seen ones. We propose RIDE: an intent detection model that leverages commonsense knowledge in an unsupervised fashion to overcome the issue of training data scarcity. RIDE computes robust and generalizable relationship meta-features that capture deep semantic relationships between utterances and intent labels; these features are computed by considering how the concepts in an utterance are linked to those in an intent label via commonsense knowledge. Our extensive experimental analysis on three widely-used intent detection benchmarks shows that relationship meta-features significantly improve the detection of both seen and unseen intents and that RIDE outperforms the state-of-the-art models.
Abstract: In this work, we establish a context graph from both conversation utterances and external knowledge, and develop a novel graph-based encoder to better understand the conversation context. Specifically, the encoder fuses the information in the context graph stage-by-stage and provides global context-graph-aware representations of each node in the graph to facilitate knowledge-grounded response generation. On a large-scale conversation corpus, we validate the effectiveness of the proposed approach and demonstrate the benefit of knowledge in conversation understanding.
Abstract: Conversational agents are drawing a lot of attention in the information retrieval (IR) community also thanks to the advancements in language understanding enabled by large contextualized language models. IR researchers have long ago recognized the importance o fa sound evaluation of new approaches. Yet, the development of evaluation techniques for conversational search is still an underlooked problem. Currently, most evaluation approaches rely on procedures directly drawn from ad-hoc search evaluation, treating utterances in a conversation as independent events, as if they were just separate topics, instead of accounting for the conversation context. We overcome this issue by proposing a framework for defining evaluation measures that are aware of the conversation context and the utterance semantic dependencies. In particular, we model the conversations as Direct Acyclic Graphs (DAG), where self-explanatory utterances are root nodes, while anaphoric utterances are linked to sentences that contain their missing semantic information. Then,we propose a family of hierarchical dependence-aware aggregations of the evaluation metrics driven by the conversational graph. In our experiments, we show that utterances from the same conversation are 20% more correlated than utterances from different conversations. Thanks to the proposed framework, we are able to include such correlation in our aggregations, and be more accurate when determining which pairs of conversational systems are deemed significantly different.
Abstract: Being able to generate informative and coherent dialogue responses is crucial when designing human-like open-domain dialogue systems. Encoder-decoder-based dialogue models tend to produce generic and dull responses during the decoding step because the most predictable response is likely to be a non-informative response instead of the most suitable one. To alleviate this problem, we propose to train the generation model in a bidirectional manner by adding a backward reasoning step to the vanilla encoder-decoder training. The proposed backward reasoning step pushes the model to produce more informative and coherent content because the forward generation step's output is used to infer the dialogue context in the backward direction. The advantage of our method is that the forward generation and backward reasoning steps are trained simultaneously through the use of a latent variable to facilitate bidirectional optimization. Our method can improve response quality without introducing side information (e.g., a pre-trained topic model). The proposed bidirectional response generation method achieves state-of-the-art performance for response quality.
Abstract: There has been significant progress in utilizing heterogeneous knowledge graphs (KGs) as auxiliary information in recommendation systems. However, existing KG-aware recommendation models rely solely on Euclidean space, neglecting hyperbolic space, which has already been shown to possess a superior ability to separate embed-dings by providing more "room". We propose a knowledge-based hyperbolic propagation framework (KBHP) which includes hyperbolic components for calculating the importance of KG attributes relative to achieve better knowledge propagation. In addition to the original relations in the knowledge graph, we propose a user purchase relation to better represent logical patterns in hyperbolic space, which bridges users and items for modeling user preference. Experiments on four real-world benchmarks show that KBHP is significantly more accurate than state-of-the-art models. We further visualize the generated embeddings to demonstrate that the proposed model successfully clusters attributes that are relevant to items and highlights those that contain useful information for the recommendation.
Abstract: Transfer learning leverages knowledge from a source domain with rich data to a target domain with sparse data. However, the difference between the source and target data distribution weakens the transferability. To bridge this gap, we focus on selecting source instances that are closely related to and have the same distribution as the target domain. In this paper, we propose a novel Adaptive Clustering Transfer Learning (ACTL) method to improve transferability. Specifically, we simultaneously train the instance selector and the transfer learning model. The selector adaptively conducts clustering on the training data and learns the weights for source instances. The weight will activate or inhibit the contribution of the corresponding source instance during transfer learning. Meanwhile, the transfer learning model guides the selector to learn the weight appropriately according to the objective function. To evaluate the effectiveness of our method, we conduct experiments on two different tasks including recommender system and text matching. Experimental results show that our method consistently outperforms competing methods and the selected source instances share a similar data distribution with the target domain.
Abstract: Most existing Visual Question Answering (VQA) systems tend to overly rely on the language bias and hence fail to reason from the visual clue. To address this issue, we propose a novel Language-Prior Feedback (LPF) objective function, to re-balance the proportion of each answer's loss value in the total VQA loss. The LPF firstly calculates a modulating factor to determine the language bias using a question-only branch. Then, the LPF assigns a self-adaptive weight to each training sample in the training process. With this reweighting mechanism, the LPF ensures that the total VQA loss can be reshaped to a more balanced form. By this means, the samples that require certain visual information to predict will be efficiently used during training. Our method is simple to implement, model-agnostic, and end-to-end trainable. We conduct extensive experiments and the results show that the LPF (1) brings a significant improvement over various VQA models, (2) achieves competitive performance on the bias-sensitive VQA-CP v2 benchmark.
Abstract: Different from traditional task-oriented and open-domain dialogue systems, insurance agents aim to engage customers for helping them satisfy specific demands and emotional companionship. As a result, customer-to-agent dialogues are usually very long, and many turns of them are pure chit-chat without any useful marketing clues. This brings challenges to dialogue state tracking task in insurance marketing. To deal with these long and sparse dialogues, we propose a new dialogue state tracking architecture containing three components: dialogue encoder, Smart History Collector (SHC) and dialogue state classifier. SHC, a deliberately designed memory network, effectively selects relevant dialogue history via slot-attention, and then updates dialogue history memory. With SHC, our model is able to keep track of the vital information and filter out pure chit-chat. Experimental results demonstrate that our proposed LS-DST significantly outperforms the state-of-the-art baselines on real insurance dialogue dataset.
Abstract: Medical triage chatbot is widely used in pre-diagnosis by asking symptom and medical history-related questions. Information collected from patients through an online chatbot system is often incomplete and imprecise, and thus it's essentially hard to achieve precise triaging. In this paper, we propose Multi-relational Hyperbolic Diagnosis Predictor (MHDP) --- a novel multi-relational hyperbolic graph neural network-based approach, to build a disease predictive model. More specifically, in MHDP, we generate a heterogeneous graph consisting of symptoms, patients, and diagnoses nodes, and then derive node representations by aggregating neighborhood information recursively in the hyperbolic space. Experiments conducted on two real-world datasets demonstrate that the proposed MHDP approach surpasses state-of-the-art baselines.
Abstract: User preference prediction is a task of learning user interests through user-item interactions. Most existing studies capture user interests based on historical behaviors without considering specific scenario information. However, the users may have special interests in these specific scenarios and sometimes user historical behaviors are limited. In this paper, we propose a Meta-Learned Specific Scenario Interest Network (Meta-SSIN) to predict user preference of target item by capturing specific scenario interests. Meta-SSIN uses multiple independent meta-learning modules to model historical behaviors in each scenario. The independent module can capture special interests based on limited behaviors. Experimental results on three datasets show that Meta-SSIN outperforms compared state-of-the-art methods.
Abstract: Federated learning (FL) is becoming an increasingly popular machine learning paradigm in application scenarios where sensitive data available at various local sites cannot be shared due to privacy protection regulations. In FL, the sensitive data never leaves the local sites and only model parameters are shared with a global aggregator. Nonetheless, it has recently been shown that, under some circumstances, the private data can be reconstructed from the model parameters, which implies that data leakage can occur in FL. In this paper, we draw attention to another risk associated with FL: Even if federated algorithms are individually privacy-preserving, combining them into pipelines is not necessarily privacy-preserving. We provide a concrete example from genome-wide association studies, where the combination of federated principal component analysis and federated linear regression allows the aggregator to retrieve sensitive patient data by solving an instance of the multidimensional subset sum problem. This supports the increasing awareness in the field that, for FL to be truly privacy-preserving, measures have to be undertaken to protect against data leakage at the aggregator.
Abstract: While previous work in comparing statistical significance tests for IR system evaluation have focused on paired data tests (e.g., for evaluating two systems using a common test collection), two-sample tests must be used when the reproducibility of IR experiments across different test collections must be examined. Using real runs and a test collection from the NTCIR-15 WWW-3 Task, the present study compares the properties of three two-sample significance tests for comparing two systems: Student's t-test (i.e., the classical parametric test), the Wilcoxon rank sum test (i.e., the classical nonparametric test), and the randomisation test (i.e., a population-free method that utilises modern computational power). In terms of the false positive rate (i.e., the chance of detecting a statistical significance even though the two samples of evaluation measure scores come from the same system), the three tests behave similarly, although the Wilcoxon rank sum test appears to be slightly more robust than the other two for very small topic set sizes (e.g., 10 topics each) with a large significance level (e.g., α=0.10). On the other hand, the t-test and the Wilcoxon rank sum test are very similar to each other from the following two viewpoints: "How often do they both detect a nonexistent difference?" and "How often do they both overlook a true difference?" Compared to the two classical significance tests, the randomisation test behaves markedly differently in terms of the above two viewpoints. Hence, we suggest that researchers should at least be aware of the above properties of the three two-sample tests when choosing from them.
Abstract: Dialogue Relation Extraction (DRE) is a new kind of relation extraction task from multi-turn dialogues. Different from the previous tasks, speaker specific relations are implicitly mixed together in both a local utterance window and a speaker context. To tackle both local and speaker dependency challenges, we explicitly construct a unified mention co-occurrence graph within a local utterance window or all utterances of a speaker from different entities. For each dialogue, a position enhanced graph attention network over this graph is proposed to obtain position aware mention representations in terms of both contexts. A gate function is utilized to help obtain a discriminative representation enough for each relation from original and position aware mention representations. For each entity pair in this dialogue, a pairwise attention mechanism is deployed to aggregate those discriminative mention representations as pair representation, which is fed into a standard multi-label classifier for relation label prediction. Experimental results on two benchmarks show the performance improvement of the proposed method is at least 1.6% and 3.2% compared with SOTA.
Abstract: Unplanned intensive care unit (ICU) readmission rate is an important metric for evaluating the quality of hospital care. Efficient and accurate prediction of ICU readmission risk can not only help prevent patients from inappropriate discharge and potential dangers, but also reduce associated costs of healthcare. In this paper, we propose a new method that uses medical text of Electronic Health Records (EHRs) for prediction, which provides an alternative perspective to previous studies that heavily depend on numerical and time-series features of patients. More specifically, we extract discharge summaries of patients from their EHRs, and represent them with multiview graphs enhanced by an external knowledge graph. Graph convolutional networks are then used for representation learning. Experimental results prove the effectiveness of our method, yielding state-of-the-art performance for this task.
Abstract: Demographics of online users such as age and gender play an important role in personalized web applications, particularly in the News domain. However, it is difficult to directly obtain the demographic information of online users. Past works have attempted to predict user demography based on reading patterns obtained from news browsing data. However, such data can be very limited. Luckily, in recent years, posts and comments have become much prevalent among online users, and the comments from users of different demographics exhibit differences in contents and writing styles. Thus, comments can provide additional clues for demographic prediction. In this paper, we study predicting users' demographics based on both news browsing data and the associated user generated comments. To this end, we make a novel use of a recently introduced BERT-based model to embed each comment in the context of its associated article. We experiment on real-world datasets, and explore the contribution of both browsing data and user generated data in the task of predicting three different user attributes: gender, location type (e.g., rural vs. urban), and mobile device. Finally we show that our approach can effectively improve the performance of such predictions and outperforms baseline methods.
Abstract: A proactive dialogue system has the ability to proactively lead the conversation. Different from the general chatbots which only react to the user, proactive dialogue systems can be used to achieve some goals, e.g., to recommend some items to the user. Background knowledge is essential to enable smooth and natural transitions in dialogue. In this paper, we propose a new multi-task learning framework for retrieval-based knowledge-grounded proactive dialogue. To determine the relevant knowledge to be used, we frame knowledge prediction as a complementary task and use explicit signals to supervise its learning. The final response is selected according to the predicted knowledge, the goal to achieve, and the context. Experimental results show that explicit modeling of knowledge prediction and goal selection can greatly improve the final response selection. Our code is available at https://github.com/DaoD/KPN/.
Abstract: Few-shot intent detection is a challenging task due to the scare annotation problem. In this paper, we propose a Pseudo Siamese Network (PSN) to generate labeled data for few-shot intents and alleviate this problem. PSN consists of two identical subnetworks with the same structure but different weights: an action network and an object network. Each subnetwork is a transformer-based variational autoencoder that tries to model the latent distribution of different components in the sentence. The action network is learned to understand action tokens and the object network focuses on object-related expressions. It provides an interpretable framework for generating an utterance with an action and an object existing in a given intent. Experiments on two real-world datasets show that PSN achieves state-of-the-art performance for the generalized few shot intent detection task.
Abstract: Previous studies on the financial news focus mainly on the news articles explicitly mentioning the target financial instruments, and may suffer from data sparsity. As taking into consideration other related news, e.g., sector-related news, is a crucial part of real-world decision-making, we explore the use of news without explicit target mentions to enrich the information for the prediction model. We develop a neural network framework that jointly learns with a news selection mechanism to extract implicit information from the chaotic daily news pool. Our proposed model, called the news distilling network (NDN), takes advantage of neural representation learning and collaborative filtering to capture the relationship between stocks and news. With NDN, we learn latent stock and news representations to facilitate similarity measurements, and apply a gating mechanism to prevent noisy news representations from flowing to a higher level encoding stage, which encodes the selected news representation of each day. Extensive experiments on real-world stock market data demonstrate the effectiveness of our framework and show improvements over previous techniques.
Abstract: Given a set of required skills, the objective of the team formation problem is to form a team of experts that cover the required skills. Most existing approaches are based on graph methods, such as minimum-cost spanning trees. These approaches, due to their limited view of the network, fail to capture complex interactions among experts and are computationally intractable. More recent approaches adopt neural architectures to learn a mapping between the skills and experts space. While they are more effective, these techniques face two main limitations: (1) they consider a fixed representation for both skills and experts, and (2) they overlook the significant amount of past collaboration network information. We learn dense representations for skills and experts based on previous collaborations and bootstrap the training process through transfer learning. We also propose to fine-tune the representation of skills and experts while learning the mapping function. Our experiments over the DBLP dataset verify that our proposed architecture is able to outperform the state-of-the-art graph and neural methods over both ranking and quality metrics.
Abstract: With the rapid growth of digital data on the Internet, rumor detection on social media has been vital. Existing deep learning-based methods have achieved promising results due to their ability to learn high-level representations of rumors. Despite the success, we argue that these approaches require large reliable labeled data to train, which is time-consuming and data-inefficient. To address this challenge, we present a new solution, Rumor Detection on social media with Event Augmentations (RDEA), which innovatively integrates three augmentation strategies by modifying both reply attributes and event structure to extract meaningful rumor propagation patterns and to learn intrinsic representations of user engagement. Moreover, we introduce contrastive self-supervised learning for the efficient implementation of event augmentations and alleviate limited data issues. Extensive experiments conducted on two public datasets demonstrate that RDEA achieves state-of-the-art performance over existing baselines. Besides, we empirically show the robustness of RDEA when labeled data are limited.
Abstract: Millions of trademarks were registered last year in China, and thousands of applications are submitted daily. A trademark must be unique in the category it belongs to. Therefore, each new trademark application needs to be checked against all the existing ones in its category. A trademark can be a text string (characters, words or phrases), a figure (symbol or design), or both. In this study, we focus on the textual trademark in Chinese, and propose a model for finding similar trademarks for a given one. This neural network model exploits the semantic, phonetic and visual similarities between two textual trademarks. We evaluated our model based on a dataset that were built from the real trademark application data. Our evaluation shows that the proposed model outperforms other approaches.
Abstract: Biomedical literature retrieval has greatly benefited from recent advances in neural language modeling. In particular, fine-tuning pretrained contextual language models has shown impressive results in recent biomedical retrieval evaluation campaigns. Nevertheless, current approaches neglect the inherent structure available from biomedical abstracts, which are (often explicitly) organised into semantically coherent sections such as background, methods, results, and conclusions. In this paper, we investigate the suitability of leveraging biomedical abstract sections for fine-tuning pretrained contextual language models at a finer granularity. Our results on two TREC biomedical test collections demonstrate the effectiveness of the proposed structured fine-tuning regime in contrast to a standard fine-tuning that does not leverage structure. Through an ablation study, we show that models fine-tuned on individual sections are able to capture potentially useful word contexts that may be otherwise ignored by structure-agnostic models.
Abstract: In real-world search, recommendation, and advertising systems, the multi-stage ranking architecture is commonly adopted. Such architecture usually consists of matching, pre-ranking, ranking, and re-ranking stages. In the pre-ranking stage, vector-product based models with representation-focused architecture are commonly adopted to account for system efficiency. However, it brings a significant loss to the effectiveness of the system. In this paper, a novel pre-ranking approach is proposed which supports complicated models with interaction-focused architecture. It achieves a better tradeoff between effectiveness and efficiency by utilizing the proposed learnable Feature Selection method based on feature Complexity and variational Dropout (FSCD). Evaluations in a real-world e-commerce sponsored search system for a search engine demonstrate that utilizing the proposed pre-ranking, the effectiveness of the system is significantly improved. Moreover, compared to the systems with conventional pre-ranking models, an identical amount of computational resource is consumed.
Abstract: Existing emotion-aware conversational models usually focus on controlling the response contents to align with a specific emotion class, whereas empathy is the ability to understand and concern the feelings and experience of others. Hence, it is critical to learn the causes that evoke the users' emotion for empathetic responding, a.k.a. emotion causes. To gather emotion causes in online environments, we leverage counseling strategies and develop an empathetic chatbot to utilize the causal emotion information. On a real-world online dataset, we verify the effectiveness of the proposed approach by comparing our chatbot with several SOTA methods using automatic metrics, expert-based human judgements as well as user-based online evaluation.
Abstract: Accurate skill retrieval is a key factor for the success of modern conversational AI agents. The major challenges lie in the ambiguity in human spoken language and the wide spectrum of candidate skills. In this paper, we make the first attempt to attack the problem by implementing a user feedback enhanced reranking strategy, and propose a self-adaptive dialogue system (AdaDial) for conversational AI agents. In AdaDial, we consider estimating user feedback and adjusting ranking strategy into a "closed-loop". In particular, we propose a scalable schema for user feedback estimation and a feedback enhanced reranking model with customized feature encoding, target attention based feature assembling, and multi-task learning. As a result, AdaDial achieves self-adaptivity at both individual- and system-levels. Online experimental results demonstrate that AdaDial could not only retrieve desired skills for different users in different scenarios, but also correct its regular strategy according to negative feedback. AdaDial has been deployed on a large-scale conversational AI agents with tens of millions daily queries, and is bringing continued positive impacts on user experience.
Abstract: Disinformation and fake news have posed detrimental effects on individuals and society in recent years, attracting broad attention to fake news detection. The majority of existing fake news detection algorithms focus on mining news content and/or the surrounding exogenous context for discovering deceptive signals; while the endogenous preference of a user when he/she decides to spread a piece of fake news or not is ignored. The confirmation bias theory has indicated that a user is more likely to spread a piece of fake news when it confirms his/her existing beliefs/preferences. Users' historical, social engagements such as posts provide rich information about users' preferences toward news and have great potentials to advance fake news detection. However, the work on exploring user preference for fake news detection is somewhat limited. Therefore, in this paper, we study the novel problem of exploiting user preference for fake news detection. We propose a new framework, UPFD, which simultaneously captures various signals from user preferences by joint content and graph modeling. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework. We release our code and data as a benchmark for GNN-based fake news detection: https://github.com/safe-graph/GNN-FakeNews.
Abstract: Query-biased Summarization (QBS) aims to produce a query-dependent summary of a retrieved document to reduce the human effort for inspecting the full-text content. Typical summarization approaches extract document snippets that overlap with the query and show them to searchers. Such QBS methods show relevant information in a document but do not inform searchers what is missing. Our study focuses on reducing user effort in finding relevant documents by exposing the information in the query that is missing in the retrieved results. We use a classical approach, DSPApprox, to find terms or phrases relevant to a query. Then, we identify which terms or phrases are missing in a document, present them in a search interface, and ask crowd workers to judge document relevance based on snippets and missing information. Experimental results show both benefits and limitations of our method compared with traditional ones that only show relevant snippets.
Abstract: Variational Autoencoders (VAEs) have shown to be effective for recommender systems with implicit feedback (e.g., browsing history, purchasing patterns, etc.). However, a little attention is given to ensembles of VAEs, that can learn user and item representations jointly. We introduce Joint Variational Autoencoder (JoVA), an ensemble of two VAEs, which jointly learns both user and item representations to predict user preferences. This design allows JoVA to capture user-user and item-item correlations simultaneously. We also introduce JoVA-Hinge, a JoVA's extension with a hinge-based pairwise loss function, to further specialize it in recommendation with implicit feedback. Our extensive experiments on four real-world datasets demonstrate that JoVA-Hinge outperforms a broad set of state-of-the-art methods under a variety of commonly-used metrics. Our empirical results also illustrate the effectiveness of JoVA-Hinge for handling users with limited training data.
Abstract: The COVID-19 pandemic has brought about a proliferation of harmful news articles online, with sources lacking credibility and misrepresenting scientific facts. Misinformation has real consequences for consumer health search, i.e., users searching for health information. In the context of multi-stage ranking architectures, there has been little work exploring whether they prioritize correct and credible information over misinformation. We find that, indeed, training models on standard relevance ranking datasets like MS MARCO passage---which have been curated to contain mostly credible information---yields models that might also promote harmful misinformation. To rectify this, we propose a label prediction technique that can separate helpful from harmful content. Our design leverages pretrained sequence-to-sequence transformer models for both relevance ranking and label prediction. Evaluated at the TREC 2020 Health Misinformation Track, our techniques represent the top-ranked system: Our best submitted run was 19.2 points higher than the second-best run based on the primary metric, a 68% relative improvement. Additional post-hoc experiments show that we can boost effectiveness by another 3.5 points.
Abstract: When a human asks questions online, or when a conversational virtual agent asks a human questions, questions triggering emotions or with details might more likely to get responses or answers. we explore how to automatically rewrite natural language questions to improve the response rate form people. In particular, a new task of Visual Question Rewriting (VQR) task is introduced to explore how visual information can be used to improve the new question(s). A data set containing -4K bland&attractive question-images triples is collected. We developed some baseline sequence to sequence models and more advanced transformer-based models, which take a bland question and a related image as input, and output a rewritten question that's expected to be more attractive. Offline experiments and mechanical Turk based evaluations show that it's possible to rewrite bland questions in a more detailed and attractive way to increase response rate, and images can be helpful.
Abstract: Carrying abundant side information, knowledge graph (KG) has shown its great potential in enriching the sparsity of collaborative filtering (CF) for recommendation. Although graph neural networks (GNNs) have been successfully employed to learn user preferences from KG and CF signals simultaneously, most models suffer from inferior performance due to their deficient designs, i.e., 1) formulating no distinction between users, items and KG entities, 2) confounding KG signals with CF signals and 3) completely neglecting the effects of edges, which is vital for graph information propagation. In this paper, we propose a quad-channel graph model (X-2ch) to tackle these problems. First, rather than lodging KG entities on graph as nodes, X-2ch distills KG information and embeds them as edge attributes in a bi-directional manner to model the natural user-item interaction process. Second, X-2ch introduces a novel quad-channel learning scheme, including a collaborative user-item update and a CF-KG attentive propagation, to holistically capture the interconnectivity of users and items while preserving their distinct properties. Experiments on two real-world benchmarks show substantial improvement over the state-of-the-art baselines.
Abstract: Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carry out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In contrast, each of our collections has a substantial number of queries, which enables a full-shot evaluation mode and improves reliability of our results. Furthermore, since source datasets licences often prohibit commercial use, we compare transfer learning to training on pseudo-labels generated by a BM25 scorer. We find that training on pseudo-labels---possibly with subsequent fine-tuning using a modest number of annotated queries---can produce a competitive or better model compared to transfer learning. Yet, it is necessary to improve the stability and/or effectiveness of the few-shot training, which, sometimes, can degrade performance of a pretrained model.
Abstract: In this paper, we propose a novel abstractive text summarization method with hierarchical multi-scale abstraction modeling and dynamic memory (called MADY). First, we propose a hierarchical multi-scale abstraction modeling method to capture the temporal dependencies of the document from multiple hierarchical levels of abstraction, which mimics the process of how human beings comprehend an article by learning fine timescales for low-level abstraction layers and coarse timescales for high-level abstraction layers. By applying this adaptive updating mechanism, the high-level abstraction layers are updated less frequently and expected to remember the long-term dependency better than the low-level abstraction layer. Second, we propose a dynamic key-value memory-augmented attention network to keep track of the attention history and comprehensive context information for the salient facets within the input document. In this way, our model can avoid generating repetitive words and faultiness summaries. Extensive experiments on two widely-used datasets demonstrate the effectiveness of the proposed MADY model in terms of both automatic evaluation and human evaluation. For reproducibility, we submit the code and data at: https://github.com/siat-nlp/MADY.git.
Abstract: Recent AI research has witnessed increasing interests in automatically designing the architecture of deep neural networks, which is coined as neural architecture search (NAS). The automatically searched network architectures via NAS methods have outperformed manually designed architectures on some NLP tasks. However, training a large number of model configurations for efficient NAS is computationally expensive, creating a substantial barrier for applying NAS methods in real-life applications. In this paper, we propose to accelerate neural architecture search for natural language processing based on knowledge distillation (called KD-NAS). Specifically, instead of searching the optimal network architecture on the validation set conditioned on the optimal network weights on the training set, we learn the optimal network by minimizing the knowledge loss transferred from a pre-trained teacher network to the searching network based on Earth Mover's Distance (EMD). Experiments on five datasets show that our method achieves promising performance compared to strong competitors on both accuracy and searching speed. For reproducibility, we submit the code at: https://github.com/lxk00/KD-NAS-EMD.
Abstract: Owing to the remarkable capability of extracting effective graph embeddings, graph convolutional network (GCN) and its variants have been successfully applied to a broad range of tasks, such as node classification, link prediction, and graph classification. Traditional GCN models suffer from the issues of overfitting and oversmoothing, while some recent techniques like DropEdge could alleviate these issues and thus enable the development of deep GCN. However, training GCN models is non-trivial, as it is sensitive to the choice of hyperparameters such as dropout rate and learning weight decay, especially for deep GCN models. In this paper, we aim to automate the training of GCN models through hyperparameter optimization. To be specific, we propose a self-tuning GCN approach with an alternate training algorithm, and further extend our approach by incorporating the population based training scheme. Experimental results on three benchmark datasets demonstrate the effectiveness of our approaches on optimizing multi-layer GCN, compared with several representative baselines.
Abstract: We propose AutoName, an unsupervised framework that extracts a name for a set of query entities from a large-scale text corpus. Entity-set naming is useful in many tasks related to natural language processing and information retrieval such as session-based and conversational information seeking. Previous studies mainly extract set names from knowledge bases which provide highly reliable entity relations, but suffer from limited coverage of entities and set names that represent broad semantic classes. To address these problems, AutoName generates hypernym-anchored candidate phrases via probing a pre-trained language model and the entities' context in documents. Phrases are then clustered to identify ones that describe common concepts among query entities. Finally, AutoName ranks refined phrases based on the co-occurrences of their words with query entities and the conceptual integrity of their respective clusters. We built a new benchmark dataset for this task, consisting of 130 entity sets with name labels. Experimental results show that AutoName generates coherent and meaningful set names and significantly outperforms all baselines.
Abstract: Cross-lingual text representations have gained popularity lately and act as the backbone of many tasks such as unsupervised machine translation and cross-lingual information retrieval, to name a few. However, evaluation of such representations is difficult in the domains beyond standard benchmarks due to the necessity of obtaining domain-specific parallel language data across different pairs of languages. In this paper, we propose an automatic metric for evaluating the quality of cross-lingual textual representations using images as a proxy in a paired image-text evaluation dataset. Experimentally, Backretrieval is shown to highly correlate with ground truth metrics on annotated datasets, and our analysis shows statistically significant improvements over baselines. Our experiments conclude with a case study on a recipe dataset without parallel cross-lingual data. We illustrate how to judge cross-lingual embedding quality with Backretrieval, and validate the outcome with a small human study.
Abstract: Critiquing is a method for conversational recommendation that incrementally adapts recommendations in response to user preference feedback. Recent advances in critiquing have leveraged the power of VAE-CF recommendation in a critiquable-explainable (CE-VAE) framework that updates latent user preference embeddings based on their critiques of keyphrase-based explanations. However, the CE-VAE has two key drawbacks: (i) it uses a second VAE head to facilitate explanations and critiquing, which can sacrifice recommendation performance of the first VAE head due to multiobjective training, and (ii) it requires iterating an inverse decoding-encoding loop for multi-step critiquing that yields poor performance. To address these deficiencies, we propose a novel Bayesian Keyphrase critiquing VAE (BK-VAE) framework that builds on the strengths of VAE-CF, but avoids the problematic second head of CE-VAE. Instead, the BK-VAE uses a Concept Activation Vector (CAV) inspired approach to determine the alignment of item keyphrase properties with latent user preferences in VAE-CF. BK-VAE leverages this alignment in a Bayesian framework to model uncertainty in a user's latent preferences and to perform posterior updates to these preference beliefs after each critique --- essentially achieving CE-VAE's explanation and critique inversion through a simple application of Bayes rule. Our empirical evaluation on two datasets demonstrates that BK-VAE matches or dominates CE-VAE in both recommendation and multi-step critiquing performance.
Abstract: We propose a simple and effective strategy for data augmentation for low-resource machine reading comprehension (MRC). Our approach first pretrains the answer extraction components of a MRC system on the augmented data that contains approximate context of the correct answers, before training it on the exact answer spans. The approximate context helps the QA method components in narrowing the location of the answers. We demonstrate that our simple strategy substantially improves both document retrieval and answer extraction performance by providing larger context of the answers and additional training data. In particular, our method significantly improves the performance of BERT based retriever (15.12%), and answer extractor (4.33% F1) on TechQA, a complex, low-resource MRC task. Further, our data augmentation strategy yields significant improvements of up to 3.9% exact match (EM) and 2.7% F1 for answer extraction on PolicyQA, another practical but moderate sized QA dataset that also contains long answer spans.
Abstract: Multi-label learning algorithms have attracted more and more attention as of recent. This is mainly because real-world data is generally associated with multiple and non-exclusive labels, which could correspond to different objects, scenes, actions, and attributes. In this paper, we consider the following challenging multi-label stream scenario: the new labels emerge continuously in the changing environments, and are assigned to the previous data. In this setting, data mining solutions must be able to learn the new concepts and avoid catastrophic forgetting simultaneously. We propose a novel continual and interactive feature distillation-based learning framework (CIFDM), to effectively classify instances with novel labels. We utilize the knowledge from the previous tasks to learn new knowledge to solve the current task. Then, the system compresses historical and novel knowledge and preserves it while waiting for new emerging tasks. CIFDM consists of three components: 1) a knowledge bank that stores the existing feature-level compressed knowledge, and predicts the observed labels so far; 2) a pioneer module that aims to learn and predict new emerged labels based on knowledge bank.; 3) an interactive knowledge compression function which is used to compress and transfer the new knowledge to the bank, and then apply the current compressed knowledge to initialize the label embedding of the pioneer for the next task.
Abstract: In this paper, we propose a clustering-based online news topic detection and tracking (TDT) approach based on hierarchical Bayesian nonparametric framework that allows topics to be shared across different news stories in a corpus. Our approach is formulated using the hierarchical Pitman-Yor process mixture model with the inverted Beta-Liouville (IBL) distribution as its component density, which has shown superior performance in modeling text data than the widely used Gaussian distribution. Moreover, we theoretically develop a convergence-guaranteed online learning algorithm that can effectively learn the proposed TDT model from a stream of news stories based on varational Bayes. The merits of our TDT approach are illustrated by comparing it with other well-defined clustering-based TDT approaches on different news data sets.
Abstract: Hypergraphs can capture higher-order relations between subsets of objects instead of only pairwise relations as in graphs. Hypergraph clustering is an important task in information retrieval and machine learning. We study the problem of distributed hypergraph clustering in the message passing communication model using small communication cost. We propose an algorithm framework for distributed hypergraph clustering based on spectral hypergraph sparsification. For an n-vertex hypergraph G with hyperedges of maximum size r distributed at s sites arbitrarily and a parameter ε∈ (0,1), our algorithm can produce a vertex set with conductance O(√1+ε/1-ε √φG), where φG is the conductance of G, using communication cost ~O(nr2s/εO(1)) (~O hides a polylogarithmic factor). The theoretical results are complemented with extensive experiments to demonstrate the efficiency and effectiveness of the proposed algorithm under different real-world datasets. Our source code is publicly available at github.com/chunjiangzhu/dhgc.
Abstract: We present a Composite Code Sparse Autoencoder (CCSA) approach for Approximate Nearest Neighbor (ANN) search of document representations based on Siamese-BERT models. In Information Retrieval (IR), the ranking pipeline is generally decomposed in two stages: the first stage focuses on retrieving a candidate set from the whole collection. The second stage re-ranks the candidates by relying on more complex models. Recently, Siamese-BERT models have been used as first stage rankers to replace or complement the traditional bag-of-words models. However, indexing and searching a large document collection requires efficient similarity search on dense vectors and this is why ANN techniques come into play. Since composite codes are naturally sparse, we show how CCSA can learn efficient parallel inverted index thanks to an uniformity regularizer. Our experiments on MS MARCO reveal that for the same quantization budget and recall@1000 targets, CCSA is able to outperform IVF (inverted-index file) with product quantization on both
Abstract: Social recommendation aims to fuse social links with user-item interactions to alleviate the cold-start problem for rating prediction. Recent developments of Graph Neural Networks (GNNs) motivate endeavors to design GNN-based social recommendation frameworks to aggregate both social and user-item interaction information simultaneously. However, most existing methods neglect the social inconsistency problem, which intuitively suggests that social links are not necessarily consistent with the rating prediction process. Social inconsistency can be observed from both context-level and relation-level. Therefore, we intend to empower the GNN model with the ability to tackle the social inconsistency problem. We propose to sample consistent neighbors by relating sampling probability with consistency scores between neighbors. Besides, we employ the relation attention mechanism to assign consistent relations with high importance factors for aggregation. Experiments on two real-world datasets verify the model effectiveness.
Abstract: We propose a novel domain-specific generative pre-training (DSGPT) method for text generation and apply it to the product title and review summarization problems on E-commerce mobile display. First, we adopt a decoder-only transformer architecture, which fits well for fine-tuning tasks by combining input and output all together. Second, we demonstrate utilizing only small amount of pre-training data in related domains is powerful. Pre-training a language model from a general corpus such as Wikipedia or the Common Crawl requires tremendous time and resource commitment, and can be wasteful if the downstream tasks are limited in variety. Our DSGPT is pre-trained on a limited dataset, the Chinese short text summarization dataset (LCSTS). Third, our model does not require product-related human-labeled data. For title summarization task, the state of art explicitly uses additional background knowledge in training and predicting stages. In contrast, our model implicitly captures this knowledge and achieves significant improvement over other methods, after fine-tuning on the public Taobao.com dataset. For review summarization task, we utilize JD.com in-house dataset, and observe similar improvement over standard machine translation methods which lack the flexibility of fine-tuning. Our proposed work can be simply extended to other domains for a wide range of text generation tasks.
Abstract: Recently, BERT realized significant progress for sentence matching via word-level cross sentence attention. However, the performance significantly drops when using siamese BERT-networks to derive two sentence embeddings, which fall short in capturing the global semantic since the word-level attention between two sentences is absent. In this paper, we propose a Dual-view distilled BERT~(DvBERT) for sentence matching with sentence embeddings. Our method deals with a sentence pair from two distinct views, i.e., Siamese View and Interaction View. Siamese View is the backbone where we generate sentence embeddings. Interaction View integrates the cross sentence interaction as multiple teachers to boost the representation ability of sentence embeddings. Experiments on six STS tasks show that our method outperforms the state-of-the-art sentence embedding methods.
Abstract: Representation learning of examination papers is the cornerstone of the Examination Paper Analysis (EPA) in education area including Paper Difficulty Prediction (PDR) and Finding Similar Papers (FSP). Previous works mainly focus on the representation learning of each test item, but few works notice the hierarchical document structure in examination papers. To this end, in this paper, we propose a novel Examination Organization Encoder (EOE) to learn a robust representation of the examination paper with the hierarchical document structure. Specifically, we first propose a syntax parser to recover the hierarchical document structure and convert an examination paper to an Examination Organization Tree (EOT), where the test items are the leaf nodes and the internal nodes are summarization of their child nodes. Then, we applied a two-layer GRU-based module to obtain the representation of each leaf node. After that, we design a subtree encoder module to aggregate the representation of each leaf node, which is used to calculate an embedding for each layer in the EOT. Finally, we feed all the layer embedding into an output module, the process is over and we get the examination paper representation that can be used for downstream tasks. Extensive experiments on real-world data demonstrate the effectiveness and interpretability of our method.
Abstract: Cross features play an important role in click-through rate (CTR) prediction. Most of the existing methods adopt a DNN-based model to capture the cross features in an implicit manner. These implicit methods may lead to a sub-optimized performance due to the limitation in explicit semantic modeling. Although traditional statistical explicit semantic cross features can address the problem in these implicit methods, it still suffers from some challenges, including lack of generalization and expensive memory cost. Few works focus on tackling these challenges. In this paper, we take the first step in learning the explicit semantic cross features and propose Pre-trained Cross Feature learning Graph Neural Networks (PCF-GNN), a GNN based pre-trained model aiming at generating cross features in an explicit fashion. Extensive experiments are conducted on both public and industrial datasets, where PCF-GNN shows competence in both performance and memory-efficiency in various tasks.
Abstract: Deep neural network (DNN) models have been widely used for click-through rate (CTR) prediction in online advertising. The training framework typically consists of embedding layers and multi-layer perceptions (MLP). At Baidu Search Ads (a.k.a. Phoenix Nest), the new generation of CTR training platform has become PaddleBox, a GPU-based parameter server system. In this paper, we present Baidu's recently updated CTR training framework, called Gating-enhanced Multi-task Neural Networks (GemNN). In particular, we develop a neural network based multi-task learning model to predict CTR in a coarse-to-fine manner, which gradually reduces ad candidates and allows parameter sharing from upstream tasks to downstream tasks to improve the training efficiency. Also, we introduce a gating mechanism between embedding layers and MLP to learn feature interactions and control the information flow fed to MLP layers. We have launched our solution in Baidu PaddleBox platform and observed considerable improvements in both offline and online evaluations. It is now part of the current production~system.
Abstract: We address the poor generalization of few-shot learning models for event detection (ED) using transfer learning and representation regularization. In particular, we propose to transfer knowledge from open-domain word sense disambiguation into few-shot learning models for ED to improve their generalization to new event types. We also propose a novel training signal derived from dependency graphs to regularize the representation learning for ED. Moreover, we evaluate few-shot learning models for ED with a large-scale human-annotated ED dataset to obtain more reliable insights for this problem. Our comprehensive experiments demonstrate that the proposed model outperforms state-of-the-art baseline models in the few-shot learning and supervised learning settings for ED. Code and data splits are available at https://github.com/laiviet/ed-fsl.
Abstract: Graph pooling that summaries the information in a large graph into a compact form is essential in hierarchical graph representation learning. Existing graph pooling methods either suffer from high computational complexity or cannot capture the global dependencies between graphs before and after pooling. To address the problems of existing graph pooling methods, we propose Coarsened Grap hInfomaxPooling (CGIPool) that maximizes the mutual information between the input and the coarsened graph of each pooling layer to preserve graph-level dependencies. To achieve mutual information neural maximization, we apply contrastive learning and propose a self-attention-based algorithm for learning positive and negative samples. Extensive experimental results on seven datasets illustrate the superiority of CGIPool comparing to the state-of-the-art
Abstract: Graph neural architecture search has received a lot of attention as Graph Neural Networks (GNNs) has been successfully applied on the non-Euclidean data recently. However, exploring all possible GNNs architectures in the huge search space is too time-consuming or impossible for big graph data. In this paper, we propose a parallel graph architecture search (GraphPAS) framework for graph neural networks. In GraphPAS, we explore the search space in parallel by designing a sharing-based evolution learning, which can improve the search efficiency without losing the accuracy. Additionally, architecture information entropy is adopted dynamically for mutation selection probability, which can reduce space exploration. The experimental result shows that GraphPAS outperforms state-of-art models with efficiency and accuracy simultaneously.
Abstract: Conversion Rate (CVR) prediction in modern industrial e-commerce platforms is becoming increasingly important, which directly contributes to the final revenue. In order to address the well-known sample selection bias (SSB) and data sparsity (DS) issues encountered during CVR modeling, the abundant labeled macro behaviors (i.e., user's interactions with items) are used. Nonetheless, we observe that several purchase-related micro behaviors (i.e., user's interactions with specific components on the item detail page) can supplement fine-grained cues for CVR prediction. Motivated by this observation, we propose a novel CVR prediction method by Hierarchically Modeling both Micro and Macro behaviors (HM3). Specifically, we first construct a complete user sequential behavior graph to hierarchically represent micro behaviors and macro behaviors as one-hop and two-hop post-click nodes. Then, we embody HM3 as a multi-head deep neural network, which predicts six probability variables corresponding to explicit sub-paths in the graph. They are further combined into the prediction targets of four auxiliary tasks as well as the final CVR according to the conditional probability rule defined on the graph. By employing multi-task learning and leveraging the abundant supervisory labels from micro and macro behaviors, HM3 can be trained end-to-end and address the SSB and DS issues. Extensive experiments on both offline and online settings demonstrate the superiority of the proposed HM3 over representative state-of-the-art methods.
Abstract: BERT-based Neural Ranking Models (NRMs) can be classified according to how the query and document are encoded through BERT's self-attention layers - bi-encoder versus cross-encoder. Bi-encoder models are highly efficient because all the documents can be pre-processed before the query time, but their performance is inferior compared to cross-encoder models. Both models utilize a ranker that receives BERT representations as the input and generates a relevance score as the output. In this work, we propose a method where multi-teacher distillation is applied to a cross-encoder NRM and a bi-encoder NRM to produce a bi-encoder NRM with two rankers. The resulting student bi-encoder achieves an improved performance by simultaneously learning from a cross-encoder teacher and a bi-encoder teacher and also by combining relevance scores from the two rankers. We call this method TRMD (Two Rankers and Multi-teacher Distillation). In the experiments, TwinBERT and ColBERT are considered as baseline bi-encoders. When monoBERT is used as the cross-encoder teacher, together with either TwinBERT or ColBERT as the bi-encoder teacher, TRMD produces a student bi-encoder that performs better than the corresponding baseline bi-encoder. For P@20, the maximum improvement was 11.4%, and the average improvement was 6.8%. As an additional experiment, we considered producing cross-encoder students with TRMD, and found that it could also improve the cross-encoders.
Abstract: Text style transfer is an important issue for conversational agents as it may adapt utterance production to specific dialogue situations. It consists in introducing a given style within a sentence while preserving its semantics. Within this scope, different strategies have been proposed that either rely on parallel data or take advantage of non-supervised techniques. In this paper, we follow the latter approach and show that the sequential introduction of different loss functions into the learning process can boost the performance of a standard model. We also evidence that combining different style classifiers that either focus on global or local textual information improves sentence generation. Experiments on the Yelp dataset show that our methodology strongly competes with the current state-of-the-art models across style accuracy, grammatical correctness, and content preservation.
Abstract: Network representation learning aims to generate an embedding for each node in a network, which facilitates downstream machine learning tasks such as node classification and link prediction. Current work mainly focuses on transductive network representation learning, i.e. generating fixed node embeddings, which is not suitable for real-world applications. Therefore, we propose a new inductive network representation learning method called MNCI by mining neighborhood and community influences in temporal networks. We propose an aggregator function that integrates neighborhood influence with community influence to generate node embeddings at any time. We conduct extensive experiments on several real-world datasets and compare MNCI with several state-of-the-art baseline methods on various tasks, including node classification and network visualization. The experimental results show that MNCI achieves better performance than baselines.
Abstract: Transformer-based models, and especially pre-trained language models like BERT, have shown great success on a variety of Natural Language Processing and Information Retrieval tasks. However, such models have difficulties to process long documents due to the quadratic complexity of the self-attention mechanism. Recent works either truncate long documents or segment them into passages that can be treated by a standard BERT model. A hierarchical architecture, such as a transformer, can be further adopted to build a document-level representation on top of the representations of each passage. However, these approaches either lose information or have high computational complexity (and are both time and energy consuming in this latter case). We follow here a slightly different approach in which one first selects key blocks of a long document by local query-block pre-ranking, and then aggregates few blocks to form a short document that can be processed by a model such as BERT. Experiments conducted on standard Information Retrieval datasets demonstrate the effectiveness of the proposed approach.
Abstract: Knowledge graph embedding aims to represent entities and relations in a continuous feature space while preserving the structure of a knowledge graph. Most existing knowledge graph embedding methods either focus only on a flat structure of the given knowledge graph or exploit the predefined types of entities to explore an enriched structure. In this paper, we define the metagraph of a knowledge graph by proposing a new affinity metric that measures the structural similarity between entities, and then grouping close entities by hypergraph clustering. Without any prior information about entity types, a set of semantically close entities is successfully merged into one super-entity in our metagraph representation. We propose the metagraph-based pre-training model of knowledge graph embedding where we first learn representations in the metagraph and initialize the entities and relations in the original knowledge graph with the learned representations. Experimental results show that our method is effective in improving the accuracy of state-of-the-art knowledge graph embedding methods.
Abstract: Modern search engine ranking pipelines are commonly based on large machine-learned ensembles of regression trees. We propose LEAR, a novel - learned - technique aimed to reduce the average number of trees traversed by documents to accumulate the scores, thus reducing the overall query response time. LEAR exploits a classifier that predicts whether a document can early exit the ensemble because it is unlikely to be ranked among the final top-k results. The early exit decision occurs at a sentinel point, i.e., after having evaluated a limited number of trees, and the partial scores are exploited to filter out non-promising documents. We evaluate LEAR by deploying it in a production-like setting, adopting a state-of-the-art algorithm for ensembles traversal. We provide a comprehensive experimental evaluation on two public datasets. The experiments show that LEAR has a significant impact on the efficiency of the query processing without hindering its ranking quality. In detail, on a first dataset, LEAR is able to achieve a speedup of 3x without any loss in NDCG@10, while on a second dataset the speedup is larger than 5x with a negligible NDCG@10 loss (< 0.05%).
Abstract: Considering the temporal order of user-item interactions for recommendation forms a novel class of recommendation algorithms in recent years, among which sequential recommendation models are the most popular approaches. Although, theoretically, such fine-grained modeling should be beneficial to the recommendation performance, these sequential models in practice greatly suffer from the issue of data sparsity as there are a huge number of combinations for item sequences. To address the issue, we propose LSTPR, a graph-based matrix factorization model that incorporates both high-order graph information and long short-term user preferences into the modeling process. LSTPR explicitly distinguishes long-term and short-term user preferences and enriches the sparse interactions via random surfing on the user-item graph. Experiments on three recommendation datasets with temporal user-item information demonstrate that the proposed LSTPR model achieves significantly better performance than the seven baseline methods.
Abstract: The increasing of group polarization on social media seriously impacts on the health of public discourse and information dissemination. At present, detecting polarized structures in signed networks is well-motivated for studying the group polarization on social media. However, most studies restricted the number of polarized structures to only two, while neglecting the real-world scenario where signed networks consist of multiple polarized structures, that is an unreasonable assumption. To conquer the limitations of the existing work, in this paper, we present a novel cohesive subgraph model based on structural clusterable theory, named maximal multipolarized clique (MMC), which can be partitioned into k polarized subcliques such that the edges in subcliques are positive and the edges between subcliques are negative. This paper formulates the problem of Maximal Multipolarized Cliques Search (MMCS) in signed networks which is proved to be NP-hard. To address this problem, we first devise powerful pruning rules to reduce the signed network significantly and further develop an efficient algorithm to search all maximal multipolarized cliques in the reduced signed network. The experimental results on real-world signed networks demonstrate the efficiency and effectiveness of our algorithm.
Abstract: Knowledge Graphs (KGs) are widely used in various applications of information retrieval. Despite the large scale of KGs, they are still facing incomplete problems. Conventional approaches on Knowledge Graph Completion (KGC) require a large number of training instances for each relation. However, long-tail relations which only have a few related triples are ubiquitous in KGs. Therefore, it is very difficult to complete the long-tail relations. In this paper, we propose a meta pattern learning framework (MetaP) to predict new facts of relations under a challenging setting where there is only one reference for each relation. Patterns in data are representative regularities to classify data. Triples in KGs also conform to relation-specific patterns which can be used to measure the validity of triples. Our model extracts the patterns effectively through a convolutional pattern learner and measures the validity of triples accurately by matching query patterns with reference patterns. Extensive experiments demonstrate the effectiveness of our method. Besides, we build a few-shot KGC dataset of COVID-19 to assist the research process of the new coronavirus.
Abstract: Multi-task learning(MTL) is an open and challenging problem in various real-world applications. The typical way of conducting multi-task learning is establishing some global parameter sharing mechanism across all tasks or assigning each task an individual set of parameters with cross-connections between tasks. However, for most existing approaches, all tasks just thoroughly or proportionally share all the features without distinguishing the helpfulness of them. By that, some tasks would be intervened by the unhelpful features that are useful for other tasks, leading to undesired negative transfer between tasks. In this paper, we design a novel architecture named the Multiple-level Sparse Sharing Model (MSSM), which can learn features selectively and share knowledge across all tasks efficiently. MSSM first employs a field-level sparse connection module (FSCM) to enable much more expressive combinations of feature fields to be learned for generalization across tasks while still allowing for task-specific features to be customized for each task. Furthermore, a cell-level sparse sharing module (CSSM) can recognize the sharing pattern through a set of coding variables that selectively choose which cells to route for a given task. Extensive experimental results on several real-world datasets show that MSSM outperforms SOTA models significantly in terms of AUC and LogLoss metrics.
Abstract: In this paper, we propose an augmented Graph Convolutional Network (GCN) mechanism wherein additional information of local interaction patterns between a node with its neighbors (specifically, in the form of distribution of cosine similarity values of a pre-trained node vector with its neighbors) is used to enrich a node's representation prior to training a GCN. This provides additional information about the structural properties of a node, which the standard convolution operation in a GCN can then leverage for obtaining potentially improved effectiveness in a down-stream task. Our experiments demonstrate that adding these node interaction patterns (NIPs) along with an additional noise-contrastive pairwise document similarity objective within a GCN improves the linked document classification task.
Abstract: Rapidly growing online podcast archives contain diverse content on a wide range of topics. These archives form an important resource for entertainment and professional use, but their value can only be realized if users can rapidly and reliably locate content of interest. Search for relevant content can be based on metadata provided by content creators, but also on transcripts of the spoken content itself. Excavating relevant content from deep within these audio streams for diverse types of information needs requires varying the approach to systems prototyping. We describe a set of diverse podcast information needs and different approaches to assessing retrieved content for relevance. We use these information needs in an investigation of the utility and effectiveness of these information sources. Based on our analysis, we recommend approaches for indexing and retrieving podcast content for ad hoc search.
Abstract: Extreme multi-label classification (XMLC) refers to the task of tagging instances with small subsets of relevant labels coming from an extremely large set of all possible labels. Recently, XMLC has been widely applied to diverse web applications such as automatic content labeling, online advertising, or recommendation systems. In such environments, label distribution is often highly imbalanced, consisting mostly of very rare tail labels, and relevant labels can be missing. As a remedy to these problems, the propensity model has been introduced and applied within several XMLC algorithms. In this work, we focus on the problem of optimal predictions under this model for probabilistic label trees, a popular approach for XMLC problems. We introduce an inference procedure, based on the A-search algorithm, that efficiently finds the optimal solution, assuming that all probabilities and propensities are known. We demonstrate the attractiveness of this approach in a wide empirical study on popular XMLC benchmark datasets.
Abstract: Distant supervision (DS) has been widely used to automatically construct (noisy) labeled data for relation extraction (RE). To address the noisy label problem, most models have adopted the multi-instance learning paradigm by representing entity pairs as a bag of sentences. However, this strategy depends on multiple assumptions (e.g., all sentences in a bag share the same relation), which may be invalid in real-world applications. Besides, it cannot work well on long-tail entity pairs which have few supporting sentences in the dataset. In this work, we propose a new paradigm named retrieval-augmented distantly supervised relation extraction (ReadsRE), which can incorporate large-scale open-domain knowledge (e.g., Wikipedia) into the retrieval step. ReadsRE seamlessly integrates a neural retriever and a relation predictor in an end-to-end framework. We demonstrate the effectiveness of ReadsRE on the well-known NYT10 dataset. The experimental results verify that ReadsRE can effectively retrieve meaningful sentences (i.e., denoise), and relieve the problem of long-tail entity pairs in the original dataset through incorporating external open-domain corpus. Through comparisons, we show ReadsRE outperforms other baselines for this task.
Abstract: Co-clustering of document-term matrices has proved to be more effective than one-sided clustering. By their nature, text data are also generally unbalanced and directional. Recently, the von Mises-Fisher (vMF) mixture model was proposed to handle unbalanced data while harnessing the directional nature of text. In this paper we propose a novel co-clustering approach based on a matrix formulation of vMF model-based co-clustering. This formulation leads to a flexible method for text co-clustering that can easily incorporate both word-word semantic relationships and document-document similarities. By contrast with existing methods, which generally use an additive incorporation of similarities, we propose a dual multiplicative regularization that better encapsulates the underlying text data structure. Extensive evaluations on various real-world text datasets demonstrate the superior performance of our proposed approach over baseline and competitive methods, both in terms of clustering results and co-cluster topic coherence.
Abstract: Click-through rate (CTR) prediction aims to recall the advertisements that users are interested in and to lead users to click, which is of critical importance for a variety of online advertising systems. In practice, CTR prediction is generally formulated as a conventional binary classification problem, where the clicked advertisements are positive samples and the others are negative samples. However, directly treating unclicked advertisements as negative samples would suffer from the severe label noise issue, since there exist many reasons why users are interested in a few advertisements but do not click. To address such serious issue, we propose a reinforcement learning based noise filtering approach, dubbed RLNF, which employs a noise filter to select effective negative samples. In RLNF, such selected, effective negative samples can be used to enhance the CTR prediction model, and meanwhile the effectiveness of the noise filter can be enhanced through reinforcement learning using the performance of CTR prediction model as reward. Actually, by alternating the enhancements of the noise filter and the CTR prediction model, the performance of both the noise filter and the CTR prediction model is improved. In our experiments, we equip 7 state-of-the-art CTR prediction models with RLNF. Extensive experiments on a public dataset and an industrial dataset present that RLNF significantly improves the performance of all these 7 CTR prediction models, which indicates both the effectiveness and the generality of RLNF.
Abstract: Graph Neural Networks (GNNs) have achieved state-of-the-art performance in many high-impact applications such as fraud detection, information retrieval, and recommender systems due to their powerful representation learning capabilities. Some nascent efforts have been concentrated on simplifying the structures of GNN models, in order to reduce the computational complexity. However, the dynamic nature of these applications requires GNN structures to be evolving over time, which has been largely overlooked so far. To bridge this gap, in this paper, we propose a simplified and dynamic graph neural network model, called SDG. It is efficient, effective, and provides interpretable predictions. In particular, in SDG, we replace the traditional message-passing mechanism of GNNs with the designed dynamic propagation scheme based on the personalized PageRank tracking process. We conduct extensive experiments and ablation studies to demonstrate the effectiveness and efficiency of our proposed SDG. We also design a case study on fake news detection to show the interpretability of SDG.
Abstract: Searching in a domain-specific corpus of structured documents (e.g., e-commerce, media streaming services, job-seeking platforms) is often managed as a traditional retrieval task or through faceted search. Semantic Query Labeling --- the task of locating the constituent parts of a query and assigning domain-specific predefined semantic labels to each of them --- allows leveraging the structure of documents during retrieval while leaving unaltered the keyword-based query formulation. Due to both the lack of a publicly available dataset and the high cost of producing one, there have been few published works in this regard. In this paper, basing on the assumption that a corpus already contains the information the users search, we propose a method for the automatic generation of semantically labeled queries and show that a semantic tagger --- based on BERT, gazetteers-based features, and Conditional Random Fields --- trained on our synthetic queries achieves results comparable to those obtained by the same model trained on real-world data. We also provide a large dataset of manually annotated queries in the movie domain suitable for studying Semantic Query Labeling. We hope that the public availability of this dataset will stimulate future research in this area.
Abstract: Leaderboards are a ubiquitous part of modern research in applied machine learning. By design, they sort entries into some linear order, where the top-scoring entry is recognized as the "state of the art" (SOTA). Due to the rapid progress being made today, particularly with neural models, the top entry in a leaderboard is replaced with some regularity. These are touted as improvements in the state of the art. Such pronouncements, however, are almost never qualified with significance testing. In the context of the MS MARCO document ranking leaderboard, we pose a specific question: How do we know if a run is significantly better than the current SOTA? Against the backdrop of recent IR debates on scale types, our study proposes an evaluation framework that explicitly treats certain outcomes as distinct and avoids aggregating them into a single-point metric. Empirical analysis of SOTA runs from the MS MARCO document ranking leaderboard reveals insights about how one run can be "significantly better" than another that are obscured by the current official evaluation metric (MRR@100).
Abstract: In neural Information Retrieval, ongoing research is directed towards improving the first retriever in ranking pipelines. Learning dense embeddings to conduct retrieval using efficient approximate nearest neighbors methods has proven to work well. Meanwhile, there has been a growing interest in learning sparse representations for documents and queries, that could inherit from the desirable properties of bag-of-words models such as the exact matching of terms and the efficiency of inverted indexes. In this work, we present a new first-stage ranker based on explicit sparsity regularization and a log-saturation effect on term weights, leading to highly sparse representations and competitive results with respect to state-of-the-art dense and sparse methods. Our approach is simple, trained end-to-end in a single stage. We also explore the trade-off between effectiveness and efficiency, by controlling the contribution of the sparsity regularization.
Abstract: Technology Assisted Review (TAR) aims to minimise the manual judgements required to identify relevant documents. Reductions in workload are dependent on a reviewer being able to make an informed decision about when to stop examining documents. Counting processes offer a theoretically sound approach to creating stopping criteria for TAR approaches that are based on analysis of the rate at which relevant documents are observed. This paper introduces two modifications to existing approaches: application of a Cox Process (a counting process which has not previously been used for this problem) and use of a rate function based on a power law. Experiments on the CLEF 2017 e-Health TAR collection demonstrates that these approaches produces results that are superior to those reported previously.
Abstract: Deep hashing methods have been intensively studied and successfully applied in massive fast image retrieval. However, inherited from the deficiency of deep neural networks, deep hashing models can be easily fooled by adversarial examples, which brings a serious security risk to hashing based retrieval. In this paper, we propose a novel targeted attack method and the first defense scheme for deep hashing based retrieval. Specifically, a simple yet effective PrototypeNet is designed to generate category-level semantic embedding (dubbed prototype code) regarded as the semantic representative of the target label, which preserves the semantic similarity with relevant labels and dissimilarity with irrelevant labels. Subsequently, we conduct the targeted attack by minimizing the Hamming distance between the hash code of the adversarial sample and the prototype code. Moreover, we provide an adversarial training algorithm to improve the adversarial robustness of deep hashing networks. Extensive experiments demonstrate our method can produce high-quality adversarial samples with the benefit of superior targeted attack performance over state-of-the-arts. Importantly, our adversarial defense framework can significantly boost the robustness of hashing networks against adversarial attacks on deep hashing based retrieval. The code is available at https://github.com/xunguangwang/Targeted-Attack-and-Defense-for-Deep-Hashing.
Abstract: We propose VADEC, a multi-task framework that exploits the correlation between the categorical and dimensional models of emotion representation for better subjectivity analysis. Focusing primarily on the effective detection of emotions from tweets, we jointly train multi-label emotion classification and multi-dimensional emotion regression, thereby utilizing the inter-relatedness between the tasks. Co-training especially helps in improving the performance of the classification task as we outperform the strongest baselines with 3.4%, 11%, and 3.9% gains in Jaccard Accuracy, Macro-F1, and Micro-F1 scores respectively on the AIT dataset [17]. We also achieve state-of-the-art results with 11.3% gains averaged over six different metrics on the SenWave dataset [27]. For the regression task, VADEC, when trained with SenWave, achieves 7.6% and 16.5% gains in Pearson Correlation scores over the current state-of-the-art on the EMOBANK dataset [5] for the Valence (V) and Dominance (D) affect dimensions respectively. We conclude our work with a case study on COVID-19 tweets posted by Indians that further helps in establishing the efficacy of our proposed solution.
Abstract: Unsupervised ensemble learning aims to estimate ground-truth labels via integrating noisy and unreliable labeling results from multiple annotators. Although many techniques have been proposed to deal with this challenging task, there still exists some "tough" instances with noisy labels that are misclassified after the integration, which significantly affect the classification performance. This paper introduces a novel approach to improve the label accuracy based on unsupervised ensemble learning. First, we apply the expectation maximization (EM) algorithm to aggregate labels for all the instances. Then we identify instances that are most likely to be "tough" through a two-stage filtering method. Finally, an ensemble of AdaBoost-based classification models is trained on the high-quality dataset, and predicts new labels for these "tough" instances. The results of empirical investigation on binary classification task show that: (1) our approach can identify "tough" instances from the input dataset effectively; (2) our approach achieves a better performance on improving the accuracy of labels produced by unsupervised ensemble algorithms.
Abstract: Supervised summarization has made significant improvements in recent years by leveraging cutting-edge deep learning technologies. However, the true success of supervised methods relies on the availability of large quantity of human-generated summaries of documents, which is highly costly and difficult to obtain in general. This paper proposes an unsupervised approach to extractive text summarization, which uses an automatically constructed sentence graph from each document to select salient sentences for summarization based on both the similarities and relative distances in the neighborhood of each sentences. We further generalize our approach from single-document summarization to a multi-document setting, by aggregating document-level graphs via proximity-based cross-document edges. In our experiments on benchmark datasets, the proposed approach achieved competitive or better results than previous state-of-the-art unsupervised extractive summarization methods in both single-document and multi-document settings, and the performance is competitive to strong supervised baselines.
Abstract: Searchers often make a choice in a matter of seconds on SERPs. As a result of a dynamic cognitive process, choice is ultimately reflected in motor movement and thus can be modeled by tracking the computer mouse. However, because not all movements have equal value, it is important to understand how do they and, critically, their sequence length impact model performance. We study three different SERP scenarios where searchers (1)~noticed an advertisement, (2)~abandoned the page, and (3)~became frustrated. We model these scenarios with recurrent neural nets and study the effect of mouse sequence padding and truncating to different lengths. We find that it is possible to predict the aforementioned tasks sometimes using just 2 seconds of movement. Ultimately, by efficiently recording the right amount of data, we can save valuable bandwidth and storage, respect the users' privacy, and increase the speed at which machine learning models can be trained and deployed. Considering the web scale, doing so will have a net benefit on our environment.
Abstract: BlockMax WAND (BMW) and its variants can effectively prune low-scoring documents for fast top-k disjunctive query processing. This paper studies a boosting approach that further accelerates document retrieval by executing BMW, or one of its variants, on a sequence of posting windows with an order prioritized to tighten the threshold bound earlier. This optimization could add benefits to safely eliminate more operations involved in posting block visitation and document score evaluation. This paper evaluates such index navigation for BMW and two of its variants.
Abstract: Named entity processing over historical texts is more and more being used due to the massive documents and archives being stored in digital libraries. However, due to the poor annotated resources of historical nature, information extraction performances fall behind those on contemporary texts. In this paper, we introduce the development of the NewsEye resource, a multilingual dataset for named entity recognition and linking enriched with stances towards named entities. The dataset is comprised of diachronic historical newspaper material published between 1850 and 1950 in French, German, Finnish, and Swedish. Such historical resource is essential in the context of developing and evaluating named entity processing systems. It evenly allows enhancing the performances of existing approaches on historical documents which enables adequate and efficient semantic indexing of historical documents on digital cultural heritage collections.
Abstract: Deep Learning Hard (DL-HARD) is a new annotated dataset designed to more effectively evaluate neural ranking models on complex topics. It builds on TREC Deep Learning (DL) topics by extensively annotating them with question intent categories, answer types, wikified entities, topic categories, and result type metadata from a commercial web search engine. Based on this data, we introduce a framework for identifying challenging queries. DL-HARD contains fifty topics from the official DL 2019/2020 evaluation benchmark, half of which are newly and independently assessed. We perform experiments using the official submitted runs to DL on DL-HARD and find substantial differences in metrics and the ranking of participating systems. Overall, DL-HARD is a new resource that promotes research on neural ranking methods by focusing on challenging and complex topics.
Abstract: Legal case retrieval is of vital importance for ensuring justice in different kinds of law systems and has recently received increasing attention in information retrieval (IR) research. However, the relevance judgment criteria of previous retrieval datasets are either not applicable to non-cited relationship cases or not instructive enough for future datasets to follow. Besides, most existing benchmark datasets do not focus on the selection of queries. In this paper, we construct the Chinese Legal Case Retrieval Dataset (LeCaRD), which contains 107 query cases and over 43,000 candidate cases. Queries and results are adopted from criminal cases published by the Supreme People's Court of China. In particular, to address the difficulty in relevance definition, we propose a series of relevance judgment criteria designed by our legal team and corresponding candidate case annotations are conducted by legal experts. Also, we develop a novel query sampling strategy that takes both query difficulty and diversity into consideration. For dataset evaluation, we implemented several existing retrieval models on LeCaRD as baselines. The dataset is now available to the public together with the complete data processing details.
Abstract: In information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the documents to be relevant. Stemmers are known to be effective in many languages for IR. However, there are still languages where stemmers or morphological analyzers are missing; this is the case for Amharic which is the working language of Ethiopia. Morphological analysis is the key to derive stems, roots (primary lexical units) and grammatical markers of words such as person, tense and negation markers. This paper presents morphologically annotated Amharic lexicons as well as stem-based and root-based morphologically annotated corpora which could be used by the research community as benchmark collections either to evaluate morphological analyzers or information retrieval for Amharic. Such resources are believed to foster research in Amharic IR.
Abstract: Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. Around this toolkit, our group has built a culture of reproducibility through shared norms and tools that enable rigorous automated testing.
Abstract: Experimental validation is key to the development of Information Retrieval (IR) systems. The standard evaluation paradigm requires a test collection with documents, queries, and relevance judgments. Creating test collections requires significant human effort, mainly for providing relevance judgments. As a result, there are still many domains and languages that, to this day, lack a proper evaluation testbed. Portuguese is an example of a major world language that has been overlooked in terms of IR research -- the only test collection available is composed of news articles from 1994 and a hundred queries. With the aim of bridging this gap, in this paper, we developed REGIS (Retrieval Evaluation for Geoscientific Information Systems), a test collection for the geoscientific domain in Portuguese. REGIS contains 20K documents and 34 query topics along with relevance assessments. We describe the procedures for document collection, topic creation, and relevance assessment. In addition, we report on results of standard IR techniques on REGIS so that they can serve as a baseline for future research.
Abstract: The TREC Deep Learning (DL) Track studies ad hoc search in the large data regime, meaning that a large set of human-labeled training data is available. Results so far indicate that the best models with large data may be deep neural networks. This paper supports the reuse of the TREC DL test collections in three ways. First we describe the data sets in detail, documenting clearly and in one place some details that are otherwise scattered in track guidelines, overview papers and in our associated MS MARCO leaderboard pages. We intend this description to make it easy for newcomers to use the TREC DL data. Second, because there is some risk of iteration and selection bias when reusing a data set, we describe the best practices for writing a paper using TREC DL data, without overfitting. We provide some illustrative analysis. Finally we address a number of issues around the TREC DL data, including an analysis of reusability.
Abstract: In IR evaluation based on depth-k pooling, there are several strategies to order the pooled documents for relevance assessors. Among them, the simplest approach is to completely randomise the order "so assessors cannot tell if a document was highly ranked by some system or how many systems (or which systems) retrieved the document." An approach that is in sharp contrast to the above is the prioritisation approach taken by NTCIRPOOL, a tool widely used at NTCIR. NTCIRPOOL sorts the pooled documents by "pseudorelevance," a statistic that reflects the popularity of each document within the depth-k pools. Although these two strategies have coexisted for over two decades, the IR research community has yet to reach a consensus as to what advantages each of these two strategies actually offer. To help researchers directly address this question using their favourite methods of analysis, we have released a large-scale data set called WWW3E8. It comprises eight independent sets of qrels for the 160 English topics of the NTCIR-15 WWW-3 task: four qrels files constructed using the randomisation approach, and another four constructed using the prioritisation approach of NTCIRPOOL. Each qrels file covers 32,375 topic-document pairs; hence, WWW3E8 contains a total of 259,000 relevance labels. Moreover, the data set contains the raw English subtask run files from the WWW-3 task, the randomised and prioritised pool files, and topic-by-run score matrices of the official measures used in the task. Hence, researchers interested in the above research question regarding document ordering can utilise WWW3E8 as a common ground to directly compare the two strategies.
Abstract: AMUSE (Advanced MUSic Explorer) was created 2006 as an open-source Java framework for various music information retrieval tasks like feature extraction, feature processing, classification, and evaluation. In contrast to toolboxes which focus on individual MIR-related algorithms, it is possible with AMUSE, for instance, to extract features with Librosa, process them based on events estimated by MIRtoolbox, classify with WEKA or Keras, and validate the models with own classification performance measures. We present several substantial contributions to AMUSE since its first presentation at ISMIR 2010. They include the annotation editor for single and multiple tracks, the support of multi-label and multi-class classification, and new plugins which operate with Keras, Librosa, and Sonic Annotator. Other integrated methods are the structural complexity processing, chord vector feature, aggregation of features around estimated onset events, and evaluation of time event extractors. Further advancements are a more flexible feature extraction with different parameters like frame sizes, possibility to integrate additional tasks beyond algorithms related to supervised classification, marking of features which can be ignored for a classification task, extension of algorithm parameters with external code (e.g., a structure of a Keras neural net), etc.
Abstract: Machine understanding of user utterances in conversational systems is of utmost importance for enabling engaging and meaningful conversations with users. Entity Linking (EL) is one of the means of text understanding, with proven efficacy for various downstream tasks in information retrieval. In this paper, we study entity linking for conversational systems. To develop a better understanding of what EL in a conversational setting entails, we analyze a large number of dialogues from existing conversational datasets and annotate references to concepts, named entities, and personal entities using crowdsourcing. Based on the annotated dialogues, we identify the main characteristics of conversational entity linking. Further, we report on the performance of traditional EL systems on our Conversational Entity Linking dataset, ConEL, and present an extension to these methods to better fit the conversational setting. The resources released with this paper include annotated datasets, detailed descriptions of crowdsourcing setups, as well as the annotations produced by various EL systems. These new resources allow for an investigation of how the role of entities in conversations is different from that in documents or isolated short text utterances like queries and tweets, and complement existing conversational datasets.
Abstract: The amount of near-duplicates in web crawls like the ClueWeb or Common Crawl demands from their users either to develop a preprocessing pipeline for deduplication, which is costly both computationally and in person hours, or accepting the undesired effects that near-duplicates have on reliability and validity of experiments. We introduce ChatNoir-CopyCat-21, which simplifies deduplication significantly. It comes in two parts: (1) A compilation of near-duplicate documents within the ClueWeb09, the ClueWeb12, and two Common Crawl snapshots, as well as between selections of these crawls, and (2) a software library that implements the deduplication of arbitrary document sets. Our analysis shows that 14--52, of the documents within a crawl and around~0.7--2.5, between the crawls are near-duplicates. Two showcases demonstrate the application and usefulness of our resource.
Abstract: Recommender Systems have shown to be an effective way to alleviate the over-choice problem and provide accurate and tailored recommendations. However, the impressive number of proposed recommendation algorithms, splitting strategies, evaluation protocols, metrics, and tasks, has made rigorous experimental evaluation particularly challenging. Puzzled and frustrated by the continuous recreation of appropriate evaluation benchmarks, experimental pipelines, hyperparameter optimization, and evaluation procedures, we have developed an exhaustive framework to address such needs. Elliot is a comprehensive recommendation framework that aims to run and reproduce an entire experimental pipeline by processing a simple configuration file. The framework loads, filters, and splits the data considering a vast set of strategies (13 splitting methods and 8 filtering approaches, from temporal training-test splitting to nested K-folds Cross-Validation). Elliot(https://github.com/sisinflab/elliot) optimizes hyperparameters (51 strategies) for several recommendation algorithms (50), selects the best models, compares them with the baselines providing intra-model statistics, computes metrics (36) spanning from accuracy to beyond-accuracy, bias, and fairness, and conducts statistical analysis (Wilcoxon and Paired t-test).
Abstract: There is increasing recognition of the need for human-centered AI that learns from human feedback. However, most current AI systems focus more on the model design, but less on human participation as part of the pipeline. In this work, we propose a Human-in-the-Loop (HitL) graph reasoning paradigm and develop a corresponding dataset named HOOPS for the task of KG-driven conversational recommendation. Specifically, we first construct a KG interpreting diverse user behaviors and identify pertinent attribute entities for each user--item pair. Then we simulate the conversational turns reflecting the human decision making process of choosing suitable items tracing the KG structures transparently. We also provide a benchmark method with reported performance on the dataset to ascertain the feasibility of HitL graph reasoning for recommendation using our developed dataset, and show that it provides novel opportunities for the research community.
Abstract: Shared text collections continue to be vital infrastructure for IR research. The COVID-19 pandemic offered an opportunity to create a test collection that captured the rapidly changing information space during a pandemic, and the TREC-COVID effort was created to build such a collection using the TREC framework. This paper examines the quality of the resulting TREC-COVID test collections, and in doing so, offers a critique of the state-of-the-art in building reusable IR test collections. The largest of the collections--called 'TREC-COVID Complete'--is found to be on par with previous TREC ad~hoc collections with existing quality tests uncovering no apparent problems. Yet the lack of any way to definitively demonstrate the collection's quality and its violation of previously used quality heuristics suggest much work remains to be done to understand the factors affecting collection quality.
Abstract: Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across the Internet and once one obtains a copy of the data, there are numerous different data formats to work with. Even basic formats can have subtle dataset-specific nuances that need to be considered for proper use. To help mitigate these challenges, we introduce a new robust and lightweight tool (ir_datasets) for acquiring, managing, and performing typical operations over datasets used in IR. We primarily focus on textual datasets used for ad-hoc search. This tool provides both a Python and command line interface to numerous IR datasets and benchmarks. To our knowledge, this is the most extensive tool of its kind. Integrations with popular IR indexing and experimentation toolkits demonstrate the tool's utility. We also provide documentation of these datasets through the \sys catalog: https://ir-datasets.com/. The catalog acts as a hub for information on datasets used in IR, providing core information about what data each benchmark provides as well as links to more detailed information. We welcome community contributions and intend to continue to maintain and grow this tool.
Abstract: Wikipedia is the largest online encyclopedia, used by algorithms and web users as a central hub of reliable information on the web.The quality and reliability of Wikipedia content is maintained by a community of volunteer editors. Machine learning and information retrieval algorithms could help scale up editors' manual efforts around Wikipedia content reliability. However, there is a lack of large-scale data to support the development of such research. To fill this gap, in this paper, we propose Wiki-Reliability, the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues. To build this dataset, we rely on Wikipedia "templates". Templates are tags used by expert Wikipedia editors to indicate content issues, such as the presence of "non-neutral point of view" or "contradictory articles", and serve as a strong signal for detecting reliability issues in a revision. We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative with respect to each template. Each positive/negative example in the dataset comes with the full article text and 20 features from the revision's metadata. We provide an overview of the possible downstream tasks enabled by such data, and show that Wiki-Reliability can be used to train large-scale models for content reliability prediction. We release all data and code for public use.
Abstract: The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information across image and text modalities. In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset to better facilitate multimodal, multilingual learning. WIT is composed of a curated set of 37.5 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal models, as we show when applied to downstream tasks such as image-text retrieval. WIT has four main and unique advantages. First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing). Second, WIT is massively multilingual (first of its kind) with coverage over 100+ languages (each of which has at least 12K examples) and provides cross-lingual texts for many images. Third, WIT represents a more diverse set of concepts and real world entities relative to what previous datasets cover. Lastly, WIT provides a very challenging real-world test set, as we empirically illustrate using an image-text retrieval task as an example. WIT Dataset is available for download and use via a Creative Commons license here: https://github.com/google-research-datasets/wit.
Abstract: This paper introduces a new test collection for ad-hoc dataset retrieval, which have been developed through a shared task called Data Search in the fifteenth NTCIR. This test collection consists of dataset collections derived from the US and Japanese governments' open data sites (i.e., Data.gov and e-Stat), as well as English and Japanese topics for these collections. Organizing the shared task in NTCIR, we conducted relevance judgments for datasets retrieved by 74 search systems, and included them in the test collection. In addition to the detailed description of the test collection, we conducted in-depth analysis on the test collection, and revealed (1) what techniques were used and effective, (2) what topics were difficult, and (3) large topic variability in the dataset retrieval task.
Abstract: We introduce a novel dataset of real multi-destination trips booked through Booking.com's online travel platform. The dataset consists of 1.5 million reservations representing 359,000 unique journeys made across 39,000 destinations. As such, the data is particularly well suited to model sequential recommendation and retrieval problems in a high cardinality target space. To preserve user privacy and protect business-sensitive statistics, the data is fully anonymized, sampled and limited to five user origin markets. Even so, the dataset is representative of the general travel purchase behavior and therefore presents a uniquely valuable resource for Machine Learning and information retrieval researchers. This work provides an overview of the dataset. It reports several benchmark results for relevant recommendation problems obtained as part of the recently held Booking.com data challenge during the WSDM WebTour workshop.
Abstract: Recently, research on explainable recommender systems has drawn much attention from both academia and industry, resulting in a variety of explainable models. As a consequence, their evaluation approaches vary from model to model, which makes it quite difficult to compare the explainability of different models. To achieve a standard way of evaluating recommendation explanations, we provide three benchmark datasets for EXplanaTion RAnking (denoted as EXTRA), on which explainability can be measured by ranking-oriented metrics. Constructing such datasets, however, poses great challenges. First, user-item-explanation triplet interactions are rare in existing recommender systems, so how to find alternatives becomes a challenge. Our solution is to identify nearly identical sentences from user reviews. This idea then leads to the second challenge, i.e., how to efficiently categorize the sentences in a dataset into different groups, since it has quadratic runtime complexity to estimate the similarity between any two sentences. To mitigate this issue, we provide a more efficient method based on Locality Sensitive Hashing (LSH) that can detect near-duplicates in sub-linear time for a given query. Moreover, we make our code publicly available to allow researchers in the community to create their own datasets.
Abstract: atural language dialogue systems raise great attention recently. As many dialogue models are data-driven, high-quality datasets are essential to these systems. In this paper, we introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively. To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization, deduplication, segmentation, and filtering. The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models. Besides, current dialogue datasets for personalized chatbot usually contain several persona sentences or attributes. Different from existing datasets, Pchatbot provides anonymized user IDs and timestamps for both posts and responses. This enables the development of personalized dialogue models that directly learn implicit user personality from the user's dialogue history. Our preliminary experimental study benchmarks several state-of-the-art dialogue models to provide a comparison for future work. The dataset can be publicly accessed at Github: https://github.com/qhjqhj00/Pchatbot.
Abstract: This paper presents a test collection for contextual point of interest (POI) recommendation in a narrative-driven scenario. There, user history is not available, instead, user requests are described in natural language. The requests in our collection are manually collected from social sharing websites, and are annotated with various types of metadata, including location, categories, constraints, and example POIs. These requests are to be resolved from a dataset of POIs, which are collected from a popular online directory, and are further linked to a geographical knowledge base and enriched with relevant web snippets. Graded relevance assessments are collected using crowdsourcing, by pooling both manual and automatic recommendations, where the latter serve as baselines for future performance comparison. This resource supports the development of novel approaches for end-to-end POI recommendation as well as for specific semantic annotation tasks on natural language requests.
Abstract: The harvesting, management, and analysis of thematic document collections is a major challenge in a wide variety of applications. While the criteria for compiling such collections are individual, the entire process is largely standardized. Therefore, it is not efficient to build new systems over and over again to take over these tasks. In this work, we introduce Seer-Dock, a novel and easy-to deploy general-purpose dockerized framework to build a scholarly document harvesting and management system. It is based on CiteSeerX, the most widely used scholarly search engine. Seer-Dock uses docker containers for all components and thus enables its users to rapidly deploy a full-fledged document collection and management system on any operating system platform and tailor it to the specific needs of an application domain. Moreover, it is easy to scale, orchestrate, maintain, and recover. In this resource paper, we introduce the architecture of Seer-Dock and its components. Like its kernel CiteSeerX, Seer-Dock is available under an Apache 2 open source license.
Abstract: Multimodal IR, spanning text corpus, knowledge graph and images, called outside knowledge visual question answering (OKVQA), is of much recent interest. However, the popular data set has serious limitations. A surprisingly large fraction of queries do not assess the ability to integrate cross-modal information. Instead, some are independent of the image, some depend on speculation, some require OCR or are otherwise answerable from the image alone. To add to the above limitations, frequency-based guessing is very effective because of (unintended) widespread answer overlaps between the train and test folds. Overall, it is hard to determine when state-of-the-art systems exploit these weaknesses rather than really infer the answers, because they are opaque and their 'reasoning' process is uninterpretable. An equally important limitation is that the dataset is designed for the quantitative assessment only of the end-to-end answer retrieval task, with no provision for assessing the correct(semantic) interpretation of the input query. In response, we identify a key structural idiom in OKVQA ,viz., S3 (select, substitute and search), and build a new data set and challenge around it. Specifically, the questioner identifies an entity in the image and asks a question involving that entity which can be answered only by consulting a knowledge graph or corpus passage mentioning the entity. Our challenge consists of (i)OKVQA_S3, a subset of OKVQA annotated based on the structural idiom and (ii)S3VQA, a new dataset built from scratch. We also present a neural but structurally transparent OKVQA system, S3, that explicitly addresses our challenge dataset, and outperforms recent competitive baselines. We make our code and data available at https://s3vqa.github.io/.
Abstract: Evaluation is crucial in the development process of task-oriented dialogue systems. As an evaluation method, user simulation allows us to tackle issues such as scalability and cost-efficiency, making it a viable choice for large-scale automatic evaluation. To help build a human-like user simulator that can measure the quality of a dialogue, we propose the following task: simulating user satisfaction for the evaluation of task-oriented dialogue systems. The purpose of the task is to increase the evaluation power of user simulations and to make the simulation more human-like. To overcome a lack of annotated data, we propose a user satisfaction annotation dataset, USS, that includes 6,800 dialogues sampled from multiple domains, spanning real-world e-commerce dialogues, task-oriented dialogues constructed through Wizard-of-Oz experiments, and movie recommendation dialogues. All user utterances in those dialogues, as well as the dialogues themselves, have been labeled based on a 5-level satisfaction scale. We also share three baseline methods for user satisfaction prediction and action prediction tasks. Experiments conducted on the USS dataset suggest that distributed representations outperform feature-based methods. A model based on hierarchical GRUs achieves the best performance in in-domain user satisfaction prediction, while a BERT-based model has better cross-domain generalization ability.
Abstract: Click logs are valuable resources for a variety of information retrieval (IR) tasks. This includes query understanding/analysis, as well as learning effective IR models particularly when the models require large amounts of training data. We release a large-scale domain-specific dataset of click logs, obtained from user interactions of the Trip Database health web search engine. Our click log dataset comprises approximately 5.2 million user interactions collected between 2013 and 2020. We use this dataset to create a standard IR evaluation benchmark - TripClick - with around 700,000 unique free-text queries and 1.3 million pairs of query-document relevance signals, whose relevance is estimated by two click-through models. As such, the collection is one of the few datasets offering the necessary data richness and scale to train neural IR models with a large amount of parameters, and notably the first in the health domain. Using TripClick, we conduct experiments to evaluate a variety of IR models, showing the benefits of exploiting this data to train neural architectures. In particular, the evaluation results show that the best performing neural IR model significantly improves the performance by a large margin relative to classical IR models, especially for more frequent queries.
Abstract: We describe the development, characteristics and availability of a test collection for the task of Web table retrieval, which uses a large-scale Web Table Corpora extracted from the Common Crawl. Since a Web table usually has rich context information such as the page title and surrounding paragraphs, we not only provide relevance judgments of query-table pairs, but also the relevance judgments of query-table context pairs with respect to a query, which are ignored by previous test collections. To facilitate future research with this benchmark, we provide details about how the dataset is pre-processed and also baseline results from both traditional and recently proposed table retrieval methods. Our experimental results show that proper usage of context labels can benefit previous table retrieval methods.
Abstract: Chatty Goose is an open-source Python conversational search framework that provides strong, reproducible reranking pipelines built on recent advances in neural models. The framework comprises extensible modular components that integrate with popular libraries such as Transformers by HuggingFace and ParlAI by Facebook. Our aim is to lower the barrier of entry for research in conversational search by providing reproducible baselines that researchers can build on top of. We provide an overview of the framework and demonstrate how to instantiate a new system from scratch. Chatty Goose incorporates improvements to components that we introduced in the TREC 2019 Conversational Assistance Track (CAsT), where our submission represented the top-performing system. Using our framework, a comparable run can be reproduced with just a few lines of code.
Abstract: An essential part of underground conversation are dark jargon terms. They are benign-looking, but have hidden, sometimes sinister meanings and are used by participants of underground forums for illicit behavior. For example, the dark term "rat" is often used in lieu of "Remote Access Trojan". We present a novel online platform that caters to the understating of underground conversation with latent meaning. Our system enables researchers, law enforcement agents and "white-hat" hackers to gain invaluable insights into underground communication by providing them with a tool to (1) look-up dark jargon terms in a dictionary; (2) explore the usage of dark jargon over time and interpret their meaning; (3) collaborate and contribute their own research findings. Furthermore, we introduce a novel dark jargon interpretation method that leverages masked language modeling of a transformer-based architecture.
Abstract: OpenMatch is a Python-based library that serves for Neural Information Retrieval (Neu-IR) research. It provides self-contained neural and traditional IR modules, making it easy to build customized and higher-capacity IR systems. In order to develop the advantages of Neu-IR models for users, OpenMatch provides implementations of recent neural IR models, complicated experiment instructions, and advanced few-shot training methods. OpenMatch reproduces corresponding ranking results of previous work on widely-used IR benchmarks, liberating users from surplus labor in baseline reimplementation. Our OpenMatch-based solutions conduct top-ranked empirical results on various ranking tasks, such as ad hoc retrieval and conversational retrieval, illustrating the convenience of OpenMatch to facilitate building an effective IR system. The library, experimental methodologies and results of OpenMatch are all publicly available at https://github.com/thunlp/OpenMatch.
Abstract: We present a search engine aimed to help clinicians find targeted treatments for children with cancer. Childhood cancer is a leading cause of death and clinicians increasingly seek treatments that are tailored to an individual patient, particularly their tumour genetics. Finding treatments that are specific to paediatrics and match individual genetics is a real challenge amongst the vast and growing body of medical literature and clinical trials. We aim to help clinicians through a search system tailored to this problem. The system retrieves PubMed articles and clinical trials. Entity extraction is done to highlight genes, drugs and cancers --- three key information types clinicians care about. Query suggestion helps clinicians formulate otherwise difficult queries and results are presented as a knowledge graph to help result interpretability. The proposed system aims to both significantly reduce the effort of searching for targeted treatments and potentially find life saving treatments that may have otherwise been missed. Demo details at http://health-search.csiro.au/oscar/
Abstract: Mathematical Information Retrieval (MIR) has been actively studied in recent years and many fruitful results have emerged. Among those, the Approach Zero system is one of the few math-aware search engines that is able to perform substructure matching efficiently. Furthermore, it has been deployed in ARQMath2020, the most recent community-wide MIR evaluation, as a strong baseline due to its empirical effectiveness and ability to handle structured math content. However, in order to implement a retrieval model that handles structured queries efficiently, Approach Zero is written in C from the ground up, requiring special pipelines for processing math content and queries. Thus, the system is not conveniently accessible and reusable to the community as a research tool. In this paper, we present PyA0, an easy-to-use Python toolkit built on Approach Zero that improves its accessibility to researchers. We introduce the toolkit interface and report evaluation results on popular MIR datasets to demonstrate the effectiveness and efficiency of our toolkit. We have made PyA0 source code publicly accessible at https://github.com/approach0/pya0, which includes a link to a notebook demo.
Abstract: With the Web augmenting every day and computers increasingly getting more powerful, research in the field of computational argumentation becomes more and more important. One of its research branches is argument retrieval, which aims at finding and presenting users the best arguments for their queries. Several systems already exist for this purpose, all having the same goal but reaching it in different ways. In line with existing work, an argument consists of a claim supported or attacked by a premise. Now that argument retrieval has become a separate task in the CLEF lab Touché, displaying the ranking is becoming increasingly important. In this paper we present QuARk, a GUI that allows users to retrieve arguments from a focused debate collection for their queries. Since we strictly distinguished between frontend and backend and kept the communication between them simple, QuARk can be extended to integrate various argument retrieval systems, assuming some modifications are made. In order to demonstrate the GUI, we show the integration of a complex retrieval algorithm that we also presented in the CLEF lab Touché. Our retrieval process consists of two parts. In the first step, it finds the most similar claims to the query. Therefore, the user can select between different standard IR similarity methods. The second step ranks the premises directly related to the claims. Therefore, the user can choose to rank the arguments either by quantitative, qualitative, or a combined measure.
Abstract: We present the IR Anthology, a corpus of information retrieval publications accessible via a metadata browser and a full-text search engine. Following the example of the well-known ACL Anthology, the IR Anthology serves as a hub for researchers interested in information retrieval. Our search engine ChatNoir indexes the publications' full texts, enabling a focused search and linking users to the respective publisher's site for personal access. Listing more than 40,000 publications at the time of writing, the IR Anthology can be freely accessed at https://IR.webis.de.
Abstract: Medical decision-making is guided by the results of rich medical research and clinical trials. Doctors, practitioners and researchers urgently need to get updated by the most recent research and clinical outputs to make correct decisions, especially when a new virus such as Covid-19 is causing a global epidemic. However, medical literature for a certain topic could be from different aspects thus archived in different literature databases, resulting in the searching for all related articles become a laborious and time-consuming task. It becomes worse when there is a rapid growth of the number of published literature for the given topic. In this work, we build an online knowledge hub (http://covid19knowledgehub.herokuapp.com/) particularly for Covid-19 related research articles per the requirement from researchers in a hospital. The system is built on top of nine medical research article databases, which covers a wide range of medical aspects. It allows users to easily retrieve and explore the articles from multiple literature databases at one-stop. The system also provides the statistics of article distributions to offer an overview of the status of research under this topic. This real-demand driven system is deployed in a research team of Renmin Hospital, Wuhan University, and largely reduces their time for searching the latest articles. Although this project focuses on Covid-19 related research articles, the approach at the back could be applied to any topic in any domain.
Abstract: The Federal Reserve System (the Fed) plays a significant role in affecting monetary policy and financial conditions worldwide. Although it is important to analyse the Fed's communications to extract useful information, it is generally long-form and complex due to the ambiguous and esoteric nature of content. In this paper, we present FedNLP, an interpretable multi-component Natural Language Processing (NLP) system to decode Federal Reserve communications. This system is designed for end-users to explore how NLP techniques can assist their holistic understanding of the Fed's communications with NO coding. Behind the scenes, FedNLP uses multiple NLP models from traditional machine learning algorithms to deep neural network architectures in each downstream task. The demonstration shows multiple results at once including sentiment analysis, summary of the document, prediction of the Federal Funds Rate movement and visualization for interpreting the prediction model's result. Our application system and demonstration are available at https://fednlp.net.
Abstract: In the context of social media, geolocation inference on news or events has become a very important task. In this paper, we present the GeoWINE (Geolocation-based Wiki-Image-News-Event retrieval) demonstrator, an effective modular system for multimodal retrieval which expects only a single image as input. The GeoWINE system consists of five modules in order to retrieve related information from various sources. The first module is a state-of-the-art model for geolocation estimation of images. The second module performs a geospatial-based query for entity retrieval using the Wikidata knowledge graph. The third module exploits four different image embedding representations, which are used to retrieve most similar entities compared to the input image. The last two modules perform news and event retrieval from EventRegistry and the Open Event Knowledge Graph (OEKG). GeoWINE provides an intuitive interface for end-users and is insightful for experts for reconfiguration to individual setups. The GeoWINE achieves promising results in entity label prediction for images on Google Landmarks dataset. The demonstrator is publicly available at http://cleopatra.ijs.si/geowine/.
Abstract: We describe a search-assistance tool called the OrgBox, created to support users' organizational, cognitive, and metacognitive activities while performing exploratory search tasks. The OrgBox tool allows users to create and label "boxes" to organize information found during a search. The OrgBox was integrated with a custom built search system that allowed users to save, organize, and synthesize information using drag-and-drop actions. The OrgBox tool also encourages users to engage in metacognitive activities during their search process (e.g., planning their next steps, monitoring their progress, evaluating the information found so far). In this paper, we describe the features and implementation of the OrgBox tool. We also summarize result of two user studies conducted using the OrgBox that show its cognitive and metacognitive benefits to users.
Abstract: The World Wide Web and social media platforms have become popular sources for news and information. Typically, multimodal information, e.g., image and text is used to convey information more effectively and to attract attention. While in most cases image content is decorative or depicts additional information, it has also been leveraged to spread misinformation and rumors in recent years. In this paper, we present a web-based demo application that automatically quantifies the cross-modal relations of entities~(persons, locations, and events) in image and text. The applications are manifold. For example, the system can help users to explore multimodal articles more efficiently, or can assist human assessors and fact-checking efforts in the verification of the credibility of news stories, tweets, or other multimodal documents.
Abstract: Explainable AI (XAI) is currently a vibrant research topic. However, the absence of ground truth explanations makes it difficult to evaluate XAI systems such as Explainable Search. We present an Explainable Search system with a focus on evaluating the XAI aspect of Trustworthiness along with the retrieval performance. We present SIMFIC 2.0 (Similarity in Fiction), an enhanced version of a recent explainable search system. The system retrieves books similar to a selected book in a query-by-example setting. The motivation is to explain the notion of similarity in fiction books. We extract hand-crafted interpretable features for fiction books and provide global explanations by fitting a linear regression and local explanations based on similarity measures. The Trustworthiness facet is evaluated using user studies, while the ranking performance is compared by analysis of user clicks. Eye tracking is used to investigate user attention to the explanation elements when interacting with the interface. Initial experiments show statistically significant results on the Trustworthiness of the system, paving way for interesting research directions that are being investigated.
Abstract: Collecting participant search logs is an integral part of interactive IR research. Today's existing approaches are either piecemeal solutions, and/or require cumbersome setups. We present YASBIL, a two-component logging solution comprising a browser extension and a WordPress plugin. The browser extension logs the browsing activity in the participants' machines. The WordPress plugin collects the logged data into the researcher's data server. The logging works on any webpage, without the need to own or have knowledge about the HTML structure of the webpage. YASBIL also offers ethical data transparency and security towards participants, by enabling them to view and obtain copies of the logged data, as well as securely upload the data to the researcher's server over an HTTPS connection. We posit that ease of installation and use will make YASBIL especially suitable for remote user-studies, and longitudinal studies in IR.
Abstract: Fine-grained logging of interactions in user studies is important for studying user behaviour, among other reasons. However, in many research scenarios, the way interactions are logged is usually tied to a monolithic system. We present a generic, application-independent service for logging interactions in web-pages, specifically targetting user studies. Our service, Big Brother, can be dropped-in to existing user interfaces with almost no configuration required by researchers. Big Brother has already been used in several user studies to record interactions in a number of user study research scenarios, such as lab-based and crowdsourcing environments. We further demonstrate the ability for Big Brother to scale to very large user studies through benchmarking experiments. Big Brother also provides a number of additional tools for visualising and analysing interactions. Big Brother significantly lowers the barrier to entry for logging user interactions by providing a minimal but powerful, no configuration necessary, service for researchers and practitioners of user studies that can scale to thousands of concurrent sessions. We have made the source code and releases for Big Brother available for download at https://github.com/hscells/bigbro.
Abstract: Understanding and comparing the behavior of retrieval models is a fundamental challenge that requires going beyond examining average effectiveness and per-query metrics, because these do not reveal key differences in how ranking models' behavior impacts individual results. DiffIR is a new open-source web tool to assist with qualitative ranking analysis by visually 'diffing' system rankings at the individual result level for queries where behavior significantly diverges. Using one of several configurable similarity measures, it identifies queries for which the rankings of models compared have important differences in individual rankings and provides a visual web interface to compare the rankings side-by-side. DiffIR additionally supports a model-specific visualization approach based on custom term importance weight files. These support studying the behavior of interpretable models, such as neural retrieval methods that produce document scores based on a similarity matrix or based on a single document passage. Observations from this tool can complement neural probing approaches like ABNIRML to generate quantitative tests. We provide an illustrative use case of DiffIR by studying the qualitative differences between recently developed neural ranking models on a standard TREC benchmark dataset.
Abstract: In this paper, we demonstrate the Information Interactions in Virtual Reality (IIVR) system designed and implemented to study how users interact with abstract information objects in immersive virtual environments in the context of information retrieval. Virtual reality displays are quickly growing as social and personal computing media, and understanding user interactions in these immersive environments is imperative. As a step towards effective information retrieval in such emerging platforms, our system is central to upcoming studies to observe how users engage in information triaging tasks in Virtual Reality (VR). In these studies, we will observe the effects of (1) information layouts and (2) types of interactions in VR. We believe this early system motivates researchers in understanding and designing meaningful interactions for future VR information retrieval applications.
Abstract: This demo system presents a browser extension that allows the reader of a health news article to quickly retrieve related medical/health research papers. This system can help news editors and readers fact-check health news for incorrect or exaggerated claims, such as making causal claims from correlational findings or inference of animal studies to humans. Linking health news to the original research papers is not a trivial task, as links are largely missing in science news reports. To link health news to medical literature, our system includes a new named-entity recognition function to extract journal names, and a new Elasticsearch-based search engine to incorporate rich metadata into the search strategy. This paper also introduces a new dataset for evaluating the performance of the proposed search system.
Abstract: Often, existing chat services that organisations and individuals use today provide a way to search through previously sent messages. However, many of these chat services provide far-limited search functionalities, typically exact matching on individual messages. In this paper, we introduce a new task for addressing this problem, called searching for conversations, whereby the aim is to retrieve and rank groups of related messages given a search query. We promote this task by providing a platform for research and development called PECAN. Our platform provides all the necessary functionality researchers need to conduct experiments on searching for conversations. Our system is also generic so as to support organisations and individuals who wish to search through their chat message archives. We release PECAN to the wider community as an Open Source project available for download at https://github.com/ielab/pecan.
Abstract: User behaviors and experiences are the fundamental parts of information retrieval systems, but are often difficult to collect, bringing challenges to both applications and research. Recently, researchers have been exploring more fine-grained user behavior than simple clicks, such as time patterns, mouse/scroll patterns, etc., with their own specific laboratory experimental platforms. However, the lack of public available toolkits for logging user behaviors and experiences leads to difficulties on field study of remote user experiments in real scenarios. In this work, we propose a Privacy-Aware Remote User Logging Tool for remotely collecting user behaviors and explicit experience feedback, with a special care for user privacy. With this tool, participants can conduct the user experiments remotely without time and location constraints, giving researchers the possibility to observe users' more natural behaviors and experiences.
Abstract: With the advances in precision medicine, identifying clinical trials relevant to a specific patient profile becomes more challenging. Often very specific molecular-level patient features need to be matched for the trial to be deemed relevant. Clinical trials contain strict inclusion and exclusion criteria, often written in free-text. Patients profiles are also semi-structured, with some important information hidden in clinical notes. We present a search system that given a patient profile searches over clinical trials for potential matches. It enables the users to leverage the powerful querying language that comes with Apache Lucene query syntax in combination with state-of-the-art Divergence From Randomness retrieval coupled with a BERT-based neural ranking component. This system aims to assist in clinical decision making.
Abstract: In e-commerce applications, customers search and discover one or more products using queries. Some of these queries are broad and diverse, with multiple intents. Therefore, relying purely on the anonymized and aggregated customer historical behavioral data is not sufficient to train machine learned models. For example, customers may click and purchase a galaxy charger for a "samsung galaxy s9" query. The item is not an exact match for the customer query. However, it serves as a complement to the original query and may be purchased. To address these potential mismatches from surfacing in search results, e-commerce systems rely on machine learned models trained on human- annotated data. There are two challenges in collecting human annotated data. First, the human annotation process does not scale and it is hard to obtain large volumes of annotations in multiple languages. Second, annotators must query existing systems to obtain samples for auditing, resulting in very few mismatched examples (data skewness) and counterfactual biases. In this talk, we address these challenges using two recent advances in deep learning. To address the data skewness, we generate hard negative examples using positive examples. The key idea here is to generate synthetic data using a Variational Encoder Decoder (VED) architecture. We show how a modified loss function with a novel combiner (to combine VED with the classifier) can avoid policy-based gradients and other heuristics. To address the sparsity of data in less popular languages, we combine data across all languages using language-agnostic representation learning. The side information we use aligns the items across languages in the same latent space. We show that our approaches significantly improve upon state of the art baselines, by over 25% in F1 score for the variational model, and over 20% in F1 score for the multilingual model.
Abstract: Healthy online discourse is becoming less and less accessible beneath the growing noise of controversy, mis- and dis-information, and toxic speech. While IR is crucial in detecting harmful speech, researchers must work across disciplines to develop interventions, and partner with industry to deploy them rapidly and effectively. In this position paper, we argue that both detecting online information disorders and deploying novel, real-world content moderation tools is crucial in promoting empathy in social networks, and maintaining free expression and discourse. We detail our insights in studying different social networks such as Parler and Reddit. Finally, we discuss the joys and challenges as a lab-grown startup working with both academia and other industrial partners in finding a path toward a better, more trustworthy online ecosystem.
Abstract: In online marketplaces, an increasing number of producers depend on search and recommendation systems to connect them with consumers to make a living. In this talk, we discuss how these systems will need to evolve from the traditional formulations by incorporating the producer value into their objectives. Jointly optimizing the ranking functions behind these systems on both consumer and producer values is a new direction and raises many technical challenges. To overcome these, we lay out an end-to-end solution and present the results of applying this solution on Facebook Marketplace.
Abstract: Personalization is omnipresent in our life, with applications ranging from entertainment and commercial uses to smart devices and medical treatments. The integration of personalization in various products turned rapidly from an unnecessary luxury to a commodity that is expected by customers. While different machine learning fields present state-of-the-art advances and super-human performance, personalization applications are often late-adopters of novel solutions due to their complex framing and multiple stakeholders' with different business goals. The role of personalisation applications is also ambiguous: it is unclear, for instance, whether models just predict a user's next action or proactively affect the user's selections. This talk focuses on examining the role of recommenders and their ability to adapt to customer feedback. Key topics such as causality and active exploration are depicted with real examples and demonstrated alongside business considerations and implementation challenges. It relies on recent advances in the field and on work conducted at Booking.com, where we implement personalization models on one of the world's leading online travel platform.
Abstract: Graph convolution networks (GCN), which recently becomes new state-of-the-art method for graph node classification, recommendation and other applications, has not been successfully applied to industrial-scale search engine yet. In this proposal, we introduce our approach, namely SearchGCN, for embedding-based candidate retrieval in one of the largest e-commerce search engine in the world. Empirical studies demonstrate that SearchGCN learns better embedding representations than existing methods, especially for long tail queries and items. Thus, SearchGCN has been deployed into JD.com's search production since July 2020.
Abstract: We present AliMe Avatar, a Vtuber designed for live-streaming sales in the E-commerce field. To support the emerging live shopping mode, the core of our digitial avatar is to enable customers to understand products and encourage customers to purchase in a virtual broadcasting room. Based on computer graphics & vision, natural language processing, and speech recognition & synthesis, our AI avatar is able to offer three kinds of key capabilities: custom appearance, product broadcasting, and multi-modal interaction. Currently, it has been launched online in the Taobao app, broadcasts 700+ hours and serves hundreds of thousands of customers per day. In this paper, we mainly focus on the product broadcasting part, demonstrate the system, present the underlying techniques, and share our experience in dealing with live-streaming E-commerce.
Abstract: Cold-start is the most difficult and time-consuming phase when building a question answering based chatbot for a new business scenario because of the collection of sufficient training data. In this paper, we propose AliMe DA, a practical data augmentation (DA) framework that consists of data production, denoising and consumption, to alleviate this problem. We show how our DA approach can be used to substantially enhance annotation productivity and also improve downstream model performance. More importantly, we provide best practices for data augmentation, including how to choose and employ appropriate methods at each stage of our framework, and share our observation on the applicable scene of data augmentation in the era of pre-trained language models.
Abstract: Games of skill are an excellent source of recreation and relaxation. Games are also the safest and readily accessible constructs for social interaction and community affairs which potentially opens up new avenues for realising personal worth, social acceptance, respect & recognition. However, when these games are played with real money, ensuring game prudence, whereby users play real-money skill games only for entertainment purposes, and do so well within their resourceful means, becomes necessary. It becomes paramount for the wellness of players and also to ensure online gaming is only available for sheer entertainment. In this proposal, we present an automated, data driven, AI powered, Responsible Game Play (RGP) framework cum tool which has been integrated in our online skill gaming platform. RGP pipeline is a combination of: a) a couple of anomaly detection Rule Based Engines; b) a Deep Learning Pipeline which models the game play characteristics of healthy and engaged players to identify potentially risky players, and c) a ML based Local Expert which leverages users' longitudinal behavioral patterns and constructs new features using the adjacent AI OPS and Signal Processing Domains. We integrate the psychometric assessment to nudge and coarse correct at-risk players proactively, ahead of time
Abstract: Credit cards, deposits, loans, pension funds, mutual funds which of these products is relevant to a bank's clients, and at what time in their banking journey? We propose a modeling framework for item recommendation using a Transformer encoder [6] and a novel input data representation accounting for the temporal context of item ownership and user metadata. We evaluate the model on a large dataset from Bank Santander. Our system outperforms industry baselines Amazon Personalize [1], and XGBoost [4], a top performing model in the Santander Kaggle competition [2]. We achieve a 56.6% top-3 precision and significantly outperforms Amazon Personalize and the XGBoost model, with 21.5% and 37.9% top-3 precision, respectively. We engineered an original way of representing input data as a sequence and found that this specific representation, with our Transformer-based architecture, improves the model's performance. We hope that our contribution paves the way for the democratization of recommender systems in banking, and the use of the Transformer model for product recommendation in industry.
Abstract: Search systems have unprecedented influence on how and what information people access. These gateways to information on the one hand create an easy and universal access to online information, and on the other hand create biases that have shown to cause knowledge disparity and ill-decisions for information seekers. Most of the algorithms for indexing, retrieval, and ranking are heavily driven by the underlying data that itself is biased. In addition, orderings of the search results create position bias and exposure bias due to their considerable focus on relevance and user satisfaction. These and other forms of biases that are implicitly and sometimes explicitly woven in search systems are becoming increasing threats to information seeking and sense-making processes. In this tutorial, we will introduce the issues of biases in data, in algorithms, and overall in search processes and show how we could think about and create systems that are fairer, with increasing diversity and transparency. Specifically, the tutorial will present several fundamental concepts such as relevance, novelty, diversity, bias, and fairness using socio-technical terminologies taken from various communities, and dive deeper into metrics and frameworks that allow us to understand, extract, and materialize them. The tutorial will cover some of the most recent works in this area and show how this interdisciplinary research has opened up new challenges and opportunities for communities such as SIGIR.
Abstract: Probability Ranking Principle (PRP)[31], which assumes that each document has a unique and independent probability to satisfy a particular information need, is one of the fundamental principles for ranking. Traditionally, heuristic ranking features and well-known learning-to-rank approaches have been designed by following the PRP principle. Recently, neural IR models, which adopt deep learning to enhance the ranking performances, also obey the PRP principle. Though it has been widely used for nearly five decades, in-depth analysis shows that PRP is not an optimal principle for ranking, due to its independent assumption that each document should be independent of the rest candidates. Counter examples include pseudo relevance feedback[24], interactive information retrieval[46], search result diversification[10] etc. To solve the problem, researchers recently proposed to model the dependencies among the documents during the designing of ranking models. A number of ranking models have been proposed and state-of-the-art ranking performances have been achieved. This tutorial aims to give a comprehensive survey on these recently developed ranking models that go beyond the PRP principle. The tutorial tries to categorize these models based on their intrinsic assumptions: assuming that the documents are independent, sequentially dependent, or globally dependent. In this way, we expect the researchers focusing on ranking in search and recommendation can have a novel angle of view on the designing of ranking models, and therefore can stimulate new ideas on developing novel ranking models.
Abstract: This tutorial of Deep Learning on Graphs for Natural Language Processing (DLG4NLP) will cover relevant and interesting topics on applying deep learning on graph techniques to NLP, including automatic graph construction for NLP, graph representation learning for NLP, advanced GNN based models (e.g., graph2seq, graph2tree, and graph2graph) for NLP, and the applications of GNNs in various NLP tasks (e.g., machine translation, natural language generation, information extraction and semantic parsing). In addition, a handson demonstration session will be included to help the audience gain practical experience on applying GNNs to solve challenging NLP problems using our recently developed open source library - Graph4NLP, the first library for researchers and practitioners for easy use of GNNs for various NLP tasks.
Abstract: Recently, there has been growing attention on fairness considerations in machine learning. As one of the most pervasive applications of machine learning, recommender systems are gaining increasing and critical impacts on human and society since a growing number of users use them for information seeking and decision making. Therefore, it is crucial to address the potential unfairness problems in recommendation, which may hurt users' or providers' satisfaction in recommender systems as well as the interests of the platforms. The tutorial focuses on the foundations and algorithms for fairness in recommendation. It also presents a brief introduction about fairness in basic machine learning tasks such as classification and ranking. The tutorial will introduce the taxonomies of current fairness definitions and evaluation metrics for fairness concerns. We will introduce previous works about fairness in recommendation and also put forward future fairness research directions. The tutorial aims at introducing and communicating fairness in recommendation methods to the community, as well as gathering researchers and practitioners interested in this research direction for discussions, idea communications, and research promotions.
Abstract: Information retrieval (IR) in nature is a process of sequential decision making. The system repeatedly interacts with the users to refine its understanding of the users' information needs, improve its estimation of result relevance, and thus increase the utility of its returned results (e.g., the result rankings). Distinct from traditional IR solutions that rigidly execute an offline trained policy, interactive information retrieval emphasizes online policy learning. This, however, is fundamentally difficult for at least three reasons. First, the system only collects user feedback on the presented results, aka, the bandit feedback. Second, users' feedback is known to be noisy and biased. Third, as a result, the system always faces the conflicting goals of improving its policy by presenting currently underestimated results to users versus satisfying the users by ranking the currently estimated best results on top. In this tutorial, we will first motivate the need for online policy learning in interactive IR, by highlighting its importance in several real-world IR problems where online sequential decision making is necessary, such as web search and recommendations. We will carefully address the new challenges that arose in such a solution paradigm, including sample complexity, costly and even outdated feedback, and ethical considerations in online learning (such as fairness and privacy) in interactive IR. We will prepare the technical discussions by first introducing several classical interactive learning strategies from machine learning literature, and then fully dive into the recent research developments for addressing the aforementioned fundamental challenges in interactive IR. Note that the tutorial on "Interactive Information Retrieval: Models, Algorithms, and Evaluation" will provide a broad overview on the general conceptual framework and formal models in interactive IR, while this tutorial covers the online policy learning solutions for interactive IR with bandit feedback.
Abstract: Since Information Retrieval (IR) is an interactive process in general, it is important to study Interactive Information Retrieval (IIR), where we would attempt to model and optimize an entire interactive retrieval process (rather than a single query) with consideration of many different ways a user can potentially interact with a search engine. This tutorial systematically reviews the progress of research in IIR with an emphasis on the most recent progress in the development of models, algorithms, and evaluation strategies for IIR, ending with a brief discussion of the major open challenges in IIR and some of the most promising future research directions.
Abstract: The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This tutorial, based on a forthcoming book, provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has, without exaggeration, revolutionized the fields of natural language processing (NLP), information retrieval (IR), and beyond. We provide a synthesis of existing work as a single point of entry for both researchers and practitioners. Our coverage is grouped into two categories: transformer models that perform reranking in multi-stage ranking architectures and learned dense representations that perform ranking directly. Two themes pervade our treatment: techniques for handling long documents and techniques for addressing the tradeoff between effectiveness (result quality) and efficiency (query latency). Although transformer architectures and pretraining techniques are recent innovations, many aspects of their application are well understood. Nevertheless, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, we also attempt to prognosticate the future.
Abstract: There is strong interest in leveraging reinforcement learning (RL) for information retrieval (IR) applications including search, recommendation, and advertising. Just in 2020, the term "reinforcement learning" was mentioned in more than 60 different papers published by ACM SIGIR. It has also been reported that Internet companies like Google and Alibaba have started to gain competitive advantages from their RL-based search and recommendation engines. This full-day tutorial gives IR researchers and practitioners who have no or little experience with RL the opportunity to learn about the fundamentals of modern RL in a practical hands-on setting. Furthermore, some representative applications of RL in IR systems will be introduced and discussed. By attending this tutorial, the participants will acquire a good knowledge of modern RL concepts and standard algorithms such as REINFORCE and DQN. This knowledge will help them better understand some of the latest IR publications involving RL, as well as prepare them to tackle their own practical IR problems using RL techniques and tools. Please refer to the tutorial website (https://rl-starterpack.github.io/) for more information.
Abstract: Stance detection (also known as stance classification and stance prediction) is a problem related to social media analysis, natural language processing, and information retrieval, which aims to determine the position of a person from a piece of text they produce, towards a target (a concept, idea, event, etc.) either explicitly specified in the text, or implied only. The output of the stance detection procedure is usually from this set: Favor, Against, None. In this tutorial, we will define the core concepts and research problems related to stance detection, present historical and contemporary approaches to stance detection, provide pointers to related resources (datasets and tools), and we will cover outstanding issues and application areas of stance detection. As solutions to stance detection can contribute to significant tasks including trend analysis, opinion surveys, user reviews, personalization, and predictions for referendums and elections, it will continue to stand as an important research problem, mostly on textual content currently, and particularly on social media. Finally, we believe that image and video content will commonly be the subject of stance detection research soon.
Abstract: Most of the current machine learning approaches to IR---including search and recommendation tasks---are mostly designed based on the basic idea of matching, which work from the perceptual and similarity learning perspective. This include both the learning of features from data such as representation learning, and the learning of similarity matching functions from data such as neural function learning. Though many models have been widely used in practical ranking systems such as search and recommendation, their design philosophy limits the models to the correlative signals in data. However, advancing from correlative learning to causal learning in search and recommendation is an important problem, because causal modeling can help us to think outside of the observational data for representation learning and ranking. More specially, causal learning can bring benefits to the IR community on various dimensions, including but not limited to Explainable IR models, Unbiased IR models, Fairness-aware IR models, Robust IR models and Cognitive Reasoning IR models. This workshop focuses on the research and application of causal modeling in search, recommendation and a broader scope of IR tasks. The workshop will gather both researchers and practitioners in the field for discussions, idea communications, and research promotions. It will also generate insightful debates about the recent regulations on AI Ethics, to a broader community including but not limited to IR, machine learning, AI, Data Science, and beyond. Workshop homepage is available online at https://csr21.github.io/.
Abstract: Modern information retrieval (IR) consists of a series of processes, including query expansion, candidate item recall, item ranking, item re-ranking, etc. The final ranked item list will be exposed to the user, which will accordingly provide feedback through some expected actions such as browsing and click. Such a whole process can be formulated as a decision-making process where the agent is the IR system while the environment is the specific user. This decision-making process can be one-step or sequential, depending on the scenarios or the ways of problem formulation. Since 2013, Deep reinforcement learning (DRL) has been a fast-developing technique for decision-making tasks. The high capacity of deep learning models is incorporated in the reinforcement learning framework so that the agent may successfully handle complex decision-making. In recent years, there have been a bunch of publications attempting to leverage DRL techniques for different IR tasks such as ad hoc retrieval, learning to rank and interactive recommendation. Nonetheless, the fundamental theory, the principle of RL methods or the recognized experimental protocols of decision-making in IR, has not been well developed, making it challenging to evaluate the correctness of a proposed method or judge whether the reported experimental performance is valid. We propose the second DRL4IR workshop at SIGIR 2021, which provides a venue to gather the academia researchers and industry practitioners to present the recent progress of DRL techniques for IR. More importantly, people in this workshop are expected to discuss more about the fundamental principles of formulating a decision-making IR task, the underlying theory as well as the practical effectiveness of the experiment protocol design, which would foster further research on novel methodologies, innovative experimental findings and new applications of DRL for information retrieval. DRL4IR organized at SIGIR'20 was one of the most popular workshops and attracted over 200 conference attendees. In this year, we will pay more attention to fundamental research topics and recent applications, and expect about 300 participants.
Abstract: eCommerce Information Retrieval (IR) is receiving increasing attention in the academic literature and is an essential component of some of the world's largest web sites (e.g., Airbnb, Alibaba, Amazon, eBay, Facebook, Flipkart, Lowe's, Taobao, and Target). SIGIR has for several years seen sponsorship from eCommerce organisations, reflecting the importance of IR research to them. The purpose of this workshop is (1) to bring together researchers and practitioners of eCommerce IR to discuss topics unique to it, (2) to determine how to use eCommerce's unique combination of free text, structured data, and customer behavioral data to improve search relevance, and (3) to examine how to build datasets and evaluate algorithms in this domain. Since eCommerce customers often do not know exactly what they want to buy (i.e. navigational and spearfishing queries are rare), recommendations are valuable for inspiration and serendipitous discovery as well as basket building. The theme of this year's eCommerce IR workshop is ensuring fairness in search and recommendations for eCommerce. The workshop includes papers on this topic as well as a panel focused on this area. In addition, Coveo is sponsoring an eCommerce data challenge on session-based prediction for predicting the next action with a special subtask on cart abandonment. The data challenge reflects themes from prior SIGIR workshops in 2017, 2018, 2019, and 2020.
Abstract: Over 20 years ago, Information Retrieval (IR) researchers began their quest for sound IR systems for children. The path was not straightforward. Challenges posed by interface design, relevance determination, diverse contexts, ethics, and many more, were taken up and explored from different perspectives. Large projects such as Puppy-IR and the International Children's Digital Library gave this field a certain boost; still, there is neither a sound solution for children in the search area in 2021 nor a roadmap to get there. What is the reason for this? Does the field cry out for specific IR solutions developed on a small scale for very small sub-fields and specific target groups? Are there some significant unforeseen barriers that hinder researchers? What about obstacles natural to areas of study such as this one that require a multidisciplinary approach or involve protected populations? With this workshop, we want to bring together as many key experts as possible from research and industry who focus on IR for children to understand why, unlike other IR areas, this one has not flourished and look for the biggest challenges for the next 10 years. We are not only thinking of traditional researchers and designers but also of those who develop and use IR systems for fields, such as in music, film, and education, as a way to push past this immobility and look at the problem from new, and perhaps more stimulating, perspectives.
Abstract: Information retrieval plays a crucial role in the patent domain. With the success of deep learning (DL) in other domains, patent practitioners and researchers are increasingly developing DL-based approaches to support experts in the patenting process or to automate processes for patent analysis. AI-enhanced information retrieval systems can improve patent search but also require lots of annotated data. When working with patent data, particular challenges arise that call for adaption and novel approaches of general IR and AI methods. with this workshop series we want to establish a two-way communication channel between industry and academia from relevant fields in information retrieval, such as natural language processing (NLP), text and data mining (TDM), and semantic technologies (ST), in order to explore and transfer new knowledge, methods and technologies for the benefit of industrial applications as well as support interdisciplinary research in applied sciences forthe intellectual property (IP) and neighbouring domains.
Abstract: The use of simulation techniques is not foreign to information retrieval. In the past, simulation has been employed, for example, for constructing test collections and for model performance prediction and analysis in a broad array of information access scenarios. Nevertheless, a standardized methodology for performance evaluation via simulation has not yet been developed. The goal of this workshop is to create a forum for researchers and practitioners to promote methodology development and more widespread use of simulation for evaluation by: (1) identifying problem settings and application scenarios; (2) sharing tools, techniques, and experiences; (3) characterizing potentials and limitations; and (4) developing a research agenda.
Abstract: In a world of overloading choices, recommender system has become indispensable as personalized information filter, which makes it of great interest to both industry and academia. Recently, inspiring insights are provided by deep learning based algorithms, and bring us many state-of-the-art recommender systems [1]. However, there exists a widening gap between researchers and engineers: academics usually prefer end-to-end recommender systems, whereas engineers tend to design a recommender system as a three stage pipeline, including preprocessing original data, recalling a candidate subset, and ranking recommendation results. Such pipeline can greatly benefit system development and project management, which has been proven by many Internet companies over recent years. To bridge the gap between end-to-end and pipeline, we propose three research directions and relevant methodologies, including label propagation based methods for preprocessing, graph neural network based negative sampling strategy for recall, and graph-based model of implicit interactions for ranking. In the following, we will describe them in detail.
Abstract: Large data collections containing millions of math formulae are available online. Retrieving math expressions from these collections is challenging. The structural complexity of formulae requires specialized processing. When searching for mathematical content, accurate measures of formula similarity can help with tasks such as document ranking, query recommendation, and result set clustering. While there have been many attempts at embedding words and graphs, formula embedding is still in its early stages. This research aims to introduce an embedding model for mathematical formulae and accompanying text that can be used in math information retrieval. For that, first embedding models for isolated formulae are introduced, using intrinsic measures to study the effectiveness and efficiency of retrieval using those embeddings. Those results support the second goal of this research, which is to develop joint embedding models for formulae and text that can support the full range of content encountered in math retrieval. This can be seen as a special case of multimodal embedding, thus potentially benefiting from related research that jointly models other cases in which text and structured representations are co-present, such as chemistry. I summarize the research questions as follows: RQ1: How can we effectively provide an embedding model for isolated mathematical formulae? RQ2: How should the joint embedding of text and formulae be done? RQ3: How can evaluation of math search be grounded in a representative task? For RQ1, I propose to first study simple models that walk the tree structure to study the effectiveness and efficiency of the formula embedding model and then move to more advanced models. I have introduced Tangent-CFT [2] model. As my next step for formula embedding, I plan to look at deep neural network models that have been applied for graph embedding. After studying an embedding model for isolated formulae, in RQ2 I plan to focus on making use of the surrounding text of formulae. I will consider four possible approaches to constructing a joint embedding model: Linearizing the tree structure of formulae to sequences and then applying a single sequence embedding model to the text and the linearized formula, similar to [1], Forming separate embeddings for text and formulae, then unifying the two embedding spaces using seed alignments obtained either through supervision or using heuristics, or Extracting a tree out of the text and then apply a structure embedding model on both trees, or Combine results from specialized embedding models. For example, if the task is retrieval (ranking), then in the simplest scenario the results can be combined with methods such as Reciprocal Rank Fusion (RRF) or CombMNZ. I would then study how text and formulae embedding models should be combined. One possible solution might be to do retrieval using each of the embeddings and then combine the results. Another approach is to learn a model that provides a unified embedding that captures both formula and text features. Another approach to have a joint embedding model is to convert text to a tree structure. I can then look at this as a tree-to-tree translation problem. For both RQ1 and RQ2, I plan to first study the effectiveness of the proposed embedding in the formula retrieval before proceeding to the text+formula condition. Results will be compared with the best-reported results on the ARQMath [3] question answering task. While part of this research focuses on creating an embedding model for math, I also need a standard evaluation protocol and dataset. In a planned three-year sequence of ARQMath labs, I aim to answer RQ3 and provide high-quality training, devtest, and test sets for math search. Importantly, ARQMath also serves as a platform for operationalizing a repeatable community-consensus definition for relevance in isolated formula search.
Abstract: How to model the performance of a retrieval system before its deploying has puzzled the Information Retrieval (IR)researchers for a long time. Currently, the evaluation of IR systems relies on empirical experiments. Empirical evaluation means that we need experimental collections: building them is expensive both in term of time and money. Exploiting already available collections to predict the performance of a system on new collections, would dramatically reduce such cost. With the research line described in this work,we plan to study the development of predictive models for the performance of the IR systems. In particular, the proposed research line will investigate Generalized Linear Mixed Models and Causal Inference. Furthermore, we highlight the importance of modelling the performance as distributions rather than point estimations.
Abstract: Determining reliability of online data is a challenge that has recently received increasing attention. In particular, unreliable health-related content has become pervasive during the COVID-19 pandemic. The main objective of this Ph.D. thesis is to study how end-users judge the correctness and credibility of online content and provide them with a series of tools to assist them in assessing content reliability. To that end, we need to determine which sources of evidence may help to better assess the reliability of health-related online content, and how to combine them learning. Finally, I will also study which presentation aspects might help end-users to better assess reliability since previous research has proved that the format and layout of the information items, combined with user-based biases, influence their final assessments.
Abstract: This research presents a recommender system designed on the basis of a bottom-up knowledge base from textbooks. While other ontologies that are usually applied to such tasks are hand-crafted, our automated approach is a possible answer to the knowledge acquisition bottleneck. We extract concept hierarchies from section titles and use co-occurrences in book sections as evidence for possible contextual relationships between the therein mentioned entities. Motivated by a legal use case of recommending upcoming changes in law, the design is targeting three major challenges: different abstraction levels between entities of legal documents and the parliament protocols announcing norm changes, as well as engineering an explainable retrieval mechanism using the knowledge base which can additionally offer decent usability despite a high-recall requirement. Although the system is developed for a specific legal use case, there are many aspects of general applicability in the fields of recommender systems, information retrieval and information extraction, entity resolution, explainable artificial intelligence and usability. We validate selected parts of the system design also on other applications, such as educational media research.
Abstract: Multi-document summarization is one of the most important tasks in the field of Natural Language Processing (NLP) and it gains increasing attention in recent years. It aims to generate one summary across several topic-related documents. Compared with extractive summarization, abstractive summarization is more similar to human-written ones. Proposing effective and efficient abstractive multi-document summarization models is significant to the NLP community. Existing deep learning based multi-document summarization models rely on the exceptional ability of neural networks to extract distinct features. However, they have missed out important linguistic knowledge such as dependencies between words since linguistics information in texts is full of meaningful knowledge with respect to the input documents. Besides, how models automatically evaluate the quality of the summary is crucial to design a high-performance summarization model since the evaluation indicator objectively measures the effectiveness of a method. In this proposal, we bring forward two research questions and corresponding solutions for the abstractive multi-document summarization task.
Abstract: A study conducted by the International Data Corporation predicted that by the year 2021, the total amount of digital information resources would have reached the 40 zettabyte mark [2]. According to a rule formulated by Merrill Lynch, 80 to 90% of these resources are unstructured [7]. Despite this, users expect digital libraries to provide them with fast and interpretable access to digital information resources that will satisfy their information need. Math information retrieval emerged as a subfield of information retrieval in 2008 [8], when it became clear that standard information retrieval techniques used for text documents are inadequate to accurately retrieve documents in digital mathematical libraries.
Abstract: The research on Query Performance Prediction (QPP) focuses on estimating the effectiveness of retrieval results in the absence of human relevance judgments. Accurately estimating the result of a search performed in response to a query has been extensively studied over the past two decades. With the rising popularity of virtual assistants along with evolving research on complex information needs, the need for reliable QPP methods as well as the number of potential applications significantly increases. In this work, we focus on improving the evaluation framework of QPP. As we see the existing evaluation as a considerable limitation in the improvement of QPP methods, a reliable and improved evaluation framework would constitute a stepping-stone for a breakthrough in QPP. The existing evaluation framework in QPP mainly relies on the measurement of the correlation coefficient between the per-query prediction scores and the actual per-query system effectiveness measure, usually Average Precision (AP). The QPP method that achieves higher correlation is considered to be superior. However, Hauff et al. demonstrate that higher correlation does not vouch for more accurate prediction. The authors additionally advocate the usage of Fisher's transformation and Confidence Intervals (CIs) to determine statistically significant differences between multiple correlation coefficients. Furthermore, the existing evaluation methodology is true only per a specific combination of a corpus, retrieval method, and set of queries; and does not necessarily hold if any of these is changed. That is, the existing evaluation is not agnostic to the different components, thus any conclusions about the relative prediction quality of the QPP methods should be taken with a grain of salt. In the proposed research we aim to develop a better evaluation technique to reliably compare the performance of QPP methods. We intend to develop a new evaluation framework and standards that will simultaneously enable the utilization of query variants and take into consideration other confounding factors in QPP evaluation. Specifically, we raise the following research questions: (i) What limitations exist in the current evaluation practices of QPP? (ii) What are the best approaches to perform detailed failure analysis of query performance predictor results? (iii) How do existing QPP methods differ in performance on a set of topics (distinct information needs) represented by a single query versus a set of multiple queries which represent the same information need? (iv) How do the existing and new evaluation methodologies align with user satisfaction? To answer the first two research questions Faggioli et al. proposed a new evaluation framework for QPP. In the proposed framework an error is calculated for each query, resulting in a distribution of per-query errors for a set of queries. The new distribution of errors enables the authors to apply an N-way ANalysis Of VAriance (ANOVA) followed by a post-hoc analysis, Tukey's Honestly Significant Difference (HSD) test, to determine statistically significant differences between the multiple factors involved in the QPP evaluation. Separating the different components in the evaluation process allows reaching more reliable conclusions regarding the effects of each component in the prediction process. As a preliminary study, Zendel et al. compared multiple existing QPP methods in the aforementioned tasks; predicting the effectiveness for different queries representing different topics, and different query variants, that represent the same topic. They found that the difference in AP between the queries is an important confounding factor, that affects the prediction quality. Future work will focus on developing a reliable evaluation framework for QPP both for queries from different topics and query variants from the same topic. A suitable framework should enable rigorous statistical analysis with decomposition and quantification of the different factors that affect QPP. In addition, a subsequent user study will explore how the new evaluation framework aligns with user satisfaction of QPP results.
Abstract: Scientific article summarization is a challenging task not least due to the lack of large annotated corpora. In this research proposal, we present an approach to construct a large annotated corpus for scientific articles using semi-supervised/automatic annotation approaches. We intend to apply deep learning methods to increase a small seed of annotated corpus. Then, we will measure the quality of the annotated corpus on down stream informative summaries using various evaluation techniques.
Abstract: Social media plays an important role as a source of information during crisis events. It allows for more rapid dissemination of critical information than traditional news media, as its users can provide immediate information from the locations where events are unfolding. Several studies have addressed the automatic detection of crisis-related messages to contribute to disaster management and humanitarian assistance. However, most of them have focused on a particular language (usually English) or type of event, which limits their applicability to other contexts. The lack of labeled data in different languages and types of disasters poses a major obstacle to the application of supervised learning-based approaches to more diverse scenarios. To address this problem, this research aims to characterize messages related to diverse crisis domains in a language-agnostic manner in order to construct multilingual crisis detectors. To achieve this, we propose a comprehensive evaluation of transfer learning performance in terms of crisis domain and language, comparing different data representations and classification techniques.
Abstract: Web search increasingly provides a platform for users to seek advice on important personal decisions but may be biased in several different ways. One result of such biases is the search engine manipulation effect (SEME): when a list of search results relates to a debated topic (e.g., veganism) and promotes documents pertaining to a particular viewpoint (e.g., by ranking them higher), users tend to adopt this advantaged viewpoint. However, the detection and mitigation of SEME are complicated by the current lack of empirical understanding of its underlying mechanisms. This dissertation aims to investigate which (and to what degree) algorithmic and cognitive biases play a role in SEME concerning debated topics. RQ1. What set of labels can accurately represent viewpoints of textual documents on debated topics? Studying algorithmic and cognitive biases in the context of web search on debated topics requires accurate labeling of documents. RQ1 investigates how to best represent viewpoints of textual documents on debated topics. The first step in this work was introducing perspectives as an additional dimension of viewpoint labels for textual documents (i.e., adding people's underlying motivations for taking a given stance) and showing how they can be automatically discovered using Joint Topic Models. My future research will evaluate whether viewpoint labels consisting of stances and perspectives are accurate representations (or whether more nuanced notions are necessary) and describe how to obtain these labels. The work on RQ1 will result in a framework to accurately represent viewpoints on debated topics expressed by textual documents. This will allow for algorithmic assessment of viewpoint-related ranking bias in search results and alignment of document viewpoints with users' viewpoints. RQ2. What methods can automatically measure viewpoint-related ranking bias in search results? Several methods have been proposed to measure ranking bias, fairness, and diversity in search results. RQ2 investigates which of these (or novel) methods can be used to assess viewpoint-related ranking bias. The first contribution to RQ2 was demonstrating how to assess viewpoint-related ranking bias in search results using ranking fairness metrics for categorical viewpoint labels and evaluated which specific methods work best in which situation. Going forward, I plan to develop methods that assess viewpoint-related ranking bias in more complex settings. Furthermore, I aim to assess viewpoint-related ranking bias in real search results on debated topics. This work will contribute novel evaluation metrics that measure viewpoint-related ranking bias in search results, a set of guidelines for when and how to use them using a web-based demo, as well as directions for practitioners regarding viewpoint-related ranking bias in real search results. RQ3. What cognitive biases may contribute to the process of attitude change on debated topics in users of web search engines? Being able to measure algorithmic ranking bias is not yet enough to understand its effect on human behavior. RQ3 aims at understanding which specific cognitive biases are responsible for SEME; i.e., what reasoning mistakes users make when they change their attitudes after viewing search results. The first contribution to RQ3 was evaluating in a user study whether order effects alone can cause SEME. We found that this may not be the case and describe exploratory results that show that exposure effects may play a more important role in causing SEME than previously anticipated. My future work in this area will consider findings from RQ1 and RQ2 to draw more realistic scenarios of SEME and study interactions between algorithmic and different cognitive biases. The result of this work will be a set of guidelines for how SEME could be avoided by mitigating cognitive user biases in web search.