New On the near-optimality of betting confidence sets for bounded means. (arXiv:2310.01547v2 [math.ST] UPDATED)
22 hours, 43 minutes ago

Constructing nonasymptotic confidence intervals (CIs) for the mean of a univariate distribution from independent and identically distributed (i.i.d.) observations is a fundamental task in statistics. For bounded observations, a classical nonparametric approach proceeds by inverting standard concentration bounds, such as Hoeffding's or Bernstein's inequalities. Recently, an alternative betting-based approach for defining CIs and their time-uniform variants called confidence sequences (CSs), has been shown to be empirically superior to the classical methods. In this paper, we provide theoretical justification for this improved empirical performance of betting CIs and CSs.

Our main contributions are as follows: (i) We first compare CIs using the values of their first-order asymptotic widths (scaled by $\sqrt{n}$), and show that the betting CI of Waudby-Smith and Ramdas (2023) has a smaller limiting width than existing empirical Bernstein (EB)-CIs. (ii) Next, we establish two lower bounds that characterize the minimum width achievable by any method for constructing CIs/CSs in terms of certain inverse information projections. (iii) Finally, we show that the betting CI and CS match the fundamental limits, modulo an additive logarithmic term and a multiplicative constant. Overall these results imply that the betting CI~(and CS) admit stronger theoretical guarantees than the existing state-of-the-art EB-CI~(and CS); both in the asymptotic and finite-sample regimes.

New The Lipschitz-Variance-Margin Tradeoff for Enhanced Randomized Smoothing. (arXiv:2309.16883v2 [cs.LG] UPDATED)
22 hours, 43 minutes ago

Real-life applications of deep neural networks are hindered by their unsteady predictions when faced with noisy inputs and adversarial attacks. The certified radius is in this context a crucial indicator of the robustness of models. However how to design an efficient classifier with a sufficient certified radius? Randomized smoothing provides a promising framework by relying on noise injection in inputs to obtain a smoothed and more robust classifier. In this paper, we first show that the variance introduced by randomized smoothing closely interacts with two other important properties of the classifier, \textit{i.e.} its Lipschitz constant and margin. More precisely, our work emphasizes the dual impact of the Lipschitz constant of the base classifier, on both the smoothed classifier and the empirical variance. Moreover, to increase the certified robust radius, we introduce a different simplex projection technique for the base classifier to leverage the variance-margin trade-off thanks to Bernstein's concentration inequality, along with an enhanced Lipschitz bound. Experimental results show a significant improvement in certified accuracy compared to current state-of-the-art methods. Our novel certification procedure allows us to use pre-trained models that are used with randomized smoothing, effectively improving the current certification radius in a zero-shot manner.

New Optimal and Fair Encouragement Policy Evaluation and Learning. (arXiv:2309.07176v2 [cs.LG] UPDATED)
22 hours, 43 minutes ago

In consequential domains, it is often impossible to compel individuals to take treatment, so that optimal policy rules are merely suggestions in the presence of human non-adherence to treatment recommendations. In these same domains, there may be heterogeneity both in who responds in taking-up treatment, and heterogeneity in treatment efficacy. While optimal treatment rules can maximize causal outcomes across the population, access parity constraints or other fairness considerations can be relevant in the case of encouragement. For example, in social services, a persistent puzzle is the gap in take-up of beneficial services among those who may benefit from them the most. When in addition the decision-maker has distributional preferences over both access and average outcomes, the optimal decision rule changes. We study causal identification, statistical variance-reduced estimation, and robust estimation of optimal treatment rules, including under potential violations of positivity. We consider fairness constraints such as demographic parity in treatment take-up, and other constraints, via constrained optimization. Our framework can be extended to handle algorithmic recommendations under an often-reasonable covariate-conditional exclusion restriction, using our robustness checks for lack of positivity in the recommendation. We develop a two-stage algorithm for solving over parametrized policy classes under general constraints to obtain variance-sensitive regret bounds. We illustrate the methods in two case studies based on data from randomized encouragement to enroll in insurance and from pretrial supervised release with electronic monitoring.

New Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation. (arXiv:2308.15709v2 [cs.LG] UPDATED)
22 hours, 43 minutes ago

Data valuation aims to quantify the usefulness of individual data sources in training machine learning (ML) models, and is a critical aspect of data-centric ML research. However, data valuation faces significant yet frequently overlooked privacy challenges despite its importance. This paper studies these challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowadays. We first emphasize the inherent privacy risks of KNN-Shapley, and demonstrate the significant technical difficulties in adapting KNN-Shapley to accommodate differential privacy (DP). To overcome these challenges, we introduce TKNN-Shapley, a refined variant of KNN-Shapley that is privacy-friendly, allowing for straightforward modifications to incorporate DP guarantee (DP-TKNN-Shapley). We show that DP-TKNN-Shapley has several advantages and offers a superior privacy-utility tradeoff compared to naively privatized KNN-Shapley in discerning data quality. Moreover, even non-private TKNN-Shapley achieves comparable performance as KNN-Shapley. Overall, our findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley, particularly for real-world applications involving sensitive data.

New Will More Expressive Graph Neural Networks do Better on Generative Tasks?. (arXiv:2308.11978v2 [cs.LG] UPDATED)
22 hours, 43 minutes ago

Graph generation poses a significant challenge as it involves predicting a complete graph with multiple nodes and edges based on simply a given label. This task also carries fundamental importance to numerous real-world applications, including de-novo drug and molecular design. In recent years, several successful methods have emerged in the field of graph generation. However, these approaches suffer from two significant shortcomings: (1) the underlying Graph Neural Network (GNN) architectures used in these methods are often underexplored; and (2) these methods are often evaluated on only a limited number of metrics. To fill this gap, we investigate the expressiveness of GNNs under the context of the molecular graph generation task, by replacing the underlying GNNs of graph generative models with more expressive GNNs. Specifically, we analyse the performance of six GNNs on six different molecular generative objectives on the ZINC-250k dataset in two different generative frameworks: autoregressive generation models, such as GCPN and GraphAF, and one-shot generation models, such as GraphEBM. Through our extensive experiments, we demonstrate that advanced GNNs can indeed improve the performance of GCPN, GraphAF, and GraphEBM on molecular generation tasks, but GNN expressiveness is not a necessary condition for a good GNN-based generative model. Moreover, we show that GCPN and GraphAF with advanced GNNs can achieve state-of-the-art results across 17 other non-GNN-based graph generative approaches, such as variational autoencoders and Bayesian optimisation models, on the proposed molecular generative objectives (DRD2, Median1, Median2), which are important metrics for de-novo molecular design.

New On the Linear Convergence of Policy Gradient under Hadamard Parameterization. (arXiv:2305.19575v2 [math.OC] UPDATED)
22 hours, 43 minutes ago

The convergence of deterministic policy gradient under the Hadamard parameterization is studied in the tabular setting and the linear convergence of the algorithm is established. To this end, we first show that the error decreases at an $O(\frac{1}{k})$ rate for all the iterations. Based on this result, we further show that the algorithm has a faster local linear convergence rate after $k_0$ iterations, where $k_0$ is a constant that only depends on the MDP problem and the initialization. To show the local linear convergence of the algorithm, we have indeed established the contraction of the sub-optimal probability $b_s^k$ (i.e., the probability of the output policy $\pi^k$ on non-optimal actions) when $k\ge k_0$.

New SAMoSSA: Multivariate Singular Spectrum Analysis with Stochastic Autoregressive Noise. (arXiv:2305.16491v2 [cs.LG] UPDATED)
22 hours, 43 minutes ago

The well-established practice of time series analysis involves estimating deterministic, non-stationary trend and seasonality components followed by learning the residual stochastic, stationary components. Recently, it has been shown that one can learn the deterministic non-stationary components accurately using multivariate Singular Spectrum Analysis (mSSA) in the absence of a correlated stationary component; meanwhile, in the absence of deterministic non-stationary components, the Autoregressive (AR) stationary component can also be learnt readily, e.g. via Ordinary Least Squares (OLS). However, a theoretical underpinning of multi-stage learning algorithms involving both deterministic and stationary components has been absent in the literature despite its pervasiveness. We resolve this open question by establishing desirable theoretical guarantees for a natural two-stage algorithm, where mSSA is first applied to estimate the non-stationary components despite the presence of a correlated stationary AR component, which is subsequently learned from the residual time series. We provide a finite-sample forecasting consistency bound for the proposed algorithm, SAMoSSA, which is data-driven and thus requires minimal parameter tuning. To establish theoretical guarantees, we overcome three hurdles: (i) we characterize the spectra of Page matrices of stable AR processes, thus extending the analysis of mSSA; (ii) we extend the analysis of AR process identification in the presence of arbitrary bounded perturbations; (iii) we characterize the out-of-sample or forecasting error, as opposed to solely considering model identification. Through representative empirical studies, we validate the superior performance of SAMoSSA compared to existing baselines. Notably, SAMoSSA's ability to account for AR noise structure yields improvements ranging from 5% to 37% across various benchmark datasets.

New DIVA: A Dirichlet Process Mixtures Based Incremental Deep Clustering Algorithm via Variational Auto-Encoder. (arXiv:2305.14067v3 [cs.LG] UPDATED)
22 hours, 43 minutes ago

Generative model-based deep clustering frameworks excel in classifying complex data, but are limited in handling dynamic and complex features because they require prior knowledge of the number of clusters. In this paper, we propose a nonparametric deep clustering framework that employs an infinite mixture of Gaussians as a prior. Our framework utilizes a memoized online variational inference method that enables the "birth" and "merge" moves of clusters, allowing our framework to cluster data in a "dynamic-adaptive" manner, without requiring prior knowledge of the number of features. We name the framework as DIVA, a Dirichlet Process-based Incremental deep clustering framework via Variational Auto-Encoder. Our framework, which outperforms state-of-the-art baselines, exhibits superior performance in classifying complex data with dynamically changing features, particularly in the case of incremental features. We released our source code implementation at:

New A Note on "Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms". (arXiv:2304.04258v2 [stat.ML] UPDATED)
22 hours, 43 minutes ago

Data valuation is a growing research field that studies the influence of individual data points for machine learning (ML) models. Data Shapley, inspired by cooperative game theory and economics, is an effective method for data valuation. However, it is well-known that the Shapley value (SV) can be computationally expensive. Fortunately, Jia et al. (2019) showed that for K-Nearest Neighbors (KNN) models, the computation of Data Shapley is surprisingly simple and efficient.

In this note, we revisit the work of Jia et al. (2019) and propose a more natural and interpretable utility function that better reflects the performance of KNN models. We derive the corresponding calculation procedure for the Data Shapley of KNN classifiers/regressors with the new utility functions. Our new approach, dubbed soft-label KNN-SV, achieves the same time complexity as the original method. We further provide an efficient approximation algorithm for soft-label KNN-SV based on locality sensitive hashing (LSH). Our experimental results demonstrate that Soft-label KNN-SV outperforms the original method on most datasets in the task of mislabeled data detection, making it a better baseline for future work on data valuation.

New Dimensionality Reduction and Wasserstein Stability for Kernel Regression. (arXiv:2203.09347v3 [stat.ML] UPDATED)
22 hours, 43 minutes ago

In a high-dimensional regression framework, we study consequences of the naive two-step procedure where first the dimension of the input variables is reduced and second, the reduced input variables are used to predict the output variable with kernel regression. In order to analyze the resulting regression errors, a novel stability result for kernel regression with respect to the Wasserstein distance is derived. This allows us to bound errors that occur when perturbed input data is used to fit the regression function. We apply the general stability result to principal component analysis (PCA). Exploiting known estimates from the literature on both principal component analysis and kernel regression, we deduce convergence rates for the two-step procedure. The latter turns out to be particularly useful in a semi-supervised setting.

A Deep Learning Method for Comparing Bayesian Hierarchical Models. (arXiv:2301.11873v4 [stat.ML] UPDATED)
5 days, 8 hours ago

Bayesian model comparison (BMC) offers a principled approach for assessing the relative merits of competing computational models and propagating uncertainty into model selection decisions. However, BMC is often intractable for the popular class of hierarchical models due to their high-dimensional nested parameter structure. To address this intractability, we propose a deep learning method for performing BMC on any set of hierarchical models which can be instantiated as probabilistic programs. Since our method enables amortized inference, it allows efficient re-estimation of posterior model probabilities and fast performance validation prior to any real-data application. In a series of extensive validation studies, we benchmark the performance of our method against the state-of-the-art bridge sampling method and demonstrate excellent amortized inference across all BMC settings. We then showcase our method by comparing four hierarchical evidence accumulation models that have previously been deemed intractable for BMC due to partly implicit likelihoods. Additionally, we demonstrate how transfer learning can be leveraged to enhance training efficiency. We provide reproducible code for all analyses and an open-source implementation of our method.

Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v6 [eess.IV] UPDATED)
3 weeks, 2 days ago

This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional "content" latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining "texture" variables characterizing the diffusion process are synthesized at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving multiple datasets and image quality assessment metrics show that our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics. Furthermore, training the diffusion with X-parameterization enables high-quality reconstructions in only a handful of decoding steps, greatly affecting the model's practicality.

Causal Falsification of Digital Twins. (arXiv:2301.07210v4 [stat.ME] UPDATED)
3 weeks, 3 days ago

Digital twins are virtual systems designed to predict how a real-world process will evolve in response to interventions. This modelling paradigm holds substantial promise in many applications, but rigorous procedures for assessing their accuracy are essential for safety-critical settings. We consider how to assess the accuracy of a digital twin using real-world data. We formulate this as causal inference problem, which leads to a precise definition of what it means for a twin to be "correct" appropriate for many applications. Unfortunately, fundamental results from causal inference mean observational data cannot be used to certify that a twin is correct in this sense unless potentially tenuous assumptions are made, such as that the data are unconfounded. To avoid these assumptions, we propose instead to find situations in which the twin is not correct, and present a general-purpose statistical procedure for doing so. Our approach yields reliable and actionable information about the twin under only the assumption of an i.i.d. dataset of observational trajectories, and remains sound even if the data are confounded. We apply our methodology to a large-scale, real-world case study involving sepsis modelling within the Pulse Physiology Engine, which we assess using the MIMIC-III dataset of ICU patients.

Curvature Filtrations for Graph Generative Model Evaluation. (arXiv:2301.12906v3 [cs.LG] UPDATED)
1 month ago

Graph generative model evaluation necessitates understanding differences between graphs on the distributional level. This entails being able to harness salient attributes of graphs in an efficient manner. Curvature constitutes one such property that has recently proved its utility in characterising graphs. Its expressive properties, stability, and practical utility in model evaluation remain largely unexplored, however. We combine graph curvature descriptors with emerging methods from topological data analysis to obtain robust, expressive descriptors for evaluating graph generative models.

Characterizing the Implicit Bias of Regularized SGD in Rank Minimization. (arXiv:2206.05794v6 [cs.LG] UPDATED)
1 month ago

We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices and applies to a wide range of neural network architectures of any width or depth. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization.

Deep Riemannian Networks for EEG Decoding. (arXiv:2212.10426v6 [cs.LG] UPDATED)
3 months, 4 weeks ago

State-of-the-art performance in electroencephalography (EEG) decoding tasks is currently often achieved with either Deep-Learning (DL) or Riemannian-Geometry-based decoders (RBDs). Recently, there is growing interest in Deep Riemannian Networks (DRNs) possibly combining the advantages of both previous classes of methods. However, there are still a range of topics where additional insight is needed to pave the way for a more widespread application of DRNs in EEG. These include architecture design questions such as network size and end-to-end ability.How these factors affect model performance has not been explored. Additionally, it is not clear how the data within these networks is transformed, and whether this would correlate with traditional EEG decoding. Our study aims to lay the groundwork in the area of these topics through the analysis of DRNs for EEG with a wide range of hyperparameters. Networks were tested on two public EEG datasets and compared with state-of-the-art ConvNets. Here we propose end-to-end EEG SPDNet (EE(G)-SPDNet), and we show that this wide, end-to-end DRN can outperform the ConvNets, and in doing so use physiologically plausible frequency regions. We also show that the end-to-end approach learns more complex filters than traditional band-pass filters targeting the classical alpha, beta, and gamma frequency bands of the EEG, and that performance can benefit from channel specific filtering approaches. Additionally, architectural analysis revealed areas for further improvement due to the possible loss of Riemannian specific information throughout the network. Our study thus shows how to design and train DRNs to infer task-related information from the raw EEG without the need of handcrafted filterbanks and highlights the potential of end-to-end DRNs such as EE(G)-SPDNet for high-performance EEG decoding.

SpArX: Sparse Argumentative Explanations for Neural Networks [Technical Report]. (arXiv:2301.09559v3 [cs.AI] UPDATED)
3 months, 4 weeks ago

Neural networks (NNs) have various applications in AI, but explaining their decisions remains challenging. Existing approaches often focus on explaining how changing individual inputs affects NNs' outputs. However, an explanation that is consistent with the input-output behaviour of an NN is not necessarily faithful to the actual mechanics thereof. In this paper, we exploit relationships between multi-layer perceptrons (MLPs) and quantitative argumentation frameworks (QAFs) to create argumentative explanations for the mechanics of MLPs. Our SpArX method first sparsifies the MLP while maintaining as much of the original structure as possible. It then translates the sparse MLP into an equivalent QAF to shed light on the underlying decision process of the MLP, producing global and/or local explanations. We demonstrate experimentally that SpArX can give more faithful explanations than existing approaches, while simultaneously providing deeper insights into the actual reasoning process of MLPs.

Robust empirical risk minimization via Newton's method. (arXiv:2301.13192v2 [stat.ML] UPDATED)
4 months, 1 week ago

A new variant of Newton's method for empirical risk minimization is studied, where at each iteration of the optimization algorithm, the gradient and Hessian of the objective function are replaced by robust estimators taken from existing literature on robust mean estimation for multivariate data. After proving a general theorem about the convergence of successive iterates to a small ball around the population-level minimizer, consequences of the theory in generalized linear models are studied when data are generated from Huber's epsilon-contamination model and/or heavytailed distributions. An algorithm for obtaining robust Newton directions based on the conjugate gradient method is also proposed, which may be more appropriate for high-dimensional settings, and conjectures about the convergence of the resulting algorithm are offered. Compared to robust gradient descent, the proposed algorithm enjoys the faster rates of convergence for successive iterates often achieved by second-order algorithms for convex problems, i.e., quadratic convergence in a neighborhood of the optimum, with a stepsize that may be chosen adaptively via backtracking linesearch.

A kernel Stein test of goodness of fit for sequential models. (arXiv:2210.10741v3 [stat.ML] UPDATED)
4 months, 2 weeks ago

We propose a goodness-of-fit measure for probability densities modeling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalized densities. The KSD is defined by its Stein operator: current operators used in testing apply to fixed-dimensional spaces. As our main contribution, we extend the KSD to the variable-dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalized, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.

DynDepNet: Learning Time-Varying Dependency Structures from fMRI Data via Dynamic Graph Structure Learning. (arXiv:2209.13513v3 [cs.LG] UPDATED)
4 months, 2 weeks ago

Graph neural networks (GNNs) have demonstrated success in learning representations of brain graphs derived from functional magnetic resonance imaging (fMRI) data. However, existing GNN methods assume brain graphs are static over time and the graph adjacency matrix is known prior to model training. These assumptions contradict evidence that brain graphs are time-varying with a connectivity structure that depends on the choice of functional connectivity measure. Incorrectly representing fMRI data with noisy brain graphs can adversely affect GNN performance. To address this, we propose DynDepNet, a novel method for learning the optimal time-varying dependency structure of fMRI data induced by downstream prediction tasks. Experiments on real-world fMRI datasets, for the task of sex classification, demonstrate that DynDepNet achieves state-of-the-art results, outperforming the best baseline in terms of accuracy by approximately 8 and 6 percentage points, respectively. Furthermore, analysis of the learned dynamic graphs reveals prediction-related brain regions consistent with existing neuroscience literature.

Forecasting Response to Treatment with Global Deep Learning and Patient-Specific Pharmacokinetic Priors. (arXiv:2309.13135v5 [cs.LG] UPDATED)
2 weeks, 1 day ago

Forecasting healthcare time series is crucial for early detection of adverse outcomes and for patient monitoring. Forecasting, however, can be difficult in practice due to noisy and intermittent data. The challenges are often exacerbated by change points induced via extrinsic factors, such as the administration of medication. To address these challenges, we propose a novel hybrid global-local architecture and a pharmacokinetic encoder that informs deep learning models of patient-specific treatment effects. We showcase the efficacy of our approach in achieving significant accuracy gains for a blood glucose forecasting task using both realistically simulated and real-world data. Our global-local architecture improves over patient-specific models by 9.2-14.6%. Additionally, our pharmacokinetic encoder improves over alternative encoding techniques by 4.4% on simulated data and 2.1% on real-world data. The proposed approach can have multiple beneficial applications in clinical practice, such as issuing early warnings about unexpected treatment responses, or helping to characterize patient-specific treatment effects in terms of drug absorption and elimination characteristics.

AnglE-optimized Text Embeddings. (arXiv:2309.12871v6 [cs.CL] UPDATED)
2 weeks, 5 days ago

High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge existing text embedding models face is the problem of vanishing gradients, primarily due to their reliance on the cosine function in the optimization objective, which has saturation zones. To address this issue, this paper proposes a novel angle-optimized text embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a complex space. This novel approach effectively mitigates the adverse effects of the saturation zone in the cosine function, which can impede gradient and hinder optimization processes. To set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine domain-specific STS scenarios with limited labeled data and explore how AnglE works with LLM-annotated data. Extensive experiments were conducted on various tasks including short-text STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.

Understanding Addition in Transformers. (arXiv:2310.13121v3 [cs.LG] UPDATED)
2 weeks, 6 days ago

Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use. This paper presents an in-depth analysis of a one-layer Transformer model trained for n-digit integer addition. We reveal that the model divides the task into parallel, digit-specific streams and employs distinct algorithms for different digit positions. Our study also finds that the model starts calculations late but executes them rapidly. A rare use case with high loss is identified and explained. Overall, the model's algorithm is explained in detail. These findings are validated through rigorous testing and mathematical modeling, contributing to the broader works in Mechanistic Interpretability, AI safety, and alignment. Our approach opens the door for analyzing more complex tasks and multi-layer Transformer models.

Towards Understanding Sycophancy in Language Models. (arXiv:2310.13548v3 [cs.CL] UPDATED)
4 weeks, 1 day ago

Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.

Evaluating Robustness and Uncertainty of Graph Models Under Structural Distributional Shifts. (arXiv:2302.13875v4 [cs.LG] UPDATED)
4 weeks, 2 days ago

In reliable decision-making systems based on machine learning, models have to be robust to distributional shifts or provide the uncertainty of their predictions. In node-level problems of graph learning, distributional shifts can be especially complex since the samples are interdependent. To evaluate the performance of graph models, it is important to test them on diverse and meaningful distributional shifts. However, most graph benchmarks considering distributional shifts for node-level problems focus mainly on node features, while structural properties are also essential for graph problems. In this work, we propose a general approach for inducing diverse distributional shifts based on graph structure. We use this approach to create data splits according to several structural node properties: popularity, locality, and density. In our experiments, we thoroughly evaluate the proposed distributional shifts and show that they can be quite challenging for existing graph models. We also reveal that simple models often outperform more sophisticated methods on the considered structural shifts. Finally, our experiments provide evidence that there is a trade-off between the quality of learned representations for the base classification task under structural distributional shift and the ability to separate the nodes from different distributions using these representations.

WebArena: A Realistic Web Environment for Building Autonomous Agents. (arXiv:2307.13854v3 [cs.AI] UPDATED)
1 month ago

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.

Learning Unseen Modality Interaction. (arXiv:2306.12795v3 [cs.CV] UPDATED)
1 month ago

Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. In this paper, we challenge this modality-complete assumption for multimodal learning and instead strive for generalization to unseen modality combinations during inference. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved. This allows the information to be accumulated with a simple summation operation across available modalities. To reduce overfitting to less discriminative modality combinations during training, we further improve the model learning with pseudo-supervision indicating the reliability of a modality's prediction. We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval. Project website:

Most Neural Networks Are Almost Learnable. (arXiv:2305.16508v3 [cs.LG] UPDATED)
1 month ago

We present a PTAS for learning random constant-depth networks. We show that for any fixed $\epsilon>0$ and depth $i$, there is a poly-time algorithm that for any distribution on $\sqrt{d} \cdot \mathbb{S}^{d-1}$ learns random Xavier networks of depth $i$, up to an additive error of $\epsilon$. The algorithm runs in time and sample complexity of $(\bar{d})^{\mathrm{poly}(\epsilon^{-1})}$, where $\bar d$ is the size of the network. For some cases of sigmoid and ReLU-like activations the bound can be improved to $(\bar{d})^{\mathrm{polylog}(\epsilon^{-1})}$, resulting in a quasi-poly-time algorithm for learning constant depth random networks.

Initialization-Dependent Sample Complexity of Linear Predictors and Neural Networks. (arXiv:2305.16475v2 [cs.LG] UPDATED)
1 month ago

We provide several new results on the sample complexity of vector-valued linear predictors (parameterized by a matrix), and more generally neural networks. Focusing on size-independent bounds, where only the Frobenius norm distance of the parameters from some fixed reference matrix $W_0$ is controlled, we show that the sample complexity behavior can be surprisingly different than what we may expect considering the well-studied setting of scalar-valued linear predictors. This also leads to new sample complexity bounds for feed-forward neural networks, tackling some open questions in the literature, and establishing a new convex linear prediction problem that is provably learnable without uniform convergence.

Leveraging the two timescale regime to demonstrate convergence of neural networks. (arXiv:2304.09576v2 [math.OC] UPDATED)
1 month ago

We study the training dynamics of shallow neural networks, in a two-timescale regime in which the stepsizes for the inner layer are much smaller than those for the outer layer. In this regime, we prove convergence of the gradient flow to a global optimum of the non-convex optimization problem in a simple univariate setting. The number of neurons need not be asymptotically large for our result to hold, distinguishing our result from popular recent approaches such as the neural tangent kernel or mean-field regimes. Experimental illustration is provided, showing that the stochastic gradient descent behaves according to our description of the gradient flow and thus converges to a global optimum in the two-timescale regime, but can fail outside of this regime.

Optimized Sparse Matrix Operations for Reverse Mode Automatic Differentiation. (arXiv:2212.05159v3 [cs.LG] UPDATED)
2 weeks, 1 day ago

Sparse matrix representations are ubiquitous in computational science and machine learning, leading to significant reductions in compute time, in comparison to dense representation, for problems that have local connectivity. The adoption of sparse representation in leading ML frameworks such as PyTorch is incomplete, however, with support for both automatic differentiation and GPU acceleration missing. In this work, we present an implementation of a CSR-based sparse matrix wrapper for PyTorch with CUDA acceleration for basic matrix operations, as well as automatic differentiability. We also present several applications of the resulting sparse kernels to optimization problems, demonstrating ease of implementation and performance measurements versus their dense counterparts.

On the Algebraic Properties of Flame Graphs. (arXiv:2301.08941v2 [cs.SE] UPDATED)
9 months ago

Flame graphs are a popular way of representing profiling data. In this paper we propose a possible mathematical definition of flame graphs. In doing so, we gain some interesting algebraic properties almost for free, which in turn allow us to define some operations that can allow to perform an in-depth performance regression analysis. The typical documented use of a flame graph is via its graphical representation, whereby one scans the picture for the largest plateaux. Whilst this method is effective at finding the main sources of performance issues, it leaves quite a large amount of data potentially unused. By combining a mathematical precise definition of flame graphs with some statistical methods we show how to generalise this visual procedure and make the best of the full set of collected profiling data.

Exploring Effects of Computational Parameter Changes to Image Recognition Systems. (arXiv:2211.00471v4 [cs.LG] UPDATED)
9 months ago

Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and FPGAs for fast, timely processing. Failure in real-time image recognition tasks can occur due to incorrect mapping on hardware accelerators, which may lead to timing uncertainty and incorrect behavior. Owing to the increased use of image recognition tasks in safety-critical applications like autonomous driving and medical imaging, it is imperative to assess their robustness to changes in the computational environment as parameters like deep learning frameworks, compiler optimizations for code generation, and hardware devices are not regulated with varying impact on model performance and correctness. In this paper we conduct robustness analysis of four popular image recognition models (MobileNetV2, ResNet101V2, DenseNet121 and InceptionV3) with the ImageNet dataset, assessing the impact of the following parameters in the model's computational environment: (1) deep learning frameworks; (2) compiler optimizations; and (3) hardware devices. We report sensitivity of model performance in terms of output label and inference time for changes in each of these environment parameters. We find that output label predictions for all four models are sensitive to choice of deep learning framework (by up to 57%) and insensitive to other parameters. On the other hand, model inference time was affected by all environment parameters with changes in hardware device having the most effect. The extent of effect was not uniform across models.

mlpack 4: a fast, header-only C++ machine learning library. (arXiv:2302.00820v1 [cs.MS])
9 months, 3 weeks ago

For over 15 years, the mlpack machine learning library has served as a "swiss army knife" for C++-based machine learning. Its efficient implementations of common and cutting-edge machine learning algorithms have been used in a wide variety of scientific and industrial applications. This paper overviews mlpack 4, a significant upgrade over its predecessor. The library has been significantly refactored and redesigned to facilitate an easier prototyping-to-deployment pipeline, including bindings to other languages (Python, Julia, R, Go, and the command line) that allow prototyping to be seamlessly performed in environments other than C++. mlpack is open-source software, distributed under the permissive 3-clause BSD license; it can be obtained at

Disciplined Saddle Programming. (arXiv:2301.13427v1 [math.OC])
9 months, 4 weeks ago

We consider convex-concave saddle point problems, and more generally convex optimization problems we refer to as $\textit{saddle problems}$, which include the partial supremum or infimum of convex-concave saddle functions. Saddle problems arise in a wide range of applications, including game theory, machine learning, and finance. It is well known that a saddle problem can be reduced to a single convex optimization problem by dualizing either the convex (min) or concave (max) objectives, reducing a min-max problem into a min-min (or max-max) problem. Carrying out this conversion by hand can be tedious and error prone. In this paper we introduce $\textit{disciplined saddle programming}$ (DSP), a domain specific language (DSL) for specifying saddle problems, for which the dualizing trick can be automated. The language and methods are based on recent work by Juditsky and Nemirovski arXiv:2102.01002 [math.OC], who developed the idea of conic-representable saddle point programs, and showed how to carry out the required dualization automatically using conic duality. Juditsky and Nemirovski's conic representation of saddle problems extends Nesterov and Nemirovski's earlier development of conic representable convex problems; DSP can be thought of as extending disciplined convex programming (DCP) to saddle problems. Just as DCP makes it easy for users to formulate and solve complex convex problems, DSP allows users to easily formulate and solve saddle problems. Our method is implemented in an open-source package, also called DSP.

A matrix-free high-order solver for the numerical solution of cardiac electrophysiology. (arXiv:2205.05136v3 [math.NA] UPDATED)
9 months, 4 weeks ago

We propose a matrix-free solver for the numerical solution of the cardiac electrophysiology model consisting of the monodomain nonlinear reaction-diffusion equation coupled with a system of ordinary differential equations for the ionic species. Our numerical approximation is based on the high-order Spectral Element Method (SEM) to achieve accurate numerical discretization while employing a much smaller number of Degrees of Freedom than first-order Finite Elements. We combine vectorization with sum-factorization, thus allowing for a very efficient use of high-order polynomials in a high performance computing framework. We validate the effectiveness of our matrix-free solver in a variety of applications and perform different electrophysiological simulations ranging from a simple slab of cardiac tissue to a realistic four-chamber heart geometry. We compare SEM to SEM with Numerical Integration (SEM-NI), showing that they provide comparable results in terms of accuracy and efficiency. In both cases, increasing the local polynomial degree $p$ leads to better numerical results and smaller computational times than reducing the mesh size $h$. We also implement a matrix-free Geometric Multigrid preconditioner that results in a comparable number of linear solver iterations with respect to a state-of-the-art matrix-based Algebraic Multigrid preconditioner. As a matter of fact, the matrix-free solver proposed here yields up to 45$\times$ speed-up with respect to a conventional matrix-based solver.

Residual regularization path-following methods for linear complementarity problems. (arXiv:2205.10727v2 [math.OC] UPDATED)
10 months ago

In this article, we consider the residual regularization path-following method with the trust-region updating strategy for the linear complementarity problem. This time-stepping selection based on the trust-region updating strategy overcomes the shortcoming of the line search method, which consumes the unnecessary trial steps in the transient-state phase. In order to improve the robustness of the path-following method, we use the residual regularization parameter to replace the traditional complementarity regularization parameter. Moreover, we prove the global convergence of the new method under the standard assumptions without the traditional assumption condition of the priority to feasibility over complementarity. Numerical results show that the new method is robust and efficient for the linear complementarity problem, especially for the dense cases. And it is more robust and faster than some state-of-the-art solvers such as the built-in subroutines PATH and MILES of the GAMS v28.2 (2019) environment. The computational time of the new method is about 1/3 to 1/10 of that of PATH for the dense linear complementarity problem.

Faster Gr\"obner Bases via Domain-Specific Ordering in Parameter Identifiability of ODE Models. (arXiv:2202.06297v2 [cs.SC] UPDATED)
10 months ago

We consider a specific class of polynomial systems that arise in parameter identifiability problems of models of ordinary differential equations (ODE) and discover a method for speeding up the Gr\"obner basis computation by using specific variable ordering and weights coming from the structure of the ODE model. We provide empirical results that show improvement across different symbolic computing frameworks.

Exact linear reductions of dynamical models. (arXiv:2301.11653v1 [eess.SY])
10 months ago

Dynamical models described by ordinary differential equations (ODEs) are a fundamental tool in the sciences and engineering. Exact reduction aims at producing a lower-dimensional model in which each macro-variable can be directly related to the original variables, and it is thus a natural step towards the model's formal analysis and mechanistic understanding. We present an algorithm which, given a polynomial ODE model, computes a longest possible chain of exact linear reductions of the model such that each reduction refines the previous one, thus giving a user control of the level of detail preserved by the reduction. This significantly generalizes over the existing approaches which compute only the reduction of the lowest dimension subject to an approach-specific constraint. The algorithm reduces finding exact linear reductions to a question about representations of finite-dimensional algebras. We provide an implementation of the algorithm, demonstrate its performance on a set of benchmarks, and illustrate the applicability via case studies. Our implementation is freely available at

An Empirical Study on Software Bill of Materials: Where We Stand and the Road Ahead. (arXiv:2301.05362v2 [cs.SE] UPDATED)
10 months ago

The rapid growth of software supply chain attacks has attracted considerable attention to software bill of materials (SBOM). SBOMs are a crucial building block to ensure the transparency of software supply chains that helps improve software supply chain security. Although there are significant efforts from academia and industry to facilitate SBOM development, it is still unclear how practitioners perceive SBOMs and what are the challenges of adopting SBOMs in practice. Furthermore, existing SBOM-related studies tend to be ad-hoc and lack software engineering focuses. To bridge this gap, we conducted the first empirical study to interview and survey SBOM practitioners. We applied a mixed qualitative and quantitative method for gathering data from 17 interviewees and 65 survey respondents from 15 countries across five continents to understand how practitioners perceive the SBOM field. We summarized 26 statements and grouped them into four topics on SBOM's states of practice. Based on the study results, we derived a goal model and highlighted future directions where practitioners can put in their effort.