publications
(*) denotes equal contribution
2024
- ICLRCircumventing Concept Erasure Methods For Text-to-Image Generative ModelsMinh Pham, Kelly O. Marshall, Niv Cohen, and 2 more authorsInternational Conference on Learning Representations, 2024
Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.
- CVPRDIMAT: Decentralized Iterative Merging-And-Training for Deep Learning ModelsNastaran Saadati, Minh Pham, Nasla Saleem, and 5 more authorsIEEE / CVF Computer Vision and Pattern Recognition Conference, 2024
Recent advances in decentralized deep learning algorithms have demonstrated cutting-edge performance on various tasks with large pre-trained models. However, a pivotal prerequisite for achieving this level of competitiveness is the significant communication and computation overheads when updating these models, which prohibits the applications of them to real-world scenarios. To address this issue, drawing inspiration from advanced model merging techniques without requiring additional training, we introduce the Decentralized Iterative Merging-And-Training (DIMAT) paradigm—a novel decentralized deep learning framework. Within DIMAT, each agent is trained on their local data and periodically merged with their neighboring agents using advanced model merging techniques like activation matching until convergence is achieved. DIMAT provably shows to converge with the best available rate for nonconvex functions with various first-order methods, while yielding tighter error bounds and maintaining linear speed up compared to the popular existing approaches. We conduct a comprehensive empirical analysis to validate DIMAT’s efficiency and superiority over baselines across diverse computer vision tasks sourced from the CIFAR-10, CIFAR-100, and Tiny ImageNet datasets. Empirical results validate our theoretical claims by showing that DIMAT attains faster and higher initial gain in accuracy with independent and identically distributed (IID) and non-IID data, incurring lower communication overhead. This DIMAT paradigm presents a new opportunity for the future decentralized learning, enhancing its adaptability to real-world with sparse and light-weight communication and computation.
2023
- PreprintZeroForge: Feedforward Text-to-Shape Without 3D SupervisionKelly O Marshall, Minh Pham, Ameya Joshi, and 4 more authorsarXiv preprint arXiv:2306.08183, 2023
Current state-of-the-art methods for text-to-shape generation either require supervised training using a labeled dataset of pre-defined 3D shapes, or perform expensive inference-time optimization of implicit neural representations. In this work, we present ZeroForge, an approach for zero-shot text-to-shape generation that avoids both pitfalls. To achieve open-vocabulary shape generation, we require careful architectural adaptation of existing feed-forward approaches, as well as a combination of data-free CLIP-loss and contrastive losses to avoid mode collapse. Using these techniques, we are able to considerably expand the generative ability of existing feed-forward text-to-shape models such as CLIP-Forge. We support our method via extensive qualitative and quantitative evaluations.
- TMLRDistributionally Robust Classification on a Data BudgetBenjamin Feuer, Ameya Joshi, Minh Pham, and 1 more authorTransactions on Machine Learning Research, 2023
Real world uses of deep learning require predictable model behavior under distribution shifts. Models such as CLIP show emergent natural distributional robustness comparable to humans, but may require hundreds of millions of training samples. Can we train robust learners in a domain where data is limited? To rigorously address this question, we introduce JANuS (Joint Annotations and Names Set), a collection of four new training datasets with images, labels, and corresponding captions, and perform a series of carefully controlled investigations of factors contributing to robustness in image classification, then compare those results to findings derived from a large-scale meta-analysis. Using this approach, we show that standard ResNet-50 trained with the crossentropy loss on 2.4 million image samples can attain comparable robustness to a CLIP ResNet-50 trained on 400 million samples. To our knowledge, this is the first result showing (near) stateof-the-art distributional robustness on limited data budgets. Our dataset is available at https://huggingface.co/datasets/penfever/JANuS_dataset, and the code used to reproduce our experiments can be found at https://github.com/penfever/vlhub/.
2022
- PreprintRevisiting self-distillationMinh Pham*, Minsu Cho*, Ameya Joshi*, and 1 more authorarXiv preprint arXiv:2206.08491, 2022
Knowledge distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student), often being used in the context of model compression. When both models have the same architecture, this procedure is called self-distillation. Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data. In this work, we systematically study self-distillation in a number of settings. We first show that even with a highly accurate teacher, self-distillation allows a student to surpass the teacher in all cases. Secondly, we revisit existing theoretical explanations of (self) distillation and identify contradicting examples, revealing possible drawbacks of these explanations. Finally, we provide an alternative explanation for the dynamics of self-distillation through the lens of loss landscape geometry. We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization.
- PreprintSmooth-Reduce: Leveraging Patches for Improved Certified RobustnessAmeya Joshi, Minh Pham, Minsu Cho, and 4 more authorsarXiv preprint arXiv:2205.06154, 2022
Randomized smoothing (RS) has been shown to be a fast, scalable technique for certifying the robustness of deep neural network classifiers. However, methods based on RS require augmenting data with large amounts of noise, which leads to significant drops in accuracy. We propose a training-free, modified smoothing approach, Smooth-Reduce, that leverages patching and aggregation to provide improved classifier certificates. Our algorithm classifies overlapping patches extracted from an input image, and aggregates the predicted logits to certify a larger radius around the input. We study two aggregation schemes – max and mean – and show that both approaches provide better certificates in terms of certified accuracy, average certified radii and abstention rates as compared to concurrent approaches. We also provide theoretical guarantees for such certificates, and empirically show significant improvements over other randomized smoothing methods that require expensive retraining. Further, we extend our approach to videos and provide meaningful certificates for video classifiers.
2020
- ICASSPToward better speaker embeddings: Automated collection of speech samples from unknown distinct speakersMinh Pham, Zeqian Li, and Jacob WhitehillIEEE International Conference on Acoustics, Speech and Signal Processing, 2020
The accuracy of speaker verification and diarization models depends on the quality of the speaker embeddings used to separate audio samples from different speakers. With the goal of training better embedding models, we devise an au- tomatic pipeline for large-scale collection of speech samples from unique speakers that is significantly more automated than previous approaches. With this pipeline, we collect and publish the BookTubeSpeech dataset, containing 8,450 YouTube videos (7.74 min per video on average) that each contains a single unique speaker. Using this dataset combined with VoxCeleb2, we show a substantial improvement in the quality of embeddings when tested on LibriSpeech compared to a model trained on only VoxCeleb2.
- INTERSPEECHHow does label noise affect the quality of speaker embeddings?Minh Pham, Zeqian Li, and Jacob WhitehillConference of the International Speech Communication Association, 2020
A common assumption when collecting speech datasets is that the accuracy of data labels strongly influences the accuracy of speaker embedding models and verification systems trained from these data. However, we show in experiments1 on the large and diverse VoxCeleb2 dataset that this is not always the case: Under four different labeling models (Split, Merge, Permute, and Corrupt), we find that the impact on trained speaker em- bedding models, as measured by the Equal Error Rate (EER) of speaker verification, is mild (just a few percent absolute error increase), except with very large amounts of noise (i.e., every minibatch is almost completely corrupted). This suggests that efforts to collect speech datasets might benefit more from en- suring large size and diversity rather than meticulous labeling.