KDD 2023 - Workshops: LLM and Causal Inference
I attended KDD 2023 which was held in Long Beach, CA from Aug 6-10. The first day I attended was Monday which had half-day workshops around a topic. The two I attended were about LLMs (because I’m interested and it’s relevant to my work) and Causal Inference (because I haven’t used causal machine learning techniques in practice before and wanted exposure).
Takeaways from Day 1
- Ed Chi had my favorite line from the day:
- Humans + Search –> Superhuman
- LLMS + Tools –> Super LLMS
- Humans + Super LLM –> Super super humans??
- Reaffirmed the LLM space is moving very quickly. There are areas of research that if not explored in the next year or so, it will be too late to make a meaningful contribution.
- Learned some new methodologies:
- LLMs: Prompt Tuning, Mixture of Experts
- Causal ML: Double Machine Learning (DML), many packages to do Causal ML like CausalML, EconML and UpliftML
- Two groups in an A/B test may not be sufficient, need to account for 4 groups
LLM Workshop: Foundations and Applications in Large-scale AI Models - Pre-training, Fine-tuning, and Prompt-based Learning
The website for this workshop is here: https://llm-ai.github.io/llmai/.
Schedule
Time | Speaker | Title |
---|---|---|
8:00-8:10AM, 2023/08/07 (PDT) | Host Chair | Welcome and Open Remarks |
8:10-8:40AM, 2023/08/07 (PDT) | Ed Chi [Google] | Talk 1: LLM Revolution: Implications rom Chatbots and Tool-Use to Reasoning |
8:40-9:10AM, 2023/08/07 (PDT) | Tania Bedrax-Weiss [Google] | Talk 2: Large-scale AI Model Research at Google Pre-training, Fine-tuning, and Prompt-based Learning |
9:10-9:25AM, 2023/08/07 (PDT) | Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer and Wen-Tau Yih | Paper-1: Retrieval-Augmented Multimodal Language Modeling |
9:25-9:40AM, 2023/08/07 (PDT) | Silvia Terragni, Modestas Filipavicius, Nghia Khau, Bruna Guedes, André Manso and Roland Mathis | Paper-2: In-Context Learning User Simulators for Task-Oriented Dialog Systems |
9:40-9:55AM, 2023/08/07 (PDT) | Piotr Kluska, Florian Scheidegger, A. Cristano I. Malossi and Enrique S. Quintana-Ortí | Paper-3 : Challenges in post-training quantization of Vision Transformers |
9:55-10:10AM, 2023/08/07 (PDT) | Haotian Ju, Dongyue Li, Aneesh Sharma and Hongyang Zhang | Paper-4 : Generalization in Graph Neural Networks: Improved PAC-Bayesian Bounds on Graph Diffusion |
10:10-10:30AM, 2023/08/07 (PDT) | Coffee Break | |
10:30-11:00AM, 2023/08/07 (PDT) | Shafiq Joty [Salesforce] | Talk 3: NLP Research in the Era of LLMs |
11:00-11:30AM, 2023/08/07 (PDT) | YiKang Shen[IBM] | Talk 4: Modular Large Language Model and Principle-Driven alignment with Minimal Human Supervision |
11:30-11:40AM, 2023/08/07 (PDT) | Hong Sun, Xue Li, Yinchuan Xu, Youkow Homma, Qi Cao, Min Wu, Jian Jiao and Denis Charles | Paper-5: AutoHint: Automatic Prompt Optimization with Hint Generation |
11:40-11:50AM, 2023/08/07 (PDT) | Zhichao Wang, Mengyu Dai and Keld Lundgaard | Paper-6: Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation |
11:50-12:00PM, 2023/08/07 (PDT) | Long Hoang Dang, Thao Minh Le, Tu Minh Phuong and Truyen Tran | Paper-7: Compositional Prompting with Successive Decomposition for Multimodal Language Models |
12:00PM-12:10PM, 2023/08/07 (PDT) | Zhen Guo, Yanwei Wang, Peiqi Wang and Shangdi Yu | Paper-8: Dr. LLaMA: Improving Small Language Models on PubMedQA via Generative Data Augmentation |
12:10-12:20PM, 2023/08/07 (PDT) | Haopeng Zhang, Xiao Liu and Jiawei Zhang | Paper-9 : Extractive Summarization via ChatGPT for Faithful Summary Generation |
12:20-12:30PM, 2023/08/07 (PDT) | Closing Remarks |
LLM Revolution: Implications from Chatbots and Tool-Use to Reasoning - Ed Chi
Ed Chi from Google gave this great talk.
2016 - Functions that Deep Neural Network Can Learn
- Pixels –> Lion
- Audio –> Audio to text
- Text –> Text (translation)
- Pixels –> Caption
Chatbots
- Not just transactional
- We want chatbots to be contextual
- Personalized assistants for everyone
Lambda –> Bard (Brought to You by Ed’s Team)
- They wanted to publish Lambda in the form of Bard, but there were difficulties
Large Language Models (LLM)
- Large knowledge base
- What is a plan to read 20 books a year? Reaches into the LLM to come up with a real plan
- Genesis of captions –> not too far to be able to generate text
Programming
- Coding is less about coding, more about data
- Data Science (DS) is going to be a bigger part of software development
Retrieval Augmentation: Leveraging External Knowledge
- Factuality trigger
- Open-book Generative QA
- RETRO: Retrieval-augmented generative model
- Questions:
- How big does the LLM need to be?
- How big does the external knowledge base need to be?
- Fruitful Line of Research
Multi-modality output (not just text, could be images)
- Image retrieval
- Image input –> Generate captions
Humans and LLMs with Tools
- Humans + Search –> Superhuman
- LLMS + Tools –> Super LLMS
- Humans + Super LLM –> Super super humans??
Future Challenges
- Responsibility and Safety
- Factuality, Grounding, and Attribution
- Human <-> AI Content Loop and Ecosystem
- Personalization and User Memory
Keynote
- Ed is going to give the keynote tomorrow
- You can interrogate a model for why it made a decision or prediction
- Area: Self-critique, self-reflection (next year or so)
- 3-5 year research topics:
- Hallucinations / Bias in areas where the LLM has not been trained
- Relationship between hallucinations and safety
Large-scale AI Model Research at Google Pre-training, Fine-tuning, and Prompt-based Learning
Tania Bedrax-Weiss from Google gave this talk.
Mixture of Experts Models
- How to route the question to the right expert, right experts
Conditional Computation
- COLT5 Transformer layer
- Scales to longer context
- Early exit
- Per step confidence thresholds
Multi-modal Work
- Imagen - diffusion model
- Parti - autoregressive model
Imagen: Technical Details
- ViT-VQGAN as image tokenizer
- What’s an image tokenizer? See: https://keras.io/examples/vision/token_learner/
- Autoregressively generate images in a similar way that LLMs generate text
- Can generate text reliably - spell words out unlike other models
Pali
- Image to text
- State of the art text captioning model
Spotlight
- Screenshots / user interfaces - understand what are the actions that a user can perform
- Execute commands in the user interface
PLay: Parametrically Condition Layout Generation Using Guidelines
- Fine-tuning
- Prompt Tuning
- Look at this more
How do you handle ambiguity in an answer?
- LLMs are very eager to give an answer
- Types
- Use multiple prompts to get different types of answers. This is my answer. Can you generate other answers?
- Diversity objectives
Retrieval-Augmented Multimodal Language Modeling
Paper: https://arxiv.org/abs/2211.12561
Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).
- Develop a retrieval-augmented multimodal model, a first of it’s kind
- The generator uses retrieved items for generation too
- Retrieval augmented training - helped a lot
In-Context Learning User Simulators for Task-Oriented Dialog Systems
- Code: https://github.com/telepathylabsai/prompt-based-user-simulator
- Paper: https://arxiv.org/abs/2306.00774
This paper presents a novel application of large language models in user simulation for task-oriented dialog systems, specifically focusing on an in-context learning approach. By harnessing the power of these models, the proposed approach generates diverse utterances based on user goals and limited dialog examples. Unlike traditional simulators, this method eliminates the need for labor-intensive rule definition or extensive annotated data, making it more efficient and accessible. Additionally, an error analysis of the interaction between the user simulator and dialog system uncovers common mistakes, providing valuable insights into areas that require improvement. Our implementation is available at this https URL.
- Rule based systems are still more accurate. However they mainly understand happy paths of a dialog system.
- These LLM based approaches can explore unexpected behavior of users
Challenges in post-training quantization of Vision Transformers
Vision Transformers recently showed outstanding performance in computer vision tasks. However, those models are compute and memory intensive that require accelerators with a large amount of memory like NVIDIA A100 graphic processing unit for training and even for inference. Post-training quantization is an appealing compression method, as it does not require retraining the models and labels to tune the model. In this paper, we look in depth at multiple models in terms of size, architecture, and training procedure and provide guidelines on how to quantize the model to an 8-bit integer, both weights and activations. We perform a well-rounded study on the effects of quantization and sensitivity to the quantization error. Moreover, we show that applying mixed-data precision quantization works well for most vision transformer models achieving up to 90% compression ratio within a 2% top-1 accuracy drop. This kind of quantization offers a trade-off between memory, compute, and performance of the models that are deployable with the current software and hardware stack.
- There’s a difference between Static vs Dynamic Quantization
- Larger models are supposed to be easier to quantize, but not the case here
- Signal to noise quantization ratio - SNQR
- Partial Quantization: Some models that lost accuracy during dynamic quant, regained during 90% quant
Generalization in Graph Neural Networks: Improved PAC-Bayesian Bounds on Graph Diffusion
Paper: https://proceedings.mlr.press/v206/ju23a/ju23a.pdf
Graph neural networks are widely used tools for graph prediction tasks. Motivated by their empirical performance, prior works have developed generalization bounds for graph neural networks, which scale with graph structures in terms of the maximum degree. In this paper, we present generalization bounds that instead scale with the largest singular value of the graph neural network’s feature diffusion matrix. These bounds are numerically much smaller than prior bounds for real-world graphs. We also construct a lower bound of the generalization gap that matches our upper bound asymptotically. To achieve these results, we analyze a unified model that includes prior works’ settings (i.e., convolutional and message-passing networks) and new settings (i.e., graph isomorphism networks). Our key idea is to measure the stability of graph neural networks against noise perturbations using Hessians. Empirically, we find that Hessian-based measurements correlate with observed generalization gaps of graph neural networks accurately; Optimizing noise stability properties for fine-tuning pretrained graph neural networks also improves the test performance on several graph-level classification tasks.
- Overfitting if there’s an imbalance between pretraining data and finetuning data size
- Generalization gap
- Not just cross validation loss
- More detailed understanding - what networks are causing the overfitting
- Generalization gap - measures the gap between training/test losses
NLP Research in the Era of LLMs - Unleashing the Potential of LLMs through Task and Data Engineering
Shafiq Joty gave this talk: https://raihanjoty.github.io/
Background: Data Engineering
- Hold the code fixed and invite research to improve the data (Andrew Ng)
Background: Rise of Task Engineering
- Multi-task models with task prompts
- Trained with many different instructions
- Mentions prompt tuning again (soft tokens) ???
Background: Task Engineering
LLM Lifecycle
XGen LLM: June 2023
- GitHub Link
- Goal is to outperform LLaMA1
Instructed tuned
- Instructional data: WizardLM. Paper Link
What does WizardLM do exactly in advancing the SoTA?
Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna’s testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at this https URL
Verify and Edit CoT - Self-consistency
Knowledge adapting framework
Language diversity prompting
Standard vs Personalized Distillation from LLMs
Modular Large Language Model and Principle-Driven alignment with Minimal Human Supervision
Yikang Shen from IBM gave this talk.
Foundation model types
Challenges of LLM
- Efficiency
- Extendability
- Flexibility
ModuleFormer - Learning Modular LLM from Uncurated Data
- Previous modular models were based on already labeled data
Mod-Squad - designing a mixture of experts as modular multi-task learners
- Can select the right experts for a task
- Experts can share knowledge!?
Dromedary - efficiently teach AI to follow a given set of principles
- GitHub Link for Dromedary
- Principle Engraving -
- Verbose Cloning - refining the model to produce in-depth and detailed response
- 300 lines of annotations
- Kind of similar to Evol-Instruct/WizardLM to produce annotations to fine-tune a model
AutoHint: Automatic Prompt Optimization with Hint Generation
Paper: https://arxiv.org/pdf/2307.07415.pdf
This paper presents AutoHint, a novel framework for automatic prompt engineering and optimization for Large Language Models (LLM). While LLMs have demonstrated remarkable ability in achieving high-quality annotation in various tasks, the key to applying this ability to specific tasks lies in developing high-quality prompts. Thus we propose a framework to inherit the merits of both in-context learning and zero-shot learning by incorporating enriched instructions derived from input-output demonstrations to optimize original prompt. We refer to the enrichment as the Hint and propose a framework to automatically generate the hint from labeled data. More concretely, starting from an initial prompt, our method first instructs a LLM to deduce new hints for selected samples from incorrect predictions, and then summarizes from per-sample hints and adds the results back to the initial prompt to form a new, enriched instruction. The proposed method is evaluated on the BIG-Bench Instruction Induction dataset for both zero-shot and few-short prompts, where experiments demonstrate our method is able to significantly boost accuracy for multiple tasks
Causal Inference Workshop: Causal Inference and Machine Learning in Practice
The website for this workshop is here: https://causal-machine-learning.github.io/kdd2023-workshop/
COG: Creative Optimality Gap for Video Advertising
Raif Rustamov from Amazon gave this invited talk.
Video ads motivation
- How does a particular video affect shopper experience?
Goal
- Driven by explicit hypotheses tied to quantifying value of the video
Approach - Creative Optimality Gap (COG)
- If we were to replace the video of class 0 to video of class 1, what would be the improvement in the outcome for the ad?
- Uplift or Heterogenous Treatment Effect modeling
Benefits
- Differentiated at the level of video features vs. global ATE
- ATE - average treatment effect - videos are good
- ITE - individual treatment effect - noisy
- HTE - heterogeneous treatment effect - in the middle, denoising
- Handle cold start ads
Preliminaries
- Treatment indicator (T)
- Video features
- Computed using e.g. video embeddings
- Can contain non
- Ad features
- Contains non-video related features like price, product category
- Used as confounder/matching variables
- Outcome = Y
COG Modeling
- Step 1
- Step 2
- Step 3 -
- Used interpretable models in this step, why?
COG Modeling: Guardrails
Bias
- Bias comes from G model, comes from regularization or not enough capacity in the model
- Bias is not constant but varies in the Z space
- Double ML?
Uncertainty/Variance
Solution
- Conservative COG = lower bound of confidence interval
The Value of Last-Mile Delivery in Online Retail
Ruomeng Cui from Emory gave this talk.
Cainiao - Chinese Company
- Alibaba’s logistics platform
- Largest logistics platform in China
- If there are differences in preferences, there is an opportunity for optimization
Use Causal ML: Estimating ITE
- Data: Post-treatment data Q4 2021
Models
- Partial Linear DML
- First-difference DML
- Others
Account for Knapsnack
- Tau does not capture economic efficiency
- Need to account for how much capacity a customer is using. A customer going from 0 to 1 unit sales is much more valuable than a customer going from 19 to 20 units sold because the latter is not using much capacity.
Leveraging Causal Uplift Modeling for Budget Constrained Benefits Allocation
Dmitri Goldenberg from Booking.com gave this talk. It was a very good talk with virtually no words on his slides.
Ensemble Method for Estimating Individualized Treatment Effects Kevin Wu Han, Han Wu (Stanford)
- Paper: https://arxiv.org/abs/2202.12445
- Ensemble methods almost always perform a validation-set model selection based method!
A Scalable and Debiased Approach to Dynamic Pricing with Causal Machine Learning and Optimization
- Heard the term double machine learning for the second time which caused me to do to learn what it is.
An IPW-based Unbiased Ranking Metric in Two-sided Markets Keisho Oh, Naoki Nishimura (Recruit Co), Minje Sung, Ken Kobayashi, Kazuhide Nakata (Tokyo Institute of Technology)
In two-sided markets like job-matching or dating-apps, need to use an unbiased ranking metric which they propose in their paper.
Unit Selection Based on Counterfactual Logic
This was an invited talk by Ang Li about this paper: https://ftp.cs.ucla.edu/pub/stat_ser/r488.pdf.
My main takeaway was dividing a population into a typical A/B test where one group receives a treatment and the other group is the control is too simplistic. There are actually 4 groups we should be concerned about:
- Complier: Individuals who would respond positively if treated and negatively if not treated.
- Always-taker: Individuals who always respond positively no matter whether they are treated or not.
- Never-taker: Individuals who always respond negatively no matter whether they are treated or not.
- Defier: Individuals who would respond negatively if treated and positively if not treated.
Along with a benefit vector that assigns a positive or negative value to each of these 4 groups, we can use this to select the best treatment for each individual.
Ang also used the Pfizer Covid vaccine as a motivating example for why these 4 groups should be accounted for.
Towards Automating the Causal Machine Learning Pipeline Vasilis Syrgkanis (Stanford/EconML)
- A large variety of causal estimands that arise in complex static and longitudinal data analysis can be automatically de-biased when regularized machine learning algorithms are used to estimate nuisance models
- Estimation of the de-biasing term itself can be performed with generic machine learning
- Experimental results using neural nets and random forests for automated de-biasing provide examples superior performance to plug-in approaches and to prior automatically debasing approaches based solely on linear models