A hearty congratulations to my student
@RanjayKrishna
(co-advised by
@msbernst
) for a successful PhD thesis defense! Great pioneering work in combining human cognition, human-computer interaction and
#AI
! Thank you PhD committee members
@chrmanning
@syeung10
@magrawala
🌹
🎓 I'm on the faculty job market this year! Please send me a message if your department (or one you know) is interested in a Computer Vision / HCI researcher who designs models inspired by human perception and social interaction!
My application materials:
Our submission received my first ever 10/10 review from NeurIPS. Check out our
#NeurIPS2023
Oral.
We release the largest vision-language dataset for histopathology and train a SOTA model for classifying histopathology images across 13 benchmarks across 8 sub-pathologies.
Quilt-1M has been accepted for an oral presentation at
@NeurIPSConf
. As promised, we have also released our data and our model:
See you all in New Orleans!
My latest
#CVPR2018
paper with Ines,
@msbernst
and
@drfeifei
is now live with fully documented open source training/testing code. We treat visual relationships as shifts in attention and perform attention saccades around scene graphs.
We are happy to introduce Action Genome: a new representation, new dataset, and new model for decomposing actions into spatio-temporal scene graphs. Action Genome has 1.7M relationships between 0.4M object instances and enables few-shot action prediction.
I expect a future where ML agents will dynamically learn through real-world interactions with people. My
#CVPR2019
paper with
@msbernst
and
@drfeifei
pushes us towards that goal by learning to pose directed questions to learn about the visual world.
Announcing the first 𝗜𝗖𝗖𝗩 𝘄𝗼𝗿𝗸𝘀𝗵𝗼𝗽 𝗼𝗻 𝗦𝗰𝗲𝗻𝗲 𝗚𝗿𝗮𝗽𝗵 𝗥𝗲𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴. If your research involves structured data or graph-based learning, consider submitting to us by August 15, 2019:
Our new paper finds something quite neat: We easily scale up how many tools LLMs can use to over 200 tools (APIs, models, python functions, etc.)
...without any training, without a single tool-use demonstration!!
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
paper page:
Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to
On my way to Seoul for
#ICCV2019
. If you’re at the conference on October 28th, come check out a full day workshop I am organizing on Scene Graph Representation and Learning (). We have a great lineup of speakers and posters.
Academic quarter recap: here's a staff photo after the last lecture of
@cs231n
. It's crazy that we were the largest course at Stanford this quarter. This year, we added new lectures and assignments (open sourced) on attention, transformers, and self-supervised learning.
Someone made an in-depth video of our recent work at
#CVPR2018
on Referring Relationships. If you are interested in how we train models to disambiguate between different people or objects in images, go check it out.
#ComputerVision
#MachineLearning
Deploying LLMs continues to be a challenge as they grow in model size and consume more data. We introduce a simple distillation mechanism to make even 770M T5 models outperform 540B PaLM.
Led by my PhD student
@cydhsieh
and with collaborators
@chunliang_tw
@ajratner
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
reduce both the model size and the amount of data required to outperform LLMs; our 770M T5 model outperforms the 540B PaLM model using only 80% of available data on a
If you're releasing a new user-facing AI project/product, you might want to read our new
#CSCW2020
paper. We find that words or metaphors used to describe AI agents have a causal effect on users' intention to adopt your agent. Thread👇
Human-centered AI is no longer just a buzzword. It's a thriving, growing area of research. Come to our workshop tomorrow at
#ICML2023
to learn about it.
AI models have finally matured for mass market use. HCI+AI interactions will only become more vital.
Congrats to Amir Zamir
@zamir_ar
, Silvio Savarese
@silviocinguetta
and co-authors for their Best paper award at
#CVPR2018
“Taskonomy: Disentangling Task Transfer Learning”
We updated our generative human evaluation benchmark with 6 GANs, 4 image datasets (generating faces and objects), 2 sampling methods. We show statistically insignificant correlation with FID and other automatic metrics. Use HYPE ()!
Measuring progress in generative models is akin to hill climbing on noise. Automatic metrics are heuristics and human evaluations unreliable. Our latest paper presents a new human evaluation grounded in psychophysics, consistent, turnkey and cost-effective
📈New paper! 1 training method, no new architecture, no additional data, SOTA results on 8 vision-language benchmarks.
Our 5B model variants even outperform 13B+ models!
Multimodal reasoning is hard. Even the best LMMs struggle with counting😥 Any fix for it?
Introduce VPD from
@GoogleAI
: we teach LMMs multimodal CoT reasoning with data synthesized from LLM + vision tools, and achieve new SOTAs on many multimodal tasks!🥳
Measuring progress in generative models is akin to hill climbing on noise. Automatic metrics are heuristics and human evaluations unreliable. Our latest paper presents a new human evaluation grounded in psychophysics, consistent, turnkey and cost-effective
With the ICCV ban finally lifted, here is our new
#ICCV2023
paper, which already has a few follow up papers.
Our method faithfully evaluates text-to-image generation models. It provides more than just a score; it identifies missed objects, incorrect attributes, relationships, etc
It is notoriously hard to evaluate images created by text-to-image models. Why not using the powerful LLMs and VLMs to analyze them?
We introduce TIFA🦸🏻♀️
#ICCV2023
, which uses GPT + BLIP to quantitatively measure what Stable Diffusion struggles on!
Proj:
New paper: Real-world image editing that is consistent with lighting, occlusion, and 3D shapes💀🧊🎧⚽️🍎!
We introduce a new 3D image editing benchmark called OBJect. Using OBJect, we train 3DIT, a diffusion model that can rotate, translate, insert, and delete objects in images.
Imagine a 2D image serving as a window to a 3D world that you could reach into, manipulate objects, and see changes reflected in the image.
In our new OBJect 3DIT work, we edit images in this 3D-aware fashion while only operating in the pixel space!
🧵
Structured prediction requires substantial training data. Our new paper introduces the first few-shot scene graph model with predicates as functions within a graph convolution framework, resulting in the first semantically & spatially interpretable model.
New paper! Now that self-supervision for high-level vision tasks have matured, we ask what is needed for pixel-level tasks?
Given cog.sci. evidence, we show that scaling up learning multi-view correspondences improves SOTA on depth, segmentation, normals, and pose estimation.
MIMIC: Masked Image Modeling with Image Correspondences
paper page:
Many pixelwise dense prediction tasks-depth estimation and semantic segmentation in computer vision today rely on pretrained image representations. Therefore, curating effective
For researchers working on scene graphs or visual relationships, I just open sourced a simple library to easily visualize
#SceneGraphs
.
Now you can directly use this to generate your qualitative results in your publications.
Make sure to check out this new
#documentary
on PBS (
@novapbs
): "Can we build a brain?" with Fei-Fei (
@drfeifei
) and me. Check out the trailer here.
In 1938, Don Budge won all the Tennis Grand Slams in one calendar year. Now, in 2023, my advisor Michael Bernstein has done the same in HCI research.
Congrats
@msbernst
!!
Hey everyone, we have a great lineup of speakers at our upcoming workshop on the importance of Compositionality in Computer Vision (), at
#CVPR2020
(with
@eadeli
,
@jcniebles
,
@drfeifei
,
@orussakovsky
). Consider submitting a paper. Also, stay safe.
Happy Thanksgiving everyone! We have released code and a demo for our
#ICCV2019
paper on Predicting Scene Graphs with Limited Labels. Check out
@vincentsunnchen
's GitHub repository here:
Really proud of my summer intern, Pranav Khadpe, from IIT Kharagpur. He spent the summer with us at Stanford working on how how we can impact usability of AI systems even before people interface with the system.
If you are attending
#CVPR2020
, we have some exciting things for you to attend and check out: 1) Come to our (w
@drfeifei
@jcniebles
@eadeli
Jingwei) workshop on Sunday on Compositionality in Computer Vision. We have an amazing line up of speakers.
Training robots requires data—which today is hard to collect. You need (1) expensive robots, (2) teach people to operate them, (3) purchase objects for the robots to manipulate.
Our
#CoRL2023
paper shows you don't need any of the 3, not even a robot! All you need is an iPhone.
🚨Is it possible to devise an intuitive approach for crowdsourcing trainable data for robots without requiring a physical robot🤖?
Can we democratize robot learning for all?🧑🤝🧑
Check out our latest
#CoRL2023
paper->
AR2-D2: Training a Robot Without a Robot
At
#CVPR2023
this year, I had a number of conversations about how we need a faithful benchmark for measuring vision-language compositionality.
SugarCrepe is our response. Our best models are still not compositional. It's time to make some progress!
Introducing SugarCrepe: A benchmark for faithful vision-language compositionality evaluation!
‼️ Current compositional image2text benchmarks are HACKABLE: Blind models without image access outperform SOTA CLIP models due to severe dataset artifacts
📜:
@CloudinAround
@joshmeyerphd
@AndrewYNg
Maybe you should read about what the problem really is before commenting. The waves tuition will be taxable under the new bill - making out PhD unaffordable.
Today we open sourced all 9 assignments for the
#ComputerVision
class I teach
@Stanford
with
@jcniebles
- allowing everyone to learn various concepts like lane detection, deformable parts, segmentation, dimensionality reduction, optical flow, etc.
There are so many vision-language models: OpenAI’s CLIP, Meta’s FLAVA, Salesforce’s ALBEF, etc.
Our
#CVPR2023
⭐️ highlight ⭐️ paper finds that none of them show sufficient compositional reasoning capacity.
Since perception and language are both compositional, we have work to do
Have vision-language models achieved human-level compositional reasoning? Our research suggests: not quite yet.
We’re excited to present CREPE – a large-scale Compositional REPresentation Evaluation benchmark for vision-language models – as a 🌟highlight🌟at
#CVPR2023
.
🧵1/7
Soon we will be releasing over 200 computer vision student group projects, on topics ranging from autonomous driving, denoising chest x-rays, understanding satellite images, colorizing old movies, estimating real-estate price, transfer learning on edge devices, etc
New paper out! Typical active learning algorithms assume there is only one correct answer, which is not true for many tasks, like question answering. Our new uncertainty measurement is 5x more data-efficient even when there are multiple correct answers.
Check out this new workshop and benchmark for studying vision systems that can navigate as social agents amongst people -- by my colleagues (
@SHamidRezatofig
) at
@StanfordSVL
.
@fchollet
It very much depends on the act function but for most cases, you want to use conv-bn-act. Without bn before act, saturated neurons will kill gradients. We do case studies of this across multiple activation functions in these slides:
New paper: DreamSync improves any text-to-image generation model by aligning it better with text inputs.
We use DreamSync to improve stable diffusion XL.
Generated images not following your prompt?
Introducing 𝔻𝕣𝕖𝕒𝕞𝕊𝕪𝕟𝕔 from
@GoogleAI
: improving alignment + aesthetics of image generation models with feedback from VLMs!
✅ Model Agnostic
✅ Plug and Play
❌ RL
❌ Human Annotation
❌ Real Image
Structured prediction requires large training sets, but crowdsourcing is ineffective— so, existing models ignore visual relationships without sufficient labels.
Our method uses 10 relationship labels to generate training data for any scene graph model!
I will be speaking at an event tomorrow at
@Stanford
on the importance of
#Trust
and
#Transparency
in Human-AI collaboration. Come stop by to hear about how we can build dynamic learning systems that can continuously learn from interactions with humans.
Speaking of collection behavior, check out our new paper at NeurIPS 2022. Inspired by how animals coordinate to accomplish tasks, we design a simple multi-agent intrinsic reward that allows decentralized multi-agent training, allowing AI agents to even adapt to new partners.
This new project is a huge team effort from the PRIOR team at AI2 with striking conclusions:
Real-world navigation, exploration, manipulation emerges:
(1) without any RL,
(2) without any human demonstrations,
(3) with only automatically generated simulation data.
🚀 Imitating shortest paths in simulation enables effective navigation and manipulation in the real world. Our findings fly in the face of conventional wisdom!
This is a big joint effort from PRIOR
@allen_ai
(6 first authors!).
Old-school research presentations can be boring. Check out this fun creativity skit Helena put together to explain our new CSCW paper.
TLDR: Recent works keep finding that AI explanations don't help people make better decisions. We propose a theory for when they do help!
Do you want to learn about how explanations can help reduce overreliance on AIs?
Watch this fantastic, out-of-this-world, one-of-a-kind, spectacular, etc. short video explaining our work! We put a lot of ❤️ into it and would appreciate the views.
If you’re at
#ICCV2023
, reach out and come say hi. I will be giving two talks:
- one at the closing the loop in vision and language workshop:
- one at the scene graph workshop:
On a personal note, I am going to miss co-instructing with
@danfei_xu
and
@drfeifei
. This is my 5th and last time instructing at Stanford. I have learned so much from working with so many amazing teaching assistants and students. Thank you, everyone.
"We’re going to have robots with free will, absolutely. We have to understand how to program them and what we gain out of it. For some reason, evolution has found this sensation of free will to be computationally desirable." - Judea Pearl
Contrary to how today's AI products are advertised, people are more likely to adopt an agent that they originally expected to have low competence but outperforms that expectation. They are less forgiving of mistakes made by agents they expect to have high competence.
@chipro
It's subjective. I would personally say scaling up models *IS* academic research. It's easy to dismiss it as not innovative. But research is also about studying the outcomes of design decisions/interventions. In this case, the intervention is increasing model size.
I have been extremely lucky to have
@timnitGebru
as a labmate and as a friend. Thank you for sharing your brilliant work and always being generous with your precious time. I am appalled you are dealing with this. I am here to support and help in any way I can.
We can localize and over the "jacket worn by the person next to the person on the phone" or the "table below the person to the left of the person wearing the hat".
#CVPR2018
results just came out!! There doesn't appear to be any correlation between paper ID and whether your paper will get accepted, unlike past vision conferences.
Our students were more than just computer science majors. We housed majors from immunology, anthropology, MBA, biology, geology, aeronautics, music, neuroscience, philosophy, and many more. We also had over 50 industry professionals remotely enroll.
Here is a neat visualization exploring motifs in the visual world using relationships from
@VisualGenome
. Adding structure allows us to further vision research and ask questions like: "what kinds of objects usually contain food?→bowls, plates, table"
Check out our newest paper! We automatically assign probabilistic relationship labels to images and can use them to train any existing scene graph model with as few as 10 examples.
Structured prediction requires large training sets, but crowdsourcing is ineffective— so, existing models ignore visual relationships without sufficient labels.
Our method uses 10 relationship labels to generate training data for any scene graph model!
Congratulations
@PranavKhadpe
!!! Contrary to how today's AI products are marketed, our paper finds that people are more likely to adopt and cooperate with AI agents that project low competence but outperforms expectations and are less forgiving when they project high competence.
Excited to share that our paper, "Conceptual Metaphors Impact Perceptions of Human-AI Collaboration", was awarded an Honorable Mention at
#CSCW2020
🎇
Paper:
Want to say a huge thank you to my co-authors
@RanjayKrishna
@drfeifei
@jeffhancock
and
@msbernst
Embodied AI has been limited to simple lifeless simulated houses for many years.
Just like the Holodeck in the Star Trek episodes I grew up watching, our Holodeck system allows you to create diverse lived in 3D simulated environments populated with thousands of objects:
🛸 Announce Holodeck, a promptable system that can generate diverse, customized, and interactive 3D simulated environments ready for Embodied AI 🤖 applications.
Website:
Paper:
Code:
#GenerativeAI
[1/8]
Engagement learning to train an SI system. Generating questions and engaging users to learn models in an automated way. “We have one example of an image of a red panda and need to scale up to recognize red pandas consistently.”
Its refreshing to see corporations using their strengths/capabilities to give back to their society.
@DoorDash
just launched Project Dash () to use their network of restaurants and their suite of cars to deliver food to those who are hungry.
#dashforgood
Hmm, have I made a wrong turn? I was looking for
@GoogleResearch
…
Nope, you're in the right place! We’re unifying all of our research efforts under “Google AI”, which encompasses all the state-of-the-art innovation happening across Google. Learn more at
Incredibly excited to announce that Ross Girshick (
@inkynumbers
) will be joining the PRIOR team
@allen_ai
!
Ross is one of the most influential and impactful researchers in AI. I'm so honored that he is joining us, and I'm really looking forward to working with him.
Go check out our
#NeurIPS2019
*Oral* talk for "HYPE: Human eYe Perceptual Evaluation of Generative Models" today at 4:50pm at West Exhibition Hall C + B3. Also, HYPE now offers *evaluation as a service* for generative models at