TimDarcet
@timdarcet.bsky.social
📤 1246
📥 289
📝 56
PhD student, SSL for vision @ MetaAI & INRIA tim.darcet.fr
pinned post!
Vision transformers need registers! Or at least, it seems they 𝘸𝘢𝘯𝘵 some… ViTs have artifacts in attention maps. It’s due to the model using these patches as “registers”. Just add new tokens (“[reg]”): - no artifacts - interpretable attention maps 🦖 - improved performances!
arxiv.org/abs/2309.16588
almost 2 years ago
0
11
0
Want strong SSL, but not the complexity of DINOv2? CAPI: Cluster and Predict Latents Patches for Improved Masked Image Modeling.
7 months ago
1
49
11
reposted by
TimDarcet
Juliette Marrie
8 months ago
(3/3) LUDVIG uses a graph diffusion mechanism to refine 3D features, such as coarse segmentation masks, by leveraging 3D scene geometry and pairwise similarities induced by DINOv2.
loading . . .
2
12
1
reposted by
TimDarcet
Juliette Marrie
8 months ago
(2/3) We propose a simple, parameter-free aggregation mechanism, based on alpha-weighted multi-view blending of 2D pixel features in the forward rendering process.
1
10
1
reposted by
TimDarcet
Juliette Marrie
8 months ago
(1/3) Happy to share LUDVIG: Learning-free Uplifting of 2D Visual features to Gaussian Splatting scenes, that uplifts visual features from models such as DINOv2 (left) & CLIP (mid) to 3DGS scenes. Joint work w.
@dlarlus.bsky.social
@jmairal.bsky.social
Webpage & code:
juliettemarrie.github.io/ludvig
loading . . .
1
66
18
reposted by
TimDarcet
Transactions on Machine Learning Research
9 months ago
Outstanding Finalist 2: “DINOv2: Learning Robust Visual Features without Supervision," by Maxime Oquab, Timothée Darcet, Théo Moutakanni et al. 5/n
openreview.net/forum?id=a68...
loading . . .
DINOv2: Learning Robust Visual Features without Supervision
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could...
https://openreview.net/forum?id=a68SUt6zFt
2
8
3
Hash functions are really useful to uniquely encode stuff without collision huh
9 months ago
1
5
0
At least there's diversity of opinions
9 months ago
1
13
1
reposted by
TimDarcet
Jacob Schreiber
9 months ago
"no one can match my artistic vision" i mutter to myself repeatedly as i leave critical analyses undone and focus on what shade of gray to use in a supplemental figure
1
28
3
reposted by
TimDarcet
Shobhita Sundaram
9 months ago
Personal vision tasks–like detecting *your mug*--are hard; they’re data scarce and fine-grained. In our new paper, we show you can adapt general-purpose vision models to these tasks from just three photos! 📝:
arxiv.org/abs/2412.16156
💻:
github.com/ssundaram21/...
(1/n)
1
72
13
reposted by
TimDarcet
Shiry Ginosar
9 months ago
Can video MAE scale? Yes. Do you need language to scale video models? No.
arxiv.org/abs/2412.15212
Great rigorous benchmarking from my colleagues at Google DeepMind.
loading . . .
Scaling 4D Representations
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classifi...
https://arxiv.org/abs/2412.15212
0
12
2
reposted by
TimDarcet
David Picard
9 months ago
Everything is a LAW when you have 4 points on a log-log plot 🤔
add a skeleton here at some point
0
10
1
reposted by
TimDarcet
Dhruv Batra
9 months ago
Brilliant talk by Ilya, but he's wrong on one point. We are NOT running out of data. We are running out of human-written text. We have more videos than we know what to do with. We just haven't solved pre-training in vision. Just go out and sense the world. Data is easy.
4
99
18
reposted by
TimDarcet
Nicolas Dufour
10 months ago
🌍 Guessing where an image was taken is a hard, and often ambiguous problem. Introducing diffusion-based geolocation—we predict global locations by refining random guesses into trajectories across the Earth's surface! 🗺️ Paper, code, and demo:
nicolas-dufour.github.io/plonk
loading . . .
8
96
37
reposted by
TimDarcet
Clément Canonne
10 months ago
Web 1.0 is back, baby
0
19
1
Wake up babe new iNat just dropped
add a skeleton here at some point
10 months ago
0
2
0
reposted by
TimDarcet
Sara Beery
10 months ago
Along with INQUIRE, we introduce iNat24, a new dataset of 5 million research-grade images from @inaturalist with 10,000 species labels. This is one of the largest publicly available natural world image repositories!
1
28
9
The hardest thing in the world is to refrain from using superlatives
10 months ago
2
4
0
reposted by
TimDarcet
François Fleuret
10 months ago
I'd be fine calling this the "Milan Principle" and I'd extend it to "Most commercialized goods do not need new features."
add a skeleton here at some point
2
8
1
reposted by
TimDarcet
Thomas Fel
10 months ago
A fun thesis experiment: ResNet, DETR, and CLIP tackle Saint-Bernards. 🐶 ResNet focused on **fur** patterns, DETR too but also use **paws** (possibly because it helps define bounding boxes), and CLIP **head** concept oddly included human heads — language shaping learned concepts?
0
8
2
Excellent writeup on GPU streams / CUDA memory
dev-discuss.pytorch.org/t/fsdp-cudac...
TLDR by default mem is proper to a stream, to share it:: - `Tensor.record_stream` -> automatic, but can be suboptimal and nondeterministic - `Stream.wait` -> manual, but precise control
10 months ago
2
30
1
reposted by
TimDarcet
Eugene Vinitsky 🍒
10 months ago
I've been using Skybridge (
chromewebstore.google.com/detail/sky-f...
) to rebuild the graph periodically which I think helps
loading . . .
Sky Follower Bridge - Chrome Web Store
Instantly find and follow the same users from your Twitter follows on Bluesky.
https://chromewebstore.google.com/detail/sky-follower-bridge/behhbpbpmailcnfbjagknjngnfdojpko?hl=en
1
21
3
reposted by
TimDarcet
d.ly
10 months ago
please, remember our core values:
add a skeleton here at some point
5
424
60
reposted by
TimDarcet
↑Lionel Yelibi↓
10 months ago
These opportunities are mostly reserved for the rest of the world. We need similar Industry-Academia PhD programs in the US too! We need an american version of the CIFRE.
add a skeleton here at some point
2
3
1
reposted by
TimDarcet
Johan Edstedt
10 months ago
༼ つ ◕_◕ ༽つ GIVE DINOv3
1
12
2
reposted by
TimDarcet
Alaa El-Nouby
10 months ago
𝗗𝗼𝗲𝘀 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗽𝗿𝗲-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝘃𝗶𝘀𝗶𝗼𝗻? 🤔 Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding 🧵 paper:
arxiv.org/abs/2411.14402
code:
github.com/apple/ml-aim
HF:
huggingface.co/collections/...
3
59
20
reposted by
TimDarcet
Kosta Derpanis
10 months ago
Gotta put this app down. Discovered so much cool stuff without the rage.
0
28
1
reposted by
TimDarcet
David Picard
10 months ago
Sidenote: TMLR is such a pleasant journal. It's fast and reviews are (mostly) insightful, detailed and helpful. Kind of how conference reviews were before the big rush, for the youngsters who thought It's always been that way.
0
6
1
reposted by
TimDarcet
Raphael Pisoni
10 months ago
DinoV2 is without a doubt one of the most important Self Supervised Learning (SSL) methods right now. But training it takes 32 80Gb GPUs which is not easy to come by for small labs. What if we could train a comparable high-res model on 24Gb of VRAM? That's what I hope to show you here soon!🤞🧵
#mlsky
4
61
9
Vision transformers need registers! Or at least, it seems they 𝘸𝘢𝘯𝘵 some… ViTs have artifacts in attention maps. It’s due to the model using these patches as “registers”. Just add new tokens (“[reg]”): - no artifacts - interpretable attention maps 🦖 - improved performances!
arxiv.org/abs/2309.16588
almost 2 years ago
0
11
0
you reached the end!!
feeds!
log in