TimDarcet
@timdarcet.bsky.social
๐ค 1253
๐ฅ 290
๐ 56
PhD student, SSL for vision @ MetaAI & INRIA tim.darcet.fr
pinned post!
Vision transformers need registers! Or at least, it seems they ๐ธ๐ข๐ฏ๐ต someโฆ ViTs have artifacts in attention maps. Itโs due to the model using these patches as โregistersโ. Just add new tokens (โ[reg]โ): - no artifacts - interpretable attention maps ๐ฆ - improved performances!
arxiv.org/abs/2309.16588
over 2 years ago
0
11
0
Want strong SSL, but not the complexity of DINOv2? CAPI: Cluster and Predict Latents Patches for Improved Masked Image Modeling.
about 1 year ago
1
49
11
reposted by
TimDarcet
Juliette Marrie
about 1 year ago
(3/3) LUDVIG uses a graph diffusion mechanism to refine 3D features, such as coarse segmentation masks, by leveraging 3D scene geometry and pairwise similarities induced by DINOv2.
loading . . .
2
12
1
reposted by
TimDarcet
Juliette Marrie
about 1 year ago
(2/3) We propose a simple, parameter-free aggregation mechanism, based on alpha-weighted multi-view blending of 2D pixel features in the forward rendering process.
1
10
1
reposted by
TimDarcet
Juliette Marrie
about 1 year ago
(1/3) Happy to share LUDVIG: Learning-free Uplifting of 2D Visual features to Gaussian Splatting scenes, that uplifts visual features from models such as DINOv2 (left) & CLIP (mid) to 3DGS scenes. Joint work w.
@dlarlus.bsky.social
@jmairal.bsky.social
Webpage & code:
juliettemarrie.github.io/ludvig
loading . . .
1
66
18
reposted by
TimDarcet
Transactions on Machine Learning Research
about 1 year ago
Outstanding Finalist 2: โDINOv2: Learning Robust Visual Features without Supervision," by Maxime Oquab, Timothรฉe Darcet, Thรฉo Moutakanni et al. 5/n
openreview.net/forum?id=a68...
loading . . .
DINOv2: Learning Robust Visual Features without Supervision
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could...
https://openreview.net/forum?id=a68SUt6zFt
2
8
3
Hash functions are really useful to uniquely encode stuff without collision huh
about 1 year ago
1
5
0
At least there's diversity of opinions
about 1 year ago
1
13
1
reposted by
TimDarcet
Jacob Schreiber
about 1 year ago
"no one can match my artistic vision" i mutter to myself repeatedly as i leave critical analyses undone and focus on what shade of gray to use in a supplemental figure
1
28
3
reposted by
TimDarcet
Shobhita Sundaram
about 1 year ago
Personal vision tasksโlike detecting *your mug*--are hard; theyโre data scarce and fine-grained. In our new paper, we show you can adapt general-purpose vision models to these tasks from just three photos! ๐:
arxiv.org/abs/2412.16156
๐ป:
github.com/ssundaram21/...
(1/n)
1
72
13
reposted by
TimDarcet
Shiry Ginosar
about 1 year ago
Can video MAE scale? Yes. Do you need language to scale video models? No.
arxiv.org/abs/2412.15212
Great rigorous benchmarking from my colleagues at Google DeepMind.
loading . . .
Scaling 4D Representations
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classifi...
https://arxiv.org/abs/2412.15212
0
12
2
reposted by
TimDarcet
David Picard
about 1 year ago
Everything is a LAW when you have 4 points on a log-log plot ๐ค
add a skeleton here at some point
0
10
1
reposted by
TimDarcet
Dhruv Batra
about 1 year ago
Brilliant talk by Ilya, but he's wrong on one point. We are NOT running out of data. We are running out of human-written text. We have more videos than we know what to do with. We just haven't solved pre-training in vision. Just go out and sense the world. Data is easy.
4
98
18
reposted by
TimDarcet
Nicolas Dufour
about 1 year ago
๐ Guessing where an image was taken is a hard, and often ambiguous problem. Introducing diffusion-based geolocationโwe predict global locations by refining random guesses into trajectories across the Earth's surface! ๐บ๏ธ Paper, code, and demo:
nicolas-dufour.github.io/plonk
loading . . .
8
97
37
reposted by
TimDarcet
Clรฉment Canonne
about 1 year ago
Web 1.0 is back, baby
0
19
1
Wake up babe new iNat just dropped
add a skeleton here at some point
about 1 year ago
0
2
0
reposted by
TimDarcet
Sara Beery
about 1 year ago
Along with INQUIRE, we introduce iNat24, a new dataset of 5 million research-grade images from @inaturalist with 10,000 species labels. This is one of the largest publicly available natural world image repositories!
1
28
9
The hardest thing in the world is to refrain from using superlatives
about 1 year ago
2
4
0
reposted by
TimDarcet
Franรงois Fleuret
about 1 year ago
I'd be fine calling this the "Milan Principle" and I'd extend it to "Most commercialized goods do not need new features."
add a skeleton here at some point
2
8
1
reposted by
TimDarcet
Thomas Fel
about 1 year ago
A fun thesis experiment: ResNet, DETR, and CLIP tackle Saint-Bernards. ๐ถ ResNet focused on **fur** patterns, DETR too but also use **paws** (possibly because it helps define bounding boxes), and CLIP **head** concept oddly included human heads โ language shaping learned concepts?
0
8
2
Excellent writeup on GPU streams / CUDA memory
dev-discuss.pytorch.org/t/fsdp-cudac...
TLDR by default mem is proper to a stream, to share it:: - `Tensor.record_stream` -> automatic, but can be suboptimal and nondeterministic - `Stream.wait` -> manual, but precise control
about 1 year ago
2
29
1
reposted by
TimDarcet
Eugene Vinitsky ๐
about 1 year ago
I've been using Skybridge (
chromewebstore.google.com/detail/sky-f...
) to rebuild the graph periodically which I think helps
loading . . .
Sky Follower Bridge - Chrome Web Store
Instantly find and follow the same users from your Twitter follows on Bluesky.
https://chromewebstore.google.com/detail/sky-follower-bridge/behhbpbpmailcnfbjagknjngnfdojpko?hl=en
1
21
3
reposted by
TimDarcet
d.ly
about 1 year ago
please, remember our core values:
add a skeleton here at some point
5
420
60
reposted by
TimDarcet
Lionel
about 1 year ago
These opportunities are mostly reserved for the rest of the world. We need similar Industry-Academia PhD programs in the US too! We need an american version of the CIFRE.
add a skeleton here at some point
2
3
1
reposted by
TimDarcet
Johan Edstedt
about 1 year ago
เผผ ใค โ_โ เผฝใค GIVE DINOv3
1
12
2
reposted by
TimDarcet
Alaa El-Nouby
over 1 year ago
๐๐ผ๐ฒ๐ ๐ฎ๐๐๐ผ๐ฟ๐ฒ๐ด๐ฟ๐ฒ๐๐๐ถ๐๐ฒ ๐ฝ๐ฟ๐ฒ-๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐๐ผ๐ฟ๐ธ ๐ณ๐ผ๐ฟ ๐๐ถ๐๐ถ๐ผ๐ป? ๐ค Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding ๐งต paper:
arxiv.org/abs/2411.14402
code:
github.com/apple/ml-aim
HF:
huggingface.co/collections/...
3
59
20
reposted by
TimDarcet
Kosta Derpanis
over 1 year ago
Gotta put this app down. Discovered so much cool stuff without the rage.
0
28
1
reposted by
TimDarcet
David Picard
over 1 year ago
Sidenote: TMLR is such a pleasant journal. It's fast and reviews are (mostly) insightful, detailed and helpful. Kind of how conference reviews were before the big rush, for the youngsters who thought It's always been that way.
0
6
1
reposted by
TimDarcet
Raphael Pisoni
over 1 year ago
DinoV2 is without a doubt one of the most important Self Supervised Learning (SSL) methods right now. But training it takes 32 80Gb GPUs which is not easy to come by for small labs. What if we could train a comparable high-res model on 24Gb of VRAM? That's what I hope to show you here soon!๐ค๐งต
#mlsky
4
61
9
Vision transformers need registers! Or at least, it seems they ๐ธ๐ข๐ฏ๐ต someโฆ ViTs have artifacts in attention maps. Itโs due to the model using these patches as โregistersโ. Just add new tokens (โ[reg]โ): - no artifacts - interpretable attention maps ๐ฆ - improved performances!
arxiv.org/abs/2309.16588
over 2 years ago
0
11
0
you reached the end!!
feeds!
log in