Alex Wettig 9 months ago
๐ค Ever wondered how prevalent some type of web content is during LM pre-training?
In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages ๐
Key takeaway: domains help us curate better pre-training data! ๐งต/N