Tim Allison
@tallison314159.bsky.social
📤 98
📥 139
📝 39
Files, search, crawling, security.
#ApacheTika
among others...
Voting is underway for
#ApacheTika
4.0.0-alpha-1! 🎉 Started work on the 4.x branch in October 2024. Lots has changed, core principles remain. Many, many thanks to the community of fellow devs and users! Onwards towards 4.0.0!
lists.apache.org/thread/bjowz...
loading . . .
https://lists.apache.org/thread/bjowzh4ssgtrghqjk7g2dtn9hs3qmyrv
29 days ago
0
1
0
Preview of the
#ApacheTika
4.x site is live:
tika.apache.org/docs/4.0.0-S...
This is early and buggy. Please help improve it!
loading . . .
Apache Tika Documentation :: Apache Tika Documentation
https://tika.apache.org/docs/4.0.0-SNAPSHOT/
about 2 months ago
0
1
0
reposted by
Tim Allison
David Buchanan
about 2 months ago
I can already tell that "mailing lists were large group emails in old typewriter font" is a phrase that will live rent free in my mind forever
add a skeleton here at some point
7
167
22
reposted by
Tim Allison
buherator
about 2 months ago
xz security advisory (CVE-2026-34743):
tukaani.org ->
Who has the guts to update? :)
Original->
0
1
2
Voting is underway for
#ApacheTika
3.3.0! Please give it a try and let us know if there are any surprises!
lists.apache.org/thread/pq4zj...
loading . . .
https://lists.apache.org/thread/pq4zjvqf3w5zbm5yoyg14qvr2kpd2by3
3 months ago
0
0
0
reposted by
Tim Allison
buherator
3 months ago
RE//verse 2026 videos are online
www.youtube.com ->
Original->
0
2
3
Lol... wapo literally won't let me go.
4 months ago
0
0
0
reposted by
Tim Allison
evacide
5 months ago
All that the Turing Test proves is that human are much, much stupider than Alan Turing ever suspected.
12
483
82
On
#ApacheTika
, we're migrating all configuration to json for 4.x. If you use tika-server, please join the conversation on runtime configuration:
lists.apache.org/thread/jlt8j...
loading . . .
https://lists.apache.org/thread/jlt8jv47t8tm58dlrnxsrfodxm2d6o0z
6 months ago
0
1
0
In 4 hours (noon EST), I'm hosting a demo with office hours for
#ApacheTika
in belated celebration of World Digital Preservation Day
#wdpd2025
!
www.meetup.com/apache-tika-...
Please dm me for the meeting info.
loading . . .
Apache Tika -- What's New/Office Hours, Thu, Nov 13, 2025, 12:00 PM | Meetup
This will be an expansion of my presentation at the Digital Preservation Bake Off (Tools Demonstration) #iPres2025 and a late entry to celebrate World Digital Preservation
https://www.meetup.com/apache-tika-community/events/311746184/
7 months ago
0
1
0
reposted by
Tim Allison
American Dialect Society
7 months ago
Make your 2025 words-of-the-year nominations for the only vote that matters!
bit.ly/2025WOTYNOMS
0
5
7
reposted by
Tim Allison
Eric Geller
7 months ago
New: Google says it has discovered at least 5 malware families that use AI to rewrite their code and generate new capabilities on the fly, suggesting AI-powered malware is finally starting to take off.
cloud.google.com/blog/topics/...
Report also has interesting stories about state actors' AI use.
0
70
55
If you're attending
#iPres2025
, make sure to check out
@petervwyatt.bsky.social
's tutorial on Monday: "A forensic spotlight on PDF/A"!
twelve.eventsair.com/QuickEventWe...
loading . . .
iPRES 2025 - TUTORIAL 3: A forensic spotlight on PDF/A
https://twelve.eventsair.com/QuickEventWebsitePortal/ipres2025/ipres/Agenda/AgendaItemDetail?id=98e546c8-1739-48a5-ae35-43891b76f307
7 months ago
0
0
0
reposted by
Tim Allison
ToxSec
7 months ago
Is AI fueling the old 'Dead Internet' conspiracy theory? Yes! AI is building a fake internet just for you.
#ai
#psychology
#cybersecurity
#society
#internet
www.toxsec.com/p/ai-is-buil...
loading . . .
The Dead Internet - AI is Building a Fake Internet Just for You
How Generative AI is Fueling the "Dead Internet Theory," Creating an Authenticity Crisis, and Why AI Detection Can't Save Us.
https://www.toxsec.com/p/ai-is-building-a-fake-internet-just
1
6
2
In belated celebration of World Digital Preservation Day, I'm throwing a "What's new with Apache Tika/Office hours" meetup: November 13, noon EST. Everyone interested in files is welcome to join!
#ApacheTika
#wdpd2025
#digipres
#fileForensics
#reverseEngineering
www.meetup.com/apache-tika-...
loading . . .
Apache Tika -- What's New/Office Hours, Thu, Nov 13, 2025, 12:00 PM | Meetup
This will be an expansion of my presentation at the Digital Preservation Bake Off (Tools Demonstration) #iPres2025 and a late entry to celebrate World Digital Preservation
https://www.meetup.com/apache-tika-community/events/311746184
7 months ago
0
5
5
reposted by
Tim Allison
DistrictCon
7 months ago
We're officially announcing our speakers DistrictCon Year 1! Check out our incredible lineup:
www.districtcon.org/speakers
This also includes our Day 1 & Day 2 Keynotes from Ian Levy and Dan Ridge. And don't forget, GA tickets go on sale November 16! See you in January! 🪩
0
11
16
reposted by
Tim Allison
Charlie Hull
7 months ago
It's your responsibility - but how do you even get started fixing search? A blog for Search Product Managers and other search leads
thesearchjuggler.com/its-your-res...
loading . . .
It's your responsibility - but how do you even start fixing search? - Charlie Hull - The Search Juggler
How to get started fixing search - looking for zero result searches, low click queries and how to prioritise
https://thesearchjuggler.com/its-your-responsibility-but-how-do-you-even-start-fixing-search/
0
3
1
So, the news for
#ApacheTika
and
#ipres2025
: I implemented fully recursive extraction of raw embedded files from the commandline.
issues.apache.org/jira/browse/...
add a skeleton here at some point
7 months ago
1
1
0
reposted by
Tim Allison
Elizabeth Lopatto
8 months ago
goddamn is there anything Wikipedia editors can’t do
www.nytimes.com/2025/10/17/n...
loading . . .
Wikipedia Volunteers Avert Tragedy by Taking Down Gunman at Conference
https://www.nytimes.com/2025/10/17/nyregion/wikipedia-conference-gunman.html
3
2468
494
reposted by
Tim Allison
Ian Coldwater 🧊🚫
8 months ago
Everyone tests in production. Some people just don’t know it yet
3
66
16
Looking forward to some baking with
#ApacheTika
! News soon on some
#conferenceDrivenDevelopment
.
#ipres2025
add a skeleton here at some point
8 months ago
1
3
1
Amazing work, as always,
@seeinglogic.bsky.social
!
#AIxCC
add a skeleton here at some point
8 months ago
0
2
0
And, y, I'm late to the game, but I'm really excited for this course,
@softwaredoug.bsky.social
!
add a skeleton here at some point
8 months ago
0
1
0
reposted by
Tim Allison
Alexander Reelsen
8 months ago
F3: The Open-Source Data File Format for the Future Packaging WASM code to read an evolving file format with the data. Interesting approach and a good idea to test the sandbox abilities of the execution engine. Also mentions of a lot of alternatives to parquet/ORC.
loading . . .
https://db.cs.cmu.edu/papers/2025/zeng-sigmod2025.pdf
1
1
1
reposted by
Tim Allison
Jennifer Ouellette
8 months ago
A biological 0-day? Threat-screening tools may miss AI-designed proteins.
arstechnica.com/science/2025...
loading . . .
A biological 0-day? Threat-screening tools may miss AI-designed proteins.
Ordering DNA for AI-designed toxins doesn’t always raise red flags.
https://arstechnica.com/science/2025/10/do-ai-designed-proteins-create-a-biosecurity-vulnerability/
0
8
6
reposted by
Tim Allison
Andreas Lehmkühler
8 months ago
The new bugfix release 2.0.35 of
#Apache
#PDFBox
is available
pdfbox.apache.org/download.html
loading . . .
Apache PDFBox | Download
The Apache PDFBox™ library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract ...
https://pdfbox.apache.org/download.html
0
2
1
reposted by
Tim Allison
ToxSec
8 months ago
Anyone in
#bugbounty
looking to connect?
0
2
1
reposted by
Tim Allison
Alexander Doria
8 months ago
New 7-8B OCR model release from AliBaba. Integrated structures data approach looks promising for specialized use cases with complex visual inputs.
huggingface.co/Logics-MLLM/...
2
29
4
reposted by
Tim Allison
Doug Turnbull
8 months ago
Tomorrow I'll be talking about vector retrieval, continuing Cheat at Search Essentials. Full details on my blog article
softwaredoug.com/blog/2025/07...
loading . . .
Free course: Cheat at Search Essentials
A free introductory search course for anyone who wants better search without all the hard work
https://softwaredoug.com/blog/2025/07/31/cheat-at-search-essentials
1
1
1
reposted by
Tim Allison
8 months ago
📣This
#WebArchiveWednesday
, plan your proposal for
#iipcWAC26
, “Sustainable
#WebArchiving
,” at KBR, Royal Library of Belgium!
netpreserve.org/ga2026/CfP
🗓️ Deadline for proposals: OCT 15
#webarchives
#DigitalPreservation
#DigitalHumanities
0
0
5
reposted by
Tim Allison
Doug Turnbull
8 months ago
Recording for BM25 + Lexical Search now up
maven.com/p/e9fbe4/che...
loading . . .
Cheat at Search Essentials: BM25 + Lexical
It's often said with chat interfaces and RAG, search has become the hard problem. Search has a long history and means more than vector databases. Let's learn how BM25 and similar techniques compliment...
https://maven.com/p/e9fbe4/cheat-at-search-essentials-bm25-lexical
1
2
2
reposted by
Tim Allison
Doug Turnbull
8 months ago
This Wednesday I'll be discussing how to Cheat at Query Understanding using LLMs with Jason Liu. If you want a taste of "Cheat at Search with LLMs", please come hang out!
maven.com/p/eebe98
loading . . .
Cheating at Query Understanding with LLMs
LLMs transformed query understanding from months-long NLP projects into simple prompting tasks. Students learn practical skills for modern search, RAG, and e-commerce systems. This positions you for h...
https://maven.com/p/eebe98
0
1
1
reposted by
Tim Allison
WIRED
9 months ago
The annual award ceremony features miniature operas, scientific demos, and 24/7 lectures.
www.wired.com/story/say-he...
loading . . .
Say Hello to the 2025 Ig Nobel Prize Winners
The annual award ceremony features miniature operas, scientific demos, and 24/7 lectures.
https://www.wired.com/story/say-hello-to-the-2025-ig-nobel-prize-winners/
1
59
11
reposted by
Tim Allison
Fredrik Dahlgren
9 months ago
Great paper on finding and exploiting parser differentials between ZIP parsers to bypass signature validation, malware detection, or VSCode extension ID validation.
www.usenix.org/conference/u...
0
15
4
reposted by
Tim Allison
Adrien Grand
9 months ago
Lucene 10.3 is out with 40% faster lexical search, 15% faster dense vector search and 30% faster terms dictionary lookups.
lucene.apache.org/core/corenew...
loading . . .
Lucene™ Core News
Apache Lucene is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for...
https://lucene.apache.org/core/corenews.html#apache-lucenetm-1030-available
1
4
3
reposted by
Tim Allison
Apache Software Foundation (The ASF)
9 months ago
🚨 Breaking News from Community Over Code 🚨 Introducing The ASF’s New Logo
buff.ly/DzgT82w
#CommunityOverCode
#opensource
0
26
20
reposted by
Tim Allison
Matthew Martin
9 months ago
Bluesky- "If you can't cite peer reviewed literature, your opinion is morally equivalent to fart noises <links to papers>" Anyhow, I just want to show of my LLM side projects, there really isn't a forum for that anymore.
0
3
1
reposted by
Tim Allison
Matthew Martin
9 months ago
People on Twitter - "LLMs are gods and I command them so I am a god and people will finally give me the respect I crave" People on Mastodon - "<frothing> slop <pant shitting> LLMs :( <howler monkey sounds> stochastic parrot <growling noises> by the way, Github is the root of all social evils"
1
2
2
W00t!
add a skeleton here at some point
9 months ago
0
1
0
reposted by
Tim Allison
DistrictCon
9 months ago
🚨T I C K E T D R O P D A T E S 🚨 you asked, we're answering 😉 Early Bird: Sep 15 (Mon), noon EST GA: Nov 16, 2025 (Sat), noon EST
www.eventbrite.com/e/districtco...
loading . . .
DistrictCon Year 1
DistrictCon is a DC hacker con, focusing on hacking together and exchanging ideas over typical talk tracks.
https://www.eventbrite.com/e/districtcon-year-1-tickets-1467291561559
0
9
7
reposted by
Tim Allison
Paul Ford
9 months ago
This is supposed to be ironic but I saw it and went “Yeah!”
12
313
51
reposted by
Tim Allison
Fredrik Dahlgren
9 months ago
This is a great post on how to bypass code signing (e.g. for malware persistence or to introduce backdoors) by tampering with V8 heap snapshots. All Electron apps (like Slack, 1Password, and Signal) and Chromium based browsers were vulnerable to this issue.
blog.trailofbits.com/2025/09/03/s...
loading . . .
Subverting code integrity checks to locally backdoor Signal, 1Password, Slack, and more
A vulnerability in Electron applications allows attackers to bypass code integrity checks by tampering with V8 heap snapshot files, enabling local backdoors in applications like Signal, 1Password, and...
https://blog.trailofbits.com/2025/09/03/subverting-code-integrity-checks-to-locally-backdoor-signal-1password-slack-and-more/
0
3
1
reposted by
Tim Allison
Doug Turnbull
9 months ago
Trey Grainger and I are offering an "AI Powered Search" course in November. Hope to see you there 30% off in Sept
maven.com/search-schoo...
1
1
1
reposted by
Tim Allison
Alexander Reelsen
9 months ago
Started watching a few videos about MCP. Learned that there are semi official MCP SDKs, where Spring contributed the Java one. Any 1 hour talk to get more up to speed? Curious about learning more, but as condensed as possible - afraid that knowledge outdates fast 😂
loading . . .
Learn how to build an MCP Server in Java
🔍 Discover how to implement a Model Context Protocol (MCP) server using only the core Java SDK. This tutorial expands on our MCP series by showing you a more lightweight, flexible approach for…
https://www.youtube.com/watch?v=Y_Rk6QgWUbE
0
1
1
reposted by
Tim Allison
Dan Jurafsky
9 months ago
Now that school is starting for lots of folks, it's time for a new release of Speech and Language Processing! Jim and I added all sorts of material for the August 2025 release! With slides to match! Check it out here:
web.stanford.edu/~jurafsky/sl...
loading . . .
Speech and Language Processing
Speech and Language Processing
https://web.stanford.edu/~jurafsky/slp3/
3
151
62
W00t! The
#aixcc
stage talks recently dropped:
aicyberchallenge.com/def-con-33/
#defcon
#defcon33
loading . . .
DEF-CON-33 – aicyberchallenge.com
https://aicyberchallenge.com/def-con-33/
10 months ago
0
0
1
reposted by
Tim Allison
Dare Obasanjo
10 months ago
Microsoft just dropped a COPILOT function for Excel so you can run Gen AI on spreadsheets. They warn not to trust it for math or legal/compliance work since it hallucinates. Which of course means people will and the results will be both hilarious and catastrophic.
loading . . .
Microsoft Excel adds Copilot AI to help fill in spreadsheet cells
Get ready for some AI spreadsheeting.
https://www.theverge.com/news/761338/microsoft-excel-ai-copilot-spreadsheet-cell-filling
7
201
34
reposted by
Tim Allison
Mark Griffin
10 months ago
ICYMI: 5 systems built to compete in DARPA's AI Cyber Challenge are now Open Source:
archive.aicyberchallenge.com
Everything from prompt templates, to terraform code, to implementations of very recent research techniques, it's all there.
loading . . .
AIxCC Competition Archive | AIxCC Competition Archive
The comprehensive archive of DARPA's Artificial Intelligence Cyber Challenge
https://archive.aicyberchallenge.com/
0
1
1
reposted by
Tim Allison
Phrack Zine
10 months ago
At long last - Phrack 72 has been released online for your reading pleasure! Check it out:
phrack.org
0
119
65
reposted by
Tim Allison
Hazel Weakly
10 months ago
Fun fact: Linux allows file names to be any byte sequence except / and the null character. You can also mix character encodings in the same directory
#!/usr/bin/env
bash prefix="$(dd if=/dev/urandom bs=128 count=1)" touch “$(printf ‘%s\n%s’ “you can’t hurt me I’m already dead” “$prefix”)”
add a skeleton here at some point
12
57
15
Load more
feeds!
log in