File Conversion. No Problem. No Kidding?

December 10, 2025

green-dino_thumbAnother short essay from a real and still-alive dinobaby. If you see an image, we used AI. The dinobaby is not an artist like Grandma Moses.

Every few months, I get a question about file conversion. The questions are predictable. Here’s a selection from my collection:

  1. “We have data about chemical structures. How can we convert these to for AI processing?”
  2. “We have back up files in Fastback encrypted format. How do we decrypt these and get the data into our AI system?”
  3. “We have some old back up tapes from our Burroughs’ machines?”
  4. “We have PDFs. Some were created when  Adobe first rolled out Acrobat and some  generated by different third-party PDF printing solutions. How can we convert these so our AI can provide our employees with access?”

The answer to each of these questions from the new players in AI-search system is, “No problem.” I hate to rain on these marketers’ assertions, but these are typical problems large, established organizations have moving content from a legacy system into a BAIT (big AI tech) based findability solution. There are technical challenges. There are cost challenges. There are efficiency challenges. That’s right. Challenges, and in my long career in electronic content processing, these hurdles still remain. But I am an aged dinobaby. Why believe me? Hire a Gartner-type of expert to tell you what you want to hear. Have fun with that solution, please.

image

Thanks, Venice.ai. Close enough for horse shoes, the high-water mark today I believe.

Venture Beat is one of my go-to sources for timely content marketing. On November 14, 2025, the venerable firm published “Databricks:  PDF Parsing for Agentic AI Is Still Unsolved. New Tool Replaces Multi-Service Pipelines with a Single Function.” The write up makes clear that I am 100 percent dead wrong about processing PDF files with their weird handling of tables, charts, graphs, graphic ornaments, and dense financial data.

The write up explains how really off base I am; for example, the Databricks Agent Bricks Platform. It cracks the AI parsing problem. I learned from the Venture Beat write up identifies what the DABP does with PDF information:

1 “Tables preserved exactly as they appear, including merged cells and nested structures

2 Figures and diagrams with AI-generated captions and descriptions

3 Spatial metadata and bounding boxes for precise element location

4 Optional image outputs for multimodal search applications”

Once the PDFs have been processed by DABP, the outputs can be used in a number of ways. I assume these are advanced, stable, and efficient as the name “databrick” metaphorically suggests:

1 Spark declarative pipelines

2 Unity catalog (I don’t know what this means)

3 Vector search (yep, search and retrieval)

4 AI function chaining (yep, bots)

5 Multi-agent supervisor (yep, command and control).

The write up concludes with this statement:

The Databricks approach sheds new light on an issue that many might have considered to be a solved problem. It challenges existing expectations with a new architecture that could benefit multiple types of workflows. However, this is a platform-specific capability that requires careful evaluation for organizations not already using Databricks. For technical decision-makers evaluating AI agent platforms, the key takeaway is that document intelligence is shifting from a specialized external service to an integrated platform capability.

Net net: What is novel in that chemical structure? What about that guy who retired in 2002 who kept a pile of Fastback floppies with his research into in Trinitrotoluene variants? Yep, content processing is not problem except the data on those back up tapes cranked out by that old Burroughs’ MFSOLT utility, but with the new AI approaches, who needs layers of contractors and conversion utilities. Just say, “Not a problem.” Everything is easy for a market collateral writer.

Stephen E Arnold, December 10, 2025

Comments

Got something to say?





  • Archives

  • Recent Posts

  • Meta