Microsoft and Its Magic Touch: Slurp, Slurp, Slurp
April 28, 2026
Another dinobaby post. No AI unless it is an image. This dinobaby is not Grandma Moses, just Grandpa Arnold. I find it interesting that AI detectors identify my writing style as AI output. I suppose I should be flattered, but I just don’t care.
I read “LinkedIn Is Illegally Searching Your Computer.” No, I don’t know if the write up is a confection or concrete. The publisher is “BrowserGate”. The title is snappy. I want to take a quick look at what the essay (news report?) asserts. I personally believe everything I read on the Internet. You may take a a different road in the “yellow woods.”
The main point of the article strikes me as:
Microsoft is running one of the largest corporate espionage operations in modern history. Every time any of LinkedIn’s one billion users visits linkedin.com, hidden code searches their computer for installed software, collects the results, and transmits them to LinkedIn’s servers and to third-party companies including an American-Israeli cybersecurity firm. The user is never asked. Never told. LinkedIn’s privacy policy does not mention it. Because LinkedIn knows each user’s real name, employer, and job title, it is not searching anonymous visitors. It is searching identified people at identified companies. Millions of companies. Every day. All over the world.
This is a thought provoking series of statements, and my first thought is that the legal eagles in Microsoft / LinkedIn hangers are putting on their flight suits. Ground crews are checking the verbal ammunition. The air traffic control professionals are clearing air space. The lawyers will be going aloft.

Moving the cash from data sales is a big business but boring. Thanks, Venice.ai, aside from one employee going the wrong way the image is good enough.
I noted the statement “one of the largest corporate espionage operations in modern history.” I would agree that Microsoft is in the data collection business. But if the firm is using these data for corporate espionage, one must act, “With all that information, how could the estimable firm foul up its artificial intelligence service, image, and mindshare? Something does not compute for me.
The second point is that a distinctly American company is not outputting information about what it does, is doing, did do, and will do. Evidence of this approach may be gleaned from the firm’s former president’s testimony in a trial in Europe and the marketing information about the firm’s relationship with its Number One AI partner, OpenAI.
The third point is that cross correlation is as commonplace as hitting a drive-through for a cup of coffee on the way to work. Of course, those with access to data find relationships, map them, and process the analytics outputs for signals. Now at first glance, the entire process seems sketchy. I assure you that it is the equivalent of hitting speed dial to see if a lunch date with a colleague is okay for today. No big deal. Why make a standard operating procedure a hair-on-fire event. Folks, cross correlation has been a thing for decades in policeware, intelware, and regular software. I know these statements may be surprising to some people, but that’s where the idea of life-long learning shows that most people do not keep up.
The write up continues with yellow lights blinking and sirens sounding. It offers up this gems:
LinkedIn’s scan reveals the religious beliefs, political opinions, disabilities, and job search activity of identified individuals…. Under EU law, this category of data is not regulated. It is prohibited. LinkedIn has no consent, no disclosure, and no legal basis. Its privacy policy does not mention any of this.
Okay, what’s the problem? For those in the EU, just embrace Linux as France is doing and stop using US social media. Most of the US outfits really don’t recognize countries, blocs, and royal edicts. The US companies do what is necessary to generate revenue, capture and hold user attention, and sell advertising. Believe me, the data slurpers know that selling advertising is a darned good business.
The link reveals that this “essay” is quite a lengthy white paper. I will leave it to you to work through the entire document. I had to break my reading into separate segments. I know that my interest in reports that BAIT outfits (big AI tech firms) are making people unhappy. Sure, some kids kill themselves because of certain outputs. Sure, US messaging services allow bad actors to coordinate. (Keep in mind that the coded-in-the-snow Telegram Messenger is in this game too.) Of course, the price gouging, dark patterns, and “Senator, thank you for that question” transparency puts some fire in the eyes of otherwise rational people. But the process has been chugging along for a quarter of century, and what do we have. BAIT.
One assertion is:
LinkedIn’s code uses a three-stage fallback chain to detect whether a specific extension is installed in your browser.
Stage 1: Direct communication. The code attempts to contact the extension directly using Chrome’s
externally_connectablemessaging API. If the extension developer has explicitly disabled this channel in theirmanifest.json, this method fails, and LinkedIn moves to stage 2.Stage 2: Resource probing. The code attempts to fetch a known file from the extension using its
web_accessible_resources. This is the equivalent of checking whether a door is unlocked by trying the handle. If the extension developer has not exposed any web-accessible resources, this also fails, and LinkedIn moves to stage 3.Stage 3: DOM mutation detection. The code monitors for changes to the page structure that are characteristic of specific extensions injecting elements into LinkedIn’s interface. This catches extensions that modify what you see on the page.
This strikes me as a variant on the methods used by some Telegram Messenger bots. I don’t want to label the method malware, but one might be able to find some similarities. In fact, in my upcoming Telegram lecture I walk through one approach that performs similar functions just within the mini app and dApp environments. My example is a fake job posting but operates via a smart contract with the payloads stored in the TON blockchain. The approach makes detection and removal somewhat more difficult that the approach used by Microsoft LinkedIn. I suppose their approach can be upgraded, but for now, it’s lagging behind the state of the art in sporty container activity.
The essay ends with a call for action. There’s a list of “extensions.” There is a — heaven help me — WhatsApp group. There is a call for the readers of the essay to talk with a journalist. But the big request is “Support the Browsergate Legal Fund.”
Several observations:
- What data are public facing AI companies acquiring and monitoring when a person needs hallucinating smart software to be more productive and increase one’s chances for brain fry?
- What data are cross correlated in most major cities by government agencies, financial institutions, private companies, and capable black hat hackers? Where do those data end up?
- What is the revenue generated by repackaging such collected data and offered to marketers on different financial terms via third party data brokers or subsidiaries of BAIT outfits that provide advertisers with market data?
I appreciate and enjoy the reaction to a reality that has been chugging along for decades. My question is, “Where have you been?” The Browser Gate Web site provides a link. Click it and you get updates. That’s called “sticky.”
Stephen E Arnold, April 28, 2026
Britannica 11: Free at the Moment
April 22, 2026
Another dinobaby post. No AI unless it is an image. This dinobaby is not Grandma Moses, just Grandpa Arnold.
If you like encyclopedias or encyclopaediae, you will find the 11th edition of the Encyclopedia Britannica interesting. The online version contains 36,663 articles across 28 volumes. I did some spot checks on favorite topics; for example, calculus and John Milton. I did spot in the calculus article an incorrect line break, but the text seemed clean. I spotted some tables, but my spot checking did not show any illustrations. These may be in the online version and just out of my reach.
Why pay attention to an old reference book from 1910. The write ups on some subjects like John Donne are informative and entertaining. Here’s an example from the entry for John Donne, a very sporty poet:
The influence of Donne upon the literature of England was singularly wide and deep, although almost wholly malign.
I think this is more sophisticated that some of the posts on X.com.
Another reason to care about this “old” volume is that many of the articles were the work of respected scholars. No AI output appears in the 11th edition. If there were a modern encyclopedia, my thought is that the descriptive and explanatory essays would be chock full of errors. I want to point out that smart software was pressed into duty to process the page images into the text displayed.
You can search the contents. However, be prepared to do some old fashioned thinking about your query and be prepared to examine several hits if you are poking around for a subject that turns up in different disciplines.
Don’t print it out or you will spend several thousand dollars on paper and consumables today. When the 11th was published aht 28 volumes cost about US$80.
Stephen E Arnold, April 22, 2026
File Conversion. No Problem. No Kidding?
December 10, 2025
Another short essay from a real and still-alive dinobaby. If you see an image, we used AI. The dinobaby is not an artist like Grandma Moses.
Every few months, I get a question about file conversion. The questions are predictable. Here’s a selection from my collection:
- “We have data about chemical structures. How can we convert these to for AI processing?”
- “We have back up files in Fastback encrypted format. How do we decrypt these and get the data into our AI system?”
- “We have some old back up tapes from our Burroughs’ machines?”
- “We have PDFs. Some were created when Adobe first rolled out Acrobat and some generated by different third-party PDF printing solutions. How can we convert these so our AI can provide our employees with access?”
The answer to each of these questions from the new players in AI-search system is, “No problem.” I hate to rain on these marketers’ assertions, but these are typical problems large, established organizations have moving content from a legacy system into a BAIT (big AI tech) based findability solution. There are technical challenges. There are cost challenges. There are efficiency challenges. That’s right. Challenges, and in my long career in electronic content processing, these hurdles still remain. But I am an aged dinobaby. Why believe me? Hire a Gartner-type of expert to tell you what you want to hear. Have fun with that solution, please.

Thanks, Venice.ai. Close enough for horse shoes, the high-water mark today I believe.
Venture Beat is one of my go-to sources for timely content marketing. On November 14, 2025, the venerable firm published “Databricks: PDF Parsing for Agentic AI Is Still Unsolved. New Tool Replaces Multi-Service Pipelines with a Single Function.” The write up makes clear that I am 100 percent dead wrong about processing PDF files with their weird handling of tables, charts, graphs, graphic ornaments, and dense financial data.
The write up explains how really off base I am; for example, the Databricks Agent Bricks Platform. It cracks the AI parsing problem. I learned from the Venture Beat write up identifies what the DABP does with PDF information:
1 “Tables preserved exactly as they appear, including merged cells and nested structures
2 Figures and diagrams with AI-generated captions and descriptions
3 Spatial metadata and bounding boxes for precise element location
4 Optional image outputs for multimodal search applications”
Once the PDFs have been processed by DABP, the outputs can be used in a number of ways. I assume these are advanced, stable, and efficient as the name “databrick” metaphorically suggests:
1 Spark declarative pipelines
2 Unity catalog (I don’t know what this means)
3 Vector search (yep, search and retrieval)
4 AI function chaining (yep, bots)
5 Multi-agent supervisor (yep, command and control).
The write up concludes with this statement:
The Databricks approach sheds new light on an issue that many might have considered to be a solved problem. It challenges existing expectations with a new architecture that could benefit multiple types of workflows. However, this is a platform-specific capability that requires careful evaluation for organizations not already using Databricks. For technical decision-makers evaluating AI agent platforms, the key takeaway is that document intelligence is shifting from a specialized external service to an integrated platform capability.
Net net: What is novel in that chemical structure? What about that guy who retired in 2002 who kept a pile of Fastback floppies with his research into in Trinitrotoluene variants? Yep, content processing is not problem except the data on those back up tapes cranked out by that old Burroughs’ MFSOLT utility, but with the new AI approaches, who needs layers of contractors and conversion utilities. Just say, “Not a problem.” Everything is easy for a market collateral writer.
Stephen E Arnold, December 10, 2025
Social Credit Already Exists In The West…Just with Different Spins
November 4, 2025
China is a dystopian nightmare with its social credit system. Westerners believe they can breathe a sigh of relief because that doesn’t happen in their home countries. Oh, how wrong they are. Social credit systems are here, they’re just run by a capitalist system. The Nexus author Natalie Pang explores the idea in, “Your Phone Already Has Social Credit. We Just Lie About It.”
What exactly is social credit? It’s your digital reputation, a profile of your behavior captured by everything: Amazon, credit score, Airbnb, Uber, etc. There isn’t any difference between the social credit system in the west and China, except for one thing: transparency. China is 100% transparent that it rates people, while the West hides it behind many facades. China’s social credit system has been disbanded except for a few outliers. In the West, it’s alarming the impact it has on lives:
“Your credit score doesn’t just determine loan eligibility; it affects where you can live, which jobs you can get, and how much you pay for car insurance. But traditional credit scoring is expanding rapidly. Some specialized lenders scan social media profiles as part of alternative credit assessments, particularly for borrowers with limited credit histories. Payment apps and financial services increasingly track spending patterns and transaction behaviors to build comprehensive risk profiles. The European Central Bank has asked some institutions to monitor social media chatter for early warnings of bank runs, though this is more about systemic risk than individual account decisions. Background check companies routinely analyze social media presence for character assessment. LinkedIn algorithmically manages your professional visibility based on engagement patterns, posting frequency, and network connections, rankings that recruiters increasingly rely on to filter candidates. Even dating has become a scoring system: apps use engagement rates and response patterns to determine who rises to the top of the queue and who gets buried.”
Another difference between China and the West is that these apps don’t talk or affect each other. Amazon doesn’t impact your ride shares, while your dating app doesn’t impact your credit score. These data points can be described as proprietary data or a violation of a user’s privacy, so these companies don’t share them. Another way of putting it these companies don’t want to harm their bottom line.
Social crediting systems are already affecting the west, but only in realm of capitalism and social media. The bigger question to ask is what will happen if companies decide to share data for a profit? Then we’re screwed.
Whitney Grace, November 4, 2025
FAIR Squared Data Management Promises to Find Missing Data
October 31, 2025
It’s very true that information is lost or hidden away in archives never to see the light of day. That’s why it’s important to preserve the information and even use AI to make it available. Science Daily reports on new information management tool that claims to have a solution: “90% Of Science Is Lost. This New AI Just Found It.” FAIR² Data Management is designed by Frontiers and is:
“…described as the world’s first comprehensive, AI-powered research data service. It is designed to make data both reusable and properly credited by combining all essential steps — curation, compliance checks, AI-ready formatting, peer review, an interactive portal, certification, and permanent hosting — into one seamless process. The goal is to ensure that today’s research investments translate into faster advances in health, sustainability, and technology.”
The data management system is built on a robust AI algorithm. Researchers feed their their data into FAIR² and four integrated outputs are returned: a certificate, an interactive data portal with AI chat and visualizations, peer-reviewed and citable data article, and a certified data package. All of these components work “[t]ogether,…to…ensure that every dataset is preserved, validated, citable, and reusable, helping accelerate discovery while giving researchers proper recognition.”
This is a great idea and how AI should ideally be used to ensure that information is credible. If only all AI algorithms employed a data management algorithm like this to prevent AI slop, drivel, and garbage from clogging up the Internet and our brains.
Whitney Grace, October 31, 2025
Hauling Data: Is There a Chance of Derailment?
February 13, 2025
Another dinobaby write up. Only smart software is the lousy train illustration.
I spotted some chatter about US government Web sites going off line. Since I stepped away from the “index the US government” project, I don’t spend much time poking around the content at dot gov and in some cases dot com sites operated by the US government. Let’s assume that some US government servers are now blocked and the content has gone dark to a user looking for information generated by US government entities.
If libraries chug chug down the information railroad tracks to deliver data, what does the “Trouble on the Tracks” sign mean? Thanks, You.com. Good enough.
The fix in most cases is to use Bing.com. My recollection is that a third party like Bing provided the search service to the US government. A good alternative is to use Google.com, the qualifier site: command, and a bit of obscenity. The obscenity causes the Google AI to just generate a semi relevant list of links. In a pinch, you could poke around for a repository of US government information. Unfortunately the Library of Congress is not that repository. The Government Printing Office does not do the job either. The Internet Archive is a hit-and-miss archive operation.
Is there another alternative? Yes. Harvard University announced its Data.gov archive. The institution’s Library Innovation Lab Team said on February 6, 2025:
Today we released our archive of data.gov on Source Cooperative. The 16TB collection includes over 311,000 datasets harvested during 2024 and 2025, a complete archive of federal public datasets linked by data.gov. It will be updated daily as new datasets are added to data.gov.
I like this type of archive, but I am a dinobaby, not a forward leaning, “with it” thinker. Information in my mind belongs in a library. A library, in general, should provide students and those seeking information with a place to go to obtain information. The only advertising I see in a library is an announcement about a bake sale to raise funds for children’s reading material.
Will the Harvard initiative and others like it collide with something on the train tracks? Will the money to buy fuel for the engine’s power plant be cut off? Will the train drivers be forced to find work at Shake Shack?
I have no answers. I am glad I am old, but I fondly remember when the job was to index the content on US government servers. The quaint idea formulated by President Clinton was to make US government information available. Now one has to catch a train.
Stephen E Arnold, February 13, 2025
A Better Database of SEC Filings?
January 2, 2025
DocDelta is a new database that says it is, “revolutionizing investment research by harnessing the power of AI to decode complex financial documents at scale.” In plain speak that means it’s an AI-powered platform that analyzes financial documents. The AI studies terabytes of SEC filings, earning calls, and market data to reveal insights.
DocDelta wants its users to have an edge that other investors are missing. The DocDelta team explain the advanced language combined with financial expertise tracks subtle changes and locates patterns. The platform includes 10-K & 10-Q analysis, real time alerts, and insider trading tracker. As part of its smart monitoring, automated tools, DocDelta has risk assessments, financial metrics, and language analysis.
This platform was designed specifically for investment professionals. It notifies investors when companies update their risk factors and disclose materials through *-K filings. It also analyzes annual and quarterly earnings and compares them against past quarters, identifies material changes in risk factors, financial metrics, and management discussions. There’s also a portfolio management tool and a research feature.
DocDelta sums itself up like this:
“Detect critical changes in SEC filings before the market reacts. Get instant alerts and AI-powered analysis of risk factors, management discussion, and financial metrics.”
This could be a new tool to help the SEC track bad actors and keep the stock market clean. Is that oxymoronic?
Whitney Grace, January 2, 2024
Geolocation Data: Available for a Price
December 30, 2024
According to a report from 404 Media, a firm called Fog Data Science is helping law enforcement compile lists of places visited by suspects. Ars Technica reveals, “Location Data Firm Helps Police Find Out When Suspects Visited their Doctor.” Writer Jon Brodkin writes:
“Fog Data Science, which says it ‘harness[es] the power of data to safeguard national security and provide law enforcement with actionable intelligence,’ has a ‘Project Intake Form’ that asks police for locations where potential suspects and their mobile devices might be found. The form, obtained by 404 Media, instructs police officers to list locations of friends’ and families’ houses, associates’ homes and offices, and the offices of a person’s doctor or lawyer. Fog Data has a trove of location data derived from smartphones’ geolocation signals, which would already include doctors’ offices and many other types of locations even before police ask for information on a specific person. Details provided by police on the intake form seem likely to help Fog Data conduct more effective searches of its database to find out when suspects visited particular places. The form also asks police to identify the person of interest’s name and/or known aliases and their ‘link to criminal activity.’ ‘Known locations a POI [Person of Interest] may visit are valuable, even without dates/times,’ the form says. It asks for street addresses or geographic coordinates.”
See the article for an image of the form. It is apparently used to narrow down data points and establish suspects’ routine movements. It could also be used to, say, prosecute abortions, Brodkin notes.
Back in 2022, the Electronic Frontier Foundation warned of Fog Data’s geolocation data horde. Its report detailed which law enforcement agencies were known to purchase Fog’s intel at the time. But where was Fog getting this data? From Venntel, the EFF found, which is the subject of a Federal Trade Commission action. The agency charges Venntel with “unlawfully tracking and selling sensitive location data from users, including selling data about consumers’ visits to health-related locations and places of worship.” The FTC’s order would prohibit Venntel, and parent company Gravy Analytics, from selling sensitive location data. It would also require they establish a “sensitive data location program.” We are not sure what that would entail. And we might never know: the decision may not be finalized until after the president-elect is sworn in.
Cynthia Murrell, December 30, 2024
Good Enough Junk Food: Knowledge without the Work
December 16, 2024
This write up emerged from the dinobaby’s own mind. Yes, the dinobaby used AI because this write up is about the knowledge value lost with smart software. Intellectual junk food is not the same as my mother’s overcooked chicken which she killed on her aunt’s farm. Cluck, cluck, squawk.
I plugged the Guardian’s opinion essay into You.com’s Chat GPT-4o model. You can read the original essay titled “Is Doom Scrolling Really Rotting Our Brains? The Evidence Is Getting Harder to Ignore.” The original essay is clear, reasonably well researched, and structured to speak to [a] a person who has access to the Guardian online or in (gasp!) a print form, [b] who reads the Guardian’s opinion essays, and [c] who has the ability to recognize that “brain rot” has probably affected that individual, co-workers, government leaders, children, and people who repair refrigerators.
This illustration is good enough and it was produced by the AI system Magic Studio.
Here’s the summary Chat GPT-4o spit out:
The essay critiques the decline of human intellect in the digital age, using the viral “Skibidi Toilet” video as a cultural metaphor for deteriorating focus and intellectual engagement. The Oxford English Dictionary’s 2024 word of the year, “brain rot”, reflects this phenomenon, defined as the mental degradation caused by excessive consumption of trivial online content. The essay draws on academic research to demonstrate how compulsive internet use is literally shrinking grey matter, weakening memory, shortening attention spans, and distorting cognitive functions. Studies from Harvard, Oxford, and King’s College reveal the harmful effects of media multitasking and heavy internet usage, with findings such as reduced grey matter in the brain and the risk of “digital dementia” in young people during critical developmental years.The prevalence of infinite scroll and dopamine-driven design in apps is highlighted as a deliberate tactic by tech companies to exploit human psychology for profit, creating powerful addiction loops. Former Google design ethicist Tristan Harris warns that this “race to the bottom of the brain stem” manipulates fear, outrage, and reward systems, potentially driving society to catastrophe if left unchecked.Despite the grim outlook, the essay finds hope in the growing self-awareness among younger generations, who coined the term “brain rot” and recognize its effects. Movements advocating for reduced technology use, such as switching to dumbphones or promoting smartphone-free childhoods, are gaining traction. These trends suggest the possibility of reclaiming mental clarity and focus, countering the current trajectory of cognitive decline.Ultimately, the essay positions “Skibidi Toilet” as both a symptom of modern intellectual decay and a call to action. Humanity stands at a critical juncture: to either rise above the current trends of distraction and degradation or succumb further to the “U-bend” of intellectual regression.
For what it is worth, this is a typical bite-sized knowledge cookie. No one will know if the summary is good or bad unless that person takes the time to get the original full text and compare it with this AI generated output. The informational fast food provides a sugar jolt from saving time or the summary consumer’s belief that the important information is on the money. A knowledge cookie if you will, or maybe intellectual junk food?
Is this summary good enough? From my point of view, it is just okay; that is, good enough. What else is required? Flash back to 1982, the ABI/INFORM database was a commercial success. A couple of competitors were trying to capture our customers which was tricky. Intermediaries like Dialog Information Services, ESA, LexisNexis (remember Buster and his silver jumpsuit?), among others “owned” the direct relationship with the companies that paid the intermediaries to use the commercial databases on their systems. Then the intermediaries shared some information with us, the database producers.
How did a special librarian or a researcher “find” or “know about” our database? The savvy database producers provided information to the individuals interested in a business and management related commercial database. We participated in niche trade shows. We held training programs and publicized them with our partners Dow Jones News Retrieval, Investext, Predicasts, and Disclosure, among a few others. Our senior professionals gave lectures about controlled term indexing, the value of classification codes, and specific techniques to retrieve a handful of relevant citations and abstracts from our online archive. We issued news releases about new sources of information we added, in most cases with permission of the publisher.
We did not use machine indexing. We did have a wizard who created a couple of automatic indexing systems. However, when the results of what the software in 1922 could do, we fell back on human indexers, many of whom had professional training in the subject matter they were indexing. A good example was our coverage of real estate management activities. The person who handled this content was a lawyer who preferred reading and working in our offices. At this time, the database was owned by the Courier-Journal & Louisville Times Co. The owner of the privately held firm was an early adopted of online and electronic technology. He took considerable pride in our line up of online databases. When he hired me, I recall his telling me, “Make the databases as good as you can.”
How did we create a business and management database that generated millions in revenue and whose index was used by entities like the Royal Bank of Canada to index its internal business information?
Here’s the secret sauce:
- We selected sources in most cases business journals, publications, and some other types of business related content; for example, the ANBAR management reports
- The selection of which specific article to summarize was the responsibility of a managing editor with deep business knowledge
- Once an article was flagged as suitable for ABI/INFORM, it was routed to the specialist who created a summary of the source article. At that time, ABI/INFORM summaries or “abstracts” were limited to 150 words, excluding the metadata.
- An indexing specialist would then read the abstract and assign quite specific index terms from our proprietary controlled vocabulary. The indexing included such items as four to six index terms from our controlled vocabulary and a classification code like 7700 to indicate “marketing” with addition two digit indicators to make explicit that the source document was about marketing and direct mail or some similar subcategory of marketing. We also included codes to disambiguate between a railroad terminal and a computer terminal because source documents assumed the reader would “know” the specific field to which the term’s meaning belonged. We added geographic codes, so the person looking for information could locate employee stock ownership in a specific geographic region like Northern California, and a number of other codes specifically designed to allow precise, comprehensive retrieval of abstracts about business and management. Some of the systems permitted free text searching of the abstract, and we considered that a supplement to our quite detailed indexing.
- Each abstract and index terms was checked by a control control process using people who had demonstrated their interest in our product and their ability to double check the indexing.
- We had proprietary “content management systems” and these generated the specific file formats required by our intermediaries.
- Each week we updated our database and we were exploring daily updates for our companion product called Business Dateline when the Courier Journal was broken up and the database operation sold to a movie camera company, Bell+Howell.
Chat GPT-4o created the 300 word summary without the human knowledge, expertise, and effort. Consequently, the loss of these knowledge based workflow has been replaced by a smart software which can produce a summary in less than 30 seconds.
And that summary is, from my point of view, good enough. There are some trade offs:
- Chat GPT-4o is reactive. Feed it a url or a text, and it will summarize it. Gone is the knowledge-based approach to select a specific, high-value source document for inclusion in the database. Our focus was informed selection. People paid to access the database because of the informed choice about what to put in the database.
- The summary does not include the ABI/INFORM key points and actionable element of the source document. The summary is what a high school or junior college graduate would create if a writing teacher assigned a “how to write a précis” as part of the course requirements. In general, high school and junior college graduates are not into nuance and cannot determine the pivotal information payload in a source document.
- The precise indexing and tagging is absent. One could create a 1,000 such summaries, toss them in MISTRAL, and do a search. The result is great if one is uninformed about the importance of editorial polices, knowledge-based workflows, and precise, thorough indexing.
The reason I am sharing some of this “ancient” online history is:
- The loss of quality in online information is far more serious than most people understand. Getting a summary today is no big deal. What’s lost is simply not on these individuals’ radar.
- The lack of an editorial policy, precise date and time information, and the fine-grained indexing means that one has to wade through a mass of undifferentiated information. ABI/INFORM in the 1080s delivered a handful of citations directly on point with the user’s query. Today no one knows or cares about precision and recall.
- It is now more difficult than at any other time in my professional work career to locate needed information. Public libraries do not have the money to obtain reference materials, books, journals, and other content. If the content is online, it is a dumbed down and often cut rate version of the old-fashioned commercial databases created by informed professionals.
- People look up information online and remain dumb; that is, the majority of the people with whom I come in contact routinely ask me and my team, “Where do you get your information?” We even have a slide in our CyberSocial lecture about “how” and “where.” The analysts and researchers in the audience usually don’t know so an entire subculture of open source information professionals has come into existence. These people are largely on their own and have to do work which once was a matter of querying a database like ABI/INFORM, Predicasts, Disclosure, Agricola, etc.
Sure the essay is good. The summary is good enough. Where does that leave a person trying to understand the factual and logical errors in a new book examining social media. In my opinion, people are in the dark and have a difficult time finding information. Making decisions in the dark or without on point accurate information is recipe for a really bad batch of cookies.
Stephen E Arnold, December 15, 2024
Suddenly: Worrying about Content Preservation
August 19, 2024
This essay is the work of a dumb dinobaby. No smart software required.
Digital preservation may be becoming a hot topic for those who rarely think about finding today’s information tomorrow or even later today. Two write ups provide some hooks on which thoughts about finding information could be hung.
The young scholar faces some interesting knowledge hurdles. Traditional institutions are not much help. Thanks, MSFT Copilot. Is Outlook still crashing?
The first concerns PDFs. The essay and how to is “Classifying All of the PDFs on the Internet.” A happy quack to the individual who pursued this project, presented findings, and provided links to the data sets. Several items struck me as important in this project research report:
- Tracking down PDF files on the “open” Web is not something that can be done with a general Web search engine. The takeaway for me is that PDFs, like PowerPoint files, are either skipped or not crawled. The author had to resort to other, programmatic methods to find these file types. If an item cannot be “found,” it ceases to exist. How about that for an assertion, archivists?
- The distribution of document “source” across the author’s prediction classes splits out mathematics, engineering, science, and technology. Considering these separate categories as one makes clear that the PDF universe is about 25 percent of the content pool. Since technology is a big deal for innovators and money types, losing or not being able to access these data suggest a knowledge hurdle today and tomorrow in my opinion. An entity capturing these PDFs and making them available might have a knowledge advantage.
- Entities like national libraries and individualized efforts like the Internet Archive are not capturing the full sweep of PDFs based on my experience.
My reading of the essay made me recognize that access to content on the open Web is perceived to be easy and comprehensive. It is not. Your mileage may vary, of course, but this write up illustrates a large, multi-terabyte problem.
The second story about knowledge comes from the Epstein-enthralled institution’s magazine. This article is “The Race to Save Our Online Lives from a Digital Dark Age.” To make the urgency of the issue more compelling and better for the Google crawling and indexing system, this subtitle adds some lemon zest to the dish of doom:
We’re making more data than ever. What can—and should—we save for future generations? And will they be able to understand it?
The write up states:
For many archivists, alarm bells are ringing. Across the world, they are scraping up defunct websites or at-risk data collections to save as much of our digital lives as possible. Others are working on ways to store that data in formats that will last hundreds, perhaps even thousands, of years.
The article notes:
Human knowledge doesn’t always disappear with a dramatic flourish like GeoCities; sometimes it is erased gradually. You don’t know something’s gone until you go back to check it. One example of this is “link rot,” where hyperlinks on the web no longer direct you to the right target, leaving you with broken pages and dead ends. A Pew Research Center study from May 2024 found that 23% of web pages that were around in 2013 are no longer accessible.
Well, the MIT story has a fix:
One way to mitigate this problem is to transfer important data to the latest medium on a regular basis, before the programs required to read it are lost forever. At the Internet Archive and other libraries, the way information is stored is refreshed every few years. But for data that is not being actively looked after, it may be only a few years before the hardware required to access it is no longer available. Think about once ubiquitous storage mediums like Zip drives or CompactFlash.
To recap, one individual made clear that PDF content is a slippery fish. The other write up says the digital content itself across the open Web is a lot of slippery fish.
The fix remains elusive. The hurdles are money, copyright litigation, and technical constraints like storage and indexing resources.
Net net: If you want to preserve an item of information, print it out on some of the fancy Japanese archival paper. An outfit can say it archives, but in reality the information on the shelves is a tiny fraction of what’s “out there”.
Stephen E Arnold, August 19, 2024

