Text Wranglers, Attention

October 13, 2025

This is a short item for people who manipulate or wrangle text. Navigate to TextTools. The site provides access to several dozen utilities. I checked a handful of the services and found them to be free. The one I tested was Difference Checker. Paste the text of the two files or in my case code snippets. The output flags the differences. Worth a look.

Stephen E Arnold, October 13, 2025

Weaponization of LLMs Is a Thing. Will Users Care? Nope

October 10, 2025

green-dino_thumbThis essay is the work of a dumb dinobaby. No smart software required.

A European country’s intelligence agency learned about my research into automatic indexing. We did a series of lectures to a group of officers. Our research method, the results, and some examples preceded a hands on activity. Everyone was polite. I delivered versions of the lecture to some public audiences. At one event, I did a live demo with a couple of people in the audience. Each followed a procedure, and I showed the speed with which the method turned up in the Google index. These presentations took place in the early 2000s. I assumed that the behavior we discovered would be disseminated and then it would diffuse. It was obvious that:

  1. Weaponized content would be “noted” by daemons looking for new and changed information
  2. The systems were sensitive to what I called “pulses” of data. We showed how widely used algorithms react to sequences of content
  3. The systems would alter what they would output based on these “augmented content objects.”

In short, online systems could be manipulated or weaponized with specific actions. Most of these actions could be orchestrated and tuned to have maximum impact. One example in my talks was taking a particular word string and making it turn up in queries where one would not expect that behavior. Our research showed that a few as four weaponized content objects orchestrated in a specific time interval would do the trick. Yep, four. How many weaponized write ups can my local installation of LLMs produce in 15 minutes? Answer: Hundreds. How long does it take to push those content objects into information streams used for “training.” Seconds.

10 10 fish in fish bowl

Fish live in an environment. Do fish know about the outside world? Thanks, Midjourney. Not a ringer but close enough in horseshoes.

I was surprised when I read “A Small Number of Samples Can Poison LLMs of Any Size.” You can read the paper and work through the prose. The basic idea is that selecting or shaping training data or new inputs to recalibrate training data can alter what the target system does. I quite like the phrase “weaponize information.” Not only does the method work, it can be automated.

What’s this mean?

The intentional selection of information or the use of a sample of information from a domain can generate biases in what the smart software knows, thinks, decides, and outputs. Dr. Timnit Gebru and her parrot colleagues were nibbling around the Google cafeteria. Their research caused the Google to put up a barrier to this line of thinking. My hunch is that she and her fellow travelers found that content that is representative will reflect the biases of the authors. This means that careful selection of content for training or updating training sets can be steered. That’s what the Anthropic write up make clear.

Several observations are warranted:

  1. Whoever selects training data or the information used to update and recalibrate training data can control what is displayed, recommended, or included in outputs like recommendations
  2. Users of online systems and smart software are like fish in a fish bowl. The LLM and smart software crowd are the people who fill the bowl and feed the fish. Fish have a tough time understanding what’s outside their bowl. I don’t like the word “bubble” because these pop. An information fish bowl is tough to escape and break.
  3. As smart software companies converge into essentially an oligopoly using the types of systems I described in the early 2000s with some added sizzle from the Transformer thinking, a new type of information industrial complex is being assembled on a very large scale. There’s a reason why Sam AI-Man can maintain his enthusiasm for ChatGPT. He sees the potential of seemingly innocuous functions like apps within ChatGPT.

There are some interesting knock on effects from this intentional or inadvertent weaponization of online systems. One is that the escalating violent incidents are an output of these online systems. Inject some René Girard-type content into training data sets. Watch what those systems output. “Real” journalists are explaining how they use smart software for background research. Student uses online systems without checking to see if the outputs line up with what other experts say. What about investment firms allowing smart software to make certain financial decisions.

Weaponize what the fish live in and consume. The fish are controlled and shaped by weaponized information. How long has this quirk of online been known? A couple of decades, maybe more. Why hasn’t “anything” been done to address this problem? Fish just ask, “What problem?”

Stephen E Arnold, October x, 2025

I spotted

Content Injection Can Have Unanticipated Consequences

February 24, 2025

dino orangeThe work of a real, live dinobaby. Sorry, no smart software involved. Whuff, whuff. That’s the sound of my swishing dino tail. Whuff.

Years ago I gave a lecture to a group of Swedish government specialists affiliated with the Forestry Unit. My topic was the procedure for causing certain common algorithms used for text processing to increase the noise in their procedures. The idea was to input certain types of text and numeric data in a specific way. (No, I will not disclose the methods in this free blog post, but if you have a certain profile, perhaps something can be arranged by writing benkent2020 at yahoo dot com. If not, well, that’s life.)

We focused on a handful of methods widely used in what now is called “artificial intelligence.” Keep in mind that most of the procedures are not new. There are some flips and fancy dancing introduced by individual teams, but the math is not invented by TikTok teens.

In my lecture, the forestry professionals wondered if these methods could be used to achieve specific objectives or “ends”. The answer was and remains, “Yes.” The idea is simple. Once methods are put in place, the algorithms chug along, some are brute force and others are probabilistic. Either way, content and data injections can be shaped, just like the gizmos required to make kinetic events occur.

The point of this forestry excursion is to make clear that a group of people, operating in a loosely coordinated manner can create data or content. Those data or content can be weaponized. When ingested by or injected into a content processing flow, the outputs of the larger system can be fiddled: More emphasis here, a little less accuracy there, and an erosion of whatever “accuracy” calculations are used to keep the system within the engineers’ and designers’ parameters. A plebian way to describe the goal: Disinformation or accuracy erosion.

I read “Meet the Journalists Training AI Models for Meta and OpenAI.” The write up explains that journalists without jobs or in search of extra income are creating “content” for smart software companies. The idea is that if one just does the Silicon Valley thing and sucks down any and all content, lawyers might come calling. Therefore, paying for “real” information is a better path.

Please, read the original article to get a sense of who is doing the writing, what baggage or mind set these people might bring to their work.

If the content is distorted — either intentionally or unintentionally — the impact of these content objects on the larger smart software system might have some interesting consequences. I just wanted to point out that weaponized information can have an impact. Those running smart software and buying content assuming it is just fine, might find some interesting consequences in the outputs.

Stephen E Arnold, February 24, 2025

"Real" Entities or Sock Puppets? A New Solution Can Help Analysts and Investigators

January 28, 2025

Bitext’s NAMER (shorthand for "named entity recognition") can deliver precise entity tagging across dozens of languages.

Graphs — knowledge graphs and social graphs — have moved into the mainstream since Leonhard Euler formed the foundation for graph theory in the mid 18th century in Berlin.

With graphs, analysts can take advantage of smart software’s ability to make sense of Named Entity Recognition (NER), event extraction, and relationship mapping.

The problem is that humans change their names (handles, monikers, or aliases) for many reasons: Public embarrassment, a criminal record, a change in marital status, etc.

Bitext’s NER solution, NAMER, is specifically designed to meet the evolving needs of knowledge graph companies, offering exceptional features that tackle industry challenges.

Consider a person disgraced with involvement in a scheme to defraud investors in an artificial intelligence start up. The US Department of Justice published the name of a key actor in this scheme. (Source: https://www.justice.gov/usao-ndca/pr/founder-and-former-ceo-san-francisco-technology-company-and-attorney-indicted-years). The individual was identified by the court as Valerie Lau Beckman. The official court documents used the name "Lau" to reference her involvement in a multi-million dollar scam.

However, in order to correctly identify her in social media, subsequent news stories, and in possible public summaries of her training on a LinkedIn-type of smart software is not enough.

That’s the role of a specialized software solution. Here’s what NAMER delivers.

The system identifies and classifies entities (e.g., people, organizations, locations) in unstructured data. The system accurately links data across different sources of content. The NAMER technology can tag and link significant events (transactions, announcements) to maintain temporal relevance; for example, when Ms. Lau Beckman is discharged from the criminal process. NAMER can connect entities like Ms. Lau or Ms. Beckman to other individuals with whom she works or interacts and her "names" appearance in content streams.

The licensee specifies the languages NAMER is to process, either in a knowledge base or prior to content processing via a large language model.

Access to the proprietary NAMER technology is via a local SDK which is essential for certain types of entity analysis. NAMER can also be integrated into another system or provided as a "white label service" to enhance an intelligence system with NAMER’s unique functions. The developer provides for certain use cases direct access to the source code of the system.

For an organization or investigative team interested in keeping data about Lau Beckman at the highest level of precision, Bitext’s NAMER is an essential service.

Stephen E Arnold, January 28, 2025

More about NAMER, the Bitext Smart Entity Technology

January 14, 2025

dino orangeA dinobaby product! We used some smart software to fix up the grammar. The system mostly worked. Surprised? We were.

We spotted more information about the Madrid, Spain based Bitext technology firm. The company posted “Integrating Bitext NAMER with LLMs” in late December 2024. At about the same time, government authorities arrested a person known as “Broken Tooth.” In 2021, an alert for this individual was posted. His “real” name is Wan Kuok-koi, and he has been in an out of trouble for a number of years. He is alleged to be part of a criminal organization and active in a number of illegal behaviors; for example, money laundering and human trafficking. The online service Irrawady reported that Broken Tooth is “the face of Chinese investment in Myanmar.”

Broken Tooth (né Wan Kuok-koi, born in Macau) is one example of the importance of identifying entity names and relating them to individuals and the organizations with which they are affiliated. A failure to identify entities correctly can mean the difference between resolving an alleged criminal activity and a get-out-of-jail-free card. This is the specific problem that Bitext’s NAMER system addresses. Bitext says that large language models are designed for for text generation, not entity classification. Furthermore, LLMs pose some cost and computational demands which can pose problems to some organizations working within tight budget constraints. Plus, processing certain data in a cloud increases privacy and security risks.

Bitext’s solution provides an alternative way to achieve fine-grained entity identification, extraction, and tagging. Bitext’s solution combines classical natural language processing solutions solutions with large language models. Classical NLP tools, often deployable locally, complement LLMs to enhance NER performance.

NAMER excels at:

  1. Identifying generic names and classifying them as people, places, or organizations.
  2. Resolving aliases and pseudonyms.
  3. Differentiating similar names tied to unrelated entities.

Bitext supports over 20 languages, with additional options available on request. How does the hybrid approach function? There are two effective integration methods for Bitext NAMER with LLMs like GPT or Llama are. The first is pre-processing input. This means that entities are annotated before passing the text to the LLM, ideal for connecting entities to knowledge graphs in large systems. The second is to configure the LLM to call NAMER dynamically.

The output of the Bitext system can generate tagged entity lists and metadata for content libraries or dictionary applications. The NAMER output can integrate directly into existing controlled vocabularies, indexes, or knowledge graphs. Also, NAMER makes it possible to maintain separate files of entities for on-demand access by analysts, investigators, or other text analytics software.

By grouping name variants, Bitext NAMER streamlines search queries, enhancing document retrieval and linking entities to knowledge graphs. This creates a tailored “semantic layer” that enriches organizational systems with precision and efficiency.

For more information about the unique NAMER system, contact Bitext via the firm’s Web site at www.bitext.com.

Stephen E Arnold, January 14, 2025

FOGINT: A Shocking Assertion about Israeli Intelligence Before the October 2023 Attack

January 13, 2025

fog from gifer 8AC8 small A post from the FOGINT team.

One of my colleagues alerted me to a new story in the Jerusalem Post. The article is “IDF Could’ve Stopped Oct. 7 by Monitoring Hamas’s Telegram, Researchers Say.” The title makes clear that this is an “after action” analysis. Everyone knows that thinking about the whys and wherefores right of bang is a safe exercise. Nevertheless, let’s look at what the Jerusalem Post reported on January 5, 2025.

First, this statement:

“These [Telegram] channels were neither secret nor hidden — they were open and accessible to all.” — Lt.-Col. (res.) Jonathan Dahoah-Halevi

Telegram puts some “silent” barriers to prevent some third parties from downloading in real time active discussions. I know of one Israeli cyber security firm which asserts that it monitors Telegram public channel messages. (I won’t ask the question, “Why didn’t analysts at that firm raise an alarm or contact their former Israeli government employers with that information? Those are questions I will sidestep.)

Second, the article reports:

These channels [public Telegram channels like Military Tactics] were neither secret nor hidden — they were open and accessible to all. The “Military Tactics” Telegram channel even shared professional content showcasing the organization’s level of preparedness and operational capabilities. During the critical hours before the attack, beginning at 12:20 a.m. on October 7, the channel posted a series of detailed messages that should have raised red flags, including: “We say to the Zionist enemy, [the operation] coming your way has never been experienced by anyone,” “There are many, many, many surprises,” “We swear by Allah, we will humiliate you and utterly destroy you,” and “The pure rifles are loaded, and your heads are the target.”

Third, I circled this statement:

However, Dahoah-Halevi further asserted that the warning signs appeared much earlier. As early as September 17, a message from the Al-Qassam Brigades claimed, “Expect a major security event soon.” The following day, on September 18, a direct threat was issued to residents of the Gaza border communities, stating, “Before it’s too late, flee and leave […] nothing will help you except escape.”

The attack did occur, and it had terrible consequences for the young people killed and wounded and for the Israeli cyber security industry, which some believe is one of the best in the world. The attack suggested that marketing rather than effectiveness created an impression at odds with reality.

What are the lessons one can take from this report? The FOGINT team will leave that to you to answer.

Stephen E Arnold, January 13, 2025

Juicing Up RAG: The RAG Bop Bop

December 26, 2024

Can improved information retrieval techniques lead to more relevant data for AI models? One startup is using a pair of existing technologies to attempt just that. MarkTechPost invites us to “Meet CircleMind: An AI Startup that is Transforming Retrieval Augmented Generation with Knowledge Graphs and PageRank.” Writer Shobha Kakkar begins by defining Retrieval Augmented Generation (RAG). For those unfamiliar, it basically combines information retrieval with language generation. Traditionally, these models use either keyword searches or dense vector embeddings. This means a lot of irrelevant and unauthoritative data get raked in with the juicy bits. The write-up explains how this new method refines the process:

“CircleMind’s approach revolves around two key technologies: Knowledge Graphs and the PageRank Algorithm. Knowledge graphs are structured networks of interconnected entities—think people, places, organizations—designed to represent the relationships between various concepts. They help machines not just identify words but understand their connections, thereby elevating how context is both interpreted and applied during the generation of responses. This richer representation of relationships helps CircleMind retrieve data that is more nuanced and contextually accurate. However, understanding relationships is only part of the solution. CircleMind also leverages the PageRank algorithm, a technique developed by Google’s founders in the late 1990s that measures the importance of nodes within a graph based on the quantity and quality of incoming links. Applied to a knowledge graph, PageRank can prioritize nodes that are more authoritative and well-connected. In CircleMind’s context, this ensures that the retrieved information is not only relevant but also carries a measure of authority and trustworthiness. By combining these two techniques, CircleMind enhances both the quality and reliability of the information retrieved, providing more contextually appropriate data for LLMs to generate responses.”

CircleMind notes its approach is still in its early stages, and expects it to take some time to iron out all the kinks. Scaling it up will require clearing hurdles of speed and computational costs. Meanwhile, a few early users are getting a taste of the beta version now. Based in San Francisco, the young startup was launched in 2024.

Cynthia Murrell, December 26, 2024

Bitext NAMER: Simplifying Tracking of Translated Organizational Names

December 11, 2024

Hopping Dino_thumb_thumb_thumbThis blog post is the work of an authentic dinobaby. No smart software was used.

We wrote a short item about tracking Chinese names translated to English, French, or Spanish with widely varying spellings. Now Bitext’s entity extraction system can perform the same disambiguation for companies and non-governmental entities. Analysts may be looking for a casino which operates with a Chinese name. That gambling facility creates marketing collateral or gets news coverage which uses a different name or a spelling which is different from the operation’s actual name. As a result, missing a news item related to that operation is an on-going problem for some professionals.

Bitext has revealed that its proprietary technology can perform the same tagging and extraction process for organizational names in more than two dozen languages. In “Bitext NAMER Cracks Named Entity Recognition,” the company reports:

… issues arise with organizational names, such as “Sun City” (a place and enterprise) or aliases like “Yati New City” for “Shwe Koko”; and, in general, with any language that is written in non-Roman alphabet and needs transliteration. In fact, these issues affect to all languages that do not use Roman alphabet including Hindi, Malayalam or Vietnamese, since transliteration is not a one-to-one function but a one-to-many and, as a result, it generates ambiguity the hinders the work of analysts. With real-time data streaming into government software, resolving ambiguities in entity identification is crucial, particularly for investigations into activities like money laundering.

Unlike some other approaches — for instance, smart large language models — the Bitext NAMER technology:

  • Identifies correctly generic names
  • Performs type assignment; specifically, person, place, time, and organization
  • Tags AKA (also known as) and pseudonyms
  • Distinguishes simile names linked to unelated entitles; for example, Levo Chan.

The company says:

Our unique method enables accurate, multilingual entity detection and normalization for a variety of applications.

Bitext’s technology is used by three of the top five US companies listed on NASDAQ. The firm’s headquarters are in Madrid, Spain. For more information, contact the company via its Web site, www.bitext.com.

Stephen E Arnold, December 11, 2024

Entity Extraction: Not As Simple As Some Vendors Say

November 19, 2024

dino orange_thumb_thumb_thumb_thumb_thumbNo smart software. Just a dumb dinobaby. Oh, the art? Yeah, MidJourney.

Most of the systems incorporating entity extraction have been trained to recognize the names of simple entities and mostly based on the use of capitalization. An “entity” can be a person’s name, the name of an organization, or a location like Niagara Falls, near Buffalo, New York. The river “Niagara” when bound to “Falls” means a geologic feature. The “Buffalo” is not a Bubalina; it is a delightful city with even more pleasing weather.

The same entity extraction process has to work for specialized software used by law enforcement, intelligence agencies, and legal professionals. Compared to entity extraction for consumer-facing applications like Google’s Web search or Apple Maps, the specialized software vendors have to contend with:

  • Gang slang in English and other languages; for example, “bumble bee.” This is not an insect; it is a nickname for the Latin Kings.
  • Organizations operating in Lao PDR and converted to English words like Zhao Wei’s Kings Romans Casino. Mr. Wei has been allegedly involved in gambling activities in a poorly-regulated region in the Golden Triangle.
  • Individuals who use aliases like maestrolive, james44123, or ahmed2004. There are either “real” people behind the handles or they are sock puppets (fake identities).

Why do these variations create a challenge? In order to locate a business, the content processing system has to identify the entity the user seeks. For an investigator, chopping through a thicket of language and idiosyncratic personas is the difference between making progress or hitting a dead end. Automated entity extraction systems can work using smart software, carefully-crafted and constantly updated controlled vocabulary list, or a hybrid system.

Automated entity extraction systems can work using smart software, carefully-crafted and constantly updated controlled vocabulary list, or a hybrid system.

Let’s take an example which confronts a person looking for information about the Ku Group. This is a financial services firm responsible for the Kucoin. The Ku Group is interesting because it has been found guilty in the US for certain financial activities in the State of New York and by the US Securities & Exchange Commission. 

Read more

Another Reminder about the Importance of File Conversions That Work

October 18, 2024

Salesforce has revamped its business plan and is heavily investing in AI-related technology. The company is also acquiring AI companies located in Israel. CTech has the lowdown on Salesforce’s latest acquisition related to AI file conversion: “Salesforce Acquiring Zoomin For $450 Million.”

Zoomin is an Israeli data management provider for unstructured at and Salesforce purchased it for $450 million. This is way more than what Zoomin was appraised at in 2021, so investors are happy. Earlier in September, Salesforce also bought another Israeli company Own. Buying Zoomin is part of Salesforce’s long term plan to add AI into its business practices.

Since AI need data libraries to train and companies also possess a lot of unstructured data that needs organizing, Zoomin is a wise investment for Salesforce. Zoomin has a lot to offer Salesforce:

“Following the acquisition, Zoomin’s technology will be integrated into Salesforce’s Agentforce platform, allowing customers to easily connect their existing organizational data and utilize it within AI-based customer experiences. In the initial phase, Zoomin’s solution will be integrated into Salesforce’s Data Cloud and Service Cloud, with plans to expand its use across all Salesforce solutions in the future.”

Salesforce is taking steps that other businesses will eventually follow. Will Salesforce start selling the converted data to train AI? Also will Salesforce become a new Big Tech giant?

Whitney Grace, October 18, 2024

Next Page »

  • Archives

  • Recent Posts

  • Meta