Towards A Domain-Specific Language Model (DSLM)

LLMs & A Thousand Pardons

ChatGPT and similar Large Language Models have garnered much publicity. There is definitely a wow factor as you interact with it. Beyond their iconic place in modern culture as a “wow factor” technology, they have their place as task assistants and content creators. It is inevitable with scraping the internet as a dataset training corpus there is going to be leakage and bias into the model.

If using an LLM as an assistant in a profession where ground based facts, truths and calculations are mission critical. The tendencies of these models to act as the expert and present hallucinations, as they are monikered, as if they are ground based fact can be quite detrimental to the goals of the use of these tools.

Apologizing and reformulating a response to my prompt because I knew the subject, or tested the output and found the reply to be false, would be grounds for firing of a real world human assistant. After the 100th time of having to double- and triple-check their work, submitted it as if it was cited fact or verified calculation, algorithm or code snippet being sought, you’re stretching when you call it “performant”. Reading comprehension in an LLM is an ephemeral property of the order of the training text, which dilutes ever-further with each inaccurate statement it uptakes in training.

Getting Specific

Furthermore they cannot be lightweight, portable, always in need of a networked connection and your work with the tool can be data mined. If your field is domain specific, i.e. electrical engineering, Porsche automobile mechanic, ecosystem biologist specializing in southeastern US sub-tropical rain forests, etc., do you really need to know the names and dates of movie or music releases, the history of feminist Micronesian basket weaving?

You would want your AI toolset to be domain specific, fluent in the language and idioms of the field or trade or methodologies. You would want its inference engine tuned to handle constructing ground truth based factual responses using the grammars and stylings peculiar to the knowledge domain specific field you would require a task assistant toolset to provide you.

Base Vocabulary & Reference Materials

To begin with, we would provide a base language model of those words which typically occur in general conversant language. We have 1k, 2k, 3k, 5k, 10k, 30k and 330k word models. These base models then get expanded with additional words harvested from the domain specific corpus training materials.

These base materials may consist of textbooks, essays, white papers, repair manuals, guide books. While harvesting for words the text is analyzed and statistics compiled across n-Gram windows of various lengths, given conceptual linkages and categorized to construct prompt responses based on what the inference engine calculated as a stochastically correct response.

Further, we felt that combining or merging two or more Domain Specific Language Models to provide a unified toolset should be provisioned for so that they can synchronously work when loaded into the Thinker core. You may want a legal brief’s construction model alongside the laws of a specific jurisdiction. The legal briefs model should be able to make inference from the jurisdictional law model, and be tuned to provide ground truth based fact with a bias towards jurisprudence precedent, and with rapid lookup and citations from the laws of that jurisdiction. It should provide citations as to where it retrieved its data.

Inference in a DSLM

To create a workable inference model, we need to take into account possible misspellings in text input. As well, differing word orders that may mean the same thing or something totally different are tested and weighted. The parts of sentence (POS) are analyzed for grammatical syntax and semantic filtering is done to assure things like numbers are referenced in their proper domain (address vs citation vs calculable, etc.).

These various weightings are combined and the highest weighted results are passed through a series of conceptual linkages filters. Conceptual linkages are weighted and sifted to find the most topically relevant inferences inherent in the DSLM. By taking the input and stepping it thru a set of filters, testing then removing the lowest weighted possibilities at each filter step, an inference model of referential pointers provides a parameterized vector set from which a response can be constructed specific to the knowledge domain corpus.

In-Application Use

In preliminary testing using a ScriptableObject database containing all the known atoms and isotopes of every element with 40+ data points minimum that could be queried. There were over 200 questions that could be asked about any of this data. There were also over 1000 commands that could operate the application on a command and control layer. The basic inference engine as constructed was able to zero in on the correct question and answer via the Q&A set with the appropriate data from prompts specifically entered with erroneous spelling, wrong grammar, garbled syntax. It also performed rapidly, completing its hypergraph weighting and filtering in thousandths of a second, across over 4,000 records and 176,000 fields.

Conclusion

You don’t need a server farm to get a performant and knowledgeable Language Model. In fact, with the rise of Data Poisoning from the glut of AI-generated content online, curation only becomes more mission-critical. LLMs will continue to drink their own Kool-Aid until they abandon the wholesale data mining approach and find a better business model.