How to Build a Salesforce LWC Assistant to Run AI LLMs Locally: Expert Insights

5 min read

Rating:

The article is fully authored by Anton Kutishevsky – a Certified Salesforce Developer at Advanced Communities.

TL;DR

I built a Salesforce Lightning Web Component that lets you run powerful AI language models (LLMs) directly on your computer within Salesforce. It uses Pico LLM technology to process data locally, keeping your information secure and responding quickly. You can use it to generate emails, write content, analyze customer data, and more, all without relying on external services. Check out the demo video and GitHub repo to learn more!

I’ve been experimenting with the local LLMs inside Salesforce and would like to tell you about the component I developed as a result. It has the already familiar chat interface which uses Salesforce records for context. It works locally on your computer, so processed data is not being sent to any third-party service.

The introduction of Agentforce was what influenced me to develop the component. Agentforce uses agents — systems that can make decisions and perform various actions. Assistants, in contrast, only process information reactively. Even though I believe it’s possible to build a local agent using Pico LLM, it would take enormous effort. Thus, I decided to develop an assistant instead.

Running LLMs Locally in Salesforce Experience Cloud Using picoLLM Inference Engine SDK

This article demonstrates how to run Large Language Models using the picoLLM Inference Engine SDK. We’ll cover the initial setup, configuration, and coding required to achieve this integration.

Features

As you would expect an LLM to work, it generates responses on any topic, as it’s pretrained on a vast set of data. Moreover, it’s able to use Salesforce records for extra context. The features of the component are:

Supports multiple models. Any open-source model from the Pico website, such as Gemma, Llama, or Phi, can be used. The only limitation here is the amount of RAM your computer has. The more the model weighs the more RAM it consumes.
Works with a single record. When the component is placed on a record page, then it’s able to access the record for context. For example, being on an Account record detail page, it can generate a response based on its field values.
Supports related records. When a record has related records, the component can query and incorporate them into responses.
Configurable. The component can be configured on the fly, using the configuration popup. It allows changing the generation options, such as completion token limit, temperature, and top P.

How it works

From an end user’s perspective, the process is straightforward. You upload a model, select a system prompt, select records, write a user prompt, and look at the result being generated.

What is Pico LLM?

Running LLMs in a browser is a resource-consuming task because of the model’s size, bandwidth requirements and RAM needs. Therefore, the Pico team developed their picoLLM Compression technique, which makes usage of LLMs locally much more efficient for computers. They provided the picoLLM Inference Engine, as a JavaScript SDK, to allow front-end developers to run LLMs locally across browsers. It supports all modern browsers including Chrome, Safari, Edge, Firefox, and Opera. To know more about how the picoLLM Inference Engine works, you can read their article.

The LWC part

The component serves as a bridge between a user and the PicoLLM interface. At the core of the component is a Visualforce page embedded as an iframe. The page loads the PicoLLM SDK and communicates with the LWC allowing the last to use SDK via post messages. The whole combination of elements handles the following:

Loading a model. The LWC has a button that allows you to load a model of your choice. It triggers a file input element hidden inside the iframe. Once the model is loaded, the Pico SDK creates web workers, and the component is ready to process the user input.
Setting a system prompt. You don’t have to write a system prompt every time, it’s easy to select any saved record of the System_Prompt__c object. Once the button is pressed, it shows a popup with the existing system prompts to choose from.
Accepting user input. There is a resizable text area for collecting user input. When collected, it’s sent to the iframe as a payload and added to the conversation history.
Accessing Salesforce records. There are two buttons: Select Fields and Select Related Records. The first one collects the field values of the record on a record page on which the LWC resides. The second allows you to choose a related object and query its records along with the selected field values. This information is sent to the iframe as a payload as well.
Changing generation options. If desired, the completion token limit, temperature, and top P options can be changed via a dedicated button in the component. This information is also sent as a payload to the iframe.
Generating a result. When the iframe receives the payload, it utilizes the Pico SDK to utilize the loaded model and generate a result. If generation options were provided, they are taken into account. Also, the dialog is updated every time, so the LLM will remember the history of it.
Rendering chat messages. The LWC is able to render outgoing messages, which are the ones the user provided. The incoming messages, containing the generated response, are being rendered dynamically once the component has anything to say to the user. Such as the generated results or information and error messages.

A little bit of Apex code

On the back-end side of things, there is nothing fancy. The Apex code does all the heavy lifting related to detecting the relationships between the objects using a record ID from the record page. Also, it performs a couple of SOQL queries, and, thereby, its duty is done here.

Development Challenges

Web workers

Previously, I used the unpkg tool to execute code from the node module in the LWC component. This approach led to additional configuration steps, and was a less secure way to make it work. This time, I wanted to execute the PicoLLM module directly from Salesforce and not only from the Experience Cloud site, which I had done previously, but the Lightning Experience interface.

Under the hood, PicoLLM uses web workers for parallel processing, and it was the main problem because it’s not allowed to run them from LWC. Luckily, no one refused to let us run web workers from a Visualforce page, and it was the approach I used.

I downloaded the raw PicoLLM code and added it as a static resource to the Visualforce page. In LWC I used an iframe that contained the Visualforce page. The communication between the LWC and the page inside the iframe allowed me to use web workers. The page triggers the PicoLLM-related code from the lightning web component.

Using Salesforce records for context

Copy and paste Salesforce records in a JSON or CSV format, throw it into any online LLM, and watch. It will consume the records, use them for extra context, and generate a response. It turned out that it is not that easy when using compressed models for local processing.

At first, I was simply putting the records, in JSON format, right into the user prompt. Then I expected the thing to be smart enough to distinguish the prompt itself from the additional context I provided. I used different models of various sizes and didn’t understand why it wasn’t using the JSON to generate responses. It was mostly refusals to respond to my prompt or generation of fictional data not related to what I asked it to do. I started to experiment with different formats of the context data: using CSV, using JSON, using prompt dividers to strictly differentiate prompt from context — nothing helped.

I nearly abandoned the idea because the key feature wasn’t functioning. After a couple of months, I suddenly got a stupidly simple brainwave. What if I just reversed the order of prompt parts? From user prompt coming first and context coming second, to context coming first and prompt second. To my surprise, it worked, and any model I used immediately started to understand Salesforce records as context.

Performance

The component’s functionality was tested on these machines:

PC with the AMD Ryzen 9 9900X processor and 32GB of RAM (5600 MT/s).
Microsoft Surface Laptop 7 powered by the Snapdragon X-Elite ARM processor with 16 GB of RAM (8448 MT/s).

Model loading speed— it’s all about memory

The most time-consuming part of using the component is the initial model loading. You might expect the 9900X to easily outperform the Snapdragon X-Elite, but you’d be wrong. To my surprise, the latter is faster. Since it has faster memory, I presume that the faster your RAM, the faster the model loads. Here’s a model loading speed comparison table for reference:

Response generation speed

The same story with the response generation speed. As I understand, you need to have a fast combination of CPU and RAM to get the fastest generation possible. Because response generation varies with the same prompt, I did not conduct precise speed tests. Nevertheless, the generation speed is extremely fast, almost as fast as the online alternatives.

What about using a GPU?

Indeed, using a GPU to generate responses would be much more efficient. While it’s possible to use a GPU with PicoLLM, I haven’t tested that configuration myself. There are a couple of reasons for this. First, I believe it uses the WebGPU feature, which isn’t enabled by default in most browsers (except Edge). Second, it likely requires several gigabytes of VRAM to load the model which I don’t have.

Conclusion

Developing this assistant has been a fascinating journey of exploration. From grappling with web worker limitations to discovering the crucial role of prompt order in providing context, the challenges have been both stimulating and rewarding. The result is a Lightning Web Component that offers a unique approach to leveraging the power of Large Language Models within the Salesforce ecosystem.

While the initial model loading time can be a consideration, especially for larger models, the ability to process data locally offers significant advantages in terms of data security, responsiveness, and cost-effectiveness. The potential use cases, from automating content generation to providing intelligent assistance, are vast and waiting to be explored.

Check out the GitHub repo.

Table of contents

Discover more articles!

25 Nov 2022

Salesforce Experience Cloud Features & Benefits Review for High Tech & IT Organizations

Salesforce Communities is a powerful tool for High Tech industry to communicate with customers and partners. Read about all advantages in our article

5 min read

Written by Anna Babur

13 Jan 2025

7 Go-to Salesforce Agentforce and AI Use Cases for Members & Membership Managers

5 (4) Over the past couple of months, Agentforce – a Salesforce breakthrough tech – has made the headlines. On...

5 min read

Written by Anna Babur

28 Aug 2024

Running LLMs Locally in Salesforce Experience Cloud Using picoLLM Inference Engine SDK

This article demonstrates how to run Large Language Models using the picoLLM Inference Engine SDK. We’ll cover the initial setup, configuration, and coding required to achieve this integration.

5 min read

Written by Anton Kutishevsky

How to Build a Salesforce LWC Assistant to Run AI LLMs Locally: Expert Insights

TL;DR

Running LLMs Locally in Salesforce Experience Cloud Using picoLLM Inference Engine SDK

Features

How it works

What is Pico LLM?

The LWC part

A little bit of Apex code

Development Challenges

Web workers

Using Salesforce records for context

Performance

Model loading speed— it’s all about memory

Response generation speed

What about using a GPU?

Conclusion

Subscribe to Our Newsletter

Subscribe to Our Newsletter

Discover more articles!