RAG with Kernel Memory plugin always uses two LLM calls? #6903

chaelli · 2024-06-21T15:52:06Z

chaelli
Jun 21, 2024

I'd like to use Semantic Kernel for a mostly-RAG application that can also call some custom functions (otherwise I'd probably just use kernel memory). But - when I integrate kernel memory as memory plugin, I think it always needs 2 LLM calls even though it might be a basic RAG call.
First call => Check what needs to be done
Second call => by Kernel Memory including the facts

This seems bad for cases where 99% of the cases are basic RAG functions.

Any idea to work around this?

Answered by dluc

Jun 21, 2024

@chaelli the first call is about intent detection, do you know which code is making that first request? About KM, the ASK API uses these 2 requests:

generate embedding for the question, and use the embedding to find relevant sources
generate an answer using the relevant sources found

View full answer

dluc · 2024-06-21T16:46:03Z

dluc
Jun 21, 2024
Maintainer

@chaelli the first call is about intent detection, do you know which code is making that first request? About KM, the ASK API uses these 2 requests:

generate embedding for the question, and use the embedding to find relevant sources
generate an answer using the relevant sources found

0 replies

chaelli · 2024-06-21T18:44:13Z

chaelli
Jun 21, 2024
Author

@dluc thanks for the quick reply.
That's what I expected it to be:

intent detection
rag / answer

but as I forgot about the embedding part, maybe I miscounted. but - intent detection will always need a separate call right?

0 replies

chaelli · 2024-06-22T20:48:09Z

chaelli
Jun 22, 2024
Author

It's as expected:

so I wonder if there is any way to get around this?
as long as I did everything in the Semantic Kernel, I just sent the facts right along with the quesion and the possible functions - that worked very well (but then I could not use Kernel Memory and needed to do the facts collection myself)

2 replies

dluc Jun 23, 2024
Maintainer

I’m not sure where the second request is coming from. Are you using just KM or perhaps an app/demo we published? KM doesn't do intent detection, so I suspect you're using some higher level integration.

Generally, you’ll need two LLM text generations to answer a question because you first need to determine if the user is asking a question.

Scenario 1: The user says “hello”. You need a call to the LLM to determine that this is not a question. → 1 request
Scenario 2: The user says “how are you?”. You need a call to the LLM to determine that this is not a question requiring grounding. → 1 request
Scenario 3: The user says “how do I do X when Y?”.
- First, call the LLM to determine that this is a question needing further work and generate a better query string if necessary.
- If using Azure AI Search semantic ranking, pass the search string directly to the service to get a list of facts.
- If not, make a second call to the LLM to generate an embedding for the question or query string, then fetch data from a DB using vector search.
- Finally, make a third call to the LLM to generate an answer based on the DB records retrieved.
- → 3 requests (2 text generations, 1 embedding generation)

chaelli Jun 23, 2024
Author

It was a bad example because it called /search on the Kernel Memory Service.

this makes it clearer - we have 1 LLM call to check intent, then a /ask call to Kernel Memory (which itself uses another LLM call which we don't see here because it's a hosted service) and then the Semantc Kernel makes a third LLM call to put all together (this one is the second in the screenshot).

If I add related content right along the first time, I get:

which uses only 1 LLM call (not counting embedding) vs 3.
it makes more if I add questions that need function calling.. but basic RAG is very quick like this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG with Kernel Memory plugin always uses two LLM calls? #6903

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

RAG with Kernel Memory plugin always uses two LLM calls? #6903

chaelli Jun 21, 2024

Replies: 3 comments · 2 replies

dluc Jun 21, 2024 Maintainer

chaelli Jun 21, 2024 Author

chaelli Jun 22, 2024 Author

dluc Jun 23, 2024 Maintainer

chaelli Jun 23, 2024 Author

chaelli
Jun 21, 2024

Replies: 3 comments 2 replies

dluc
Jun 21, 2024
Maintainer

chaelli
Jun 21, 2024
Author

chaelli
Jun 22, 2024
Author

dluc Jun 23, 2024
Maintainer

chaelli Jun 23, 2024
Author