Forem: Composio

Optimising Function Calling (GPT4 vs Opus vs Haiku vs Sonnet)

Soham Ganatra — Sun, 12 May 2024 09:06:32 +0000

Code: https://github.com/SamparkAI/Composio-Function-Calling-Benchmark/

In the last blog, we introduced the ClickUp function calling benchmark and experimented with different optimisation approaches for improving function calling using gpt-4-turbo-preview.

This time, we wanted to check a selection of other models, which might or might not claim to be superior in performance 😅. We also wanted to make our benchmark test more generalised to find compatible optimisation approaches to specific models for function calling.

Optimisation Techniques

As function calling is a new concept, and not much literature is available, we checked different experiments by the community. From these and our intuition, we realised techniques like flattening the schema structure, making system prompts more focused on function calls, improving the function names, descriptions, parameter descriptions, adding examples, etc. will enhance the function calling performance. So, we decided on this elaborate experiment. To list the methods we experimented with:

No System Prompt: Only the problem statement
Flattening Schema : All the hierarchical parameters are flattened to a shallow tree structure
Flattened Schema + Simple System Prompt : Added a simple system prompt mentioning that function calling needs to be used
Flattened Schema + Focused System Prompt : Added characterisation on its role in solving function calling problems.
Flattened Schema + Focused System Prompt + Function Name Optimised : The function names were elaborated.
Flattened Schema + Focused System Prompt + Function Description Optimised : Explained the descriptions clearly.
Flattened Schema + Focused System Prompt containing Schema summary : Added summarised version of all function schema to the system prompts
Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimised : Summarised function schema in system prompt, with elaborated function names.
Flattened Schema + Focused System Prompt containing Schema summary + Function Description Optimised : Summarised function schema in system prompt, with clearly explained function descriptions.
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimised : Additionally, the description of the parameters was improved
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimised + Function Call examples added : Examples of function calls were added along with function descriptions.
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimised + Function Parameter examples added: Examples of parameter values were added to parameter descriptions.

OpenAI Models

As we checked gpt-4-turbo-preview in the previous experiment, we wanted to test the performance of both its predecessor, gpt-4-0125-preview, and its successor gpt-4-turbo. As we have seen before, even though the next-generation models are pretty advanced in benchmark scores, they are often not better in an all-encompassing way. So, comparing with our previous scores, here is the performance of these two OpenAI models.

Optimization Approach	gpt-4-turbo-preview	gpt-4-turbo	gpt-4-0125-preview
No System Prompt	0.36	0.36	0.353
Flattening Schema	0.527	0.487	0.533
Flattened Schema + Simple System Prompt	0.553	0.533	0.54
Flattened Schema + Focused System Prompt	0.633	0.633	0.64
Flattened Schema + Focused System Prompt + Function Name Optimized	0.553	0.607	0.587
Flattened Schema + Focused System Prompt + Function Description Optimized	0.633	0.66	0.673
Flattened Schema + Focused System Prompt containing Schema summary	0.64	0.553	0.64
Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimized	0.70	0.707	0.686
Flattened Schema + Focused System Prompt containing Schema summary + Function Description Optimized	0.687	0.707	0.68
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized	0.767	0.767	0.787
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Call examples added	0.693	0.6	0.707
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Parameter examples added	0.787	0.693	0.787

So we can see that, in most cases, the original gpt-4-0125-preview performed better. When we added more examples of parameters, in the parameter descriptions, gpt-4-0125-preview consistently performed better than the other models. In the cases where we optimised or elaborated only the function names and descriptions, we see the gpt-4-turbo seems to do better.

Anthropic Models

Next, we did the same experimentation with Anthropic's Claude-3 series of models. Claude-3 has three models, haiku, sonnet and opus, in increasing order of parameters and performance(at least that is expected).

When we tried these models, we discovered that Claude models, especially opus, is very costly, and very slow!! Running the whole benchmark with GPT-4 for one run took ~4 minutes, while claude-3-opus-20240229took around ~13 minutes. claude-3-haiku-20240307 and claude-3-sonnet-20240229 took about ~3 minutes and ~6 minutes, respectively.

We faced several problems while running the benchmark for clause models. For example, unlike OpenAI models, Claude models' most function/tool calls are preceded by a block of thoughts text, which required some changes in our benchmark code.

Then, while we ran it, we found that the scores were incredibly low in some cases and kind of absurd.

After some digging, we found that sometimes the models predicted the boolean variables as strings, like True was predicted as "True" and False was predicted as "False". We added a fix for that and then finally obtained our results.

Optimization Approach	claude-3-haiku-20240307	claude-3-sonnet-20240229	claude-3-opus-20240229
No System Prompt	0.48	0.6	0.42
Flattening Schema	0.5	0.58	0.5
Flattened Schema + Simple System Prompt	0.54	0.6	0.54
Flattened Schema + Focused System Prompt	0.54	0.54	0.54
Flattened Schema + Focused System Prompt + Function Name Optimized	0.52	0.62	0.52
Flattened Schema + Focused System Prompt + Function Description Optimized	0.52	0.6	0.52
Flattened Schema + Focused System Prompt containing Schema summary	0.46	0.62	0.46
Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimized	0.5	0.64	0.46
Flattened Schema + Focused System Prompt containing Schema summary + Function Description Optimized	0.5	0.6	0.6
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized	0.58	0.74	0.58
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Call examples added	0.6	0.76	0.64
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Parameter examples added	0.68	0.76	0.66

Now I know.., you think they must have messed up the haiku and opus models scores. But believe me, I am equally surprised and can ensure that we ran the opus benchmark multiple times and checked the code quite a lot for probable bugs.

opus, sonnet and haiku initially outperform GPT models in non-optimized scenarios. sonnet consistently outpaces haiku, as expected. Had opus maintained this trend, it likely would have surpassed Openai models.

Finally

OpenAI models, especially gpt-4-turbo-preview, are still the better choice regarding performance and cost.

Optimization Approach	gpt-4-turbo-preview	gpt-4-turbo	gpt-4-0125-preview	claude-3-haiku-20240307	claude-3-sonnet-20240229	claude-3-opus-20240229
No System Prompt	0.36	0.36	0.353	0.48	0.6	0.42
Flattening Schema	0.527	0.487	0.533	0.5	0.58	0.5
Flattened Schema + Simple System Prompt	0.553	0.533	0.54	0.54	0.6	0.54
Flattened Schema + Focused System Prompt	0.633	0.633	0.64	0.54	0.54	0.54
Flattened Schema + Focused System Prompt + Function Name Optimized	0.553	0.607	0.587	0.52	0.62	0.52
Flattened Schema + Focused System Prompt + Function Description Optimized	0.633	0.66	0.673	0.52	0.6	0.52
Flattened Schema + Focused System Prompt containing Schema summary	0.64	0.553	0.64	0.46	0.62	0.46
Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimized	0.70	0.707	0.686	0.5	0.64	0.46
Flattened Schema + Focused System Prompt containing Schema summary + Function Description Optimized	0.687	0.707	0.68	0.5	0.6	0.6
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized	0.767	0.767	0.787	0.58	0.74	0.58
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Call examples added	0.693	0.6	0.707	0.6	0.76	0.64
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Parameter examples added	0.787	0.693	0.787	0.68	0.76	0.66

All the codes are organised at: https://github.com/SamparkAI/Composio-Function-Calling-Benchmark/.

We're currently deciding which models to test next—perhaps Mistral or open-source options like Functionary or NexusRaven. Check out our repository and try running these models to compare their performance. If you have questions or suggestions, please submit a pull request. Thank you!

Improving Function Calling Accuracy

Soham Ganatra — Sat, 16 Mar 2024 09:19:09 +0000

Introduction

Large language models have recently been giving the ability to function-calling. Given the details(function-schema) of a number of functions, the LLM will be able to select and run the function with appropriate parameters, if the prompt demands for it. OpenAI’s GPT-4 is one of the best function-calling LLMs available for use. In addition to the GPT4, there are also open-source function calling LLMs like OpenGorilla, Functionary, NexusRaven and FireFunction that I will try and compare performance with. Example Function Calling Code can be found at OpenAI Function Calling Cookbook.

TLDR: Show me the results

Integration-Focused Agentic Function Calling

We are transitioning towards Agentic applications for more effective use of LLMs in our daily workflow. In this setup, each AI agent is designated a specific role, equipped with distinct functionalities, often collaborating with other agents to perform complex tasks.

To enhance user experience and streamline workflows, these agents must interact with the tools used by users and automate some functionalities. Currently, AI development allows agents to interact with various software tools to a certain extent through proper integration using software APIs or SDKs. While we can integrate these points into AI agents and hope for flawless operation, the question arises:

Are the common design of API endpoints compatible with Agentic Process Automation (APA)? Maybe we can redesign APIs to be more suitable to function calling?

Selecting Endpoints

We referenced the docs of ClickUp (Popular Task management App) and curated a selection of endpoints. We decided this due to the impracticality of expecting the LLM to choose from hundreds of endpoints, considering the limitation of context length.

**get_spaces** (team_id:string, archived:boolean)
create_space(team_id:string, name:string, multiple_assignees:boolean, features:(due_dates:(enabled:boolean, start_date:boolean, remap_due_dates:boolean, remap_closed_due_date:boolean), time_tracking:(enabled:boolean)))
get_space(space_id:string)
update_space(space_id:string, name:string, color:string, private:boolean, admin_can_manage:boolean, multiple_assignees:boolean, features:(due_dates:(enabled:boolean, start_date:boolean, remap_due_dates:boolean, remap_closed_due_date:boolean), time_tracking:(enabled:boolean)))
delete_space(space_id:string)
get_space_tags(space_id:string)
create_space_tag(space_id:string, tag:(name:string, tag_fg:string, tag_bg:string))
delete_space_tag(space_id:string, tag_name:string, tag:(name:string, tag_fg:string, tag_bg:string))

We converted them to the corresponding OpenAI function schema, which is available here. These were specifically selected as they combine endpoints with both flattened and nested parameters.

Creating Benchmark Dataset

To evaluate our approaches effectively, we require a benchmark dataset that is small and focuses specifically on the software-integration aspect of function-calling Language Models (LLMs).

Despite reviewing various existing function calling datasets, none were ideal for this article.

Consequently, we developed our own dataset called the ClickUp-Space dataset , which replicates real-world scenarios to some extent.

The prompts require one of eight selected functions to solve , ranging from simple to complex. Our evaluation will be based on how accurately the functions are called with the correct parameters. We also prepared code for assessing performance.

Next, we developed a problem set consisting of 50 pairs of prompts along with their respective function calling solutions.

[
  {
    "prompt": "As the new fiscal year begins, the management team at a marketing agency decides it's time to archive older projects to make way for new initiatives. They remember that one of their teams is called \"Innovative Solutions\" and operates under the team ID \"team123\". They want to check which spaces under this team are still active before deciding which ones to archive.",
    "solution": "get_spaces(team_id=\"team123\", archived=False)"
  },
  {
    "prompt": "Ella, the project coordinator, is setting up a new project space in ClickUp for the \"Creative Minds\" team with team ID \"cm789\". This space, named \"Innovative Campaigns 2023\", should allow multiple assignees for tasks, but keep due dates and time tracking disabled, as the initial planning phase doesn't require strict deadlines or time monitoring.",
    "solution": "create_space(team_id=\"cm789\", name=\"Innovative Campaigns 2023\", multiple_assignees=True, features=(due_dates=(enabled=False, start_date=False, remap_due_dates=False, remap_closed_due_date=False), time_tracking=(enabled=False)))"
  },
...
]

Measuring Baseline Performance

Initially, we wanted to assess GPT-4's performance independently, without any system prompts.

fcalling_llm = lambda fprompt : client.chat.completions.create(
  model="gpt-4-turbo-preview",
  messages=[
    {
      "role": "system",
      "content": """"""
    },
    {
      "role": "user",
      "content": prompt
    },
  ],
  temperature=0,
  max_tokens=4096,
  top_p=1,
  tools=tools,
  tool_choice="auto"
)

response = fcalling_llm(bench_data[1]["prompt"])

We set the temperature to 0 to make the results more predictable. The experiment was repeated three times, resulting in an average accuracy of 0.3 , which is below our target.

Benchmark without System Prompt - [Code Here]

Flattening the Parameters

As mentioned earlier, some functions require output parameters in a nested structure. An example below-

{
    "name": "create_space",
    "description": "Add a new Space to a Workspace.",
    "parameters": {
      "type": "object",
      "properties": {
        "team_id": {
          "type": "string",
          "description": "The ID of the team"
        },
        "name": {
          "type": "string",
          "description": "The name of the new space"
        },
        "multiple_assignees": {
          "type": "boolean",
          "description": "Enable or disable multiple assignees for tasks within the space"
        },
        "features": {
          "type": "object",
          "description": "Enabled features within the space",
          "properties": {
            "due_dates": {
              "type": "object",
              "description": "Due dates feature settings",
              "properties": {
                "enabled": { "type": "boolean" },
                "start_date": { "type": "boolean" },
                "remap_due_dates": { "type": "boolean" },
                "remap_closed_due_date": { "type": "boolean" }
              }
            },
            "time_tracking": {
              "type": "object",
              "description": "Time tracking feature settings",
              "properties": {
                "enabled": { "type": "boolean" }
              }
            }
          }
        }
      },
      "required": ["team_id", "name", "multiple_assignees", "features"]
    }
  }

Based on our experience with LLMs, we believe that while the model (GPT-4) has been optimised for structured output, a complex output structure may actually reduce performance and accuracy.

Therefore, we programmatically flatten the parameters.

Above function flattened will look as follows:

{
        "description": "Add a new Space to a Workspace.",
        "name": "create_space",
        "parameters": {
            "properties": {
                "features __due_dates__ enabled": {
                    "description": "enabled __Due dates feature settings__ Enabled features within the space__",
                    "type": "boolean"
                },
                "features __due_dates__ remap_closed_due_date": {
                    "description": "remap_closed_due_date __Due dates feature settings__ Enabled features within the space__",
                    "type": "boolean"
                },
                "features __due_dates__ remap_due_dates": {
                    "description": "remap_due_dates __Due dates feature settings__ Enabled features within the space__",
                    "type": "boolean"
                },
                "features __due_dates__ start_date": {
                    "description": "start_date __Due dates feature settings__ Enabled features within the space__",
                    "type": "boolean"
                },
                "features __time_tracking__ enabled": {
                    "description": "enabled __Time tracking feature settings__ Enabled features within the space__",
                    "type": "boolean"
                },
                "multiple_assignees": {
                    "description": "Enable or disable multiple assignees for tasks within the space__",
                    "type": "boolean"
                },
                "name": {
                    "description": "The name of the new space__",
                    "type": "string"
                },
                "team_id": {
                    "description": "The ID of the team__",
                    "type": "string"
                }
            },
            "required": [
                "team_id",
                "name",
                "multiple_assignees",
                "features __due_dates__ enabled",
                "features __due_dates__ start_date",
                "features __due_dates__ remap_due_dates",
                "features __due_dates__ remap_closed_due_date",
                "features __time_tracking__ enabled"
            ],
            "type": "object"
        }
    }

We attached the parameter name to its parent parameters (ex:features __due_dates__ enabled ) by __ , and joined the parameter descriptions to its predecessor ( Ex:enabled__due_dates feature settings __enabled features within the space__ ).

Benchmark after Flattening Schema [Code Here]

Adding System Prompt

We didn't have a system prompt before, so the LLM wasn't instructed on its role or interacting with ClickUp APIs.

Let's add a simple system prompt now.

System

from openai import OpenAI
client = OpenAI()

fcalling_llm = lambda fprompt : client.chat.completions.create(
  model="gpt-4-turbo-preview",
  messages=[
    {
      "role": "system",
      "content": """
You are an agent who is responsible for managing various employee management platform, 
one of which is CliuckUp.

When you are presented with a technical situation, that a person of a team is facing, 
you must give the soulution utilizing your functionalities. 
"""
    },
    {
      "role": "user",
      "content": fprompt
    },
  ],
  temperature=0,
  max_tokens=4096,
  top_p=1,
  tools=tools,
  tool_choice="auto"
)

response = fcalling_llm(bench_data[1]["prompt"])

Code Change

Benchmark with System Prompt - [Code Here]

Improving System Prompt

Now that we've observed an improvement in performance by adding a system prompt, we will enhance its detail to assess if the performance increase is sustained.

You are an agent who is responsible for managing various employee management platform, 
one of which is CliuckUp. 

You are given a number of tools as functions, you must use one of those tools and fillup 
all the parameters of those tools ,whose answers you will get from the given situation.

When you are presented with a technical situation, that a person of a team is facing, 
you must give the soulution utilizing your functionalities. 

First analyze the given situation to fully anderstand what is the intention of the user,
what they need and exactly which tool will fill up that necessity.

Then look into the parameters and extract all the relevant informations to fillup the 
parameter with right values.

New System Prompt

Seems to work great! [Code Here]

Benchmark after Flattened Schema + Improved System Prompt

Adding Schema Summary in Schema Prompt

Let's enhance the system prompts further by focusing on the functions and their purpose, building upon the clear instructions provided for the LLM's role.

Here is a concise summary of the system functions which we add to prompt.

get_spaces - View the Spaces available in a Workspace.
create_space - Add a new Space to a Workspace.
get_space - View the details of a specific Space in a Workspace.
update_space - Rename, set the Space color, and enable ClickApps for a Space.
delete_space - Delete a Space from your Workspace.
get_space_tags - View the task Tags available in a Space.
create_space_tag - Add a new task Tag to a Space.
delete_space_tag - Delete a task Tag from a Space.

Benchmark after Flattened Schema + Improved System Prompt containing Schema Summary. [Code Here]

Optimising Function Names

Now, let's improve the schemas starting with more descriptive function names.

schema_func_name_dict = {
    "get_spaces": "get_all_clickup_spaces_available",
    "create_space": "create_a_new_clickup_space",
    "get_space": "get_a_specific_clickup_space_details",
    "update_space": "modify_an_existing_clickup_space",
    "delete_space": "delete_an_existing_clickup_space",
    "get_space_tags": "get_all_tags_of_a_clickup_space",
    "create_space_tag": "assign_a_tag_to_a_clickup_space",
    "delete_space_tag": "remove_a_tag_from_a_clickup_space",
}

Replacing Current Function Names with Above

optimized_schema = []
for sc in flattened_schema:
    temp_dict = sc.copy()
    temp_dict["name"] = schema_func_name_dict[temp_dict["name"]]
    optimized_schema.append(temp_dict)

Replace names in the schema Code

Benchmark after Flattened Schema + Improved System Prompt containing Schema Summary + Function Names Optimised [Code Here]

Optimising Function Description

Here, we focus on the function descriptions and make those more clear and focused.

schema_func_decription_dict = {
    "get_spaces": "Retrives information of all the spaces available in user's Clickup Workspace.",
    "create_space": "Creates a new ClickUp space",
    "get_space": "Retrives information of a specific Clickup space",
    "update_space": "Modifies name, settings the Space color, and assignee management Space.",
    "delete_space": "Delete an existing space from user's ClickUp Workspace",
    "get_space_tags": "Retrives all the Tags assigned on all the tasks in a Space.",
    "create_space_tag": "Assigns a customized Tag in a ClickUp Space.",
    "delete_space_tag": "Deletes a specific tag previously assigned in a space.",
}

New Descriptions

And change schema with:

optimized_schema = []
for sc in flattened_schema:
    temp_dict = sc.copy()
    temp_dict["description"] = schema_func_decription_dict[temp_dict["name"]]
    optimized_schema.append(temp_dict)

Changing Schema

Benchmark after Flattened Schema + Improved System Prompt containing Schema Summary + Function Names Optimised + Function Descriptions Optimised [Code Here]

Optimising Function Parameter Descriptions

Earlier, we flattened the schema by stacking nested parameters' descriptions with their parents' descriptions until they were in a flattened state.

Let's now replace them with:

schema_func_params_dict = {
    'create_space': {
        'features __due_dates__ enabled': 'If due date feature is enabled within the space. Default: True',
        'features __due_dates__ remap_closed_due_date': 'If remapping closed date feature in due dates is available within the space. Default: False',
        'features __due_dates__ remap_due_dates': 'If remapping due date feature in due dates is available within the space. Default: False',
        'features __due_dates__ start_date': 'If start date feature in due dates is available within the space. Default: False',
        'features __time_tracking__ enabled': 'If time tracking feature is available within the space. Default: True',
        'multiple_assignees': 'Enable or disable multiple assignees for tasks within the space. Default: True',
        'name': 'The name of the new space to create',
        'team_id': 'The ID of the team'
        },
    'create_space_tag': {
        'space_id': 'The ID of the space',
        'tag__name': 'The name of the tag to assign',
        'tag__tag_bg': 'The background color of the tag to assign',
        'tag__tag_fg': 'The foreground(text) color of the tag to assign'
        },
    'delete_space': {
        'space_id': 'The ID of the space to delete'
        },
    'delete_space_tag': {
        'space_id': 'The ID of the space',
        'tag__name': 'The name of the tag to delete',
        'tag__tag_bg': 'The background color of the tag to delete',
        'tag__tag_fg': 'The foreground color of the tag to delete',
        'tag_name': 'The name of the tag to delete'
        },
    'get_space': {
        'space_id': 'The ID of the space to retrieve details'
        },
    'get_space_tags': {
        'space_id': 'The ID of the space to retrieve all the tags from'
        },
    'get_spaces': {
        'archived': 'A flag to decide whether to include archived spaces or not. Default: True',
        'team_id': 'The ID of the team'
        },
    'update_space': {
        'admin_can_manage': 'A flag to determine if the administrator can manage the space or not. Default: True',
        'color': 'The color used for the space',
        'features __due_dates__ enabled': 'If due date feature is enabled within the space. Default: True',
        'features __due_dates__ remap_closed_due_date': 'If remapping closed date feature in due dates is available within the space. Default: False',
        'features __due_dates__ remap_due_dates': 'If remapping due date feature in due dates is available within the space. Default: False',
        'features __due_dates__ start_date': 'If start date feature in due dates is available within the space. Default: False',
        'features __time_tracking__ enabled': 'If time tracking feature is available within the space. Default: True',
        'multiple_assignees': 'Enable or disable multiple assignees for tasks within the space. Default: True',
        'name': 'The new name of the space',
        'private': 'A flag to determine if the space is private or not. Default: False',
        'space_id': 'The ID of the space'
        }
        }

And modifying the previous schema:

optimized_schema = []
for sc in flattened_schema:
    temp_dict = sc.copy()
    temp_dict["description"] = schema_func_decription_dict[temp_dict["name"]]
    for func_param_name, func_param_description in schema_func_params_dict[temp_dict["name"]].items():
        sc["parameters"]["properties"][func_param_name]["description"] = func_param_description
    optimized_schema.append(temp_dict)

Benchmark after Flattened Schema + Improved System Prompt containing Schema Summary + (Function Names + Function Descriptions + Parameter Descriptions) Optimised [Code Here]

Wow! For all runs we got score equal to or over 75%.

Adding Examples of Function Calls

LLMs perform better when response examples are provided. Let's aim to give examples and analyse the outcomes.

To start, we can provide examples of each function call along with the corresponding function description in the schema to illustrate this concept.

schema_func_decription_dict = {
    "get_spaces": """\
Retrives information of all the spaces available in user's Clickup Workspace. Example Call:

python
get_spaces({'team_id': 'a1b2c3d4', 'archived': False})

    """,
    "create_space": """\
Creates a new ClickUp space. Example Call:

python
create_space ({
'team_id': 'abc123',
'name': 'NewWorkspace',
'multiple_assignees': True,
'features due_dates enabled': True,
'features due_dates start_date': False,
'features due_dates remap_due_dates': False,
'features due_dates remap_closed_due_date': False,
'features time_tracking enabled': True
})

""",
    "get_space": """\
Retrives information of a specific Clickup space. Example Call:

python
get_space({'space_id': 's12345'})

""",
    "update_space": """\
Modifies name, settings the Space color, and assignee management Space. Example Call:

python
update_space({
'space_id': 's12345',
'name': 'UpdatedWorkspace',
'color': '#f0f0f0',
'private': True,
'admin_can_manage': False,
'multiple_assignees': True,
'features due_dates enabled': True,
'features due_dates start_date': False,
'features due_dates remap_due_dates': False,
'features due_dates remap_closed_due_date': False,
'features time_tracking enabled': True
})


""",
    "delete_space": """\
Delete an existing space from user's ClickUp Workspace. Example Call:

python
delete_space({'space_id': 's12345'})

    """,
    "get_space_tags": """\
Retrives all the Tags assigned on all the tasks in a Space. Example Call:

python
get_space_tags({'space_id': 's12345'})

""",
    "create_space_tag": """\
        Assigns a customized Tag in a ClickUp Space. Example Call:

python
create_space_tag({
'space_id': 's12345',
'tag_name': 'Important',
'tagtag_bg': '#ff0000',
'tag_tag_fg': '#ffffff'
})

        """,
    "delete_space_tag": """\
    Deletes a specific tag previously assigned in a space. Example Call:

python
delete_space_tag({
'space_id': 's12345',
'tag_name': 'Important',
'tag_name': 'Important',
'tagtag_bg': '#ff0000',
'tag_tag_fg': '#ffffff'
})

    """,
}

And when we run the benchmark,

Benchmark after Flattened Schema + Improved System Prompt containing Schema Summary + (Function Names + Function Descriptions + Parameter Descriptions) Optimised + Function Call Examples Added [Code Here]

Sadly, the score seems to degrade!

Adding Example Parameter Values

Since the function call example for addition did not work, let's now try adding sample values to the function parameters to provide a clearer idea of the values to input. We will adjust the descriptions of our function parameters accordingly.

schema_func_params_dict = {
    'create_space': {
        'features __due_dates__ enabled': 'If due date feature is enabled within the space. \nExample: True, False \nDefault: True',
        'features __due_dates__ remap_closed_due_date': 'If remapping closed date feature in due dates is available within the space. \nExample: True, False \nDefault: False',
        'features __due_dates__ remap_due_dates': 'If remapping due date feature in due dates is available within the space. \nExample: True, False \nDefault: False',
        'features __due_dates__ start_date': 'If start date feature in due dates is available within the space. \nExample: True, False \nDefault: False',
        'features __time_tracking__ enabled': 'If time tracking feature is available within the space. \nExample: True, False \nDefault: True',
        'multiple_assignees': 'Enable or disable multiple assignees for tasks within the space \nExample: True, False. Default: True',
        'name': 'The name of the new space to create \nExample: \'NewWorkspace\', \'TempWorkspace\'',
        'team_id': 'The ID of the team \nExample: \'abc123\', \'def456\' '
        },
    'create_space_tag': {
        'space_id': 'The ID of the space \nExample: \'abc123\', \'def456\'',
        'tag__name': 'The name of the tag to assign \nExample: \'NewTag\', \'TempTag\'',
        'tag__tag_bg': 'The background color of the tag to assign \nExample: \'#FF0000\', \'#00FF00\'',
        'tag__tag_fg': 'The foreground(text) color of the tag to assign \nExample: \'#FF0000\', \'#00FF00\''
        },
    'delete_space': {
        'space_id': 'The ID of the space to delete \nExample: \'abc123\', \'def456\''
        },
    'delete_space_tag': {
        'space_id': 'The ID of the space to delete \nExample: \'abc123\', \'def456\'',
        'tag__name': 'The name of the tag to delete \nExample: \'NewTag\', \'TempTag\'',
        'tag__tag_bg': 'The background color of the tag to delete \nExample: \'#FF0000\', \'#00FF00\', \'#0000FF\'',
        'tag__tag_fg': 'The foreground color of the tag to delete \nExample: \'#FF0000\', \'#00FF00\', \'#0000FF\'',
        'tag_name': 'The name of the tag to delete \nExample: \'NewTag\', \'TempTag\''
        },
    'get_space': {
        'space_id': 'The ID of the space to retrieve details \nExample: \'abc123\', \'def456\''
        },
    'get_space_tags': {
        'space_id': 'The ID of the space to retrieve all the tags from \nExample: \'abc123\', \'def456\''
        },
    'get_spaces': {
        'archived': 'A flag to decide whether to include archived spaces or not \nExample: True, False. Default: True',
        'team_id': 'The ID of the team \nExample: \'abc123\', \'def456\''
        },
    'update_space': {
        'admin_can_manage': 'A flag to determine if the administrator can manage the space or not \nExample: True, False. Default: True',
        'color': 'The color used for the space \nExample: \'#FF0000\', \'#00FF00\'',
        'features __due_dates__ enabled': 'If due date feature is enabled within the space. \nExample: True, False \nDefault: True',
        'features __due_dates__ remap_closed_due_date': 'If remapping closed date feature in due dates is available within the space. Default: False',
        'features __due_dates__ remap_due_dates': 'If remapping due date feature in due dates is available within the space. Default: False',
        'features __due_dates__ start_date': 'If start date feature in due dates is available within the space. Default: False',
        'features __time_tracking__ enabled': 'If time tracking feature is available within the space. \nExample: True, False \nDefault: True',
        'multiple_assignees': 'Enable or disable multiple assignees for tasks within the space \nExample: True, False. Default: True',
        'name': 'The new name of the space \nExample: \'NewWorkspace\', \'TempWorkspace\'',
        'private': 'A flag to determine if the space is private or not \nExample: True, False. Default: False',
        'space_id': 'The ID of the space to update \nExample: \'abc123\', \'def456\''
        }
        }

And using these in the function schema, we get:

Flattened Schema + Improved System Prompt containing Schema Summary + (Function Names + Function Descriptions + Parameter Descriptions) Optimised + Function Call Examples Added + Adding Example Parameter Values [Code Here]

Wow! The intuition of adding example pays off.

Compiling the Results

To summarise all our examples, and their results:

We experimented with strategies to improve the function calling ability of LLMs, specifically for Agentic Software integrations. Starting from a baseline score of 36%, we boosted performance to an average of 78%. The insights shared in this article aim to enhance your applications as well.

Moreover, we discovered a key distinction between general function calling and function calling for software integrations. In general function calls, even with multiple functions, they operate independently and non-linearly when executing an action. However, in software integrations, functions must follow a specific sequence to effectively accomplish an action.

All the codes of this articles are available here. Thank you!

Further Experiments & Challenges

We have been experimenting on this for a while and are planning to write further on

Parallel Function calling accuracy
Sequential Function Call Planning Accuracy (RAG + CoT)
Comparison with Open Source Function Calling Models (OpenGorilla, Functionary, NexusRaven, and FireFunction)

When dealing with integration-centric function calls, the process can be complex. For instance, the agent may need to gather data from various endpoints like get_spaces_members, get_current_active_members, and get_member_whose_contract_is_over before responding with the update_member_list function.

This means there could be additional data not yet discussed in the conversation that requires the agent to fetch from other endpoints silently to formulate a complete response.

Optimisations like this are crucial aspect of our efforts at Composio to enhance the smoothness of Agentic integrations. If you are interested in improving accuracy of your agents connect with us at mailto: hello@composio.dev.

Subscribe if you are interested in learning more!

Better interface between Agents <--> Tools

Soham Ganatra — Sat, 02 Mar 2024 15:26:26 +0000

What are we working on?

We’re on the cusp of a future where multiple AI agents will soon work together and interact with diverse tools for complex tasks. The rise in platforms for AI workflow and agent orchestration signals this shift. Yet, these platforms face challenges: limited scope, variety, and reliability of integrations. Developers often grapple with authentication and API specifications to implement basic agentic use cases. This hampers the seamless communication between agents and tools, a cornerstone for enabling real-world applications.

Our goal is to simplify this. By managing your integrations, we let you focus on creating your agentic platform. We’re crafting the vital integration layer for AI agents, smoothing out the rough edges for innovation.

What can we offer now?

Our SDK offers over 90 connectors optimized for LLM tool actions and triggers. Enjoy a customizable, white-label authentication experience. We also offer best-in-class reliability and detailed observability for each API call, saving you the hassle of spending sleepless nights while debugging the faulty API calls.