.. _ai_utilities:
AI Utils
========
Text Similarity
---------------
Text similarity is a measure of how alike two pieces of text are in terms of meaning, structure, or content.
Toolium provides several methods to compare and validate text similarity using different AI techniques:
1. `SpaCy `_: Uses the SpaCy library to compute text similarity with pre-trained NLP models. Fast,
lightweight and good for general-purpose text analysis.
2. `Sentence Transformers `_: Leverages Sentence Transformers for
semantic textual similarity using deep learning embeddings. Best balance of accuracy and performance for semantic
similarity.
3. `OpenAI `_: Utilizes OpenAI's language models for advanced semantic text
comparison. Provides the most sophisticated analysis but requires API access and it may incur costs.
Usage
~~~~~
You can use the function `assert_text_similarity` from `toolium.utils.ai_utils.text_similarity` module to compare
two texts using any of these methods. You can specify the method to use with the `similarity_method` parameter and set a
threshold for similarity with the `threshold` parameter (a value between 0 and 1, where 1 means identical and 0 means
completely different).
.. code-block:: python
from toolium.utils.ai_utils.text_similarity import assert_text_similarity
# Basic usage
input_text = "The quick brown fox jumps over the lazy dog"
expected_text = "A fast brown fox leaps over a sleepy dog" # Admits both a single expected text or a list of expected texts
threshold = 0.8 # Similarity threshold between 0 and 1
similarity_method = 'spacy' # Options: 'spacy', 'sentence_transformers', 'openai', 'azure_openai'
# Validate similarity
assert_text_similarity(input_text, expected_text, threshold=threshold, similarity_method=similarity_method)
Configuration
~~~~~~~~~~~~~
Default similarity method can be set in the properties.cfg file with the property *text_similarity_method* in
*[AI]* section::
[AI]
text_similarity_method: openai # Options: 'spacy' (default), 'sentence_transformers', 'openai', 'azure_openai'
spacy_model: en_core_web_lg # SpaCy model to use, en_core_web_sm by default
sentence_transformers_model: all-MiniLM-L6-v2 # Sentence Transformers model to use, all-mpnet-base-v2 by default
openai_model: gpt-3.5-turbo # OpenAI model to use, gpt-4o-mini by default
To select models for each method, you can refer to the following links:
* `SpaCy models `_
* `Sentence Transformers models `_
* `OpenAI models `_
Installation
~~~~~~~~~~~~
Make sure to install the required libraries for the chosen method. For SpaCy, Sentence Transformers or OpenAI LLM, you
can install them with the following command:
.. code-block:: bash
pip install toolium[ai]
**Additional Requirements:**
For SpaCy, you also need to download the language model, i.e. for small English model:
.. code-block:: bash
python -m spacy download en_core_web_sm
For OpenAI LLM, you need to set up your configuration in environment variables, that it may depend on the type of access
you have (direct OpenAI access or Azure OpenAI):
.. code-block:: bash
# For example, to configure direct OpenAI access:
OPENAI_API_KEY=
.. code-block:: bash
# For example, to configure Azure OpenAI:
AZURE_OPENAI_API_KEY=
AZURE_OPENAI_ENDPOINT=
OPENAI_API_VERSION=
Text Readability
----------------
Text readability is a measure of how user-friendly and comprehensible a piece of text is.
Toolium currently provides a single method to assess text readability, using the `SpaCy `_ library.
Usage
~~~~~
You can use the function `assert_text_readability` from `toolium.utils.ai_utils.text_readability` module to assess
the readability of a text. You can set the `readability_method` (currently only `spacy`), the `threshold`
(a value between 0 and 1, where 1 means the text is very readable and 0 means it is not readable at all) and
optionally the `technical_characters`, a list of characters to be considered as non-linguistic content,
if you need to overide the ones set by default.
.. code-block:: python
from toolium.utils.ai_utils.text_readability import assert_text_readability
# Basic usage
input_text = "This is a readable text with proper structure and vocabulary."
threshold = 0.8 # Readability threshold between 0 and 1
technical_characters = ['$', '%', '&'] # Optional: list of characters considered non-linguistic content
readability_method = 'spacy' # Only 'spacy' is currently supported
# Validate readability
assert_text_readability(input_text, threshold=threshold, technical_characters=technical_characters, readability_method=readability_method)
Configuration
~~~~~~~~~~~~~
Default readability method and spacy model can be set in the *[AI]* section of the properties.cfg file::
[AI]
text_readability_method: spacy # Only 'spacy' is currently supported
spacy_model: en_core_web_md # SpaCy model to use, en_core_web_md by default
For more information on SpaCy models, you can refer to the following link:
* `SpaCy models `_
Installation
~~~~~~~~~~~~
The requirements are the same explained for `SpaCy` in the
`installation section of Text Similarity `_
Answer Evaluation using LLM-as-a-Judge
--------------------------------------
Answer evaluation using LLM-as-a-Judge is a technique to assess the quality and correctness of an LLM-generated answer
by comparing it against a reference answer using another LLM. This approach provides context-aware evaluation considering
semantic similarity, factual accuracy, completeness, and relevance.
Toolium provides methods to evaluate answers using OpenAI and Azure OpenAI models with optional structured output
using Pydantic models.
Usage
~~~~~
You can use the functions from the `toolium.utils.ai_utils.evaluate_answer` module to evaluate LLM answers:
**Basic evaluation without structured response:**
.. code-block:: python
from toolium.utils.ai_utils.evaluate_answer import get_answer_evaluation_with_azure_openai
llm_answer = "Paris is the capital of France and has a population of over 2 million people."
reference_answer = "The capital of France is Paris."
question = "What is the capital of France?"
similarity, response = get_answer_evaluation_with_azure_openai(
llm_answer=llm_answer,
reference_answer=reference_answer,
question=question,
model_name='gpt-4o'
)
print(f"Similarity score: {similarity}")
print(f"Explanation: {response['explanation']}")
**Evaluation with structured Pydantic response:**
.. code-block:: python
from pydantic import BaseModel, Field
from toolium.utils.ai_utils.evaluate_answer import get_answer_evaluation_with_azure_openai
class SimilarityEvaluation(BaseModel):
"""Model for text similarity evaluation response"""
similarity: float = Field(description='Similarity score between 0.0 and 1.0', ge=0.0, le=1.0)
explanation: str = Field(description='Brief justification for the similarity score')
llm_answer = "Paris is the capital of France and has a population of over 2 million people."
reference_answer = "The capital of France is Paris."
question = "What is the capital of France?"
similarity, response = get_answer_evaluation_with_azure_openai(
llm_answer=llm_answer,
reference_answer=reference_answer,
question=question,
model_name='gpt-4o',
response_format=SimilarityEvaluation
)
print(f"Similarity score: {similarity}")
print(f"Explanation: {response.explanation}")
**Advanced evaluation with custom evaluation criteria:**
.. code-block:: python
from pydantic import BaseModel, Field
from toolium.utils.ai_utils.evaluate_answer import get_answer_evaluation_with_azure_openai
class AnswerEvaluation(BaseModel):
"""Comprehensive evaluation model"""
similarity: float = Field(description='Similarity score between 0.0 and 1.0', ge=0.0, le=1.0)
explanation: str = Field(description='Detailed evaluation feedback')
accuracy: float = Field(description='Factual correctness score 1-5')
completeness: float = Field(description='Information completeness score 1-5')
relevance: float = Field(description='Relevance to question score 1-5')
similarity, response = get_answer_evaluation_with_azure_openai(
llm_answer=llm_answer,
reference_answer=reference_answer,
question=question,
model_name='gpt-4o',
response_format=AnswerEvaluation
)
print(f"Similarity: {similarity}")
print(f"Accuracy: {response.accuracy}/5")
print(f"Completeness: {response.completeness}/5")
print(f"Relevance: {response.relevance}/5")
**Assertion with threshold validation:**
.. code-block:: python
from toolium.utils.ai_utils.evaluate_answer import assert_answer_evaluation
# Validate that LLM answer meets minimum similarity threshold
assert_answer_evaluation(
llm_answer="Paris is both the capital and the most populous city in France.",
reference_answers="The capital and largest city of France is Paris.",
question="What is the capital of France and its largest city?",
threshold=0.7, # Minimum similarity score (0.0 to 1.0)
provider='azure',
model_name='gpt-4o'
)
Evaluation Methods
~~~~~~~~~~~~~~~~~~
The module provides the following evaluation methods:
* **assert_answer_evaluation()**: Evaluates answer and asserts if similarity meets threshold
* **get_answer_evaluation_with_openai()**: Uses OpenAI's API directly for evaluation
* **get_answer_evaluation_with_azure_openai()**: Uses Azure OpenAI's API for evaluation
Evaluation Criteria
~~~~~~~~~~~~~~~~~~~
When evaluating answers, the LLM considers the following criteria:
- **Semantic similarity**: Does the LLM answer convey the same meaning as the reference answer, even if phrased differently?
- **Factual accuracy**: How factually correct is it compared to the reference answer?
- **Completeness**: How thoroughly does it address all aspects of the question, covering all information from the reference answer?
- **Relevance**: How well does it directly answer the specific question asked?
Scoring Guide
~~~~~~~~~~~~~
- **1.0**: Perfect semantic match - answer is equivalent to reference answer
- **0.7-0.9**: Similar meaning - minor differences that don't affect overall correctness
- **0.4-0.6**: Incomplete or partially similar - major differences or missing information
- **0.0-0.3**: Different, irrelevant or contradictory - does not match the reference answer
Configuration
~~~~~~~~~~~~~
Default OpenAI model can be set in the properties.cfg file in the *[AI]* section::
[AI]
provider: azure # AI provider to use, openai by default
openai_model: gpt-4o # OpenAI model to use, gpt-4o-mini by default
Installation
~~~~~~~~~~~~
Make sure to install the required libraries:
.. code-block:: bash
pip install toolium[ai]
For Azure OpenAI, you need to set up your configuration in environment variables:
.. code-block:: bash
AZURE_OPENAI_API_KEY=
AZURE_OPENAI_ENDPOINT=
OPENAI_API_VERSION=
For standard OpenAI:
.. code-block:: bash
OPENAI_API_KEY=
.. _accuracy_tags_for_behave_scenarios:
Accuracy tags for Behave scenarios
----------------------------------
@accuracy
~~~~~~~~~
You can use accuracy tags in your Behave scenarios to specify the desired accuracy level and number of executions for
scenarios that involve AI-generated content. The accuracy tag follows the format `@accuracy__`,
where `` is the desired accuracy percentage (0-100) and `` is the number of executions to achieve
that accuracy. For example, `@accuracy_80_10` indicates that the scenario must be executed 10 times and it should
achieve at least 80% accuracy.
.. code-block:: bash
@accuracy_80_10
Scenario: Validate AI-generated response accuracy
Given the AI model generates a response
When the user sends a message
Then the AI response should be accurate
When a scenario is tagged with an accuracy tag, Toolium will automatically execute the scenario multiple times. If the
scenario does not meet the specified accuracy after the given number of executions, it will be marked as failed.
Other examples of accuracy tags:
- `@accuracy_percent_85_executions_10`: 85% accuracy, 10 executions
- `@accuracy_percent_75`: 75% accuracy, default 10 executions
- `@accuracy_90_5`: 90% accuracy, 5 executions
- `@accuracy_80`: 80% accuracy, default 10 executions
- `@accuracy`: default 90% accuracy, 10 executions
A csv report with the results of each execution will be generated in the `accuracy` folder inside the output folder.
@accuracy_data
~~~~~~~~~~~~~~
You can also use accuracy data tags in your Behave scenarios to specify different sets of accuracy data for each
execution. The accuracy data tag follows the format `@accuracy_data_`, where `` is a custom suffix to
identify the accuracy data set. For example, `@accuracy_data_greetings` indicates that the scenario should use the
accuracy data set with the suffix "greetings".
.. code-block:: bash
@accuracy_80
@accuracy_data_greetings
Scenario: Validate AI-generated greeting responses
Given the AI model generates a greeting response
When the user sends "[CONTEXT:accuracy_execution_data.question]" message
Then the AI greeting response should be similar to "[CONTEXT:accuracy_execution_data.answer]"
When a scenario is tagged with an accuracy data tag, Toolium will automatically use the specified accuracy data set for
each execution. This allows you to test different scenarios with varying data inputs. Accuracy data should be stored
previously in the context storage under the key `accuracy_data_`, where `` matches the one used in the
tag. For example, for the tag `@accuracy_data_greetings`, the accuracy data should be stored under the key
`accuracy_data_greetings`. The accuracy data should be a list of dictionaries, where each dictionary contains the data
for a specific execution.
For example, to store accuracy data for greetings, you can do the following in a step definition:
.. code-block:: python
accuracy_data_greetings = [
{"question": "Hello", "answer": "Hi, how can I help you?"},
{"question": "Good morning", "answer": "Good morning! What can I do for you today?"},
{"question": "Hey there", "answer": "Hey! How can I assist you?"}
]
context.storage["accuracy_data_greetings"] = accuracy_data_greetings
This way, during each execution of the scenario, Toolium will use the corresponding data from the accuracy data set
based on the execution index.
after_accuracy_scenario method
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can monkey-patch the `after_accuracy_scenario` method in `toolium.utils.ai_utils.accuracy` module to implement
custom behavior after accuracy scenario execution, like calling Allure `after_scenario` method.
.. code-block:: python
from toolium.utils.ai_utils import accuracy
def custom_after_accuracy_scenario(context, scenario):
context.allure.after_scenario(context, scenario)
# Monkey-patch the hook
accuracy.after_accuracy_scenario = custom_after_accuracy_scenario
AI agents for testing
---------------------
Toolium provides utilities to create and execute AI agents in your tests using langgraph library, allowing you to
simulate complex user interactions or validate AI-generated responses.
You can create an AI agent using the `create_react_agent` function from the `toolium.utils.ai_utils.ai_agent` module.
This function allows you to create a ReAct agent, which is a type of AI agent that can reason and act based on the
conversation history and tool interactions. You must specify the system message with AI testing agent instructions
and the tool method, that the agent can use to send requests to the system under test and receive responses.
.. image:: react_agent.png
:alt: ReAct Agent Flow Diagram
Once you have created an AI agent, you can execute it using the `execute_agent` function from the same module. This
function will run the agent and log all conversation messages and tool calls, providing insights into the agent's
behavior and the interactions it had during execution.
You can also provide previous messages to the agent to give it context for its reasoning and actions.
.. code-block:: python
from toolium.utils.ai_utils.ai_agent import create_react_agent, execute_agent
# Create a ReAct agent with a system message and a tool method
system_message = "You are an assistant that helps users find TV content based on their preferences."
tool_method = tv_recommendations # This should be a function that the agent can call as a tool
provider = 'azure' # Specify the AI provider to use, e.g., 'azure' or 'openai'
model_name = 'gpt-4o-mini' # Specify the model to use for the agent
agent = create_react_agent(system_message, tool_method=tool_method, provider=provider, model_name=model_name)
# Execute the agent and log all interactions
final_state = execute_agent(agent)
Default provider and model can be set in the properties.cfg file in *[AI]* section::
[AI]
provider: azure # AI provider to use, openai by default
openai_model: gpt-3.5-turbo # OpenAI model to use, gpt-4o-mini by default