Embedding CSV Data with Sentence Transformers

In the realm of NLP or natural language processing, where machines are made to understand the human text, audio, video or images, one of the first steps is to convert the input to a numerical representation(something machines can easily understand). This is where the concept of embeddings comes into picture.

Embeddings or the dense vectors or numerical representation often capture the semantic meaning of text and is used in various applications such as text classification, clustering and semantic search.

Let us dive deeper into the world of text, embeddings and NLP.

Problem Statement

We have a CSV file which has data about, say product. In that csv, the combination of values of three columns uniquely identify the value of the fourth column. For example, the category, sub-category and description uniquely identify the name of the product. Our goal is to use this single file to generate embeddings for the unique combination of category, sub-category and description, which will allow us to get answer about similar data from other files.

Solution

For this solution, we need to follow the below steps –

Go through the csv file that will be used for creating embeddings.
In our case we are creating embeddings using only three columns. So, loop through the file and create those embeddings.
To create those embeddings, you need a model. This model can be created by you or can be a pre-trained model. In this case, we are using a pre-trained model from the SentenceTransformer library.
Once the embeddings are created , we store them in a numpy array.
When the embeddings for the data have been created, it is time to ask questions from this data.
Question to be asked from the data is stored in a variable and embeddings are created using the same model for this question.
We then use cosine similarity to find the similarity between the question and the data that we have.
For each row in the csv we find the similarity score with the question and then return the index of the row which has the highest similarity score.
Once the index of the row is found, we use functions of pandas to retrieve the relevant content.

Code – Create Embeddings

1. df=pd.read_csv('file_name.csv')
2. model = SentenceTransformer('all-MiniLM-L6-v2')
3. embeddings = []
4. for _,row in df.iterrows():
        4.1 combined_text= f"{row['CATEGORY_NAME']} {row['SUBCATEGORY']} {row['DESCRIPTION']}"
        4.2 embedding = model.encode(combined_text)
        4.3 embeddings.append(embedding)
5. embeddings = np.array(embeddings)

Code Understanding

Let us now understand this section of code line by line –

We are here creating a panda dataframe. The pd.read_csv function reads the CSV file and loads the data into a dataframe. This dataframe is a tabular structure with labeled axes.
- But why pandas? Pandas because –
  - It allows you to load an entire csv into a dataframe with just one line of code.
  - Pandas dataframe allow you ease of use. You can easily filter, sort, group, and transform data using built-in methods. In this case, because it was a pandas dataframe, we could access specific columns without much of hassle.
  - pandas is optimized for performance and can handle large datasets efficiently.
Next we specified a model, model=SentenceTransformer(‘all-MiniLM-L6-v2’). To use this we have imported the Sentence Transformer class from the sentence_transformer library which is available in python. This library provides pre trained models for generating sentence embeddings. We have used the ‘all-MiniLM-L6-v2’ model. This model is great for it is light weight and is suitable for generating embeddings for small sentences or short paragraphs. Considering we are using three columns of the csv file, we used this model. We then assign it to a variable model.
Next, we initialize an embedding list, which will be used to store the embeddings.
for _,row in df.iterrows(): – We are now looping through the rows of the dataframe. iterrows is a pandas function that generates an iterator over the dataframe rows. This iterator returns the index of the row and the data in the row. In python, we use _ to tell that this variable will not be used. This represents the index of the row, but will not be used in the for loop. There are different objects available for itterows, like row, column_name etc. Here, we are using row object which contains the data for the current row.
- 4.1 – combined_text= f”{row[‘CATEGORY_NAME’]} {row[‘SUBCATEGORY’]} {row[‘DESCRIPTION’]}” – Using this we created a concatenation of the values present in the columns. These are the values we need for creating embeddings.
- 4.2 embedding = model.encode(combined_text) – In this line, we are making use of the encode method of the SentenceTransformer class. This method is used to convert the input to a dense vector or embedding.
- 4.3 embeddings.append(embedding) – embedding has the embedding for each row. Each of these rows are then stored in the final embeddings list.
embeddings = np.array(embeddings) – We have already created a list of embeddings. So, then why this line? Why are we converting this list to a numpy array. This is because the functions of scikit-learn expect NumPy arrays as input. We will be performing semantic similarity in this case, and hence need the cosine similarity. This function requires numpy array.

Once the embeddings are created, it is time for us to now to ask question from this embedding. Let us understand the code for this.

Code – Ask Question

1. def ask_question(question,model,df,embeddings):
    2. question_embedding = model.encode(question)
    3. similarities = cosine_similarity([question_embedding],embeddings)
    4. most_similar_idx= np.argmax(similarities)
    5. return df.iloc[most_similar_idx]

Code Understanding

We have defined a function. A function in python is defined using the def keyword. This function takes the question as the input, the model(which we created earlier), the dataframe is the one we stored our training csv in, and the embeddings it the numpy array holding the embeddings for the combined columns.
In this line, we are creating embeddings for the question. By the end of this line, we have two vectors, one is the numerical representation of the question, the other is the numerical representation of the data.
cosine_similarity([question_embedding],embeddings) – What on earth is cosine similarity? Cosine similarity is a measure of similarity between two non-zero vectors. It is the cosine angle between two vectors. Now, there is a mathematical equation to all this, but for now, let us consider that when we calculate the cosine similarity between two vectors, we are finding the distance in terms of angle between the vectors. Understand this in terms of the graph with x-axis and y-axis. There are two lines in this graph and we are finding the angle between those lines. The result of the cosine similarity varies from -1 to 1-
- -1 – This means that the vectors are diametrically opposite, meaning they are in opposite directions.
- 0 – This means that the vectors are at 90 degrees with each other. In this case, the vectors have no similarity.
- 1- This means that the vectors are pointing in the same direction, or are similar to each other. By the end, of this code, we have cosine similarity or to say the similarity score between the question and the embeddings created from the csv file. So, for each index of the csv file there is a similarity score associated with it. And, we need to find the index of the row which has the highest similarity score.
To achieve that, we use – most_similar_idx= np.argmax(similarities). np in this case refers to the numpy library. argmax is a function in the numpy library that returns the first occurrence of the maximum value in the entire numpy array. So, when this line of code is executed, it first finds the maximum value of the similarity, basically finds the number which is maximum in this array. And then returns the index of the row where this maximum value occurs first.
Once we have the index of the row, we use – df.iloc[most_similar_idx] to return the value in that particular row. df refers to the dataframe that we created containing the entire data. most_similar_idx stores the index number of the row which is most similar to the question. iloc is an indexer in pandas dataframe that allows you to select rows and columns by their integer positions. The iloc indexer is used for integer location based on indexing and selection by position.

You can find the entire code used in github.

Conclusion

In this tutorial, we explored the process of generating embeddings from textual data using a pre-trained model from the SentenceTransformer library. By focusing on a CSV file containing product data, we demonstrated how to create dense vector representations for unique combinations of category, sub-category, and description.

As the field continues to evolve, leveraging embeddings will become increasingly crucial for effective data analysis and retrieval tasks.

For those interested in further exploration, consider experimenting with different datasets or models, and continue developing your skills in the fascinating world of NLP.

Happy Learning 🙂

Tanubhrt.This.Side

Embedding CSV Data with Sentence Transformers

Leave a comment Cancel reply

Embedding CSV Data with Sentence Transformers

Share this:

Leave a comment Cancel reply