Unsupervised Text Classification in PowerShell :Part 1

6 min readOct 17, 2021

Text classification or text analysis is one of the growing fields in the IT world. Cyber Security is not an exception, especially in Threat Hunting. Understanding the infrastructure and log sources is a very important and crucial part before we start hunting.

While understanding the data and log sources we need to perform the base-lining. Base-lining could be done by the manual analysis by the hunter but with help of the power of text classification algorithms, we can automate this process.

Why PowerShell?

One question one might ask that why I am using the PowerShell when there are many other tools for this task that are better than PowerShell and also have the support of a huge collection of the libraries specifically designed for this task.

Well, the short answer is that PowerShell is everywhere. The long answer is that many organizations have strict policies for the installation of 3rd party software and some of them do not allow a hunter to take their data out of their network to analyze so one might be in situations that he\she might not have access to the python but always have access to PowerShell.

Main Idea

The main idea behind this exercise is to feed the data to the algorithm and the algorithm should classify the text into 5 groups. Group_0, Group_A, Group_B, Group_C, and Group_U. The below table provides the details about what these groups are.

Mathematical Model

To implement the above idea we need a mathematical model to classify the text, in this article we will use the NLP’s cosine model. This blog author has explained this model very well.

For sake of time-saving, i am skipping some parts and moving on the workflow for our algorithm

so let write the code for the above workflow

Consuming the data with a new line delimiter and storing it in an array of lines.

It’s quite easy in PowerShell

Here I am storing an array of the lines in $line array, also I am defining arrays as per our base idea. In the end, we will have these arrays populated with similar text.

Creating an array of the unique words between line number (n) and next line

This is a simple function that concatenating the line (n) and line (n+1). Then returning the unique words from both the lines

for this exercise we are using below dummy data with only 13 lines.

if you see, there are 10 lines similar to each other and consist the majority of the data, however, there are also 2 similar lines but in minority and one unique line at the end

we can access the lines from $line as below

Now, since we can access lines we can also access the words of each line like this.

Now when we call the function unique_words we get output like this for line 1 and line 2

1,2,is,line,number,This

above are the unique word from each line.

Create a vector array of matching words between two lines.

Now, this is tricky we need to create an array of vectors such that when we have a matching word from the “unique_word” array the value of the vector is 1 otherwise it is 0.

Here I am calculating an array for each line which we need to compare. Notice i have used the .contains() method instead of the .equal() method. One can argue that this selection is erroneous and is true.

Consider the following the case where we are checking for word “word” in the or line and line happens to have consisted word “sword” then also the vector will be 1 instead of 0, But it is okay here since we are not looking for some exact rigid mechanism to check the words.

Also, I have converted all text into the lower case, this detection was better than when I was considering the cases or words.

Let’s see what is output for our line 1 and 2 for this vector

This is the output of our vec_array

Taking dot product of two vectors.

Now as we have 2 array

$line1 = @(1,0,1,1,1,1)

$line2 = @(0,1,1,1,1,1)

we have to dot the product of these 2 arrays.

Which function i have stolen from here

Calculate the % similarity between 2 lines

Now we need to calculate that how much line 1 is similar to line 2. For that, we have to implement the below mathematical formula

which I have implemented in the below function

which take the output of dotproduct and vector array of both lines.

Store value in temp array

This is the practically last step of this program. Here I am storing the similarity of line 1 with all lines and at the end, I am calculating the average similarity then storing the line in particular array with the below logic

same steps are repeated until the similarity of all line are calculated. The final output looks like this.

As expected we have the 3 groups populated Group_B, Group_C, and Group_U . Also, note that unique text is only one line which is in group_U.

You can find the source-code of here In the next blog, we will optimize this algorithm for practical use.

Ref

What Are Word Embeddings for Text? - Machine Learning Mastery

Word embeddings are a type of word representation that allows words with similar meaning to have a similar…

machinelearningmastery.com

NLP Text Similarity, how it works and the math behind it

Have a look at these pairs of sentences, which one of these pairs you think has similar sentences?

towardsdatascience.com

https://rosettacode.org/wiki/Dot_product#PowerShell