Inside Team Theta's Winning Project at Acumatica Hackathon 2020.

author

Yuriy Zaletskyy

min read

  • Share
  • Tweet
  • Email
  • Print

A-A+

1.15%
Read

Inside Team Theta's Winning Project at Acumatica Hackathon 2020.

21/02/2020 Categories: ARTICLES

Winning Acumatica Hackathon 2020 was definitely a team effort (Team Theta FTW!), but today I want to describe my contribution to the project: A machine learning model trained to distinguish sentences in data sets as questions or statements. Part of our success came from writing out clear steps to the project and making decisive choices that were necessary to create this piece of code. Here is the basic summary of the steps I followed: Find a proper data set. Transform the data set so that it can be fed into a machine learning model. Find a proper machine learning mechanism for training the model. Train it, potentially repeating steps 1 and 2. Integrate the mechanism with Acumatica. Proper Data Set For these types of tasks I typically use the approach of supervised learning. For this type of machine learning, it is good to have a solid dataset to train the model on. My first instinct was to find a model that already knows how to distinguish between sentences, but had no luck. I found recipes on how to achieve it but didn’t find the ready-made solution I was looking for. After looking at a few different data sets online, I decided to use one from the nltk.corpus data set. I chose this data set because: It contained an exhaustive list of sentence types. It contained different types of question sentences. Transformation of Data Set After I chose my proper data set, I needed to choose how to transform it. Initially, I received it as a .zip archive, which is not feasible for training. Ideally, I was hoping to find a ready-made model that can consume .zip files, but in reality, I needed to do the following: Find the location where the Python library stored data. Discover that it was .zip format. Unzip file. Merge all .xml files into one. Proper Machine Learning Mechanism I considered the following candidates for my Machine Learning Mechanism: NLTK library of Python, Tensorflow, CNTK, and ML .Net. I tried the NLTK library first and sent web API requests to it. The NLTK library tested well, but for some strange reason, it was very difficult to understand how to feed it into a sentence. For Tensorflow, I looked for a ready-made solution, but this proved to be fruitless. I believe this was due to a lack of knowledge of Python. Next up was CNTK, but after discovering that Microsoft no longer develops it, I decided to follow their example and jump to C# library ML .Net. ML .Net is a relatively new library, so it was lacking neural networks (especially deep neural networks). The big benefit of ML .Net was that Microsoft had polished plenty of algorithms there, and for my purposes, its features were enough. My initial guess was to use the Naïve Bayes classifier, but that approach was soon abandoned due to an imbalanced dataset. It consisted of ~20% of questions and 80% of other sentences. With an imbalanced data set, we had a few options: synthesize new examples (in order to have 50% vs 50%), continuously repeat examples from the data set, get more data, and a few others. I’ve decided to use another learning algorithm named SdcaMaximumEntropy, which builds a maximum entropy classification model (also known as a maximum entropy multiclass classifier) in ML .Net along with text transformation features. Training of the Model To fully describe what I did with ML .Net, I've made the full source code available here. I've summarized the most important details below: The first step of training was the creation of the pipeline. In ML .Net, any kind of transformation of input data needs to be in a pipeline. For example, mapping text, inputs, and outputs were done in a pipeline like this: var pipeline = _mlContext.Transforms.Conversion .MapValueToKey(inputColumnName: "Area", outputColumnName: "Label") .Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Description", outputColumnName: "DescriptionFeaturized")) // // .Append(_mlContext.Transforms.Concatenate("Features", "DescriptionFeaturized")) // //Sample Caching the DataView so estimators iterating over the data multiple times, instead of always reading from file, using the cache might get better performance. // .AppendCacheCheckpoint(_mlContext); And this part also provided training transformations: var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features")) .Append(_mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel")); // // STEP 4: Train the model fitting to the DataSet //Console.WriteLine($"=============== Training the model ==============="); // _trainedModel = trainingPipeline.Fit(trainingDataView); As you can see, with ML .Net we get a mapping between input and output columns and an executed build of the model, a method that fits in plenty of libraries. One more important method we

alt="Team Theta Hackathon Trophy" data-entity-type="file" data-entity-uuid="ad0ed8c0-5296-41fd-8cb9-dd3a4df3d96f" src="/sites/default/files/inline-images/Yuriy%20blog%20photo.png" />

Part of our success came from writing out clear steps to the project and making decisive choices that were necessary to create this piece of code.

Here is the basic summary of the steps I followed:

  1. Find a proper data set.
  2. Transform the data set so that it can be fed into a machine learning model.
  3. Find a proper machine learning mechanism for training the model.
  4. Train it, potentially repeating steps 1 and 2.
  5. Integrate the mechanism with Acumatica.

Proper Data Set

For these types of tasks I typically use the approach of supervised learning. For this type of machine learning, it is good to have a solid dataset to train the model on. My first instinct was to find a model that already knows how to distinguish between sentences, but had no luck. I found recipes on how to achieve it but didn’t find the ready-made solution I was looking for. After looking at a few different data sets online, I decided to use one from the nltk.corpus data set. I chose this data set because:

  1. It contained an exhaustive list of sentence types.
  2. It contained different types of question sentences.

Transformation of Data Set

After I chose my proper data set, I needed to choose how to transform it. Initially, I received it as a .zip archive, which is not feasible for training. Ideally, I was hoping to find a ready-made model that can consume .zip files, but in reality, I needed to do the following:

  1. Find the location where the Python library stored data.
  2. Discover that it was .zip format.
  3. Unzip file.
  4. Merge all .xml files into one.

Proper Machine Learning Mechanism

I considered the following candidates for my Machine Learning Mechanism: NLTK library of Python, Tensorflow, CNTK, and ML .Net. I tried the NLTK library first and sent web API requests to it. The NLTK library tested well, but for some strange reason, it was very difficult to understand how to feed it into a sentence. For Tensorflow, I looked for a ready-made solution, but this proved to be fruitless. I believe this was due to a lack of knowledge of Python.

Next up was CNTK, but after discovering that Microsoft no longer develops it, I decided to follow their example and jump to C# library ML .Net. ML .Net is a relatively new library, so it was lacking neural networks (especially deep neural networks). The big benefit of ML .Net was that Microsoft had polished plenty of algorithms there, and for my purposes, its features were enough.

My initial guess was to use the Naïve Bayes classifier, but that approach was soon abandoned due to an imbalanced dataset. It consisted of ~20% of questions and 80% of other sentences. With an imbalanced data set, we had a few options: synthesize new examples (in order to have 50% vs 50%), continuously repeat examples from the data set, get more data, and a few others. I’ve decided to use another learning algorithm named SdcaMaximumEntropy, which builds a maximum entropy classification model (also known as a maximum entropy multiclass classifier) in ML .Net along with text transformation features.

Training of the Model

To fully describe what I did with ML .Net, I've made the full source code available here. I've summarized the most important details below:

The first step of training was the creation of the pipeline. In ML .Net, any kind of transformation of input data needs to be in a pipeline. For example, mapping text, inputs, and outputs were done in a pipeline like this:

var pipeline = _mlContext.Transforms.Conversion

.MapValueToKey(inputColumnName: "Area", outputColumnName: "Label")

                .Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Description",

                    outputColumnName: "DescriptionFeaturized"))
                // 

                // 

                .Append(_mlContext.Transforms.Concatenate("Features", "DescriptionFeaturized"))

                            // 

                            //Sample Caching the DataView so estimators iterating over the data multiple times, instead of always reading from file, using the cache might get better performance.

                            // 

                            .AppendCacheCheckpoint(_mlContext);

And this part also provided training transformations:

var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features"))

                    .Append(_mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));

            //  

            // STEP 4: Train the model fitting to the DataSet

            //Console.WriteLine($"=============== Training the model  ==============="); 

            // 

            _trainedModel = trainingPipeline.Fit(trainingDataView);

As you can see, with ML .Net we get a mapping between input and output columns and an executed build of the model, a method that fits in plenty of libraries. One more important method we implemented was how ML .Net was reading the dataset. By default, ML .Net knows how to read from .csv files, but it needs to be taught how to read from other storages. For these types of storage files, Microsoft implemented the method LoadFromEnumerable. In order to read from a prepared .xml file, I’ve created the following method:

private static List LoadFromXml()

        {

            var result = new List();

            var document = new XmlDocument();

            document.Load("questions.xml");

            var posts = document.LastChild.LastChild.ChildNodes;

            int i = 0;

            foreach (XmlNode post in posts)

            {

                try

                {

                    i++;

                    var inputRow = new ConversationInput();

                    inputRow.Description = post.FirstChild.Value;

                    inputRow.Area = post.Attributes["class"].Value;

                    //inputRow.Title = post.FirstChild.Value;

                    //inputRow.ClassificationOutput = (uint) (post.Attributes["class"].Value.Contains("Question") ? 1 : 0);

                    result.Add(inputRow);

                }

                catch (Exception)

                {

                }

                if(i == 1000)

                    break;

            }

 

            return result;

        }

That method is relatively simple. It reads from .xml files, takes into account some attributes, and gives an in-memory list of values as a result.

Integration with Acumatica

I didn’t anticipate any potential issues integrating ML .Net with Acumatica, but I was naïve. Unfortunately, both ML .Net and Acumatica had a dependency from NewtonSoft.Json. Because of this, I wasn’t able to use the model inside Acumatica. It meant that we needed to use our model through a proxy. The different proxies that we considered were the following:

  1. Web API service
  2. Console application, which can be executed through command line
  3. WCF service
  4. ESB approach

Due to time constraints, as well as some other limitations, we decided to go with the web API service. Inside of the web API controller, we implemented the following method of using of our model:

// GET api/values/5

        [HttpGet("{id}")]

        public ActionResult Get(string id)

        {

            var _mlContext = new MLContext(seed: 0);

            ITransformer _trainedModel = _mlContext.Model.Load("model.zip", out DataViewSchema inputSchema);

            PredictionEngine _predEngine = _mlContext.Model.CreatePredictionEngine(_trainedModel);

 

            ConversationInput issue = new ConversationInput()

            {

                Description = id

            };

            var prediction = _predEngine.Predict(issue);

 

            if (prediction.Area.Contains("Question"))

                return "true";

            return "false";

        }

The pattern of usage is simple:

  1. Load model from file.
  2. Create model of input.
  3. Get output with help of Predict method.

Summary

Initially, I thought that our project would not be the winner, especially after looking at the projects being made by other people. Admittedly, I was biased because I wanted to vote for products made by developers, for developers. But I realized that at Acumatica Hackathon, the biggest value does not come from something that makes the life of the developer easier. It comes from something that makes the life of the Acumatica user easier. 

Tags

Body

Winning Acumatica Hackathon 2020 was definitely a team effort (Team Theta FTW!), but today I want to describe my contribution to the project: A machine learning model trained to distinguish sentences in data sets as questions or statements.

Team Theta Hackathon Trophy

Part of our success came from writing out clear steps to the project and making decisive choices that were necessary to create this piece of code.

Here is the basic summary of the steps I followed:

  1. Find a proper data set.
  2. Transform the data set so that it can be fed into a machine learning model.
  3. Find a proper machine learning mechanism for training the model.
  4. Train it, potentially repeating steps 1 and 2.
  5. Integrate the mechanism with Acumatica.

Proper Data Set

For these types of tasks I typically use the approach of supervised learning. For this type of machine learning, it is good to have a solid dataset to train the model on. My first instinct was to find a model that already knows how to distinguish between sentences, but had no luck. I found recipes on how to achieve it but didn’t find the ready-made solution I was looking for. After looking at a few different data sets online, I decided to use one from the nltk.corpus data set. I chose this data set because:

  1. It contained an exhaustive list of sentence types.
  2. It contained different types of question sentences.

Transformation of Data Set

After I chose my proper data set, I needed to choose how to transform it. Initially, I received it as a .zip archive, which is not feasible for training. Ideally, I was hoping to find a ready-made model that can consume .zip files, but in reality, I needed to do the following:

  1. Find the location where the Python library stored data.
  2. Discover that it was .zip format.
  3. Unzip file.
  4. Merge all .xml files into one.

Proper Machine Learning Mechanism

I considered the following candidates for my Machine Learning Mechanism: NLTK library of Python, Tensorflow, CNTK, and ML .Net. I tried the NLTK library first and sent web API requests to it. The NLTK library tested well, but for some strange reason, it was very difficult to understand how to feed it into a sentence. For Tensorflow, I looked for a ready-made solution, but this proved to be fruitless. I believe this was due to a lack of knowledge of Python.

Next up was CNTK, but after discovering that Microsoft no longer develops it, I decided to follow their example and jump to C# library ML .Net. ML .Net is a relatively new library, so it was lacking neural networks (especially deep neural networks). The big benefit of ML .Net was that Microsoft had polished plenty of algorithms there, and for my purposes, its features were enough.

My initial guess was to use the Naïve Bayes classifier, but that approach was soon abandoned due to an imbalanced dataset. It consisted of ~20% of questions and 80% of other sentences. With an imbalanced data set, we had a few options: synthesize new examples (in order to have 50% vs 50%), continuously repeat examples from the data set, get more data, and a few others. I’ve decided to use another learning algorithm named SdcaMaximumEntropy, which builds a maximum entropy classification model (also known as a maximum entropy multiclass classifier) in ML .Net along with text transformation features.

Training of the Model

To fully describe what I did with ML .Net, I've made the full source code available here. I've summarized the most important details below:

The first step of training was the creation of the pipeline. In ML .Net, any kind of transformation of input data needs to be in a pipeline. For example, mapping text, inputs, and outputs were done in a pipeline like this:

var pipeline = _mlContext.Transforms.Conversion

.MapValueToKey(inputColumnName: "Area", outputColumnName: "Label")

                .Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Description",

                    outputColumnName: "DescriptionFeaturized"))
                // 

                // 

                .Append(_mlContext.Transforms.Concatenate("Features", "DescriptionFeaturized"))

                            // 

                            //Sample Caching the DataView so estimators iterating over the data multiple times, instead of always reading from file, using the cache might get better performance.

                            // 

                            .AppendCacheCheckpoint(_mlContext);

And this part also provided training transformations:

var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features"))

                    .Append(_mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));

            //  

            // STEP 4: Train the model fitting to the DataSet

            //Console.WriteLine($"=============== Training the model  ==============="); 

            // 

            _trainedModel = trainingPipeline.Fit(trainingDataView);

As you can see, with ML .Net we get a mapping between input and output columns and an executed build of the model, a method that fits in plenty of libraries. One more important method we implemented was how ML .Net was reading the dataset. By default, ML .Net knows how to read from .csv files, but it needs to be taught how to read from other storages. For these types of storage files, Microsoft implemented the method LoadFromEnumerable. In order to read from a prepared .xml file, I’ve created the following method:

private static List LoadFromXml()

        {

            var result = new List();

            var document = new XmlDocument();

            document.Load("questions.xml");

            var posts = document.LastChild.LastChild.ChildNodes;

            int i = 0;

            foreach (XmlNode post in posts)

            {

                try

                {

                    i++;

                    var inputRow = new ConversationInput();

                    inputRow.Description = post.FirstChild.Value;

                    inputRow.Area = post.Attributes["class"].Value;

                    //inputRow.Title = post.FirstChild.Value;

                    //inputRow.ClassificationOutput = (uint) (post.Attributes["class"].Value.Contains("Question") ? 1 : 0);

                    result.Add(inputRow);

                }

                catch (Exception)

                {

                }

                if(i == 1000)

                    break;

            }

 

            return result;

        }

That method is relatively simple. It reads from .xml files, takes into account some attributes, and gives an in-memory list of values as a result.

Integration with Acumatica

I didn’t anticipate any potential issues integrating ML .Net with Acumatica, but I was naïve. Unfortunately, both ML .Net and Acumatica had a dependency from NewtonSoft.Json. Because of this, I wasn’t able to use the model inside Acumatica. It meant that we needed to use our model through a proxy. The different proxies that we considered were the following:

  1. Web API service
  2. Console application, which can be executed through command line
  3. WCF service
  4. ESB approach

Due to time constraints, as well as some other limitations, we decided to go with the web API service. Inside of the web API controller, we implemented the following method of using of our model:

// GET api/values/5

        [HttpGet("{id}")]

        public ActionResult Get(string id)

        {

            var _mlContext = new MLContext(seed: 0);

            ITransformer _trainedModel = _mlContext.Model.Load("model.zip", out DataViewSchema inputSchema);

            PredictionEngine _predEngine = _mlContext.Model.CreatePredictionEngine(_trainedModel);

 

            ConversationInput issue = new ConversationInput()

            {

                Description = id

            };

            var prediction = _predEngine.Predict(issue);

 

            if (prediction.Area.Contains("Question"))

                return "true";

            return "false";

        }

The pattern of usage is simple:

  1. Load model from file.
  2. Create model of input.
  3. Get output with help of Predict method.

Summary

Initially, I thought that our project would not be the winner, especially after looking at the projects being made by other people. Admittedly, I was biased because I wanted to vote for products made by developers, for developers. But I realized that at Acumatica Hackathon, the biggest value does not come from something that makes the life of the developer easier. It comes from something that makes the life of the Acumatica user easier.