Connect with us

# Clustering documents and gaussian data with Dirichlet Process Mixture Models

Published

on

This article is the fifth part of the tutorial on Clustering with DPMM. In the previous posts we covered in detail the theoretical background of the method and we described its mathematical representationsmu and ways to construct it. In this post we will try to link the theory with the practice by introducing two models DPMM: the Dirichlet Multivariate Normal Mixture Model which can be used to cluster Gaussian data and the Dirichlet-Multinomial Mixture Model which is used to cluster documents.

Update: The Datumbox Machine Learning Framework is now open-source and free to download. Check out the package com.datumbox.framework.machinelearning.clustering to see the implementation of Dirichlet Process Mixture Models in Java.

## 1. The Dirichlet Multivariate Normal Mixture Model

The first Dirichlet Process mixture model that we will examine is the Dirichlet Multivariate Normal Mixture Model which can be used to perform clustering on continuous datasets. The mixture model is defined as follows:

Equation 1: Dirichlet Multivariate Normal Mixture Model

As we can see above, the particular model assumes that the Generative Distribution is the Multinomial Gaussian Distribution and uses the Chinese Restaurant process as prior for the cluster assignments. Moreover for the Base distribution G0 it uses the Normal-Inverse-Wishart prior which is conjugate prior of Multivariate Normal distribution with unknown mean and covariance matrix. Below we present the Graphical Model of the mixture model:

Figure 1: Graphical Model of Dirichlet Multivariate Normal Mixture Model

As we discussed earlier, in order to be able to estimate the cluster assignments, we will use the Collapsed Gibbs sampling which requires selecting the appropriate conjugate priors. Moreover we will need to update the parameters posterior given the prior and the evidence. Below we see the MAP estimates of the parameters for one of the clusters:

Equation 2: MAP estimates on Cluster Parameters

Where d is the dimensionality of our data and is the sample mean. Moreover we have several hyperparameters of the Normal-Inverse-Wishart such as the μ0 which is the initial mean, κ0 is the mean fraction which works as a smoothing parameter, ν0 is the degrees of freedom which is set to the number of dimensions and Ψ0 is the pairwise deviation product which is set to the dxd identity matrix multiplied by a constant. From now on all the previous hyperparameters of G0 will be denoted by λ to simplify the notation. Finally by having all the above, we can estimate the probabilities that are required by the Collapsed Gibbs Sampler. The probability of observation i to belong to cluster k given the cluster assignments, the dataset and all the hyperparameters α and λ of DP and G0 is given below:

Equation 3: Probabilities used by Gibbs Sampler for MNMM

Where zi is the cluster assignment of observation xi, x1:n is the complete dataset, z-i is the set of cluster assignments without the one of the ith observation, x-i is the complete dataset excluding the ith observation, ck,-i is the total number of observations assigned to cluster k excluding the ith observation while and are the mean and covariance matrix of cluster k exluding the ith observation.

## 2. The Dirichlet-Multinomial Mixture Model

The Dirichlet-Multinomial Mixture Model is used to perform cluster analysis of documents. The particular model has a slightly more complicated hierarchy since it models the topics/categories of the documents, the word probabilities within each topic, the cluster assignments and the generative distribution of the documents. Its target is to perform unsupervised learning and cluster a list of documents by assigning them to groups. The mixture model is defined as follows:

Equation 4: Dirichlet-Multinomial Mixture Model

Where φ models the topic probabilities, zi is a topic selector, θk are the word probabilities in each cluster and xi,j represents the document words. We should note that this technique uses the bag-of-words framework which represents the documents as an unordered collection of words, disregarding grammar and word order. This simplified representation is commonly used in natural language processing and information retrieval. Below we present the Graphical Model of the mixture model:

Figure 2: Graphical Model of the Dirichlet-Multinomial Mixture Model

The particular model uses Multinomial Discrete distribution for the generative distribution and Dirichlet distributions for the priors. The ℓ is the size of our active clusters, the n the total number of documents, the β controls the a priori expected number of clusters while the α controls the number of words assigned to each cluster. To estimate the probabilities that are required by the Collapsed Gibbs Sampler we use the following equation:

Equation 5: Probabilities used by Gibbs Sampler for DMMM

Where Γ is the gamma function, zi is the cluster assignment of document xi, x1:n is the complete dataset, z-i is the set of cluster assignments without the one of the ith document, x-i is the complete dataset excluding the ith document, Nk(z-i) is the number of observations assigned to cluster k excluding ith document, Nz=k(x-i) is a vector with the sums of counts for each word for all the documents assigned to cluster k excluding ith document and N(xi) is the sparse vector with the counts of each word in document xi. Finally as we can see above, by using the Collapsed Gibbs Sampler with the Chinese Restaurant Process the θjk variable which stores the probability of word j in topic k can be integrated out.

#### About Vasilis Vryniotis

My name is Vasilis Vryniotis. I’m a Data Scientist, a Software Engineer, author of Datumbox Machine Learning Framework and a proud geek. Learn more

# Resiliency And Security: Future-Proofing Our AI Future

Published

on

By Allison Proffitt, AI Trends

On the first day of the Second Annual AI World Government conference and expo held virtually October 28-30, a panel moderated by Robert Gourley, cofounder & CTO of OODA, raised the issue of AI resiliency. Future-proofing AI solutions requires keeping your eyes open to upcoming likely legal and regulatory roadblocks, said Antigone Peyton, General Counsel & Innovation Strategist at Cloudigy Law. She takes a “use as little as possible” approach to data, raising questions such as: How long do you really need to keep training data? Can you abstract training data to the population level, removing some risk while still keeping enough data to find dangerous biases?

Stephen Dennis, Director of Advanced Computing Technology Centers at the U.S. Department of Homeland Security, also recommended a forward-looking posture, but in terms of the AI workforce. In particular, Dennis challenged the audience to consider the maturity level of the users of new AI technology. Full automation is not likely a first AI step, he said. Instead, he recommends automating slowly, bringing the team along. Take them a technology that works in the context they are used to, he said. They shouldn’t need a lot of training. Mature your team with the technology. Remove the human from the loop slowly.

Of course, some things will never be fully automated. Brian Drake, U.S. Department of Defense, pointed out that some tasks are inherently human-to-human interactions—such as gathering human intelligence. But AI can help humans do even those tasks better, he said.

He also cautioned enterprises to consider their contingency plan as they automate certain tasks. For example, we rarely remember phone numbers anymore. We’ve outsourced that data to our phones while accepting a certain level of risk. If you deploy a tool that replaces a human analytic activity, that’s fine, Drake said. But be prepared with a contingency plan, a solution for failure.

Organizing for Resiliency

All of these changes will certainly require some organizational rethinking, the panel agreed. While government is organized in a top down fashion, Dennis said, the most AI-forward companies—Uber, Netflix—organize around the data. That makes more sense, he proposed, if we are carefully using the data.

Data models—like the new car trope—begin degrading the first day they are used. Perhaps the source data becomes outdated. Maybe an edge use case was not fully considered. The deployment of the model itself may prompt a completely unanticipated behavior. We must capture and institutionalize those assessments, Dennis said. He proposed an AI quality control team—different from the team building and deploying algorithms—to understand degradation and evaluate the health of models in an ongoing way. His group is working on this with sister organizations in cyber security, and he hopes the best practices they develop can be shared to the rest of the department and across the government.

Peyton called for education—and reeducation—across organizations. She called the AI systems we use today a “living and breathing animal”. This is not, she emphasized, an enterprise-level system that you buy once and drop into the organization. AI systems require maintenance, and someone must be assigned to that caretaking.

But at least at the Department of Defense, Drake pointed out, all employees are not expected to become data scientists. We’re a knowledge organization, he said, but even if reskilling and retraining are offered, a federal workforce does not have to universally accept those opportunities. However, surveys across DoD have revealed an “appetite to learn and change”, Drake said. The Department is hoping to feed that curiosity with a three-tiered training program offering executive-level overviews, practitioner-level training on the tools currently in place, and formal data science training. He encouraged a similar structure to AI and data science training across other organizations.

Gourley turned the conversation to bad actors. The very first telegraph message between Washington DC and Baltimore in 1844 was an historic achievement. The second and third messages—Gourley said—were spam and fraud. Cybercrime is not new and it is absolutely guaranteed in AI. What is the way forward, Gourley asked the panel.

“Our adversaries have been quite clear about their ambitions in this space,” Drake said. “The Chinese have published a national artificial intelligence strategy; the Russians have done the same thing. They are resourcing those plans and executing them.”

In response, Drake argued for the vital importance of ethics frameworks and for the United States to embrace and use these technologies in an “ethically up front and moral way.” He predicted a formal codification around AI ethics standards in the next couple of years similar to international nuclear weapons agreements now.

# AI Projects Progressing Across Federal Government Agencies

Published

on

By AI Trends Staff

Government agencies are gaining experience with AI on projects, with practitioners focusing on defining the project benefit and the data quality is good enough to ensure success. That was a takeaway from talks on the opening day of the Second Annual AI World Government conference and expo held virtually on October 28.

Wendy Martinez, PhD, director of the Mathematical Statistics Research Center, with the Office of Survey Methods Research in the US Bureau of Labor Statistics, described a project to use natural language understanding AI to parse text fields of databases, and automatically correlate them to job occupations in the federal system. One lesson learned was despite interest in sharing experience with other agencies, “You can’t build a model based on a certain dataset and use the model somewhere else,”  she stated. Instead, each project needs its own source of data and model tuned to it.

Renata Miskell, Chief Data Officer in the Office of the Inspector General for the US Department of Health and Human Services, fights fraud and abuse for an agency that oversees over \$1 trillion in annual spending, including on Medicare and Medicaid. She emphasized the importance of ensuring that data is not biased and that models generate ethical recommendations. For example, to track fraud in its grant programs awarding over \$700 billion annually, “It’s important to understand the data source and context,” she stated. The unit studied five years of data from “single audits” of individual grant recipients, which included a lot of unstructured text data. The goal was to pass relevant info to the audit team. “It took a lot of training, she stated. “Initially we had many false positives.” The team tuned for data quality and ethical use, steering away from blind assumptions. “If we took for granted that the grant recipients were high risk, we would be unfairly targeting certain populations,” Miskell stated.

In the big picture, many government agencies are engaged in AI projects and a lot of collaboration is going on. Dave Cook is senior director of AI/ML Engineering Services for Figure Eight Federal, which works on AI projects for federal clients. He has years of experience working in private industry and government agencies, mostly now the Department of Defense and intelligence agencies. “In AI in the government right now, groups are talking to one another and trying to identify best practices around whether to pilot, prototype, or scale up,” he said. “The government has made some leaps over the past few years, and a lot of sorting out is still going on.”

Ritu Jyoti, Program VP, AI Research and Global AI Research lead for IDC consultants, program contributor to the event, has over 20 years of experience working with companies including EMC, IBM Global Services, and PwC Consulting. “AI has progressed rapidly,” she said. From a global survey IDC conducted in March, business drivers for AI adoption were found to be better customer experience, improved employee productivity, accelerated innovation and improved risk management. A fair number of AI projects failed. The main reasons were unrealistic expectations, the AI did not perform as expected, the project did not have access to the needed data, and the team lacked the necessary skills. “The results indicate a lack of strategy,” Joti stated.

David Bray, PhD, Inaugural Director of the nonprofit Atlantic Council GeoTech Center, and a contributor to the event program, posted questions on how data governance challenges the future of AI. He asked what questions practitioners and policymakers around AI should be asking, and how the public can participate more in deciding what can be done with data. “You choose not to be a data nerd at your own peril,” he said.

Anthony Scriffignano, PhD, senior VP & Chief Data Scientist with Dun & Bradstreet, said in the pandemic era with many segments of the economy shut down, companies are thinking through and practicing different ways of doing things. “We sit at the point of inflection. We have enough data and computer power to use the AI techniques invented generations ago in some cases,” he said. This opportunity poses challenges related to what to try and what not to try, and “sometimes our actions in one area cause a disruption in another area.”

AI World Government continues tomorrow and Friday.

(Ed. Note: Dr. Eric Schmidt, former CEO of Google is now chair of the National Security Commission on AI, today was involved in a discussion, Transatlantic Cooperation Around the Future of AI, with Ambassador Mircea Geoana, Deputy Secretary General, North Atlantic Treaty Organization, and Secretary Robert O. Work, vice chair of the National Security Commission. Convened by the Atlantic Council, the event can be viewed here.)

# 5 Work From Home Office Essentials

Published

on

Working remotely from home had been increasing in popularity, but it’s now become a necessity for many professionals due to the pandemic.

“Some companies are eager to reopen their doors and return to the office, but a large number of employer and employees are making the transitional work environment a permanent change.”

They can’t guarantee their health and safety in a socially-crowded space, plus, companies are able to save tons of money they would have spent on their commercial lease or mortgage payments.

That’s not the say that working from home doesn’t come at its own costs, however. It can lead to a huge hit in productivity without the right equipment in place. To maximize your performance and efficiency in a remote setting, be sure to purchase these five office essentials.

### 1. Powerful PC

This one probably feels like an obvious pointer, but let’s knock it off our list. You won’t be able to get by with a make-shift work station and in today’s digital domain, your computer will be at the core of everything you do.

Never-ending loading wheels, delayed downloads, and slow rendering will add seconds to every task you do, so if your company didn’t provide you with a workhorse computer tower, you might look into investing in one yourself, then deduct the cost in your tax return.

Depending on your line of work, it might make more sense to go for a laptop vs a desktop computer. Unless your tasks demand super sophisticated software and large storage space, you can probably get by with a portable PC. That way, when coffee shops begin to reopen and allow patrons to sit inside, you can work on-the-go without feeling tethered to your desk.

### 2. Ergonomic Office Chair

If you’re looking at a long-term remote situation, it’s worth spending the big bucks on an ergonomic office chair. You should feel comfortably locked into your seat for eight hours a day—at least if you want to concentrate on your workflow, rather than the cramp in your back.

Shop around for an office chair that’s sophisticated in design and specifically built to hold the human body. Some stand-out features you should look out for include:

• Targeted support around the lumbar spine
• Adjustable height so you can adjust the seat as necessary for your arms to rest naturally on the keyboard
• Swivel base to effortlessly turn your body, preventing neck strain
• Cushioned seat to comfort your tailbone
• Ventilated fabric that promotes airflow so you don’t feel overheated when sitting in the chair for several hours

You might have to pay a couple of hundred dollars for the best-of-the-line features, but there is another item that might qualify as an eligible tax deduction—just be sure to keep all your receipts organized with a document scanner in case the IRS raises their eyebrows and issues an audit.

### 3. Wireless Keyboard

If you want to type faster and feel better while you’re at it, then a wireless keyboard is clutch. They enable you to bring the keys closer, decreasing the extension length of your arms and accompanying shoulder strain.

“It also helps reduce the strain on your eyes by moving the bright screen farther away from your direct line of sight.”

And, last but not least, the keys are placed in an ergonomic position for a more natural finger splay, with ample cushioning wrist cushioning that helps prevent overuse injuries such as a carpal tunnel.

### 4. Noise-cancelling Headphones

To truly get in the zone, you should block out distractions with headphones the cancel noise in your environment—especially if your work station is set up in a common area. Other tips to stay focused include installing a website blocker and leaving your cellphone on the other side of the room.

### 5. House plant or flowers

People are scientifically proven to be more productive when working near fresh flowers or lush greenery. The good news is that you don’t need to have a green thumb or natural lighting to achieve this effect—even artificial foliage can brighten your mood and improve your performance.

Working from home sometimes can feel like you’re locked inside all day, so bringing the outside world inside your space can help ward off burnout.

Take these tips with you into 2021 and set yourself up for success in your new home office setting.