Connect with us


MIT Study: Effects of Automation on the Future of Work Challenges Policymakers  




The 2020 MIT Task Force on the Future of Work suggests how the productivity gains of automation can coexist with opportunity for low-wage workers. (Credit: Getty Images)

By John P. Desmond, AI Trends Editor  

Rising productivity brought on by automation has not led an increase in income for workers. This is among the conclusions of the 2020 report from the MIT Task Force on the Future of Work, founded in 2018 to study the relation between emerging technologies and work, to shape public discourse and explore strategies to enable a sharing of prosperity.  

Dr. Elisabeth Reynolds, Executive Director, MIT Task Force on the Work of the Future

“Wages have stagnated,” said Dr. Elisabeth Reynolds, Executive Director, MIT Task Force on the Work of the Future, who shared results of the new task force report at the AI and the Work of the Future Congress 2020 held virtually last week.   

The report made three areas of recommendations, the first around translating productivity gains from advances in automation to better quality jobs. “The quality of jobs in this country has been falling and not keeping up with those in other countries,” she said. Among rich countries, the US is among the worst places for the less educated and low-paid workers.” For example, the average hourly wage for low-paid workers in the US is $10/hour, compared to $14/hour for similar workers in Canada, who have health care benefits from national insurance. 

“Our workers are falling behind,” she said.  

The second area of recommendation was to invest and innovate in education and skills training. “This is a pillar of our strategy going forward,” Reynolds said. The report focuses on workers between high school and a four-year degree. “We focus on multiple examples to help workers find the right path to skilled jobs,” she said.  

Many opportunities are emerging in health care, for example, specifically around health information technicians. She cited the IBM P-TECH program, which provides public high school students from underserved backgrounds with skills they need for competitive STEM jobs, as a good example of education innovation. P-TECH schools enable students to earn both their high school diploma and a two year associate degree linked to growing STEM fields.  

The third area of innovation is to shape and expand innovation.  

“Innovation creates jobs and will help the US meet competitive challenges from abroad,” Reynolds said. R&D funding as a percent of GDP in the US has stayed fairly steady for states from 1953 to 2015, but support from the federal government has declined over that time. “We want to see greater activity by the US government,” she said. 

In a country that is politically divided and economically polarized, many have a fear of technology. Deploying new technology into the existing labor market has the potential to make such divisions worse, continuing downward pressure on wages, skills and benefits, and widening income inequality. “We reject the false tradeoffs between economic growth and having a strong labor market,” Dr. Reynolds said. “Other countries have done it better and the US can do it as well,” she said, noting many jobs that exist today did not exist 40 years ago. 

The COVID-19 crisis has exacerbated the different realities between low-paid workers deemed “essential” needing to be physically present to earn their livings, and higher-paid workers able to work remotely via computers, the report noted.   

The Task Force is co-chaired by MIT Professors David Autor, Economics, and David Mindell, Engineering, in addition to Dr. Reynolds. Members of the task force include more than 20 faculty members drawn from 12 departments at MIT, as well as over 20 graduate students. The 2020 Report can be found here.   

Low-Wage Workers in US Fare Less Well Than Those in Other Advanced Countries  

James Manyika, Senior Partner, McKinsey & Co.

In a discussion on the state of low-wage jobs, James Manyika, Senior Partner, McKinsey & Co., said low-wage workers have not fared well across the 37 countries of the Organization for Economic Cooperation and Development (OECD), “and in the US, they have fared far worse than in other advanced countries,” he said. Jobs are available, but the wages are lower and, “Work has become a lot more fragile,” with many jobs in the gig worker economy (Uber, Lyft for example) and not full-time jobs with some level of benefits. 

Addressing cost of living, Manyika said the cost of products such as cars and TVs have declined as a percentage of income, but costs of housing, education and health care have increased dramatically and are not affordable for many. The growth in the low-wage gig-worker type of job has coincided with “the disappearance of labor market protections and worker voice,” he said, noting, “The power of workers has declined dramatically.” 

Geographically, two-thirds of US job growth has happened in 25 metropolitan areas. “Other parts of the country have fared far worse,” he said. “This is a profound challenge.”  

In a session on Labor Market Dynamics, Susan Houseman, VP and Director of Research, W.E. Upjohn Institute for Employment Research, drew a comparison to Denmark for some contrasts. Denmark has a strong safety net of benefits for the unemployed, while the US has “one of the least generous unemployment systems in the world,” she said. “This will be more important in the future with the growing displacement caused by new technology.”  

Another contrast between the US and Denmark is the relationship of labor to management. “The Danish system has a long history of labor management cooperation, with two-thirds of Danish workers in a union,” she said. “In the US, unionization rates have dropped to 10%.”  

“We have a long history of labor management confrontation and not cooperation,” Houseman said. “Unions have really been weakened in the US.”  

As for recommendations, she suggested that the US strengthen its unemployment systems, help labor organizations to build, raise the federal minimum wage [Ed. Note: Federal minimum wage increased to $10/hour on Jan. 2, 2020, raised from $7.25/hour, which was set in 2009.], and provide universal health insurance, “to take it out of the employment market.” 

She suspects the number of workers designated as independent contractors is “likely understated” in the data.   

Jayaraman of One Fair Wage Recommends Sectoral Bargaining 

Saru Jayaraman, President One Fair Wage and Director, Food Labor Research Center, University of California, Berkeley

Later in the day, Saru Jayaraman, President One Fair Wage and Director, Food Labor Research Center, at the University of California, Berkeley, spoke about her work with employees and business owners. One Fair Wage is a non-profit organization that advocates for a fair minimum wage, including for example, a suggestion that tips be counted as a supplement to minimum wage for restaurant workers.  

“We fight for higher wages and better working conditions, but it’s more akin to sectoral bargaining in other parts of the world,” she said. Sectoral collective bargaining is an effort to reach an agreement covering all workers in a sector of the economy, as opposed to between workers for individual firms. “It is a social contract,” Jayaraman said.  

In France, 98% of workers were covered by sectoral bargaining as of 2015. ”The traditional models for improving wages and working conditions workplace by workplace do not work,” she said. She spoke of the need to maintain a “consumer base” of workers who put money back into the economy.   

With the pandemic causing many restaurants to scale back or close, more restaurant owners have reached out to her organization in an effort to get workers back with the help of more cooperative agreements. “We have been approached by hundreds of restaurants in the last six months who are saying it’s time to change to a minimum wage,” she said. “Many were moved that so many workers were not getting unemployment insurance. They are rethinking every aspect of their businesses. They want a more functional system where everybody gets paid, and we move away from slavery. It’s a sea change among employers.”   

She said nearly 800 restaurants are now members of her organization. 

For the future, “We don’t have time for each workplace to be organized. We need to be innovating with sectoral bargaining to raise wages and working conditions across sectors. That is the future combined with workplace organizing,” she said.  

Read the 2020 report from the MIT Task Force on the Future of Work; learn about IBM P-TECH and about One Fair Wage. 



How to deliver natural conversational experiences using Amazon Lex Streaming APIs




Natural conversations often include pauses and interruptions. During customer service calls, a caller may ask to pause the conversation or hold the line while they look up the necessary information before continuing to answer a question. For example, callers often need time to retrieve credit card details when making bill payments. Interruptions are also common. Callers may interrupt a human agent with an answer before the agent finishes asking the entire question (for example, “What’s the CVV code for your credit card? It is the three-digit code top right corner.…”). Just like conversing with human agents, a caller interacting with a bot may interrupt or instruct the bot to hold the line. Previously, you had to orchestrate such dialog on Amazon Lex by managing client attributes and writing code via an AWS Lambda function. Implementing a hold pattern required code to keep track of the previous intent so that the bot could continue the conversation. The orchestration of these conversations was complex to build and maintain, and impacted the time to market for conversational interfaces. Moreover, the user experience was disjointed because the properties of prompts such as ability to interrupt were defined in the session attributes on the client.

Amazon Lex’s new streaming conversation APIs allow you to deliver sophisticated natural conversations across different communication channels. You can now easily configure pauses, interruptions and dialog constructs while building a bot with the Wait and Continue and Interrupt features. This simplifies the overall design and implementation of the conversation and makes it easier to manage. By using these features, the bot builder can quickly enhance the conversational capability of virtual agents or IVR systems.

In the new Wait and Continue feature, the ability to put the conversation into a waiting state is surfaced during slot elicitation. You can configure the slot to respond with a “Wait” message such as “Sure, let me know when you’re ready” when a caller asks for more time to retrieve information. You can also configure the bot to continue the conversation with a “Continue” response based on defined cues such as “I’m ready for the policy ID. Go ahead.” Optionally, you can set a “Still waiting” prompt to play messages like “I’m still here” or “Let me know if you need more time.” You can set the frequency of these messages to play and configure a maximum wait time for user input. If the caller doesn’t provide any input within the maximum wait duration, Amazon Lex resumes the dialog by prompting for the slot. The following screenshot shows the wait and continue configuration options on the Amazon Lex console.


The Interrupt feature enables callers to barge-in while a prompt is played by the bot. A caller may interrupt the bot and answer a question before the prompt is completed. This capability is surfaced at the prompt level and provided as a default setting. On the Amazon Lex console, navigate to the Advanced Settings and under Slot prompts, enable the setting to allow users to interrupt the prompt.

After configuring these features, you can initiate a streaming interaction with the Lex bot by using the StartConversation API. The streaming capability enables you to capture user input, manage state transitions, handle events, and deliver a response required as part of a conversation. The input can be one of three types: audio, text, or DTMF, whereas the response can be either audio or text. The dialog progresses by eliciting an intent, populating any slots, confirming the intent, and finally closing the intent. Streaming allows intents to be defined based on different conversation states such as: InProgress, Waiting, Confirmed, Denied, Fulfilled, ReadyForFulfillment or Failed. A detailed list of different dialog and intents states, see the documentation for Amazon Lex intents.

The StartConversation API uses HTTP/2 for bidirectional communication between the client and Amazon Lex. Clients are expected to maintain a single long-running stream throughout the conversation and exchange required information with the server (and vice versa) via different events. To learn more about streaming for Amazon Lex bots, please refer the Lex streaming documentation.

Amazon Lex Streaming APIs in action

Let’s put all of this together in a bot. I first use the Amazon Lex console to configure a bot and then use the following sample telephony application to interact with the bot. For the purposes of this post, we use a simple bot to complete a bill payment with a credit card. Because callers often need time to retrieve their credit card details, we enable the wait and continue construct when eliciting the credit card number. We also set up a service-side interrupt to manage interruption.

The following is a sample conversation with our bot:

Caller:    I’d like to make a payment on my account
Bot:        Sure. Can we start off with your account ID?
Caller:    Yes the account ID is 1234
Bot:        Great. The balance due is $100. How much would you like to pay today?
Caller:    $100
Bot:        Ok. Can you share the 16-digit number on your credit card?
Caller:    Can you wait a minute?
Bot:        Sure

—————- a few seconds later —————-

Bot:        Let me know when ready. No rush.

—————- a few seconds later —————-

Caller:    Ok I am back
Bot:        Ok please go ahead
Caller:    The credit card number is 1234 5678 1234 5678
Bot:        Got it. Let me repeat that for you 1234 5678 1234 5678. Is that correct?
Caller:    Yes
Bot:        Ok. What’s the CVV code? You can find it in the top right on the signature panel on the back of the card.
Caller:    123
Bot:        Awesome. I will go ahead and process the $100 payment on your account using the card ending in 5678.
Caller:    Ok
Bot:        The payment went through. Your confirmation code is 1234.

The first step is to build an Amazon Lex bot with intents to process payment and get balance on the account. The ProcessPayment intent elicits the information required to process the payment, such as the payment amount, credit card number, CVV code, and expiration date. The GetBalanceAmount intent provides the balance on the account. The FallbackIntent is triggered when the user input can’t be processed by either of the two configured intents.

Deploying the sample bot

To create the sample bot, complete the following steps. This creates an Amazon Lex bot called PaymentsBot.

  1. On the Amazon Lex console, choose Create Bot.
  2. In the Bot configuration section, give the bot the name PaymentsBot.
  3. Specify AWS Identity and Access Management (IAM) permissions and COPPA flag.
  4. Choose Next.
  5. Under Languages, choose English(US).
  6. Choose Done.
  7. Add the ProcessPayment and GetBalanceAmount intents to your bot.
  8. For the ProcessPayment intent, add the following slots:
    1. PaymentAmount slot using the built-in AMAZON.Number slot type
    2. CreditCardNumber slot using the built-in AMAZON.AlphaNumeric slot type
    3. CVV slot using the built-in AMAZON.Number slot type
    4. ExpirationDate using the built-in AMAZON.Date built-in slot type
  9. Configure slot elicitation prompts for each slot.
  10. Configure a closing response for the ProcessPayment intent.
  11. Similarly, add and configure slots and prompts for GetBalanceAmount intents.
  12. Choose Build to test your bot.

For more information about creating a bot, see the Lex V2 documentation.

Configuring Wait and Continue

  1. Choose the ProcessPayment intent and navigate to the CreditCardNumber slot.
  2. Choose Advanced Settings to open the slot editor.
  3. Enable Wait and Continue for the slot.
  4. Provide the Wait, Still Waiting, and Continue responses.
  5. Save the intent and choose Build.

The bot is now configured to support the Wait and Continue dialog construct. Now let’s configure the client code. You can use a telephony application to interact with your Lex bot. You can download the code for setting up a telephony IVR interface via Twilio at the GitHub project. The link contains information to set up a telephony interface as well as a client application code to communicate between the telephony interface and Amazon Lex.

Now, let us review the client-side setup to use the bot configuration that we just enabled on the Amazon Lex console. The client application uses the Java SDK to capture payment information. In the beginning, you use the ConfigurationEvent to set up the conversation parameters. Then, you start sending an input event (AudioInputEvent, TextInputEvent or DTMFInputEvent) to send user input to the bot depending on the input type. When sending audio data, you would need to send multiple AudioInputEvent events, with each event containing a slice of the data.

The service first responds with TranscriptEvent to give transcription, then sends the IntentResultEvent to surface the intent classification results. Subsequently, Amazon Lex sends a response event (TextResponseEvent or AudioResponseEvent) that contains the response to play back to caller. If the caller requests the bot to hold the line, the intent is moved to the Waiting state and Amazon Lex sends another set of TranscriptEvent, IntentResultEvent and a response event. When the caller requests to continue the conversation, the intent is set to the InProgress state and the service sends another set of TranscriptEvent, IntentResultEvent and a response event. While the dialog is in the Waiting state, Amazon Lex responds with a set of IntentResultEvent and response event for every “Still waiting” message (there is no transcript event for server-initiated responses). If the caller interrupts the bot prompt at any time, Amazon Lex returns a PlaybackInterruptionEvent.

Let’s walk through the main elements of the client code:

  1. Create the Amazon Lex client:
    AwsCredentialsProvider awsCredentialsProvider = StaticCredentialsProvider .create(AwsBasicCredentials.create(accessKey, secretKey)); LexRuntimeV2AsyncClient lexRuntimeServiceClient = LexRuntimeV2AsyncClient.builder() .region(region) .credentialsProvider(awsCredentialsProvider) .build();

  2. Create a handler to publish data to server:
    EventsPublisher eventsPublisher = new EventsPublisher();

  1. Create a handler to process bot responses:
    public class BotResponseHandler implements StartConversationResponseHandler { private static final Logger LOG = Logger.getLogger(BotResponseHandler.class); @Override public void responseReceived(StartConversationResponse startConversationResponse) {"successfully established the connection with server. request id:" + startConversationResponse.responseMetadata().requestId()); // would have 2XX, request id. } @Override public void onEventStream(SdkPublisher<StartConversationResponseEventStream> sdkPublisher) { sdkPublisher.subscribe(event -> { if (event instanceof PlaybackInterruptionEvent) { handle((PlaybackInterruptionEvent) event); } else if (event instanceof TranscriptEvent) { handle((TranscriptEvent) event); } else if (event instanceof IntentResultEvent) { handle((IntentResultEvent) event); } else if (event instanceof TextResponseEvent) { handle((TextResponseEvent) event); } else if (event instanceof AudioResponseEvent) { handle((AudioResponseEvent) event); } }); } @Override public void exceptionOccurred(Throwable throwable) { LOG.error(throwable); System.err.println("got an exception:" + throwable); } @Override public void complete() {"on complete"); } private void handle(PlaybackInterruptionEvent event) {"Got a PlaybackInterruptionEvent: " + event);"Done with a PlaybackInterruptionEvent: " + event); } private void handle(TranscriptEvent event) {"Got a TranscriptEvent: " + event); } private void handle(IntentResultEvent event) {"Got an IntentResultEvent: " + event); } private void handle(TextResponseEvent event) {"Got an TextResponseEvent: " + event); } private void handle(AudioResponseEvent event) {//synthesize speech"Got a AudioResponseEvent: " + event); } }

  1. Initiate the connection with the bot:
    StartConversationRequest.Builder startConversationRequestBuilder = StartConversationRequest.builder() .botId(botId) .botAliasId(botAliasId) .localeId(localeId); // configure the conversation mode with bot (defaults to audio)
    startConversationRequestBuilder = startConversationRequestBuilder.conversationMode(ConversationMode.AUDIO); // assign a unique identifier for the conversation
    startConversationRequestBuilder = startConversationRequestBuilder.sessionId(sessionId); // build the initial request
    StartConversationRequest startConversationRequest =; CompletableFuture<Void> conversation = lexRuntimeServiceClient.startConversation( startConversationRequest, eventsPublisher, botResponseHandler);

  2. Establish the configurable parameters via ConfigurationEvent:
    public void configureConversation() { String eventId = "ConfigurationEvent-" + eventIdGenerator.incrementAndGet(); ConfigurationEvent configurationEvent = StartConversationRequestEventStream .configurationEventBuilder() .eventId(eventId) .clientTimestampMillis(System.currentTimeMillis()) .responseContentType(RESPONSE_TYPE) .build(); eventWriter.writeConfigurationEvent(configurationEvent);"sending a ConfigurationEvent to server:" + configurationEvent);

  3. Send audio data to server:
    public void writeAudioEvent(ByteBuffer byteBuffer) { String eventId = "AudioInputEvent-" + eventIdGenerator.incrementAndGet(); AudioInputEvent audioInputEvent = StartConversationRequestEventStream .audioInputEventBuilder() .eventId(eventId) .clientTimestampMillis(System.currentTimeMillis()) .audioChunk(SdkBytes.fromByteBuffer(byteBuffer)) .contentType(AUDIO_CONTENT_TYPE) .build(); eventWriter.writeAudioInputEvent(audioInputEvent);

  4. Manage interruptions on the client side:
    private void handle(PlaybackInterruptionEvent event) {"Got a PlaybackInterruptionEvent: " + event); callOperator.pausePlayback();"Done with a PlaybackInterruptionEvent: " + event);

  5. Enter the code to disconnect the connection:
    public void disconnect() { String eventId = "DisconnectionEvent-" + eventIdGenerator.incrementAndGet(); DisconnectionEvent disconnectionEvent = StartConversationRequestEventStream .disconnectionEventBuilder() .eventId(eventId) .clientTimestampMillis(System.currentTimeMillis()) .build(); eventWriter.writeDisconnectEvent(disconnectionEvent);"sending a DisconnectionEvent to server:" + disconnectionEvent);

You can now deploy the bot on your desktop to test it out.

Things to know

The following are a couple of important things to keep in mind when you’re using the Amazon Lex V2 Console and APIs:

  • Regions and languages – The Streaming APIs are available in all existing Regions and support all current languages.
  • Interoperability with Lex V1 console – Streaming APIs are only available in the Lex V2 console and APIs.
  • Integration with Amazon Connect – As of this writing, Lex V2 APIs are not supported on Amazon Connect. We plan to provide this integration as part of our near-term roadmap.
  • Pricing – Please see the details on the Lex pricing page.

Try it out

Amazon Lex Streaming API is available now and you can start using it today. Give it a try, design a bot, launch it and let us know what you think! To learn more, please see the Lex streaming API documentation.

About the Authors

Esther Lee is a Product Manager for AWS Language AI Services. She is passionate about the intersection of technology and education. Out of the office, Esther enjoys long walks along the beach, dinners with friends and friendly rounds of Mahjong.

Swapandeep Singh is an engineer with Amazon Lex team. He works on making interactions with bot smoother and more human-like. Outside of work, he likes to travel and learn about different cultures.


Continue Reading


Building a Complete AI Based Search Engine with Elasticsearch, Kubeflow and Katib




AI-based search engine

Building search systems is hard. Preparing them to work with machine learning is really hard. Developing a complete search engine framework integrated with AI is really really hard.

So let’s make one. ✌️

In this post, we’ll build a search engine from scratch and discuss on how to further optimize results by adding a machine learning layer using Kubeflow and Katib. This new layer will be capable of retrieving results considering the context of users and is the main focus of this article.

As we’ll see, thanks to Kubeflow and Katib, final result is rather quite simple, efficient and easy to maintain.

Complete pipeline executed by Kubeflow, responsible for orchestrating the whole system. Image by author.

To understand the concepts in practice, we’ll implement the system with hands-on experience. As it’s been built on top of Kubernetes, you can use any infrastructure you like (given appropriate adaptations). We’ll be using Google Cloud Platform (GCP) in this post.

We begin with a brief introduction to concepts and then move to the system implementation discussion.

Do you find this in-depth technical education about NLP applications to be useful? Subscribe below to be updated when we release new relevant content.

So, with no further ado,

1. You Know, For Search

If you receive the challenge of building a search system for your company or want to build one for your own, you’ll soon realize that the initial steps tend to be somewhat straightforward.

First and foremost, the search engine must contain documents for retrieval. As we’ll be working with Elasticsearch, let’s use it as reference (for an introduction, please refer to their official docs)

Documents should be uploaded to Elasticsearch following a JSON format. If, for instance, we are building a search engine for a fashion eCommerce store, here’s an example of a document:

Example of document as uploaded to Elasticsearch. Image by author.

Then comes the retrieval step which in essence involves matching search queries with document fields:

Example of matching between search terms and document fields. Image by author.

The ranking phase applies some mathematical rules such as TF-IDF or BM25F to figure out how to properly rank, sorting documents from best to worst match. It’d be something like:

Example of properly ranked results as retrieved by Elasticsearch running BM25 scoring among the stored documents in the database. Image by author.

Further optimization could leverage on specific fields of documents containing performance metrics. For instance, in the previous example, we have that the Click Through Rate (CTR, i.e., reason between clicks and total impressions) of the t-shirt is CTR=0.36. Another retrieval layer could be added using this information and favoring documents with better CTR to show at the top (also known as “boosting”):

Example adding performance metrics layer in retrieval rules. Previous second-best document raises to the top of the results. Image by author.

So far so good. But let’s see how to further optimize even more.

Consider that each user has a specific context. Let’s take our fashion online store as an example again. Some of the traffic may come from southern regions where it may be warmer than regions from the north. They’d probably rather be exposed to lighter clothing than to winter specific products.

More context can be added to the equation: we could distinguish customers based on their favorite brands, categories, colors, sizes, device used, average consuming ticket, profession, age and the list goes on and on…

Doing so requires some extra tools as well. Let’s dive a bit deeper into that.

2. The Machine Learning Layer

Learn-to-rank (LTR) is a field of machine learning that studies algorithms whose main goal is to properly rank a list of documents.

It works essentially as any other learning algorithm: it requires a training dataset, suffers from problems such as bias-variance, each model has advantages over certain scenarios and so on.

What basically changes is that the cost function for the training process is designed to let the algorithm learn about ranking and the output of the model is a value for how good of a match a given document is for a given query.

Mathematically, it’s simply given by:

X in our case comprehends all features we’d like to use to add context to the search. They can be values such as region of the user, their age, favorite brand, correlation between queries and documents fields and so on.

f is the ranking model which is supposed to be trained and evaluated.

Finally, J extends for Judgment and for us it’s an integer value that ranges from 0 (meaning the document is not a good match for a query given features) up to 4 (document is a very good match). We arrange documents from best to worst by using the judgments.

Our main goal is to obtain since it represents the ranking algorithm that adds the machine learning layer to the search results. And in order to obtain f, we need a dataset that already contains the values of the judgments otherwise we can’t train the models.

As it turns out, finding those values can be quite challenging. While the details of how to do so won’t be covered here, this post has a thorough discussion on the subject; in a nutshell, we use clickstream data of users interactions with search engines (their clicks and purchases) to fit models whose variables yield a proxy for the judgment value.

Example of Graphical Model implemented on pyClickModels for finding relevance of documents associated to search engines results. Image by author.

After computing the judgments, we are left with training the ranking models. Elasticsearch already offers a learn-to-rank plugin which we’ll use in this implementation. The plugin offers various ranking algorithms ranging from decision trees to neural networks.

This is an example of the required training file:

Example of input file for the training step as required by Elasticsearch Learn2Rank plugin. Image by author.

The idea is to register for each query (“women t-shirt”) all documents that were printed in the results page. For each, we compute their expected judgment and build the matrix of features X.

In practice, what will happen is that we’ll first prepare all this data and feed it to the Learn-To-Rank plugin of Elasticsearch which will result in a trained ranking model. It can then be used to add the personalization layer we are looking for.

Further details on building X will be discussed soon.

We are ready now to train the models. So far so good. But then, we still have a tricky problem: how to know if it’s working?

2.1 The Valuation Framework

We could choose from several methods to check the performance of a ranking model. The one we’ll discuss here is the average rank metric based on what users either clicked or purchased (pySearchML focuses on purchase events only but clicks can be used interchangeably).

Mathematically, it’s given by:

The formula basically sums over each rank associated to each purchased (or clicked) item in reference to a complete list of documents. The denominator is simply the cardinality of how many items were summed over in the process (total items users either clicked or bought).

In practice, what will happen is that, after training a ranking model, we’ll loop through the validation dataset (which contains what users searched and what they purchased) and use each search term to send a query to Elasticsearch. We then compare the results with what users bought to compute the appropriate average ranking.

Example of the validation framework in practice. For each user in dataset, we use their search term to retrieve from ES results with the ranking model already implemented. We then take the average rank of the purchased items by users and their position in the ES result. Image by author.

The image above illustrates the concept. For each user, we send their search term to Elasticsearch which already contains the just recently trained model. We then compare the search results with what the user purchased and compute the rank. In the previous example, the red t-shirt appears at position 2 out of the 3 retrieved items. As it’s just one item that was purchased then rank=66%.

We run the same computation to all users in the database and then average them all together for a final rank expression.

Notice that the final rank metric must be lower than 50% otherwise the algorithm is just performing as a random selector of documents.

This value is important as it’s used for selecting the best ranking model. That’s where we use Katib from Kubeflow.

Let’s see now how to put all these concepts together and build the search engine:

3. Kubeflow Orchestration

As discussed before, Kubeflow is the orchestrator for the pipeline processing. It has various responsibilities ranging from preparing data for Elasticsearch and for training to running the entire training process.

It works by defining components and their respective tasks. For pySearchML, here’s the complete pipeline that was implemented:

def build_pipeline( bucket='pysearchml', es_host='elasticsearch.elastic-system.svc.cluster.local:9200', force_restart=False, train_init_date='20160801', train_end_date='20160801', validation_init_date='20160802', validation_end_date='20160802', test_init_date='20160803', test_end_date='20160803', model_name='lambdamart0', ranker='lambdamart', index='pysearchml'
): pvc = dsl.PipelineVolume(pvc='pysearchml-nfs') prepare_op = dsl.ContainerOp( name='prepare env', image=f'{PROJECT_ID}/prepare_env', arguments=[f'--force_restart={force_restart}', f'--es_host={es_host}', f'--bucket={bucket}', f'--model_name= {model_name}'], pvolumes={'/data': pvc} ) val_reg_dataset_op = dsl.ContainerOp( name='validation regular dataset', image=f'{PROJECT_ID}/data_validation', arguments=[f'--bucket={bucket}/validation/regular', f'--validation_init_date={validation_init_date}', f'--validation_end_date={validation_end_date}', f'--destination=/data/pysearchml/{model_name}/validation_regular'], pvolumes={'/data': pvc} ).set_display_name('Build Regular Validation Dataset').after(prepare_op) val_train_dataset_op = dsl.ContainerOp( name='validation train dataset', image=f'{PROJECT_ID}/data_validation', arguments=[f'--bucket={bucket}/validation/train', f'--validation_init_date={train_init_date}', f'--validation_end_date={train_end_date}', f'--destination=/data/pysearchml/{model_name}/validation_train'], pvolumes={'/data': pvc} ).set_display_name('Build Train Validation Dataset').after(prepare_op) val_test_dataset_op = dsl.ContainerOp( name='validation test dataset', image=f'{PROJECT_ID}/data_validation', arguments=[f'--bucket={bucket}/validation/test', f'--validation_init_date={test_init_date}', f'--validation_end_date={test_end_date}', f'--destination=/data/pysearchml/{model_name}/validation_test'], pvolumes={'/data': pvc} ).set_display_name('Build Test Validation Dataset').after(prepare_op) train_dataset_op = dsl.ContainerOp( name='train dataset', image=f'{PROJECT_ID}/data_train', command=['python', '/train/'], arguments=[f'--bucket={bucket}', f'--train_init_date={train_init_date}', f'--train_end_date={train_end_date}', f'--es_host={es_host}', f'--model_name={model_name}', f'--index={index}', f'--destination=/data/pysearchml/{model_name}/train'], pvolumes={'/data': pvc} ).set_display_name('Build Training Dataset').after(prepare_op) katib_op = dsl.ContainerOp( name='pySearchML Bayesian Optimization Model', image=f'{PROJECT_ID}/model', command=['python', '/model/'], arguments=[f'--es_host={es_host}', f'--model_name={model_name}', f'--ranker={ranker}', '--name=pysearchml', f'--train_file_path=/data/pysearchml/{model_name}/train/train_dataset.txt', f'--validation_files_path=/data/pysearchml/{model_name}/validation_regular', '--validation_train_files_path=/data/pysearchml/{model_name}/validation_train', f'--destination=/data/pysearchml/{model_name}/'], pvolumes={'/data': pvc} ).set_display_name('Katib Optimization Process').after( val_reg_dataset_op, val_train_dataset_op, val_test_dataset_op, train_dataset_op ) post_model_op = dsl.ContainerOp( name='Post Best RankLib Model to ES', image=f'{PROJECT_ID}/model', command=['python', '/model/'], arguments=[f'--es_host={es_host}', f'--model_name={model_name}', f'--destination=/data/pysearchml/{model_name}/best_model.txt'], pvolumes={'/data': pvc} ).set_display_name('Post RankLib Model to ES').after(katib_op) _ = dsl.ContainerOp( name='Test Model', image=f'{PROJECT_ID}/model', command=['python', '/model/'], arguments=[f'--files_path=/data/pysearchml/{model_name}/validation_test', f'--index={index}', f'--es_host={es_host}', f'--model_name={model_name}'], pvolumes={'/data': pvc} ).set_display_name('Run Test Step').after(post_model_op)

The pipeline is defined by receiving various input parameters such as the bucket and model_name and we’ll be able to change those values at execution time (as we’ll see soon).

Let’s see each component from the pipeline implementation and its purpose:

Step-by-step as implemented in pySearchML. Image by author.

1. prepare_env

Here’s how prepare_env component is defined:

prepare_op = dsl.ContainerOp( name='prepare env', image=f'{PROJECT_ID}/prepare_env', arguments=[f'--force_restart={force_restart}', f'--es_host={es_host}', f'--bucket={bucket}', f'--model_name= {model_name}'], pvolumes={'/data': pvc}
  • Image is a docker reference for the component to run in that step.
  • Arguments are input parameters sent to the script executed in the Docker’s image ENTRYPOINT.
  • pvolumes mounts the volume claim into data.

Here’s all files in prepare_env:

  • is responsible for running queries against BigQuery and preparing Elasticsearch. One of its input arguments is model_name which sets which folder to use as reference for processing data. lambdamart0 is an algorithm that has already implemented to work with Google Analytics(GA) public sample dataset.
  • Dockerfile bundles the whole code together and have as ENTRYPOINT the execution of the script:
FROM python:3.7.7-alpine3.12 as python COPY kubeflow/components/prepare_env /prepare_env
WORKDIR /prepare_env
COPY ./key.json . ENV GOOGLE_APPLICATION_CREDENTIALS=./key.json RUN pip install -r requirements.txt ENTRYPOINT ["python", ""]
  • lambdamart0 is a folder dedicated to an implementation of an algorithm with this respective name. It’s been built to process GA public data and works as an example of the system. Here’s the files it contains:
  • ga_data.sql is a query responsible for retrieving documents from the GA public dataset and exporting it to Elasticsearch
  • es_mapping.json is an index definition for each field of the documents
  • features carries the value of X as discussed before. In lambdamart0 example, it uses the GA public data as reference for building the features.

Notice the feature called name.json:

{ "query": { "bool": { "minimum_should_match": 1, "should": [ { "match": { "name": "{{search_term}}" } } ] } }, "params": ["search_term"], "name": "BM25 name"

Learn-To-Rank plugin requires that each feature be defined as a valid Elasticsearch query and score results are associated as to X.

In the previous example, it receives a parameter search_term and proceeds on matching it on the field name of each document returning the BM25 match, which effectively becomes our “X0”.

Using BM25 between query and name field is not enough to add personalization to results. Another feature that advances in the customization layer is the channel_group.json defined as follows:

{ "query": { "function_score": { "query": { "match_all": {} }, "script_score" : { "script" : { "params": { "channel_group": "{{channel_group}}" }, "source": "if (params.channel_group == 'paid_search') { return doc[''].value * 10 } else if (params.channel_group == 'referral') { return doc[''].value * 10 } else if (params.channel_group == 'organic_search') { return doc[''].value * 10 } else if (params.channel_group == 'social') { return doc[''].value * 10 } else if (params.channel_group == 'display') { return doc[''].value * 10 } else if (params.channel_group == 'direct') { return doc[''].value * 10 } else if (params.channel_group == 'affiliates') { return doc[''].value * 10 }" } } } }, "params": ["channel_group"], "name": "channel_group"

It receives as input the parameter channel_group (channel that brought the user to our web store) and returns the CTR for that respective channel.

This effectively prepares the model to distinguish from users their origin and how to rank each group. Specifically, users coming from paid sources might behave differently than those coming in through organic channels for instance. If the training is good enough, the ranking algorithm should be prepared to handle these situations.

Still, this doesn’t tell much on how to personalize results by using intrinsic characteristics of each user. So, here’s one possibility for solving that. The feature avg_customer_price.json is defined as:

{ "query": { "function_score": { "query": { "match_all": {} }, "script_score" : { "script" : { "params": { "customer_avg_ticket": "{{customer_avg_ticket}}" }, "source": "return Math.log(1 + Math.abs(doc['price'].value - Float.parseFloat(params.customer_avg_ticket)))" } } } }, "params": ["customer_avg_ticket"], "name": "customer_avg_ticket"

It receives as input the parameter customer_avg_ticket and returns for each document the log of the distance between the average user ticket and the price of the document.

Now the ranking model can learn in the training phase how to manage the rank of each item based on how distant its price is to the average spending of the user on the website.

With those three types of features we can add a complete personalization layer to a search system on top of Elasticsearch. Features can be anything as long as it can be abstracted into a valid search query and they must return some scoring metric eventually translated as our value X.

For what follows in prepare_env component:

  • Features are exported to Elasticsearch.
  • An index is created on Elasticsearch that defines documents fields.
  • Documents are queried from BigQuery and uploaded into Elasticsearch.
  • RankLib requirements are created (feature set store and so on).

For implementing a new model with new data and features, just create another folder inside prepare_env (something like modelname2) and set how it will query data and upload them to Elasticsearch.

2. Validation Datasets

This is a simple step. It consists of retrieving data from BigQuery containing what users searched, the context of the search and a list of products purchased.

Here’s the BigQuery query used for retrieving the data. It basically selects all users, their searches and purchases and then combines with their context. An example of the results:

{ "search_keys": { "search_term": "office", "channel_group": "direct", "customer_avg_ticket": "13" }, "docs": [ {"purchased":["GGOEGOAQ012899"]} ]

search_keys can contain any available information that sets the context of customers. In the previous example, we’re using their channel group and average spending ticket on the website.

This data is what we feed into the validation framework when computing the average rank as discussed before.

Notice that the system builds three different validation datasets: one for the training period, another for the regular validation and finally the third is for the final testing step. The idea here is to analyze bias and variance for the trained models.

3. Training Dataset

This is the component responsible for building the RankLib training file as discussed before. The complete script is actually quite simple. First, it downloads from BigQuery the input clickstream data which consists of users interactions on search pages. Here’s an example:

{ "search_keys": { "search_term": "drinkware", "channel_group": "direct", "customer_avg_ticket": "20" }, "judgment_keys": [ { "session": [ {"doc":"GGOEGDHC017999","click":"0","purchase":"0"}, {"doc":"GGOEADHB014799","click":"0","purchase":"0"}, {"doc":"GGOEGDHQ015399","click":"1","purchase":"0"} ] } ]

Notice that the keys associated to the search are aggregated together inside search_keys. Those values are the ones we send to Elasticsearch and replace appropriately each feature X as discussed in prepare_env. In the previous JSON example, we know that the user search context is:

  • Searched for drinkware.
  • Came to the store directly.
  • Spent on average $20 on the website.

judgment_keys combines users sessions composed of documents they saw on the search page and their interaction on a given document.

This information is then sent to pyClickModels which then process the data and evaluates the judgment for each query-document pair. Result is a newline delimited JSON document as follows:

{ "search_term: bags|channel_group:organic_search|customer_avg_ticket:30": { "GGOEGBRJ037299": 0.3333377540111542, "GGOEGBRA037499": 0.222222238779068, "GGOEGBRJ037399": 0.222222238779068 }

Notice that the value of the key is 


As discussed before, we want our search engine to be aware of context and further optimize on top of that. As a consequence, judgments are extract based on the entire selected context, not just the search_term.

By doing so, we can differentiate documents for each context and we’d have scenarios where a product receives judgment 4 for customers coming from northern regions and judgment 0 otherwise, as an example.

Notice that the judgments values, as given by pyClickModels, ranges between 0 and 1. As the Learn-To-Rank Elasticsearch plugin is built on top of RankLib, this value is expected to range between integers 0 and 4, inclusive. What we do then is we transform the variables using their percentile as reference. Here’s the complete code for building the final judgment files:

def build_judgment_files(model_name: str) -> None: model = DBN.DBNModel() clickstream_files_path = f'/tmp/pysearchml/{model_name}/clickstream/' model_path = f'/tmp/pysearchml/{model_name}/model/model.gz' rmtree(os.path.dirname(model_path), ignore_errors=True) os.makedirs(os.path.dirname(model_path)) judgment_files_path = f'/tmp/pysearchml/{model_name}/judgments/judgments.gz' rmtree(os.path.dirname(judgment_files_path), ignore_errors=True) os.makedirs(os.path.dirname(judgment_files_path)), iters=10) model.export_judgments(model_path) with gzip.GzipFile(judgment_files_path, 'wb') as f: for row in gzip.GzipFile(model_path): row = json.loads(row) result = [] search_keys = list(row.keys())[0] docs_judgments = row[search_keys] search_keys = dict(e.split(':') for e in search_keys.split('|')) judgments_list = [judge for doc, judge in docs_judgments.items()] if all(x == judgments_list[0] for x in judgments_list): continue percentiles = np.percentile(judgments_list, [20, 40, 60, 80, 100]) judgment_keys = [{'doc': doc, 'judgment': process_judgment(percentiles, judgment)} for doc, judgment in docs_judgments.items()] result = {'search_keys': search_keys, 'judgment_keys': judgment_keys} f.write(json.dumps(result).encode() + 'n'.encode()) def process_judgment(percentiles: list, judgment: float) -> int: if judgment <= percentiles[0]: return 0 if judgment <= percentiles[1]: return 1 if judgment <= percentiles[2]: return 2 if judgment <= percentiles[3]: return 3 if judgment <= percentiles[4]: return 4

Here’s an example of the output of this step:

{ "search_keys": { "search_term": "office", "channel_group": "organic_search", "customer_avg_ticket": "24" }, "judgment_keys": [ {"doc": "0", "judgment": 0}, {"doc": "GGOEGAAX0081", "judgment": 4}, {"doc": "GGOEGOAB016099", "judgment": 0} ]

This data needs to be transformed into the required training file for RankLib. This is where we combine the information of judgments, documents, queries context with the features (here’s the code example for retrieving X from Elasticsearch).

Each JSON row from previous step containing search context and judgment keys is looped over and sent as a query against Elasticsearch with the input parameters of the search_keys. The result will be each value of X as already defined from previous prepare_env step.

End result is a training file that looks something like this:

0	qid:0	1:3.1791792	2:0	3:0.0	4:2.3481672
4	qid:0	1:3.0485907	2:0	3:0.0	4:2.3481672
0	qid:0	1:3.048304	2:0	3:0.0	4:0
0	qid:0	1:2.9526825	2:0	3:0.0	4:0
4	qid:1	1:2.7752903	2:0	3:0.0	4:3.61228
0	qid:1	1:2.8348017	2:0	3:0.0	4:2.3481672

For each query and for each document we have the estimated judgment as computed by pyClickModels, the id of the query and then a list of features X with their respective values.

With this file, we can train now the ranking algorithms.

4. Katib Optimization

Katib is a tool from Kubeflow that offers an interface for automatic hyperparameter optimization. It has several available methods; Bayesian Optimization is the one selected in pySearchML.

Example of the algorithm Bayesian Optimization. As it samples more data points from the allowed domain, the closer it may get to the optimum value of a given function. In pySearchML, the domain is a set of variables that sets how a ranker should fit the data and the cost function it’s optimizing is the average rank. Image taken from Wikimedia Foundation.

What Katib does is it selects for each hyperparameter a new value based on a trade-off between exploration-exploitation. It then tests the new model and observe results which are used for future steps.

For pySearchML, each parameter is an input of RankLib which sets how the model will be fit (such as how many trees to use, total leaf nodes, how many neurons in a net and so on).

Katib is defined through a Custom Resource of Kubernetes. We can run it by defining a YAML file and deploying it to the cluster, something like:

kubectl create -f katib_def.yaml

What Katib will do is read through the YAML file and start trials, each experimenting a specific value of hyperparameters. It can instantiate multiple pods running in parallel executing the code as specified in the experiment definition.

Here are the files in this step: is responsible for launching Katib from a Python script. It receives input arguments, builds an YAML definition and use Kubernetes APIs to start Katib from the script itself.

experiment.json works as a template for the definition of the experiment. Here’s its definition:

{ "apiVersion": "", "kind": "Experiment", "metadata": { "namespace": "kubeflow", "name": "", "labels": { "": "1.0" } }, "spec": { "objective": { "type": "minimize", "objectiveMetricName": "Validation-rank", "additionalMetricNames": [ "rank" ] }, "algorithm": { "algorithmName": "bayesianoptimization" }, "parallelTrialCount": 1, "maxTrialCount": 2, "maxFailedTrialCount": 1, "parameters": [], "trialTemplate": { "goTemplate": { "rawTemplate": { "apiVersion": "batch/v1", "kind": "Job", "metadata":{ "name": "{{.Trial}}", "namespace": "{{.NameSpace}}" }, "spec": { "template": { "spec": { "restartPolicy": "Never", "containers": [ { "name": "{{.Trial}}", "image": "{PROJECT_ID}/model", "command": [ "python /model/ --train_file_path={train_file_path} --validation_files_path={validation_files_path} --validation_train_files_path={validation_train_files_path} --es_host={es_host} --destination={destination} --model_name={model_name} --ranker={ranker} {{- with .HyperParameters}} {{- range .}} {{.Name}}={{.Value}} {{- end}} {{- end}}" ], "volumeMounts": [ { "mountPath": "/data", "name": "pysearchmlpvc", "readOnly": false } ] } ], "volumes": [ { "name": "pysearchmlpvc", "persistentVolumeClaim": { "claimName": "pysearchml-nfs", "readOnly": false } } ] } } } } } } }

It essentially defines how many pods to run in parallel and which Docker image to run for each trial along with its input command. Notice that the total pods running in parallel as well as maximum trials are hard-coded in pySearchML. Best approach would be to receive those parameters from the pipeline execution and replace them accordingly. will read this template, build a final YAML definition and send it to Kubernetes which will start the Katib process.

One of the input parameters is the ranker which is the ranking algorithm to select from RankLib (such as lambdaMart, listNet and so on). Each ranker has its own set of parameters, here’s the example for LambdaMart algorithm as implemented in

def get_ranker_parameters(ranker: str) -> List[Dict[str, Any]]: return { 'lambdamart': [ {"name": "--tree", "parameterType": "int", "feasibleSpace": {"min": "1", "max": "500"}}, {"name": "--leaf", "parameterType": "int", "feasibleSpace": {"min": "2", "max": "40"}}, {"name": "--shrinkage", "parameterType": "double", "feasibleSpace": {"min": "0.01", "max": "0.2"}}, {"name": "--tc", "parameterType": "int", "feasibleSpace": {"min": "-1", "max": "300"}}, {"name": "--mls", "parameterType": "int", "feasibleSpace": {"min": "1", "max": "10"}} ] }.get(ranker)

Katib will select parameters from the domain defined above and run where effectively RankLib is used to train the ranking models. An example of the command as implemented in Python:

cmd = ('java -jar ranklib/RankLib-2.14.jar -ranker ' f'{ranker} -train {args.train_file_path} -norm sum -save ' f'{args.destination}/model.txt ' f'{(" ".join(X)).replace("--", "-").replace("=", " ")} -metric2t ERR')

This string is sent to a subprocess call (notice it requires Java because of RankLib) which starts the training process. The result is a newly trained ranking model that can be exported to Elasticsearch.

Just as the model is fit, is invoked for computing the expected rank. The steps that take place are:

  • The script loops through each JSON from the validation dataset.
  • Each row contains the search context which is then used to build an Elasticsearch query. Here’s the query used by the model lambdamart0 which we’ll use later on:
{ "query": { "function_score": { "query": { "bool": { "must": { "bool": { "minimum_should_match": 1, "should": [ { "multi_match": { "operator": "and", "query": "{query}", "type": "cross_fields", "fields": [ "sku", "name", "category" ] } } ] } } } }, "functions": [ { "field_value_factor": { "field": "", "factor": 10, "missing": 0, "modifier": "none" } } ], "boost_mode": "sum", "score_mode": "sum" } }, "rescore": { "window_size": "{window_size}", "query": { "rescore_query": { "sltr": { "params": "{search_keys}", "model": "{model_name}" } }, "rescore_query_weight": 20, "query_weight": 0.1, "score_mode": "total" } }
  • Given the recently built query, a request is sent to Elasticsearch.
  • A comparison happens between the search results and purchased documents.

Here’s the code responsible for building the Elasticsearch query:

def get_es_query( search_keys: Dict[str, Any], model_name: str, es_batch: int = 1000
) -> str: """ Builds the Elasticsearch query to be used when retrieving data. Args ---- args: NamedTuple args.search_keys: Dict[str, Any] Search query sent by the customer as well as other variables that sets its context, such as region, favorite brand and so on. args.model_name: str Name of RankLib model saved on Elasticsearch args.index: str Index on Elasticsearch where to retrieve documents args.es_batch: int How many documents to retrieve Returns ------- query: str String representation of final query """ # it's expected that a ES query will be available at: # ./queries/{model_name}/es_query.json query = open(f'queries/{model_name}/es_query.json').read() query = json.loads(query.replace('{query}', search_keys['search_term'])) # We just want to retrieve the id of the document to evaluate the ranks between # customers purchases and the retrieve list result query['_source'] = '_id' query['size'] = es_batch query['rescore']['window_size'] = 50 # Hardcoded to optimize first 50 skus query['rescore']['query']['rescore_query']['sltr']['params'] = search_keys query['rescore']['query']['rescore_query']['sltr']['model'] = model_name return query

Notice that the parameter rescore_query triggers the machine learning layer on Elasticsearch learn-to-rank plugin.

Finally, the function compute_rank puts it all together as shown below:

def compute_rank( search_arr: List[str], purchase_arr: List[List[Dict[str, List[str]]]], rank_num: List[float], rank_den: List[float], es_client: Elasticsearch
) -> None: """ Sends queries against Elasticsearch and compares results with what customers purchased. Computes the average rank position of where the purchased document falls within the retrieved items. Args ---- search_arr: List[str] Searches made by customers as observed in validation data. We send those against Elasticsearch and compare results with purchased data purchase_arr: List[List[Dict[str, List[str]]]] List of documents that were purchased by customers rank_num: List[float] Numerator value of the rank equation. Defined as list to emulate a pointer rank_den: List[float] es_client: Elasticsearch Python Elasticsearch client """ idx = 0 if not search_arr: return request = os.linesep.join(search_arr) response = es_client.msearch(body=request, request_timeout=60) for hit in response['responses']: docs = [doc['_id'] for doc in hit['hits'].get('hits', [])] if not docs or len(docs) < 2: continue purchased_docs = [ docs for purch in purchase_arr[idx] for docs in purch['purchased'] ] ranks = np.where(np.in1d(docs, purchased_docs))[0] idx += 1 if ranks.size == 0: continue rank_num[0] += ranks.sum() / (len(docs) - 1) rank_den[0] += ranks.size print('rank num: ', rank_num[0]) print('rank den: ', rank_den[0])

Katib instantiate sidecars pods which keeps reading through the stdout of the training pod. When it identifies the string Validation-rank=(...), it uses the value as the result for the optimization process.

A persistent volume is used in the process to save the definition of the best model trained by Katib which is used by our next component.

5. Post RankLib Model

The most difficult parts are done already. Now what happens is the script simply goes after the definition of the best model as saved in a text file and uploads it to Elasticsearch.

Notice that one of the main advantages with this design is that this component could export the model to a production Elasticsearch while the whole optimization could happen on a staging replica engine.

6. Final Testing

Finally, as the best model is exported to Elasticsearch, the system has at its disposal the best optimized ranking model. In this step, a final validation is executed in order to verify not only that everything worked fine as well as for providing further information on whether the system is suffering from bias-variance.

That’s pretty much it! Let’s run some code now to see this whole framework in action.

4. Hands-On Section

Time to implement the whole architecture in practice! Complete code is available on pySearchML repository:


In this section, we’ll be using GCP for running the code with real data. Also, keep in mind that there will be costs (a few cents) associated to running this experiment.

For those new to GCP, there’s a $300 free credit gift that lasts for a year; just sign in and create a project for this tutorial (pysearchml for instance). You should end up with access to a dashboard that looks like this:

Example of a GCP dashboard project. Image by author.

gcloud will be required for interacting with GCP through command line. Installing it is quite straightforward. After the initial setup, make sure you can login by running:

gcloud auth login

Now the rest is quite simple. Clone pySearchML to your local:

git clone pysearchml && cd pySearchML

Enable Kubernetes Engine in your platform. After that, just trigger the execution of cloudbuild which will be responsible for creating the whole required infrastructure (this step should take somewhere between 5~10 minutes).

Here’s how the build triggers the run:

_VERSION='0.0.0' ./kubeflow/build/ gcloud builds submit --no-source --config kubeflow/build/cloudbuild.yaml --substitutions $SUBSTITUTIONS --timeout=2h

You can choose appropriate values in the variable SUBSTITUTIONS. Notice that _VERSION sets the pipeline version to be exported to Kubeflow. After everything is set, just run the script:

Steps executed on cloudbuild. Image by author.
  1. Prepares secret keys to allow access and authorization into GCP tools.
  2. Prepares known_hosts in building machine.
  3. Clones pySearchML locally.
  4. The file that runs in step 4 is responsible for creating the Kubernetes cluster on top of Google Kubernetes Engine (GKE) as well as deploying Elasticsearch, Kubeflow and Katib. In parallel, all Docker images required for the system is built and deployed to Google Container Registry (GCR) for later use in Kubeflow.
  5. Run several unittests. Those ended up being important to confirm the system was working as expected. Also, it compiles the Kubeflow pipeline in parallel.
  6. Finally, deploys the pipeline to the cluster.

After it’s done, if you browse to your console and select “Kubernetes Engine”, you’ll see that it’s already up and running:

Kubernetes cluster deployed into GKE ready to run. Image by author.

It’s a small cluster as we won’t be using much data, this helps to further save on costs.

Kubeflow and Katib have already been installed. In order to access it, first connect your gcloud to the cluster by running:

gcloud container clusters get-credentials pysearchml

After that, port-forward the service that deals with Kubeflow to your local by running:

kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80 1>/dev/null &

Now, if you access your localhost on port 8080, this is what you should see:

Kubeflow Dashboard. Image by author.

And the pipeline is ready for execution.

This experiment uses the public Google Analytics sample dataset and it consists of a small sample of customers browsing on Google Store. It spans from 20160801 up to 20170801 and contains what each user searched and how they interacted with the search results.

Select pysearchml_0.0.0 and then select “+Create Run”. You should see a screen with all possible input parameters as defined in the Python pipeline script. After choosing appropriate parameters, just run the code.

After execution, here’s the expected result:

Full execution of the entire pipeline as defined in pySearchML. Image by author.

Output of Katib component:

Example of output as printed by the Katib component. Image by author.

We can see a rank of 36.05%. Then, we can compare results from the test component:

AI search engine
Example of output as printed by test component. Image by author.

Rank here is 40.16% which is a bit worse than validation dataset. It might indicate that the model suffers a bit of over fitting; more data or further parameters exploration might help to alleviate this problem.

And, pretty much, there you have it! Elasticsearch now has a fully trained new layer of machine learning to improve results based on context of customers.

If you want to navigate through the files created for each step, there’s an available deployment for that. In the pySearchML folder, just run:

kubectl apply -f kubeflow/disk-busybox.yaml

If you run kubectl -n kubeflow get pods you’ll see that the name of one of the pods is something like “nfs-busybox-(…)”. If you exec into it you’ll have access to the files:

kubectl -n kubeflow exec -it nfs-busybox-(...) sh

They should be located at /mnt/pysearchml.

There’s also a quick and dirty visualizer for the whole process as well. Just run:

kubectl port-forward service/front -n front 8088:8088 1>/dev/null &

And access your browser in localhost:8088. You should see this (quick and ugly) interface:

Front-end interface for running queries on the newly trained Elasticsearch. Image by author.

Example of results:

Not only does it allow to play around with results as well as give us a better sense if the optimization pipeline is working or not.

And that’s pretty much all it takes to have a complete search engine running with AI optimization ready to handle income traffic for any store.

5. Conclusion

Now, that was a challenge!

Building pySearchML was quite tough and I can safely say it was one of the most brutal challenges I’ve ever faced 😅. Countless and countless designs, architectures, infrastructures were considered but most failed.

The realization of integrating the whole process on top of Kubeflow and Katib came only later on when several alternatives had already been tested.

The advantage of this design is how simple and direct the final code becomes. It’s fully modular, each component is responsible for a simple task and Kubeflow orchestrates the whole execution. On top of that, we can focus mainly in the code development and let Katib do the hard work of finding best parameters.

The development process was not straightforward. Several lessons had to be learned including concepts from Kubernetes and its available resources. Still, it was all well worth it. As a result, an entire search engine could be built from scratch with a few lines of code, ready to handle real traffic.

As next steps, one could probably consider replacing RankLib with some Deep Learning algorithm which would further extract context from data. One of the main challenges for doing so is that the response time of the system could increase as well as the costs (pros and cons have to be evaluated).

Regardless of the ranking algorithm used, the architecture remains the same for the most part.

Hopefully, that was a useful post for those working in this field. Now it’s time for us to take some rest, meditate on the lessons learned and prepare ourselves for the next adventure :).

As for this post, it certainly deserves being concluded with the mission accomplished soundtrack.

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more AI and NLP updates.

We’ll let you know when we release more in-depth technical education.


Continue Reading


Scientists Made a Biohybrid Nose Using Cells From Mosquitoes




Thanks to biological parts of a mosquito’s “nose,” we’re finally closer to Smell-O-Vision for computers. And a way to diagnose early cancer.

With the recent explosion in computing hardware prowess and AI, we’ve been able to somewhat adequately duplicate our core senses with machines. Computer vision and bioengineered retinas tag-teamed to bolster artificial vision. Smart prosthetics seamlessly simulate touch, pain, pressure, and other skin sensations. Hearing devices are ever more capable of isolating specific sounds from a muddy cacophony of noises.

Yet what’s been missing is the sense of smell. Although humans are hardly smelling maestros in the animal kingdom, we can nevertheless detect roughly one trillion odors, often with only a single scent molecule drifting into our noses. Despite their best efforts, translating this prowess into a digital, artificial computer “nose” has eluded scientists. One major problem is reverse-engineering the sensitivity of our biological smelling mechanics—otherwise known as our olfactory system.

“Odors, airborne chemical signatures, can carry useful information about environments … However, this information is not harnessed well due to a lack of sensors with sufficient sensitivity and selectivity,” said Dr. Shoji Takeuchi from the Biohybrid Systems Laboratory at the University of Tokyo.

But here’s the thing: evolution already came up with a solution. Rather than rebuild olfaction from scratch, why not just use what’s available?

In a new paper published in Science Advances, Takeuchi and colleagues did just that. They tapped the odor-sensing components of a Yellow Fever mosquito (yikes), and rebuilt the entire construct with synthetic biology. Using a parallel chip design, they then carefully placed these biological components as an array onto a chip and monitored the setup with a computer.

By infusing odor chemicals into tiny liquid droplets—think essential oils mixing with liquid in a humidifier—the chip could detect odors with unprecedented sensitivity. Because each biological odor-sensing component is tailored to their favorite chemical, it’s possible in theory to use a single 16-channel chip to detect over a quadrillion mixes of odors.

“These biohybrid sensors…are highly sensitive,” said Takeuchi. With AI, he added, the biohybrid sensor could be further amped up to analyze increasingly complex mixes of chemicals, both in the environment and as a disease-hunting breathalyzer to monitor health.

The Anatomy of Smell

The animal olfactory system is an evolutionary work of art. Although specific biological setups differ between species, the general concept is pretty similar.

It all starts in the nose. Our noses are densely packed with olfactory cells, huddled high up in our nasal passages. Think of these cells as fatty bubbles, each with electrical wires to transfer information up the chain. Dotted along the outer rim of the bubbles are proteins, called olfactory receptors. These are basically “smart” tunnels that connect the outside environment with the interior sanctuary of the cells.

Here’s the crux: each tunnel, or receptor, is tailored to a single odor chemical. It’s normally shut. When an odor—say, vanillin, the dominant chemical in vanilla—drifts into the nasal passage, it grasps onto its preferred receptor. This action opens the receptor tunnel, causing ions to flow into the cell. Translation? It triggers an electrical current, a sort of “on” signal, that gets sent to the brain.

Now take something more complicated in scent, say, French toast. Multiple chemical molecules grab onto their respective odor receptors. Each sends a current, which get analyzed together as they race along nerve highways towards the brain. Based on previous experience, the brain can then analyze the combination data and determine “oh, that’s French toast!”

As you can tell, combination is key. It’s how our 400 or so olfactory receptors can detect a trillion different odors. It’s also what the Japanese team tapped into for inspiration in their new artificial “nose.”

Mosquitoes to the Rescue

Rather than reconstructing a human smell receptor, the team turned to mosquitoes. It’s not that the team particularly loves these blood-sucking demons. It’s because the mosquito odor receptors are built to heighten their sense of smell. In addition to the usual odor-grabbing component, these nuisance bugs also have a separate component that amplifies the electrical signal in a biological way. This allows them to smell with higher sensitivity, while being able to minutely discern what they’re smelling—blood or Deet. It’s a trait that’s terrible for us, but great for an artificial nose.

The next step was to reconstruct the cell structure on a chip. Here, the team used a proprietary recipe to carefully micro-engineer two bubbles, each filled with liquid, smashed together horizontally, forming a sort of squished figure-eight. At the nexus between the two bubbles is a structure that mimics a cell’s outer shell—its membrane, with two fatty layers. The team called this their artificial cell membrane.

They then tapped synthetic biology to make mosquito odor receptors from scratch using DNA. These receptors are embedded into the artificial cell membrane. The entire contraption—think two oil tanks, closely snuggled together but separated by the odor-sensing membrane—was then plopped onto a chip. Each odor-detection unit sat on top of a bunch of slits on the chip, as a sort of venting mechanism.

Here’s how the odor detection works. Each chip has engraved “channels,” allowing any odor to flow towards each detection unit. At a unit, the odor chemicals are then shuttled into the slits using blasts of nitrogen air. This “blows” the odor molecules to mix better with the liquid in the bubbles, so that the odor-detection component gets more of a “whiff.” Think stirring a pot of chili so you can smell it better—that’s the general idea.

Once the odor molecules grabbed onto the embedded mosquito smell sensors, the sensors generated an electrical current. Using an electrode and a computer, the team could then monitor for these currents. Rather than relying on a single device, they linked up 16 on a chip to further increase sensitivity.

As a proof of concept, the team fed an odor molecule called octenol to their biohybrid nose. Octenol is often detected in the breath of cancer patients. At just a few parts per billion—a range similar to our natural noses—the biohybrid “nose” was able to reliably pick up the smell, with over 90 percent accuracy.

In another test, the team had a volunteer breathe heavily into a bag. They then connected the bag, through a regulator for constant air flow, to the artificial nose. “Although the human breath contains about 3,000 different metabolites [molecules],” the team said, for a healthy human without octenol in their breath, the nose remained silent. However, once a tiny bit of octenol gas was added, the bionic “nose” immediately generated an electrical bump indicating its presence.

A Fragrant Future

For now, the team has only tested the contraption one scent at a time. But because the sensing channels are plug-and-play, they’re confident of a fragrant, blossoming future. Mixing and matching different odor receptors, for example, could lead to artificial noses that outwit our own.

Eventually, the team hopes to use their bionic nose as an affordable and portable way to detect early stages of diseases. With cancer and other health problems, the mixture of body odors can dramatically shift. It’s how dogs and other animals with a more heightened sense of smell than our own can detect health issues at early stages. In addition, a biohybrid nose could also venture into contaminated wastelands to screen for toxic chemicals without harming a living being.

To compliment the “mosquito nose” hardware, the team is already looking towards more sophisticated AI software. “I would like to expand upon the analytical side of the system by using some kind of AI. This could enable our biohybrid sensors to detect more complex kinds of molecules,” said Takeuchi.

Image Credit: Richárd Ecsedi on Unsplash


Continue Reading


Testing Conversational AI




Measuring chatbot performance beyond traditional metrics and software testing methods

How do you test a Conversational AI solution? How do you evaluate if your chatbot is fit to be deployed to face your customers? Out of all the types of Natural Language Processing systems like Machine Translation, Question Answering Systems, Speech recognition, Speech synthesis, Information Retrieval, etc., Conversational AI is the most challenging one to measure. Conversations are not one-shot tasks. They are multi-turn and whether a conversation succeeded or failed is not easily apparent. How then can we as conversational AI designers measure the quality of the systems we build?

Photo by William Warby on Unsplash

Traditional software testing approaches are useful to test if the deployed solution is robust and resilient. However, when it comes to the quality of understanding customer’s input and the chatbot’s responses, software testing approaches fail to test the breadth of possible scenarios adequately.

Metrics like precision, recall, accuracy scores are used to evaluate statistical and ML models used in chatbots to perform various tasks like sentiment analysis, intent classification, emotion detection, entity recognition, etc. Although these metrics are great to measure various parts of the system, they do not measure the system holistically. In other words, a highly accurate intent classification model does not guarantee the quality of the conversation as a whole.

In addition to model based metrics, there are holistic ones like task completion rate, time taken to task completion, etc. However these tests fail to capture some key undesired behaviours that may lead to disengaging conversations in the wild. Ask yourselves this question — do shorter conversations mean engaging or productive conversations? We can’t really say. It looks like, appropriate metrics need to be identified based on the purpose of the conversation. While task based systems (e.g. booking a ticket) might aim for short conversations, open domain companions might do the opposite.

Besides traditional metrics, the quality of conversational AI solutions can be measured based on a number of user experience (UX) factors including ease of use, how well it understands the user, how accurate and appropriate its responses are, how consistent it is, how trustable and authentic the responses are and so on. Recently, many new quantitative and qualitative metrics have been suggested by researchers and designers working in the domain.

ChatbotTest is an open source evaluation framework for testing chatbots. It identifies 7 categories of chatbot design as follows.

Personality — is there a clear tone of voice that fits the conversation?

Onboarding — how are the users getting started with the chatbot experience?

Understanding — how wide is the chatbot’s capability to understand the user’s input?

Answering — are the chatbot’s responses to the user accurate and appropriate?

Navigation — how easy is it to navigate through the conversation without feeling lost?

Error Management — how good is the chatbot in repairing and recovering from errors in conversation?

Intelligence — how well does it use contextual information to handle the conversation intelligently?

Spread across these categories, the ChatbotTest guide provides us with a number of test cases to examine and evaluate any chatbot qualitatively. The framework prods us to ask a number of questions concerning the design of the chatbot. The list of questions is very exhaustive and comprehensive. Here is an example — ellipsis test.

1. The Messenger Rules for European Facebook Pages Are Changing. Here’s What You Need to Know

2. This Is Why Chatbot Business Are Dying

3. Facebook acquires Kustomer: an end for chatbots businesses?

4. The Five P’s of successful chatbots

ChatbotTest — Is the chatbot intelligent enough to understand based on context?

And I like this one under Error Management —Awareness of channel issues.

ChatbotTest — Can it be aware of channel issues and be helpful?

While the guide lists a number of scenarios and questions to ask, they don’t give us the right answers. The answers we expect are for us as designers to decide. In a sense, gathering these questions into a list may give us a list of requirements for building a great conversational experience.

The Chatbot Usability Questionnaire (CUQ) is a questionnaire consisting of 16 questions concerning the usability of the chatbot. Respondents are asked to grade their agreement to each statement about the chatbot using Likert scale responses. The statements listed range from chatbot’s personality, purpose, ease of use and other qualitative features. The questions are evenly divided between two polarities — positive questions and negative questions — in order to reduce bias.

Although similar to ChatbotTest framework described above, the questionnaire is not that exhaustive. However, it provides a way to grade each response and calculate a score (out of 100) that each respondent gives the chatbot.

Checklist is a comprehensive testing framework for NLP models that tests them on specific tasks and behaviours. It consists of a matrix of general linguistic capabilities and tests types for each of them. This will help you ideate and generate a comprehensive list of test cases. There are different kinds of tests — Minimum Functionality Test (similar to a unit test in software testing), Invariance Test (perturbations that should not change the output of the model, Directional Expectation Test (perturbations with known expected results). Combine these test types with capabilities — vocabulary, named entities, negation and much more to identify a number of test cases that can be run on the model to find if it is working as expected. The combination of capabilities and test types helps generate comprehensive test cases that could have been overlooked easily.

These tests typically answer questions like these — What happens when named entities are changed? Can we replace nouns with synonyms and get the same outcome? What happens when there are typos? How is the model outcome affected by adding words that negate the sentence? The framework comes with a tool that helps enumerate possible test input sentences give the type of test and capability.

The above image shows test cases generated for a sentiment model. For each capability (e.g. Vocab+POS, Robustness, NER, etc), and type of test (e.g. MFT, INV, and DIR), a list of test descriptions (e.g. short utterances with neutral adjectives and nouns, ect) have been identified . Next, for each of the test descriptions, test case utterances and expected outputs are generated. The test utterances can then be input into the model and the output compared to expected outputs to measure failure rates.

Sensibleness and Specificity Average (SSA) is a metric proposed by Google and was used to measure the performance of Google Meena chatbot against other similar systems. The metric measures how good the chatbot’s responses are in terms of being a sensible response to the user’s utterance and how specific it is. While it is very basic, compared to the other metrics and frameworks discussed here, the SSA metric throws light on the fact that when the user types in an utterance, there are many ways your chatbot can respond. And how do you measure the quality of such a response?

ACUTE-Eval is a novel metric that measures the quality of a chatbot by comparing its conversations to another. It takes two multi-turn conversations and asks the evaluator to compare one of the speakers (say Speaker A) in one conversation to one of the speakers (say Speaker B) in another conversation. The human evaluators are then asked specific questions asking them to choose between speaker A and B — which of the two were more engaging, knowledgeable, interesting, etc. This metric was used recently by Facebook to evaluate its open domain chatbot, Blender. They asked evaluators the following questions:

  1. Who would you prefer to talk to for a long conversation?
  2. Which speaker sounds more human?

By comparing the two speakers in two conversations laid out side-by-side, the anchoring effects of seeing conversations one after another is avoided.

Cohesion and Separation measures the quality of training examples fed into intent classification model. Using sentence embeddings, semantic similarity between utterances can be measured. Similarity between utterance examples within an intent is Cohesion. Higher the cohesion value, the better. On the other hand, Separation is the measure of similarity between utterance examples belonging to any two intents. Higher the separation value, the better. Although this measure does not directly measure chatbot’s overall performance or that of its intent classification model’s, it is useful to measure the quality of training examples fed into the model.

And finally, another interesting approach I read about recently was getting chatbots to talk to each other and letting the audience decide was BotBattle. This is very similar to ACUTE-Eval, in the sense that two speakers are compared on various parameters but unlike ACUTE-Eval, the two speakers are engaged in conversation with each other. Given the nature of the task, this approach can be used for open domain chat as opposed to task based conversations where the roles of the speakers are clearly decided which could bias the evaluation.

This approach was used to compare Kuki, the popular Loebner prize winning chatbot to Facebook’s BlenderBot. The chat was presented in a virtual reality environment where both the bots had their own avatars as well. The winner is decided through audience vote where they get to decide who is best.

Photo by Scott Graham on Unsplash

So, there you go — a list of recent metrics and frameworks for evaluating Conversational AI models and systems. I am sure this is not an exhaustive list. And as the domain of conversational AI evolves and the expectations of conversational experience changes, more metrics will be invented. As the systems get widely adopted to handle different kinds of conversations, metrics need to be developed based on purpose as well. Hope this article prods you to ask more questions on how to properly test your system and seek answers too. Please do share your experiences using these or other new metrics in the comments section below.


Continue Reading
Blockchain4 days ago

Buying the Bitcoin Dip: MicroStrategy Scoops $10M Worth of BTC Following $7K Daily Crash

Blockchain4 days ago

Bitcoin Correction Intact While Altcoins Skyrocket: The Crypto Weekly Recap

Blockchain4 days ago

MicroStrategy CEO claims to have “thousands” of executives interested in Bitcoin

Blockchain4 days ago

Canadian VR Company Sells $4.2M of Bitcoin Following the Double-Spending FUD

Blockchain5 days ago

TA: Ethereum Starts Recovery, Why ETH Could Face Resistance Near $1,250

Amb Crypto4 days ago

Monero, OMG Network, DigiByte Price Analysis: 23 January

Amb Crypto2 days ago

Will range-bound Bitcoin fuel an altcoin rally?

NEWATLAS1 day ago

Lockheed Martin and Boeing debut Defiant X advanced assault helicopter

Amb Crypto4 days ago

Chainlink Price Analysis: 23 January

Amb Crypto2 days ago

Bitcoin Price Analysis: 24 January

Amb Crypto4 days ago

Popular analyst prefers altcoins LINK, UNI, others during Bitcoin & Eth’s correction phase

Amb Crypto3 days ago

Bitcoin Cash, Synthetix, Dash Price Analysis: 23 January

Amb Crypto3 days ago

Why has Bitcoin’s brief recovery not been enough

Automotive3 days ago

Tesla Powerwalls selected for first 100% solar and battery neighborhood in Australia

Blockchain5 days ago

Bitcoin Cash Analysis: Strong Support Forming Near $400

SPAC Insiders5 days ago

Virtuoso Acquisition Corp. (VOSOU) Prices Upsized $200M IPO

Blockchain5 days ago

OIO Holdings Appoints Rudy Lim as CEO of Blockchain Business Subsidiary

Amb Crypto3 days ago

Stellar Lumens, Cosmos, Zcash Price Analysis: 23 January

Amb Crypto3 days ago

Why now is the best time to buy Bitcoin, Ethereum

AI2 days ago

Plato had Big Data and AI firmly on his radar