Connect with us


DeepMind’s latest protein-solving AI AlphaFold a step closer to cracking biology’s 50-year conundrum




DeepMind says its AlphaFold machine-learning software can now rapidly predict the structure of proteins with decent accuracy, and could one day help us develop drugs faster.

In its announcement on Monday, heralded as a scientific breakthrough by some, the Google stablemate claims to have solved a 50-year problem in biology: building a computer system capable of accurately and quickly modelling the structure of a protein from its strings of amino acids.

For the uninitiated, proteins are molecules essential for life as we know it: they transport matter within cells and bodies, they perform chemical reactions as enzymes, they protect you as antibodies, and so on. How they each function is basically defined by their shape. However, proteins are complex 3D arrangements nanometres in size, and so it is tricky, though not always impossible, to figure out their structure through observations.

An alternative approach is to use software to link the amino acids detected within a protein – each protein is formed from chains of these acids – to the protein’s intricate structure. Computer programs like AlphaFold estimate the structure of a protein from its amino acids. Knowing the shape of a protein’s components helps us understand how they work, which helps us do things like develop drugs, and create new or copycat proteins.

The challenge

The Critical Assessment of Protein Structure Prediction competition, known as CASP for short, was set up in 1994 to compare code for predicting protein folding, and is held every two years. This time around, DeepMind’s AlphaFold reached the highest score ever of 87 GDT, just below 90 GDT, a score considered to be as good as results obtained from physical observations of proteins. Crucially, the AI-based software is able to accurately predict the structure of individual proteins in a matter of minutes or days, whereas physical experiments can take much longer. That means drug development work can be vastly sped up.

“We have been stuck on this one problem – how do proteins fold up – for nearly 50 years,” said John Moult, co-founder and chair of CASP, and a professor at the University of Maryland in the US. “To see DeepMind produce a solution for this, having worked personally on this problem for so long and after so many stops and starts, wondering if we’d ever get there, is a very special moment.”

GDT stands for Global Distance Test (GDT) and is ranked on a scale of 0 to 100. AlphaFold’s score means it is able to predict the structure of a protein to about 87 per cent accuracy: the guesstimated positions of the amino acids may be off by distance of 1.6 Angstroms or 0.16nm – about the width of an atom.

NHS hosptial photo, by Marbury via Shutterstock

NHS England offers £15m to AI firms for software that helps with stroke victims’ treatment as COVID-19 stretches service


AlphaFold trashed its competitors, we note from the official results. In 2018, AlphaFold also topped the list albeit with a score below 60 GDT using a system made up of three neural networks.

“We have to get much more accurate in order for it to be useful for biologists,” DeepMind’s CEO Demis Hassibis told El Reg at the time. The latest 2020 entry has a different architecture, one based on an “attention-based neural network system.”

“A folded protein can be thought of as a ‘spatial graph’, where residues are the nodes and edges connect the residues in close proximity,” the AlphaFold team explained this week. “This graph is important for understanding the physical interactions within proteins, as well as their evolutionary history.”

AlphaFold was trained on approximately 170,000 protein structures, and learned the spatial representation of all the constituent amino acids in each one in order to predict structures presented in the CASP competition. “It uses approximately 128 TPUv3 cores (roughly equivalent to ~100-200 GPUs) run over a few weeks, which is a relatively modest amount of compute in the context of most large state-of-the-art models used in machine learning today,” the team said.

“Being able to investigate the shape of proteins quickly and accurately has the potential to revolutionise life sciences,” said Andriy Kryshtafovych, who helped organize the CASP competition and a project scientist at the University of California, Davis.

“Now that the problem has been largely solved for single proteins, the way is open for development of new methods for determining the shape of protein complexes – collections of proteins that work together to form much of the machinery of life, and for other applications.”

Although the AI performs almost as well as scientists, it won’t replace lab experiments altogether. Predicting the structure of individual proteins doesn’t describe how they might interact with one another or with other molecules like DNA or RNA. It also can’t reliably handle multi-protein complexes.

“AlphaFold is one of our most significant advances to date but, as with all scientific research, there are still many questions to answer,” the DeepMind team concluded. “Not every structure we predict will be perfect…In collaboration with others, there’s also much to learn about how best to use these scientific discoveries in the development of new medicines, ways to manage the environment, and more.”

The London-based AI lab is expected to reveal more details in an upcoming peer-reviewed paper. A spokesperson was not available for further comment. ®

PS: An antidote to the latest hype surrounding AlphaFold can be found here by Stephen Curry, a Professor of Structural Biology at Imperial College London.



Soon, no more blood tests or probing for prostate cancer? AI claims 99% success rate using more relaxing methods




Scientists say they have devised a way to screen for prostate cancer using a drop of urine, a sensor, and AI algorithms. And the test takes just twenty minutes, and is 99 per cent accurate, according to results from a small-scale test.

The risk of developing prostate cancer increases for men as they get older, and the over 50s are supposed to be routinely screened for the disease. If the examination – performed either with a blood test or something… more hands on – suggests there’s a problem, the patient may undergo a biopsy, where a small amount of tissue is extracted from the prostate to confirm whether or not there are cancerous cells present. But these screening tests tend to generate a lot of false positives, leading to unnecessary biopsies.

A team of researchers led by the Korea Institute of Science and Technology (KIST) believe that prostate cancer can be screened non-invasively and quickly using just a urine sample. If high levels of PCA3 are detected, there’s a high chance of prostate cancer.

Urine tests, however, can also be inaccurate at predicting whether someone really needs a biopsy or not. Instead of looking for just PCA3, the team uses a biosensor made up of a dual-gate field effect transistor to sense the presence of biomarkers in the urine. These biomarkers cause a shift in the transistor’s reference voltage.

By measuring this change after a tiny drop of urine is placed on the biosensor’s surface, scientists can determine the concentration of the biomarkers in the fluid. Trained algorithms can take these concentration readings, and use them to predict whether or not someone has the disease.

“Basically, a biomarker is a certain substance in our body where its concentration is affected by a specific disease state,” Kwanhyi Lee – co-author of a paper describing the technology, published in ACS Nano, and a principal research scientist at KIST – explained to The Register on Tuesday.

“In our study, we chose four different biomarkers related to prostate cancer. Simply speaking, we see either an increase or decrease of biomarker concentration from cancer patients compared to healthy individuals.


Study: AI designed to detect diabetic eye disease blinks in the real world, makes more work for doctors


“In a single biomarker based diagnosis, we can simply set the threshold to diagnose the disease. However, when there are more biomarkers, four in our case, understanding the relation between biomarker data and disease state is not easy. Considering the potential of multiple biomarker based diagnosis, it is necessary to analyze multiple biomarker data. Here, we utilized AI algorithms to learn the pattern of biomarker data so that the AI algorithm predicts the cancer.”

The algorithms were trained to look for specific patterns in the biomarker data that are common in patients with prostate cancer. If they detect the same patterns in a fresh urine sample, there’s a high chance of the disease. The team collected 76 urine samples from a mixture of healthy patients and men with prostate cancer, and used 70 per cent of them to train the algorithms and 30 per cent for testing.

Initial results show that the algorithms were able to correctly predict with at least 99 per cent accuracy whether someone had prostate cancer or not. It should be noted, however, that only 23 people were tested in the experiment so the results of this limited experiment should be taken with a pinch of salt.

After a further development of the technology, I believe that replacing the current blood test will be possible

Lee said that although the false positive and negative rates of the algorithms were low at 0.024 and 0.037 respectively, the team needed to verify their results with many more patients.

“After a further development of the technology, I believe that replacing the current blood test will be possible,” he said. “To make this happen, there are a few challenges we need to overcome. First, a validation of our approach with a much larger data set should be checked. Second, miniaturization of the sensor is needed.

“And then we need to closely work with healthcare experts on fields to monitor the performance of our sensing platform to determine replacing the current blood-based test. Personally, I think that making the sensing platform more affordable without compromising the performance is the key.”

At the moment, Lee believes the test is not yet good enough to completely eradicate the need for further biopsy tests. He hopes that one day in the future the team will develop a biosensor capable of detecting multiple biomarkers for different cancers that can be analysed with AI algorithms to diagnose patients. ®


Continue Reading


How to deliver natural conversational experiences using Amazon Lex Streaming APIs




Natural conversations often include pauses and interruptions. During customer service calls, a caller may ask to pause the conversation or hold the line while they look up the necessary information before continuing to answer a question. For example, callers often need time to retrieve credit card details when making bill payments. Interruptions are also common. Callers may interrupt a human agent with an answer before the agent finishes asking the entire question (for example, “What’s the CVV code for your credit card? It is the three-digit code top right corner.…”). Just like conversing with human agents, a caller interacting with a bot may interrupt or instruct the bot to hold the line. Previously, you had to orchestrate such dialog on Amazon Lex by managing client attributes and writing code via an AWS Lambda function. Implementing a hold pattern required code to keep track of the previous intent so that the bot could continue the conversation. The orchestration of these conversations was complex to build and maintain, and impacted the time to market for conversational interfaces. Moreover, the user experience was disjointed because the properties of prompts such as ability to interrupt were defined in the session attributes on the client.

Amazon Lex’s new streaming conversation APIs allow you to deliver sophisticated natural conversations across different communication channels. You can now easily configure pauses, interruptions and dialog constructs while building a bot with the Wait and Continue and Interrupt features. This simplifies the overall design and implementation of the conversation and makes it easier to manage. By using these features, the bot builder can quickly enhance the conversational capability of virtual agents or IVR systems.

In the new Wait and Continue feature, the ability to put the conversation into a waiting state is surfaced during slot elicitation. You can configure the slot to respond with a “Wait” message such as “Sure, let me know when you’re ready” when a caller asks for more time to retrieve information. You can also configure the bot to continue the conversation with a “Continue” response based on defined cues such as “I’m ready for the policy ID. Go ahead.” Optionally, you can set a “Still waiting” prompt to play messages like “I’m still here” or “Let me know if you need more time.” You can set the frequency of these messages to play and configure a maximum wait time for user input. If the caller doesn’t provide any input within the maximum wait duration, Amazon Lex resumes the dialog by prompting for the slot. The following screenshot shows the wait and continue configuration options on the Amazon Lex console.


The Interrupt feature enables callers to barge-in while a prompt is played by the bot. A caller may interrupt the bot and answer a question before the prompt is completed. This capability is surfaced at the prompt level and provided as a default setting. On the Amazon Lex console, navigate to the Advanced Settings and under Slot prompts, enable the setting to allow users to interrupt the prompt.

After configuring these features, you can initiate a streaming interaction with the Lex bot by using the StartConversation API. The streaming capability enables you to capture user input, manage state transitions, handle events, and deliver a response required as part of a conversation. The input can be one of three types: audio, text, or DTMF, whereas the response can be either audio or text. The dialog progresses by eliciting an intent, populating any slots, confirming the intent, and finally closing the intent. Streaming allows intents to be defined based on different conversation states such as: InProgress, Waiting, Confirmed, Denied, Fulfilled, ReadyForFulfillment or Failed. A detailed list of different dialog and intents states, see the documentation for Amazon Lex intents.

The StartConversation API uses HTTP/2 for bidirectional communication between the client and Amazon Lex. Clients are expected to maintain a single long-running stream throughout the conversation and exchange required information with the server (and vice versa) via different events. To learn more about streaming for Amazon Lex bots, please refer the Lex streaming documentation.

Amazon Lex Streaming APIs in action

Let’s put all of this together in a bot. I first use the Amazon Lex console to configure a bot and then use the following sample telephony application to interact with the bot. For the purposes of this post, we use a simple bot to complete a bill payment with a credit card. Because callers often need time to retrieve their credit card details, we enable the wait and continue construct when eliciting the credit card number. We also set up a service-side interrupt to manage interruption.

The following is a sample conversation with our bot:

Caller:    I’d like to make a payment on my account
Bot:        Sure. Can we start off with your account ID?
Caller:    Yes the account ID is 1234
Bot:        Great. The balance due is $100. How much would you like to pay today?
Caller:    $100
Bot:        Ok. Can you share the 16-digit number on your credit card?
Caller:    Can you wait a minute?
Bot:        Sure

—————- a few seconds later —————-

Bot:        Let me know when ready. No rush.

—————- a few seconds later —————-

Caller:    Ok I am back
Bot:        Ok please go ahead
Caller:    The credit card number is 1234 5678 1234 5678
Bot:        Got it. Let me repeat that for you 1234 5678 1234 5678. Is that correct?
Caller:    Yes
Bot:        Ok. What’s the CVV code? You can find it in the top right on the signature panel on the back of the card.
Caller:    123
Bot:        Awesome. I will go ahead and process the $100 payment on your account using the card ending in 5678.
Caller:    Ok
Bot:        The payment went through. Your confirmation code is 1234.

The first step is to build an Amazon Lex bot with intents to process payment and get balance on the account. The ProcessPayment intent elicits the information required to process the payment, such as the payment amount, credit card number, CVV code, and expiration date. The GetBalanceAmount intent provides the balance on the account. The FallbackIntent is triggered when the user input can’t be processed by either of the two configured intents.

Deploying the sample bot

To create the sample bot, complete the following steps. This creates an Amazon Lex bot called PaymentsBot.

  1. On the Amazon Lex console, choose Create Bot.
  2. In the Bot configuration section, give the bot the name PaymentsBot.
  3. Specify AWS Identity and Access Management (IAM) permissions and COPPA flag.
  4. Choose Next.
  5. Under Languages, choose English(US).
  6. Choose Done.
  7. Add the ProcessPayment and GetBalanceAmount intents to your bot.
  8. For the ProcessPayment intent, add the following slots:
    1. PaymentAmount slot using the built-in AMAZON.Number slot type
    2. CreditCardNumber slot using the built-in AMAZON.AlphaNumeric slot type
    3. CVV slot using the built-in AMAZON.Number slot type
    4. ExpirationDate using the built-in AMAZON.Date built-in slot type
  9. Configure slot elicitation prompts for each slot.
  10. Configure a closing response for the ProcessPayment intent.
  11. Similarly, add and configure slots and prompts for GetBalanceAmount intents.
  12. Choose Build to test your bot.

For more information about creating a bot, see the Lex V2 documentation.

Configuring Wait and Continue

  1. Choose the ProcessPayment intent and navigate to the CreditCardNumber slot.
  2. Choose Advanced Settings to open the slot editor.
  3. Enable Wait and Continue for the slot.
  4. Provide the Wait, Still Waiting, and Continue responses.
  5. Save the intent and choose Build.

The bot is now configured to support the Wait and Continue dialog construct. Now let’s configure the client code. You can use a telephony application to interact with your Lex bot. You can download the code for setting up a telephony IVR interface via Twilio at the GitHub project. The link contains information to set up a telephony interface as well as a client application code to communicate between the telephony interface and Amazon Lex.

Now, let us review the client-side setup to use the bot configuration that we just enabled on the Amazon Lex console. The client application uses the Java SDK to capture payment information. In the beginning, you use the ConfigurationEvent to set up the conversation parameters. Then, you start sending an input event (AudioInputEvent, TextInputEvent or DTMFInputEvent) to send user input to the bot depending on the input type. When sending audio data, you would need to send multiple AudioInputEvent events, with each event containing a slice of the data.

The service first responds with TranscriptEvent to give transcription, then sends the IntentResultEvent to surface the intent classification results. Subsequently, Amazon Lex sends a response event (TextResponseEvent or AudioResponseEvent) that contains the response to play back to caller. If the caller requests the bot to hold the line, the intent is moved to the Waiting state and Amazon Lex sends another set of TranscriptEvent, IntentResultEvent and a response event. When the caller requests to continue the conversation, the intent is set to the InProgress state and the service sends another set of TranscriptEvent, IntentResultEvent and a response event. While the dialog is in the Waiting state, Amazon Lex responds with a set of IntentResultEvent and response event for every “Still waiting” message (there is no transcript event for server-initiated responses). If the caller interrupts the bot prompt at any time, Amazon Lex returns a PlaybackInterruptionEvent.

Let’s walk through the main elements of the client code:

  1. Create the Amazon Lex client:
    AwsCredentialsProvider awsCredentialsProvider = StaticCredentialsProvider .create(AwsBasicCredentials.create(accessKey, secretKey)); LexRuntimeV2AsyncClient lexRuntimeServiceClient = LexRuntimeV2AsyncClient.builder() .region(region) .credentialsProvider(awsCredentialsProvider) .build();

  2. Create a handler to publish data to server:
    EventsPublisher eventsPublisher = new EventsPublisher();

  1. Create a handler to process bot responses:
    public class BotResponseHandler implements StartConversationResponseHandler { private static final Logger LOG = Logger.getLogger(BotResponseHandler.class); @Override public void responseReceived(StartConversationResponse startConversationResponse) {"successfully established the connection with server. request id:" + startConversationResponse.responseMetadata().requestId()); // would have 2XX, request id. } @Override public void onEventStream(SdkPublisher<StartConversationResponseEventStream> sdkPublisher) { sdkPublisher.subscribe(event -> { if (event instanceof PlaybackInterruptionEvent) { handle((PlaybackInterruptionEvent) event); } else if (event instanceof TranscriptEvent) { handle((TranscriptEvent) event); } else if (event instanceof IntentResultEvent) { handle((IntentResultEvent) event); } else if (event instanceof TextResponseEvent) { handle((TextResponseEvent) event); } else if (event instanceof AudioResponseEvent) { handle((AudioResponseEvent) event); } }); } @Override public void exceptionOccurred(Throwable throwable) { LOG.error(throwable); System.err.println("got an exception:" + throwable); } @Override public void complete() {"on complete"); } private void handle(PlaybackInterruptionEvent event) {"Got a PlaybackInterruptionEvent: " + event);"Done with a PlaybackInterruptionEvent: " + event); } private void handle(TranscriptEvent event) {"Got a TranscriptEvent: " + event); } private void handle(IntentResultEvent event) {"Got an IntentResultEvent: " + event); } private void handle(TextResponseEvent event) {"Got an TextResponseEvent: " + event); } private void handle(AudioResponseEvent event) {//synthesize speech"Got a AudioResponseEvent: " + event); } }

  1. Initiate the connection with the bot:
    StartConversationRequest.Builder startConversationRequestBuilder = StartConversationRequest.builder() .botId(botId) .botAliasId(botAliasId) .localeId(localeId); // configure the conversation mode with bot (defaults to audio)
    startConversationRequestBuilder = startConversationRequestBuilder.conversationMode(ConversationMode.AUDIO); // assign a unique identifier for the conversation
    startConversationRequestBuilder = startConversationRequestBuilder.sessionId(sessionId); // build the initial request
    StartConversationRequest startConversationRequest =; CompletableFuture<Void> conversation = lexRuntimeServiceClient.startConversation( startConversationRequest, eventsPublisher, botResponseHandler);

  2. Establish the configurable parameters via ConfigurationEvent:
    public void configureConversation() { String eventId = "ConfigurationEvent-" + eventIdGenerator.incrementAndGet(); ConfigurationEvent configurationEvent = StartConversationRequestEventStream .configurationEventBuilder() .eventId(eventId) .clientTimestampMillis(System.currentTimeMillis()) .responseContentType(RESPONSE_TYPE) .build(); eventWriter.writeConfigurationEvent(configurationEvent);"sending a ConfigurationEvent to server:" + configurationEvent);

  3. Send audio data to server:
    public void writeAudioEvent(ByteBuffer byteBuffer) { String eventId = "AudioInputEvent-" + eventIdGenerator.incrementAndGet(); AudioInputEvent audioInputEvent = StartConversationRequestEventStream .audioInputEventBuilder() .eventId(eventId) .clientTimestampMillis(System.currentTimeMillis()) .audioChunk(SdkBytes.fromByteBuffer(byteBuffer)) .contentType(AUDIO_CONTENT_TYPE) .build(); eventWriter.writeAudioInputEvent(audioInputEvent);

  4. Manage interruptions on the client side:
    private void handle(PlaybackInterruptionEvent event) {"Got a PlaybackInterruptionEvent: " + event); callOperator.pausePlayback();"Done with a PlaybackInterruptionEvent: " + event);

  5. Enter the code to disconnect the connection:
    public void disconnect() { String eventId = "DisconnectionEvent-" + eventIdGenerator.incrementAndGet(); DisconnectionEvent disconnectionEvent = StartConversationRequestEventStream .disconnectionEventBuilder() .eventId(eventId) .clientTimestampMillis(System.currentTimeMillis()) .build(); eventWriter.writeDisconnectEvent(disconnectionEvent);"sending a DisconnectionEvent to server:" + disconnectionEvent);

You can now deploy the bot on your desktop to test it out.

Things to know

The following are a couple of important things to keep in mind when you’re using the Amazon Lex V2 Console and APIs:

  • Regions and languages – The Streaming APIs are available in all existing Regions and support all current languages.
  • Interoperability with Lex V1 console – Streaming APIs are only available in the Lex V2 console and APIs.
  • Integration with Amazon Connect – As of this writing, Lex V2 APIs are not supported on Amazon Connect. We plan to provide this integration as part of our near-term roadmap.
  • Pricing – Please see the details on the Lex pricing page.

Try it out

Amazon Lex Streaming API is available now and you can start using it today. Give it a try, design a bot, launch it and let us know what you think! To learn more, please see the Lex streaming API documentation.

About the Authors

Esther Lee is a Product Manager for AWS Language AI Services. She is passionate about the intersection of technology and education. Out of the office, Esther enjoys long walks along the beach, dinners with friends and friendly rounds of Mahjong.

Swapandeep Singh is an engineer with Amazon Lex team. He works on making interactions with bot smoother and more human-like. Outside of work, he likes to travel and learn about different cultures.


Continue Reading


Building a Complete AI Based Search Engine with Elasticsearch, Kubeflow and Katib




AI-based search engine

Building search systems is hard. Preparing them to work with machine learning is really hard. Developing a complete search engine framework integrated with AI is really really hard.

So let’s make one. ✌️

In this post, we’ll build a search engine from scratch and discuss on how to further optimize results by adding a machine learning layer using Kubeflow and Katib. This new layer will be capable of retrieving results considering the context of users and is the main focus of this article.

As we’ll see, thanks to Kubeflow and Katib, final result is rather quite simple, efficient and easy to maintain.

Complete pipeline executed by Kubeflow, responsible for orchestrating the whole system. Image by author.

To understand the concepts in practice, we’ll implement the system with hands-on experience. As it’s been built on top of Kubernetes, you can use any infrastructure you like (given appropriate adaptations). We’ll be using Google Cloud Platform (GCP) in this post.

We begin with a brief introduction to concepts and then move to the system implementation discussion.

Do you find this in-depth technical education about NLP applications to be useful? Subscribe below to be updated when we release new relevant content.

So, with no further ado,

1. You Know, For Search

If you receive the challenge of building a search system for your company or want to build one for your own, you’ll soon realize that the initial steps tend to be somewhat straightforward.

First and foremost, the search engine must contain documents for retrieval. As we’ll be working with Elasticsearch, let’s use it as reference (for an introduction, please refer to their official docs)

Documents should be uploaded to Elasticsearch following a JSON format. If, for instance, we are building a search engine for a fashion eCommerce store, here’s an example of a document:

Example of document as uploaded to Elasticsearch. Image by author.

Then comes the retrieval step which in essence involves matching search queries with document fields:

Example of matching between search terms and document fields. Image by author.

The ranking phase applies some mathematical rules such as TF-IDF or BM25F to figure out how to properly rank, sorting documents from best to worst match. It’d be something like:

Example of properly ranked results as retrieved by Elasticsearch running BM25 scoring among the stored documents in the database. Image by author.

Further optimization could leverage on specific fields of documents containing performance metrics. For instance, in the previous example, we have that the Click Through Rate (CTR, i.e., reason between clicks and total impressions) of the t-shirt is CTR=0.36. Another retrieval layer could be added using this information and favoring documents with better CTR to show at the top (also known as “boosting”):

Example adding performance metrics layer in retrieval rules. Previous second-best document raises to the top of the results. Image by author.

So far so good. But let’s see how to further optimize even more.

Consider that each user has a specific context. Let’s take our fashion online store as an example again. Some of the traffic may come from southern regions where it may be warmer than regions from the north. They’d probably rather be exposed to lighter clothing than to winter specific products.

More context can be added to the equation: we could distinguish customers based on their favorite brands, categories, colors, sizes, device used, average consuming ticket, profession, age and the list goes on and on…

Doing so requires some extra tools as well. Let’s dive a bit deeper into that.

2. The Machine Learning Layer

Learn-to-rank (LTR) is a field of machine learning that studies algorithms whose main goal is to properly rank a list of documents.

It works essentially as any other learning algorithm: it requires a training dataset, suffers from problems such as bias-variance, each model has advantages over certain scenarios and so on.

What basically changes is that the cost function for the training process is designed to let the algorithm learn about ranking and the output of the model is a value for how good of a match a given document is for a given query.

Mathematically, it’s simply given by:

X in our case comprehends all features we’d like to use to add context to the search. They can be values such as region of the user, their age, favorite brand, correlation between queries and documents fields and so on.

f is the ranking model which is supposed to be trained and evaluated.

Finally, J extends for Judgment and for us it’s an integer value that ranges from 0 (meaning the document is not a good match for a query given features) up to 4 (document is a very good match). We arrange documents from best to worst by using the judgments.

Our main goal is to obtain since it represents the ranking algorithm that adds the machine learning layer to the search results. And in order to obtain f, we need a dataset that already contains the values of the judgments otherwise we can’t train the models.

As it turns out, finding those values can be quite challenging. While the details of how to do so won’t be covered here, this post has a thorough discussion on the subject; in a nutshell, we use clickstream data of users interactions with search engines (their clicks and purchases) to fit models whose variables yield a proxy for the judgment value.

Example of Graphical Model implemented on pyClickModels for finding relevance of documents associated to search engines results. Image by author.

After computing the judgments, we are left with training the ranking models. Elasticsearch already offers a learn-to-rank plugin which we’ll use in this implementation. The plugin offers various ranking algorithms ranging from decision trees to neural networks.

This is an example of the required training file:

Example of input file for the training step as required by Elasticsearch Learn2Rank plugin. Image by author.

The idea is to register for each query (“women t-shirt”) all documents that were printed in the results page. For each, we compute their expected judgment and build the matrix of features X.

In practice, what will happen is that we’ll first prepare all this data and feed it to the Learn-To-Rank plugin of Elasticsearch which will result in a trained ranking model. It can then be used to add the personalization layer we are looking for.

Further details on building X will be discussed soon.

We are ready now to train the models. So far so good. But then, we still have a tricky problem: how to know if it’s working?

2.1 The Valuation Framework

We could choose from several methods to check the performance of a ranking model. The one we’ll discuss here is the average rank metric based on what users either clicked or purchased (pySearchML focuses on purchase events only but clicks can be used interchangeably).

Mathematically, it’s given by:

The formula basically sums over each rank associated to each purchased (or clicked) item in reference to a complete list of documents. The denominator is simply the cardinality of how many items were summed over in the process (total items users either clicked or bought).

In practice, what will happen is that, after training a ranking model, we’ll loop through the validation dataset (which contains what users searched and what they purchased) and use each search term to send a query to Elasticsearch. We then compare the results with what users bought to compute the appropriate average ranking.

Example of the validation framework in practice. For each user in dataset, we use their search term to retrieve from ES results with the ranking model already implemented. We then take the average rank of the purchased items by users and their position in the ES result. Image by author.

The image above illustrates the concept. For each user, we send their search term to Elasticsearch which already contains the just recently trained model. We then compare the search results with what the user purchased and compute the rank. In the previous example, the red t-shirt appears at position 2 out of the 3 retrieved items. As it’s just one item that was purchased then rank=66%.

We run the same computation to all users in the database and then average them all together for a final rank expression.

Notice that the final rank metric must be lower than 50% otherwise the algorithm is just performing as a random selector of documents.

This value is important as it’s used for selecting the best ranking model. That’s where we use Katib from Kubeflow.

Let’s see now how to put all these concepts together and build the search engine:

3. Kubeflow Orchestration

As discussed before, Kubeflow is the orchestrator for the pipeline processing. It has various responsibilities ranging from preparing data for Elasticsearch and for training to running the entire training process.

It works by defining components and their respective tasks. For pySearchML, here’s the complete pipeline that was implemented:

def build_pipeline( bucket='pysearchml', es_host='elasticsearch.elastic-system.svc.cluster.local:9200', force_restart=False, train_init_date='20160801', train_end_date='20160801', validation_init_date='20160802', validation_end_date='20160802', test_init_date='20160803', test_end_date='20160803', model_name='lambdamart0', ranker='lambdamart', index='pysearchml'
): pvc = dsl.PipelineVolume(pvc='pysearchml-nfs') prepare_op = dsl.ContainerOp( name='prepare env', image=f'{PROJECT_ID}/prepare_env', arguments=[f'--force_restart={force_restart}', f'--es_host={es_host}', f'--bucket={bucket}', f'--model_name= {model_name}'], pvolumes={'/data': pvc} ) val_reg_dataset_op = dsl.ContainerOp( name='validation regular dataset', image=f'{PROJECT_ID}/data_validation', arguments=[f'--bucket={bucket}/validation/regular', f'--validation_init_date={validation_init_date}', f'--validation_end_date={validation_end_date}', f'--destination=/data/pysearchml/{model_name}/validation_regular'], pvolumes={'/data': pvc} ).set_display_name('Build Regular Validation Dataset').after(prepare_op) val_train_dataset_op = dsl.ContainerOp( name='validation train dataset', image=f'{PROJECT_ID}/data_validation', arguments=[f'--bucket={bucket}/validation/train', f'--validation_init_date={train_init_date}', f'--validation_end_date={train_end_date}', f'--destination=/data/pysearchml/{model_name}/validation_train'], pvolumes={'/data': pvc} ).set_display_name('Build Train Validation Dataset').after(prepare_op) val_test_dataset_op = dsl.ContainerOp( name='validation test dataset', image=f'{PROJECT_ID}/data_validation', arguments=[f'--bucket={bucket}/validation/test', f'--validation_init_date={test_init_date}', f'--validation_end_date={test_end_date}', f'--destination=/data/pysearchml/{model_name}/validation_test'], pvolumes={'/data': pvc} ).set_display_name('Build Test Validation Dataset').after(prepare_op) train_dataset_op = dsl.ContainerOp( name='train dataset', image=f'{PROJECT_ID}/data_train', command=['python', '/train/'], arguments=[f'--bucket={bucket}', f'--train_init_date={train_init_date}', f'--train_end_date={train_end_date}', f'--es_host={es_host}', f'--model_name={model_name}', f'--index={index}', f'--destination=/data/pysearchml/{model_name}/train'], pvolumes={'/data': pvc} ).set_display_name('Build Training Dataset').after(prepare_op) katib_op = dsl.ContainerOp( name='pySearchML Bayesian Optimization Model', image=f'{PROJECT_ID}/model', command=['python', '/model/'], arguments=[f'--es_host={es_host}', f'--model_name={model_name}', f'--ranker={ranker}', '--name=pysearchml', f'--train_file_path=/data/pysearchml/{model_name}/train/train_dataset.txt', f'--validation_files_path=/data/pysearchml/{model_name}/validation_regular', '--validation_train_files_path=/data/pysearchml/{model_name}/validation_train', f'--destination=/data/pysearchml/{model_name}/'], pvolumes={'/data': pvc} ).set_display_name('Katib Optimization Process').after( val_reg_dataset_op, val_train_dataset_op, val_test_dataset_op, train_dataset_op ) post_model_op = dsl.ContainerOp( name='Post Best RankLib Model to ES', image=f'{PROJECT_ID}/model', command=['python', '/model/'], arguments=[f'--es_host={es_host}', f'--model_name={model_name}', f'--destination=/data/pysearchml/{model_name}/best_model.txt'], pvolumes={'/data': pvc} ).set_display_name('Post RankLib Model to ES').after(katib_op) _ = dsl.ContainerOp( name='Test Model', image=f'{PROJECT_ID}/model', command=['python', '/model/'], arguments=[f'--files_path=/data/pysearchml/{model_name}/validation_test', f'--index={index}', f'--es_host={es_host}', f'--model_name={model_name}'], pvolumes={'/data': pvc} ).set_display_name('Run Test Step').after(post_model_op)

The pipeline is defined by receiving various input parameters such as the bucket and model_name and we’ll be able to change those values at execution time (as we’ll see soon).

Let’s see each component from the pipeline implementation and its purpose:

Step-by-step as implemented in pySearchML. Image by author.

1. prepare_env

Here’s how prepare_env component is defined:

prepare_op = dsl.ContainerOp( name='prepare env', image=f'{PROJECT_ID}/prepare_env', arguments=[f'--force_restart={force_restart}', f'--es_host={es_host}', f'--bucket={bucket}', f'--model_name= {model_name}'], pvolumes={'/data': pvc}
  • Image is a docker reference for the component to run in that step.
  • Arguments are input parameters sent to the script executed in the Docker’s image ENTRYPOINT.
  • pvolumes mounts the volume claim into data.

Here’s all files in prepare_env:

  • is responsible for running queries against BigQuery and preparing Elasticsearch. One of its input arguments is model_name which sets which folder to use as reference for processing data. lambdamart0 is an algorithm that has already implemented to work with Google Analytics(GA) public sample dataset.
  • Dockerfile bundles the whole code together and have as ENTRYPOINT the execution of the script:
FROM python:3.7.7-alpine3.12 as python COPY kubeflow/components/prepare_env /prepare_env
WORKDIR /prepare_env
COPY ./key.json . ENV GOOGLE_APPLICATION_CREDENTIALS=./key.json RUN pip install -r requirements.txt ENTRYPOINT ["python", ""]
  • lambdamart0 is a folder dedicated to an implementation of an algorithm with this respective name. It’s been built to process GA public data and works as an example of the system. Here’s the files it contains:
  • ga_data.sql is a query responsible for retrieving documents from the GA public dataset and exporting it to Elasticsearch
  • es_mapping.json is an index definition for each field of the documents
  • features carries the value of X as discussed before. In lambdamart0 example, it uses the GA public data as reference for building the features.

Notice the feature called name.json:

{ "query": { "bool": { "minimum_should_match": 1, "should": [ { "match": { "name": "{{search_term}}" } } ] } }, "params": ["search_term"], "name": "BM25 name"

Learn-To-Rank plugin requires that each feature be defined as a valid Elasticsearch query and score results are associated as to X.

In the previous example, it receives a parameter search_term and proceeds on matching it on the field name of each document returning the BM25 match, which effectively becomes our “X0”.

Using BM25 between query and name field is not enough to add personalization to results. Another feature that advances in the customization layer is the channel_group.json defined as follows:

{ "query": { "function_score": { "query": { "match_all": {} }, "script_score" : { "script" : { "params": { "channel_group": "{{channel_group}}" }, "source": "if (params.channel_group == 'paid_search') { return doc[''].value * 10 } else if (params.channel_group == 'referral') { return doc[''].value * 10 } else if (params.channel_group == 'organic_search') { return doc[''].value * 10 } else if (params.channel_group == 'social') { return doc[''].value * 10 } else if (params.channel_group == 'display') { return doc[''].value * 10 } else if (params.channel_group == 'direct') { return doc[''].value * 10 } else if (params.channel_group == 'affiliates') { return doc[''].value * 10 }" } } } }, "params": ["channel_group"], "name": "channel_group"

It receives as input the parameter channel_group (channel that brought the user to our web store) and returns the CTR for that respective channel.

This effectively prepares the model to distinguish from users their origin and how to rank each group. Specifically, users coming from paid sources might behave differently than those coming in through organic channels for instance. If the training is good enough, the ranking algorithm should be prepared to handle these situations.

Still, this doesn’t tell much on how to personalize results by using intrinsic characteristics of each user. So, here’s one possibility for solving that. The feature avg_customer_price.json is defined as:

{ "query": { "function_score": { "query": { "match_all": {} }, "script_score" : { "script" : { "params": { "customer_avg_ticket": "{{customer_avg_ticket}}" }, "source": "return Math.log(1 + Math.abs(doc['price'].value - Float.parseFloat(params.customer_avg_ticket)))" } } } }, "params": ["customer_avg_ticket"], "name": "customer_avg_ticket"

It receives as input the parameter customer_avg_ticket and returns for each document the log of the distance between the average user ticket and the price of the document.

Now the ranking model can learn in the training phase how to manage the rank of each item based on how distant its price is to the average spending of the user on the website.

With those three types of features we can add a complete personalization layer to a search system on top of Elasticsearch. Features can be anything as long as it can be abstracted into a valid search query and they must return some scoring metric eventually translated as our value X.

For what follows in prepare_env component:

  • Features are exported to Elasticsearch.
  • An index is created on Elasticsearch that defines documents fields.
  • Documents are queried from BigQuery and uploaded into Elasticsearch.
  • RankLib requirements are created (feature set store and so on).

For implementing a new model with new data and features, just create another folder inside prepare_env (something like modelname2) and set how it will query data and upload them to Elasticsearch.

2. Validation Datasets

This is a simple step. It consists of retrieving data from BigQuery containing what users searched, the context of the search and a list of products purchased.

Here’s the BigQuery query used for retrieving the data. It basically selects all users, their searches and purchases and then combines with their context. An example of the results:

{ "search_keys": { "search_term": "office", "channel_group": "direct", "customer_avg_ticket": "13" }, "docs": [ {"purchased":["GGOEGOAQ012899"]} ]

search_keys can contain any available information that sets the context of customers. In the previous example, we’re using their channel group and average spending ticket on the website.

This data is what we feed into the validation framework when computing the average rank as discussed before.

Notice that the system builds three different validation datasets: one for the training period, another for the regular validation and finally the third is for the final testing step. The idea here is to analyze bias and variance for the trained models.

3. Training Dataset

This is the component responsible for building the RankLib training file as discussed before. The complete script is actually quite simple. First, it downloads from BigQuery the input clickstream data which consists of users interactions on search pages. Here’s an example:

{ "search_keys": { "search_term": "drinkware", "channel_group": "direct", "customer_avg_ticket": "20" }, "judgment_keys": [ { "session": [ {"doc":"GGOEGDHC017999","click":"0","purchase":"0"}, {"doc":"GGOEADHB014799","click":"0","purchase":"0"}, {"doc":"GGOEGDHQ015399","click":"1","purchase":"0"} ] } ]

Notice that the keys associated to the search are aggregated together inside search_keys. Those values are the ones we send to Elasticsearch and replace appropriately each feature X as discussed in prepare_env. In the previous JSON example, we know that the user search context is:

  • Searched for drinkware.
  • Came to the store directly.
  • Spent on average $20 on the website.

judgment_keys combines users sessions composed of documents they saw on the search page and their interaction on a given document.

This information is then sent to pyClickModels which then process the data and evaluates the judgment for each query-document pair. Result is a newline delimited JSON document as follows:

{ "search_term: bags|channel_group:organic_search|customer_avg_ticket:30": { "GGOEGBRJ037299": 0.3333377540111542, "GGOEGBRA037499": 0.222222238779068, "GGOEGBRJ037399": 0.222222238779068 }

Notice that the value of the key is 


As discussed before, we want our search engine to be aware of context and further optimize on top of that. As a consequence, judgments are extract based on the entire selected context, not just the search_term.

By doing so, we can differentiate documents for each context and we’d have scenarios where a product receives judgment 4 for customers coming from northern regions and judgment 0 otherwise, as an example.

Notice that the judgments values, as given by pyClickModels, ranges between 0 and 1. As the Learn-To-Rank Elasticsearch plugin is built on top of RankLib, this value is expected to range between integers 0 and 4, inclusive. What we do then is we transform the variables using their percentile as reference. Here’s the complete code for building the final judgment files:

def build_judgment_files(model_name: str) -> None: model = DBN.DBNModel() clickstream_files_path = f'/tmp/pysearchml/{model_name}/clickstream/' model_path = f'/tmp/pysearchml/{model_name}/model/model.gz' rmtree(os.path.dirname(model_path), ignore_errors=True) os.makedirs(os.path.dirname(model_path)) judgment_files_path = f'/tmp/pysearchml/{model_name}/judgments/judgments.gz' rmtree(os.path.dirname(judgment_files_path), ignore_errors=True) os.makedirs(os.path.dirname(judgment_files_path)), iters=10) model.export_judgments(model_path) with gzip.GzipFile(judgment_files_path, 'wb') as f: for row in gzip.GzipFile(model_path): row = json.loads(row) result = [] search_keys = list(row.keys())[0] docs_judgments = row[search_keys] search_keys = dict(e.split(':') for e in search_keys.split('|')) judgments_list = [judge for doc, judge in docs_judgments.items()] if all(x == judgments_list[0] for x in judgments_list): continue percentiles = np.percentile(judgments_list, [20, 40, 60, 80, 100]) judgment_keys = [{'doc': doc, 'judgment': process_judgment(percentiles, judgment)} for doc, judgment in docs_judgments.items()] result = {'search_keys': search_keys, 'judgment_keys': judgment_keys} f.write(json.dumps(result).encode() + 'n'.encode()) def process_judgment(percentiles: list, judgment: float) -> int: if judgment <= percentiles[0]: return 0 if judgment <= percentiles[1]: return 1 if judgment <= percentiles[2]: return 2 if judgment <= percentiles[3]: return 3 if judgment <= percentiles[4]: return 4

Here’s an example of the output of this step:

{ "search_keys": { "search_term": "office", "channel_group": "organic_search", "customer_avg_ticket": "24" }, "judgment_keys": [ {"doc": "0", "judgment": 0}, {"doc": "GGOEGAAX0081", "judgment": 4}, {"doc": "GGOEGOAB016099", "judgment": 0} ]

This data needs to be transformed into the required training file for RankLib. This is where we combine the information of judgments, documents, queries context with the features (here’s the code example for retrieving X from Elasticsearch).

Each JSON row from previous step containing search context and judgment keys is looped over and sent as a query against Elasticsearch with the input parameters of the search_keys. The result will be each value of X as already defined from previous prepare_env step.

End result is a training file that looks something like this:

0	qid:0	1:3.1791792	2:0	3:0.0	4:2.3481672
4	qid:0	1:3.0485907	2:0	3:0.0	4:2.3481672
0	qid:0	1:3.048304	2:0	3:0.0	4:0
0	qid:0	1:2.9526825	2:0	3:0.0	4:0
4	qid:1	1:2.7752903	2:0	3:0.0	4:3.61228
0	qid:1	1:2.8348017	2:0	3:0.0	4:2.3481672

For each query and for each document we have the estimated judgment as computed by pyClickModels, the id of the query and then a list of features X with their respective values.

With this file, we can train now the ranking algorithms.

4. Katib Optimization

Katib is a tool from Kubeflow that offers an interface for automatic hyperparameter optimization. It has several available methods; Bayesian Optimization is the one selected in pySearchML.

Example of the algorithm Bayesian Optimization. As it samples more data points from the allowed domain, the closer it may get to the optimum value of a given function. In pySearchML, the domain is a set of variables that sets how a ranker should fit the data and the cost function it’s optimizing is the average rank. Image taken from Wikimedia Foundation.

What Katib does is it selects for each hyperparameter a new value based on a trade-off between exploration-exploitation. It then tests the new model and observe results which are used for future steps.

For pySearchML, each parameter is an input of RankLib which sets how the model will be fit (such as how many trees to use, total leaf nodes, how many neurons in a net and so on).

Katib is defined through a Custom Resource of Kubernetes. We can run it by defining a YAML file and deploying it to the cluster, something like:

kubectl create -f katib_def.yaml

What Katib will do is read through the YAML file and start trials, each experimenting a specific value of hyperparameters. It can instantiate multiple pods running in parallel executing the code as specified in the experiment definition.

Here are the files in this step: is responsible for launching Katib from a Python script. It receives input arguments, builds an YAML definition and use Kubernetes APIs to start Katib from the script itself.

experiment.json works as a template for the definition of the experiment. Here’s its definition:

{ "apiVersion": "", "kind": "Experiment", "metadata": { "namespace": "kubeflow", "name": "", "labels": { "": "1.0" } }, "spec": { "objective": { "type": "minimize", "objectiveMetricName": "Validation-rank", "additionalMetricNames": [ "rank" ] }, "algorithm": { "algorithmName": "bayesianoptimization" }, "parallelTrialCount": 1, "maxTrialCount": 2, "maxFailedTrialCount": 1, "parameters": [], "trialTemplate": { "goTemplate": { "rawTemplate": { "apiVersion": "batch/v1", "kind": "Job", "metadata":{ "name": "{{.Trial}}", "namespace": "{{.NameSpace}}" }, "spec": { "template": { "spec": { "restartPolicy": "Never", "containers": [ { "name": "{{.Trial}}", "image": "{PROJECT_ID}/model", "command": [ "python /model/ --train_file_path={train_file_path} --validation_files_path={validation_files_path} --validation_train_files_path={validation_train_files_path} --es_host={es_host} --destination={destination} --model_name={model_name} --ranker={ranker} {{- with .HyperParameters}} {{- range .}} {{.Name}}={{.Value}} {{- end}} {{- end}}" ], "volumeMounts": [ { "mountPath": "/data", "name": "pysearchmlpvc", "readOnly": false } ] } ], "volumes": [ { "name": "pysearchmlpvc", "persistentVolumeClaim": { "claimName": "pysearchml-nfs", "readOnly": false } } ] } } } } } } }

It essentially defines how many pods to run in parallel and which Docker image to run for each trial along with its input command. Notice that the total pods running in parallel as well as maximum trials are hard-coded in pySearchML. Best approach would be to receive those parameters from the pipeline execution and replace them accordingly. will read this template, build a final YAML definition and send it to Kubernetes which will start the Katib process.

One of the input parameters is the ranker which is the ranking algorithm to select from RankLib (such as lambdaMart, listNet and so on). Each ranker has its own set of parameters, here’s the example for LambdaMart algorithm as implemented in

def get_ranker_parameters(ranker: str) -> List[Dict[str, Any]]: return { 'lambdamart': [ {"name": "--tree", "parameterType": "int", "feasibleSpace": {"min": "1", "max": "500"}}, {"name": "--leaf", "parameterType": "int", "feasibleSpace": {"min": "2", "max": "40"}}, {"name": "--shrinkage", "parameterType": "double", "feasibleSpace": {"min": "0.01", "max": "0.2"}}, {"name": "--tc", "parameterType": "int", "feasibleSpace": {"min": "-1", "max": "300"}}, {"name": "--mls", "parameterType": "int", "feasibleSpace": {"min": "1", "max": "10"}} ] }.get(ranker)

Katib will select parameters from the domain defined above and run where effectively RankLib is used to train the ranking models. An example of the command as implemented in Python:

cmd = ('java -jar ranklib/RankLib-2.14.jar -ranker ' f'{ranker} -train {args.train_file_path} -norm sum -save ' f'{args.destination}/model.txt ' f'{(" ".join(X)).replace("--", "-").replace("=", " ")} -metric2t ERR')

This string is sent to a subprocess call (notice it requires Java because of RankLib) which starts the training process. The result is a newly trained ranking model that can be exported to Elasticsearch.

Just as the model is fit, is invoked for computing the expected rank. The steps that take place are:

  • The script loops through each JSON from the validation dataset.
  • Each row contains the search context which is then used to build an Elasticsearch query. Here’s the query used by the model lambdamart0 which we’ll use later on:
{ "query": { "function_score": { "query": { "bool": { "must": { "bool": { "minimum_should_match": 1, "should": [ { "multi_match": { "operator": "and", "query": "{query}", "type": "cross_fields", "fields": [ "sku", "name", "category" ] } } ] } } } }, "functions": [ { "field_value_factor": { "field": "", "factor": 10, "missing": 0, "modifier": "none" } } ], "boost_mode": "sum", "score_mode": "sum" } }, "rescore": { "window_size": "{window_size}", "query": { "rescore_query": { "sltr": { "params": "{search_keys}", "model": "{model_name}" } }, "rescore_query_weight": 20, "query_weight": 0.1, "score_mode": "total" } }
  • Given the recently built query, a request is sent to Elasticsearch.
  • A comparison happens between the search results and purchased documents.

Here’s the code responsible for building the Elasticsearch query:

def get_es_query( search_keys: Dict[str, Any], model_name: str, es_batch: int = 1000
) -> str: """ Builds the Elasticsearch query to be used when retrieving data. Args ---- args: NamedTuple args.search_keys: Dict[str, Any] Search query sent by the customer as well as other variables that sets its context, such as region, favorite brand and so on. args.model_name: str Name of RankLib model saved on Elasticsearch args.index: str Index on Elasticsearch where to retrieve documents args.es_batch: int How many documents to retrieve Returns ------- query: str String representation of final query """ # it's expected that a ES query will be available at: # ./queries/{model_name}/es_query.json query = open(f'queries/{model_name}/es_query.json').read() query = json.loads(query.replace('{query}', search_keys['search_term'])) # We just want to retrieve the id of the document to evaluate the ranks between # customers purchases and the retrieve list result query['_source'] = '_id' query['size'] = es_batch query['rescore']['window_size'] = 50 # Hardcoded to optimize first 50 skus query['rescore']['query']['rescore_query']['sltr']['params'] = search_keys query['rescore']['query']['rescore_query']['sltr']['model'] = model_name return query

Notice that the parameter rescore_query triggers the machine learning layer on Elasticsearch learn-to-rank plugin.

Finally, the function compute_rank puts it all together as shown below:

def compute_rank( search_arr: List[str], purchase_arr: List[List[Dict[str, List[str]]]], rank_num: List[float], rank_den: List[float], es_client: Elasticsearch
) -> None: """ Sends queries against Elasticsearch and compares results with what customers purchased. Computes the average rank position of where the purchased document falls within the retrieved items. Args ---- search_arr: List[str] Searches made by customers as observed in validation data. We send those against Elasticsearch and compare results with purchased data purchase_arr: List[List[Dict[str, List[str]]]] List of documents that were purchased by customers rank_num: List[float] Numerator value of the rank equation. Defined as list to emulate a pointer rank_den: List[float] es_client: Elasticsearch Python Elasticsearch client """ idx = 0 if not search_arr: return request = os.linesep.join(search_arr) response = es_client.msearch(body=request, request_timeout=60) for hit in response['responses']: docs = [doc['_id'] for doc in hit['hits'].get('hits', [])] if not docs or len(docs) < 2: continue purchased_docs = [ docs for purch in purchase_arr[idx] for docs in purch['purchased'] ] ranks = np.where(np.in1d(docs, purchased_docs))[0] idx += 1 if ranks.size == 0: continue rank_num[0] += ranks.sum() / (len(docs) - 1) rank_den[0] += ranks.size print('rank num: ', rank_num[0]) print('rank den: ', rank_den[0])

Katib instantiate sidecars pods which keeps reading through the stdout of the training pod. When it identifies the string Validation-rank=(...), it uses the value as the result for the optimization process.

A persistent volume is used in the process to save the definition of the best model trained by Katib which is used by our next component.

5. Post RankLib Model

The most difficult parts are done already. Now what happens is the script simply goes after the definition of the best model as saved in a text file and uploads it to Elasticsearch.

Notice that one of the main advantages with this design is that this component could export the model to a production Elasticsearch while the whole optimization could happen on a staging replica engine.

6. Final Testing

Finally, as the best model is exported to Elasticsearch, the system has at its disposal the best optimized ranking model. In this step, a final validation is executed in order to verify not only that everything worked fine as well as for providing further information on whether the system is suffering from bias-variance.

That’s pretty much it! Let’s run some code now to see this whole framework in action.

4. Hands-On Section

Time to implement the whole architecture in practice! Complete code is available on pySearchML repository:


In this section, we’ll be using GCP for running the code with real data. Also, keep in mind that there will be costs (a few cents) associated to running this experiment.

For those new to GCP, there’s a $300 free credit gift that lasts for a year; just sign in and create a project for this tutorial (pysearchml for instance). You should end up with access to a dashboard that looks like this:

Example of a GCP dashboard project. Image by author.

gcloud will be required for interacting with GCP through command line. Installing it is quite straightforward. After the initial setup, make sure you can login by running:

gcloud auth login

Now the rest is quite simple. Clone pySearchML to your local:

git clone pysearchml && cd pySearchML

Enable Kubernetes Engine in your platform. After that, just trigger the execution of cloudbuild which will be responsible for creating the whole required infrastructure (this step should take somewhere between 5~10 minutes).

Here’s how the build triggers the run:

_VERSION='0.0.0' ./kubeflow/build/ gcloud builds submit --no-source --config kubeflow/build/cloudbuild.yaml --substitutions $SUBSTITUTIONS --timeout=2h

You can choose appropriate values in the variable SUBSTITUTIONS. Notice that _VERSION sets the pipeline version to be exported to Kubeflow. After everything is set, just run the script:

Steps executed on cloudbuild. Image by author.
  1. Prepares secret keys to allow access and authorization into GCP tools.
  2. Prepares known_hosts in building machine.
  3. Clones pySearchML locally.
  4. The file that runs in step 4 is responsible for creating the Kubernetes cluster on top of Google Kubernetes Engine (GKE) as well as deploying Elasticsearch, Kubeflow and Katib. In parallel, all Docker images required for the system is built and deployed to Google Container Registry (GCR) for later use in Kubeflow.
  5. Run several unittests. Those ended up being important to confirm the system was working as expected. Also, it compiles the Kubeflow pipeline in parallel.
  6. Finally, deploys the pipeline to the cluster.

After it’s done, if you browse to your console and select “Kubernetes Engine”, you’ll see that it’s already up and running:

Kubernetes cluster deployed into GKE ready to run. Image by author.

It’s a small cluster as we won’t be using much data, this helps to further save on costs.

Kubeflow and Katib have already been installed. In order to access it, first connect your gcloud to the cluster by running:

gcloud container clusters get-credentials pysearchml

After that, port-forward the service that deals with Kubeflow to your local by running:

kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80 1>/dev/null &

Now, if you access your localhost on port 8080, this is what you should see:

Kubeflow Dashboard. Image by author.

And the pipeline is ready for execution.

This experiment uses the public Google Analytics sample dataset and it consists of a small sample of customers browsing on Google Store. It spans from 20160801 up to 20170801 and contains what each user searched and how they interacted with the search results.

Select pysearchml_0.0.0 and then select “+Create Run”. You should see a screen with all possible input parameters as defined in the Python pipeline script. After choosing appropriate parameters, just run the code.

After execution, here’s the expected result:

Full execution of the entire pipeline as defined in pySearchML. Image by author.

Output of Katib component:

Example of output as printed by the Katib component. Image by author.

We can see a rank of 36.05%. Then, we can compare results from the test component:

AI search engine
Example of output as printed by test component. Image by author.

Rank here is 40.16% which is a bit worse than validation dataset. It might indicate that the model suffers a bit of over fitting; more data or further parameters exploration might help to alleviate this problem.

And, pretty much, there you have it! Elasticsearch now has a fully trained new layer of machine learning to improve results based on context of customers.

If you want to navigate through the files created for each step, there’s an available deployment for that. In the pySearchML folder, just run:

kubectl apply -f kubeflow/disk-busybox.yaml

If you run kubectl -n kubeflow get pods you’ll see that the name of one of the pods is something like “nfs-busybox-(…)”. If you exec into it you’ll have access to the files:

kubectl -n kubeflow exec -it nfs-busybox-(...) sh

They should be located at /mnt/pysearchml.

There’s also a quick and dirty visualizer for the whole process as well. Just run:

kubectl port-forward service/front -n front 8088:8088 1>/dev/null &

And access your browser in localhost:8088. You should see this (quick and ugly) interface:

Front-end interface for running queries on the newly trained Elasticsearch. Image by author.

Example of results:

Not only does it allow to play around with results as well as give us a better sense if the optimization pipeline is working or not.

And that’s pretty much all it takes to have a complete search engine running with AI optimization ready to handle income traffic for any store.

5. Conclusion

Now, that was a challenge!

Building pySearchML was quite tough and I can safely say it was one of the most brutal challenges I’ve ever faced 😅. Countless and countless designs, architectures, infrastructures were considered but most failed.

The realization of integrating the whole process on top of Kubeflow and Katib came only later on when several alternatives had already been tested.

The advantage of this design is how simple and direct the final code becomes. It’s fully modular, each component is responsible for a simple task and Kubeflow orchestrates the whole execution. On top of that, we can focus mainly in the code development and let Katib do the hard work of finding best parameters.

The development process was not straightforward. Several lessons had to be learned including concepts from Kubernetes and its available resources. Still, it was all well worth it. As a result, an entire search engine could be built from scratch with a few lines of code, ready to handle real traffic.

As next steps, one could probably consider replacing RankLib with some Deep Learning algorithm which would further extract context from data. One of the main challenges for doing so is that the response time of the system could increase as well as the costs (pros and cons have to be evaluated).

Regardless of the ranking algorithm used, the architecture remains the same for the most part.

Hopefully, that was a useful post for those working in this field. Now it’s time for us to take some rest, meditate on the lessons learned and prepare ourselves for the next adventure :).

As for this post, it certainly deserves being concluded with the mission accomplished soundtrack.

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more AI and NLP updates.

We’ll let you know when we release more in-depth technical education.


Continue Reading


Scientists Made a Biohybrid Nose Using Cells From Mosquitoes




Thanks to biological parts of a mosquito’s “nose,” we’re finally closer to Smell-O-Vision for computers. And a way to diagnose early cancer.

With the recent explosion in computing hardware prowess and AI, we’ve been able to somewhat adequately duplicate our core senses with machines. Computer vision and bioengineered retinas tag-teamed to bolster artificial vision. Smart prosthetics seamlessly simulate touch, pain, pressure, and other skin sensations. Hearing devices are ever more capable of isolating specific sounds from a muddy cacophony of noises.

Yet what’s been missing is the sense of smell. Although humans are hardly smelling maestros in the animal kingdom, we can nevertheless detect roughly one trillion odors, often with only a single scent molecule drifting into our noses. Despite their best efforts, translating this prowess into a digital, artificial computer “nose” has eluded scientists. One major problem is reverse-engineering the sensitivity of our biological smelling mechanics—otherwise known as our olfactory system.

“Odors, airborne chemical signatures, can carry useful information about environments … However, this information is not harnessed well due to a lack of sensors with sufficient sensitivity and selectivity,” said Dr. Shoji Takeuchi from the Biohybrid Systems Laboratory at the University of Tokyo.

But here’s the thing: evolution already came up with a solution. Rather than rebuild olfaction from scratch, why not just use what’s available?

In a new paper published in Science Advances, Takeuchi and colleagues did just that. They tapped the odor-sensing components of a Yellow Fever mosquito (yikes), and rebuilt the entire construct with synthetic biology. Using a parallel chip design, they then carefully placed these biological components as an array onto a chip and monitored the setup with a computer.

By infusing odor chemicals into tiny liquid droplets—think essential oils mixing with liquid in a humidifier—the chip could detect odors with unprecedented sensitivity. Because each biological odor-sensing component is tailored to their favorite chemical, it’s possible in theory to use a single 16-channel chip to detect over a quadrillion mixes of odors.

“These biohybrid sensors…are highly sensitive,” said Takeuchi. With AI, he added, the biohybrid sensor could be further amped up to analyze increasingly complex mixes of chemicals, both in the environment and as a disease-hunting breathalyzer to monitor health.

The Anatomy of Smell

The animal olfactory system is an evolutionary work of art. Although specific biological setups differ between species, the general concept is pretty similar.

It all starts in the nose. Our noses are densely packed with olfactory cells, huddled high up in our nasal passages. Think of these cells as fatty bubbles, each with electrical wires to transfer information up the chain. Dotted along the outer rim of the bubbles are proteins, called olfactory receptors. These are basically “smart” tunnels that connect the outside environment with the interior sanctuary of the cells.

Here’s the crux: each tunnel, or receptor, is tailored to a single odor chemical. It’s normally shut. When an odor—say, vanillin, the dominant chemical in vanilla—drifts into the nasal passage, it grasps onto its preferred receptor. This action opens the receptor tunnel, causing ions to flow into the cell. Translation? It triggers an electrical current, a sort of “on” signal, that gets sent to the brain.

Now take something more complicated in scent, say, French toast. Multiple chemical molecules grab onto their respective odor receptors. Each sends a current, which get analyzed together as they race along nerve highways towards the brain. Based on previous experience, the brain can then analyze the combination data and determine “oh, that’s French toast!”

As you can tell, combination is key. It’s how our 400 or so olfactory receptors can detect a trillion different odors. It’s also what the Japanese team tapped into for inspiration in their new artificial “nose.”

Mosquitoes to the Rescue

Rather than reconstructing a human smell receptor, the team turned to mosquitoes. It’s not that the team particularly loves these blood-sucking demons. It’s because the mosquito odor receptors are built to heighten their sense of smell. In addition to the usual odor-grabbing component, these nuisance bugs also have a separate component that amplifies the electrical signal in a biological way. This allows them to smell with higher sensitivity, while being able to minutely discern what they’re smelling—blood or Deet. It’s a trait that’s terrible for us, but great for an artificial nose.

The next step was to reconstruct the cell structure on a chip. Here, the team used a proprietary recipe to carefully micro-engineer two bubbles, each filled with liquid, smashed together horizontally, forming a sort of squished figure-eight. At the nexus between the two bubbles is a structure that mimics a cell’s outer shell—its membrane, with two fatty layers. The team called this their artificial cell membrane.

They then tapped synthetic biology to make mosquito odor receptors from scratch using DNA. These receptors are embedded into the artificial cell membrane. The entire contraption—think two oil tanks, closely snuggled together but separated by the odor-sensing membrane—was then plopped onto a chip. Each odor-detection unit sat on top of a bunch of slits on the chip, as a sort of venting mechanism.

Here’s how the odor detection works. Each chip has engraved “channels,” allowing any odor to flow towards each detection unit. At a unit, the odor chemicals are then shuttled into the slits using blasts of nitrogen air. This “blows” the odor molecules to mix better with the liquid in the bubbles, so that the odor-detection component gets more of a “whiff.” Think stirring a pot of chili so you can smell it better—that’s the general idea.

Once the odor molecules grabbed onto the embedded mosquito smell sensors, the sensors generated an electrical current. Using an electrode and a computer, the team could then monitor for these currents. Rather than relying on a single device, they linked up 16 on a chip to further increase sensitivity.

As a proof of concept, the team fed an odor molecule called octenol to their biohybrid nose. Octenol is often detected in the breath of cancer patients. At just a few parts per billion—a range similar to our natural noses—the biohybrid “nose” was able to reliably pick up the smell, with over 90 percent accuracy.

In another test, the team had a volunteer breathe heavily into a bag. They then connected the bag, through a regulator for constant air flow, to the artificial nose. “Although the human breath contains about 3,000 different metabolites [molecules],” the team said, for a healthy human without octenol in their breath, the nose remained silent. However, once a tiny bit of octenol gas was added, the bionic “nose” immediately generated an electrical bump indicating its presence.

A Fragrant Future

For now, the team has only tested the contraption one scent at a time. But because the sensing channels are plug-and-play, they’re confident of a fragrant, blossoming future. Mixing and matching different odor receptors, for example, could lead to artificial noses that outwit our own.

Eventually, the team hopes to use their bionic nose as an affordable and portable way to detect early stages of diseases. With cancer and other health problems, the mixture of body odors can dramatically shift. It’s how dogs and other animals with a more heightened sense of smell than our own can detect health issues at early stages. In addition, a biohybrid nose could also venture into contaminated wastelands to screen for toxic chemicals without harming a living being.

To compliment the “mosquito nose” hardware, the team is already looking towards more sophisticated AI software. “I would like to expand upon the analytical side of the system by using some kind of AI. This could enable our biohybrid sensors to detect more complex kinds of molecules,” said Takeuchi.

Image Credit: Richárd Ecsedi on Unsplash


Continue Reading
Blockchain4 days ago

Buying the Bitcoin Dip: MicroStrategy Scoops $10M Worth of BTC Following $7K Daily Crash

Blockchain4 days ago

Bitcoin Correction Intact While Altcoins Skyrocket: The Crypto Weekly Recap

Blockchain4 days ago

Canadian VR Company Sells $4.2M of Bitcoin Following the Double-Spending FUD

Blockchain4 days ago

MicroStrategy CEO claims to have “thousands” of executives interested in Bitcoin

NEWATLAS1 day ago

Lockheed Martin and Boeing debut Defiant X advanced assault helicopter

Amb Crypto4 days ago

Monero, OMG Network, DigiByte Price Analysis: 23 January

Amb Crypto2 days ago

Former Goldman Sachs exec: Bitcoin ‘could work,’ but will attract more regulation

Amb Crypto3 days ago

Will range-bound Bitcoin fuel an altcoin rally?

Amb Crypto4 days ago

Chainlink Price Analysis: 23 January

Amb Crypto3 days ago

Bitcoin Price Analysis: 24 January

Amb Crypto5 days ago

Popular analyst prefers altcoins LINK, UNI, others during Bitcoin & Eth’s correction phase

Amb Crypto2 days ago

Other than Bitcoin, Coinbase notes institutional demand for Ethereum as well

Amb Crypto4 days ago

Bitcoin Cash, Synthetix, Dash Price Analysis: 23 January

Amb Crypto4 days ago

Why has Bitcoin’s brief recovery not been enough

Automotive4 days ago

Tesla Powerwalls selected for first 100% solar and battery neighborhood in Australia

Amb Crypto2 days ago

Chainlink, Monero, Ethereum Classic Price Analysis: 25 January

Amb Crypto4 days ago

Stellar Lumens, Cosmos, Zcash Price Analysis: 23 January

AI3 days ago

Plato had Big Data and AI firmly on his radar

Amb Crypto4 days ago

Why now is the best time to buy Bitcoin, Ethereum

Amb Crypto1 day ago

Bitcoin SV, Ontology, Zcash Price Analysis: 26 January