This is a guest post from Euler Hermes. In their own words, “For over 100 years, Euler Hermes, the world leader in credit insurance, has accompanied its clients to provide simpler and safer digital products, thus becoming a key catalyzer in the world’s commerce.”
Euler Hermes manages more than 600,000 B2B transactions per month and effectuates data analytics from over 30 million companies worldwide. At-scale artificial intelligence and machine learning (ML) have become the heart of the business.
Euler Hermes uses ML across a variety of use cases. One recent example is typo squatting detection, which came about after an ideation workshop between the Cybersecurity and IT Innovation teams to better protect clients. As it turns out, moving from idea to production has never been easier when your data is in the AWS Cloud and you can put the right tools in the hands of your data scientists in minutes.
Typo squatting, or hijacking, is a form of cybersecurity attack. It consists of registering internet domain names that closely resemble legitimate, reputable, and well-known ones with the goal of phishing scams, identity theft, advertising, and malware installation, among other potential issues. The sources of typo squatting can be varied, including different top-level domains (TLD), typos, misspellings, combo squatting, or differently phrased domains.
The challenge we faced was building an ML solution to quickly detect any suspicious domains registered that could be used to exploit the Euler Hermes brand or its products.
To simplify the ML workflow and reduce time-to-market, we opted to use Amazon SageMaker. This fully managed AWS service was a natural choice due to the ability to easily build, train, tune, and deploy ML models at scale without worrying about the underlying infrastructure while being able to integrate with other AWS services such as Amazon Simple Storage Service (Amazon S3) or AWS Lambda. Furthermore, Amazon SageMaker meets the strict security requirements necessary for financial services companies like Euler Hermes, including support for private notebooks and endpoints, encryption of data in transit and at rest, and more.
To build and tune ML models, we used Amazon SageMaker notebooks as the main working tool for our data scientists. The idea was to train an ML model to recognize domains related to Euler Hermes. To accomplish this, we worked on the following two key steps: dataset construction and model building.
Every ML project requires a lot of data, and our first objective was to build the training dataset.
The dataset of negative examples was composed of 1 million entries randomly picked from Alexa, Umbrella, and publicly registered domains, whereas the dataset of 1 million positive examples was created from a domain generated algorithm (DGA) using Euler Hermes’s internal domains.
Model building and tuning
One of the project’s biggest challenges was to decrease the number of false positives to a minimum. On a daily basis, we need to unearth domains related to Euler Hermes from a large dataset of approximately 150,000 publicly registered domains.
We tried two approaches: classical ML models and deep learning.
We considered various models for classical ML, including Random Forest, Logistic regression, and gradient boosting (LightGBM and XGBoost). For these models, we manually created more than 250 features. After an extensive feature-engineering phase, we selected the following as the most relevant:
- Number of FQDN levels
- Vowels ration
- Number of characters
- Bag of n-grams (top 50 n-grams)
- Features TF-IDF
- Latent Dirichlet allocation features
For deep learning, we decided to work with recurrent neural networks. The model we adopted was a Bidirectional LSTM (BiLSTM) with an attention layer. We found this model to be the best at extracting a URL’s underlying structure.
The following diagram shows the architecture designed for the BiLSTM model. To avoid overfitting, a Dropout layer was added.
The following code orchestrates the set of layers:
We built and tuned the classical ML and the deep learning models using the Amazon SageMaker-provided containers for Scikit-learn and Keras.
The following table summarizes the results we obtained. The BiLSTM outperformed the other models with a 13% precision improvement compared to the second-best model (LightGBM). For this reason, we put the BiLSTM model into production.
(Area Under the Curve)
For model training, we made use of Managed Spot Training in Amazon SageMaker to use Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances for training jobs. This allowed us to optimize the cost of training models at a lower cost compared to On-Demand Instances.
Because we predominantly used custom deep learning models, we needed GPU instances for time-consuming neural network training jobs, with times ranging from minutes to a few hours. Under these constraints, Managed Spot Training was a game-changing solution. The on-demand solution permitted no interruption of our data scientists while managing instance-stopping conditions.
Euler Hermes’s cloud principles follow a serverless-first strategy, with an Infrastructure as Code DevOps practice. Systematically, we construct a serverless architecture based on Lambda whenever possible, but when this isn’t possible, we deploy to containers using AWS Fargate.
Amazon SageMaker allows us to deploy our ML models at scale within the same platform on a 100% serverless and scalable architecture. It creates a model endpoint that is ready to serve inference requests. To get inferences for an entire dataset, we use batch transform, which cuts the dataset off in smaller batches and gets the predictions on each one. Batch transform manages all the compute resources required to get inferences, including launching instances and deleting them after the batch transform job is complete.
The following figure depicts the architecture deployed for the use case in this post.
First, a daily Amazon CloudWatch event is set to trigger a Lambda function with two jobs: download all the publicly registered domains and store them in an Amazon Simple Storage Service (Amazon S3) bucket subfolder and trigger the BatchTransform job. Amazon SageMaker automatically saves the inferences in an S3 bucket that you specify when creating the batch transform job.
Finally, a second CloudWatch event monitors the task success of Amazon SageMaker. If the task succeeds, it triggers a second Lambda function that retrieves the inferred domains and selects those that have label 1—related to Euler Hermes or its products—and stores them in another S3 bucket subfolder.
Following Euler Hermes’s DevOps principles, all the infrastructure in this solution is coded in Terraform to implement an MLOps pipeline to deploy to production.
Amazon SageMaker provides the tool that our data scientists need to quickly and securely experiment and test while maintaining compliance with strict financial service standards. This allows us to bring new ideas into production very rapidly. With flexibility and inherent programmability, Amazon SageMaker helped us tackle our main pain point of industrializing ML models at scale. After we train an ML model, we can use Amazon SageMaker to deploy the model, and can automate the entire pipeline following the same DevOps principles and tools we use for all other applications we run with AWS.
In under 7 months, we were able to launch a new internal ML service from ideation to production and can now identify URL squatting fraud within 24 hours after the creation of a malicious domain.
Although our application is ready, we have some additional steps planned. First, we’ll extend the inferences currently stored on Amazon S3 to our SIEM platform. Second, we’ll implement a web interface to monitor the model and allow manual feedback that is captured for model retraining.
About the Authors
Luis Leon is the IT Innovation Advisor responsible for the data science practice in the IT at Euler Hermes. He is in charge of the ideation of digital projects as well as managing the design, build and industrialization of at scale machine learning products. His main interests are Natural Language Processing, Time Series Analysis and non-supervised learning.
Hamza Benchekroun is Data Scientist in the IT Innovation hub at Euler Hermes focusing on deep learning solutions to increase productivity and guide decision making across teams. His research interests include Natural Language Processing, Time Series Analysis, Semi-Supervised Learning and their applications.
Hatim Binani is data scientist intern in the IT Innovation hub at Euler Hermes. He is an engineering student at INSA Lyon in the computer science department. His field of interest is data science and machine learning. He contributed within the IT innovation team to the deployment of Watson on Amazon Sagemaker.
Guillaume Chambert is an IT security engineer at Euler Hermes. As SOC manager, he strives to stay ahead of new threats in order to protect Euler Hermes’ sensitive and mission-critical data. He is interested in developing innovation solutions to prevent critical information from being stolen, damaged or compromised by hackers.
Microsoft Data Science Interview Questions
Microsoft has been a big player in the data science industry after Azure and it’s machine learning tools have been slowly dominating as the biggest service provider in the cloud-computing market. As a result, Microsoft has been building out its data science team slowly but surely over the past five years to become one of the biggest companies hiring for the role.
The Data Scientist Role
The role of a data scientist at Microsoft varies a lot and is dependent on whichever team you interview with. Each Microsoft data science job is different and spans from analytics-based roles to more machine learning heavy. As a huge multi-conglomerate corporation, Microsoft has different teams that work on speech and language, artificial intelligence, machine learning infrastructure on Azure, data science consulting for cloud computing, and much more.
Microsoft generally prefers to hire experienced candidates with about a minimum of 2+ years of experience working in data science for a mid-level role. General qualifications are:
- Ph.D. in a quantitative field and previous experience in DNN, NLP, time series, reinforcement learning, network analysis, causal inference or any related areas.
- Proficiency in any of the following numerical programming languages (Python/Numpy/Scipy, R, SQL, C#, or Spark).
- Experience with cloud-based architectures such as AWS or Azure.
What are the types of data scientists?
Microsoft has a department under engineering that is called data and applied science. Employees in this department are often placed in teams and go by three main titles: data scientists, applied scientists, and machine learning engineers. Depending on the team their functions would include:
- Writing codes to ship models to production.
- Writing codes for machine learning algorithms to be used by other data scientists.
- Working with customers directly or indirectly to resolve technical issues.
- Working on metrics and experimentation.
- Working on product features.
The ideal candidate for the Microsoft Data and Applied Scientist role is expected to be able to apply a breadth of machine learning tools and analytical techniques to answer a wide range of high-impact business questions and present the insights in a concise and effective manner.
The Microsoft Data Scientist Interview
After submitting your application for the job, the first phone interview may or may not be a recruiter depending on the seniority level of the role. Many times the hiring manager will conduct a 30 minute interview first to understand your past experience.
Expect this part of the phone interview to come in two parts. You will be asked about your background and projects as well as a few technical interview questions. The technical interview questions will be more theoretical along the lines of explaining how a machine learning concept works or a quick probability or statistical problem.
- What’s the difference between lasso and ridge regression?
- How would you explain how a deep learning model works to a business person?
- How would you define a p-value to someone who’s non-technical?
The Technical Screen
After the hiring manager screen, the recruiter will schedule a second more technical screen with a Microsoft data scientist. Generally this screen is 45 minutes to an hour and designed to test pure technical skills and how well you can code and explain your thought-process.
The technical screen consists of around three different questions covering the topics of algorithms, SQL coding, and probability and statistics. Expect questions akin to data structures and algorithms in Python along with data processing type questions.
- Given an array of words and a max width parameter, format the text such that each line has exactly X characters.
- Write a query to randomly sample a row from a table with 100 million rows.
- What’s the probability that you roll at least two 3s when rolling three die?
The Onsite Interview
The onsite interview consists of a full day event from 9 am to 4 pm. You will meet with five different data scientists and go on a lunch interview as well.
Here’s what the interview panel generally looks like:
- Probability and statistics
- Data structures and algorithms
- Modeling and machine learning systems
- Hiring manager and behavioral interview
- Data manipulation
- You’ll also spend 1:1 time with one or two data scientists during a lunch break to learn more about Microsoft and the team. This is usually a one hour lunch interview that they’ll let you take a break or talk through what they work on.
The onsite interview will be mostly a combination of all the different technical concepts. Remember to study different model assessment metrics in different circumstances, the bias/variance tradeoff of coefficients under collinearity, open-ended questions about sampling schemes, experimental and A/B testing design, explaining p-values to a 5 year old, different concepts of Bayes’ theorem, and teaching the interviewer a statistical learning technique of your choice.
Another big focus for Microsoft is on communication, since the data science team at Microsoft has partnerships throughout the organization to ensure the team is doing useful work.
You can find many of the data structures and algorithm questions on Interview Query or Leetcode. It’s also advisable to get a white-board to practice writing code on, given how different is coding on a whiteboard versus the computer.
Sample Microsoft Data Science Interview Questions
- How would you select a representative sample of search queries from six million?
- Find the maximum of sub sequence in an integer list?
- Give an example of a scenario where you would use Naive Bayes over another classifier?
- How would you explain what MapReduce does as concise as possible?
- What is the ROC curve and the meaning of sensitivity, specificity, confusion matrix?
- The autocomplete feature: How would you implement it and can you highlight the flaws in this tool today?
- Describe efficient ways to merge a given k sorted arrays of size n each.
Check out Interview Query for more data scientist interview questions.
This article was originally published on Interview Query Blog and re-published to TOPBOTS with permission from the author.
Are You Recruiting AI Talent?
Sign up below to receive our upcoming content on recruiting AI & ML professionals.
The Tour de France Is Going Virtual, and It Starts This Weekend
The coronavirus pandemic has changed the way we do things, big-time. The events, places, and activities we were used to enjoying have been canceled, closed, or in some cases, permanently shut down. Virtual versions of just about everything have sprung up: meetings, concerts, parties, classes, conventions. This week another event was added to the list of things gone virtual: the Tour de France.
First held in 1903, the Tour de France has gone on every year since, with the only exceptions being during the first and second World Wars. As of right now, the in-the-flesh tour is still scheduled to take place, though it’s been pushed back to an August 29 start date (it usually takes place in July).
With all the smaller cycling races that usually go on during the summer having been canceled, the virtual Tour will give cyclists some motivation to train, and a chance to see how they stack up against their competitors (whose training routines have no doubt been equally disrupted over the last few months). Participants will be on stationary bikes in their homes rather than real bikes on the road, and there are some other key differences between the virtual Tour and the real thing.
For starters, the Tour is normally broken down into 21 parts, or “stages,” each classified as flat, hilly, or mountain. Cyclists have 23 consecutive days to complete all the stages, with the total distance spanning a whopping 3,500 kilometers (2,200 miles), about the distance from San Francisco to Chicago.
The virtual Tour will look a little different (or, let’s be honest—a lot different. About as different as possible while still being called a bike race). Rather than consecutive days, the race will happen over three weekends in July, with six stages lasting one to two hours apiece. As in real life, each stage will tend toward being mostly hilly, mountain, or flat (meaning participants will need to be adjusting the resistance on their trainer bikes and sometimes standing or crouching to simulate climbing a hill; if you’ve ever done a spin class, you know how it works).
The race will be conducted on a virtual platform called Zwift. Zwift isn’t brand-new—it’s been around for a few years—and it markets itself as a training app for cyclists, runners, and triathletes. Athletes use a treadmill or stationary bike in combination with an array of sensors plus their laptop or smartphone. They can access customized training programs and join virtual races against other users all over the world.
Ideally, competitors in the virtual Tour will have a big screen in front of them simulating their ride through virtual environments, some of which Zwift created especially for this event. For the first two days of the race, riders will bike through Watopia, a virtual world created by Zwift. But the company also rushed to build new, custom worlds for the Tour, mainly mimicking the real-life locations where the race usually takes place, including the French countryside, a 6,263-foot peak in Provence called Mont Ventoux, and the finish line on the famous Champs-Elysées in Paris.
In one cycling coach’s opinion, riding on Zwift can actually feel more physically challenging than being out on a real bike, for three reasons: it’s harder for your body to cool off, the bike’s resistance works differently, and “your motivation dwindles due to not having the wind in your hair and the road moving underneath you.”
That last point is key. The pandemic has played out very differently than it would have just 10 years ago; technologies like Zoom and Slack allowed millions of people to work from home, our smartphones helped us stay ultra-connected even when physically apart, and quick access to information kept us informed of what was going on.
Of course, talking to our friends or watching musicians stream on a screen will never be a good-enough substitute for doing these things in person, just as riding a stationary bike through a virtual world will never give you that wind-in-your-hair, road-beneath-your-feet feeling.
But in a time when we have no choice but to appreciate the small things, it’s better than the alternative, which is… nothing. Alas, depending how the pandemic continues to play out, we may be in for a highly virtualized future, with events we never would’ve thought could go virtual finding a way to do just that.
23 men’s teams and 17 women’s teams have registered for the virtual bike race, including the last three winners of the real-life event. “Footage” will be broadcast in more than 130 countries.
Let’s just hope all the contestants have stable internet connections.
Image Credit: Zwift
My Invisalign app uses machine learning and facial recognition to sell the benefits of dental work
Align Technology uses DevSecOps tactics to keep complex projects on track and align business and IT goals.
Align Technology’s Chief Digital Officer Sreelakshmi Kolli is using machine learning and DevOps tactics to power the company’s digital transformation.
Kolli led the cross-functional team that developed the latest version of the company’s My Invisalign app. The app combines several technologies into one product including virtual reality, facial recognition, and machine learning. Kolli said that using a DevOps approach helped to keep this complex work on track.
“The feasibility and proof of concept phase gives us an understanding of how the technology drives revenue and/or customer experience,” she said. “Modular architecture and microservices allows incremental feature delivery that reduces risk and allows for continuous delivery of innovation.”
SEE: Research: Microservices bring faster application delivery and greater flexibility to enterprises (TechRepublic Premium)
The customer-facing app accomplishes several goals at once, the company said:
- Offers a preview of life after braces via SmileView
- Sends weekly treatment reminders
- Keeps patients in touch with their doctors during treatment
More than 7.5 million people have used the clear plastic molds to straighten their teeth, the company said. Align Technology has used data from these patients to train a machine learning algorithm that powers the visualization feature in the mobile app. The SmileView feature uses machine learning to predict what a person’s smile will look like when the braces come off.
Kolli started with Align Technology as a software engineer in 2003. Now she leads an integrated software engineering group focused on product technology strategy and development of global consumer, customer and enterprise applications and infrastructure. This includes end user and cloud computing, voice and data networks and storage. She also led the company’s global business transformation initiative to deliver platforms to support customer experience and to simplify business processes.
Kolli used the development process of the My Invisalign app as an opportunity to move the dev team to DevSecOps practices. Kolli said that this shift represents a cultural change, and making the transition requires a common understanding among all teams on what the approach means to the engineering lifecycle.
“Teams can make small incremental changes to get on the DevSecOps journey (instead of a large transformation initiative),” she said. “Investing in automation is also a must for continuous integration, continuous testing, continuous code analysis and vulnerability scans.”
To build the machine learning expertise required to improve and support the My Invisalign app, she has hired team members with that skill set and built up expertise internally.
“We continue to integrate data science to all applications to deliver great visualization experiences and quality outcomes,” she said.
Align Technology uses AWS to run its workloads.
Aligning business and IT goals to power transformation
In addition to keeping patients connected with orthodontists, the My Invisalign app is a marketing tool to convince families to opt for the transparent but expensive alternative to metal braces.
Kolli said that IT leaders should work closely with business leaders to make sure initiatives support business goals such as revenue growth, improved customer experience, or operational efficiencies, and modernize the IT operation as well.
“Making the line of connection between the technology tasks and agility to go to market helps build shared accountability to keep technical debt in control,” she said.
Align Technology released the revamped app in late 2019. In May of this year, the company released a digital version tool for doctors that combines a photo of the patient’s face with their 3D Invisalign treatment plan.
This ClinCheck “In-Face” Visualization is designed to help doctors manage patient treatment plans.
The visualization workflow combines three components of Align’s digital treatment platform: Invisalign Photo Uploader for patient photos, the iTero intraoral scanner to capture data needed for the 3D model of the patient’s teeth, and ClinCheck Pro 6.0. ClinCheck Pro 6.0 allows doctors to modify treatment plans through 3D controls.
These new product releases are the first in a series of innovations to reimagine the digital treatment planning process for doctors, Raj Pudipeddi, Align’s chief innovation, product, and marketing officer and senior vice president, said in a press release about the product.
Here’s Why Bitcoin’s Next Big Move May Occur in the Next 2 Days
The VR Download: Weekly VR News Live From Virtual Studio
Shooty Skies Overdrive Review: A Short Yet Sweet VR Wave Shooter
Delta expects to see an 80% decline in travelers over the 4th of July weekend, and it shows how abysmal things are right now in the airline industry (DAL)
An app helping families save for college used this pitch deck to raise $9 million from investors like Anthos Capital and NBA all-star Baron Davis
Crypto Market Analysis: 1st July 2020
The Trump administration has hired a military contracting firm backed by Trump adviser Peter Thiel to build a virtual border wall
H.I.G. Carves Out Piece of Genuine Parts
19 TV shows Netflix canceled even though critics loved them
How much money TikTok influencers get paid to promote songs in their videos, according to music marketers, creators, and managers
KickEX New Deposit Contest; Make a deposit on the exchange and win 100 USDT!
Second Lawsuit Emerges: Why Are Hackers Targeting AT&T Crypto Investors?
Buying a house during the pandemic was a little tricky, but worth it for the rock-bottom rate we got on our mortgage
High-yield savings accounts aren’t earning much interest right now, but keeping my money at Ally Bank is still the smartest choice
Here are all the famous people Jeffrey Epstein was connected to
Bitcoin Breaks Below $9,000 as Sellers Invalidate Bullish Technical Pattern
Lemonade, a tech-driven insurance company, soars 132% in trading debut (LMND)
PiGirrl Zero w/ Audio #3DThursday #3DPrinting
A Massive Star Has Seemingly Vanished from Space With No Explanation
A Russian TV star denied holding racist views after she was fired as a brand ambassador to Audi after Instagram post controversy
Crypto Trading Woes: What They Are and How UpBots Solves Them
Volvo Penta Begins Field Trials Of Battery-Electric Fire Truck
Centerbridge Sells KIK Piece to Voyant
Why Isn’t BCH Way More Popular on the Dark Web?
Leaked emails show Amazon is delaying Prime Day again to October as concerns grow that a new COVID-19 demand spike may hit supply chains (AMZN)
Former Republican presidential candidate Herman Cain is hospitalized with the coronavirus
Trump’s favorite trade scorecard worsened in May as exports hit lowest level since 2009
A free tool from the National Association of Insurance Commissioners can search for a life insurance policy after a loved one’s death
‘TikTok Spies On You and Transfers Data to Chinese Authorities.’ But Is It All That Bad?
Review: Calculator Kit is Just a Few Hacks From Greatness
EYE on NPI – Alorium Evo M51 #EyeOnNPI #DigiKey #Adafruit @digikey @Adafruit @AloriumTech @Intel
Fursuit- or puppet-head base – version 71 – Toon dragon #3Dprinting #3DThursday
St Paul’s bomb plotter now denies she got cold feet, court hears
Twitter Adds Crypto Emoji for Binance, Following Bitcoin (BTC) and Crypto.com Coin (CRO)
Will This Year’s Third Quarter Be Negative for Bitcoin?
Healthy foods to eat that can improve and extend your marijuana high
Fortnite Fireworks Locations: Where To Light Fireworks For Captain America Challenge
Ethereum EIP-1559 Will Be A Make Or Break Moment For ETH
Evolution Managers Backs Another New Manager
Cardano starts off first day of Shelley Virtual Summit with 5 new announcements
New York Times1 week ago
Gen Z Will Not Save Us
Gaming1 week ago
All safe codes – The Last of Us Part 2
BBC1 week ago
Ron Jeremy: Adult star charged with rape and sexual assault
New York Times1 week ago
The Boy Who Cried Fake News
Gaming1 week ago
Valorant update 1.02 Patch notes add ranked mode and surrender option
Gaming1 week ago
The Last of Us Part 2 voice actors and cast
Blockchain1 week ago
New Class of Crypto Assets Will Outshine Bitcoin (BTC) in Next Crypto Craze, Says Polyient Games Executive
zephyrnet7 days ago
Compound (COMP) is now available on Coinbase Earn