Once you have decided to explore a career in data science, and you need to engage in a project to get yourself going, you need to decide what dataset to use.
Fortunately, a guide to the best datasets for machine learning has been published in edureka!, written by Disha Gupta, a computer science and technology writer based in India. She notes that without training datasets, machine-learning algorithms would not have a way to learn text mining or text classification. Five to 10 years ago, it was difficult to find datasets for machine learning and data science projects. Today the challenge is not finding data, but to find the relevant data.
Here is an excerpt referring to datasets good for Natural Language Processing projects, which need text data. She recommended:
Enron Dataset – Email data from the senior management of Enron that is organized into folders.
Amazon Reviews – It contains approximately 35 million reviews from Amazon spanning 18 years. Data includes user information, product information, ratings, and text review.
Newsgroup Classification – Collection of almost 20,000 newsgroup documents, partitioned evenly across 20 newsgroups. It is great for practicing topic modeling and text classification.
For Finance projects:
Quandl: A great source of economic and financial data that is useful to build models to predict stock prices or economic indicators.
World Bank Open Data: Covers population demographics and many economic and development indicators across the world.
IMF Data: The International Monetary Fund (IMF) publishes data on international finances, foreign exchange reserves, debt rates, commodity prices, and investments.
IMDB Reviews – Dataset for binary sentiment classification. It features 25,000 movie reviews.
Sentiment140 – Uses 160,000 tweets with emoticons pre-removed.
Two Questions for Your Data Science Project
Once you have selected a dataset, you might need some more suggestions for getting your project off the ground. First, ask yourself two questions, suggests a recent article in Data Science Weekly: How would you make some money with it? And how would you save some money with it?
The answers will help you focus on what is important and useful when looking at your data. You will often find that before you get to the modeling or serious math, you may have to work through problems with the data, such as missing, erroneous or biased data. “You will find frequently in the real world that data is incredibly messy and nothing like the squeaky clean data sets found online in contests on Kaggle or elsewhere,” the author states.
Maybe at this stage you feel you need more education on AI. Fortunately, BestColleges has arrived. The company is a partnership with HigherEducation.com to provide students with direct connections to schools and programs that suit their education goals. The site provides college planning, access to financial aid and career resources.
Tune Up Your AI Education
Success in the AI field usually requires an undergraduate degree in computer science or a related discipline such as mathematics. More senior positions may require a master’s of PhD. Motivation is important. “Curiosity, confidence and perseverance are good traits for any student looking to break into an emerging field and AI is no exception,” states Dan Ayoub, Education Manager for Microsoft. “Unlike careers where a path has been laid over decades, AI is still in its infancy, which means you may have to form your own path and get creative.”
The article sketches out sample core subjects in an AI curriculum in math and statistics, computer science and “core AI,” such as machine learning, neural networks and natural language processing. Once you cover some fundamentals, you can begin to explore subjects that interest you personally. Clusters include machine learning, robotics, and human-AI interaction.
Whether you are a college student or already in the workforce, it’s important to proactively define your own AI curriculum, Ayoub suggested.
Example skills that can help you check off the right boxes in your response to the AI job posting include:
Big Data Tools: Spark, HBase, Kafka, HDFS, Hive, Hadoop, MapReduce, Pig
Natural Language Processing Tools: spaCy, NLTK
Jobs of the future will require a willingness to stay curious. It takes a little time and some patience.
An IBM AI researcher encourages an attitude that AI needs to be adopted by more people with data science and software engineering skills, as demand for workers skilled in machine learning is doubling every few months. “If we leave it as some mythical realm, this field of AI, that’s only accessible to the select PhDs that work on this, it doesn’t really contribute to its adoption,” said Dario Gil, research director at IBM, in an article in VentureBeat.
Like “innovation,” machine learning and artificial intelligence are commonplace terms that provide very little context for what they actually signify. AI/ML spans dozens of different fields of research, covering all kinds of different problems and alternative and often incompatible ways to solve them.
One robust area of research here that has antecedents going back to the mid-20th century is what is known as stochastic optimization — decision-making under uncertainty where an entity wants to optimize for a particular objective. A classic problem is how to optimize an airline’s schedule to maximize profit. Airlines need to commit to schedules months in advance without knowing what the weather will be like or what the specific demand for a route will be (or, whether a pandemic will wipe out travel demand entirely). It’s a vibrant field, and these days, basically runs most of modern life.
Warren B. Powell has been exploring this problem for decades as a researcher at Princeton, where he has operated the Castle Lab. He has researched how to bring disparate areas of stochastic optimization together under one framework that he has dubbed “sequential decision analytics” to optimize problems where each decision in a series places constraints on future decisions. Such problems are common in areas like logistics, scheduling and other key areas of business.
The Castle Lab has long had industry partners, and it has raised tens of millions of dollars in grants from industry over its history. But after decades of research, Powell teamed up with his son, Daniel Powell, to spin out his collective body of research and productize it into a startup called Optimal Dynamics. Father Powell has now retired full-time from Princeton to become Chief Analytics Officer, while son Powell became CEO.
The company raised $18.4 million in new funding last week from Bessemer led by Mike Droesch, who recently was promoted to partner earlier this year with the firm’s newest $3.3 billion fundraise. The company now has 25 employees and is centered in New York City.
So what does Optimal Dynamics actually do? CEO Powell said that it’s been a long road since the company’s founding in mid-2017 when it first raised a $450,000 pre-seed round. We were “drunkenly walking in finding product-market fit,” Powell said. This is “not an easy technology to get right.”
What the company ultimately zoomed in on was the trucking industry, which has precisely the kind of sequential decision-making that father Powell had been working on his entire career. “Within truckload, you have a whole series of uncertain variables,” CEO Powell described. “We are the first company that can learn and plan for an uncertain future.”
There’s been a lot of investment in logistics and trucking from VCs in recent years as more and more investors see the potential to completely disrupt the massive and fragmented market. Yet, rather than building a whole new trucking marketplace or approaching it as a vertically-integrated solution, Optimal Dynamics decided to go with the much simpler enterprise SaaS route to offer better optimization to existing companies.
One early customer, which owned 120 power units, saved $4 million using the company’s software, according to Powell. That was a result of better utilization of equipment and more efficient operations. They “sold off about 20 vehicles that they didn’t need anymore due to the underlying efficiency,” he said. In addition, the company was able to replace a team of ten who used to manage trucking logistics down to one, and “they are just managing exceptions” to the normal course of business. As an example of an exception, Powell said that “a guy drove half way and then decided he wanted to quit,” leaving a load stranded. “Trying to train a computer on weird edge events [like that] is hard,” he said.
Better efficiency for equipment usage and then saving money on employee costs by automating their work are the two main ways Optimal Dynamics saves money for customers. Powell says most of the savings come in the former rather than the latter, since utilization is often where the most impact can be felt.
On the technical front, the key improvement the company has devised is how to rapidly solve the ultra-complex optimization problems that logistics companies face. The company does that through value function approximation, which is a field of study where instead of actually computing the full range of stochastic optimization solutions, the program approximates the outcomes of decisions to reduce compute time. We “take in this extraordinary amount of detail while handling it in a computationally efficient way,” Powell said. That’s where we have really “wedged ourselves as a company.”
Early signs of success with customers led to a $4 million seed round led by Homan Yuen of Fusion Fund, which invests in technically-sophisticated startups (i.e. the kind of startups that take decades of optimization research at Princeton to get going). Powell said that raising the round was tough, transpiring during the first weeks of the pandemic last year. One corporate fund pulled out at the last minute, and it was “chaos ensuing with everyone,” he said. This Series A process meanwhile was the opposite. “This round was totally different — closed it in 17 days from round kickoff to closure,” he said.
With new capital in the bank, the company is looking to expand from 25 employees to 75 this year, who will be trickling back to the company’s office in the Flatiron neighborhood of Manhattan in the coming months. Optimal Dynamics targets customers with 75 trucks or more, either fleets for rent or private fleets owned by companies like Walmart who handle their own logistics.
Elevate your enterprise data technology and strategy at Transform 2021.
(Reuters) — IBM said on Tuesday it would buy Waeg, a consulting partner for Salesforce, in a deal that will extend its range of services and support its hybrid cloud and artificial intelligence strategy.
The deal to acquire Waeg, which is based in Brussels and serves clients across Europe, complements IBM’s acquisition in January of 7Summits, a U.S. consultancy that specialises in Salesforce’s customer management software.
“Waeg’s strength in Salesforce consulting services will be key to creating intelligent workflows that allow our clients to keep pace with changing customer and employee needs and expectations,” Mark Foster, senior vice president of IBM Services and Global Business Services, said.
Waeg employs a team of 130 ‘Waegers’ in locations that include Belgium, Denmark, France, Ireland, Poland, Portugal and the Netherlands.
The terms were not disclosed for the deal, which is subject to customary closing conditions and is expected to be completed this quarter.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
up-to-date information on the subjects of interest to you
Sylvain Kalache is the co-founder of Holberton, an edtech company training digital talent in more than 10 countries. An entrepreneur and software engineer, he has worked in the tech industry for more than a decade. Part of the team that led SlideShare to be acquired by LinkedIn, he has written for CIO and VentureBeat.
AI is driving the paradigm shift that is the software industry’s transition to data-centric programming from writing logical statements. Data is now oxygen. The more training data a company gathers, the brighter will its AI-powered products burn.
Why is Tesla so far ahead with advanced driver assistance systems (ADAS)? Because no one else has collected as much information — it has data on more than ten billion driven miles, helping it pull ahead of competition like Waymo, which has only about 20 million miles. But any company that is considering using machine learning (ML) cannot overlook one technical choice: supervised or unsupervised learning.
There is a fundamental difference between the two. For unsupervised learning, the process is fairly straightforward: The acquired data is directly fed to the models, and if all goes well, it will identify patterns.
Elon Musk compares unsupervised learning to the human brain, which gets raw data from the six senses and makes sense of it. He recently shared that making unsupervised learning work for ADAS is a major challenge that hasn’t been solved yet.
A major part of real-world AI has to be solved to make unsupervised, generalized full self-driving work, as the entire road system is designed for biological neural nets with optical imagers
Supervised learning is currently the most practical approach for most ML challenges. O’Reilly’s 2021 report on AI Adoption in the Enterprise found that 82% of surveyed companies use supervised learning, while only 58% use unsupervised learning. Gartner predicts that through 2022, supervised learning will remain favored by enterprises, arguing that “most of the current economic value gained from ML is based on supervised learning use cases”.
Short introduction: I’m Jerry Udensi, CTO of a Nigerian-Malaysian tech company: Lyshnia Limited. Prior to working full time with Lyshnia (a company I founded in 2013 with my elder brother), I worked in the AI industry in Malaysia and Singapore. I have built Natural Language AI systems for large corporations such as Allianz SE, and Insurance Technology for companies like Malaysia’s Insuradar Sdn.
The reason for my short introduction is to show you my background in building AI powered systems. Natural Language Processing is a field I’ve actively been in for over 3 years now so you’d think building a Transactional Chat Bot that sells only 10 products shouldn’t be an issue for me right? Well you’d be right if the customers were people who read.
In the paragraphs to follow, I will highlight what I’ve learnt building and maintaining Jane B(Just another Non-Existent Bot) which attends to approx. 1000 customers every day.
There’s this old saying that goes “if you want to hide something from a Black Man, put it in a book”. Unfortunately, this is the case with over 70% of the Customers who used the bot.
When you first message the bot, it greets you, let’s you know that you’re chatting with a Bot, then gives you 4 options to choose from.
5 out of 10 people ignore the initial message and go ahead to write what they want, 2 out of 10 people would read but not understand and therefore reply confusedly like in the image below:
For the 5 who initially ignored the Menu message, we automatically resend the message, and 4 out of 5 would go on to reply appropriately, while 1 of 5 would complain of how stressful the process is and probably never chat again.
Yes, we get it. You live in France, but do you want it Delivered or will you Pick it up? (some customers send people in to do a pick up for them)
Jane has been simplified to understand even incorrect English, and giving the customers hints on how to reply, yet a lot of those who chat her simply ignore instructions, and rather type a thousand words than one that Jane would understand.
You would think it’ll be easier and less stressful for customers to simply reply “1” rather than type out “I want to make an order”, but no. Chat after chat, you will realise a lot of people are saying unnecessary things before or after their actual intention. For Chat Bot providers, this can be a nightmare because the Chat Bot asked a question and is listening for a Natural Language answer which is very hard to predict if the users response is in line with your desired answer.
Even for a human, it is hard to understand another humans intentions when spoken out of context
For the Chat above, the Bot was asking the user to confirm the items she wants to buy, but the user instead replies saying where they live. Totally out of context.
Getting instant replies is a drug people are addicted to. Customers are told that this is a chat bot which only takes orders and track orders, then given another number to chat for consultancy to speak to a human. Yet, they keep coming back just minutes later to complain to the Bot that they’re not getting responses there.
Something else I noticed while analysing the chat response times is that the Customers get so hooked on the instant replies that if at any point, the chat bot delays their response for even just 1 minute they start asking why they’re not getting any response.
On the good side, customer who read and follow the short and simple instructions are able to place their orders in less than 2 minutes from a platform their comfortable with (WhatsApp) while feeling like they’re chatting with a human.