Connect with us

# Using Feature Selection Methods in Text Classification

Published

on

In text classification, the feature selection is the process of selecting a specific subset of the terms of the training set and using only them in the classification algorithm. The feature selection process takes place before the training of the classifier.

Update: The Datumbox Machine Learning Framework is now open-source and free to download. Check out the package com.datumbox.framework.machinelearning.featureselection to see the implementation of Chi-square and Mutual Information Feature Selection methods in Java.

The main advantages for using feature selection algorithms are the facts that it reduces the dimension of our data, it makes the training faster and it can improve accuracy by removing noisy features. As a consequence feature selection can help us to avoid overfitting.

The basic selection algorithm for selecting the k best features is presented below (Manning et al, 2008):

On the next sections we present two different feature selection algorithms: the Mutual Information and the Chi Square.

## Mutual Information

One of the most common feature selection methods is the Mutual Information of term t in class c (Manning et al, 2008). This measures how much information the presence or absence of a particular term contributes to making the correct classification decision on c. The mutual information can be calculated by using the following formula:

[1]

In our calculations, since we use the Maximum Likelihood Estimates of the probabilities we can use the following equation:

[2]

Where N is the total number of documents, Ntcare the counts of documents that have the values et (occurrence of term t in the document; it takes the value 1 or 0) and ec(occurrence of document in class c; it takes the value 1 or 0) that indicated by two subscripts, and . Finally we must note that all the aforementioned variables take non-negative values.

## Chi Square

Another common feature selection method is the Chi Square. The x2 test is used in statistics, among other things, to test the independence of two events. More specifically in feature selection we use it to test whether the occurrence of a specific term and the occurrence of a specific class are independent. Thus we estimate the following quantity for each term and we rank them by their score:

[3]

High scores on x2 indicate that the null hypothesis (H0) of independence should be rejected and thus that the occurrence of the term and class are dependent. If they are dependent then we select the feature for the text classification.

The above formula can be rewritten as follows:

[4]

If we use the Chi Square method, we should select only a predefined number of features that have a x2 test score larger than 10.83 which indicates statistical significance at the 0.001 level.

Last but not least we should note that from statistical point the Chi Square feature selection is inaccurate, due to the one degree of freedom and Yates correction should be used instead (which will make it harder to reach statistical significance). Thus we should expect that out of the total selected features, a small part of them are independent from the class). Thus we should expect that out of the total selected features, a small part of them are independent from the class. Nevertheless as Manning et al (2008) showed, these noisy features do not seriously affect the overall accuracy of our classifier.

## Removing noisy/rare features

Another technique which can help us to avoid overfitting, reduce memory consumption and improve speed, is to remove all the rare terms from the vocabulary. For example one can eliminate all the terms that occurred only once across all categories. Removing those terms can reduce the memory usage by a significant factor and improved the speed of the analysis. Finally we should not that this technique can be used in conjunction with the above feature selection algorithms.

Did you like the article? Please take a minute to share it on Twitter. 🙂

My name is Vasilis Vryniotis. I’m a Data Scientist, a Software Engineer, author of Datumbox Machine Learning Framework and a proud geek. Learn more

# Malta AI & Blockchain Summit CEO: Malta Intends to Remain Pioneer in Digital World

While legalities have been uncertain around the world regarding any concise regulatory development surrounding distributed ledger technology (DLT), Malta, Switzerland, and Singapore planned to implement regulatory measures on tokens in order to accelerate blockchain development. In fact, that is precisely what Malta once did in its pursuit to establish itself as a “blockchain island.”  In … Continued

The post Malta AI & Blockchain Summit CEO: Malta Intends to Remain Pioneer in Digital World appeared first on BeInCrypto.

Published

on

A while back, Malta was on the cutting edge of technology and was quickly establishing notoriety as a nerve center for digital innovation, as well as providing stability for blockchain-driven startups in this era of regulatory ambiguity. BeInCrypto spoke with Eman Pulis, CEO of the igaming summit SiGMA and the Malta AI & Blockchain Summit, about what makes Malta attractive to fintech startups.

While legalities have been uncertain around the world regarding any concise regulatory development surrounding distributed ledger technology (DLT), Malta, Switzerland, and Singapore planned to implement regulatory measures on tokens in order to accelerate blockchain development. In fact, that is precisely what Malta once did in its pursuit to establish itself as a “blockchain island.”

In July 2018, Malta became the first jurisdiction to adopt blockchain and cryptocurrency regulations, with distinctly instituted regulatory structures for DLT, initial coin offerings, and digital currencies — with the projected goal of also becoming the first cashless country by 2033.

This bold initiative in developing regulatory frameworks was supposed to provide a stable platform for companies to operate in a regulated environment, without the element of uncertainty.

In 2018, the Maltese government took legislative action and passed the Malta Digital Innovation Authority Act, the Innovative Technology Arrangement and Services Act, in addition to the Virtual Financial Assets Act.

These legislative moves, often called the Digital Innovation Framework, made the country distinct from nearly all others across the globe, as well as distinguished the island as an extremely appealing destination for blockchain-focused startups.

Additionally, the country was also preparing to equip its financial sector with the power of artificial intelligence (AI) in establishing a future that is piloted by data.

However, Malta’s government has been silent about its crypto and blockchain-related developments, over the past year, and companies’ interest in receiving a local license to operate a crypto-focused business seems to be fading away.

## Malta is “definitely going digital”

BeInCrypto: It seems Malta is becoming less blockchain-focused and more digital economy-focused. How did it come to this decision?

Eman Pulis: I wouldn’t really go as far as to say that we’re becoming less of a blockchain-focused country, but yes we are definitely going digital and the reasons are obvious, since, in doing so we are enabling an array of new possibilities across the economy, generating business and better jobs.

In addition to that, we all saw the pandemic imposing a default shift in the working of things, and in spite of the difficulties it created simultaneously, we still found a way by going digital when it comes to retail and food services, for instance. It seems only natural that if a newly-applied digitized system worked efficiently during hard times, it will be immensely beneficial for when things slowly get back to normal.

BeInCrypto: Can Malta become attractive to fintech startups? What are the main requests of fintech startups looking to relocate to Malta?

Eman Pulis: Both Hon. Silvio Schembri and Hon. Dr. Chris Fearne actually addressed this at the openings of our Med-Tech and Med-Cann summits, and they both said how Malta continues to work in order to facilitate the process for businesses in med-tech, med-cann, blockchain, and fintech, and that in the end, the numbers speak for themselves.

Malta has consistently worked hard to be ahead of the curve where technology and innovation are concerned, and in fact the Malta Digital Innovation Authority will be launching a technological sandbox that is intended to assist start-ups in launching their services in the respective emerging markets.

On the whole, I believe that the way to go is to move away from the current, rather fragmented approach, and instead move towards a more holistic approach, locally embracing new technologies and digitalization in unison.

## “Sometimes delays create time for careful planning”

BeInCrypto: There have been some delays in the development of a regulatory framework around blockchain. Have you noticed blockchain firms leaving Malta due to the regulatory unclarity?

Eman Pulis: There’s no doubt that this year has disrupted plans and caused endless delays for businesses and individuals alike, so if there were any delays, that surely didn’t help.

On the other hand, if you look at the way that Malta learned from other countries, took its time, and then applied a fantastic regulatory framework for the igaming sector, I believe we are still ahead of time to do the same thing in blockchain. Sometimes delays create time for careful planning and refinement.

BeInCrypto: Germany is developing a crypto-friendly set of regulations, and probably they will be rolled out to the rest of the European Union. How would this affect Malta’s position?

Eman Pulis: On this matter, I feel Clayton Bartolo really said it all in his interview on our BLOCK magazine, when he said that the country intends to keep up its commitment to remain a pioneer in the digital world, specifically by heavily investing in blockchain technologies, particularly in the fields of education, energy management, traffic management and tourism.

BeInCrypto: What developments do you expect in the fintech and blockchain sectors in the coming years?

Eman Pulis: As you know it’s a vast industry and there’s endless subsectors evolving technologies with newly emerging concepts and it is without any doubt that the pandemic has cemented our reliance on technology, so it is certain that we will expect a multitude of entrepreneurial innovations.

Businesses should consider investments into AI, Internet of Things, and automation; these three factors are playing an important role in the ever-changing game of tech, and that naturally includes blockchain and fintech.

Share Article

# Kamua’s AI-powered editor helps marketers embrace vertical video

Published

on

A new AI-powered video-editing platform is preparing for launch, designed to help businesses, marketers, and creators automatically transform landscape-shot videos into a vertical format suitable for TikTok, Instagram, Snapchat, and all the rest.

Founded out of London in 2019, Kamua wants to be aligned with tools such as Figma, a software design and prototyping tool for product managers who lack certain technical skills. For Kamua, the goal is democratizing the creative and technical processes in video editing.

“Kamua makes it possible for non-editors to directly control how their videos look in any format, on any screen, in multiple durations and sizes, without the steep and long learning curves, hardware expense, and legacy workflows associated with editing software suites,” Kamua CEO and cofounder Paul Robert Cary told VentureBeat.

Kamua, which was available as an alpha release since last year before launching in invite-only beta back in September, is now preparing for a more extensive roll-out on December 1, when a limited free version will be made available for anyone without any formal application process.

Above: Kamua CEO and cofounder Paul Robert Cary

## Reformat

Reformatting videos for different-sized screens is an age-old problem, one that movie studios have contended with for years as they shoehorned productions created in one aspect ratio onto displays built for another. In the modern digital era, businesses and freelance creators also have to contend with a wide array of screens and evolving consumption habits — the viewer could potentially be watching the end-product on any number of displays, ranging from a PC monitor, to a smart TV, to a tablet, or, most likely, a smartphone.

Editing a video that was filmed in landscape so that it plays nice with the much-maligned (but increasingly popular) vertical video format is no easy feat; it’s a problem that can consume considerable marketing and IT resources. And for businesses that want to tailor their advertisements or showreels for vertical-screen configurations without having to film multiple versions, Kamua hopes to fill that niche.

“Kamua obviates the need to shoot multiple orientations, which can often increase costs and time, double the editing workload, and result in missed opportunities,” Cary said.

Driving this demand is the simple fact that more than half of all internet users only ever use a smartphone to access the internet, a figure that’s expected to grow to nearly three-quarters by 2025. This trend translates into digital video views, too, which are also now driven chiefly by smartphones.

## Visionary

Using computer vision and machine learning techniques, Kamua tracks on-screen subjects (e.g., people or animals) to convert landscape videos into organic-looking portrait videos. So when the time comes to port a YouTube video to Instagram’s longform IGTV, for example, Kamua can auto-crop the videos into vertical incarnations, focusing entirely on the action to ensure that the context is preserved.

As Kamua puts it, it’s all about “automating the un-fun parts of video editing,” bypassing the need for software downloads, file syncs, or specially skilled personnel.

In this clip, for example, you can see how the subject of the footage changes mid-scene, with Kamua correctly deciding to switch focus from the cyclist to the skateboarder. Auto-crop can also be manually overridden if it makes a mistake, with any operator able to retarget the focus of the edit in a couple of clicks.

Above: Kamua auto-cropping an action video, where the subject switches mid-scene

Kamua also offers a feature it calls auto-cut, which again uses AI to analyze videos to identify where the editor initially included cuts and transitions between scenes. Kamua displays these in a gridlike format separated by each cut point, making it easier for editors to choose which shots or scenes they wish to use in a final edit (and convert to vertical video, if required).

Above: Kamua: Auto-cuts

Elsewhere, Kamua can also generate subtitles using speech-to-text technology in more than 60 source languages, similar to other video platforms such as YouTube. However, Kamua brings its own twist to the mix, as it automatically resizes the captions for the screen format on which it will be displayed.

Above: Kamua: Captions

There are other similar tools on the market already. Last year, Adobe launched a new auto-reframe tool, though it’s only available as part of Adobe Premier Pro on the desktop. Apple also recently debuted a similar new feature for Final Cut Pro called Smart Conform, though of course that’s only available for Macs. Elsewhere, Cloudinary offers something akin to all of this, but it’s bundled as part of its broader media-management platform.

Earlier this year, Google debuted a new open source framework called AutoFlip, for “intelligent video reframing,” though that does of course require proper technical know-how to implement it into an actual usable product.

What’s clear in all of this, however, is that there is a growing demand for automated video-editing tools that address the myriad screens people use to consume content today.

## Vid in the cloud

Kamua, for its part, is an entirely browser-based service, deployed on Google Cloud with all its video processing and AI processing taking place on Nvidia GPUs. According to Cary, Kamua uses proprietary machine learning algorithms that are more than 95% accurate in terms of determining the exact frames where videos can be cut into clips, and neural networks that identify the “most interesting” action to track in a given scene. This is all combined with “highly customized” open source computer vision tools and frameworks, including Google’s Tensorflow, alongside off-the-shelf solutions such as Nvidia NGX and CUDA.

Although Kamua is planning offline support in the future, Cary is adamant that one of its core selling points — to businesses, at least — is its ties to the cloud. And this is perhaps more pertinent as companies rapidly embrace remote working.

“Cloud-based creative software that is automation-centric ticks a lot of boxes for IT departments,” Cary said. “The onus is on us to provide faster and cheaper servers and to ensure 100% up-times.”

Looking to the future, Cary said that Kamua plans to offer analytics to its customers, and its roadmap includes a mobile app that can automatically resize videos from the device camera roll. Plans are also afoot to raise a seed round of funding in early 2021 — up until now, Kamua has been funded through a combination of bootstrapping and some angel funding stretching back to a couple of products the team developed before pivoting entirely to Kamua in 2019.

In terms of pricing, the company officially opens its basic free tier next week, which will allow only a limited number of watermarked videos each month, limit video processing and cloud bandwidth, and leave out automated captions. The company’s paid plans, which will launch at a later date, will start at around \$25 per month, going up to the \$100-a-month “premium” plan that will offer more cloud storage, video processing, and other add-ons.

Source: https://venturebeat.com/2020/11/27/kamuas-ai-powered-editor-helps-marketers-embrace-vertical-video/

# AI Weekly: The state of machine learning in 2020

Published

on

It’s hard to believe, but a year in which the unprecedented seemed to happen every day is just weeks from being over. In AI circles, the end of the calendar year means the rollout of annual reports aimed at defining progress, impact, and areas for improvement.

The AI Index is due out in the coming weeks, as is CB Insights’ assessment of global AI startup activity, but two reports — both called The State of AI — have already been released.

Last week, McKinsey released its global survey on the state of AI, a report now in its third year. Interviews with executives and a survey of business respondents found a potential widening of the gap between businesses that apply AI and those that do not.

The survey reports that AI adoption is more common in tech and telecommunications than in other industries, followed by automotive and manufacturing. More than two-thirds of respondents with such use cases say adoption increased revenue, but fewer than 25% saw significant bottom-line impact.

Along with questions about AI adoption and implementation, the McKinsey State of AI report examines companies whose AI applications led to EBIT growth of 20% or more in 2019. Among the report’s findings: Respondents from those companies were more likely to rate C-suite executives as very effective, and the companies were more likely to employ data scientists than other businesses were.

At rates of difference of 20% to 30% or more compared to others, high-performing companies were also more likely to have a strategic vision and AI initiative road map, use frameworks for AI model deployment, or use synthetic data when they encountered an insufficient amount of real-world data. These results seem consistent with a Microsoft-funded Altimeter Group survey conducted in early 2019 that found half of high-growth businesses planned to implement AI in the year ahead.

If there was anything surprising in the report, it’s that only 16% of respondents said their companies have moved deep learning projects beyond a pilot stage. (This is the first year McKinsey asked about deep learning deployments.)

Also surprising: The report showed that businesses made little progress toward mounting a response to risks associated with AI deployment. Compared with responses submitted last year, companies taking steps to mitigate such risks saw an average 3% increase in response to 10 different kinds of risk — from national security and physical safety to regulatory compliance and fairness. Cybersecurity was the only risk that a majority of respondents said their companies are working to address. The percentage of those surveyed who consider AI risks relevant to their company actually dropped in a number of categories, including in the area of equity and fairness, which declined from 26% in 2019 to 24% in 2020.

McKinsey partner Roger Burkhardt called the survey’s risk results concerning.

“While some risks, such as physical safety, apply to only particular industries, it’s difficult to understand why universal risks aren’t recognized by a much higher proportion of respondents,” he said in the report. “It’s particularly surprising to see little improvement in the recognition and mitigation of this risk, given the attention to racial bias and other examples of discriminatory treatment, such as age-based targeting in job advertisements on social media.”

Less surprisingly, the survey found an uptick in automation in some industries during the pandemic. VentureBeat reporters have found this to be true across industries like agricultureconstructionmeatpacking, and shipping.

“Most respondents at high performers say their organizations have increased investment in AI in each major business function in response to the pandemic, while less than 30% of other respondents say the same,” the report reads.

The McKinsey State of AI in 2020 global survey was conducted online from June 9 to June 19 and garnered nearly 2,400 responses, with 48% reporting that their companies use some form of AI. A 2019 McKinsey survey of roughly the same number of business leaders found that while nearly two-thirds of companies reported revenue increases due to the use of AI, many still struggled to scale its use.

## The other State of AI

A month before McKinsey published its business survey, Air Street Capital released its State of AI report, which is now in its third year. The London-based venture capital firm found the AI industry to be strong when it comes to company funding rounds, but its report calls centralization of AI talent and compute “a huge problem.” Other serious problems Air Street Capital identified include ongoing brain drain from academia to industry and issues with reproducibility of models created by private companies. A team of 40 Google researchers also recently identified underspecification as a major hurdle for machine learning.

A number of conclusions found in the Air Street Capital report are in line with a recent analysis of AI research papers that found the concentration of deep learning activity among Big Tech companies, industry leaders, and elite universities is increasing inequality. The team behind this analysis says a growing “compute divide” could be addressed in part by the implementation of a national research cloud.

As we inch toward the end of the year, we can expect more reports on the state of machine learning. The state of AI reports released in the past two months demonstrate a variety of challenges but suggest AI can help businesses save money, generate revenue, and follow proven best practices for success. At the same time, researchers are identifying big opportunities to address the various risks associated with deploying AI.

For AI coverage, send news tips to Khari Johnson and Kyle Wiggers and AI editor Seth Colaner — and be sure to subscribe to the AI Weekly newsletter and bookmark our AI Channel.

Khari Johnson

Senior AI Staff Writer

Source: https://venturebeat.com/2020/11/27/ai-weekly-the-state-of-machine-learning-in-2020/

# Robotics researchers propose AI that locates and safely moves items on shelves

Published

on

A pair of new robotics studies from Google and the University of California, Berkeley propose ways of finding occluded objects on shelves and solving “contact-rich” manipulation tasks like moving objects across a table. The UC Berkeley research introduces Lateral Access maXimal Reduction of occupancY support Area (LAX-RAY), a system that predicts a target object’s location, even when only a portion of that object is visible. As for the Google-coauthored paper, it proposes Contact-aware Online COntext Inference (COCOI), which aims to embed the dynamics properties of physical things in an easy-to-use framework.

While researchers have explored the robotics problem of searching for objects in clutter for quite some time, settings like shelves, cabinets, and closets are a less-studied area, despite their wide applicability. (For example, a service robot at a pharmacy might need to find supplies from a medical cabinet.) Contact-rich manipulation problems are just as ubiquitous in the physical world, and humans have developed the ability to manipulate objects of various shapes and properties in complex environments. But robots struggle with these tasks due to the challenges inherent in comprehending high-dimensional perception and physics.

The UC Berkeley researchers, working out of the university’s AUTOLab department, focused on the challenge of finding occluded target objects in “lateral access environments,” or shelves. The LAX-RAY system comprises three lateral access mechanical search policies. Called “Uniform,” “Distribution Area Reduction (DAR),” and “Distribution Area Reduction over ‘n’ steps (DER-n),” they compute actions to reveal occluded target objects stored on shelves. To test the performance of these policies, the coauthors leveraged an open framework — The First Order Shelf Simulator (FOSS) — to generate 800 random shelf environments of varying difficulty. Then they deployed LAX-RAY to a physical shelf with a Fetch robot and an embedded depth-sensing camera, measuring whether the policies could figure out the locations of objects accurately enough to have the robot push those objects.

The researchers say the DAR and DER-n policies showed strong performance compared with the Uniform policy. In a simulation, LAX-RAY achieved 87.3% accuracy, which translated to about 80% accuracy when applied to the real-world robot. In future work, the researchers plan to investigate more sophisticated depth models and the use of pushes parallel to the camera to create space for lateral pushes. They also hope to design pull actions using pneumatically activated suction cups to lift and remove occluding objects from crowded shelves.

In the Google work, which had contributions from researchers at Alphabet’s X, Stanford, and UC Berkeley, the coauthors designed a deep reinforcement learning method that takes multimodal data and uses a “deep representative structure” to capture contact-rich dynamics. COCOI taps video footage and readings from a robot-mounted touch sensor to encode dynamics information into a representation. This allows a reinforcement learning algorithm to plan with “dynamics-awareness” that improves its robustness in difficult environments.

The researchers benchmarked COCOI by having both a simulated and real-world robot push objects to target locations while avoiding knocking them over. This isn’t as easy as it sounds; key information couldn’t be easily extracted from third-angle perspectives, and the task dynamics properties weren’t directly observable from raw sensor information. Moreover, the policy needed to be effective for objects with different appearances, shapes, masses, and friction properties.

The researchers say COCOI outperformed a baseline “in a wide range of settings” and dynamics properties. Eventually, they intend to extend their approach to pushing non-rigid objects, such as pieces of cloth.