This post has been contributed by D2C's data scientist Masahiro Abe about his own machine learning hackathon.
* Some AWS, which supported the Hackathon, is also written.
My name is Abe of D2C DOCOMO Advertising Business Division Data Solution Department.The D2C is developing an advertising business, and our department analyzes users, advertisers, and media, and develops the logic and user segments of advertising distribution systems.
docomo AD Network
I think that many companies are currently focusing on what can be done using data, but what I personally find the most difficult is to set the actual issues with data science.I think it's a place to drop.To do so, not only the knowledge and skills of data science, but also the deep domain knowledge of the business is required.
However, in that case, it may not be realistic for data analysts to understand the business structure of each business at the same level (or more) as the person in charge in the field.I started to feel.As with our company, if the data analyst is working with a centralized organizational system that gathered in one department, many people may feel this area as an issue.
Ideally, we believe that it is better to inherit the analysis skills, including modeling and data engineering, to members of other departments, and to work on the project at the stage of dropping them into actual issues.Therefore, as the first step in the initiative, for members of engineers who are not familiar with machine learning, they have to touch on the outline of machine learning and modeling using actual data, and compete with each other to compete for predictive accuracy.I planned the details in this article.
The data used in this hackathon's assignment is the data used in the actual advertising distribution system.Originally, we used the same task as the short -term internship conducted by our department in the summer of 2020.Specifically, it provides past advertising distribution logs, user attributes, behavior logs in a certain service, and combine them to create a model that predicts the probability of clicking at the user and advertising granularity.And the distribution date will be a form that competes for its accuracy, using future distribution logs as test data than this learning data.However, since the data is large and multiple data is prepared, a certain coding skill is required by simply performing exploring data analysis (EDA).For this reason, the students who applied for the internship were imposed a coding test in advance and decided the participants.
So, when I was worried about how to have members who are familiar with machine learning to work on this task like this time, AWS has a composition that allows you to turn EDA → pretreatment → modeling flow as much as possible.I was suggested.AWS people have always received a lot of support, including improvement of existing systems and advice on creating a new system, but if we can cooperate with the lecturers on the day by considering the custom -made events like this time.I didn't think.
The specific explanation of the AWS service will be explained after the next section, but I will explain a simple flow.The hackathon was performed for three days (8 hours in total).On the first day, we had AWS GLUE DATABREW hands -on, which allows large -scale data to be aggregated, visualized, and processed on the GUI.On the second day, we perform a hands -on hand -on using machine learning and Amazon Sagemaker and Autogluon, and each participant will work on improving accuracy in one week until the last day.By combining these services, it is possible to minimize the construction of predictive models with a small amount of code even for non -engineers.
As a result, this hackathon became nearly 40 participants from each department.In addition to the situation in which more than half of the participants used the AWS itself, there was also the difficulty of implementing online, but by conducting hands -on in each section, we were able to proceed to modeling.
Hackathon issues
On the last day, we asked the top to announce that we worked on it.
After that, it will be a detailed explanation of each AWS service.
* This section was written by Fujita of AWS
The architecture when the machine learning hackathon is implemented is as shown in the figure below.
Architecture used in the hackathon
In this hackathon, I used Amazon S3, AWS GLUE DATABREW, Amazon Sagemaker, and Amazon EC2.
As a rough flow, participants create a feature volume to deepen their understanding of data while performing data visualization and aggregation in AWS GLUE DATABREW, and improves the prediction accuracy of machine learning models.Amazon Sagemaker creates a machine learning model based on the created features and predicts test data.
The prediction results are sent to the MLFLOW server built on Amazon EC2.The MLFLOW server calculates the accuracy from the transmitted prediction results.As a result, the participants can check the accuracy of their machine learning model from MLFLOW.
* This section was written by AWS's Tsuwasaki
AWS GLUE DATABREW was used for the purpose of aggregation, visualization, and subsequent machine learning, such as user attributes prepared for hackathon.
AWS GLUE DATABREW is a visual and interactive data preparation that can easily connect to a data store such as S3 and search, connect, cleanse, and add it without coding.
Typical analysis and machine learning workflows say that up to 80 % of data preparations are spent.By using AWS GLUE DATABREW, data analysts that do not write code can be performed from the visual interface to prepare data and generate features.This time, in the case of a hackathon, we introduced AWS GLUE DATABREW hands -on according to the actual machine learning use case, and provided the operation that is often performed in the generation of features.
If you want to learn more about AWS GLUE DATABREW itself, a video has been released in AWS Black Belt Online Seminar "AWS GLUE DATABREW".
The operation procedure of AWS GLUE DATABREW introduced in Hackathon is summarized in a hands -on format.If you perform this hands -on, you can experience the same feature generation as a hackathon.
* Hands -on for D2C uses the data provided by customers for hackathon, but we have prepared sample data that can perform similar procedures for this blog.Please use it together.
* This section was written by Fujita of AWS
In the creation of a hackathon machine learning model, I used Amazon Sagemaker and Autogluon.
Amazon Sagemaker is a full -managed service that provides a wide range of functions related to machine learning.A service that supports data scientists and developers quickly preparing, building, learning, and deploying machine learning models.
Autogluon is an OSS that runs Automl for images, text, and table data developed by AWS.In this hackathon, I used Autogluon Tabular, which runs Automl for table data.
The benefits of using sageMaker and Autogluon for hackathon are as follows.
These two combinations are efficient in that even those who are not familiar with machine learning can perform environment construction and machine learning modeling.
Let's explain the specific flow.
After that, we will have the participants make improvements by trial and error for a week.
This trial and error and the accuracy of the model have improved and the deterioration can be expected to be useful for understanding business from data and in actual data analysis practical.
The above is the flow of machine learning hackathon.
Thank you for reading to the end.
As a point of improvement in hackathon, it was difficult to adjust the difficulty level and became a future task.
In order to be able to participate in as many backgrounds as possible, you want to lower the threshold, but if you lower it too much, it will be unsatisfactory for those who are used to it to some extent.
In the future, we would like to incorporate ideas such as dividing trucks by difficulty in the activities to make use of data on a regular basis through training, etc., and in the future.
I am glad that the use of data utilization becomes even more active through this activity.
D2C Co., Ltd. Masahiro Abe
D2C Co., Ltd. is in charge of machine learning and the construction of the foundation for that.I'm glad I jump in ETL tasks, fearing that the work of moving my hands as a manager will decrease.My hobby is to go around gold gyms.
Amazon Web Service Japan Solution Architect Tsuwazaki Miki (Miki Tsuwazaki)
Enterprise solution architect, which is in charge of customers in the telecommunications industry.My favorite AWS service is the AWS Management Console.From computing to machine learning, I like where everyone can start all kinds of services, regardless of experience.
Amazon Web Service Japan Machine Machinery Learning Prototype Solution Architect Mitsunori Fujita (AtSUNORI FUJITA)
Machine learning prototype solution architect, specializing in natural language processing and time series analysis.My hobby is Kaggle.