• Home
  • |
  • blog

  • |
  • [Holding report] ML@Loft #5 (N...

[Holding report] ML@Loft #5 (NLP)

Written By notebooktabletphone

Hello, this is the Startup Solution Architect Naruhara (Twitter: @_hariby).On August 27, he talked about natural language processing in the fifth of ML @ LOFT, a community event of machine learning held at AWS Loft Tokyo.Some people may have been interested but don't have the schedule on this day, so I'll summarize the contents.

ML@LOFT is a consultation event for machine learning, and has been held every month at Meguro's AWS Loft Tokyo since April 2019.Originally, it was an event that started with a place where customers who use AWS can feel free to consult about their troubles when developing and operating by incorporating machine learning into the service.The speaker (counselor) is held in a two -part configuration: a Lightning Talk (LT) of about 10 minutes, which also serves as a self -introduction, and then divided into a table to provide specific consultations and discussions.The past is a blog with a slide of a speaker [ #1 (mlops), #2 (mlops), #3 (recommandation), #4 (edge)].

We decide the theme based on the wishes of the participants every time, and the fifth theme is the natural language processing (NLP), especially in practical language processing in Japanese, especially in Japanese.And it was a meeting to talk about the current best practices.This time, I asked for the stage this time (Connecoto Co., Ltd.), Seiso Shimoka (Studio OUSIA Co., Ltd.), Hidehito Masuoka (Retriiba Co., Ltd.), Takeshi Sakaki, Shiichi Yamanaka (Hotel Co., Ltd.Links), and 6 people, Miku Fujii (GVA TECH Co., Ltd.).These NLP experienced people shared their knowledge from various angles from research to actual operation.

LT session

Let's look back on the contents of the session.Please check the information on the slide of the speakers for each detailed information.

In Japanese natural language processing, morphological analysis and preprocessing are often done. When he produced it, he talked about built a "psychological safety" ML flow by utilizing the AWS service. The preliminary learning part is made of Gensim's Word2Vec model and makes his ETL flow in Amazon RDS → AWS GLUE → Amazon S3. With this ETL part and AWS Fargate, you write pre -processing (separation, speech restrictions, regularization, stopword removal, dictionary creation, creation of embedded processions, sequence of text, Train/test split) in AWS STEPFUNCTIONS, TensorFlow This is a flow that deploys the model of training and real -time inference. It seems that Re: Dash is also visualized to continue monitoring accuracy in the production environment.

NLP supporting community services for moms

I told you about the problems of the machine learning system that I have operated so far, which has been dramatically improved by using Amazon Sagemaker.In the legacy system, there were two major issues: the codes related to the essential part of machine learning were enlarged and the modeling of each component was tightened.Mr. Shimaoka noticed that Amazon Sagemaker could solve these problems, left the peripheral processing to the managed service, and led the infrastructure improvement project so that each component would be alleviated by unifying the interface.。As a result, the amount of chords is reduced by 40%, the learning time is reduced by 70%, and the question response is 4.It seems that the correct answer rate was improved by 4%.In fact, the project was very successful with positive opinions from internal engineers and customers.

【開催報告】ML@Loft #5 (NLP)

Rebuilding a machine learning infrastructure by introducing AWS Sagemaker

Lettriva Co., Ltd. introduced as a study as a research that produced newspaper articles and posted papers on top conferences [1]. In order to make use of research results in business, there is a problem that it is difficult to find the best balance with business logic in B2B's natural language processing. For short POC periods, we have cleared verification data maintenance and pre -learning, and if there are few annotation data, realistic issues are cleared by subdivision of teaching and tasks without teachers. By the way, there are three steps before starting natural language processing. Scrutinize whether the given data can be used for machine learning. Then check while visualizing whether the feature amount has been extracted correctly. We will select an algorithm, especially in batch learning/online learning, and whether deep learning is necessary. Also, in actual projects, it is important to adjust the customer expectations, so it seems that the detailed stories around that were talked about in a round table discussion.

[1] Watanabe, S., Hori, T., KARITA, S., Hayashi, T., Nishitoba, J., UNNO, Y., ENRIQUE YALTA SOPLIN, N., Heimann, j., WIESNER, M., CHEN, N., Renduchintala, a., Ochiai, T.(2018) ESPNET: END-TO-END SpeechPropessing Toolkit.Prop.INTERSPEECH 2018, 2207-2211, DOI: 10.21437/InterspeeCH.2018-1456.

20190827_AWS_LOFT_LT

Mr. Sakaki explained the relationship between text analysis technology and NLP regarding NLP in social media represented by Twitter.After pre -treatment with social media documents as input, perform morphological analysis.After that, the necessary processing differs depending on the task, but in the case of search, if you want to know after the topic and after the relationship, you will need to process it.One of the features of the document seen on Twitter, as set as a lecture title

Such may be included. By devising pre -processing in order to deal with these, we will divide it after converting the text of "[Sad news] Twitter NLP seriously or the unreasonable chazuke". Next, Mr. Yamanaka introduced two examples of specific methods for the problem. Bilstm-Char CNN-POS (Part-of-Speech)-CRF (Conditional Random Field) for key phrase extraction from SNS posting, dictionary base+Hulistics in verification using key phrase data created by hand. He showed that DNN surpassed the method of Precision, and the method of these combined techniques exceeded the value of recall, f1. In the text classification of SNS posts, in order to verify the effects of pre -processing and separation, the input of Bilstm is compared in the word book position and the character unit, but the word -based is faster and he is high in Precision, Recall, and F1. I got a value. Therefore, it is better to properly treat words in the current Japanese language SNS text classification, but if there is no pre -processing or specific knowledge of SNS text, a certain performance can be obtained even with character units. He told me that it could be said.

20190827 AWS ML@Loft#5 by Hottolink

Mr. Fujii talked about the characteristics of the contract data in comparison with the legal document. In linguistics, it is widely known that "vocabulary and sentence patterns are different for each field", so there should be a unique trend in contract data, so you can understand this for building and analyzing models. It was important, and the results of the comparison were introduced using the National Language Research Institute's Copus [2]. Looking at the part of the part of the parts for selecting a method, the contract and the legal document have similar parts of speech distribution (how to write sentences), but the contract has more adverbs, pronouns, and integrated lyrics (rich expression). I did it. Furthermore, the contract has more connected lyrics (adding the proviso to the sentence), and the meaning is reversed due to the reverse connection, so it is necessary to consider NEXT Sentence Prediction, etc., as it may not be possible to process well with BOW. You can see that. In addition, comparing frequent vocabulary, the difference is that the contract is often used in legal documents, which are often used in legal documents, which are frequent in the contract, such as "A/Otsu", "Beginning/Breaking/pay/Payment/Return". It was suggested that the content and responsibilities to be written were different. The interpretation of the data revealed that the contract does not include legal documents. He also talked about the interpretation of the model to find the basis of the model. After arranging approaches (large -scale/local explanations and explanations that can be explained) and interpretation of "interpretation" (interpreted with attension/unable to process/work well), what is divided after all? The point was whether the model was interpreted, and there was a message that it was important where the motivation was. In the case of GVA TECH Co., Ltd., it is trial and error, such as comparing points that lawyers are paying attention to when reviewing the contract with the points that the models are paying attention to for users.

[2] Contemporary Japanese Writing Equipment (BCCWJ)

Round Table discussion

In addition to the above six people, a round table was held with Mr. Ikuya Yamada of Studio OUSIA CTO Co., Ltd.Although the discussion is very exciting and I can not write it, I will introduce some of the notes used by the participants at the time of the event.

  • 差分更新はどうしてる?
  • BERT の fine tuning ってサクッとできました?
  • 研究テーマは1人でやるか、複数人でやるか、どちらが良いのでしょう?
  • 属人化しないよう気を付けていることは?
  • Jupyter ノートブックなどで書くと実行順がバラバラになりがちだが、どう対処すれば良い?
  • テストどう書いている?
  • 最近 NLP の pre-trianed 系のがたくさん出ているが、どういうステップで製品に入れているか
  • Wikipedia みたいなグラフ構造がある自然言語処理ってこれから流行る?
  • B2B だとモデルが個社ごとにできて大変になりませんか?
  • 文書の校正・校閲などはどこに難しさがあるか?
  • 顧客の期待値を調整するとは?
  • 自然言語処理、はじめにどこから手をつければいい?
  • 前処理ツール何かいいのありますか?
  • 辞書の使い分け
  • Sudachi は長単位・短単位がある
  • 辞書の作り方
  • NLP プロジェクトの方向性を見誤らないためには?ビジネスをささっと進めるには?
  • 機械学習の9割ぐらいは前処理ですよね?
  • 適切な評価が難しくて困っている
  • 今日覚えて帰って欲しいこと
  • It was still not enough to talk.Thank you again to everyone who participated and participated.The next one was held again on September 20, jointly held with his MLPP event on the theme of natural language processing and recommendation and time -series processing [ML @ Loft # 6].Please wait for the report blog for the contents.

    The author of this blog

    針原 佳貴 (Yoshitaka Haribara)スタートアップ担当のソリューションアーキテクトです。博士 (情報理工学)。