Building an Efficient AI Platform for Data Preprocessing and Model Training | HackerNoon

📆 7/14/2022 7:00 AM

Australia News News

Australia Latest News,Australia Headlines

📆 7/14/2022 7:00 AM
📰 hackernoon

⏱ Reading Time:
113 sec. here
3 min. at publisher
📊 Quality Score:
News: 48%
Publisher: 51%

'Building an Efficient AI Platform for Data Preprocessing and Model Training' ai platform

Lei Li, AI Platform Lead, and Zifan Ni, Senior Software Engineer from Bilibili, share how they applied Alluxio to their AI platform to increase training efficiency, as well as best practices including technical architecture and specific tuning tips is a leading video community with a mission to enrich the everyday life of the young generations in China.

We use Alluxio as an intermediate layer between the computation and storage of our AI platform. We came across four major challenges before adopting Alluxio and Alluxio has helped us overcome them.The training data is downloaded through the container and read locally during the training phase. However, if the container crashes during the download process, we need to restart the container to download the data again, which is a huge waste of time.Alluxio can hold huge data in a distributed way.

Unified namespace allows data access as simple as doing configurations. With both OSS and HDFS mounted to Alluxio, the single unified namespace logically decouples the applications from storage, so that the AI applications simply communicate with Alluxio while Alluxio handles the communication with the different underlying storage systems on applications’ behalf.We have a huge amount of training data, which is too large to fit on a single machine.

Now, a newly-created pod needs to use the cache on GPU 0 and GPU resources. Since pod 0 has already taken over the GPU resources and cache on GPU 0, pod 1 must wait for pod 0 to release before it can read data from the cache properly. It is unacceptable because pod 0 may take up resources for a long time without releasing them. It is impossible to schedule pod 1 to the node on the right-hand-side as there’s no Alluxio FUSE service and no Alluxio cache.

After all the nodes are started and deployed, when the user submits a task, both the task and the FUSE process are injected into the training container as sidecars. The FUSE Sidecar mounts the FUSE Path and Host Path in both directions, while the task sidecar accesses cached data in Alluxio via this path.

To solve this problem, the first thing is to reduce MaxGCPauseMillis, because it will make the JVM do GC more aggressively and improve the ParallelGCThreads. In the meanwhile, we can give Alluxio master higher memory requests to ensure that there is enough memory to store large amounts of metadata.In some heavy-load tasks, stop-the-world GC cannot be avoided, even after tuning the parameters mentioned above. As a result, some FUSE requests will time out, making the training tasks fail.

We have summarized this news so that you can read it quickly. If you are interested in the news, you can read the full text here. Read more:

Australia Latest News, Australia Headlines

Similar News:You can also read news stories similar to this one that we have collected from other news sources.

Meta Has a New AI Tool to Fight Misinformation—and It's Using Wikipedia to Train ItselfFacebook parent Meta announced a new tool, called Sphere, that aims to use AI to better detect and address misinformation, or “fake news,” on the internet.
Read more »

DeepMind AI learns physics by watching videos that don't make senseTeaching an AI simple physical concepts, such as the fact that two objects can't occupy the same space, could help develop more efficient algorithms
Read more »

Google Engineer Claims AI Chatbot Is Sentient: Why That MattersIs it possible for an artificial intelligence to be sentient?
Read more »

Wikipedia uses Meta open-source AI to fact-check new articlesWikipedia has teamed up with Meta, the parent company of Facebook, to fact-check its citations.
Read more »

AI Learns What an Infant Knows about the Physical WorldA computer model simulating how objects react to physical forces approximates how babies understand their surroundings
Read more »

Meta made a fact-checking AI to help verify Wikipedia citations | EngadgetThe Wikimedia Foundation recently partnered with Facebook parent company Meta to improve the encyclopedia’s citations..
Read more »