PaddlePaddle’s New API Simplifies Deep Learning Programs

In September, we open sourced PaddlePaddle, the deep learning framework that has been used to power a range of Baidu products since its inception four years ago. To make the platform easier to use for the community, we’ve made several updates since then, including adopting Kubernetes cluster management system.

Today we are happy to announce the alpha release of our new API as well as our new book, Deep Learning with PaddlePaddle, which includes example programs.

Using the new API, PaddlePaddle programs now require fewer lines of code, as shown in the example below. The figure shows a convolutional network program written in the old API (left) and the new one (right).

This significant simplification is a result of three key improvements

1.A New Conceptual Model

New research requires a flexible way to describe innovative deep learning algorithms. For example, a GAN model contains two networks, whose layers share some parameters.  Also, during the training, we need to fix some parameters while updating some others.  With our old API, users would have to access very low-level APIs, which are often undocumented, for such flexibility. With the new API, the illustrative GAN example takes only a few lines of code.

2.Higher-level API

PaddlePaddle was created to support distributed training. The old API exposes many details that users need to know before writing the distributed program. While PaddlePaddle can run a train loop pre-defined in C++ code, it prevents PaddlePaddle programs from running inside Jupyter Notebook, an ideal solution for documenting the work. In the new API, we provide higher-level APIs like `train`, `test`, and `infer`.  For example, the train API can run a local training job and will be able to run a distributed job on a Kubernetes cluster.

3.Compositional Data Bricks

Data loading in industrial AI applications is far from trivial and usually requires a lot of source code. Our new API provides compositional concepts of reader, reader-creator, and reader-decorator, which enables the reuse of data operations. For example, we can define a reader-creator, `impressions()`, that reads search engine’s impression log stream from Kafka, in a few lines of Python code. We can also define another reader, `clicks()`, for reading the click stream. Then, we can buffer and shuffle using predefined reader-decorators. We can even compose/join data streams:

r = paddle.reader.shuffle( 
      paddle.reader.compose(
        paddle.reader.buffer(impressions(impression_url)),
        paddle.reader.buffer(clicks(click_url))),
      4096)

If we want a small subset of 5000 instances for a quick experiment, we can use: 

paddle.reader.firstn(r, 5000)

In packaging paddle.datasets, we provide pre-defined readers that download public datasets and read from the local cache.

We will continue improving this new API over time. Your comments, feedback, and code contribution are highly appreciated!

By Yi Wang, Tech Lead, PaddlePaddle team
2017-05-22T04:01:04+00:00 March 9th, 2017|

3 Comments

  1. kwitangkost kost44 March 10, 2017 at 3:07 am

    Hi there, just became alert to your blog through Google, and found that it is really informative.

    I’m going to watch out for brussels. I will be grateful if you continue this in future.
    Lots of people will be benefited from your writing.
    Cheers!

  2. Richard Peterson March 24, 2017 at 12:14 pm

    Interesting technology. The graphic output is more like a human brain than a digital information management system. Check out cubist theory.

  3. dred May 24, 2017 at 6:46 am

    Interesting, good luck with your project.

Comments are closed.