2017-06-19

Haiku Depot and Performance Around Bulk Data

The interplay between Haiku Depot Server (HDS) and the desktop application Haiku Depot (HD) is working well, but improvements could be made to both elements so that they scale-up more gracefully in the future.

Some of the initial approaches made to get things “basically working” were good decisions at the time, but can now be revisited with the luxury of being able to improve a working system in relation to increased quantities of data and users.

Bulk Data

The API approach from HDS was intended to be used in a more “on-demand” fashion rather than as a bulk-transfer mechanism and I had not appreciated that HD was going to be hauling in all the data it needs up-front. This is the primary area where changes are required.

One area that was taking considerable time was where HD spun-up an HTTP request for each icon it needed. In 2016 I undertook some work (server and client-side) to change this arrangement so that it would instead pull a tar-ball of the icons. This has improved matters, but more work is required of this sort and the target areas for improvements around bulk-downloads are centric around repositories and packages where bulk-data is involved.

A new approach will be taken specifically around this bulk data. The new approach will mean that the HD desktop application will be able to request a compressed stream of bulk JSON data rather than use the existing RPC-based API. The requests from HD are then able to employ the standard HTTP If-Modified-Since header mechanism in order that it can manage data-freshness properly. To support this, date + time data will be conveyed in the JSON payload from HDS. The advantages of this approach are;

The client can cache the JSON data locally and the client has a robust refresh approach based on well understood standards. The local cache would be the JSON payload verbatim.
The client can make less HTTP requests.
The client can start processing the stream of bulk JSON data immediately rather than having to wait until it has all arrived, as it is the case with the RPC approach. Maybe this facet can come later.
Because the JSON payload is common to all client instances (it is bulk data), the server is also able to cache the data.
Because the data is cached server-side, the database is put under less stress.
The server is able to employ an HTTP redirect in order to force the client to pickup the data from another location.

This does not constitute a move away from JSON-RPC because using RPC is working very well for most areas. It is only the bulk acquisition of data where this applies and where I would like to make changes.

Payload Models

A problem with an API defined in two separate projects and languages is sharing the data-transfer objects’ (DTO) model or schema. Without some sort of model description mechanism, there is a risk that diverging systems can become incompatible.

To prevent this incompatibility, I plan to introduce a schema. There are some quite appealing infrastructures for handling model definition and subsequent stub generation, but perhaps the best option for this project would be to use JSON-Schema. Once the schema is defined, in some way it will be possible to generate the java-side DTOs and the C++ side DTOs. Maybe the C++ DTOs could be generated using a python script using this json schema plugin.

JSON Parser

I modified Haiku’s JSON parser earlier in 2017 such that it will now support streaming based on a DataIO input. You can find out more about this here. It should now also be possible to get the schema for the DTOs to generate a listener to plug into this JSON parsing architecture.

Conclusion

This is probably going to take some time to achieve, but should provide for better interplay between HaikuDepotServer application server and HaikuDepot desktop application in terms of handling bulk data.