In T359062: Assess Wikidata dump import hardware there's compelling evidence that increasing buffer capacity for import, that is to say updating RWStore.properties for a value of com.bigdata.rdf.sail.bufferCapacity=1000000, leads to a material performance improvement, as observed on a gaming-class desktop.
This task is to request that we soon verify on a WDQS node in the data center, preferably ahead of any further imports with changed graph split definitions.
At this point it seems clear that CPU speed, disk speed, and the buffer capacity make a meaningful difference in import time.
Proposed:
Using the scholarly_articles split files, on wdqs2024, run imports as follows.
- With the CPU performance governor configuration applied as described in T336443#9726600 and with the existing default RWStore.properties configuration (which will have com.bigdata.rdf.sail.bufferCapacity=100000, note this is 100_000). This will let us better understand for the R450 setup if the performance benefits for the performance governor configuration (sort of an analog of a faster processor like what we've seen with a gaming-class desktop) extend to this bulk ingestion routine. We could compare against results from T350465#9405888 .
- Then, still with the CPU performance governor configuration in place, using a RWStore.properties with a value of com.bigdata.rdf.sail.bufferCapacity=1000000 (note this is 1_000_000). This will let us verify that for this hardware class the performance benefits are further extended.
- If and when a high speed NVMe is installed onto wdqs2024 (T361216), with both the CPU performance governor and higher buffer capacity pieces in place. This will let us verify that for this hardware class the performance benefits are even further extended.
We had used wdqs1024 for the main graph ("non-scholarly") import before, and note the request here is to do the scholarly article graph import on wdqs2024. This is mainly because we have an NVMe request in flight for it.