Codes for the paper "SPEED: Streaming Partition and Parallel Acceleration for Temporal Interaction Graph Embedding"
For the datasets Wikipedia, Reddit, MOOC and LastFM, please download data from the project homepage of JODIE and pre-process them with the script provided by TGN.
For ML25m, please download data from the grouplens and put the file ratings.csv into the folder Datasets then, pre-process it with the ML25m2TGN.py. Or you can download the dataset which is pre-processed by us from Kaggle, and put it in the folder data.
For DGraphfin, please download data from the DGraph and put it into the folder Datasets then, pre-process it with the DGraphFin2TGN.py. Or you can download the dataset which is pre-processed by us from Kaggle, and put it in the folder data.
For Taobao, please download data from the Tianchi and put the file into the folder Datasets then, pre-process it with the Taobao2TGN.py. Or you can download the dataset which is pre-processed by us from Kaggle, and put it in the folder data.
If you would like to using our SEP, please first do one more pre-processing step, which transform the csv data file to txt format.
python partition/transform.py --data [DATA]
We also provide the processed files for datasets Wikipedia, Reddit, MOOC and LastFM in this repo, so for these four datasets, you could directly conduct the SEP.
Then move forward to the partition folder (by cd partition
).
graphfile
: the name of the file that stores the graph to be partitioned.
nparts
: the number of parts that the graph will be partitioned into. Maximum value 256.
topk
: topk nodes are selected as hubs.
Options:
-degree_compute string
-> options to compute nodes' centrality (normal decay)
-algorithm string
-> specifies the algorithm to be used (hdrf hashing mymethod). Default mymethod.
-lambda double
-> specifies the lambda parameter for hdrf and mymethod. Default 1.
-beta double
-> specifies the beta parameter for hdrf. Default 0.1.
-seed int
-> seed for repeated experiments. Default 0.
-threads integer
-> specifies the number of threads used by the application. Default all available processors.
-output string
-> specifies the prefix for the name of the files where the output will be stored (files: prefix.info, prefix.edges and prefix.vertices).
You can run the partition for temporal interaction graphs by:
java -jar dist/partition.jar [DATA] [2/4/8...] [0.01/0.05/0.10...] -degree_compute [normal/decay] -algorithm [mymethod/hdrf/hashing] -lambda [1/2...] -beta [0.1/0.2...] -seed [0/1/2/...] -threads [1/2/4...] -output output
Some parameters need to be transfered from SEP:
SEP | PAC |
---|---|
topk | top_k/100 |
nparts | 2^part_exp |
topk*100 | top_k |
part_exp |
Please use the main branch to proceed a regular training. You can also use it to train big datasets, however, by applying the codes in the branch "big_datasets" would save you some time.
python ddp_train_self_supervised.py --gpu 0,1,2,3 --data [DATA] --part_exp [1/2/3...] --[jodie/tgn/tgat/dyrep/tige] --prefix [add_your_prefered_prefix] --top_k [0/1/5/10/-1] --seed [0/1/2...] --sync_mode [last/none/average] --divide_method pre
You could also choose to add argument --shuffle_parts
to enable the random shuffling, when you have more graph partitions than your number of GPUs.
If you would like to train the three big datasets, you may need to pass two arguments to avoid OOM problem:
--backup_memory_to_cpu
--testing_on_cpu
.
And you may need to specify a vector dimension for nodes and edges, e.g.,--dim 100
.
python ddp_train_self_supervised.py --gpu 0,1,2,3 --data [DATA] --part_exp [1/2/3...] --[jodie/tgn/tgat/dyrep/tige] --prefix [add_your_prefered_prefix] --top_k [0/1/5/10/-1] --seed [0/1/2...] --sync_mode [last/none/average] --divide_method pre --backup_memory_to_cpu --testing_on_cpu --no_ind_val --dim 100
If you would like to train big datasets and want to accelerate the I/O process to save some time, please use the codes in the branch "IO-Accelerating".
In this branch, the whole training process is being accelerated by optimising the I/O process and omitting unnecessary inductive validation and only using testing_from_begin
setting.
You should (optionally) first save the graphs and sub-graphs by running save_graphs.py
for different (parameters): --data
, --part_exp
, --gpu
, --divide_method
, --top_k
and --seed
(the saved graphs will be the same if these parameters are the same), you may also would like to set a different --prefix
:
python save_graphs.py --gpu 0,1,2,3 --data [DATA] --part_exp [1/2/3...] --prefix [add_your_prefered_prefix] --top_k [0/1/5/10/-1] --seed [0/1/2...] --divide_method pre --save_mode save
Then you can reuse the saved graphs by setting --save_mode read
for training for different backbone models or other parameters.
python ddp_train_self_supervised.py --gpu 0,1,2,3 --data [DATA] --part_exp [1/2/3...] --[jodie/tgn/tgat/dyrep/tige] --prefix [add_your_prefered_prefix] --top_k [0/1/5/10/-1] --seed [0/1/2...] --sync_mode [last/none/average] --divide_method pre --backup_memory_to_cpu --testing_on_cpu --no_ind_val --dim 100 --save_mode read
Note that this branch only support for the situation, that your number of GPUs equals to 2^part_exp, for now.
Node Classification
python train_supervised.py --code [CODE]
Here, [CODE] is the HASH code of a trained model with train_self_supervised.py
.