Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shuffle sniffer nodes to Improving stability and performance #47354

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

saiergong
Copy link

@saiergong saiergong commented Oct 1, 2019

In the spark streaming, use the rest client to write es. When the number of nodes in the spark is large, the job restart will cause the es node to oom. Because when the spark job restarts, the order of the es nodes detected by all the spark nodes using ElasticsearchNodesSniffer is basically the same, and the initial value of the member variable lastNodeIndex of the RestClient is 0, so the initial request of each spark node is sent to the same es node. , causing the es node oom.

At the same time, if the number of requests sent to es on each spark node is basically the same, it is very likely that the requests on each spark node are basically sent to the es node in the same order, resulting high load on some nodes in es, affecting write performance.

This commits shuffle the nodes after probe the es cluster nodes, avoding such problem.

@elasticcla
Copy link

Hi @saiergong, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

@saiergong
Copy link
Author

Hi @saiergong, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

is it ok now?

@saiergong
Copy link
Author

@nik9000 could you help to review this request?

@polyfractal polyfractal added the :Clients/Java Low Level REST Client Minimal dependencies Java Client for Elasticsearch label Oct 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Java Low Level REST Client)

Copy link
Contributor

@hub-cap hub-cap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, lets get the imports fixed and then ill run the tests

import java.util.Map;
import java.util.Objects;
import java.util.Set;
import java.util.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will not pass tests, we keep our imports explicit. You may have to change your IDE to not do this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fixed.

@hub-cap
Copy link
Contributor

hub-cap commented Oct 22, 2019

@elasticmachine update branch

@hub-cap
Copy link
Contributor

hub-cap commented Oct 22, 2019

@elasticmachine ok to test

@hub-cap
Copy link
Contributor

hub-cap commented Oct 23, 2019

I got to thinking about this and I am not sure I like changing the default here. Let me have a day to think about how we can fix this so that you (and others) can still have something that is shuffled, and we dont change the default.

@saiergong
Copy link
Author

I got to thinking about this and I am not sure I like changing the default here. Let me have a day to think about how we can fix this so that you (and others) can still have something that is shuffled, and we dont change the default.

I wonder why does build fail? Is it not allowed to call shuffle function at this place? Or do we implement a shuffle function here?

@hub-cap
Copy link
Contributor

hub-cap commented Oct 24, 2019

I think it would make more sense now to alter the Sniffer code so that there is a new paramerter (maybe boolean shuffleNodes)introduced into the constructor, and then in Sniffer#sniff() check for this parameter being true, and if so, do the shuffle there instead. It will allow for custom sniffer impls to also benefit from this shuffling if they want to.

Then you can add it to the SnifferBuilder too so its set properly in the constructor of Sniffer and please default it to false if its not set in the builder.

@hub-cap
Copy link
Contributor

hub-cap commented Oct 24, 2019

in regard to the test failing, we are using a method we cannot use, lets instead use the API that accepts the List and a Random, and add a new Random(nodes.size()) that is based off the number of nodes returned in the sniffed list.

@saiergong
Copy link
Author

ok,More flexible in this way. I'll do it

@saiergong
Copy link
Author

in regard to the test failing, we are using a method we cannot use, lets instead use the API that accepts the List and a Random, and add a new Random(nodes.size()) that is based off the number of nodes returned in the sniffed list.

if we use nodes.size() as the Random seed,you should not be able to solve the problem of shuffle。 Because if the seeds are the same on all spark node, the generated random number list will be the same. maybe we should use the timestamp as the Random seed?

@hub-cap
Copy link
Contributor

hub-cap commented Oct 28, 2019

This is a good point. The seed needs to be something that is reproducible from a test scenario, as well as random enough that you dont hit the problem you mention. I think it might make more sense if we relaxed the final modifier on the ElasticsearchNodesSniffer, you should be able to just override the sniff() method like the following

public class ShufflingNodesSniffer extends ElasticsearchNodesSniffer {
    public ShufflingNodesSniffer(RestClient restClient) {
        super(restClient);
    }

    @Override
    public List<Node> sniff() throws IOException {
        List<Node> nodes = super.sniff();
        Collections.shuffle(nodes);
        return nodes;
    }
}

and you can define whatever random seed (if any) you want for your nodes. Then you just need to use it like this

ShufflingNodesSniffer nodesSniffer = new ShufflingNodesSniffer(restClient);
Sniffer sniffer = Sniffer.builder(restClient).setNodesSniffer(nodesSniffer).build();

Then we end up with the freedom that you can do whatever you want w the sniffer, and it does not change implementation details for other clients. What do you think of this proposal instead?

@saiergong
Copy link
Author

i could implement ShufflingNodesSniffer in my code, but maybe other people will encounter this problem some day. so i think maybe the community should provide the shuffle ability?

if we can use the API that accepts the List and a Random, may be we could use the ip address as the Random seed?

@hub-cap
Copy link
Contributor

hub-cap commented Oct 30, 2019

We also want the random seed reproducible between machines, because a test should be reproducible given a seed for the testing framework. So using the IP as a random value will not make for a reproducible test if you take it to a different machine.

This is the first ask for a shuffled list, so i think the best thing we can do is let the user shuffle it if need be. We can add it in proper if we find that more people want the node shuffle feature.

@saiergong
Copy link
Author

If the test framework requires the results of two machines to be replicated, then this conflicts with the function we want to achieve...

@hub-cap
Copy link
Contributor

hub-cap commented Oct 30, 2019

If the test framework requires the results of two machines to be replicated, then this conflicts with the function we want to achieve...

This is not true exactly. The test should be reproducible, given a seed, which will control the source of randomness. The seed should be something that would be the same on 2 test machines, but can be influenced by something at runtime. For example, if there was a session ID that could be used, the tests could always supply the same session ID and therefore be replicated on diff machines.

@nik9000
Copy link
Member

nik9000 commented Oct 30, 2019

The server and everything that lives inside it can share Randomness with the tests but we don't have access to that in the clients. We use it to make tests properly reproducible all the time. But clients don't have it.

@saiergong
Copy link
Author

At present, I copyed ElasticsearchNodesSniffer to my project package and add shuffle function, which has solved my problem.

@hub-cap
Copy link
Contributor

hub-cap commented Oct 31, 2019

which has solved my problem.

This is true, but only for this version. I think a PR to relax the final on the sniff() method I mentioned would be best, and it would allow you to stop copying the class wholesale into your project. Otherwise if we introduce new code or refactor things, you may not get the newest code.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:14
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@elasticsearchmachine elasticsearchmachine added Team:Clients Meta label for clients team and removed Team:Data Management Meta label for data/management team labels Jul 27, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Clients/Java Low Level REST Client Minimal dependencies Java Client for Elasticsearch >enhancement Team:Clients Meta label for clients team v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.