Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding queryAbundance(node) function. #19

Open
ddurai opened this issue Jan 17, 2018 · 8 comments
Open

Question regarding queryAbundance(node) function. #19

ddurai opened this issue Jan 17, 2018 · 8 comments

Comments

@ddurai
Copy link

ddurai commented Jan 17, 2018

Hello all,
I am using GATB library for my work. I have a naive question which I have not been able to find an answer for. While using one of the functions to find the abundance of a node (queryAbundance(node)) in the initial dataset, I always get the abundance as 255 for any kmer having abundance greater than 255. Upon looking at the gatb-core source code I found the following template in Graph.cpp:

template<typename Node, typename Edge, typename GraphDataVariant>
struct queryAbundance_visitor : public boost::static_visitor {
Node& node;
queryAbundance_visitor (Node& node) : node(node){}
template<size_t span> int operator() (const GraphData& data) const
{
unsigned long hashIndex = getNodeIndex(data, node);
if(hashIndex == ULLONG_MAX) return 0; // node was not found in the mphf
unsigned char value = (*(data._abundance)).at(hashIndex);
return value;
}
};

The template is returning an unsigned char value which has a range of 0 to 255. This might be a reason why the value is always 255. But even if I change it to "unsigned int", I still get the same result. Inspired from the snippet given in GATB library (deBruijn26.cpp), I am using the query abundance function as follows:

std::string s = model.toString (itKmer->value()); //itkmer->value is the kmer of length 21 from a read
const char* sq = s.c_str();
Node node = graph.buildNode(sq);
auto abund = graph.queryAbundance(node);

Am I making a mistake here in my above code?

regards
Dilip

@rchikhi
Copy link
Member

rchikhi commented Jan 17, 2018

Hi Dilip! yep, thanks for reporting it (and seen your email); will reply shortly

@ddurai
Copy link
Author

ddurai commented Jan 18, 2018

Thank you. (I apologies for bothering you with multiple mails )

@rchikhi
Copy link
Member

rchikhi commented Jan 25, 2018

hi Dilip!
should be fixed now in the latest master commit (687629e). Previously, abundances in the Graph were always recorded with values at most 255. Guillaume had partly implemented quantized abundance encoding, so that abundances larger than 255 are now encoded with some fixed error. I've activated this feature in the Graph.

If you really need high precision abundances: right now the abundance is fixed at 8-bit precision but I could point to you what to change in the code to have it at a higher precision (it's basically a matter of changing a uint8 to a uint16 somewhere).

In addition, I added a unit test (TestDebruijn::debruijn_large_abundance_query which can be run with CPPUNIT_VERBOSE=1 bin/gatb-core-cppunit TestDebruijn)

@rizkg
Copy link
Contributor

rizkg commented Jan 26, 2018

Thank you for the fix rayan.
Some more details about how abundance is encoded with this 8-bit scheme :

  • from 0 to 70 : step = 1
  • from 70 to 100 : step = 2
  • from 100 to 500 : step = 10
  • from 500 to 1000 : step = 20
  • from 1000 to 5000 : step = 100
  • from 5000 to 10000 : step = 200
  • from 10000 to 50000 : step = 1000

Up to 70 the value returned is exact. Then there is some error.
i.e. for example if the code gives you the value 410, the real abundance value is in the interval [405 ; 415]
The max is now 50000.

@ddurai
Copy link
Author

ddurai commented Jan 26, 2018

Hey Guys,
Thanks a lot for the fix and the explanation. Now its more clear to me how the abundance function is working. Also at later stages of my work I would be working with abundances of high precision. So it would be good to know where to change in the code

Thanks a lot once again

@rchikhi rchikhi closed this as completed Mar 12, 2018
@rchikhi rchikhi reopened this Mar 12, 2018
@rchikhi
Copy link
Member

rchikhi commented Mar 12, 2018

Regarding adapting the code to higher precision:

actually since discretization is in place and cannot be disabled, it is a bit harder to change the code to have precise high abundances.

  1. one would need to change these two functions

int abundanceAt (const Key& key) {

int abundanceAt (typename Hash::Code code) {

to only do return data[hash[key]] and return data[code], respectively.

  1. and that line:

template<size_t span=KMER_DEFAULT_SPAN, typename Abundance_t=u_int8_t, typename NodeState_t=u_int8_t>

Abundance_t=u_int16_t

If it doesn't work, let us know.

@leoisl
Copy link

leoisl commented Mar 15, 2018

Hello,

I too got this problem and tried to solve as described. I created a fork (https://github.com/leoisl/gatb-core) since I did some other changes, but I did not manage to make it work (querying the node's abundance with Node_t::abundance gives the correct abundance, but with GraphTemplate::queryAbundance, no). For example, in TestDebruijn::debruijn_large_abundance_query, this is the output I get (added back some debug stuff):

TestDebruijn::debruijn_large_abundance_query
AAAGAACATGTGAGCAACGGGATAACGCAGG test printing node abundance with it.item().abundance: 999
AACATGTGAGCAACGGGATAACGCAGGAAAG test printing node abundance with it.item().abundance: 999
AACGCAGGAAAGAACATGTGAGCAACGGGAT test printing node abundance with it.item().abundance: 999
AACGGGATAACGCAGGAAAGAACATGTGAGC test printing node abundance with it.item().abundance: 999
AAGAACATGTGAGCAACGGGATAACGCAGGA test printing node abundance with it.item().abundance: 999
ACGCAGGAAAGAACATGTGAGCAACGGGATA test printing node abundance with it.item().abundance: 999
ACGGGATAACGCAGGAAAGAACATGTGAGCA test printing node abundance with it.item().abundance: 999
ATAACGCAGGAAAGAACATGTGAGCAACGGG test printing node abundance with it.item().abundance: 999
AGAACATGTGAGCAACGGGATAACGCAGGAA test printing node abundance with it.item().abundance: 999
AGGAAAGAACATGTGAGCAACGGGATAACGC test printing node abundance with it.item().abundance: 999
CAACGGGATAACGCAGGAAAGAACATGTGAG test printing node abundance with it.item().abundance: 999
CAGGAAAGAACATGTGAGCAACGGGATAACG test printing node abundance with it.item().abundance: 999
CCGTTGCTCACATGTTCTTTCCTGCGTTATC test printing node abundance with it.item().abundance: 999
CTGCGTTATCCCGTTGCTCACATGTTCTTTC test printing node abundance with it.item().abundance: 999
CGCAGGAAAGAACATGTGAGCAACGGGATAA test printing node abundance with it.item().abundance: 999
CGTTGCTCACATGTTCTTTCCTGCGTTATCC test printing node abundance with it.item().abundance: 999
CGGGATAACGCAGGAAAGAACATGTGAGCAA test printing node abundance with it.item().abundance: 1000
TAACGCAGGAAAGAACATGTGAGCAACGGGA test printing node abundance with it.item().abundance: 999
TTTCCTGCGTTATCCCGTTGCTCACATGTTC test printing node abundance with it.item().abundance: 999
TGCGTTATCCCGTTGCTCACATGTTCTTTCC test printing node abundance with it.item().abundance: 999
GCAGGAAAGAACATGTGAGCAACGGGATAAC test printing node abundance with it.item().abundance: 999
GTTGCTCACATGTTCTTTCCTGCGTTATCCC test printing node abundance with it.item().abundance: 999
ACATGTTCTTTCCTGCGTTATCCCGTTGCTC test printing node abundance with it.item().abundance: 999
ATGTTCTTTCCTGCGTTATCCCGTTGCTCAC test printing node abundance with it.item().abundance: 999
ATGTGAGCAACGGGATAACGCAGGAAAGAAC test printing node abundance with it.item().abundance: 999
AGCAACGGGATAACGCAGGAAAGAACATGTG test printing node abundance with it.item().abundance: 999
CATGTTCTTTCCTGCGTTATCCCGTTGCTCA test printing node abundance with it.item().abundance: 999
CATGTGAGCAACGGGATAACGCAGGAAAGAA test printing node abundance with it.item().abundance: 999
TCACATGTTCTTTCCTGCGTTATCCCGTTGC test printing node abundance with it.item().abundance: 999
TGTTCTTTCCTGCGTTATCCCGTTGCTCACA test printing node abundance with it.item().abundance: 999
ACATGTGAGCAACGGGATAACGCAGGAAAGA test printing node abundance with it.item().abundance: 999

TTGCTCACATGTTCTTTCCTGCGTTATCCCG test printing node abundance with graph.queryAbundance(node): 150. Expected abundance:1000

The first abundances were queried using Node_t::abundance and are correct. However, GraphTemplate::queryAbundance return 150 instead of 1000 for kmer TTGCTCACATGTTCTTTCCTGCGTTATCCCG. I tried to debug this, and when graph.queryAbundance(node) is called, it eventually reaches this line in MapMPHF.hpp (that I slightly modified): https://github.com/leoisl/gatb-core/blob/master/gatb-core/src/gatb/tools/collections/impl/MapMPHF.hpp#L227

Upon inspecting the values of data when gdb gets to this line, they are incorrectly set:

(gdb) p data
$4 = std::vector of length 31, capacity 31 = {149, 149, 149, 149, 149, 149, 149, 150, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149, 149}

However, I don't think I can progress much from here... Do you have some ideas on how to correctly fill data?

Thanks!

@rchikhi
Copy link
Member

rchikhi commented Dec 25, 2018

Hi Leandro,

Sorry for letting this question linger for so long. '149' is actually exactly the discretized abundance code for a real abundance of 999: 999=701+152+4010+2420+19, as per #19 (comment), and summing the terms in bold, 70+15+40+24=149. So the values in data are correctly set. If you're getting 150 instead of 1000 (where 150 is actually the correct encoding for 1000), something is wrong with the decoding of the discretized abundance. If you're still interested in figuring out this bug, I'd recommend checking for wrong memory accesses using valgrind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants