Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search engine test #120

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Search engine test #120

wants to merge 7 commits into from

Conversation

jburel
Copy link
Member

@jburel jburel commented Sep 6, 2022

Add notebook comparing search engine call and mapr call
cc @pwalczysko

@github-actions
Copy link

github-actions bot commented Sep 6, 2022

Binder 👈 Launch a binder notebook on branch search_engine_test

@pwalczysko
Copy link
Contributor

The notebook works as expected with a list of

"pax1", "pep", "pax", "pax2", "pax3", "pax4", "pax5", "pax6", "pax7", "ciz1", "spen"

When the list of genes is widened to

"pax1", "pep", "pax", "pax2", "pax3", "pax4", "pax5", "pax6", "pax7", "ciz1", "spen", "p", "pa", "pb", "pc", "pk", "pn", "pr", "pu", "px", "p11", "p30", "p42", "p53", "p76", "pa1", "pad", "pag", "pah", "pak", "pal", "pan", "pav", "pb1", "pbk", "pbl", "pc4", "pcd", "pck", "pcl", "pcm", "pcp", "pcs", "pcx", "pdc", "pdf", "pdh", "pdi", "pdk", "pdp", "pea", "peb", "pek", "pen", "per", "pes", "pez", "pf4", "pfk", "pgc", "pgf", "pgi", "pgk", "pgm", "pgp", "pgr", "php"

Then I am getting an error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <timed exec>:1, in <module>

Input In [51], in load_using_search_api()
      7 url = KEY_VALUE_SEARCH.format(**qs1)  
      8 json = session.get(url).json()
----> 9 images = json['results']['results']
     10 for image in images:
     11     if image['id'] not in ids:

TypeError: list indices must be integers or slices, not str

The error is in the cell "Search using search engine"

@jburel
Copy link
Member Author

jburel commented Sep 6, 2022

Thanks. This is probably due to the fact that for some genes no results are found. I will adjust that

@jburel
Copy link
Member Author

jburel commented Sep 6, 2022

@pwalczysko fixed

@pwalczysko
Copy link
Contributor

pwalczysko commented Sep 6, 2022

Thanks, that works.

But further, for some reason, when a non-existing Gene is searched for, the test fails (should it ? one can argue that it should pass).

The test fails with (see below). Note that I added print statements, which show that in the list there was a non-existing gene called blah.

I think it would be good either to make either

  • the test not fail (as both search approaches should deliver an empty list ?)
  • or warn the user about the fact that one search (or both) was completely empty.
print (added)
print (len(added))
print (removed)
print (len(removed))
print (modified)
print (len(modified))
assert len(added) == 0
assert len(removed) == 0
assert len(modified) == 0
assert len(same) == len(ITEMS)


{'blah'}
1
set()
0
{}
0

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [44], in <cell line: 7>()
      5 print (modified)
      6 print (len(modified))
----> 7 assert len(added) == 0
      8 assert len(removed) == 0
      9 assert len(modified) == 0

AssertionError: 

@jburel
Copy link
Member Author

jburel commented Sep 6, 2022

I will sort that out

@jburel
Copy link
Member Author

jburel commented Sep 7, 2022

@pwalczysko fixed

@pwalczysko
Copy link
Contributor

pwalczysko commented Sep 7, 2022

Thanks @jburel , the fix works fine when small number of genes is passed into the list.

Now with a list such as
"pax1", "pep", "pax", "pax2", "pax3", "pax4", "pax5", "pax6", "pax7", "ciz1", "spen", "p", "pa", "pb", "pc", "pk", "pn", "pr", "pu", "px", "p11", "p30", "p47", "p53", "p76", "pa1", "pad", "pag", "pah", "pak", "pal", "pan", "pav", "pb1", "pbk", "pbl", "pc4", "pcd", "pck", "pcl", "pcm", "pcp", "pcs", "pcx", "pdc", "pdf", "pdh", "pdi", "pdk", "pdp", "pea", "peb", "pek", "pen", "per", "pes", "pez", "pf4", "pfk", "pgc", "pgf", "pgi", "pgk", "pgm", "pgp", "pgr", "php3", "neco"

I am getting persistently

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

@pwalczysko
Copy link
Contributor

The problem with the data rate exceeded is due to the line

print(results_mapr)

Probably too much to print out. When I comment out this line, all works fine.

@jburel
Copy link
Member Author

jburel commented Sep 8, 2022

@pwalczysko I have added the ability to load all the possible values for a given key. The values are sorted
Searching for all the values afterwards is not recommended so I have added the ability to search by interval e.g 0-10 20-30

@pwalczysko
Copy link
Contributor

pwalczysko commented Sep 8, 2022

Thanks, works fine.

I have for a search between 0 and 500

  1. Fail of assert len(added) == 0
  2. When I print (added) I get 28 items
{'acp7', 'ac012476.1', 'ac073896.1', 'ac171558.1', 'abraxas1', 'ac008695.1', 'acod1', 'ac008687.1', 'ac004754.3', 'ac240274.1', 'ac004556.1', 'ac011462.1', 'ac022414.1', 'ac171558.2', 'ac145212.1', 'ac138969.4', 'ac104534.3', 'ac136352.1', 'abhd18', 'ac023055.1', 'ac092718.8', 'ac009163.4', 'ac010531.1', 'ac092718.3', 'ac126283.2', 'abraxas2', 'ac091959.3', 'ac006538.4'}
28

Does that mean that search_engine is returning 28 more search results than mapr ?

Edit:
For a search between 501 and 1000, added test also fails, print (added) gives

{'agap5', 'akain1', 'agap6', 'adgre1', 'afg1l', 'af165138.7', 'agap9'}
7

Note that for these long searches, mapr has some 42 minutes against 55 sec of search_engine.

@jburel
Copy link
Member Author

jburel commented Sep 8, 2022

I will have to investigate

@pwalczysko
Copy link
Contributor

pwalczysko commented Sep 8, 2022

Yes, I think that

def dict_compare(d1, d2):
...
added = d1_keys - d2_keys
...
dict_compare(results, results_mapr)
...
added, removed, modified, same = dict_compare(results, results_mapr)

means that there are more search_engine keys than the mapr keys. Wonder how could that be possible ?

Edit:
I have also confirmed that the result is repeatable, the list of added Keys does not vary between the runs of the playbook with the same params.

@jburel
Copy link
Member Author

jburel commented Sep 8, 2022

It seems to be more a problem with the logic.
A direct mapr vs search_engine with for example agap5 gives me the same result via the UI

@pwalczysko
Copy link
Contributor

Tested genes between 0 and 1500. No mismatches, all looks good with the new commit (took something like 5 + 20 + 17 minutes on mapr step).

@pwalczysko
Copy link
Contributor

pwalczysko commented Sep 15, 2022

Tested further 1500-3000, in 3 5-hundred strong batches. The test is passing in full, but the times for mapr can be even 40 mins for 500 genes search. I suppose that this is because there are more results for those genes.

This means we have now 0 - 3000 tested.

@jburel
Copy link
Member Author

jburel commented Sep 15, 2022

only 47000 to go :-)

@pwalczysko
Copy link
Contributor

13000 (13 thousand) done as of today ;)

@pwalczysko
Copy link
Contributor

pwalczysko commented Sep 21, 2022

Between 18501 - 19000 I got an error on the mapr cell execution (the search engine one returned fine)

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:971, in Response.json(self, **kwargs)
    970 try:
--> 971     return complexjson.loads(self.text, **kwargs)
    972 except JSONDecodeError as e:
    973     # Catch JSON-related errors and raise as requests.JSONDecodeError
    974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    333 """Return the Python representation of ``s`` (a ``str`` instance
    334 containing a JSON document).
    335 
    336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338 end = _w(s, end).end()

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
File <timed exec>:1, in <module>

Input In [11], in load_using_mapr(values)
     43 qs1 = {'key': KEY_MAPR, 'value': item}
     44 url1 = MAPR_URL.format(**qs1)
---> 45 json = session.get(url1).json()
     46 for m in json['maps']:
     47     qs2 = {'key': KEY_MAPR, 'value': item}

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:975, in Response.json(self, **kwargs)
    971     return complexjson.loads(self.text, **kwargs)
    972 except JSONDecodeError as e:
    973     # Catch JSON-related errors and raise as requests.JSONDecodeError
    974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 975     raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Edit: This was an intermittent error, did not repeat on second run.

@pwalczysko
Copy link
Contributor

@jburel now I am consistently getting a following error on the cell

values = load_values_for_given_key()

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:971, in Response.json(self, **kwargs)
    970 try:
--> 971     return complexjson.loads(self.text, **kwargs)
    972 except JSONDecodeError as e:
    973     # Catch JSON-related errors and raise as requests.JSONDecodeError
    974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/json/decoder.py:340, in JSONDecoder.decode(self, s, _w)
    339 if end != len(s):
--> 340     raise JSONDecodeError("Extra data", s, end)
    341 return obj

JSONDecodeError: Extra data: line 1 column 5 (char 4)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 values = load_values_for_given_key()

Input In [6], in load_values_for_given_key()
      4 qs1 = {'type': 'image', 'key': KEY}
      5 url = KEYS_SEARCH.format(**qs1)  
----> 6 json = session.get(url).json()
      7 for d in json['data']:
      8     if d['Value']:

File ~/opt/anaconda3/envs/idr_env/lib/python3.9/site-packages/requests/models.py:975, in Response.json(self, **kwargs)
    971     return complexjson.loads(self.text, **kwargs)
    972 except JSONDecodeError as e:
    973     # Catch JSON-related errors and raise as requests.JSONDecodeError
    974     # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 975     raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)

JSONDecodeError: Extra data: line 1 column 5 (char 4)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants