Update to function as out-of-the-box test server#13
Conversation
NGINX is now also listens to port 8000 on the docker network. This is an important step to being able to start these `services` and have them function as a local test server for openml-python among others.
|
|
||
| # Update openml.expdb.dataset with the same url | ||
| mysql -hdatabase -uroot -pok -e 'UPDATE openml_expdb.dataset DS, openml.file FL SET DS.url = FL.filepath WHERE DS.did = FL.id;' | ||
|
|
There was a problem hiding this comment.
These removed updates are now embedded in the state of the database on the new image
| sed -i -E 's/^(::1\t)localhost (.*)$/\1\2/g' /etc/hosts.new | ||
| cat /etc/hosts.new > /etc/hosts | ||
| rm /etc/hosts.new | ||
|
|
There was a problem hiding this comment.
For other containers updating /etc/hosts through configuration was sufficient.
For this one, the pre-existing /etc/hosts took precidence, so it needed to be updated.
| - "8000:8000" | ||
| networks: | ||
| default: | ||
| ipv4_address: 172.28.0.2 |
There was a problem hiding this comment.
the static ip address is required so that we can add entries to /etc/hosts file of other containers, so they contact nginx when they resolve localhost.
| @@ -1,4 +1,4 @@ | |||
| CONFIG=api_key=AD000000000000000000000000000000;server=http://php-api:80/ | |||
| CONFIG=api_key=abc;server=http://php-api:80/ | |||
There was a problem hiding this comment.
I don't understand, here the api key is set from AD000000000000000000000000000000 to abc ...
There was a problem hiding this comment.
AD000000000000000000000000000000 was the api key in the old test database image, but this has been changed to abc to match the test server database.
There was a problem hiding this comment.
The evaluation engine needs administrator access currently.
| apikey=normaluser | ||
| server=http://localhost:8000/api/v1/xml |
There was a problem hiding this comment.
... and here the api key is set from AD000000000000000000000000000000 to normaluser
So far, these were the keys for developers:
php-api (v1) test-server: normaluser
php-api (v1) local-server: AD000000000000000000000000000000
has anything changed here?
Also what are the api keys for python-api (v2), now that it will also be added to services with a frozen docker image
There was a problem hiding this comment.
This configuration is just for when you spin up a openml-python container to use the Python API. They do not need administrator access, so I changed the key to normaluser which is a normal read-write account.
There was a problem hiding this comment.
The Python-based REST API uses the keys that are in the database. The server is unaffected, but I will need to update the keys that are used in its tests.
josvandervelde
left a comment
There was a problem hiding this comment.
Looking good! I encountered some problems when using python to connect to the local running containers.
| minio: | ||
| profiles: ["all", "minio", "evaluation-engine"] | ||
| image: openml/test-minio:v0.1.20241110 | ||
| image: openml/test-minio:v0.1.20260204 |
There was a problem hiding this comment.
This minio contains most parquet files out of the box, but not all!
bash-5.1# ls /data/datasets/0000/0001
dataset_1.pq phpFsFYVN
bash-5.1# ls /data/datasets/0000/0128
iris.arff
This is probably a mistake?
Also, it contains some weird files:
bash-5.1# ls /data/datasets/0000
0000 '0000?C=S;O=A' '0000?C=D;O=A' '0000?C=M;O=A' '0000?C=N;O=D' ....
There was a problem hiding this comment.
Apparently the weird files are apache: https://httpd.apache.org/docs/2.4/mod/mod_autoindex.html
Harmless, but I'll update the wget command to exclude them.
There was a problem hiding this comment.
The omission of 128 was accidental, but turned out to be useful for the openml-python API tests that require an arff file (which isn't easily downloaded anymore if parquet files are present). I will hold off on adding that parquet file because:
- I would need to update openml-python (or at least its tests) accordingly
- Services should be able to handle a missing parquet file for now, as not all datasets have parquet files in production either
As for the reason it was skipped.. that's worth looking into. For now, I'll add a note to the readme.
| my_task = openml.tasks.get_task(my_task.task_id) | ||
| from sklearn import compose, ensemble, impute, neighbors, preprocessing, pipeline, tree | ||
| clf = tree.DecisionTreeClassifier() | ||
| run = openml.runs.run_model_on_task(clf, my_task) |
There was a problem hiding this comment.
I get errors here:
OSError: Repetition level histogram size mismatch on
Traceback (most recent call last):
File "/openml/openml/datasets/dataset.py", line 593, in _parse_data_from_pq
data = pd.read_parquet(data_file)
It seems to have something to do with the pyarrow version in openml-python. Maybe unrelated to this PR, but I haven't seen these problems before. Do you see these problems as well?
There was a problem hiding this comment.
Yes. I had sent a message on Slack about it. Basically, the openml-python image is so outdated the newly generated parquet files cannot be loaded. If you take the shell as an entrypoint and first update pyarrow and pandas, it works fine.
Updating routing and data of the images to allow an out of the box test server on a local machine.
Currently the updated configuration allows running of the openml-python unit tests that require the test server (see openml/openml-python#1630).
Have to cross-check I didn't break other functionality in the process.