First we have results from the recent offline validation. Second we processed rosbags with our online implementation ros_adetector. Altough we utilized the same model and the same bags, our observation was that both of these methods lead to different results (what it shouldn't).
I discussed some points with @hanikevi that could be verified:
Laufzeit ros_adetector und rosbag play prüfen (bag langsam abspielen),
prüfen, ob überlappende Fenster im Training verwendet wurden (JA Stand jetzt, hilfreich für langsame rosbags gleicher Dateninput),
loss_fn checken im ros_adetector (eher nicht),
verifizieren, ob richtiges Modell,
verifizieren des Netz-Input (könnte sein, dass Dimensionen aut. angepasst werden) und publishen,
check queue size (in main, passt es! queue size = 1) --> vor allem update ros_datahandler
play rosbag with 0,05 and plot longer window
compare net input to rosbag data
✓
8 of 8 checklist items completed
· Edited
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
Kevin Haningermarked the checklist item prüfen, ob überlappende Fenster im Training verwendet wurden (JA Stand jetzt, hilfreich für langsame rosbags gleicher Dateninput), as completed
marked the checklist item prüfen, ob überlappende Fenster im Training verwendet wurden (JA Stand jetzt, hilfreich für langsame rosbags gleicher Dateninput), as completed
Checking for queue_size in ros_datahandler. In commit de607ab8 (latest on main here), we have ros_datahandler/ros_datahandler/datahandler.py::L70 defining the subscriber factory with queue_size=1
Kevin Haningermarked the checklist item check queue size (in main, passt es! queue size = 1) --> vor allem update ros_datahandler as completed
marked the checklist item check queue size (in main, passt es! queue size = 1) --> vor allem update ros_datahandler as completed
In ros_adetector we calculate the loss fn with jnp.mean(jnp.abs(x,y)) and in the other plot I called ValModel.visualize_loss and we calculate the loss with the leaf_loss fn:
So even if the form is a bit different, the loss values should not be that different. In the ros_adetector version we get loss values around 3-4 times higher than the ones in validate_model. The plots of the forces and joints look equal. So the reconstruction is probably different.
I played the rosbag with rate -0.3 and the loss fn changed. The values are not that high anymore but still greater than the ones I got in the offline validation. I added the plots from plotjuggler in the word doc.
ok good loss fn were comparing is abt the same. If play speed changed things then the network eval is prob slower than the sample rate of the topic used for the callback. 0.3 saturates the change or does it change more when you go to 0.05 or sth? I’d also suspect the data formatting. Could probably println the first network input for offline/online and make sure theyre the same.
force is the slower topic (220 Hz vs 500 for joint_states), so we should update ros_adetector. Should help, but I'm not sure this is the only bug.
This consistency should be enforced structurally. I see two options:
Add a flag to the data_streams init argument to ros_datahandler which allows us to mark which one is the sync topic. We would drop sync_stream from the init args, but requires further specialization of the data_streams.
We add a config in encoder_dynamics which specifies which dataclass and also the sync topic.
Both are sensible, and when we use this on other systems, we will want to be passing, e.g. FrankaData as an argument, not hard-coded, so the 2nd one will be needed in any case. Thoughts?
Sounds good. The dill file would then serialize all of the configuration and allow us to load from a file, right? I would suggest making a separate issue and setting up a data type for what we serialize, so that it's documented and can be extended in the future more easily, but you and Niki can discuss
So I changed sync_stream and the args but there is no difference. I checked the net input and it is the same as data. But I noticed that we have an offset in the published data and reconstructed data. Maybe our autoencoder is reconstructing the data with a offset?
That means with the updated sync_topic to force we still have differences in form and scaling between offline and online evaluation?
For the delay, I had also seen this effect online: when pushing on the robot, the reconstructed raise to match it shortly later. I think the network learns to use the last few forces to predict the next forces, which would act like a bit of a delay on the signal. I would also keep this in mind, but I think it's a separate issue.
yes we still have slight differences, more in scaling than in form I'd say.
Concerning your 2nd aspect, does this behaviour could come from too simple net-architecture? If I got it right, my fist step would be to modify our model and increase the latent dim to have more parameters and thus more complexity.
That there's differences in scaling is still weird and might be a hint to some difference in the offline and online data processing. Are you investigating further or just moving forward with scaling the threshold for online eval?
For the 2d aspect, I'm not sure that this is a problem. It can be a property of the network architecture and training loss. With a low-dimensional signal (i.e. not pictures) and a big latent space, the network could learn to just pass the signal through the latent space and therefore get 'perfect' reconstruction. The original understanding of VAEs is that they have a bottleneck which forces compression of the high-dim input. But I would suggest to make any decisions on architecture based on the validation results
Lisa-Marie Fennermarked the checklist item verifizieren des Netz-Input (könnte sein, dass Dimensionen aut. angepasst werden) und publishen, as completed
marked the checklist item verifizieren des Netz-Input (könnte sein, dass Dimensionen aut. angepasst werden) und publishen, as completed
Lisa-Marie Fennermarked the checklist item Laufzeit ros_adetector und rosbag play prüfen (bag langsam abspielen), as completed
marked the checklist item Laufzeit ros_adetector und rosbag play prüfen (bag langsam abspielen), as completed
Lisa and I discussed in the lab a bit today. One idea is to re-publish the network inputs in the callback function, in the same way that the network outputs are published. This way we can see if there's any formatting errors or delays in ROS system.
So as Kevin suggested I published the network inputs again (F_comp) and at first sight there was no major difference. But if we zoom in we get this picture:
I reduced the play rate to 0.5 and we get
Now with play rate 0.1
We saw already that with a slow play rate the reconstruction is almost like the one we get in offline and the loss is lower. So it seems like we can not process all data points in data.get_data() if we play the rosbag with the normal play rate? It seems like the model replies to this issue by generating an offset.
Hey everyone,
since all checkpoints have been completed, the issue will be closed. The latest results indicate successful reconstructions, with no significant loss observed in anomalous cases. Further work on this will take place in #59 (closed) .