From 723e45698e4e58e26f7a27f3c461ce7c346aa429 Mon Sep 17 00:00:00 2001 From: joncrall Date: Sun, 28 Jul 2024 00:09:35 -0400 Subject: [PATCH] work on bittorrent and writeup --- .../getting_the_dataset_via_bittorrent.rst | 137 +++++++++++--- papers/application-2024/main.tex | 179 ++++++++++++------ 2 files changed, 237 insertions(+), 79 deletions(-) diff --git a/docs/source/manual/getting_the_dataset_via_bittorrent.rst b/docs/source/manual/getting_the_dataset_via_bittorrent.rst index 442fbb2..1b1bae9 100644 --- a/docs/source/manual/getting_the_dataset_via_bittorrent.rst +++ b/docs/source/manual/getting_the_dataset_via_bittorrent.rst @@ -37,14 +37,14 @@ https://forum.transmissionbt.com/viewtopic.php?t=9778 cat $TORRENT_FPATH # Start seeding the torrent - transmission-cli "$TORRENT_FPATH" -w $(dirname $DATA_DPATH) + # transmission-cli "$TORRENT_FPATH" -w $(dirname $DATA_DPATH) # Do we need additional flags to tell transmission we have the data already? # --download-dir tmpdata # On remote machine rsync toothbrush:tmp/create-torrent-demo/shared-demo-data.torrent . - transmission-cli shared-demo-data.torrent --download-dir tmpdata + transmission-cli shared-demo-data.torrent --download-dir DATA_DPATH @@ -54,6 +54,9 @@ Machine Setup Ensure that the torrent port is open and correctly fowarded. +This may require configuring your router to forward port 51413 to the seeding +machine. + .. code:: TRANSMISSION_TORRENT_PORT=51413 @@ -77,9 +80,6 @@ Install the transmission bittorrent client. # Install Transmission sudo apt-get install transmission-daemon transmission-cli - # Install the GUI as well, although this is not needed. - sudo apt install transmission-gtk - Configure transmission to allow for local peer discovery @@ -100,20 +100,25 @@ Configure transmission to allow for local peer discovery # Verify the status of the daemon systemctl status transmission-daemon.service +Optional Install a GUI: + +.. code:: + + # Install the GUI as well, although this is not needed. + sudo apt install transmission-gtk + Small Demo to Verify Torrents Download Correctly ------------------------------------------------ -(Not Working Yet, FIXME) - On the seeding machine .. code:: bash # Create a dummy set of data that will be shared WORKING_DPATH=$HOME/tmp/create-torrent-demo - DATA_DPATH=$WORKING_DPATH/shared-demo-data-v1 - TORRENT_FPATH=$WORKING_DPATH/shared-demo-data-v1.torrent + DATA_DPATH=$WORKING_DPATH/shared-demo-data-v001 + TORRENT_FPATH=$WORKING_DPATH/shared-demo-data-v001.torrent mkdir -p "$WORKING_DPATH" cd $WORKING_DPATH @@ -123,10 +128,70 @@ On the seeding machine echo "some data" > $DATA_DPATH/data2.txt echo "some other data" > $DATA_DPATH/data3.txt - transmission-create --comment "a demo torrent v1" --outfile "$TORRENT_FPATH" "$DATA_DPATH" + # A list of open tracker URLS is: + # https://gist.github.com/mcandre/eab4166938ed4205bef4 + TRACKER_URL=udp://tracker.openbittorrent.com:80 + COMMENT="a demo torrent v1" + + transmission-create --comment "$COMMENT" --tracker "$TRACKER_URL" --outfile "$TORRENT_FPATH" "$DATA_DPATH" cat "$TORRENT_FPATH" + tree -f $DATA_DPATH + + # Start seeding the transmission daemon + transmission-remote --auth transmission:transmission --add "$TORRENT_FPATH" --download-dir "$(dirname $DATA_DPATH)" + + # Show Registered Torrents to verify success + transmission-remote --auth transmission:transmission --list + + # DEBUGGING + # https://forum.transmissionbt.com/viewtopic.php?t=11830 + + # Start the torrent + transmission-remote --auth transmission:transmission --torrent 1 --start + transmission-remote --auth transmission:transmission -t1 -i + transmission-remote --auth transmission:transmission --list + + # Alternative: start seeding the torrent + # Ensure that the download directory contains the data to be seeded + # transmission-cli --verify --download-dir "$(dirname $DATA_DPATH)" $TORRENT_FPATH + + +On the downloading machine + +.. code:: bash + + SEEDING_MACHINE_NAME=some_remote_name + # SEEDING_MACHINE_NAME=toothbrush + SEEDING_MACHINE_NAME=jojo + + rsync $SEEDING_MACHINE_NAME:tmp/create-torrent-demo/shared-demo-data-v001.torrent . + + TEST_DOWNLOAD_DPATH="$HOME/tmp/transmission-dl" + mkdir -p "$TEST_DOWNLOAD_DPATH" + transmission-remote --auth transmission:transmission --add "shared-demo-data-v001.torrent" -w "$TEST_DOWNLOAD_DPATH" + transmission-remote --auth transmission:transmission --list + + tree $TEST_DOWNLOAD_DPATH + + # Show Registered Torrents to verify success + transmission-remote --auth transmission:transmission --list + + # transmission-cli shared-demo-data-v1.torrent + transmission-remote --auth transmission:transmission -t3 -i + + +Misc Notes +---------- + +Other notes that are not well organized yet + +.. code:: bash + ################################### + # Work In Progress After This Point + ################################### + # Start seeding the torrent # Ensure that the download directory contains the data to be seeded transmission-cli --verify --download-dir "$(dirname $DATA_DPATH)" $TORRENT_FPATH @@ -155,27 +220,36 @@ On the seeding machine transmission-remote --auth transmission:transmission -tall -f transmission-remote --auth transmission:transmission -tall --get all - - transmission-remote --auth transmission:transmission --add "$TORRENT_FPATH" --download-dir "$(dirname $DATA_DPATH)" On the downloading machine, do something to transfer the torrent file itself. .. code:: bash - SEEDING_MACHINE_NAME=remote - SEEDING_MACHINE_NAME=toothbrush + SEEDING_MACHINE_NAME=some_remote_name + # SEEDING_MACHINE_NAME=toothbrush + SEEDING_MACHINE_NAME=jojo - SEEDING_MACHINE_NAME=remote rsync $SEEDING_MACHINE_NAME:tmp/create-torrent-demo/shared-demo-data-v1.torrent . - transmission-cli shared-demo-data-v1.torrent + TEST_DOWNLOAD_DPATH="$HOME/tmp/transmission-dl" + transmission-remote --auth transmission:transmission --add "shared-demo-data-v1.torrent" -w "$TEST_DOWNLOAD_DPATH" - rsync toothbrush:shitspotter.torrent . - transmission-remote --auth transmission:transmission --add "shitspotter.torrent" - transmission-remote --auth transmission:transmission --add "shitspotter.torrent" -w "$HOME/data/dvc-repos" + # transmission-cli shared-demo-data-v1.torrent + + + # Shistposter test + + #rsync toothbrush:shitspotter.torrent . + #transmission-remote --auth transmission:transmission --add "shitspotter.torrent" + #transmission-remote --auth transmission:transmission --add "shitspotter.torrent" -w "$HOME/data/dvc-repos" # transmission-remote --auth transmission:transmission --add "shared-demo-data-v1.torrent" + transmission-remote --auth transmission:transmission --add "$TORRENT_FPATH" --download-dir "$(dirname $DATA_DPATH)" + + # Show Registered Torrents to verify success + transmission-remote --auth transmission:transmission --list + @@ -188,17 +262,26 @@ Instructions To Create The Torrent Install Instructions Are Modified ChatGPT outputs (which was very helpful here). -.. code:: +.. code:: bash # Install Transmission CLI sudo apt-get install transmission-daemon transmission-cli # Create a new torrent DVC_DATA_DPATH=$HOME/data/dvc-repos/shitspotter_dvc - transmission-create -o shitspotter.torrent $HOME/data/dvc-repos/shitspotter_dvc + cd $DVC_DATA_DPATH + + TRACKER_URL=udp://tracker.openbittorrent.com:80 + transmission-create \ + --outfile shitspotter_dvc.torrent \ + --tracker "$TRACKER_URL" \ + --comment "first shitspotter torrent" \ + $HOME/data/dvc-repos/shitspotter_dvc # Start seeding the torrent - transmission-cli shitspotter.torrent --download-dir tmpdata + transmission-remote --auth transmission:transmission --add shitspotter_dvc.torrent --download-dir $HOME/data/dvc-repos + + transmission-remote --auth transmission:transmission --list # Enable local peer discovery in the settings cat ~/.config/transmission/settings.json | grep lpd @@ -212,10 +295,14 @@ Testing On Local Network ------------------------ -.. code:: +.. code:: bash + + rsync jojo:data/dvc-repos/shitspotter_dvc/shitspotter_dvc.torrent . - rysnc jojo:shitspotter.torrent . - transmission-cli shitspotter.torrent --download-dir tmpdata + mkdir -p ./tmpdata + transmission-remote --auth transmission:transmission --add shitspotter_dvc.torrent --download-dir $PWD/tmpdata + transmission-remote --auth transmission:transmission --list + #transmission-cli shitspotter.torrent --download-dir tmpdata Instructions To Download/Seed The Torrent diff --git a/papers/application-2024/main.tex b/papers/application-2024/main.tex index 3662796..64767b8 100644 --- a/papers/application-2024/main.tex +++ b/papers/application-2024/main.tex @@ -83,71 +83,134 @@ \section{Introduction} Poop detection is a simple problem, suitable for exploring the capabilities of object detection models while also containing non-trivial challenges. -There are several challenges in detecting dog poop in phone-camera images. -* Resolution -* Distractors -* Occlusion -* Variation in appearance (old/new/healthy/sick) +There are several challenges in detecting dog poop in phone-camera images: +resolution, +distractors, +occlusion, +variation in appearance (old/new/healthy/sick). +Discuss building the dataset -Hosting a dataset challenges intended for scientific use -* Requires an institution willing to host or payment for a hosting service -* Prone to outages (give VGG outage as example) -* Requires updates (makes bittorrent difficult) +Discuss building the segmentation models +For a dataset to be of scientific use it must be accessible. -Compare and Contrast: -* Centralized -* BitTorrent -* IPFS +Centralized methods are the typical choice and offer very good speeds, +but they requires an institution willing to host or payment for a hosting service, +can prone to outages (give VGG outage as example), +and version control is not built in. + +Decentralized methods allow volunteers to host data offers ability to validate +data integrity. This motivates us to compare and contrast Cloud Services, +BitTorrent, and IPFS as mechanisms for distributing datasets. + +Our contributions are: +1. A challenging new \textbf{open dataset} of images with polygon segmentations. +2. An experimental \textbf{evaluation of baseline training} methods. +3. An experimental \textbf{comparison of dataset distribution} methods. +4. \textbf{Open code and models}. +Related work is discussed at the end. % https://gist.github.com/liamzebedee/4be7d3a551c6cddb24a279c4621db74c % https://gist.github.com/liamzebedee/224494052fb6037d07a4293ceca9d6e7 +\section{Dataset} + +Our first contribution is the collection of a new open dataset. + + % https://arxiv.org/abs/1803.09010 Data is released with a datasheet describing its characteristics \cite{gebru_datasheets_2021}. -% BitTorrent can be vulnerable to MITM: -% https://www.reddit.com/r/technology/comments/1dpinuw/south_korean_telecom_company_attacks_torrent/ +Challenges +\paragraph{Construction} -%------------------------------------------------------------------------- -\subsection{Related Work} +Labelme \cite{wada_labelmeailabelme_nodate} for annotations with segment anything \cite{kirillov_segment_2023}. -Object detection +Anecdotal note: SAM worked well to automatically segment the poop, many of +these needed adjustments, especially in regions of shadows, but there were +cases that required a completely manual approach. Unfortunately a clean record +of what cases these were does not exist. -TACO dataset: \cite{proenca_taco_2020} +\paragraph{Analysis} -MSHIT dataset +Number of images, annotations, and other stats. -Dog Poop Detection - Neeraj Madan -Other poop work +\section{Models} -\subsection{Dataset Construction} +Our second contribution is an evaluation of several trained models to serve as +a baseline. -Labelme \cite{wada_labelmeailabelme_nodate} for annotations with segment anything \cite{kirillov_segment_2023}. +We use the training and evaluation system of \cite{crall_igarss_2024}, which +can be trained to predict heatmaps from polygons and can evaluate those +heatmaps on a pixelwise-level. -Anecdotal note: SAM worked well to automatically segment the poop, many of -these needed adjustments, especially in regions of shadows, but there were -cases that required a completely manual approach. Unfortunately a clean record -of what cases these were does not exist. +The baseline architecture is a variant of a vision-transformer \cite{vit,split-attention,greenwell_wacv_2023}. + +Number of parameters. +Memory at train time. +Memory at predict time. + +We train several models and vary the learning rate, weight decay, +shrink-and-perterb regularization \cite{warmstart}, as well as other +ad hoc experiment settings. +%\textbf{Static Parameters}: + +\subsection{Model Experiments} + +After finding a reasonable performing starting point, we performed an ablation +over learning rate and regularization parameters. + +This restricted set is illustrated in \Cref{fig:scatter-subset}. -\subsection{Dataset Distribution} +We evaluate each model with standard pixelwise segmentation metrics with a +focus on average-precision (AP) and area under the ROC curve (AUC) +\cite{metrics} . + +\Cref{fig:scatter-all} illustrates the AP and AUC of all baseline models trained. +These include ad-hoc parameters settings when searching for a stable training +configuration. + +The resources used are given in \Cref{fig:resource}. + +\section{Distribution} + +% BitTorrent can be vulnerable to MITM: +% https://www.reddit.com/r/technology/comments/1dpinuw/south_korean_telecom_company_attacks_torrent/ + +Our third contribution is an exploration of distributed and centralized data distribution methods. + +Cloud storage for a modest amount of data can be expensive. + +Decentralized methods can allow information to persist so long as at least 1 +person has the data. + +BitTorrent is a well known distributed system. + +IPFS is a new similar tool. + + + + +In order to Discuss distributing the dataset via IPFS versus centralized distribution systems. -Decentralized Method - IPFS +Decentralized Method - IPFS and BitTorrent. Centralized Method - Girder Observations: -* IPFS via https using gateways does not always work well. -* IPFS usually works well if you use the CLI. -* IPFS is easier to update. -* IPFS does rehash every file, which induces an O(N) scalability constraint. -* IPFS does rehash every file, which induces an O(N) scalability constraint. +\begin{itemize} + \item IPFS via https using gateways does not always work well. + \item IPFS usually works well if you use the CLI. + \item IPFS is easier to update. + \item IPFS does rehash every file, which induces an O(N) scalability constraint. + \item IPFS does rehash every file, which induces an O(N) scalability constraint. +\end{itemize} IPFS vs BitTorrent: @@ -169,43 +232,51 @@ \subsection{Dataset Distribution} https://academictorrents.com/docs/about.html -\subsection{Experiments} +\subsection{Distribution Experiments} Measure the performance of our algorithm versus a baseline. Measure the speed of IPFS vs bittorrent. -Define Model - Separated Attention Transformer +%------------------------------------------------------------------------- +\section{Related Work} -Define training protocol +Object detection -Define Parameter Search Space - Learning Rate / +TACO dataset: \cite{proenca_taco_2020}. Trash bounding box annotations. -Define quality metrics - Pixel IoU +MSHIT dataset -Plot Scatter Plots and Box Plots +Dog Poop Detection - Neeraj Madan -Make any inferences +Other poop work +\section{Conclusion} -\subsection{Future} +The ShitSpotter dataset is 42GB of images with polygon segmentations of dog +poop. -The future of the project will: +We train and evaluate several baseline segmentation models, the best of which +achieve an AP/AUC of ... -* Add lightweight object-level head and test object detection metrics -* Optimize model architectures for mobile devices -* Launch phone application -* Improve distributed distribution mechanisms +Our dataset is sufficient to train an object detection network to (level of +precision/recall). -\subsection{Conclusion} +We make data and models available over 3 distribution mechanisms: +cloud storage, bit-torrent, and IPFS. +Decentralized methods are feasible methods of distribution, with strong +security but they can be slow. IPFS is a promising solution for hosting scientific datasets, but does have pain points. In contrast bittorrent can do X/Y/Z, but ... -Lastly there are centralized systems which ... - -Our dataset is sufficient to train an object detection network to (level of -precision/recall). - +Lastly centralized cloud storage can give the best speeds, but sacrifices some +security and can be less robust. + +Directions for future research / development are: +1. Add lightweight object-level head and test object detection metrics. +2. Optimize model architectures for mobile devices. +3. Launch phone application. +4. Improve model / data distribution. %%%%%%%%% REFERENCES