Extend chapter on hyperparameter + results + robustness + feature set…

…s + ...🧙 (#410)
KarelZe · Jun 18, 2023 · 1ea2cb9 · 1ea2cb9
1 parent 9cf5cad
commit 1ea2cb9
Show file tree

Hide file tree

Showing 28 changed files with 1,198 additions and 880 deletions.
diff --git a/notebooks/4.0c-mb-feature-importances.ipynb b/notebooks/4.0c-mb-feature-importances.ipynb
diff --git a/notebooks/6.0c-mb-results-universal.ipynb b/notebooks/6.0c-mb-results-universal.ipynb
diff --git a/notebooks/6.0e-mb-viz-universal.ipynb b/notebooks/6.0e-mb-viz-universal.ipynb
diff --git a/references/obsidian/📑notes/👶introduction notes.md b/references/obsidian/📑notes/👶introduction notes.md
@@ -45,6 +45,7 @@ Motivated by these considerations, we investigate how the predictability documen
 
 ## Contributions
 - from expose: In the introduction, we provide motivation and present our key findings. The contributions are three-fold: (I) We employ state-of-the-art machine learning algorithms i.~e., gradient-boosted trees and transformer networks, for trade classification. Tree-based approaches outperform state-of-the-art trade classification rules in out-of-sample tests. (II) As part of semi-supervised approaches, we study the impact of incorporating unlabelled trades into the training procedure on trade classification accuracy. (III) We consistently interpret feature contributions to classical trade classification rules and machine learning models with a game-theoretic approach.
+- through visualising attention we are able to establish a theoretical link between rule-based classification and machine learning
 
 Our contributions are n-fold:
 - Our paper contributes to at least two strands of literature. First, it is

diff --git a/references/obsidian/📖chapters/🏅Feature importance results.md b/references/obsidian/📖chapters/🏅Feature importance results.md
@@ -123,3 +123,41 @@ Results:
 
 - **Classical Rules** Results align with intuition. Largest improvements come from applying the quote rule (nbbo), which requires quote_best + Trade price, quote (ex) is only applied to a fraction of all trades. The rev tick test is of hardly any importance, as it does not affect classification rules much, nor is it applied often
 
+Relation between attention techniques
+
+- put special emphasise in analysis Transformer performance, as it is 
+- arrange in graphics below. arrange like puzzle blocks, what is done by which approach
+- pick up and describe two trades
+
+- general overview in 
+
+![[model-wide-attention.png]]
+(from [[@coenenVisualizingMeasuringGeometry2019]])
+
+While [6] analyzed context embeddings, another natural place to look for encodings is in the attention matrices. After all, attention matrices are explicitly built on the relations between pairs of words.
+
+This broader analysis shows that BERT’s attention heads pay little attention to the current token but rather specialize to attend heavily on the next or previous token, especially in the earlier layers.
+
+A substantial amount of BERT’ attention focuses on a few special tokens such as the deliminator token [SEP] which means that such tokens play a vital role in BERT’s performance. The figure below shows the average attention behavior in each layer for some special tokens such as [CLS] and [SEP].
+
+With a few more creative tests (see paper for full details), the authors found that BERT’s attention maps have a fairly thorough representation of English syntax.
+
+Upon further investigation of the individual attention heads behavior for a given layer, the authors found that some heads behave similarly, possible due to some attention weights being zeroed-out via dropout. A surprising result, given that other researchers found that encouraging different behavior in attention heads improves a Transformer’s performance. There is more opportunity to conduct extended analysis to help further understand these behaviors in the attention layer.
+
+The example given here is correctly classified. Crucially, only in the first couple of layers, there are some distinctions in the attention patterns for different positions, while in higher layers the attention weights are rather uniform. Figure 2 (left) gives raw attention scores of the CLS token over input tokens (x-axis) at different layers (y-axis), which similarly lack an interpretable pattern.These observations reflect the fact that as we go deeper into the model, the embeddings are more contextualized and may all carry similar information. This underscores the need to track down attention weights all the way back to the input layer and is in line with findings of Serrano and Smith (2019), who show that attention weights do not necessarily correspond to the relative importance of input tokens. ([[@abnarQuantifyingAttentionFlow2020]])
+
+add a CLS token and use its embedding in the final layer as the input to the classifier
+
+Figures 2 and 3 show the weights from raw attention, attention rollout and attention flow for the CLS embedding over input tokens (x-axis) in all 6 layers (y-axis) for three examples. The first example is the same as the one in Figure 1. The second example is “the article on NNP large systems ”. The model correctly classifies this example and changing the subject of the missing verb from “article” to “articles” flips the decision of the model. The third example is “here the NNS differ in that the female ”, which is a miss-classified example and again changing “NNS” (plural noun) to “NNP” (singular proper noun) flips the decision of the model. For all cases, the raw attention weights are almost uniform above layer three (discussed before). raw attention attention rollout attention flow (a) “The author talked to Sara about mask book.” raw attention attention rollout attention flow (b) “Mary convinced John of mask love.” Figure 4: Bert attention maps. We look at the attention weights from the mask embedding to the two potential references for it, e.g. “author” and “Sara” in (a) and “Mary” and “John” in (b). The bars, at the left, show the relative predicted probability for the two possible pronouns, “his” and “her”. In the case of the correctly classified example, we observe that both attention rollout and attention flow assign relatively high weights to both the subject of the verb, “article’ and the attractor, “systems”. For the miss-classified example, both attention rollout and attention flow assign relatively high scores to the “NNS” token which is not the subject of the verb. This can explain the wrong prediction of the model.
+
+![[Pasted image 20230617080432.png]]
+
+We claim that one-layer attention-only transformers can be understood as an ensemble of a bigram model and several "skip-trigram" models (affecting the probabilities of sequences "A… BC"). 12 Intuitively, this is because each attention head can selectively attend from the present token ("B") to a previous token ("A") and copy information to adjust the probability of possible next tokens ("C"). (https://transformer-circuits.pub/2021/framework/index.html)
+
+- the innerworkings of transformers are not fully-understood yet. https://transformer-circuits.pub/2021/framework/index.html
+
+Compare attention of pre-trained Transformer with vanilla Transformer?
+![[Pasted image 20230617081051.png]]
+
+
+![[Pasted image 20230617081138.png]]
diff --git a/references/obsidian/📖chapters/🏅Results.md b/references/obsidian/📖chapters/🏅Results.md
@@ -57,5 +57,3 @@ By extension, we also estimate rules combinations involving overrides from the t
 
 In absence of other suitable baselines, we also the GSU method for FS3, even if it doesn't utilise option-specific features.
 
-
-
diff --git a/references/obsidian/📖chapters/🏅Robustness.md b/references/obsidian/📖chapters/🏅Robustness.md
@@ -155,4 +155,16 @@ We analyse the robustness of gls-GBRT with self-training on gls-ise data in cref
 Compared to the vanilla gls-GBRT, performance degrades across almost all subsets. We indicate the change with an arrow. Quantitatively, we find no improvements in robustness as performance differences between sub-samples are of the same magnitude and the performance gap between rule-based classification mostly extend for index options and trades outside the spread. 
 
 
-Break down
+**Transformers with Pre-Training-Objective**
+
+Transformers with pre-training objective outperform the benchmark in all subsets apart from index options and trades outside the quotes. For gls-ISE trades in cref-ise-transformer-semi pre-training improves performance across subsets, reaching accuracies greater than percentage-86. The only exception are index options, where the performance gap slightly widens. Deep-out-of-the-money options and options with long maturity profit most from the introduction of option features. 
+
+For trades at the gls-cboe performance improvements associated with pre-training are slightly lower across several sub-groups. Positively, pre-training improves robustness, as the performance gap to the benchmarks narrows for trades outside the quotes. 
+
+
+
+
+
+
+![[Pasted image 20230618070205.png]]
+![[Pasted image 20230618070237.png]]
diff --git a/references/obsidian/📖chapters/💡Hyperparameter tuning.md b/references/obsidian/📖chapters/💡Hyperparameter tuning.md
@@ -59,6 +59,12 @@ Do like $\operatorname{Categorical}\left[\operatorname{tick}_{\text{ex}},\ldots,
 
 ![[training-vs-validation-accuracy.png]]
 
+![[Pasted image 20230617144732.png]]
+![[Pasted image 20230617144757.png]]
+
+![[Pasted image 20230617200347.png]]
+
+![[Pasted image 20230617200407.png]]
 
 https://arxiv.org/pdf/1603.02754.pdf
 

diff --git a/references/obsidian/📖chapters/🧓Discussion.md b/references/obsidian/📖chapters/🧓Discussion.md
@@ -1,6 +1,12 @@
 
+- results for classical rules demonstrate that classical choices for option trade classification.
+- We identify missingess in data to be down-ward biasing the results of classical estimators. ML predictors are robust to this missingness, as they can handle missing values and potentially substitute.
 
-the elephant in the room
+- our study puts special emphasise on thoughtful tuning, data pre-processing.
+- 
+
+- the elephant in the room, labelled data and cmputational data. Finetune. Low cost of inference
+- our results contradict ronen et al. neural networks can achieve sota performance if well tuned.
 
 
 es it mean? Point out limitations and e. g., managerial implications or future impact.

diff --git a/references/obsidian/📥Inbox/@coenenVisualizingMeasuringGeometry2019.md b/references/obsidian/📥Inbox/@coenenVisualizingMeasuringGeometry2019.md
@@ -0,0 +1,13 @@
+*title:* Visualizing and Measuring the Geometry of BERT
+*authors:* Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viégas, Martin Wattenberg
+*year:* 2019
+*tags:* 
+*status:* #📥
+*related:*
+*code:*
+*review:*
+
+## Notes 📍
+
+## Annotations 📖
+Note: 
diff --git a/references/obsidian/🖼️Media/model-wide-attention.png b/references/obsidian/🖼️Media/model-wide-attention.png
diff --git a/reports/Content/Appendix.tex b/reports/Content/Appendix.tex
@@ -116,7 +116,7 @@ \subsection{Results of Supervised Models With Re-Training}
 
 \begin{table}[ht]
     \centering
-    \caption[Accuracies of Supervised Approaches With Re-Training On \glsentryshort{CBOE} and \glsentryshort{ISE}]{This table reports the accuracy of \glspl{GBRT} for different feature sets on the \gls{ISE} and \gls{CBOE} test set after re-training on \gls{ISE} training and validation set. The improvement is estimated as the absolute change in accuracy between the classifier and the benchmark. For feature set classical, $\operatorname{gsu}_{\mathrm{small}}$ is the benchmark and otherwise $\operatorname{gsu}_{\mathrm{large}}$.}
+    \caption[Accuracies of Supervised Approaches With Re-Training On \glsentryshort{CBOE} and \glsentryshort{ISE} Sample]{This table reports the accuracy of \glspl{GBRT} for different feature sets on the \gls{ISE} and \gls{CBOE} test set after re-training on \gls{ISE} training and validation set. The improvement is estimated as the absolute change in accuracy between the classifier and the benchmark. For feature set classical, $\operatorname{gsu}_{\mathrm{small}}$ is the benchmark and otherwise $\operatorname{gsu}_{\mathrm{large}}$.}
     \label{tab:results-supervised-retraining-ise-cboe}
     \begin{tabular}{@{}llSSSSSS@{}}
         \toprule

diff --git a/reports/Content/bibliography.bib b/reports/Content/bibliography.bib
@@ -927,6 +927,17 @@ @misc{clarkWhatDoesBERT2019
   archiveprefix = {arxiv}
 }
 
+@misc{coenenVisualizingMeasuringGeometry2019,
+  title = {Visualizing and {{Measuring}} the {{Geometry}} of {{BERT}}},
+  author = {Coenen, Andy and Reif, Emily and Yuan, Ann and Kim, Been and Pearce, Adam and Viégas, Fernanda and Wattenberg, Martin},
+  year = {2019},
+  number = {arXiv:1906.02715},
+  eprint = {1906.02715},
+  publisher = {{arXiv}},
+  urldate = {2023-06-17},
+  archiveprefix = {arxiv}
+}
+
 @article{congDEEPSEQUENCEMODELING,
   title = {Deep Sequence Modeling: Development and Applications in Asset Pricing},
   author = {Cong, Lin William and Tang, Ke and Wang, Jingyuan and Zhang, Yang}
@@ -3469,6 +3480,17 @@ @article{radfordImprovingLanguageUnderstanding
   author = {Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya}
 }
 
+@misc{raeScalingLanguageModels2022,
+  title = {Scaling {{Language Models}}: {{Methods}}, {{Analysis}} \& {{Insights}} from {{Training Gopher}}},
+  author = {Rae, Jack W. and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and Rutherford, Eliza and Hennigan, Tom and Menick, Jacob and Cassirer, Albin and Powell, Richard and Driessche, George van den and Hendricks, Lisa Anne and Rauh, Maribeth and Huang, Po-Sen and Glaese, Amelia and Welbl, Johannes and Dathathri, Sumanth and Huang, Saffron and Uesato, Jonathan and Mellor, John and Higgins, Irina and Creswell, Antonia and McAleese, Nat and Wu, Amy and Elsen, Erich and Jayakumar, Siddhant and Buchatskaya, Elena and Budden, David and Sutherland, Esme and Simonyan, Karen and Paganini, Michela and Sifre, Laurent and Martens, Lena and Li, Xiang Lorraine and Kuncoro, Adhiguna and Nematzadeh, Aida and Gribovskaya, Elena and Donato, Domenic and Lazaridou, Angeliki and Mensch, Arthur and Lespiau, Jean-Baptiste and Tsimpoukelli, Maria and Grigorev, Nikolai and Fritz, Doug and Sottiaux, Thibault and Pajarskas, Mantas and Pohlen, Toby and Gong, Zhitao and Toyama, Daniel and {d'Autume}, Cyprien de Masson and Li, Yujia and Terzi, Tayfun and Mikulik, Vladimir and Babuschkin, Igor and Clark, Aidan and Casas, Diego de Las and Guy, Aurelia and Jones, Chris and Bradbury, James and Johnson, Matthew and Hechtman, Blake and Weidinger, Laura and Gabriel, Iason and Isaac, William and Lockhart, Ed and Osindero, Simon and Rimell, Laura and Dyer, Chris and Vinyals, Oriol and Ayoub, Kareem and Stanway, Jeff and Bennett, Lorrayne and Hassabis, Demis and Kavukcuoglu, Koray and Irving, Geoffrey},
+  year = {2022},
+  number = {arXiv:2112.11446},
+  eprint = {2112.11446},
+  publisher = {{arXiv}},
+  urldate = {2023-06-17},
+  archiveprefix = {arxiv}
+}
+
 @misc{raffelExploringLimitsTransfer2020,
   title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
   author = {Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J.},
@@ -3527,6 +3549,16 @@ @article{ribeiroEnsembleApproachBased2020
   doi = {10.1016/j.asoc.2019.105837}
 }
 
+@article{rogersPrimerBERTologyWhat2020,
+  title = {A {{Primer}} in {{BERTology}}: {{What We Know About How BERT Works}}},
+  author = {Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna},
+  year = {2020},
+  journal = {Transactions of the Association for Computational Linguistics},
+  volume = {8},
+  doi = {10.1162/tacl_a_00349},
+  urldate = {2023-06-17}
+}
+
 @article{ronenMachineLearningTrade2022,
   title = {Machine Learning and Trade Direction Classification: Insights from the Corporate Bond Market},
   author = {Ronen, Tavy and Fedenia, Mark A. and Nam, Seunghan},
Original file line number	Diff line number	Diff line change
Expand Up		@@ -57,5 +57,3 @@ By extension, we also estimate rules combinations involving overrides from the t

		In absence of other suitable baselines, we also the GSU method for FS3, even if it doesn't utilise option-specific features.