diff --git a/docs/src/features.md b/docs/src/features.md index 4f98bd2..1a36674 100644 --- a/docs/src/features.md +++ b/docs/src/features.md @@ -117,7 +117,7 @@ julia> using AbstractTrees; children(ts[1]) ``` ## Setting a Custom Objective Function -Xgboost uses a second order approximation, so to provide a custom objective functoin first and +XGBoost uses a second order approximation, so to provide a custom objective functoin first and second order derivatives must be provided, see the docstring of [`updateone!`](@ref) for more details. @@ -148,7 +148,7 @@ bst = xgboost((X, y), ℓ′, ℓ″, max_depth=8) ``` ## Caching Data From External Memory -Xgboost can be used to cache memory from external memory on disk, see +XGBoost can be used to cache memory from external memory on disk, see [here](https://xgboost.readthedocs.io/en/stable/tutorials/external_memory.html). In the Julia wrapper this is facilitated by allowing a `DMatrix` to be constructed from any Julia iterator with [`fromiterator`](@ref). The resulting `DMatrix` holds references to cache files which will have diff --git a/docs/src/index.md b/docs/src/index.md index 04d4038..a540b62 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -22,10 +22,13 @@ ŷ = predict(bst, X) using DataFrames -df = DataFrame(randn(100,3), [:a, :b, :y]) +df = DataFrame(randn(200,3), [:a, :b, :y]) + +train = DMatrix(df[1:150, [:a, :b]], df[1:150, :y]) +test = DMatrix(df[151:end, [:a, :b]], df[151:end, :y]) # can accept tabular data, will keep feature names -bst = xgboost((df[!, [:a, :b]], df.y)) +bst = xgboost(train, watchlist=Dict("train"=>train, "test"=>test), num_round=10, max_depth=3, η=0.1, objective="reg:squarederror") # display importance statistics retaining feature names importancereport(bst) @@ -57,15 +60,13 @@ X = [0 missing 1 isequal(DMatrix(X), x) # nullity is preserved ``` -!!! note - - `DMatrix` must allocate new arrays when fetching values from it. One therefore should avoid - using `DMatrix` directly except with `XGBoost`; retrieving values from this object should be - considered useful mostly only for verification. +`DMatrix` must allocate new arrays when fetching values from it. One therefore should avoid +using `DMatrix` directly except with `XGBoost`; retrieving values from this object should be +considered useful mostly only for verification. ### Feature Naming and Tabular Data -Xgboost supports the naming of features (i.e. columns of the feature matrix). This can be useful +XGBoost supports feature naming (i.e. names of columns of the feature matrix). This can be useful for inspecting trained models. ```julia X = randn(10,3) @@ -80,14 +81,9 @@ XGBoost.setfeaturenames!(dm, ["a", "b", "c"]) # can also set after construction `AbstractVector`s or a `DataFrame`). ```julia using DataFrames -df = DataFrame(randn(10,3), [:a, :b, :c]) - -y = randn(10) - -DMatrix(df, y) -df[!, :y] = y -DMatrix(df, :y) # equivalent to DMatrix(df, y) +df = DataFrame(randn(10,4), [:a, :b, :c, :y]) +dm = DMatrix(df, :y) # equivalent to DMatrix(df[!, Not(:y)], df[!, :y]) ``` When constructing a `DMatrix` from a table the feature names will automatically be set to the names @@ -134,7 +130,7 @@ this is always a `DMatrix` but arguments will be automatically converted. ### [Parameters](https://xgboost.readthedocs.io/en/stable/parameter.html) Keyword arguments to `Booster` are xgboost model parameters. These are described in detail [here](https://xgboost.readthedocs.io/en/stable/parameter.html) and should all be passed exactly as -they are described in the main xgbosot documentation (in a few cases such as Greek letters we also +they are described in the main xgboost documentation (in a few cases such as Greek letters we also allow unicode equivalents). ### Training @@ -156,7 +152,7 @@ using Statistics mean(ŷ - y)/std(y) ``` -Xgboost expects `Booster`s to be initialized with training data, therefore there is usually no need +XGBoost expects `Booster`s to be initialized with training data, therefore there is usually no need to define `Booster` separate from training. A shorthand for the above, provided by [`xgboost`](@ref) is ```julia diff --git a/src/booster.jl b/src/booster.jl index eaa263b..36d57ba 100644 --- a/src/booster.jl +++ b/src/booster.jl @@ -396,14 +396,14 @@ end xgboost(data; num_round=10, watchlist=Dict(), kw...) xgboost(data, ℓ′, ℓ″; kw...) -Creates an xgboost gradient booster object on training data `data` and runs `nrounds` of training. +Creates an xgboost gradient booster object on training data `data` and runs `num_round` of training. This is essentially an alias for constructing a [`Booster`](@ref) with `data` and keyword arguments -followed by [`update!`](@ref) for `nrounds`. +followed by [`update!`](@ref) for `num_round`. -`watchlist` is a dict the keys of which are strings giving the name of the data to watch -and the values of which are [`DMatrix`](@ref) objects containing the data. +`watchlist` is a Dict of form key=>[`DMatrix`](@ref) and is used to specify a data to evaluate a model on. +If omitted `watchlist` will be initialized with the training data. -All other keyword arguments are passed to [`Booster`](@ref). With few exceptions these are model +All other keyword arguments are passed to [`Booster`](@ref). With few exceptions these are model training hyper-parameters, see [here](https://xgboost.readthedocs.io/en/stable/parameter.html) for a comprehensive list. @@ -412,9 +412,10 @@ See [`updateone!`](@ref) for more details. ## Examples ```julia -(X, y) = (randn(100,3), randn(100)) +train = DMatrix(randn(100,3), randn(100)) +test = DMatrix(randn(100,3), randn(100)) -b = xgboost((X, y), 10, max_depth=10, η=0.1) +b = xgboost(train, watchlist=Dict("train"=>train, "test"=>test), num_round=10, max_depth=5, η=0.1) ŷ = predict(b, X) ```