Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate MLFF geometry optimization files to JSON Lines format for fast partial loading #231

Merged
merged 21 commits into from
Mar 23, 2025

Conversation

janosh
Copy link
Owner

@janosh janosh commented Mar 20, 2025

does not fix #230 (yet). the symmetry analysis rerun for all models is pushed back to a subsequent PR. the current lineup of models was tested with mixture of spglib==2.5.0 and several moyopy versions so a rerun would have been in order anyway for consistency. the code to calculate RMSD metric is fixed in this PR but the existing metrics are not updated. my machine keeps crashing trying to rerun all models. may need to rerun models one by one over night.

this PR makes debugging and iterating on future geometry optimization much faster by allowing to load just ~100 DFT/model-relaxed structures instead of all 257k in WBM on test runs

Unpolished conversion script for ML-relaxed structures from JSON to JSON Lines
"""
Convert geometry optimization JSON files to JSON lines format for better debug performance.

JSON lines format allows for reading specific number of lines without loading the entire file
into memory, which is especially useful for large structure files when only a subset is needed.

Example usage:
    python scripts/convert_to_jsonl.py  # Convert all model geo-opt files
    python scripts/convert_to_jsonl.py --models mace_mp_0 m3gnet  # Convert specific models
    python scripts/convert_to_jsonl.py --force  # Overwrite existing output files
"""

import argparse
import glob
import os
import sys

import pandas as pd
from tqdm import tqdm

from matbench_discovery.enums import Model


def convert_to_jsonl(
    input_file: str,
    output_file: str | None = None,
    suffix: str = ".jsonl.gz",
    force: bool = False,
) -> bool:
    """Convert a JSON file to JSON lines format.

    Args:
        input_file: Path to input JSON file
        output_file: Path to output JSON lines file (default: input_file with .jsonl.gz extension)
        suffix: Suffix to use for output file if output_file is not specified
        force: Whether to overwrite existing output file

    Returns:
        bool: True if the file was processed, False if it was skipped
    """
    if not os.path.exists(input_file):
        print(f"Error: Input file {input_file} does not exist")
        return False

    try:
        print(f"Reading {input_file}...")
        df = pd.read_json(input_file)

        # Get row count for filename
        row_count = len(df)

        # Check if index is already a material ID
        if df.index.name is None or df.index.name != "material_id":
            # If not indexed by material_id, try to find material_id column
            if "material_id" in df.columns:
                df = df.set_index("material_id")
            elif df.index.dtype == "object" and df.index[0].startswith("wbm-"):
                # Index already contains material IDs but not named
                df.index.name = "material_id"

        # Ensure index name is material_id for proper loading later
        if df.index.name is None:
            df.index.name = "material_id"

        if output_file is None:
            # By default, replace extension with the line count and .jsonl.gz
            basename = os.path.splitext(input_file)[0]
            basename = basename.removesuffix(".json")  # Remove .json from .json.gz
            output_file = f"{basename}-n={row_count}{suffix}"

        # Silently skip if file exists and not forcing overwrite
        if os.path.exists(output_file) and not force:
            return False

        print(
            f"Writing {row_count} structures to {output_file} in JSON lines format..."
        )
        df.reset_index().to_json(output_file, orient="records", lines=True)
        print(f"Successfully converted {input_file} to {output_file}")

        # Print info on how to verify the conversion
        print("\nTo verify conversion:")
        print(
            f"  python -c \"import pandas as pd; print(pd.read_json('{output_file}', lines=True).head())\""
        )
        return True

    except Exception as exc:
        print(f"Error converting {input_file}: {exc}")
        return False


def main():
    parser = argparse.ArgumentParser(
        description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
    )
    parser.add_argument(
        "--models",
        nargs="*",
        type=str,
        help="Specific models to convert (default: all models with geo_opt_path)",
    )
    parser.add_argument(
        "--suffix",
        default=".jsonl.gz",
        help="Suffix for output file (default: .jsonl.gz)",
    )
    parser.add_argument(
        "--force", action="store_true", help="Overwrite existing output files"
    )
    args = parser.parse_args()

    # Get list of models to process
    models_to_process = []
    if args.models:
        # Convert model names to actual Model enum values
        for model_name in args.models:
            try:
                model = getattr(Model, model_name)
                models_to_process.append(model)
            except AttributeError:
                print(f"Warning: Unknown model '{model_name}', skipping")
    else:
        # Process all models with geo_opt_path
        models_to_process = list(Model)

    # Filter to models with geometry optimization data
    geo_opt_models = []
    for model in models_to_process:
        try:
            if model.geo_opt_path:  # Returns None if model doesn't have geo_opt
                geo_opt_models.append(model)
        except ValueError as exc:
            print(f"Warning: {model.label} - {exc}")
        except Exception as exc:
            print(f"Error with {model.label}: {exc}")

    if not geo_opt_models:
        print("No models with geometry optimization data found.")
        sys.exit(1)

    print(f"Found {len(geo_opt_models)} models with geo_opt paths")

    # Process each model's geo_opt file
    processed_count = 0
    skipped_count = 0
    already_exists_count = 0

    for model in tqdm(geo_opt_models, desc="Converting files"):
        try:
            input_file = model.geo_opt_path
            if input_file and os.path.exists(input_file):
                # Check if output file would exist without conversion attempt
                basename = os.path.splitext(input_file)[0]
                basename = basename.removesuffix(".json")  # Remove .json from .json.gz

                # Check if any file matching this pattern exists
                potential_output_files = glob.glob(f"{basename}-n=*{args.suffix}")
                if potential_output_files and not args.force:
                    already_exists_count += 1
                    continue

                # Process file and track results
                if convert_to_jsonl(input_file, None, args.suffix, args.force):
                    processed_count += 1
                else:
                    already_exists_count += 1
            else:
                print(
                    f"Warning: geo_opt_path for {model.label} doesn't exist: {input_file}"
                )
                skipped_count += 1
        except Exception as exc:
            print(f"Error processing {model.label}: {exc}")
            skipped_count += 1

    print(
        f"Processed {processed_count} file(s), skipped {skipped_count} file(s), {already_exists_count} already existed"
    )


if __name__ == "__main__":
    main()

janosh added 2 commits March 20, 2025 08:37
…lling them with 1.0

- scripts/evals/geo_opt.py add note explaining RMSD values are unitless
@janosh janosh added fix Bug fix geo opt Geometry optimization labels Mar 20, 2025
- after moving spacegroup comparison out of pred_vs_ref_struct_symmetry to analyze_geo_opt.py
- calc_structure_distances now only calculates distance metrics between predicted and reference structures
- fix tests to cover new functionality and ensure robustness against mismatched IDs
janosh added 2 commits March 20, 2025 09:24
… 'distance'

- Conditional execution of symmetry analysis and structure distance calculations based on the specified analysis type
…aster loading on debug runs

uploaded to new figshare article: https://figshare.com/articles/dataset/28642406
old one is now deprecated: https://figshare.com/articles/dataset/28187999
symmetry and distance analysis files to be added to new article next
janosh added 3 commits March 22, 2025 12:22
- update_yaml_at_path now allows reading from a YAML file when `data` is set to None, returning the value at the specified dotted path without modifying the file
- tests to verify read-only functionality
…a dictionary

- Refined `calc_geo_opt_metrics` to handle NaN values
- more test coverage for both functions
…IDS`

- Modified `calc_structure_distances` in `symmetry.py` to print a warning instead of raising an error when no shared IDs between predicted and reference structures
- new test cases in `test_symmetry.py` to verify the new warning behavior and ensure proper handling of NaN values in distance calculations
…imization analysis in `test_analyze_geo_opt.py`

- bump pre-commit hooks for ruff, eslint, and pyright
janosh added 4 commits March 22, 2025 17:35
…analysis filename

- updated analysis file paths to include structure counts in filenames for consistency
- modified RMSD values to reflect updated config in several models
…`lines=True` for new JSON files in line-delimited format

- Updated relevant scripts and models to ensure compatibility
…pdate figshare URLs in data-files.yml

- remove DataFiles.wbm_cses_plus_init_structs altogether, usually you just need one or the other, not both initial and relaxed structures
- change all references from `wbm_cses_plus_init_structs` to `wbm_initial_structures` and `wbm_computed_structure_entries` in scripts and models
- enhance upload script with argparse for file selection
janosh added 2 commits March 23, 2025 12:32
…g.md and PR template.md

update data-files.yml to reflect changes in file paths from .json.bz2 to .jsonl.gz for WBM computed and initial structures, removal of wbm_cses_plus_init_structs
@janosh janosh changed the title Fix geo_opt RMSD metric Migrate MLFF geometry optimization files to JSON Lines format for fast partial loading Mar 23, 2025
@janosh janosh merged commit 85892aa into main Mar 23, 2025
6 checks passed
@janosh janosh deleted the fix-geo-opt-rmsd branch March 23, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix geo opt Geometry optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parameters for StructureMatcher impact on geo-opt metrics
1 participant