Finetune Small Language Model (SLM) Phi-3 using Azure Machine Learning

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Motivations for Small Language Models:

· Efficiency: SLMs are computationally more efficient, requiring less memory and storage, and can operate faster due to fewer parameters to process.

· Cost: Training and deploying SLMs is less expensive, making them accessible to a wider range of businesses and suitable for applications in edge computing.

· Customizability: SLMs are more adaptable to specialized applications and can be fine-tuned for specific tasks more readily than larger models· Under-Explored Potential: While large models have shown clear benefits, the potential of smaller models trained with larger datasets has been less explored. SLM aims to showcase that smaller models can achieve high performance when trained with enough data.

· Inference Efficiency: Smaller models are often more efficient during inference, which is a critical aspect when deploying models in real-world applications with resource constraints. This efficiency includes faster response times and reduces computational and energy costs.

· Accessibility for Research: By being open-source and smaller in size, SLM is more accessible to a broader range of researchers who may not have the resources to work with larger models. It provides a platform for experimentation and innovation in language model research without requiring extensive computational resources.

· Advancements in Architecture and Optimization: SLM incorporates various architectural and speed optimizations to improve computational efficiency. These enhancements allow SLM to train faster and with less memory, making it feasible to train on commonly available GPUs.

· Open-Source Contribution: The authors of SLM have made the model checkpoints and code publicly available, contributing to the open-source community and enabling further advancements and applications by others.

· End-User Applications: With its excellent performance and compact size, SLM is suitable for end-user applications, potentially even on mobile devices, providing a lightweight platform for a wide range of applications.

· Training Data and Process: SLM training process is designed to be effective and reproducible, using a mixture of natural language data and code data, aiming to make pre-training accessible and transparent.

Phi-2 is the successor of Phi-1.5, the large language model (LLM) created by Microsoft.To improve over Phi-1.5, in addition to doubling the number of parameters to 2.7 billion, Microsoft also extended the training data. Phi-2 outperforms Phi-1.5 and LLMs that are 25 times larger on several public benchmarks even though it is not aligned/fine-tuned. This is just a pre-trained model for research purposes only (non-commercial, non-revenue generating). Forget about the exorbitant fees of larger language models. Phi-2 runs efficiently on even modest hardware, democratizing access to cutting-edge AI for startups and smaller businesses. No more sky-high cloud bills, just smart, affordable solutions on your own terms. In this example, we are going to learn how to fine-tune phi-2 using QLoRA: Efficient Finetuning of Quantized LLMs with Flash Attention. QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.

Step:1

Lets prepare the dataset. In this case we are going to download the ultrachat dataset.

from datasets import load_dataset from random import randrange # Load dataset from the hub dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split='train_sft[:2%]') print(f"dataset size: {len(dataset)}") print(dataset[randrange(len(dataset))])

Lets take a shorter version of the dataset to create training and test example. To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.

dataset = dataset.train_test_split(test_size=0.2) train_dataset = dataset['train'] train_dataset.to_json(f"data/train.jsonl") test_dataset = dataset['test'] test_dataset.to_json(f"data/eval.jsonl")

Lets save this training and test dataset in json format. Now let’s load the Azure ML SDK. This will help us create the necesary component.

# import required libraries from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential from azure.ai.ml import MLClient, Input from azure.ai.ml.dsl import pipeline from azure.ai.ml import load_component from azure.ai.ml import command from azure.ai.ml.entities import Data from azure.ai.ml import Input from azure.ai.ml import Output from azure.ai.ml.constants import AssetTypes

Now lets create the workspace client.

credential = DefaultAzureCredential() workspace_ml_client = None try: workspace_ml_client = MLClient.from_config(credential) except Exception as ex: print(ex) subscription_id= "Enter your subscription_id" resource_group = "Enter your resource_group" workspace= "Enter your workspace name" workspace_ml_client = MLClient(credential, subscription_id, resource_group, workspace)

Here lets create a custom training environment.

from azure.ai.ml.entities import Environment, BuildContext env_docker_image = Environment( image="mcr.microsoft.com/azureml/curated/acft-hf-nlp-gpu:latest", conda_file="environment/conda.yml", name="llm-training", description="Environment created for llm training.", ) ml_client.environments.create_or_update(env_docker_image)

Let’s look at the conda.yml

name: pydata-example channels: - conda-forge dependencies: - python=3.8 - pip=21.2.4 - pip: - bitsandbytes - transformers - peft - accelerate - einops - datasets

Lets look at the training script. We are going to use the recently introduced method in the paper “QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation” by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:

Quantize the pretrained model to 4 bits and freezing it.
Attach small, trainable adapter layers. (LoRA)
Finetune only the adapter layers, while using the frozen quantized model for context.

%%writefile src/train.py import os #import mlflow import argparse import sys import logging import datasets from datasets import load_dataset from peft import LoraConfig import torch import transformers from trl import SFTTrainer from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig from datasets import load_dataset logger = logging.getLogger(__name__) ################### # Hyper-parameters ################### training_config = { "bf16": True, "do_eval": False, "learning_rate": 5.0e-06, "log_level": "info", "logging_steps": 20, "logging_strategy": "steps", "lr_scheduler_type": "cosine", "num_train_epochs": 1, "max_steps": -1, "output_dir": "./checkpoint_dir", "overwrite_output_dir": True, "per_device_eval_batch_size": 4, "per_device_train_batch_size": 4, "remove_unused_columns": True, "save_steps": 100, "save_total_limit": 1, "seed": 0, "gradient_checkpointing": True, "gradient_checkpointing_kwargs":{"use_reentrant": False}, "gradient_accumulation_steps": 1, "warmup_ratio": 0.2, } peft_config = { "r": 16, "lora_alpha": 32, "lora_dropout": 0.05, "bias": "none", "task_type": "CAUSAL_LM", "target_modules": "all-linear", "modules_to_save": None, } train_conf = TrainingArguments(**training_config) peft_conf = LoraConfig(**peft_config) ############### # Setup logging ############### logging.basicConfig( format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) log_level = train_conf.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) transformers.utils.logging.set_verbosity(log_level) transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() # Log on each process a small summary logger.warning( f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}" + f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}" ) logger.info(f"Training/evaluation parameters {train_conf}") logger.info(f"PEFT parameters {peft_conf}") ################ # Modle Loading ################ checkpoint_path = "microsoft/Phi-3-mini-4k-instruct" # checkpoint_path = "microsoft/Phi-3-mini-128k-instruct" model_kwargs = dict( use_cache=False, trust_remote_code=True, attn_implementation="flash_attention_2", # loading the model with flash-attenstion support torch_dtype=torch.bfloat16, device_map=None ) model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs) tokenizer = AutoTokenizer.from_pretrained(checkpoint_path) tokenizer.model_max_length = 2048 tokenizer.pad_token = tokenizer.unk_token # use unk rather than eos token to prevent endless generation tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token) tokenizer.padding_side = 'right' ################## # Data Processing ################## def apply_chat_template( example, tokenizer, ): messages = example["messages"] # Add an empty system message if there is none if messages[0]["role"] != "system": messages.insert(0, {"role": "system", "content": ""}) example["text"] = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False) return example def main(args): train_dataset = load_dataset('json', data_files=args.train_file, split='train') test_dataset = load_dataset('json', data_files=args.eval_file, split='train') column_names = list(train_dataset.features) processed_train_dataset = train_dataset.map( apply_chat_template, fn_kwargs={"tokenizer": tokenizer}, num_proc=10, remove_columns=column_names, desc="Applying chat template to train_sft", ) processed_test_dataset = test_dataset.map( apply_chat_template, fn_kwargs={"tokenizer": tokenizer}, num_proc=10, remove_columns=column_names, desc="Applying chat template to test_sft", ) ########### # Training ########### trainer = SFTTrainer( model=model, args=train_conf, peft_config=peft_conf, train_dataset=processed_train_dataset, eval_dataset=processed_test_dataset, max_seq_length=2048, dataset_text_field="text", tokenizer=tokenizer, packing=True ) train_result = trainer.train() metrics = train_result.metrics trainer.log_metrics("train", metrics) trainer.save_metrics("train", metrics) trainer.save_state() ############# # Evaluation ############# tokenizer.padding_side = 'left' metrics = trainer.evaluate() metrics["eval_samples"] = len(processed_test_dataset) trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) # ############ # # Save model # ############ os.makedirs(args.model_dir, exist_ok=True) torch.save(model, os.path.join(args.model_dir, "model.pt")) def parse_args(): # setup argparse parser = argparse.ArgumentParser() # add arguments parser.add_argument("--train-file", type=str, help="Input data for training") parser.add_argument("--eval-file", type=str, help="Input data for eval") parser.add_argument("--model-dir", type=str, default="./", help="output directory for model") parser.add_argument("--epochs", default=10, type=int, help="number of epochs") parser.add_argument( "--batch-size", default=16, type=int, help="mini batch size for each gpu/process", ) parser.add_argument("--learning-rate", default=0.001, type=float, help="learning rate") parser.add_argument("--momentum", default=0.9, type=float, help="momentum") parser.add_argument( "--print-freq", default=200, type=int, help="frequency of printing training statistics", ) # parse args args = parser.parse_args() # return args return args # run script if __name__ == "__main__": # parse args args = parse_args() # call main function main(args)

Let’s create a training compute .

from azure.ai.ml.entities import AmlCompute # If you have a specific compute size to work with change it here. By default we use the 1 x V100 compute from the above list compute_cluster_size = "Standard_NC6s_v3" # If you already have a gpu cluster, mention it here. Else will create a new one with the name 'gpu-cluster-big' compute_cluster = "gpu-cluster" try: compute = ml_client.compute.get(compute_cluster) print("The compute cluster already exists! Reusing it for the current run") except Exception as ex: print( f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {compute_cluster_size}!" ) try: print("Attempt #1 - Trying to create a dedicated compute") compute = AmlCompute( name=compute_cluster, size=compute_cluster_size, tier="Dedicated", max_instances=1, # For multi node training set this to an integer value more than 1 ) ml_client.compute.begin_create_or_update(compute).wait() except Exception as e: print("Error")

Now lets call the compute job with the above training script in the AML compute we just created.

from azure.ai.ml import command from azure.ai.ml import Input from azure.ai.ml.entities import ResourceConfiguration job = command( inputs=dict( train_file=Input( type="uri_file", path="data/train.jsonl", ), eval_file=Input( type="uri_file", path="data/eval.jsonl", ), epoch=2, batchsize=64, lr = 0.01, momentum = 0.9, prtfreq = 200, output = "./outputs" ), code="./src", # local path where the code is stored compute = 'gpu-a100', command="accelerate launch train.py --train-file ${{inputs.train_file}} --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}", environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/52", distribution={ "type": "PyTorch", "process_count_per_instance": 1, }, ) returned_job = workspace_ml_client.jobs.create_or_update(job) workspace_ml_client.jobs.stream(returned_job.name)

Lets look at the pipeline output.

# check if the `trained_model` output is available job_name = returned_job.name print("pipeline job outputs: ", workspace_ml_client.jobs.get(job_name).outputs)

Once the model is finetuned lets register the job in the workspace to create endpoint.

from azure.ai.ml.entities import Model from azure.ai.ml.constants import AssetTypes run_model = Model( path=f"azureml://jobs/{job_name}/outputs/artifacts/paths/outputs/mlflow_model_folder", name="phi-3-finetuned", description="Model created from run.", type=AssetTypes.MLFLOW_MODEL, ) model = workspace_ml_client.models.create_or_update(run_model)

Lets creat the endpoint.

from azure.ai.ml.entities import ( ManagedOnlineEndpoint, IdentityConfiguration, ManagedIdentityConfiguration, ) # Check if the endpoint already exists in the workspace try: endpoint = workspace_ml_client.online_endpoints.get(endpoint_name) print("---Endpoint already exists---") except: # Create an online endpoint if it doesn't exist # Define the endpoint endpoint = ManagedOnlineEndpoint( name=endpoint_name, description=f"Test endpoint for {model.name}", identity=IdentityConfiguration( type="user_assigned", user_assigned_identities=[ManagedIdentityConfiguration(resource_id=uai_id)], ) if uai_id != "" else None, ) # Trigger the endpoint creation try: workspace_ml_client.begin_create_or_update(endpoint).wait() print("\n---Endpoint created successfully---\n") except Exception as err: raise RuntimeError( f"Endpoint creation failed. Detailed Response:\n{err}" ) from err

Once the endpoint is created we can go ahead and create the deployment.

# Initialize deployment parameters deployment_name = "phi3-deploy" sku_name = "Standard_NCs_v3" REQUEST_TIMEOUT_MS = 90000 deployment_env_vars = { "SUBSCRIPTION_ID": subscription_id, "RESOURCE_GROUP_NAME": resource_group, "UAI_CLIENT_ID": uai_client_id, }

For inferencing we will use a different base image.

from azure.ai.ml.entities import Model, Environment env = Environment( image='mcr.microsoft.com/azureml/curated/foundation-model-inference:latest', inference_config={ "liveness_route": {"port": 5001, "path": "/"}, "readiness_route": {"port": 5001, "path": "/"}, "scoring_route": {"port": 5001, "path": "/score"}, }, )

Lets deploy the model

from azure.ai.ml.entities import ( OnlineRequestSettings, CodeConfiguration, ManagedOnlineDeployment, ProbeSettings, Environment ) deployment = ManagedOnlineDeployment( name=deployment_name, endpoint_name=endpoint_name, model=model.id, instance_type=sku_name, instance_count=1, #code_configuration=code_configuration, environment = env, environment_variables=deployment_env_vars, request_settings=OnlineRequestSettings(request_timeout_ms=REQUEST_TIMEOUT_MS), liveness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, period=100, initial_delay=500, ), readiness_probe=ProbeSettings( failure_threshold=30, success_threshold=1, period=100, initial_delay=500, ), ) # Trigger the deployment creation try: workspace_ml_client.begin_create_or_update(deployment).wait() print("\n---Deployment created successfully---\n") except Exception as err: raise RuntimeError( f"Deployment creation failed. Detailed Response:\n{err}" ) from err

If you want to delete the endpoint please see the below code.

workspace_ml_client.online_deployments.begin_delete(name = deployment_name, endpoint_name = endpoint_name) workspace_ml_client._online_endpoints.begin_delete(name = endpoint_name)

Hope this tutorial helps you in Finetuning and deploying Phi-3 model in Azure ML Studio.

Hope you like the blog. Please clap and follow me if you like to read more such blogs coming soon.

References:

https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

https://www.philschmid.de/sagemaker-falcon-180b-qlora

Leave a Reply Cancel reply