Blog Post

AI - Machine Learning Blog
17 MIN READ

Distillation of Phi-4 on DeepSeek R1: SFT and GRPO

xinyuwei's avatar
xinyuwei
Icon for Microsoft rankMicrosoft
Feb 17, 2025

Please refer to my repo to get more AI resources, wellcome to star it:

https://github.com/xinyuwei-david/david-share.git 

This article if from one of my repo:

https://github.com/xinyuwei-david/david-share/tree/master/Deep-Learning/GRPO-Phi-4-Training   

https://github.com/xinyuwei-david/david-share/tree/master/Deep-Learning/SLM-DeepSeek-R1

 

Phi-4 thinks as DeepSeek-R1

I tried fine-tuning Microsoft's Phi-4 model using the open-source R1 dataset. Below, I'll share my steps with everyone.

Please click below pictures to see my demo video on Youtube:

https://www.youtube.com/watch?v=9CVKR0YcdKU

Dataset Used

Why Choose This Dataset?

I used the reasoning-deepseek subset from the cognitivecomputations/dolphin-r1 dataset. This dataset was generated by the large model DeepSeek-R1 and contains 30,000 training samples, focusing on reasoning and question-answering capabilities.

The dataset contains the model's reasoning process, wrapped with special <think> tags, which can help our model learn how to think and reason.

Data Preprocessing

Before using this dataset, we need to do some preprocessing:

  • Merge Fields: Combine the reasoning and answer fields in the dataset into a new assistant_message, and add it to the messages column. This way, our model can learn the complete question-answering and reasoning process.
  • Handle Special Tokens: Since the data uses <think> tags, we need to add these special tokens to the tokenizer so that the model can correctly understand and generate them.

Supervised Fine-tuning the Phi-4 Model

During the fine-tuning process, I chose the LoRA (Low-Rank Adaptation) method. This is a parameter-efficient fine-tuning technique that allows the model to learn new capabilities without significantly increasing the number of parameters.

Main Steps of Fine-tuning Include:

  1. Load Model and Tokenizer: Use microsoft/phi-4 as the base model and load the corresponding tokenizer.
  2. Add Special Tokens to Tokenizer: Add <think> and </think> to the tokenizer's special tokens and adjust the model's embedding layer to accommodate the new vocabulary size.
  3. Set Up LoRA Configuration: Specify the model modules to train, such as q_proj, k_proj, v_proj, o_proj, etc.
  4. Start Training: Fine-tune the model using the preprocessed dataset.
  5. Resource Consumption
  • GPU Memory: Approximately 72149MiB of GPU memory is needed.
  • Training Time: It took about 4 hours on a H100

Full code

Training code:

from datasets import load_dataset  
import torch, multiprocessing, sys  
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig  
from peft import prepare_model_for_kbit_training, LoraConfig  
from trl import SFTConfig, SFTTrainer  
  
compute_dtype = torch.bfloat16    
# attn_implementation = 'flash_attention_2' 
  
# 加载 tokenizer  
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-4")  
tokenizer.pad_token = "<|finetune_right_pad_id|>"  
tokenizer.pad_token_id = 100257  
tokenizer.padding_side = 'right'  
  
# 添加新标记 '<think>' 和 '</think>'  
new_tokens = ['<think>', '</think>']  
tokenizer.add_tokens(new_tokens)  
  
# 加载数据集  
ds = load_dataset("cognitivecomputations/dolphin-r1", 'reasoning-deepseek', split='train[:30000]').train_test_split(test_size=0.1)  
  
# 处理数据集  
def process(row):  
    assistant_message = "<think>" + row['reasoning'] + "</think>\n\n" + row['answer']  
    row['messages'].append({'role': 'assistant', 'content': assistant_message})  
    # 手动拼接消息内容  
    conversations = ''  
    for message in row['messages']:  
        conversations += f"{message['role']}: {message['content']}\n"  
    row['text'] = conversations.strip()  
    return row  
  
ds['train'] = ds['train'].map(  
    process,  
    num_proc=multiprocessing.cpu_count(),  
    load_from_cache_file=False,  
)  
  
ds['test'] = ds['test'].map(  
    process,  
    num_proc=multiprocessing.cpu_count(),  
    load_from_cache_file=False,  
)  
  
def fine_tune(model_name, batch_size=1, gradient_accumulation_steps=32, LoRA=False, QLoRA=False):  
  
    if QLoRA:  
        bnb_config = BitsAndBytesConfig(  
            load_in_4bit=True,  
            bnb_4bit_quant_type="nf4",  
            bnb_4bit_compute_dtype=compute_dtype,  
            bnb_4bit_use_double_quant=True,  
        )  
        model = AutoModelForCausalLM.from_pretrained(  
            model_name, quantization_config=bnb_config, device_map={"": 0}  
        )  
        model = prepare_model_for_kbit_training(model)  
    else:  
        model = AutoModelForCausalLM.from_pretrained(  
            model_name, device_map={"": 0}, torch_dtype=compute_dtype  
        )  
        model.gradient_checkpointing_enable()  
  
    # **调整模型的嵌入矩阵以匹配新的词汇表大小**  
    model.resize_token_embeddings(len(tokenizer))  
  
    if LoRA or QLoRA:  
        peft_config = LoraConfig(  
            lora_alpha=16,  
            lora_dropout=0.05,  
            r=16,  
            bias="none",  
            task_type="CAUSAL_LM",  
            target_modules=['k_proj', 'o_proj', 'q_proj', 'v_proj', 'up_proj', 'down_proj', 'gate_proj'],  
            modules_to_save=["lm_head", "embed_tokens"],  
        )  
    else:  
        peft_config = None  
  
    output_dir = "./LoRA/"  
  
    training_arguments = SFTConfig(  
        output_dir=output_dir,  
        evaluation_strategy="steps",  
        do_eval=True,  
        optim="adamw_8bit",  
        per_device_train_batch_size=batch_size,  
        gradient_accumulation_steps=gradient_accumulation_steps,  
        per_device_eval_batch_size=batch_size,  
        log_level="debug",  
        save_strategy="steps",        
        save_steps=200,              
        logging_steps=25,  
        learning_rate=1e-5,  
        bf16=True,                    
        eval_steps=200,               
        num_train_epochs=1,  
        warmup_ratio=0.1,  
        lr_scheduler_type="linear",  
        dataset_text_field="text",  
        max_seq_length=1024,  
        report_to='none',  
        save_total_limit=3            
    )  
  
    trainer = SFTTrainer(  
        model=model,  
        train_dataset=ds['train'],  
        eval_dataset=ds['test'],  
        peft_config=peft_config,  
        tokenizer=tokenizer,          
        args=training_arguments,  
    )  
  
    trainer.train()  
fine_tune("microsoft/phi-4", batch_size=16, gradient_accumulation_steps=4, LoRA=True)

Load Fine-tuned Model:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

compute_dtype = torch.bfloat16
attn_implementation = 'flash_attention_2'

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-4")
tokenizer.pad_token = "<|finetune_right_pad_id|>"
tokenizer.pad_token_id = 100257

tokenizer.vocab[128011] = '<think>'
tokenizer.vocab[128012] = '</think>'
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-4",
    device_map={"": 0},
    attn_implementation=attn_implementation,
    torch_dtype=torch.bfloat16,
)

model = PeftModel.from_pretrained(model, "./LoRA/checkpoint-422/")

Inference test

Question 1 and answer 1:

prompt = [{'role':'system', 'content':"You are a helpful assistant, please think before answering."},
    {'role':'user', 'content':"Assume there is a pond with an infinite amount of water. You have two empty jugs with capacities of 5 liters and 6 liters, respectively. How can you use only these two jugs to obtain exactly 3 liters of water from the pond?"}
    ]

prompt = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).to('cuda')
output = model.generate(**input_ids, temperature=0.7, max_new_tokens=2048)
print(tokenizer.decode(output[0], skip_special_tokens=False))

Result:

<|im_end|><|im_start|>user<|im_sep|>Assume there is a pond with an infinite amount of water. You have two empty jugs with capacities of 5 liters and 6 liters, respectively. How can you use only these two jugs to obtain exactly 3 liters of water from the pond?<|im_end|><|im_start|>assistant<|im_sep|><think>Okay, so I need to figure out how to get exactly 3 liters of water using a 5-liter jug and a 6-liter jug. Hmm, this is a classic water jug problem. Let me think about the steps.

First, I need to remember the rules: I can fill either jug from the pond, pour water from one jug to the other until one is full or the other is empty, and empty a jug back into the pond. So, I can't measure directly, but I can use the difference between the two jugs to get the desired amount.

Let's start by filling the 6-liter jug. Then, pour from the 6-liter jug into the 5-liter jug until the 5-liter jug is full. That would leave 1 liter in the 6-liter jug. Then, empty the 5-liter jug back into the pond. Now, pour the remaining 1 liter from the 6-liter jug into the 5-liter jug. Then, fill the 6-liter jug again. Pour from the 6-liter jug into the 5-liter jug until the 5-liter jug is full. Since the 5-liter jug already has 1 liter, it can only take 4 more liters. So, after pouring 4 liters, the 6-liter jug will have 2 liters left. Then, empty the 5-liter jug again. Pour the 2 liters from the 6-liter jug into the 5-liter jug. Now, fill the 6-liter jug again. Pour from the 6-liter jug into the 5-liter jug until the 5-liter jug is full. The 5-liter jug already has 2 liters, so it can take 3 more liters. After pouring 3 liters, the 6-liter jug will have 3 liters left. So, that's how you get 3 liters in the 6-liter jug.

Wait, let me check if that's correct. Let's go through the steps again:

1. Fill the 6-liter jug. (6, 0)
2. Pour from 6-liter to 5-liter until 5-liter is full. (1, 5)
3. Empty the 5-liter jug. (1, 0)
4. Pour the 1 liter from the 6-liter jug into the 5-liter jug. (0, 1)
5. Fill the 6-liter jug again. (6, 1)
6. Pour from 6-liter to 5-liter until 5-liter is full. The 5-liter jug has 1 liter, so it can take 4 more liters. (2, 5)
7. Empty the 5-liter jug. (2, 0)
8. Pour the 2 liters from the 6-liter jug into the 5-liter jug. (0, 2)
9. Fill the 6-liter jug again. (6, 2)
10. Pour from 6-liter to 5-liter until 5-liter is full. The 5-liter jug has 2 liters, so it can take 3 more liters. (3, 5)

So, after step 10, the 6-liter jug has 3 liters left. That's correct. The answer is that you can get 3 liters in the 6-liter jug after these steps.

Alternatively, let me see if there's another way. Maybe starting with the 5-liter jug? But that seems less straightforward. Let me try:

1. Fill the 5-liter jug. (5, 0)
2. Pour from 5-liter to 6-liter. (0, 5)
3. Fill the 5-liter jug again. (5, 5)
4. Pour from 5-liter to 6-liter until 6-liter is full. The 6-liter jug already has 5 liters, so it can take 1 more liter. (4, 6)
5. Empty the 6-liter jug. (4, 0)
6. Pour the 4 liters from the 5-liter jug into the 6-liter jug. (0, 4)
7. Fill the 5-liter jug again. (5, 4)
8. Pour from 5-liter to 6-liter until 6-liter is full. The 6-liter jug has 4 liters, so it can take 2 more liters. (3, 6)

So, after step 8, the 5-liter jug has 3 liters left. That's another way to get 3 liters. So, there are two possible solutions: either the 5-liter jug ends up with 3 liters, or the 6-liter jug ends up with 3 liters. Both are valid.

Therefore, the answer is that you can obtain exactly 3 liters of water in either the 5-liter jug or the 6-liter jug by following the steps outlined above.</think>

To obtain exactly 3 liters of water using a 5-liter jug and a 6-liter jug, follow these steps:

1. Fill the 6-liter jug completely. (6, 0)
2. Pour water from the 6-liter jug into the 5-liter jug until the 5-liter jug is full. (1, 5)
3. Empty the 5-liter jug. (1, 0)
4. Pour the remaining 1 liter from the 6-liter jug into the 5-liter jug. (0, 1)
5. Fill the 6-liter jug again. (6, 1)
6. Pour water from the 6-liter jug into the 5-liter jug until the 5-liter jug is full. (2, 5)
7. Empty the 5-liter jug. (2, 0)
8. Pour the 2 liters from the 6-liter jug into the 5-liter jug. (0, 2)
9. Fill the 6-liter jug again. (6, 2)
10. Pour water from the 6-liter jug into the 5-liter jug until the 5-liter jug is full. (3, 5)

After step 10, the 6-liter jug will contain exactly 3 liters of water. Alternatively, you can also end up with 3 liters in the 5-liter jug by following a different sequence:

1. Fill the 5-liter jug completely. (5, 0)
2. Pour water from the 5-liter jug into the 6-liter jug. (0, 5)
3. Fill the 5-liter jug again. (5, 5)
4. Pour water from the 5-liter jug into the 6-liter jug until the 6-liter jug is full. (4, 6)
5. Empty the 6-liter jug. (4, 0)
6. Pour the 4 liters from the 5-liter jug into the 6-liter jug. (0, 4)
7. Fill the 5-liter jug again. (5, 4)
8. Pour water from the 5-liter jug into the 6-liter jug until the 6-liter jug is full. (3, 6)

After step 8, the 5-liter jug will contain exactly 3 liters of water.<|im_end|>

Question 2 and answer 2:

<|im_start|>user<|im_sep|>Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you, 'Do you want to pick door No. 2 instead?' What choice of door now gives you the biggest advantage?<|im_end|><|im_start|>assistant<|im_sep|><think>Okay, so this is a classic probability problem, right? It sounds a lot like the Monty Hall problem, but with a twist. Let me think through it step by step.

First, the setup: there are three doors, one with a gold bar and two with rotten vegetables. The contestant picks a door, say No. 1. Then the host offers to switch to door No. 2. So the question is, should the contestant switch to door No. 2, or stick with door No. 1?

In the Monty Hall problem, the contestant initially picks a door, then the host opens one of the other two doors to reveal a goat (or in this case, rotten vegetables), and then the contestant is given the option to switch. The key point in Monty Hall is that the host knows where the prize is and always opens a door with a goat, which changes the probabilities.

But here, the host is offering to switch to door No. 2 without opening any doors. So, the host isn't revealing any additional information. Let me think about the probabilities.

Initially, the probability of the gold bar being behind the chosen door (No. 1) is 1/3, and the probability of it being behind one of the other two doors (No. 2 or No. 3) is 2/3. But since the host is offering to switch to door No. 2, we need to consider the probabilities again.

If the gold bar is behind door No. 1 (which has a 1/3 chance), then switching to door No. 2 would result in getting rotten vegetables. If the gold bar is behind door No. 2 (which has a 1/3 chance), then switching to door No. 2 would result in getting the gold bar. If the gold bar is behind door No. 3 (which has a 1/3 chance), then switching to door No. 2 would result in getting rotten vegetables.

So, the probability of winning the gold bar by switching to door No. 2 is the probability that the gold bar is behind door No. 2, which is 1/3. The probability of winning by sticking with door No. 1 is the probability that the gold bar is behind door No. 1, which is also 1/3.

Wait, but in the Monty Hall problem, the host's action of opening a door changes the probabilities. Here, the host is just offering to switch to door No. 2. So, the host's offer doesn't change the initial probabilities. Therefore, switching to door No. 2 has the same probability of winning as sticking with door No. 1, which is 1/3.

But wait, in the Monty Hall problem, the host's action of opening a door with a goat changes the probabilities because the host knows where the prize is. Here, the host is offering to switch to door No. 2 without opening any doors. So, the host's offer doesn't provide any additional information. Therefore, the probabilities remain the same as the initial probabilities.

So, the probability of winning by switching to door No. 2 is 1/3, and the probability of winning by sticking with door No. 1 is also 1/3. Therefore, there is no advantage to switching or sticking. Both options have the same probability of winning.

Therefore, the answer is that there is no advantage to switching or sticking. Both options have the same probability of winning, which is 1/3.</think>

In this scenario, the host's offer to switch to door No. 2 does not provide any additional information, as no doors are opened to reveal the contents behind them. Therefore, the probabilities remain the same as the initial choice. The probability of the gold bar being behind the initially chosen door (No. 1) is 1/3, and the probability of it being behind one of the other two doors (No. 2 or No. 3) is 2/3. However, since the host is offering to switch specifically to door No. 2, the probability of winning by switching to door No. 2 is 1/3, and the probability of winning by sticking with door No. 1 is also 1/3. Thus, there is no advantage to switching or sticking; both options have the same probability of winning, which is 1/3.<|im_end|>

The results were satisfying! The model successfully generated a reasoning process containing <think> tags, demonstrating a certain level of logical thinking ability.

By using datasets provided by the community, we can fine-tune models like Phi-4 enabling smaller models to possess certain "thinking" and reasoning abilities.

 

GRPO on Phi-4 RL Training

Please click below pictures to see my demo video on Youtube about GRPO of Microsoft/phi-4:

https://www.youtube.com/watch?v=WXjJdsV2cbU&feature=youtu.be

Code what I used

from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

max_seq_length = 1024
lora_rank = 16

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "microsoft/phi-4",
    max_seq_length = max_seq_length,
    load_in_4bit = True, 
    fast_inference = True, 
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, 
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
<aha>
No no no, this is my real answer:
...
</aha>
"""
import re

def very_loose_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    responses = [completion[0]["content"] for completion in completions]
    return [0.5 if "<reasoning>" in r and "</reasoning>" in r else 0.0 for r in responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>.*?</reasoning>\s*<answer>.*?</answer>\s*<aha>.*?</aha>$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def aha_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion contains "aha" times, 2 for the tags, and one more, wherever it wants."""
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.findall(r'\baha\b', r, re.IGNORECASE) for r in responses]
    return [0.5 if len(match) == 3 else 0.0 for match in matches]

from datasets import load_dataset
import multiprocessing
ds = load_dataset("cognitivecomputations/dolphin-r1", "reasoning-deepseek", split="train[:10000]")
ds = ds.rename_columns({'messages':'prompt'})

def process(row):
  row['prompt'][0]['content'] += '\n'+SYSTEM_PROMPT
  return row

ds= ds.map(
    process,
    num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False,
)

def tokenize_with_template(example):
    """Tokenizes input text using the chat template of the tokenizer."""
    chat_template = tokenizer.apply_chat_template(example['prompt'], tokenize=False, add_generation_prompt=True)
    tokens = tokenizer(chat_template, truncation=False, add_special_tokens=True)["input_ids"]
    return {"token_length": len(tokens)}

ds = ds.map(tokenize_with_template,
    #num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False)
ds = ds.filter(lambda example: example["token_length"] <= max_seq_length)
ds = ds.remove_columns(["token_length"])

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 1e-6,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "paged_adamw_8bit",
    logging_steps = 5,
    bf16 = True,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory (but avoid below 4)
    max_prompt_length = 256,
    max_completion_length = 512,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        aha_reward_func,
        strict_format_reward_func,
        very_loose_format_reward_func
    ],
    args = training_args,
    train_dataset = ds,
)
trainer.train()
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-4",
    device_map="cuda",
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-4")
model = PeftModel.from_pretrained(model, "outputs/checkpoint-50")

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
<aha>
No no no, this is my real answer:
...
</aha>
"""

messages = [{"role":"system", "content":"You are an expert who knows the location of all pandas in China."+SYSTEM_PROMPT},
    {"role": "user", "content": "Make 10 possible equations to compute the number of panda in China."},
]

seq = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(seq)

encoded_input = tokenizer(seq, return_tensors="pt").to(model.device)
outputs = model.generate(
  encoded_input['input_ids'],
  max_new_tokens=2048,
  do_sample=True,
  temperature=0.6,
  top_p=0.9,
)


response = outputs[0][encoded_input['input_ids'].shape[-1]:]
print(tokenizer.decode(response))

Final result:

<|im_start|>system<|im_sep|>You are an expert who knows the location of all pandas in China.
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
<aha>
No no no, this is my real answer:
...
</aha>
<|im_end|><|im_start|>user<|im_sep|>Make 10 possible equations to compute the number of panda in China.<|im_end|><|im_start|>assistant<|im_sep|>
<reasoning>
To create equations that could theoretically compute the number of pandas in China, we need to consider various factors that might influence the panda population. These factors could include birth rates, death rates, conservation efforts, habitat changes, and more. Here are some conceptual equations:

1. **Basic Population Model**:
   \[
   P(t) = P_0 + (B - D) \times t
   \]
   Where \( P(t) \) is the population at time \( t \), \( P_0 \) is the initial population, \( B \) is the average number of births per year, and \( D \) is the average number of deaths per year.

2. **Logistic Growth Model**:
   \[
   P(t) = \frac{K}{1 + \left(\frac{K - P_0}{P_0}\right) e^{-rt}}
   \]
   Where \( K \) is the carrying capacity of the environment, \( r \) is the intrinsic growth rate, and \( e \) is the base of the natural logarithm.

3. **Conservation Impact Model**:
   \[
   P(t) = P_0 + (B - D + C) \times t
   \]
   Where \( C \) represents the net effect of conservation efforts (e.g., increased births or decreased deaths due to conservation).

4. **Habitat Change Model**:
   \[
   P(t) = P_0 + (B - D) \times t - H(t)
   \]
   Where \( H(t) \) is the negative impact on the population due to habitat loss or degradation over time.

5. **Predation and Disease Model**:
   \[
   P(t) = P_0 + (B - D - P_d - P_c) \times t
   \]
   Where \( P_d \) is the number of deaths due to disease, and \( P_c \) is the number of deaths due to predation.

6. **Migration Model**:
   \[
   P(t) = P_0 + (B - D) \times t + M(t)
   \]
   Where \( M(t) \) is the net migration (immigration minus emigration) of pandas over time.

7. **Human Impact Model**:
   \[
   P(t) = P_0 + (B - D - H_i) \times t
   \]
   Where \( H_i \) is the impact of human activities (e.g., poaching, deforestation) on the panda population.

8. **Climate Change Model**:
   \[
   P(t) = P_0 + (B - D - C_c) \times t
   \]
   Where \( C_c \) is the impact of climate change on the panda population.

9. **Genetic Diversity Model**:
   \[
   P(t) = P_0 + (B - D) \times t + G(t)
   \]
   Where \( G(t) \) is the effect of genetic diversity on population growth (e.g., increased resilience).

10. **Combined Model**:
    \[
    P(t) = P_0 + (B - D + C - H - P_d - P_c + M - H_i - C_c + G) \times t
    \]
    This equation combines multiple factors to provide a more comprehensive model.

These equations are hypothetical and would require specific data to be accurately applied.
</reasoning>

<answer>
1. \( P(t) = P_0 + (B - D) \times t \)

2. \( P(t) = \frac{K}{1 + \left(\frac{K - P_0}{P_0}\right) e^{-rt}} \)

3. \( P(t) = P_0 + (B - D + C) \times t \)

4. \( P(t) = P_0 + (B - D) \times t - H(t) \)

5. \( P(t) = P_0 + (B - D - P_d - P_c) \times t \)

6. \( P(t) = P_0 + (B - D) \times t + M(t) \)

7. \( P(t) = P_0 + (B - D - H_i) \times t \)

8. \( P(t) = P_0 + (B - D - C_c) \times t \)

9. \( P(t) = P_0 + (B - D) \times t + G(t) \)

10. \( P(t) = P_0 + (B - D + C - H - P_d - P_c + M - H_i - C_c + G) \times t \)
</answer>

<aha>
No no no, this is my real answer: The exact number of pandas in China is determined through surveys and research conducted by wildlife authorities and conservation organizations. These numbers are updated periodically and are based on field data rather than equations.
</aha><|im_end|>

Refer to: 

https://kaitchup.substack.com/p/fine-tuning-your-llm-to-think-like-r1

https://kaitchup.substack.com/p/grpo-train-llms-with-deepseek-r1s

Updated Feb 17, 2025
Version 2.0
  • eduheidrich's avatar
    eduheidrich
    Copper Contributor

    Amazing work. Great job! 
    Is the model shared anywhere we can download and test? I like phi-4 and am excited to test it with thinking capabilities. Thanks for sharing.